Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks 2010 Editor
Humberto González-Díaz Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 - Santiago de Compostela, Spain
Co-Editor
Cristian Robert Munteanu Department of Information and Communication Technologies, Computer Science Faculty, University of A Coruña, Campus de Elviña, 15071 - A Coruña, Spain
Transworld Research Network, T.C. 37/661 (2), Fort P.O., Trivandrum-695 023 Kerala, India
Published by Transworld Research Network 2010; Rights Reserved Transworld Research Network T.C. 37/661(2), Fort P.O., Trivandrum-695 023, Kerala, India Editor Humberto GonzĂĄlez-DĂaz Co-Editor Cristian Robert Munteanu Managing Editor S.G. Pandalai Publication Manager A. Gayathri Transworld Research Network and the Editors assume no responsibility for the opinions and statements advanced by contributors ISBN: 978-81-7895-489-9
Contents
Preface Chapter 1 Multi-target QSAR of antiviral drugs Francisco J. Prado-Prado and Humberto González-Díaz Chapter 2 Multi-target QSAR & phylogenetic analysis of antifungal activity Francisco J. Prado-Prado, Lourdes Santana and Humberto González-Díaz Chapter 3 Directed network topological indices for van der Waals complexes based on coupled cluster interaction energies Cristian Robert Munteanu, Berta Fernández, Vanessa Aguiar José A. Serantes, Julián Dorado, Alejandro Pazos and Humberto González-Díaz Chapter 4 QSRR construction of networks for chirality inversion reactions Sonia Arrasate, Nuria Sotomayor, Esther Lete and Humberto González-Díaz
1
15
35
53
Chapter 5 Network entropies classification of fungi and bacteria cellulases of interest for biotechnology Guillermín Agüero-Chapin, Aminael Sanchez-Rodríguez Agostinho Antunes, Gustavo A. de la Riva and Humberto González-Díaz
69
Chapter 6 Scoring function for DNA-drug docking based on topological indices of supra-molecular networks Lázaro Guillermo Pérez-Montoto, Lourdes Santana and Humberto González-Díaz
95
Chapter 7 Entropy analysis of enzymes with QSAR, partial order, and 3D-contact networks Riccardo Concu, Gianni Podda, Bairong Shen and Humberto González-Díaz
123
Chapter 8 QSPR models for human Rhinovirus surface networks Santiago Vilar and Humberto González-Díaz
145
Chapter 9 Predicting bacterial co-aggregation networks with phylogenetic spectral moments Ronal Ramos de Armas, Liane Saíz-Urra and Humberto González-Díaz Chapter 10 QSPR models for cerebral cortex co-activation networks Humberto González-Díaz, Santiago Vilar, Daniel Rivero Enrique Fernández-Blanco, Ana Porto and Cristian Robert Munteanu Chapter 11 Network prediction of fasciolosis spreading in Galicia (NW Spain) Humberto González-Díaz, Mercedes Mezo, Marta González-Warleta Laura Muíño-Pose, Esperanza Paniagua and Florencio M. Ubeira Chapter 12 Study of criminal law networks with Markov-probability centralities Aliuska Duardo-Sanchez
163
179
191
205
Preface Currently the use of graph theoretic Topological Indices (TIs), Connectivity Indices (CIs) and node Centrality measures to study Complex Network representations of different systems is gaining in importance on a broad spectrum of topics. These topics cover the same areas as the complex interacting networks studied, which may be observed in systems from such diverse areas as physics, biology, economics, ecology, and computer science. For example, at the molecular level the structure of drugs, DNA sequences, RNA secondary structure, and proteins spatial structure may be described in terms of molecular graphs and/or different contact networks. In other current problems of the Biosciences, we can find prominent examples of supramolecular networks such as protein-protein interaction (PPI) networks, RNA transcript co-expression networks, as well as molecular networks in the genome of the living cells. However, the uses of graphs and networks do not limit to the molecular, macro-molecular or supra-molecular level. Economic or social interactions often organize themselves in complex network structures. Similar phenomena are observed in traffic flow and in communication networks as the internet. On larger scales one finds networks of cells as in neural networks, up to the scale of organisms in ecological food webs. This determined the recent development of several interesting software, web-servers, and/or theoretical methods to construct graphs and networks of different systems, calculate TIs/CIs of thes graphs, and seek structure-function relationships and manage data mining in many fields. In any case, in only one research paper and/or review manuscript is very difficult to zip all this information. So it is necessary, at least, one book to describe the different uses of TIs/CIs at different levels of organization of matter. This kind of book becomes interesting because many of the users of these programs limit to a narrow field of application and ignore the several applications at different higher or lower levels with implications in their own research. On the other hand, many researchers, which move by the frontiers of these fields, miss materials reviewing the actual applications and future perspectives of these methods and the possible relationships of data flow between them in a common theoretic framework. Taking into consideration all these aspects, we decided to edit the present e-book composed by a collection of papers devoted to review and/or introduce new results on the common theoretic basis, applications, and inter-connections between the inputs and outputs of different TIs/CIs and graph/networks approaches in different areas. As a consequence of the contents we decided to entitle this
book: Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks. We hope that the present book may serve as a bridge between theoretical scientists in graph theory and experimentalists in all these areas in order to suggest new areas of mutual interchange and collaboration. Finally, I would like to express, in the name of all-coauthors, our sincere gratitude to the editorial team of Research Signpost by decisive and kind attention. Spain GonzĂĄlez-DĂaz H, PhD.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 1-14 ISBN: 978-81-7895-489-9 Editors: Humberto González-Díaz and Cristian Robert Munteanu
1. Multi-target QSAR of antiviral drugs 1
Francisco J. Prado-Prado1 and Humberto González-Díaz2,*
Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela (USC), Santiago de Compostela, 15782, Spain 2 Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela (USC), Santiago de Compostela, 15782, Spain
Abstract. Graph theory and have applications at molecular level to describe drug-virus action pairs in antiviral medicinal chemistry research. With graph parameters called Topological Indices (TIs), we can search of Quantitative Structure-Activity Relationship (QSAR) models for prediction and discovery of antimicrobial drugs. In this work, we decided to test the potentialities of one of the classes of TIs. For this test we selected the class of TIs called the node absolute probabilities πk(i) that can be calculated with the method MARCH-INSIDE based on Markov models. We report a new QSAR model: that can be used to predict drug activity against different viral strains. The model correctly classifies 428 out of 533 cases (80.30%) and 481 out of 596 non-active compounds/virus cases (80.7%). Using this QSAR model we were able to reconstruct a large complex network of observed effective drugvirus pairs with a total Accuracy = 89.8 %. The work opens new directions in the generalization of TIs to develop QSAR/QSPR models for predicting relevant information of systems at different structural levels. Correspondence/Reprint request: Dr. González-Díaz H, Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela (USC), Santiago de Compostela, 15782, Spain E-mail: gonzalezdiazh@yahoo.es
2
Francisco J. Prado-Prado & Humberto GonzĂĄlez-DĂaz
1. Introduction Graph theory and Complex Network analysis tools with many applications on the search of rational approaches for antimicrobial drugs discovery. Actually, there are many pathogen microbial species with very different antimicrobial drugs susceptibility. In particular, viral pathogens species are responsible of many human diseases. Examples of diseases caused by viruses include the common cold, which is caused by any one of a variety of related viruses; smallpox; AIDS, which is caused by HIV; and cold sores, which are caused by herpes simplex. Other connections are being studied such as the connection of Human Herpes Virus (HHV) in organic neurological diseases such as multiple sclerosis and chronic fatigue syndrome. Recently, it has been shown that cervical cancer is caused at least partly by papillomavirus (which causes papillomas, or warts), representing the first significant evidence in humans for a link between cancer and an infective agent. There is current controversy over whether borna virus, previously thought of primarily as the causative agent of neurological disease in horses, could be responsible for psychiatric illness in humans. The relative ability of viruses to cause disease is described in terms of virulence. This very high number of drug-species combinations may be investigated using networks to group or cluster drugs with similar multi-species activity profile and possibly mechanism of action [1]. In fact, the applications information-mining techniques based on graphs or networks do not limit only to drug discovery. We can use different classes of graph and networks such as: Molecular graphs to describe the structure of the antimicrobial drugs, Artificial Neural Networks (ANN) [2-13] for dataset mining. We can use also Hasse graphs to depict relationships within genetic code [14-20], or Interaction and/or Co-expression networks to represent relationships between proteins, genes, or RNAs [21-35]. Specifically, coexpression networks can be constructed by measuring the expression of pairs or genes in different tissues [36-40]. Similarly, protein networks study experimentally or theoretically established protein-protein interactions [21, 41]. In co-expression networks two RNAs are connected (supposed to be involved in common mechanism of regulation) if the levels of both RNAs for different tissues strongly correlate [42]. We proposed to use the same network approach to study multi-species antimicrobials drug action. The antimicrobial drug plays the role of the RNA molecule and the drug activity against different species activity play the role of RNA level of expression in different tissues. In the co-expression network, we need to measure each RNA tissue profile if we do not have a computational approach to predict it [36, 43].
Multi-target QSAR of antiviral drugs
3
Disappointingly, QSPR studies are generally focused on the study of limited properties of small molecules. For example, in the case of antimicrobial many QSPR models predict only structurally parent compounds acting against one single microbial species [44-46]. Actually, there are more than 1 600 molecular descriptors that may be in principle generalized and used to solve the QSPR problem in small molecules [47]. In any case, any of these indices have been extended yet to encode information additional to chemical structure [48-50]. In addition, have been reported QSPR-like models based on graph and networks TIs for proteins, DNA, or RNA structures and some authors have extended the applications of TIs to whole blood proteomes, protein-protein interaction networks or even tissues [41, 51, 52]. Anyhow, the applications of TIs in QSPR research are far to cover all the potentialities of TIs and new gateways in molecular, biological, or even social QSPR models are still waiting to be opened. On this line of thinking, our group has introduced a Markov model (MM) method named MARCH-INSIDE: Markovian Chemicals In Silico Design. MARCH-INSIDE generate TIs in the form of matrix invariants such as stochastic entropies, spectral moments, or absolute probabilities for the study of molecular properties [53-57]. Recently the method has been renamed as MARCH-INSIDE 2.0: Markov Chain Invariants for Network Simultaion & Design, in order to give a more clear idea of the unexplored potentialities [58]. In this work, we decided to test the potentialities at different structural levels of one of the classes of TIs calculated by MARCH-INSIDE. For this test we selected the class of TIs called the node absolute probabilities πk(i). The πk(i) values represent the absolute probability of reaching node i after a walk of length k moving from any node in the network. We calculate the πk(i) based on a Markov matrix associated to a graph or network. These TIs differ from other MARCH-INSIDE TIs because they are useful only to describe a node or, if we sum several πk(i) values, we can describe a collection of nodes (atoms, aminoacids, a group of electric plants, a social subgroup...); which form part of larger network systems (molecule, protein, US Electric power system, Society...). It happens because the sum of all πk(j) values for the whole system is always equal to one for any system a do not give structural information. Consequently, the πk(i) values may be consider as local node TIs. We commonly have known this class of TIs as networks nodes Centralities. Several node centralities have been defined before and the software CentiBin calculate some of the more used [59]. However, the definition of new Centralities is an active field of research and new centralities have been recently introduced such as sub-graph centrality
4
Francisco J. Prado-Prado & Humberto González-Díaz
introduced by Estrada [60]. Certainly, the πk(j) values were used in the past by our group [61, 62] but ever at the molecular level only and never for nomolecular problems. In order to both confirm the potentials applications of πk(j) values at the molecular level and extend these applications beyond traditional frontiers, we are going to develop here a new QSPR models. This QSPR model based on πk(j) values can be used to predict antiviral drugs activity against multiple virus species.
2. Materials and methods Multi-target probability centrality for atoms in molecules By using, Chapman-Kolgomorov equations we can calculate multi-target Cπ,s(j) values referred to atoms (nodes) in molecular graphs. As was mentioned above multi-target here means that we obtain different kCπ,s(j) values for the same atom in the same molecule when the molecular target (bacteria, virus, parasite, receptor, enzyme, etc.) change. First, we have to calculate the absolute probabilities spk(j) for the interaction in many step of different j-th atoms with the specific target. Here targets are only different microbial species (s). In this sense, we insert the superscript s in the symbol of the centrality. These values can be determined as the elements of the vectors kπ(s). These vectors are elements of a Markov chain based on the stochastic matrix 1Π, which describes probabilities of interaction sp1(i,j) of the j-th atom given that previously other i-th atom has interacted with the target. The specificity for one target is given using target specific weights in the definition of the elements of the matrix 1Π. The theoretic foundations of the method have been given in previous works, so we do not detail it here but refer the reader to these works [63, 64]. After that, the entropy centrality is very ease to calculate applying the Shannon’s formula to each element spk(j) of the vectors kπ(s) and obtain the entropy centrality measures kCπ,s(j). As in the example 1 we can sum the kCπ,s(j) values for specific atom sets (AS), or the same groups of nodes, to create local molecular descriptors for the drugtarget interaction. Herein the AS used were: halogens (X), insaturated carbons (Cins), saturated carbons (Csat), heteroatom (Het), and hydrogen atoms bound to heteroatom (H-Het). The corresponding symbols of the local entropy centrality for these AS are: kCπ,s(X), kCπ,s(Cins), kCπ,s(Csat), kCπ,s(Het), k Cπ,s(H-Het) and kCπ,s(T). In this study, we calculated the first six classes of entropy centrality (k = 0 to 5) for the 5 AS in total 6·5 = 30 molecular local centralities for each drug [64]. The theoretic foundations of the method have k
Multi-target QSAR of antiviral drugs
5
been given in previous works, so we do not detail it here but refer the reader to these works [64, 65]:
k
[
]
π s = π (s ) ⋅ Π (s ) = [π 0 (1, s ), 0
1
k
π 0 (2, s ), π 0 (3, s ),
.
⎡ 1π 11 (s ) ⎢1 ⎢ π 21 (s ) π 0 (n , s ) ]⋅ ⎢ . ⎢ ⎢ . ⎢ 1π (s ) ⎣ n1
π 12 (s ) 1 π 22 (s )
. .
. .
. . .
. . .
. . .
1
π 1n (s )⎤
1
k
⎥ ⎥ . ⎥ ⎥ . ⎥ 1 π nn (s )⎥⎦ .
(1) The Aπk(j,s) can be summed for specific sets of atoms (AS) to create local molecular descriptors for the drug-target interaction. Herein, the AS used were the following: halogens (X), insaturated carbons (Cins), saturated carbons (Csat), heteroatoms (Het), and hydrogens bound to heteroatoms (H-Het). The corresponding symbols of the local absolute probabilities for these AS are: Aπk(X,s), Aπk(Cins,s), Aπk(Csat,s), Aπk(Het,s), Aπk(H-Het,s). In this study, we calculated the first six classes of probabilities (k = 0 to 5) for the 5 AS in total 6·5 = 30 molecular descriptors [64].
Statistical analysis As a continuation of the previous sections, we can attempt to develop a simple linear QSPR using the MARCH-INSIDE methodology, as defined previously, with the general formulae [62]: Actv = c b0 ⋅ A π 0 (C , s )+ cb1 ⋅ A π 1 (C , s )s + cb2 ⋅ A π 2 (C , s )+ cb3 ⋅ A π 3 (C , s )s ..... + cbk ⋅ A π 4 (C , s ) + b
(2) Here, the absolute probabilities Aπk(C,s) play the role of molecule-target interaction descriptors for specific microbial species. We selected Linear Discriminant Analysis (LDA) [66] to fit the classification functions. The model deals with the classification of a set of compounds as active or not against different microbial species. A dummy variable (Actv) was used to codify the antimicrobial activity. This variable indicates either the presence (Actv = 1) or absence (Actv = –1) of antimicrobial activity of the drug against the microbe species in question. In equation (8), bk represents the coefficients of the classification function, determined by the LDA module of the STATISTICA 6.0 software package [67] using forward stepwise strategy for variable selection.The quality of LDA models was determined by examining Wilk’s U statistic, Fisher ratio (F), and the p-level (p). We also inspected the percentage of good classification. Validation of the model was corroborated with external prediction series.
6
Francisco J. Prado-Prado & Humberto González-Díaz
2. Results and discussion QSAR model for antiviral molecules One of the main advantages of the present approach is that the generalized parameters kCπ,s(j) fit on more large and complex databases than the previous ones. This work introduces by the first time a single linear mtQSAR equation model in order to predict the antibacterial activity of drugs against different species. The data set used here was established by a set of marketed and/or very recently reported antiviral drugs with low reported MIC50 < 10 μM against different virus strains. The data set have different drugs experimentally tested against some species of a list of more than 40 viruses. We do not found in the literature the experimental values for all compounds against all listed virus species so we were able to collect 950 cases (drug/virus pairs). The names or codes and activity for all compounds as well as the references used to collect it have been saved into in a supplementary material file, available upon author request. S (DVP
) = − 16 . 47 ⋅1 C π (C Csp & Sp 2 ) + 17 . 34 ⋅ 2 C π ( C Csp & Sp 2 ) − 7 . 05 ⋅0 C π ( C Csp 3 )
− 3 . 06 ⋅0 C π ( H − Het ) − 9 . 69 ⋅0 C π (C Csp & Sp 2 ) − 0 . 91 ⋅5 C π ( H − Het ) + 0 . 18
U = 0 . 48
F = 94 . 767
p < 0 . 001
(3) S(DVP), the output of the model, is a real value variable (not probability) that scores Drug-Virus activity specificity. In this equation, kCπ,s(j) where summed for the totality (T) of the atoms in the molecule or for specific atom sets (AS) as we referred above. These collections are atoms with a common characteristic as for instance are: saturated Carbon atoms (Csat), hydrogen atoms linked to the hetero-atoms (H-Het). The model correctly classifies 428 out of 533 cases (80.30%) and 481 out of 596 non-active compounds/virus cases (80.7%). Overall training Accuracy was 80.5%.
Construction of a drug-virus network Next, we used the outputs of the mt-QSAR as inputs to constructs the first CN for antiviral drugs and species based on kCπ,s(j) values. In previous works, we constructed by the first time mt-QSAR models accounting for pairs of anti-parasite [68, 69] antifungal [61, 64] or antiviral drugs [70] with similar/dissimilar multi-species activity profile and represented it as large
Multi-target QSAR of antiviral drugs
7
networks. In this work, we have to manage with a very high number of possible Drug-Virus Pairs (DVPs). These DVPs may be investigated using CNs to regroup or cluster drugs with similar multi-bacterial affinity profile. In DVP-CN, the DVPs are nodes interconnected by the edges if they have similar drug-bacteria activity. We need to measure the activity of the drug on different bacterial if we cannot predict it. We propose to construct here, by the first time, a DVP-CN taking into consideration only the number of DVPs predicted by the mt-QSAR model based on kCπ(j) values. In order to construct this CN we have given the following steps: 1. First, we calculated two types of activity Z-scores (drug score and bacteria score) for both experimental and QSAR-predicted values: zobs (d ) =
log MIC i log MIC max
z pred (d ) = p ( − ) i
(4)
(5)
where d is the score affinity, either observed score (sobs) or predicted score (spred). sobs were calculated on the experimental data (IC50). We calculated the spred of each one of the 950 compounds with all the studied viral strains here by substituting the molecular descriptors into the QSAR equation using the Microsoft Excel application [71]. Mean is the average either of sobs or spred for the DBP. We calculate the distance matrix between all DBP using a Euclidean distance: obs
d ij =
pred
1 ⋅ log MIC i − log MIC j log MIC max
d ij = p ( − ) i − p ( − ) j
(6)
(7)
2. Using Microsoft Excel [71] again, we transformed the DBPs distance matrices derived into Boolean matrices. The elements of this type of matrix are equal to 1 if two DBPs have a Euclidean distance dij < a cut-off value. We explore the threshold values in a range from log ICsobs until ICspred trying to obtain average DBP node degree equal to 1 and minimizing the number of disconnected DBPs [61].
8
Francisco J. Prado-Prado & Humberto González-Díaz
3. The Boolean matrix was saved as a .txt format file. After, renamed the .txt file as a .mat file we read it with the software CentiBin [59] and renamed the .mat file as a .net file an d read it with Pajek software. Using Pajek, we can not only represent the network but also highlight all DBP (nodes) connected to a specific DBP and calculating connectivity parameters [72]. 4. Last, we compared the observed and predicted DBP-DBP networks pair-to-pair calculating the total Accuracy values. The mt-QSAR predicted correctly 751 334 drug-virus similar/dissimilar pairs out of 836 310 pairs. Thus, the mt-QSAR based on kCπ(j) values predicts the real CN (based on MIC50 values) with an Accuracy of 89.8% (number of similarity/dissimilarity relationships present in both CNs). The cut-off values that maximize the similarity between the two CN were used in order to decided if to nodes are connected or not: p(real) = (logMIC50/logMICmax) = 0.03 for the observed CN and p(predicted) = 0.017. In Figure 1 we illustrate both, the CN observed and the CN predicted with the mt-QSAR model.
Figure 1. Drug-Virus Complex Networks: (A) observed network and (B) predicted network.
Multi-target QSAR of antiviral drugs
9
Table 1. Results of the Leave-Specie-Out validation for the mt-QSAR model. Virus Cowpox Virus Epstein-Barr Virus Hepatitis B Virus Hepatitis C Virus Herpes Simplex Virus 1 HIV-1 Human Cytomegalovirus Human Papillomavirus Influenza Virus Vaccinia Virus Varicella-Zoster Virus
# Cases
LSO (%)
24 16 36 114 55 409 124 37 38 51 55
91.7 87.5 88.9 89.5 96.4 98.5 99.2 81.1 89.5 94.1 100
Accuracy (%) 87.4 87.4 87.3 86.5 87.0 87.3 87.0 87.4 87.2 86.9 87.0
Acknowledgements Prado-Prado, F. acknowledges financial support from Xunta the Galicia and European Social Fund (ESF) for a one-year post-doctoral position (Research Project IN89A 2008/75-0). González-Díaz, H. acknowledges financial support of Program Isidro Parga Pondal and one-year post-doctoral position (Research Project IN89A 2008/117-0), both funded by Xunta the Galicia and ESF.
References 1.
2. 3. 4.
Prado-Prado FJ, de la Vega OM, Uriarte E, Ubeira FM, Chou KC, González-Díaz H. Unified QSAR approach to antimicrobials. 4. Multi-target QSAR modeling and comparative multi-distance study of the giant components of antiviral drugdrug complex networks. Bioorg Med Chem. 2009;17:569–75. Katritzky AR, Dobchev DA, Fara DC, Karelson M. QSAR studies on 1phenylbenzimidazoles as inhibitors of the platelet-derived growth factor. Bioorg Med Chem. 2005 Dec 15;13(24):6598-608. Basak SC, Grunwald GD, Gute BD, Balasubramanian K, Opitz D. Use of statistical and neural net approaches in predicting toxicity of chemicals. J Chem Inf Comput Sci. 2000 Jul-Aug;40(4):885-90. Baskin, II, Ait AO, Halberstam NM, Palyulin VA, Zefirov NS. An approach to the interpretation of backpropagation neural network models in QSAR studies. SAR QSAR Environ Res. 2002 Mar;13(1):35-41.
10
5. 6. 7. 8.
9. 10.
11. 12. 13.
14. 15. 16. 17. 18. 19.
Francisco J. Prado-Prado & Humberto GonzĂĄlez-DĂaz
Benigni R, Giuliani A. Quantitative structure-activity relationship (QSAR) studies in genetic toxicology: mathematical models and the "biological activity" term of the relationship. Mutat Res. 1994 Apr 15;306(2):181-6. Fernandez M, Caballero J. Bayesian-regularized genetic neural networks applied to the modeling of non-peptide antagonists for the human luteinizing hormonereleasing hormone receptor. J Mol Graph Model. 2006 Feb 28. Fernandez M, Caballero J, Tundidor-Camba A. Linear and nonlinear QSAR study of N-hydroxy-2-[(phenylsulfonyl)amino]acetamide derivatives as matrix metalloproteinase inhibitors. Bioorg Med Chem. 2006 Jun 15;14(12):4137-50. Fernandez M, Tundidor-Camba A, Caballero J. Modeling of cyclin-dependent kinase inhibition by 1H-pyrazolo[3,4-d]pyrimidine derivatives using artificial neural network ensembles. Journal of chemical information and modeling. 2005 Nov-Dec;45(6):1884-95. Fernandez M, Caballero J, Helguera AM, Castro EA, Gonzalez MP. Quantitative structure-activity relationship to predict differential inhibition of aldose reductase by flavonoid compounds. Bioorg Med Chem. 2005 May 2;13(9):3269-77. Caballero J, Garriga M, Fernandez M. Genetic neural network modeling of the selective inhibition of the intermediate-conductance Ca2+ -activated K+ channel by some triarylmethanes using topological charge indexes descriptors. J Comput Aided Mol Des. 2005 Nov;19(11):771-89. Caballero J, Fernandez M. Linear and nonlinear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-regularized neural networks. J Mol Model 2006 Jan;12(2):168-81. Caballero J, Fernandez M. Linear and nonlinear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-regularized neural networks. J Mol Model (Online). 2005 Oct 21:1-14. Caballero J, Fernandez L, Abreu JI, Fernandez M. Amino Acid Sequence Autocorrelation vectors and ensembles of Bayesian-Regularized Genetic Neural Networks for prediction of conformational stability of human lysozyme mutants. Journal of chemical information and modeling. 2006 May-Jun;46(3):1255-68. Sanchez R, Grau R. A genetic code Boolean structure. II. The genetic information system as a Boolean information system. Bulletin of mathematical biology. 2005 Sep;67(5):1017-29. Sanchez R, Grau R. A novel algebraic structure of the genetic code over the galois field of four DNA bases. Acta biotheoretica. 2006;54(1):27-42. Sanchez R, Grau R, Morgado E. A novel Lie algebra of the genetic code over the Galois field of four DNA bases. Mathematical biosciences. 2006 Jul;202(1):156-74. Sanchez R, Morgado E, Grau R. A genetic code Boolean structure. I. The meaning of Boolean deductions. Bulletin of mathematical biology. 2005 Jan;67(1):1-14. Sanchez R, Morgado E, Grau R. Gene algebra from a genetic code algebraic structure. Journal of mathematical biology. 2005 Oct;51(4):431-57. Bashford JD, Jarvis PD. The genetic code as a periodic table: algebraic aspects. Bio Systems. 2000 Aug-Sep;57(3):147-61.
Multi-target QSAR of antiviral drugs
11
20. Beland P, Allen TF. The origin and evolution of the genetic code. Journal of theoretical biology. 1994 Oct 21;170(4):359-65. 21. Zhang Z, Grigorov MG. Similarity networks of protein binding sites. Proteins. 2006 Feb 1;62(2):470-8. 22. Voy BH, Scharff JA, Perkins AD, Saxton AM, Borate B, Chesler EJ, et al. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput Biol. 2006 Jul 21;2(7):e89. 23. Tanaka T, Ikeo K, Gojobori T. Evolution of metabolic networks by gain and loss of enzymatic reaction in eukaryotes. Gene. 2006 Jan 3;365:88-94. 24. Sun S, Zhao Y, Jiao Y, Yin Y, Cai L, Zhang Y, et al. Faster and more accurate global protein function assignment from protein interaction networks using the MFGO algorithm. FEBS Lett. 2006 Mar 20;580(7):1891-6. 25. Gelfand MS. Evolution of transcriptional regulatory networks in microbial genomes. Curr Opin Struct Biol. 2006 Jun;16(3):420-9. 26. Barabasi AL. Sociology. Network theory--the emergence of the creative enterprise. Science. 2005 Apr 29;308(5722):639-41. 27. Barabasi AL, Freeh VW, Jeong H, Brockman JB. Parasitic computing. Nature. 2001 Aug 30;412(6850):894-7. 28. Barabasi AL, Oltvai ZN. Network biology: understanding the cell's functional organization. Nature reviews. 2004 Feb;5(2):101-13. 29. de Menezes MA, Barabasi AL. Fluctuations in network dynamics. Physical review letters. 2004 Jan 16;92(2):028701. 30. Dezso Z, Barabasi AL. Halting viruses in scale-free networks. Physical review. 2002 May;65(5 Pt 2):055103. 31. Dezso Z, Oltvai ZN, Barabasi AL. Bioinformatics analysis of experimentally determined protein complexes in the yeast Saccharomyces cerevisiae. Genome research. 2003 Nov;13(11):2450-4. 32. Dobrin R, Beg QK, Barabasi AL, Oltvai ZN. Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network. BMC bioinformatics. 2004 Jan 30;5:10. 33. Jeong H, Mason SP, Barabasi AL, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001 May 3;411(6833):41-2. 34. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic networks. Nature. 2000 Oct 5;407(6804):651-4. 35. Oliveira JG, Barabasi AL. Human dynamics: Darwin and Einstein correspondence patterns. Nature. 2005 Oct 27;437(7063):1251. 36. Yu X, Lin J, Masuda T, Esumi N, Zack DJ, Qian J. Genome-wide prediction and characterization of interactions between transcription factors in Saccharomyces cerevisiae. Nucleic Acids Res. 2006;34(3):917-27. 37. Carter SL, Brechbuhler CM, Griffin M, Bond AT. Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics. 2004 Sep 22;20(14):2242-50. 38. Carlson MR, Zhang B, Fang Z, Mischel PS, Horvath S, Nelson SF. Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks. BMC Genomics. 2006;7:40.
12
Francisco J. Prado-Prado & Humberto González-Díaz
39. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Statistical applications in genetics and molecular biology. 2005;4:Article17. 40. Reverter A, Barris W, McWilliam S, Byrne KA, Wang YH, Tan SH, et al. Validation of alternative methods of data normalization in gene co-expression studies. Bioinformatics. 2005 Apr 1;21(7):1112-20. 41. Estrada E. Virtual identification of essential proteins within the protein interaction network of yeast. Proteomics. 2006 Jan;6(1):35-40. 42. Yu X, Lin J, Zack DJ, Qian J. Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues. Nucleic Acids Res. 2006;34(17):4925-36. 43. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7 Suppl 1:S7. 44. Todeschini R, Consonni V. Handbook of Molecular Descriptors: Wiley-VCH 2002. 45. Otzen T, Wempe EG, Kunz B, Bartels R, Lehwark-Yvetot G, Hansel W, et al. Folate-synthesizing enzyme system as target for development of inhibitors and inhibitor combinations against Candida albicans-synthesis and biological activity of new 2,4-diaminopyrimidines and 4'-substituted 4-aminodiphenyl sulfones. J Med Chem. 2004 Jan 1;47(1):240-53. 46. Fratev F, Benfenati E. 3D-QSAR and molecular mechanics study for the differences in the azole activity against yeastlike and filamentous fungi and their relation to P450DM inhibition. 1. 3-substituted-4(3H)-quinazolinones. Journal of chemical information and modeling. 2005 May-Jun;45(3):634-44. 47. Kubinyi H. Quantitative structure-activity relationships (QSAR) and molecular modelling in cancer research. J Cancer Res Clin Oncol. 1990;116(6):529-37. 48. Marrero-Ponce Y, Medina-Marrero R, Torrens F, Martinez Y, Romero-Zaldivar V, Castro EA. Atom, atom-type, and total nonstochastic and stochastic quadratic fingerprints: a promising approach for modeling of antibacterial activity. Bioorg Med Chem. 2005 Apr 15;13(8):2881-99. 49. Marrero-Ponce Y, Castillo-Garit JA, Olazabal E, Serrano HS, Morales A, Castanedo N, et al. Atom, atom-type and total molecular linear indices as a promising approach for bioorganic and medicinal chemistry: theoretical and experimental assessment of a novel method for virtual screening and rational design of new lead anthelmintic. Bioorg Med Chem. 2005 Feb 15;13(4):1005-20. 50. Marrero-Ponce Y, Montero-Torres A, Zaldivar CR, Veitia MI, Perez MM, Sanchez RN. Non-stochastic and stochastic linear indices of the 'molecular pseudograph's atom adjacency matrix': application to 'in silico' studies for the rational discovery of new antimalarial compounds. Bioorg Med Chem. 2005 Feb 15;13(4):1293-304. 51. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. 52. Manke T, Demetrius L, Vingron M. Lethality and entropy of protein interaction networks. Genome Inform Ser. 2005;16(1):159-63.
Multi-target QSAR of antiviral drugs
13
53. González-Díaz H, Olazabal E, Castanedo N, Sanchez IH, Morales A, Serrano HS, et al. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer aided molecular design II: experimental and theoretical assessment of a novel method for virtual screening of fasciolicides. J Mol Model (Online). 2002 Aug;8(8):237-45. 54. González-Díaz H, Gia O, Uriarte E, Hernadez I, Ramos R, Chaviano M, et al. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer-aided molecular design I: discovery of anticancer compounds. J Mol Model. 2003 Dec;9(6):395-407. 55. Ferino G, Delogu G, Podda G, Uriarte E, González-Díaz H. Quantitative Proteome-Disease Relationships (QPDRs) in Clinical Chemistry: Prediction of Prostate Cancer with Spectral Moments of PSA/MS Star Networks. In: Mitchem BHaS, Ch.L., ed. Clinical Chemistry Research (ISBN: 978-1-60692-517-1). NY: Nova Science Publisher 2009. 56. Concu R, Podda G, Uriarte E, González-Díaz H. A New Computational Chemistry & Complex Networks approach to Structure-Function and Similarity Relationships in Protein Enzymes. In: Collett CTaR, C.D., ed. Handbook of Computational Chemistry Research: Nova Science Publishers 2009. 57. González-Díaz H, Prado-Prado F, Ubeira FM. Predicting Antimicrobial Drugs and Targets with the MARCH-INSIDE approach. Cuerrent Topics in Medicinal Chemistry. 2008;8(18):1676-90. 58. Cruz-Monteagudo M, González-Díaz H, Borges F, Dominguez ER, Cordeiro MN. 3D-MEDNEs: An Alternative “in Silico” Technique for Chemical Research in Toxicology. 2. Quantitative Proteome-Toxicity Relationships (QPTR) based on Mass Spectrum Spiral Entropy. Chem Res Toxicol. 2008(21):619–32. 59. Junker BH, Koschuetzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006 Apr 21;7(1):219. 60. Estrada E, Rodriguez-Velazquez JA. Subgraph centrality in complex networks. Phys Rev E. 2005 May;71(5 Pt 2):056103. 61. González-Díaz H, Prado-Prado F. Unified QSAR and Network-Based Computational Chemistry Approach to Antimicrobials, Part 1: Multispecies Activity Models for Antifungals. J Comput Chem. 2008;29:656-7. 62. González-Díaz H, Sanchez IH, Uriarte E, Santana L. Symmetry considerations in Markovian chemicals 'in silico' design (MARCH-INSIDE) I: central chirality codification, classification of ACE inhibitors and prediction of sigma-receptor antagonist activities. Comput Biol Chem. 2003 Jul;27(3):217-27. 63. Prado-Prado F, González-Díaz H, Santana L, Uriarte E. Unified QSAR approach to antimicrobials. Part 2: Predicting activity against more than 90 different species in order to halt antibacterial resistance. Bioorg Med Chem. 2007;15 897-902. 64. González-Díaz H, Prado-Prado FJ, Santana L, Uriarte E. Unify QSAR approach to antimicrobials. Part 1: Predicting antifungal activity against different species. Bioorg Med Chem. 2006 Jun 5;14 5973-80. 65. Prado-Prado F, González-Díaz H, Santana L, Uriarte E. Unified QSAR approach to antimicrobials. Part 2: Predicting activity against more than 90 different species in order to halt antibacterial resistance. Bioorg Med Chem. 2007;15:897-902.
14
Francisco J. Prado-Prado & Humberto González-Díaz
66. Van Waterbeemd H. Chemometric methods in molecular design. New York: Wiley-VCH 1995. 67. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002. 68. Prado-Prado FJ, González-Díaz H, Martinez de la Vega O, Ubeira FM, Chou KC. Unified QSAR approach to antimicrobials. Part 3: First multi-tasking QSAR model for Input-Coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds. Bioorg Med Chem. 2008;16:5871–80. 69. González-Díaz H, Prado-Prado F, Ubeira FM. Predicting Antimicrobial Drugs and Targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. 70. Prado-Prado J, Martinez de la Vega O, Uriarte E, Ubeira FM, Chou K-C, GonzálezDíaz H. Unified QSAR approach to antimicrobials. 4. Multi-target QSAR modeling and comparative multi-distance study of the giant components of antiviral drug– drug complex networks. Bioorg Med Chem. 2008;doi:10.1016/j.bmc.2008.11.075. 71. Microsoft.Corp. Microsoft Excel 2002:Microsoft Excel 72. Batagelj V, Mrvar A. Pajek 1.15. 2006. 73. Devah P. Mark of a Criminal Record. Am J Soc. 2003(108):937-75.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 15-34 ISBN: 978-81-7895-489-9 Editors: Humberto González-Díaz and Cristian Robert Munteanu
2. Multi-target QSAR & phylogenetic analysis of antifungal activity Francisco J. Prado-Prado1, Lourdes Santana1 and Humberto González-Díaz2 1
Department of Organic Chemistry, University of Santiago de Compostela 15782 Santiago de Compostela, Spain 2 Department of Microbiology & Parasitology, University of Santiago de Compostela 15782 Santiago de Compostela, Spain
Abstract. There are important pathogen fungal species that infect the skin. These infections have broad profile of susceptibility to different antifungal drugs. This fact, justify the necessity of new methods to classify fungi according to drug susceptibility and discover antifungal compounds with species specific action. In this sense, Phylogenetic tree analysis is often used to classify fungi species based on gene or protein sequence and Quantitative Structure-Activity Relationship (QSAR) models are used to explore large databases of compounds previously to synthesis and biological assay. One limitation of phylogenetic analysis is that encode phylogenetic similarity between fungi species but not necessarily susceptibility to drugs. Many QSAR models predict biological activity of drugs against only one fungi species due to almost all molecular descriptors describe molecular structure but not the drug target. This work develops a multi-target QSAR model based on entropy indices calculated with Markov Chain theory and Correspondence/Reprint request: Dr. FJ. Prado-Prado, Department of Organic Chemistry, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain. E-mail: fenol1@hotmail.com Dr. H González-Díaz, Department of Microbiology & Parasitology, University of Santiago de Compostela 15782 Santiago de Compostela, Spain. E-mail: humberto.gonzalez@usc.es
16
Francisco J. Prado-Prado et al.
Linear Discriminant Analysis (LDA). The new model classifies 280 drugs as active or non-active vs. 90 fungi species (19 500+pairs). The model correctly classifies 12 420 out of 12 566 non-active compounds (98.84%) and 468 out of 468 active compounds (100%) in training. Validation of the model was carrying out by means of external predicting series, classifying the model 6 210 out of 6 277 non-active compounds (98.93%) and 239 out of 239 active compounds (100%). In addition, we propose a method to construct fungi phylogenetic trees based on drug susceptibility calculated with the tm-QSAR equation instead of protein or gene sequence. The new phylogenetic tree is able to include 16 680 drug-fungi target outcomes not experimentally measured yet. The method offer complimentary information to classic phylogenetic trees based on Small Sub-Unit ribosomal RNA (SSU-rRNA), which not necessarily correspond to drug activity profiles.
1. Introduction Fungi are spread far and wide in our environment. In nature, one may find them in soil, water as well as on plants and animals. Of the roughly 50,000 different species belonging to the realm of fungi, fewer than 300 species have been made responsible for infections in humans. Among the pathogenic species, fewer than a dozen cause far more than 90% of infections. Fungi form part of the normal skin flora but fungal infections of the skin, hair and nails are common skin diseases. Fungi can infect the skin of people of all ages. Increased incidence occurs in immunocompromised patients who have AIDS or are being treated with chemotherapeutic agents and therapy directed at reducing inflammation. People with diabetes and people who are simply getting older have more infections [1, 2]. Skin infections can divided into the most common superficial group that stays in the outer layers of the skin and an invasive group that extends beyond the skin in adjacent tissues and spread to other organs. Invasive skin infections such as blast mycosis often develop after a primary lung infection is established. The infecting yeast travel in the blood from the lung to skin areas [3, 4]. Consequently, there is an increasing interest on the development of rational approaches for discovery of antifungal drugs. In this sense, a very important role may be played by computer-added drug discovery techniques based on Quantitative-Structure-Activity-Relationship (QSAR) models [5]. Unfortunately, almost QSAR studies, including those for antifungal activity and others, use limited databases of structurally parent compounds acting against one single fungus species [6]. One important step in the evolution of this field was the introduction of QSAR models for heterogeneous series of antimicrobial compounds; see for instance the works of Cronin, de JuliánOrtiz, Galvéz, Gárcía-Domenech, Gosalbez, Marrero-Ponce, Torrens, et al.
Multi-target QSAR & phylogenetic analysis of antifungal activity
17
and others [7-12]. As a result, researchers may predict very heterogeneous series of compounds but often need to use/develop as many QSAR equations as microbial species are necessary to be predicted. In any case, if you aim to predict activity against different targets you still need to use one different QSAR model for each target. An interesting alternative, is the prediction of structurally diverse series of antimicrobial compounds (antiviral in this case) against different targets (mechanisms) using complicated non-linear Artificial Neural Networks with multi-class prediction, e.g. the work of Vilar et al. [13]. We can understand strategies developed in this sense as Multi-Objective Optimization (MOOP) techniques; in this case we pretend to optimize the activity of antifungal drugs against many different objectives or targets (fungus species). A very useful strategy related to the MOOP problem use Derringer's desirability function desirability function and many QSAR models for different objectives [14, 15]. In this sense, it is of major importance the development of unified but simple linear equations explaining the antimicrobial activity, in the present work antifungal activity, of structurally-heterogeneous series of compounds active against as many targets (fungus species) as possible. We call this class of QSAR problem the multi-target QSAR (mt-QSAR) [16-22], see Figure 1.
Figure 1. Comparison of classic QSAR vs. mt-QSAR.
18
Francisco J. Prado-Prado et al.
There are near to 2000 chemical molecular descriptors that may be in principle generalized and used to solve the mt-QSAR problem. Many of these indices are known as Topological Indices (TIs) or simply invariants of a molecular graph G. We can rationalize G as a draw composed of vertices (atoms) weighted with physicochemical properties (mass, polarity, electro negativity, or charge) and edges (chemical bonds) [23]. In any case, many of these indices have not been extended yet to encode additional information to chemical structure. One alternative to mt-QSAR is the substitution of classic atomic weights by target specific weights. For instance, we introduced and/or reviewed TIs that use atomic weights for the propensity of the atom to interact with different microbial targets [24] or undergoes partition in a biphasic systems or distribution to biological tissues [25, 26]. The method, called MARCH-INSIDE approach, MARkovian CHemicals IN SIlico Design, calculates TIs using Markov Chain theory. In fact, MARCH-INSIDE define a Markov matrix to derive matrix invariants such as stochastic spectral moments, mean values, absolute probabilities, or entropy measures, for the study of molecular properties. Applications to macromolecules have extended to RNA, proteins, and blood proteome [27, 28]. In particular, one of the classes of MARCH-INSIDE descriptors is defined in terms of entropy measures; which have demonstrated flexibility in many bioorganic and medicinal chemistry problems such as: estimation of anticoccidial activity, modelling the interaction between drugs and HIV-packaging-region RNA, and predicting proteins and virus activity [29, 30]. We give high importance to entropy measures due to it have been largely demonstrate as an excellent function to codify information in molecular systems, see for instance the important works of Graham [31, 32]. However, have not been studied the proficiency of entropy indices (of MARCH-INSIDE type or not) to solve the mt-QSAR problems in antifungal compounds. On the other hand, Bioinformatics methods, based on sequence alignment, are very useful to perform taxonomy classification of living organisms based on phylogenetic trees constructed with sequences of proteins and nucleic acids [33-35]. However, some authors have referred that alignment procedures may fail in cases of low sequence homology between the query and the template sequences deposited in the data base. Alignment techniques are also useless if there is high query-template homology but we do not know the function of the template sequence deposited in the database [36]. The readers can consult reviews and compilations on this topic for an overview of this area [37]. Phylogenetics is commonly used to classify different organisms in a taxonomy system [38]. These classic phylogenetic methods are very dependent on sequence alignment. Therefore, complementary or alternative approaches for the prediction of critical residues would be
Multi-target QSAR & phylogenetic analysis of antifungal activity
19
desirable. Thibert and Bredesen [39] reported a study of cancer proteins in an extensive human PIN constructed by computational methods. They compared a couple of phylogenetic approaches to several different network-based methods for the prediction of critical residues, and showed that a combination of one phylogenetic method and one network-based method is superior to other methods previously employed. The approach associates a network with each member of a set of proteins for which the 3D structure is known and the critical residues have been previously determined experimentally. Phylogenetic approaches led to predictions that were as reliable as the network-based measurements although, interestingly, the two general approaches tend to predict different sets of critical residues. Hence these authors proposed a hybrid method that is composed of one network CI (closeness centrality) and one phylogenetic approach (Conseq server). This hybrid approach predicts critical residues more accurately than the other methods tested here. The present study develops the first mt-QSAR model based on entropy indices to predict antifungal activity of drugs against different fungus species. The model fits one of the largest datasets used up-to-date in QSAR studies, number of entries 19 500+ cases; which is the result of forming different (antifungal compounds/fungus target) pairs.
2. Methods 2.1. Markov entropy (θk) for drug-target k-th step-by-step interaction One can consider a hypothetical situation in which a drug molecule is free in the space at an arbitrary initial time (t0). It is then interesting to develop a simple stochastic model for a step-by-step interaction between the atoms of a drug molecule and a molecular receptor in the time of desencadenation of the pharmacological effect. For the sake of simplicity, we are going to consider from now on a general structure less receptor. Understanding as structure-less molecular receptor a model of receptor which chemical structure and position it is not taken into consideration. Specifically, the molecular descriptors used in the present work are called stochastic entropies θk, which are entropies describing th connectivity and the distribution of electrons for each atom in the molecule [40]. The initial entropy of interaction a j-th atom of the drug with the target 0θj(s) is considered as a state function so a reversible process of interaction may be came apart on several elemental interactions between the j-th atom and the receptor. The 0 indicates that we refer to the initial interaction, and the argument (s) indicates that this energy depends on the specific fungi species.
20
Francisco J. Prado-Prado et al.
Afterwards, interaction continues and we have to define the interaction probability kθij(s) between the j-th atom and the receptor for specific fungi specie (s) given that i-th atom has been interacted at previous time tk. In particular, immediately after of the first interaction (t0 = 0) takes place an interaction 1pij(s) at time t1 = 1 and so on. So, one can suppose that, atoms begin its interaction whit the structure-less molecular receptor binding to this receptor in discrete intervals of time tk. However, there several alternative ways in which such step-by-step binding process may occur [41, 42]. The Figure 2 illustrates this idea.
Figure 2
The entropy 0θj(s) will be considered here as a function of the absolute temperature of the system and the equilibrium local constant of interaction between the j-th atom and the receptor 0γj(s) for a give microbial species. Additionally, the energy 1θij(s) can be defined by analogy as γij(s) [30, 43]:
θ j (s ) = − R ⋅ T ⋅ log 0 Γ j (s )
(1)
θ ij (s ) = γij(s) = − R ⋅ T ⋅ log 1 Γij (s )
(2)
0
1
Multi-target QSAR & phylogenetic analysis of antifungal activity
21
The present approach to antimicrobial-species-specific-drug-receptor interaction has two main drawbacks. The first is the difficulty on the definition of the constants. In this work, we solve the first question estimating 0 γj(s) as the rate of occurrence nj(s) of the j-th atom on active molecules against a given specie with respect to the number of atoms of the j-th class in the molecules tested against the same specie nt(s). With respect to 1γij(s) we must taking into consideration that once the j-th atom have interacted the preferred candidates for the next interaction are such i-th atoms bound to j by a chemical bond. Both constants can be then written down as [30, 43]: ⎛ n j (s) ⎞ Rj⋅T 0 Γj (s) = ⎜⎜ +1⎟⎟ = e ( ) n s ⎝ T ⎠
θ ( s)
0
(3)
θ ( s) ⎛ n j (s) ⎞ Rij⋅T 1 Γij (s) = ⎜⎜αij ⋅ +1⎟⎟ = e ( ) n s T ⎝ ⎠ 1
(4)
Where, αij are the elements of the atom adjacency matrix, nj(s), nt(s), θj(s), and 1θij(s) have been defined in the paragraph above, r is the universal gases constant, and t the absolute temperature. The number 1 is added to avoid scale and logarithmic function definition problems. The second problem relates to the description of the interaction process at higher times tk > t1. Therefore, mm theory enables a simple calculation of the probabilities with which the drug-receptor interaction takes place in the time until the studied effect is achieved. In this work we are going to focus on drugsmicrobial structure less target interaction. As depicted in figure 1, this model deals with the calculation of the probabilities (kpij) with which any arbitrary molecular atom j-th bind to the structure less molecular receptor given that other atom i-th has been bound before; along discrete time periods tk (k = 1, 2, 3, …); (k = 1 in grey), (k = 2 in blue) and (k = 3 in red) throughout the chemical bonding system. The procedure described here considers as states of the mm the atoms of the molecule. The method arranges all the 0θj(s) values in a vector θ (s) and all the 1θij(s) entropies of interaction as a squared table of n x n dimension. After normalization of both the vector and the matrix we can built up the corresponding absolute initial probability vector φ(s) and the stochastic matrix 1Π(s), which has the elements 0pj(s) and 1pij(s) respectively. The elements 0pj(s) of the above mentioned vector φ(s) constitutes the absolute probabilities with which the j-th atom interact with the molecular target or receptor in the species s at the initial time with respect to any atom in the molecule [30, 43]: 0
22
Francisco J. Prado-Prado et al.
⎛ n (s ) ⎞ ⎛ n (s ) ⎞ − RT ⋅ log ⎜⎜ j + 1⎟⎟ + 1⎟⎟ log ⎜⎜ j θ j (s ) ( ) ( ) n s n s 0 T T ⎝ ⎠ = ⎝ ⎠ = m p j (s ) = m m ⎛ ⎞ ⎛ ⎞ 0 θ a (s ) ∑ − RT ⋅ log ⎜⎜ na + 1⎟⎟ ∑ log ⎜⎜ na + 1⎟⎟ ∑ a =1 a =1 ⎝ nT (s ) ⎠ a =1 ⎝ nT (s ) ⎠ 0
(5)
Where, m represents all the atoms in the molecule including the j-th, na is the rate of occurrence of any atom a including the j-th with value nj. On the other hand, the matrix is called the 1-step drug-target interaction stochastic matrix. 1 Π(s) is built too as a squared table of order n, where n represents the number of atoms in the molecule. The elements 1pij(s) of the 1-step drug-target interaction stochastic matrix are the binding probabilities with which a j-th atom bind to a structure less molecular receptor given that other i-th atoms have been interacted before at time t1 = 1 (considering t0 = 0) [30, 43, 44]: ⎛ n (s ) ⎞ ⎛ n j (s ) ⎞ α ij ⋅ log⎜⎜ j + 1⎟⎟ + 1⎟⎟ θ (s ) 1 ⎝ n (s ) ⎠ = ⎝ nT ⎠ = n pij (s ) = n ij n ( ) n s ⎞ ⎞ ⎛ ⎛ 1 θia (s ) ∑ α ia ⋅ (− RT ) ⋅ log⎜ na (s ) + 1⎟ ∑ α ia ⋅ log⎜ j + 1⎟ ∑ ⎜ n (s ) ⎟ ⎜ n (s ) ⎟ a =1 a =1 ⎠ a =1 ⎠ ⎝ T ⎝ T 1
α ij ⋅ (− RT ) ⋅ log⎜⎜
(6)
By using, φ(s), 1Π(s) and chapman-kolgomorov equations one can describe the further evolution of the system.10-17 summing up all the atomic free energies of interaction 0θj(s) pre-multiplied by the absolute probabilities of drug-target interaction apk(j,s) one can derive the average changes in entropies kθs of the gradual interaction between the drug and the receptor at a specific time k in a given microbial species (s) [30]: k
θ s = ϕ(s )⋅k Π (s )⋅0θ (s ) = ϕ(s )⋅ [1 Π (s )] ⋅θ 1Π (s ) = ∑ kθ j (s ) = ∑ A pk ( j , s )⋅0θ j (s ) k
n
n
j =1
j =1
(7)
Such a model is stochastic per se (probabilistic step-by-step atomreceptor interaction in time) but also considers molecular connectivity (the step-by-step atom union in space throughout the chemical bonding system). Another interesting direction is the use of TIs of the DNA, RNA and/or protein graph representations described above as well as others to construct phylogenetic trees in an alignment-independent way. For instance, Zupan and Randič [45] studied Spectrum-like and Zig-Zag representations of the betaglobin gene for different species and also obtained phylogenetic trees without alignment. In another paper, Liao proposed a 2-D graphical representation of a DNA sequence [46]. Liao et al. [47] used this representation as a basis to compute the similarities between 11 mitochondrial sequences belonging to different species and used the elements of the similarity matrix to construct
Multi-target QSAR & phylogenetic analysis of antifungal activity
23
the phylogeny tree. Among all above-mentioned, Liao, Randic, Basak, Vackro, Nandy and Wang [48, 49] associated a DNA sequence having n bases with n x n non-negative real symmetric matrix M with elements aij and use its leading eigenvalue (λ) to characterize the DNA sequence in phylogenetic studies (see also section 4 of this paper). These matrices have been derived from 2DD representations and different CIs calculated [50]. On the other hand, Zhang, Luo, and Yang [51] very recently introduced numerical parameters referred to as Zinv for 3DD curves and used these CIs to analyze the phylogenetic relationships for the seven HA (H5N1) sequences of avian influenza virus. The general formulae to calculate some of these TIs and the formula for the phylogenetic dissimilarity between two sequences are given at follows [51]: n − 1 ⎛ n,n 2 ⎞ ⎞⎟ 1 ⎛ 1 n ,n ⎜ ∑ aij ⎟ χ (M ) = ⎜ ∑ aij + ⎜ ⎟ 2 ⎜ n i , j =1 ⎝
Inv (M ) =
Zinv (M ) =
⎠ ⎟⎠
n ⎝ i , j =1
⎞ 1 n ⎛ n ⎜ ∑ aij ⎟ ∑ ⎜ n − 1 i=1 ⎝ j =1 ⎟⎠
(9)
n ⎛ n ⎞ 1 ⎜ ∑ aij ⎟ ∑ ⎛ 1 ⎞ i =1 ⎜⎝ j =1 ⎟⎠ n−⎜ ⎟ ⎝n⎠
d mn (Zinv ) =
(10)
m
∑ [Zinv (m ) − Zinv (n )] k =1
2
k
(8)
k
(11)
However, such approaches do have inherent limitations, such as the requirement for the identification of multiple homologies of the protein under consideration. Then, phylogenetic tree are very dependent on the sequence used and may not reflect the susceptibility of fungi species to drug candidate compounds that have not been biologically assayed yet.
2.2. Statistical analysis As a continuation of the previous sections, we can attempt to develop a simple linear QSAR using the MARCH-INSIDE methodology, as defined previously, with the general formula: Actv = a0 ⋅0θ s + a1 ⋅1θ s + a2 ⋅2θ s + a3 ⋅3θ s ..... + ak ⋅k θ n + b0
(12)
Here, kθs act as the microbial species specific molecule-target interaction descriptors. The calculation of these indices has been explained in
24
Francisco J. Prado-Prado et al.
supplementary material by space reasons. We selected Linear Discriminant Analysis (LDA) to fit the classification functions. The model deals with the classification of a set of compounds as active or not against different microbial species [43]. A dummy variable (Actv) was used to codify the antimicrobial activity. This variable indicates either the presence (Actv = 1) or absence (Actv = –1) of antimicrobial activity of the drug against the specific species. In equation (1), ak represents the coefficients of the classification function and b0 the independent term, determined by the least square method as implemented in the LDA module of the STATISTICA 6.0 software package [52]. Forward stepwise was fixed as the strategy for variable selection [43]. The quality of LDA models was determined by examining Wilk’s U statistic, Fisher ratio (F), and the p-level (p). We also inspected the percentage of good classification and the ratios between the cases and variables in the equation and variables to be explored in order to avoid over-fitting or chance correlation. Validation of the model was corroborated by re-substitution of cases in four predicting series [43, 52].
2.3. Data set The data set was formed by a set of marketed and/or very recently reported antifungal drugs which low reported MIC50 < 10 μM against different fungus. The data set was conformed to more of 280 different drugs experimentally tested against some species of a list of 90. Not all drugs were tested in the literature against all listed species so we were able to collect 19 550 cases (drug/species pairs) instead of 280 x 90 cases. The names or codes and activity for all compounds as well as the references used to collect it are depicted in supplementary material files.
2.4. mt-QSAR phylogenetic trees for fungi species In principle, we can use different distance functions; here we selected the Euclidean distance only. Using the Tree Joining Cluster (TJC) analysis algorithm implemented on the software Statistica 6.0 we were able to construct, visualize, and compare the phylogenetic trees based on both θk alone or weighted values ak·θk. The cases used in this study were the same 19000 + drug-fungi pairs peptides found on the PMF of the new protein. The equation for the Euclidean distances is: D pq (θ ) =
∑ (a ) ⋅ ( θ 5
k =0
2
k
p
k
( s ) − qθ k ( s )
)
2
(13)
Multi-target QSAR & phylogenetic analysis of antifungal activity
25
3. Results and discussion 3.1. mt-QSAR model One of the main advantages of the present stochastic approach is the possibility of deriving average thermodynamic parameters depending on the probability of the states of the MM. The generalized parameters fit on more clearly physicochemical sense with respect to our previous ones [30, 42]. In specific, this work introduces by the first time a linear mt-QSAR equation model useful for prediction and MOOP of the antifungal activity of drugs against different fungal target species or objectives. The best model found was: actv = −2.37 ⋅θ 4 (s )het − 19.94 ⋅θ 0 (s )total − 11.18 ⋅θ 0 (s )Csat + 2.62 ⋅θ 5 (s )Csp3 + 26.43 ⋅θ 3 (s )total + 6.07 ⋅θ 4 (s )Csp&sp 2 − 13.55 ⋅θ5 (s )total − 42.25 N = 13034
λ = 0.25
χ 2 = 17833.13
p < 0.001
(14) In the model the coefficient λ is the Wilk’s statistics, statistic for the overall discrimination, χ2 is the Chi-square, and p the error level. In this equation, kθs where calculated for the totality (T) of the atoms in the molecule or for specific collections of atoms. These collections are atoms with a common characteristic as for instance are: heteroatom (Het) and unsaturated Carbon atoms (Cunst) and saturated Carbon atoms (Csat). The model correctly classifies 12 420 out of 12 566 non-active compounds (98.84%) and 468 out of 468 active compounds (100%). Overall training predictability was 98.88%. Table 1. Results of the model, analysis, validation. Analysis Parameter
%
classes
Sensitivity
98.84
Specificity
100
Accuracy
98.88
Non-active
Non-active 12 420
Antifungal 146
Antifungal
0
468
Validation Sensitivity
98.93
Non-active
6 210
67
Specificty
100
Antifungal
0
239
Accuracy 98.97 The positive cases are in black
26
Francisco J. Prado-Prado et al.
Validation of the model was carrying out by means of external predicting series, classifying the model 6 210 out of 6 277 non-active compounds and 239 out of 239 active compounds see Table 1. The more interesting fact is that kθs have the skill of discerning the active/no-active classification of compounds among a large number of fungal species. This property is related to the definition of the kθs using speciesspecific atomic weights (see supplementary material file for method). It allows us to model by the first time a very heterogeneous a diverse data with more than 19 500 cases (one of the largest in QSAR). The predicted class for every drug-species pair are depicted in supplementary material file for data. In Figure 3 we illustrate the high accuracy of the model in this sense. We can see in this figure that the model is able to predict antifungal compounds active for both species with high susceptibility or those with lower susceptibility to drugs found in the dataset. We understand specie susceptibility to drugs as the ration in % of the number of active drugs with respect to the total number of drugs assayed or predicted. In Table 2 we give more detailed results in this sense. The present work is the first reported mtQSAR model using entropy kθs as a molecular descriptor that allow one predicting antifungal activity of any organic compound against a very large diversity of fungal pathogens.
Figure 3. mt-QSAR prediction of fungus species susceptibility to antifungal drugs.
Multi-target QSAR & phylogenetic analysis of antifungal activity
27
Table 2. Observed vs. Predicted antifungal drugs; in % with respect to drugs tested against the specie. Species
Obs.
Pred.
Species
Obs.
Pred.
A. coryambifera
1.4
1.9
Ca. utilis
2.3
2.8
A. strumarium
2.4
0.9
C. atrobrunneum
1.9
A. elegans
1.4
1.4
C. globosum
Asp. candidus
1.0
1.4
Asp. flavus
7.4
Asp. fumigatus
Species
Obs.
Pred.
M. praecox
3.2
1.4
1.4
M. racemosum
2.3
2.8
1.9
2.4
P. lilacinus
1.9
2.8
C. nigrocolor
1.9
0.0
P. variotii
2.8
2.8
6.1
Chrysosporium spp.
0.9
1.9
R. oryzae
1.4
1.0
7.7
6.8
C. immitis
1.4
0.9
R. glutinis
2.3
3.2
Asp. glaucus
1.4
0.0
C. recurvatus
1.4
1.4
S. cerevisiae
4.0
3.6
Asp. nidulans
2.3
2.3
Coprinus species
66.7
66.7
S. vasiformis
1.9
2.4
Asp. niger
5.8
5.8
C. laurentii
2.3
3.7
S. apiospermum
2.3
1.9
Asp. ochraceus
1.0
0.5
C. neoformans
6.2
4.9
S. prolificans
0.9
2.8
Asp. spp.
2.4
1.4
Cunninghamella spp.
60.0
20.0
S. commune
2.3
1.9
Asp. sydowii
2.4
0.9
E. floccosum
5.5
4.1
S. salmonicolor
2.8
2.8
Asp. terreus
2.8
3.7
F. oxysporum
0.5
1.4
T. ajelloi
5.9
5.0
Asp. ustus
1.4
0.5
F. solani
0.5
1.4
T. balcaneum
3.7
1.9
Asp. versicolor
1.4
1.9
Fusarium spp.
1.9
0.9
T. concentricum
4.2
2.3
Bipolaris spp.
0.5
1.9
Ma. mycetomatis
1.4
0.5
T. erinacei
3.2
2.3
B. adusta
1.9
1.4
M. furfur
4.5
4.5
T. interdigitale
3.2
3.2
B. dermatitidis
1.4
1.0
M. pachydermatis
4.6
4.1
T. mentagrophytes
7.1
4.9
Ca. albicans
13.7
13.3
M. slooffiae
2.3
0.9
T. phaseoliforme
3.2
0.9
Ca. dubliniensis
4.5
6.3
M. sympodialis
2.3
1.9
T. rubrum
7.1
6.2
Ca. famata
0.9
0.9
M. audouinii
3.7
2.3
T. schoenleinii
3.7
2.3
Ca. glabrata
11.8
8.4
M. canis
7.1
5.8
T. simii
3.2
1.4
Ca. guilliermondii
10.3
7.3
M. cookei
1.9
0.9
T. tonsurans
4.1
3.7
Ca. kefyr
100.0
77.8
M. ferrugineur
4.2
3.2
T. verrucosum
4.6
3.2
Ca. krusei
11.6
8.3
M. fulvum
3.2
2.8
T. violaceum
5.4
2.7
Ca. lusitaniae
5.9
3.2
M. gallinae
3.2
1.9
T. asahii
3.2
4.1
Ca. neoformans
1.9
0.9
M. gypseum
5.0
4.5
Basidiomycetes spp.
66.7
66.7 0.0
Ca. parapsilosis
11.0
7.6
M. nanum
3.7
2.8
W. dermatitidis
1.9
Ca. tropicalis
83.9
64.5
3.2. Drug-susceptibility phylogenetic analysis of fungi based on mt-QSAR Using information about the distribution of amino acids in the sequence of specific phylogenetic biomarker proteins or gene have been the major
28
Francisco J. Prado-Prado et al.
tendency on molecular Phylogenetic analysis [53]. In the introduction, we discussed the importance of new phylogenetic approaches for fungi species based on the predicted susceptibility to assayed and/or un-assayed drugs. In materials and method we outlined by the first time the possibility of construction such a phylogenetic tree using as input the terms of the mt-QSAR equation. We calculated the Dpq(θ) values for all pairs of fungi species present in our dataset. The new phylogenetic tree is able to include the high number of
Figure 4. Part of mt-QSAR Drug-susceptibility (A) vs. SSU-rRNA phylogenetic tree (B) for some Candida spp.
Multi-target QSAR & phylogenetic analysis of antifungal activity
29
Table 3. Study of fragment contribution to antifungal activity against Candida albicans.
16 680 drug-fungi target outcomes not experimentally measured yet (we do not illustrate it by reasons of space). In Figure 4 (A) we illustrate a section of this tree that contains the higher number of Cadida spp. used including some important pathogens. On the other hand, until recently attempts to find the most
30
Francisco J. Prado-Prado et al.
parsimonious trees for large data sets were impractical, given current computational limitations. A seminar work after Tehler, Little, and Farris [54] used a large data set with 1551 fungal sequences of the Small Subunit ribosomal RNA (SSU-rRNA) phylogenetically analyse fungal species. Despite the differences on methodology that may vary the final tree obtained we find that in general both methods mt-QSAR phylogeny and SSU-rRNA phylogeny may be seen as complementary instead of redundant information. For example, mt-QSAR based tree predicts Candida glabrata as the nearest specie with respect to Candida albicans (one of the more common fungi pathogens for human beings). Both species are separated at a distance of above 10% in terms of *Dpq(θ) = 100·(Dpq(θ)/Dpq(θ)max). Conversely, SSUrRNA phylogeny detects Candida dublinensis as the nearest specie with respect to Candida albicans. It may be because SSS-rRNA is not the drugtarget for all antifungal drugs used to construct the mt-QSAR tree. To show how the method is to be performimg well in Table 3 we can find interesting information about structural and property activity. In this table we study of fragment contribution to antifungal drugs activity like Eupolauridine, Flucytosine, Fluconazole and Terbinafine against Candida albicans. For example, the fragments E1, E2 and E3 have similar contribution to activity; this means that any fragment has a high affinity for the receptor and produce the antifunagal activity. By contrast, F1, F2 and F3 have a low contribution, and this means that their affinity to the receptor is low [55].
4. Conclusions Entropy based mt-QSAR equation is able to predict the biological activity of antifungal drugs in more general situations than the traditional QSAR models; which the major limitation is predict the biological activity of drugs against only one fungi species. The terms of the mt-QSAR equation are excellent inputs to construct phylogenetic tree for fungi species based on the predicted drug susceptibility instead of protein or gene sequence. Consequently, we can generally expect that the present approach may be very useful for medicinal chemists in the search of new antifungal compounds.
Acknowledgments Prado-Prado, F. acknowledges financial support from Xunta the Galicia for a one-year post-doctoral position (Research Project IN89A 2008/75-0). González-Díaz, H. acknowledges financial support of Program Isidro Parga Pondal and one-year post-doctoral position (Research Project IN89A
Multi-target QSAR & phylogenetic analysis of antifungal activity
31
2008/117-0), both funded by Xunta the Galicia and European Research funds from: European Social Fond (F.S.E.).
References 1. 2. 3. 4. 5. 6.
7. 8. 9. 10.
11.
12.
13.
Fedder AM, Morn B, Moller JK. [Candidemia in the hospitals in the Aarhus County, Denmark, 1993-2002]. Ugeskrift for laeger. 2006 Jan 23;168(4):363-6. Chopra A, Khuller GK. Lipid metabolism in fungi. Critical reviews in microbiology. 1984;11(3):209-71. Grant SM, Clissold SP. Fluconazole. A review of its pharmacodynamic and pharmacokinetic properties, and therapeutic potential in superficial and systemic mycoses. Drugs. 1990 Jun;39(6):877-916. Vartian CV, Shlaes DM, Padhye AA, Ajello L. Wangiella dermatitidis endocarditis in an intravenous drug user. The American journal of medicine. 1985 Apr;78(4):703-7. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. Fratev F, Benfenati E. 3D-QSAR and molecular mechanics study for the differences in the azole activity against yeastlike and filamentous fungi and their relation to P450DM inhibition. 1. 3-substituted-4(3H)-quinazolinones. Journal of chemical information and modeling. 2005 May-Jun;45(3):634-44. Cronin MT, Aptula AO, Dearden JC, Duffy JC, Netzeva TI, Patel H, et al. Structure-based classification of antibacterial activity. J Chem Inf Comput Sci. 2002 Jul-Aug;42(4):869-78. Vega MC, Montero-Torres A, Marrero-Ponce Y, Rolon M, Gomez-Barrio A, Escario JA, et al. New ligand-based approach for the discovery of antitrypanosomal compounds. Bioorg Med Chem Lett. 2006 Apr 1;16(7):1898-904. Garcia-Domenech R, Galvez J, de Julian-Ortiz JV, Pogliani L. Some new trends in chemical graph theory. Chem Rev. 2008 Mar;108(3):1127-69. Marrero-Ponce Y, Khan MT, Casanola-Martin GM, Ather A, Sultankhodzhaev MN, Garcia-Domenech R, et al. Bond-based 2D TOMOCOMD-CARDD approach for drug discovery: aiding decision-making in 'in silico' selection of new lead tyrosinase inhibitors. J Comput Aided Mol Des. 2007 Apr;21(4):167-88. Garcia-Garcia A, Galvez J, de Julian-Ortiz JV, Garcia-Domenech R, Munoz C, Guna R, et al. New agents active against Mycobacterium avium complex selected by molecular topology: a virtual screening method. J Antimicrob Chemother. 2004 Jan;53(1):65-73. Meneses-Marcel A, Rivera-Borroto OM, Marrero-Ponce Y, Montero A, Machado Tugores Y, Escario JA, et al. New antitrichomonal drug-like chemicals selected by bond (edge)-based TOMOCOMD-CARDD descriptors. J Biomol Screen. 2008 Sep;13(8):785-94. Vilar S, Santana L, Uriarte E. Probabilistic neural network model for the in silico evaluation of anti-HIV activity and mechanism of action. J Med Chem. 2006;49(3):1118-24.
32
Francisco J. Prado-Prado et al.
14. Cruz-Monteagudo M, Borges F, Cordeiro MN, Cagide Fajin JL, Morell C, Ruiz RM, et al. Desirability-based methods of multiobjective optimization and ranking for global QSAR studies. Filtering safe and potent drug candidates from combinatorial libraries. J Comb Chem. 2008 Nov-Dec;10(6):897-913. 15. Gohlke H, Schwarz S, Gundisch D, Tilotta MC, Weber A, Wegge T, et al. 3D QSAR analyses-guided rational design of novel ligands for the (alpha4)2(beta2)3 nicotinic acetylcholine receptor. Journal of medicinal chemistry. 2003 May 22;46(11):2031-48. 16. Prado-Prado FJ, Uriarte E, Borges F, González-Díaz H. Multi-target spectral moments for QSAR and Complex Networks study of antibacterial drugs. Eur J Med Chem. 2009 Jun 24. 17. Prado-Prado FJ, Ubeira FM, Borges F, González-Díaz H. Unified QSAR & network-based computational chemistry approach to antimicrobials. II. Multiple distance and triadic census analysis of antiparasitic drugs complex networks. J Comput Chem. 2009 May 6. 18. Prado-Prado FJ, González-Díaz H, Santana L, Uriarte E. Unified QSAR approach to antimicrobials. Part 2: predicting activity against more than 90 different species in order to halt antibacterial resistance. Bioorg Med Chem. 2007 Jan 15;15(2):897-902. 19. Prado-Prado FJ, González-Díaz H, de la Vega OM, Ubeira FM, Chou KC. Unified QSAR approach to antimicrobials. Part 3: first multi-tasking QSAR model for input-coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds. Bioorg Med Chem. 2008 Jun 1;16(11):5871-80. 20. Prado-Prado FJ, de la Vega OM, Uriarte E, Ubeira FM, Chou KC, González-Díaz H. Unified QSAR approach to antimicrobials. 4. Multi-target QSAR modeling and comparative multi-distance study of the giant components of antiviral drugdrug complex networks. Bioorg Med Chem. 2009;17:569–75. 21. Prado-Prado FJ, Borges F, Perez-Montoto LG, González-Díaz H. Multi-target spectral moment: QSAR for antifungal drugs vs. different fungi species. Eur J Med Chem. 2009 May 5. 22. Prado-Prado F, Borges F, Uriarte E, González-Díaz H. Multi-Target Spectral Moments for QSAR & Complex Networks study of antibacterial drugs. Eur J Med Chem. 2009:doi:16.1016/j.ejmech.2009.06.018. 23. Todeschini R, Consonni V. Handbook of Molecular Descriptors: Wiley-VCH 2002. 24. González-Díaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. 25. González-Díaz H, Cabrera-Pérez MA, Agüero-Chapín G, Cruz-Monteagudo M, Castañedo-Cancio N, del Río MA, et al. Multi-target QSPR assemble of a Complex Network for the distribution of chemicals to biphasic systems and biological tissues. Chemometrics Intellig Lab Syst. 2008;94:160-5. 26. Cruz-Monteagudo M, González-Díaz H, Agüero-Chapin G, Santana L, Borges F, Domínguez RE, et al. Computational Chemistry Development of a Unified Free
Multi-target QSAR & phylogenetic analysis of antifungal activity
27. 28. 29. 30.
31. 32. 33. 34. 35.
36. 37. 38. 39. 40.
41.
33
Energy Markov Model for the Distribution of 1300 Chemicals to 38 Different Environmental or Biological Systems. J Comput Chem. 2007; 28:1909-22. González-Díaz H, Uriarte E. Proteins QSAR with Markov average electrostatic potentials. Bioorg Med Chem Lett. 2005 Nov 15;15(22):5088-94. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. González-Díaz H, Saiz-Urra L, Molina R, Santana L, Uriarte E. A Model for the Recognition of Protein Kinases Based on the Entropy of 3D van der Waals Interactions. Journal of proteome research. 2007 Feb 2;6(2):904-8. González-Díaz H, Aguero G, Cabrera MA, Molina R, Santana L, Uriarte E, et al. Unified Markov thermodynamics based on stochastic forms to classify drugs considering molecular structure, partition system, and biological species: distribution of the antimicrobial G1 on rat tissues. Bioorg Med Chem Lett. 2005 Feb 1;15(3):551-7. Graham DJ. Information Content in Organic Molecules: Brownian Processing at Low Levels. Journal of chemical information and modeling. 2007;47(2):376-89. Graham DJ. Information Content and Organic Molecules: Aggregation States and Solvent Effects. Journal of chemical information and modeling. 2005;45(1223). Tamiya T, Fujimi TJ. Molecular evolution of toxin genes in Elapidae snakes. Mol Divers. 2006 Nov;10(4):529-43. Lajoix AD, Gross R, Aknin C, Dietz S, Granier C, Laune D. Cellulose membrane supported peptide arrays for deciphering protein-protein interaction sites: the case of PIN, a protein with multiple natural partners. Mol Divers. 2004;8(3):281-90. Balakrishnan R, Christie KR, Costanzo MC, Dolinski K, Dwight SS, Engel SR, et al. Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Saccharomyces Genome Database (SGD). Nucleic Acids Res. 2005 Jan 1;33(Database issue):D374-7. Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, et al. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics. 2006 Jul;6(14):4023-37. Liu WC, Lin WH, Davis AJ, Jordan F, Yang HT, Hwang MJ. A network perspective on the topological importance of enzymes and their phylogenetic conservation. BMC Bioinformatics. 2007;8:121. Wang B, Chen P, Huang DS, Li JJ, Lok TM, Lyu MR. Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS Lett. 2006 Jan 23;580(2):380-4. Thibert B, Bredesen DE, del Rio G. Improved prediction of critical residues for protein function based on network and phylogenetic analyses. BMC Bioinformatics. 2005;6:213. González-Díaz H, Tenorio E, Castanedo N, Santana L, Uriarte E. 3D QSAR Markov model for drug-induced eosinophilia--theoretical prediction and preliminary experimental assay of the antimicrobial drug G1. Bioorg Med Chem. 2005 Mar 1;13(5):1523-30. González-Díaz H, Cruz-Monteagudo M, Molina R, Tenorio E, Uriarte E. Predicting multiple drugs side effects with a general drug-target interaction thermodynamic Markov model. Bioorg Med Chem. 2005 Feb 15;13(4):1119-29.
34
Francisco J. Prado-Prado et al.
42. Cruz-Monteagudo M, González-Díaz H. Unified drug-target interaction thermodynamic Markov model using stochastic entropies to predict multiple drugs side effects. Eur J Med Chem. 2005 Oct;40(10):1030-41. 43. Van Waterbeemd H. Discriminant Analysis for Activity Prediction. In: Van Waterbeemd H, ed. Chemometric methods in molecular design. New York: Wiley-VCH 1995:265-82. 44. González-Díaz H, Prado-Prado FJ, Santana L, Uriarte E. Unify QSAR approach to antimicrobials. Part 1: Predicting antifungal activity against different species. Bioorg Med Chem. 2006 Jun 5;14 5973–80. 45. Zupan J, Randic M. Algorithm for coding DNA sequences into "spectrum-like" and "zigzag" representations. Journal of chemical information and modeling. 2005 Mar-Apr;45(2):309-13. 46. Liao B. A 2D graphical representation of DNA sequence. Chem Phys Lett. 2005;401:196-9. 47. Liao B, Tan M, Ding K. Application of 2-D graphical representation of DNA sequence. Chem Phys Lett. 2005;414(4-6):296-300. 48. Liao B, Wang TM. New 2D graphical representation of DNA sequences. J Comput Chem. 2004 Aug;25(11):1364-8. 49. Randič M, Vračko M, Nandy A, Basak SC. On 3-D Graphical Representation of DNA Primary Sequences and Their Numerical Characterization. . J Chem Inf Comput Sci. 2000;40:1235-44. 50. Zhang Y, Chen W. Analysis of similarity/dissimilarity of long DNA sequences based on three 2DD-curves. Comb Chem High Throughput Screen. 2007 Mar;10(3):231-7. 51. Zhang X, Luo J, Yang L. New Invariant of DNA Sequence Based on 3DDCurves and Its Application on Phylogeny. J Comput Chem. 2007;28:2342-6. 52. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002. 53. Puslednik L, Serb JM. Molecular phylogenetics of the Pectinidae (Mollusca: Bivalvia) and effect of increased taxon sampling and outgroup selection on tree topology. Mol Phylogenet Evol. 2008 Sep;48(3):1178-88. 54. Tehler A, Little DP, Farris JS. The full-length phylogenetic tree from 1551 ribosomal sequences of chitinous fungi, Fungi. Mycol Res. 2003 Aug;107(Pt 8):901-16. 55. Molina E, Díaz HG, Gonzalez MP, Rodriguez E, Uriarte E. Designing antibacterial compounds through a topological substructural approach. Journal of chemical information and computer sciences. 2004 Mar-Apr;44(2):515-21.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 35-51 ISBN: 978-81-7895-489-9 Editors: Humberto González-Díaz and Cristian Robert Munteanu
3. Directed network topological indices for van der Waals complexes based on coupled cluster interaction energies Cristian Robert Munteanu1,2, Berta Fernández1, Vanessa Aguiar2 José A. Serantes2, Julián Dorado2, Alejandro Pazos2 and Humberto González-Díaz3 1
Department of Physical Chemistry, Faculty of Chemistry, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain; 2Department of Information and Communication Technologies, Computer Science Faculty, University of A Coruña Campus de Elviña, 15071 A Coruña, Spain; 3Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela 15782 Santiago de Compostela, Spain
Abstract. The theoretical study of van der Waals interactions by transforming ab initio Coupled Cluster interaction energies of NeAr, N2-Ar, acetylene-Ar, cyclopropane-Ar and fluorobenzene-Ar in complex networks is proposed. The topics include the general topology, the local structure (triadic census), the node degree distribution, and the shortest van der Waals dissociation paths. In addition, each real network is compared with a Barabasi–Albert network, a Kleinberg small world network, a 2D lattice network, an Erdos–Renyi network, and an Epsstein power law network. These results can originate future studies on van der Waals complex creation and stability, or on evaluation models for physical properties. Correspondence/Reprint request: Dr. Cristian Robert Munteanu, Department of Physical Chemistry, Faculty of Chemistry, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain E-mail: muntisa@gmail.com
36
Cristian Robert Munteanu et al.
1. Introduction Any group of items linked by at least a property can form a complex network [1,2]. In theoretical research these networks are characterised by Topological Indices (TIs) [3-7]. These numbers make it possible to compare the networks or the creation of models for the prediction of new ones. Thus, an increasing number of scientists are using the complex network TIs in studies such as molecular graphs [8], clinical proteomics [9,10], enzymology [11-14], DNA/protein structures [15-19], drug-target interactions [16,20], and biochemical networks [21]. Several physical-chemistry complex items such as reactions [22], and metabolic [23], proteinâ&#x20AC;&#x201C;protein [24-29] and hydrogen bond interactions [30,31] have been studied. The intermolecular interaction networks of van der Waals complexes can be included in the same field. Intermolecular interactions have an important role in many areas of science and the control and understanding of these forces originated new fields such as nanochemistry, an integral part of nanotechnology. Among the inter/intramolecular interactions [32], van der Waals forces arise from the attractive forces between temporal dipoles/multipoles in non-polar molecules.
Figure 1. Triad isomorphism classes.
Directed network topological indices for van der Waals complexes
37
This work extends the use of complex networks to the study of van der Waals complexes [33]. Previous work on the theoretical evaluation of Coupled Cluster interaction energies of small-size molecules with the argon rare gas atom is considered. The analysis of the results is carried out in four parts: general topology, triadic census (local structure), node degree distribution of the networks, and the shortest van der Waals dissociation paths. For each complex, the topology of the ab initio potential network (real network) is compared with the structure of the five theoretical networks. The local structure consists of configurations and properties of small sub-networks, of nodes and arcs, most notably dyads and triads [34]. A triad is a sub-network of three nodes and the arcs between these nodes. In a directed graph each triad is isomorphic with one of the sixteen isomorphism classes or triad types shown in Figure 1. The node degree distributions of the real networks were compared with two ideal ones: the normal and the exponential distributions.
2. Methodology Directed networks are constructed using as vertices/nodes (n) ab initio interaction energies of the van der Waals complexes formed by the argon (Ar) atom and neon [35], N2 [36], acetylene (C2H2) [37], cyclopropane (C3H6) [38] and fluorobenzene (FB) [39]. These potentials were obtained in previous work by using the coupled cluster singles and doubles including connected triples method (CCSD(T)) [40], and augmented correlation consistent polarized valence basis sets [41,42] extended with the same set of 3s3p2d1f1g midbond functions [43]. Cartesian coordinates x, y, z were chosen to give the position of the argon atom with respect to the atom or the center of mass of the rigid molecule (the geometries were not optimised, but taken from experimental microwave results). We have selected these complexes in order to have the same mobile Ar atom and an increasing complexity of the van der Waals interaction along the set. Each node is characterised by the ab initio energy and the position of the Ar atom. The network edges are constructed using the two following rules: the distance between two nodes is less than a distance cutoff (dcutoff) and the corresponding energy difference less than an energy cutoff (Vcutoff). The energy cutoff is calculated as the product of the absolute value of the minimum energy (ab initio calculations) and a correction factor ncorr. The ncorr values have been chosen so that the networks did not contain any unconnected nodes, and therefore, the energy cutoffs are evaluated only in these cases. A Boolean pair matrix is obtained with Microsoft Excel [44] (1 = both distance and energy conditions are true; 0 otherwise). Figure 2 shows the variation of the unconnected node fraction (as the percentage of the total
38
Cristian Robert Munteanu et al.
nodes that are unconnected) with the distance cutoff, and the energy correction factors (ncorr). It is clear that in order to guarantee the connectivity first condition for all nodes in all the potentials, dcutoff cannot be selected lower than 1 Ă&#x2026;. Figure 2B shows the results for the selection of the different ncorr factors. These are reported in Table 1 together with the energy minima (Vmin), and the energy cutoffs (Vcutoff).
Figure 2. Unconnected node fraction vs. distance cutoff (A) and energy correction factor (B).
Directed network topological indices for van der Waals complexes
39
Table 1. Minimum energies [min(Vi)], correction factors [ncorr], and energy cutoffs [Vcutoff = abs(min(Vi))*ncorr; Vi = ab initio CCSD(T) interaction energy] for the real networks. Units are cm-1. Ne-Ar N2-Ar C2H2-Ar C3H6-Ar FB-Ar Vmin
-45.10
-96.58
-121.99
-300.97 -391.20
ncorr
1.20
0.35
0.50
0.80
0.50
Vcutoff
54.12
33.80
61.00
240.78
195.60
Once constructed the matrices are saved in *.mat files in order to be analysed with the CentiBin [45,46] and the Pajek [47] applications. CentiBin calculates the node degree (Z) and generates random networks by five different algorithms including a Barabasiâ&#x20AC;&#x201C;Albert network (B-A), a Kleinberg small world network (SWN), a 2D lattice network (2D-L), an Erdosâ&#x20AC;&#x201C;Renyi network (E-R), and an Epsstein power law network (PLN) [46]. We plot in Figure 3 the variation of the average node degree with the same variables used in Figure 2, i.e. dcutoff and ncorr. The distance and energy cutoffs were chosen so that the average node degree values were the lowest possible. The complex networks depict the variation of the potential energy with the distance similar to the potential energy surfaces. This way of constructing the complex networks links the abstract representation with physical properties such as the energy and the energy barriers. The difference between the new and the classical potential energy surface fitting method is that the fitted function in the latter is replaced by a simple energy variation condition. In addition, the transitions between the nodes in these complex networks are directed by the connections (conditions) and the representation of these networks is not performed using the energy-distance coordinates. In Figure 4 we illustrate the N2-Ar network in a CentiBin interface, where each node corresponds to an ab initio interaction energy for specific x,y,z Cartesian coordinates of the argon atom. The topology and triadic census were calculated with the Pajek application. The main TIs are the following: n*ln(n) (LN), the total adjacency index or the number of edges (m), the density network (d), the Zagreb group index 1 (M1), the Zagreb group index 2 (M2), the Randic connectivity index (Xr), the Platt index (F) and the index of re-linking (P) [48]. The description of this kind of parameters has been reported previously [46] and applications to the study of small molecule-, macromolecule-, and other networks have been reviewed [16]. The general topological properties of these five ideal networks (B-A, SWN, 2D-L, E-R, and PLN) have been studied in detail earlier [46].
40
Cristian Robert Munteanu et al.
Figure 3. Average node degree vs. distance cutoff (A) and energy correction factor (B).
It is interesting to compare the features of our real van der Waals interaction networks with the TIs of the ideal ones. The ideal networks were generated as similar as possible to the real ones. The deviation of the ideal networks with respect to real behaviour was measured in terms of the relative difference percentage (RD%), defined as RD% = (TIreal - TIideal) * 100 / TIreal [49]. All node degrees were used as input in STATISTICA [50] in order to study the network node distribution and compare it to the ideal network normal and exponential distributions.
Directed network topological indices for van der Waals complexes
41
Figure 4. N2-Ar network inside CentiBin.
The shortest van der Waals dissociation paths between the absolute minimum and maximum ab initio energies of each complex were calculated with the program Agna [51].
3. Results and discussion The aim of this work is to propose the use of complex networks to study van der Waals complexes. Four aspects are considered: the topology, the distribution of the networks, the local structure (triadic census), and the shortest van der Waals dissociation paths.
3.1. Network topology The general topologies of the corresponding ideal theoretical networks (E-R, B-A, PLN, SWN, 2D-L) with respect to those of the real networks are given in terms of the RD%s are given in Table 2. The Ne-Ar networks, the E-R, PLN, SWN and 2D-L averaged RD%s range between 9.5% and 11.2%. For the Ne-Ar and the FB-Ar complexes, the B-A network is the most similar to the real one, with averaged deviations of only 2 and 1.4%, respectively. Excluding the Ne-Ar complex, the averaged RD%s display large fluctuations, but the E-R and PLN ideal networks are characterised by similar values of the averaged RD%s and even of some of the TIs.
42
Cristian Robert Munteanu et al.
Table 2. Summary of the comparative study of the real versus the ideal networks, given in terms of RD% (see text). TIs (%)
E-R
B-A
PLN
SWN
2D-L
TIs (%)
E-R
Ne-Ar
B-A
PLN
SWN
2D-L
C3H6-Ar
n
0.0
-5.0
0.0
-25.0
-25.0
n
0.0
-0.5
0.0
-4.8
-4.8
Z
1.1
-3.5
1.1
-8.7
13.0
Z
-0.7
0.9
0.3
61.7
69.4
LN
0.0
-6.7
0.0
-34.3
-34.3
LN
0.0
-0.6
0.0
-5.8
-5.8
m
1.1
-8.7
0.0
-35.9
-8.7
m
-0.7
0.4
0.0
59.9
67.9
M1
27.0
-2.9
26.0
-4.3
33.6
M1
13.7
-56.5
14.3
87.1
91.9
M2
51.5
14.8
50.4
27.6
63.3
M2
31.3
-122.9
32.0
96.2
98.1
Xr
-0.3
1.3
-1.1
-27.8
-28.0
Xr
-1.1
8.4
-1.0
-6.5
-6.8
F
29.1
-2.4
28.1
-1.7
37.1
F
14.1
-58.4
14.8
88.0
92.6
P
-1.9
-8.1
-3.3
-15.5
13.1
P
-0.7
1.0
0.0
66.8
75.1
d
1.1
1.7
0.0
13.9
31.2
d
-0.7
1.5
0.0
63.5
70.8
N2-Ar
FB-Ar
n
0.0
-1.5
0.0
1.5
1.5
n
0.0
-0.3
0.0
4.4
4.4
Z
0.2
3.6
1.2
38.8
51.0
Z
-4.2
-0.7
0.1
83.7
87.0
LN
0.0
-1.9
0.0
1.9
1.9
LN
0.0
-0.3
0.0
5.2
5.2
m
0.2
2.1
0.0
39.7
51.8
m
-4.2
-1.0
0.0
84.4
87.5
M1
14.0
-32.9
14.6
69.6
80.7
M1
46.3
-2.0
50.4
98.7
99.2
M2
29.7
-60.9
30.4
85.4
92.7
M2
76.9
14.8
79.5
99.9
100.0
Xr
-1.0
7.0
-1.5
-0.8
-1.1
Xr
-2.5
7.7
-2.4
2.0
1.8
F
14.7
-34.8
15.3
71.1
82.3
F
46.7
-2.0
50.8
98.9
99.3
P
0.2
4.0
0.0
44.1
58.0
P
-5.2
-1.5
-0.9
86.4
89.8
d
0.2
5.0
0.0
37.8
50.3
d
-4.2
-0.4
0.0
83.0
86.4
C2H2-Ar n
0.0
-0.6
0.0
6.1
6.1
M2
58.9
-36.4
58.0
98.5
99.3
Z
0.2
-1.7
0.3
65.9
72.7
Xr
-1.4
7.8
-1.2
4.3
4.1
LN
0.0
-0.7
0.0
7.3
7.3
F
31.9
-27.5
31.0
93.2
95.8
m
0.2
-2.3
0.0
68.0
74.4
P
-1.8
-4.0
-2.0
70.1
77.6
M1
31.1
-26.9
30.3
92.6
95.3
d
0.2
-1.2
0.0
63.7
70.9
Directed network topological indices for van der Waals complexes
43
In order to compare them to the real networks, all TIs were divided by the number of vertices (n) and the resulting values were normalised. In Figure 5 the complexes are presented in increasing order of molecular complexity and interaction energy, from Ne-Ar to FB-Ar, and different TIs patterns can be observed. Z*, LN* and m* generally increase with the complexity of the complexes due to the increasing complexity of the networks (nodes, vertices, node degrees). Ne-Ar, N2-Ar, C2H2-Ar and C3H6Ar have comparable M1*, M2* and F* topological indices with values up to 0.20, opposite to FB-Ar where they equal 1.00. Two TIs, p* and d*, generally decrease with a stronger van der Waals interaction. The exception to these patterns is Xr* with an almost constant values for all the real networks. In addition, Table 3 compares the clustering index (C=Z/n) values of the van der Waals interaction networks (between 0.14 and 0.46) with the corresponding values of some well known networks [52]. The simplest Ne-Ar and N2-Ar networks have similar values to those of the word co-occurrence and the Internet (autonomous systems) networks, respectively. In contrast, the values of the C2H2-Ar, C3H6-Ar and FB-Ar networks are analogous to those of the mathematics collaborations.
Figure 5. Modified TIs of the Ne-Ar, N2-Ar, C2H2-Ar, C3H6-Ar and FB-Ar networks [TIi*=normalised(TIi/n)].
44
Cristian Robert Munteanu et al.
Table 3. Clustering index of different network types. C =Z/n
Type
power grid
0.080
technological
biology collaborations
0.081
social
WWW (sites) C3H6-Ar van der Waals interaction
0.11
technological
0.14
physicochemical
mathematics collaborations C2H2-Ar van der Waals interaction
0.15
social
0.16
physicochemical
FB-Ar van der Waals interaction
0.18
physicochemical
film actor collaborations
0.20
social
food web
0.22
biological
Internet (autonomous systems) N2-Ar van der Waals interaction
0.24
technological
0.25
physicochemical
neural network
0.28
biological
word co-occurrence Ne-Ar van der Waals interaction
0.44
linguistics
0.46
physicochemical
company directors
0.59
social
metabolic network
0.59
biological
Network
3.2. Node degree distribution We studied the fitting of the real networks to normal and exponential node degree distributions. The input data are the in-Degree and out-Degree centrality of the vertices obtained with the CentiBiN tool for each network. The Kolmogorov-Smirnov and the Chi-Square tests [53] are used within the Distribution Fitting calculations in STATISTICA (see Table 4). Both tests need the following null hypothesis: the distributions of the degree centralities have to be normal or exponential distributions. If the probabilities (p) in both tests are greater than a significance level of 0.05 (5%), we cannot reject the null hypothesis. The p values in both tests show that only for the Ne-Ar complex the null hypothesis can neither be rejected for the normal nor for the exponential distributions.
Directed network topological indices for van der Waals complexes
45
Table 4. Kolmogorov-Smirnov and Chi-Square test for the distribution of in-Degree and out-Degree centrality of the real network vertices. Kolmogorov-Smirnov
Ne-Ar
in-Degree
out-Degree
N2-Ar
in-Degree
out-Degree
C2H2-Ar
in-Degree
out-Degree
C3H6-Ar
in-Degree
out-Degree
FB-Ar
in-Degree
out-Degree
Chi-Square
d
p<
Ď&#x2021;2 test
p<
normal
0.26133
0.15
-
-
exponential
0.23167
0.2
1.94527
0.1631
normal
0.28032
0.1
-
-
exponential
0.19538
n.s.
1.16997
0.27941
normal
0.14373
0.15
14.26558
0.04665
exponential
0.28596
0.01
46.76067
0
normal
0.149
0.15
14.926
0.02084
exponential
0.23692
0.01
48.13
0
normal
0.12127
0.05
53.91222
0
exponential
0.15742
0.01
28.60431
0.00007
normal
0.16313
0.01
93.05314
0
exponential
0.12408
0.01
22.47245
0.00211
normal
0.11795
0.05
50.74608
0
exponential
0.28613
0.01
167.58766
0
normal
0.10634
0.05
29.21284
0.00115
exponential
0.238
0.01
114.12588
0
normal
0.24937
0.01
695.52764
0
exponential
0.16078
0.01
397.1891
0
normal
0.2462
0.01
716.2792
0
exponential
0.14315
0.01
371.33026
0
3.3.Triadic census approach to the local sub-structure The local structure of the real networks were analysed with the Pajek application and the results are presented in Table 5. ni* and ei* are the number of triads and the triad expected values divided by the number of vertices, respectively. Among all the triads, the most interesting are reported. First, 1-003 (all null) triads are common in van der Waals interaction networks. This coincides with the behaviour of complex social networks such as friendship between adolescent boys, grazing preference among cows and
46
Cristian Robert Munteanu et al.
dominance between nursery school boys, where 1-003 triads account for more than 50% of the total [54]. The numbers of self-return 2-folded (3-102) and full connectivity (16-300) triads increase with the average node degree and the strength of the van der Waals interaction. The 3-102 triads represent a direct reversible transition between two states with energies lower than the cut-off constrains. The 16-300 triads represent clusters of three states which are fully communicated because all possible transition energies obey the cutoff constrains. These completely connected triads are rare across the social networks with the exception of the network of grooming between chimpanzees. In contrast, the number of transferability (6-021C) [55,56] triads decreases in going from N2-Ar to C3H6-Ar and FB-Ar; and the bifurcation (4-021D) values are notably lower than the expected ones. The self-return 3-folded transitions (also known as transitive triads) were neither Table 5. Triad distributions for the Ne-Ar, N2-Ar, C2H2-Ar, C3H6-Ar and FB-Ar real networks (ni = number of triads; ei = expected triads, * denotes values divided by the number of network vertices). Ne-Ar
N2-Ar
C2H2-Ar
C3H6-Ar
FB-Ar
Triad type ni*
ei*
ni*
ei*
ni*
ei*
ni*
ei*
ni*
ei*
3 - 102
23.9
3.3
158.9
19.0
950.5
75.9
1002.6
63.3
3886.1
321.1
16 - 300
3.1
0.0
6.3
0.0
27.2
0.0
18.8
0.0
243.9
0.0
1 - 003
23.5
10.8
399.1
296.2
3992.7
3180.1
4535.7
3706.2
14342.0
10720.8
4 - 021D
0.0
3.3
1.0
19.0
1.7
75.9
0.0
63.3
1.6
321.1
5 - 021U
0.0
3.3
1.1
19.0
1.7
75.9
1.0
63.3
0.5
321.1
9 - 030T
0.0
2.1
1.1
5.6
0.0
13.5
0.1
9.6
0.0
64.2
12 - 120D
0.0
0.3
0.6
0.4
3.5
0.6
0.4
0.4
1.7
3.2
13 - 120U
0.8
0.3
4.5
0.4
10.8
0.6
3.6
0.4
14.5
3.2
2 - 012
3.1
20.7
87.0
260.0
264.1
1702.3
135.7
1678.4
307.9
6427.4
14 - 120C
0.0
0.7
0.0
0.8
0.0
1.2
0.0
0.7
0.0
6.4
15 - 210
0.3
0.2
1.5
0.1
4.2
0.1
0.9
0.1
25.3
0.6
6 - 021C
0.0
6.6
0.6
38.0
0.0
151.9
0.2
126.7
0.0
642.2
7 - 111D
0.0
2.1
2.0
5.6
4.2
13.5
1.0
9.6
8.0
64.2
8 - 111U
0.5
2.1
4.0
5.6
15.5
13.5
5.0
9.6
26.7
64.2
10 - 030C
0.0
0.7
0.0
1.9
0.0
4.5
0.0
3.2
0.0
21.4
11 - 201
1.9
0.3
4.4
0.4
34.3
0.6
29.9
0.4
126.1
3.2
Directed network topological indices for van der Waals complexes
47
detected in the form of 10-030C nor as 14-120C triads, by the Pajek algorithm. On the other hand, the transitive triad is prevalent in social networks such as: agonistic bouts between baboons, threats between highland ponies, fights between adult rhesus monkeys, dominance between sparrows and aggressive encounters between juvenile vervet monkeys [54]. This could indicate that transitive triads appear in higher matter organisation networks and are absent in van der Waals complex networks. However, both the full connected and the 15-210 triads account for this type of energy node transitions, with values notably higher than the expected.
3.4. Shortest van der Waals dissociation paths The complex networks of the interaction energies in van der Waals complexes can be used for studying the shortest van der Waals dissociation paths, in a way similar to a classical reaction path study [57]. Table 6 shows the shortest paths between the absolute minimum and the maximum energies Table 6. Shortest dissociation path for the Ne-Ar and the N2-Ar networks. Ab initio Network Ne-Ar
Min./Max. Energies (cm-1) -45.10 / 34.27
N2-Ar
-96.58 / 82.13
Dissociation Shortest Path 5 1 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63 63
49 49 49 49 50 50 50 50 50 50 50 51 51 51 51 51 51 51 52 52 52 52 52 52 52
38 39 39 39 38 39 39 39 40 40 40 38 39 39 39 40 40 40 38 39 39 39 40 40 40
31 31 32 32 31 31 32 32 31 32 32 31 31 32 32 31 32 32 31 31 32 32 31 32 32
27 27 27 28 27 27 27 28 27 27 28 27 27 27 28 27 27 28 27 27 27 28 27 27 28
23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23
No. of Paths 1
Path Length 1
25
5
48
Cristian Robert Munteanu et al.
and the number and lengths of these paths for the smallest complexes: Ne-Ar and N2-Ar. The length of a path is the number of connections between the ends of the path and it is not depending on the Cartesian positions of the nodes in the plots; one connectivity matrix which characterises a network can have multiple graphical representations. We chose only these two complexes to carry out the calculations limited by the computational complexity. The characteristics of the shortest paths depend on the network node degree. Thus, the simplest Ne-Ar complex has a unique one-step path from the minimum to the maximum energies, in contrast with the 25 possible shortest paths (5 length steps) of the N2-Ar complex. Figure 6 visualises the shortest van der Waals dissociation path in the case of Ne-Ar.
Figure 6. The shortest van der Waals reaction paths for the Ne-Ar ab initio network.
Conclusions We have shown that it is possible to use the Complex Network theory in the study of van der Waals complexes formed by small-size molecules. Using this method we were able to describe the network general topologies, local
Directed network topological indices for van der Waals complexes
49
structures (triadic census), node degree distributions and the shortest van der Waals dissociation paths; as well as to compare the real networks with the theoretical graph models. This study confirms the utility of complex networks and graphs in theoretical chemistry [58-63]. Future studies can deal with complex stability, formation, or can be the basis for new physical property evaluation models.
Acknowledgement The authors acknowledge the CTQ2005-01076 and CTQ2008-01861 projects from the Ministerio de Ciencia e Innovacion (Spain), and the European Research and Training Network NANOQUANT, contract No. MRTN-CT-2003506842. The work was additionally supported by the Xunta de Galicia and FEDER (fellowship IN809A 2007/81-0 and Axuda para Consolidación e Estruturación de Unidades de Investigación Competitivas do Sistema Universitario de Galicia 2007/050 and 2007-2013, and INCITE09 314 252 PR project). The authors also thank the partial financial support from the grants 2006/60, 2007/127 and 2007/144 from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia. We also thank the Ibero-American Network of the Nano-Bio-Info-Cogno Convergent Technologies (Ibero-NBIC) network (209RT0366) funded by CYTED (Ciencia y Tecnología para el Desarrollo). González-Díaz H. and Munteanu C. R. acknowledge the Isidro Parga Pondal Programme, Xunta de Galicia, Spain.
References 1. 2. 3. 4.
R Menke, Perspect Biol Med 47 (2004) 300-03. D Bonchev, GA Buck, J Chem Inf Model 47 (2007) 909-17. E Estrada, Chem Phys Lett 336 (2001) 248-52. H González-Díaz, Y González-Díaz, L Santana, FM Ubeira, E Uriarte, Proteomics 8 (2008) 750-78. 5. AT Balaban, A Beteringhe, T Constantinescu, PA Filip, O Ivanciuc, J Chem Inf Model 47 (2007) 716-31. 6. O Ivanciuc, DJ Klein, J Chem Inf Comput Sci 42 (2002) 8-22. 7. O Ivanciuc, T Ivanciuc, DJ Klein, WA Seitz, AT Balaban, J Chem Inf Comput Sci 41 (2001) 536-49. 8. M Mandado, MJ Gonzáles-Moa, RA Mosquera, J Comput Chem 28 (2007) 1625–33. 9. H González-Díaz, G Ferino, G Podda, E Uriarte, ECSOC 11 (2007) G1:1-10. 10. H González-Díaz, D Vina, L Santana, E de Clercq, E Uriarte, Bioorg Med Chem 14 (2006) 1095-107.
50
Cristian Robert Munteanu et al.
11. 12. 13. 14. 15.
KC Chou, Biophys Chem 35 (1990) 1-24. KC Chou, J Biol Chem 264 (1989) 12074-79. KC Chou, Bioinformatics 21 (2005) 10-9. KC Chou, S Forsen, Biochem J 187 (1980) 829-35. M Randič, M Vračko, A Nandy, SC Basak, J Chem Inf Comput Sci 40 (2000) 1235-44. H González-Díaz, S Vilar, L Santana, E Uriarte, Curr Top Med Chem 7 (2007) 1025-39. E Estrada, Comput Biol Chem 27 (2003) 305-13. B Liao, TM Wang, J Comput Chem 25 (2004) 1364-8. KC Chou, Curr Med Chem 11 (2004) 2105-34. H González-Díaz, M Cruz-Monteagudo, R Molina, E Tenorio, E Uriarte, Bioorg Med Chem 13 (2005) 1119-29. LB Kier, D Bonchev, GA Buck, Chem Biodivers 2 (2005) 233-43. T Loftsson, T Thorsteinsson, M Masson, J Pharm Pharmacol 57 (2005) 721-7. OG Mekenyan, SD Dimitrov, TS Pavlov, GD Veith, Curr Pharm Des 10 (2004) 1273-93. TS Rush, 3rd, JA Grant, L Mosyak, A Nicholls, J Med Chem 48 (2005) 1489-95. KC Chou, YD Cai, J Proteome Res 5 (2006) 316-22. XW Chen, M Liu, Bioinformatics 21 (2005) 4394-400. R Chen, Z Weng, Proteins 51 (2003) 397-408. D Bonchev, Chem. Biodivers. 1 (2004 ) 312-26. JF Sharom, DS Bellows, M Tyers, Curr. Opin. Chem. Biol. 8 (2004) 81-90. M Shibata, TJ Zielinski, J Mol Graph 10 (1992) 88-95, 107-9. AF Jalbout, M Solimannaejad, JK Labanowski, Chem. Phys. Lett. 379 (2003) 503-06. DMA Smith, AF Jalbout, J Smets, L Adamowicz, J. Chem. Phys. 260 (2000) 45-51. LS Bartell, Chem. Rev. 86 (1986) 491-505. S Wasserman, K Faust: Social Network Analysis: Methods and Applications, Cambridge University Press, New York, 1994. J López Cacheiro, B Fernández, D Marchesan, S Coriani, C Hättig, A Rizzo, Mol. Phys. 102 (2004) 101-10. CR Munteanu, JL Cacheiro, B Fernández, J Chem Phys 121 (2004) 10419-25. CR Munteanu, B Fernández, J Chem Phys 123 (2005) 014309. TB Pedersen, B Fernández, J. Chem. Phys. 115 (2001) 8431-39. JL Cagide Fajín, J López Cacheiro, B Fernández, J Makarewicz, J. Chem. Phys. 120 (2004) 8582-86. K Raghavachari, GW Trucks, JA Pople, M Head-Gordon, Chemical Physics Letters 157 (1989) 479-83. RA Kendall, TH Dunning Jr., RJ Harrison, J. Chem. Phys. 96 (1992) 6796-806. DE Woon, J Dunning, T. H., J. Chem. Phys. 98 (1993) 1358-71. H Koch, B Fernández, O Christiansen, J. Chem. Phys. 108 (1998) 2784-90. Microsoft.Corp., Microsoft Excel 2002. D Koschützki, CentiBiN Version 1.4.2, 2006, p. CentiBiN Version 1.4.2, Centralities in Biological Networks © 2004-06 Dirk Koschützki Research Group Network Analysis, IPK Gatersleben, Germany.
16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45.
Directed network topological indices for van der Waals complexes
51
46. BH Junker, D Koschuetzki, F Schreiber, BMC Bioinformatics 7 (2006) 219. 47. V Batagelj, A Mrvar, Pajek 1.15, 2006. 48. J Devillers, AT Balaban: Topological Indices and Related Descriptors in QSAR and QSPR, Gordon and Breach, The Netherlands, 1999. 49. H González-Díaz, FJ Prado-Prado, J Comput Chem 29 (2008) 656-67. 50. StatSoft.Inc., STATISTICA, 2002. 51. MI Benta, Cognition, Brain, Behavior IX (2005) 567-74. 52. S Bornholdt, HG Schuster: Handbook of Graphs and Complex Networks: From the Genome to the Internet, WILEY-VCH GmbH & CO. KGa., Wheinheim, 2003. 53. T Hill, P Lewicki: STATISTICS Methods and Applications. A Comprehensive Reference for Science, Industry and Data Mining, StatSoft, Tulsa, 2006. 54. J Moody, Social Networks 20 (1998) 291-99. 55. JL Lopez, M Mandado, AM Grana, RA Mosquera, Int J Quantum Chem 86 (2002) 190-98. 56. A Vila, RA Mosquera, J Chem Phys 115 (2001) 1264-73. 57. JLC Fajín, MNDS Cordeiro and JRB Gomes, J. Phys. Chem. C 111 (2007) 17311-17321. 58. Liao, B. and K. Ding J Comput Chem 26 (2005) 1519-23. 59. Randic, M. and J. Zupan SAR QSAR Environ Res 15 (2004) 191-205. 60. Liao, B., M. Tan, et al. Chem Phys Lett 402 (2005) 380-383. 61. Zupan, J. and M. Randic J Chem Inf Model 45 (2005) 309-13. 62. Liao, B., T. Wang, et al. Molecular Simulation 31 (2005) 1063-1071. 63. Randic, M., N. Lers, et al. J Proteome Res 4 (2005) 1347-52.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 53-68 ISBN: 978-81-7895-489-9 Editors: Humberto González-Díaz and Cristian Robert Munteanu
4. QSRR construction of networks for chirality inversion reactions 1
Sonia Arrasate1, Nuria Sotomayor1, Esther Lete1 and Humberto González-Díaz2
Department of Organic Chemistry II, Faculty of Science and Technology, University of the Basque Country/Euskal Herriko Unibertsitatea, Apto. 644, 48080 Bilbao, Spain 2 Department of Microbiology and Parasitology, University of Santiago de Compostela, 15782, Spain
Abstract. There are many enantioselective reactions of organolithium to imines described with very different substrates, organolithium compounds, chiral ligands, solvents, and specific reaction conditions, such as, temperature, addition times, reaction time, and order of addition of reactans. It implies that we may need to use mathematical tools in order to study this huge amount of information. In this work, we constructed from experimental outcomes large Complex Network, which may be used to perform datamining and quantitatively describe changes in reaction variables that determine the enantiomeric excess and configuration of the stereogenic centre formed in product. Unfortunately, there are not models to predict enantioselectivity for these reactions. Computational chemistry prediction of the reactivity based on Quantitative Structure-Reactivity Relationships (QSRR) may be used in this sense. We developed here a Multiple Linear Regression QSRR (MLR-QSRR) prediction model for the variation on enantioselectivity after changing some of the above-mentioned Correspondence/Reprint request: Dr. Humberto González-Díaz, Faculty of Pharmacy, University of Santiago de Compostela 15782, Spain. E-mail: humberto.gonzalez@usc.es
54
Sonia Arrasate et al.
reaction conditions and/or reactans. Overall model accuracy was high because the model explains 80.37% of variance (R2) for 17404 and 79.99% for 26106 reaction pairs used in training and cross-validation respectively. Using the experimental values we constructed a Complex Network for Enantioselectivity (ECNobs) for these reactions. In addition, the outputs of the MLR-QSRR model were used as inputs to predict a ECNpred for these reactions. The ECNobs has 228 nodes (reactions) and 23586 edges (pairs of reactions with high propensity to R/S chirality inversion) whereas the ECNpred has 20374 edges. After edge-to-edge comparison we have demonstrated that the ECNpred is significantly similar to the ECNobs one (Accuracy = 76.82%). The present study opens a new interesting field of research in organic chemistry reporting by the first time the construction and QSRR prediction of ECNs for organic reactions.
Introduction The asymmetric 1,2-addition of organometallic reagents to imines is a powerful tool to form carbonâ&#x2C6;&#x2019;carbon bonds. In that way it is possible to introduce a new stereogenic centre in organic molecules [1-10]. Thus, it provides ready access to enantiomerically enriched amines with a stereogenic centre at the Îą-position, an important structural feature in many biologically active compounds. These optically active amines are also important compounds because of their broad range of applications such as chiral auxiliaries, resolving agents and building blocks for the synthesis of natural and unnatural compounds, and their pharmacological properties [11-15]. In this kind of reactions are implicated many variables, substrates, organolithium reagents, chiral ligands, products and variables of reaction condition for instance. Therefore it exist a huge field of possible reactions to investigate. In this sense, it is of the major interest the search of rational approaches to predict and describe the high complexity of information generated by the changes on enantioselectivity for large databases of these kind of pairs of reactions. Satoh and Funatsu have created a reaction generator in SOPHIA (System for organic reaction prediction by heuristic approach) considers reaction conditions to recognize suitable AAG (Atoms and/or Atomic Groups) for suitable free bonds by utilizing reaction condition groups obtained by classification based on word combinations of reaction condition descriptions in a reaction database. They also has been extended SOPHIA to employ the reaction condition groups for interpretation of reaction conditions entered by the user and to consider the reaction conditions in reaction prediction procedures [16]. In addition, Patel and coworkers have described a new approach, iterated reaction graphs (IRG) that simulates complex chemical reaction systems. To achieve that they modelling a subset of all possible reactions, i.e., those that are specific to Millard chemistry, modeling of the rate kinetics with reaction rate probabilities and blocking of the
QSRR construction of networks for chirality inversion reactions
55
reactions into logical groups [17]. In particular, Quantitative StructureReactivity Relationships (QSRR) studies, based on molecular descriptors of chemical structure, may play an important role in the prediction of biological activity or specific property of a reaction. For example, Ignatz-Hoover et al. describes QSRR for kinetic chain-transfer constants for 90 agents on styrene polymerization at 60 oC in which three- and five-parameter correlations were obtained with R2 of 0.725 and 0.818, respectively [18]. Varnek et al. propose substructural fragments as a simple and safe way to encode molecular structures in a matrix containing the occurrence of fragments of a given type [19].Satoh and coworkers have investigated a dataset of 131 reactions focusing on the changes of electronic features on the oxygen atoms at the reaction sites by principal component analysis and selforganizing neural networks analyses [20]. On the other hand, Long and Niu have developed quantitative structureproperty relationship/quantitative structure-activity relationship (QSPR/QSAR) for rate constants (k) of alkylnaphthalene reactions with chlorine, hydroxyl and nitrate radicals using partial least squares (PLS) regression [21]. Mu et al. have investigated the prediction of oxidoreductase-catalyzed reactions based on atomic properties of metabolites [22]. As above-mentioned QSRR models may be used to predict effect of changes in reaction variables over enantioselectivity but we also need tools to describe the huge amount of information generated. This sort of problem may be investigated using Complex Networks (CNs) to regroup reactions with inverse results in which the enantioneric excess and configuration are changed from R to S. In fact, we can use CNs to study relationships between reactions, proteins, genes, RNAs, organisms, or even non-living objects such as web pages but we can also develop in-silico procedures to predict these CNs [23-25]. In principle, we can extend more than 1 600 different molecular descriptors to solve the former problem [26]. Our group has introduced elsewhere a Markov Chain Model (MCM) method named MARkov CHains Invariants for Network SImulation and Design (MARCH-INSIDE). The MARCH-INSIDE approach makes use of MCM to calculate the average values of different molecular physicochemical properties in chemical structures [27]. We propose herein, for the first time, a QSRR model able to predict the difference in enantiomeric excess for R-product between two pair of reactions (Î&#x201D;ee(R)%), which achieve to similar/dissimilar enatioselectivity after modification of reaction variables. This QSRR may predict the configuration of the new stereogenic centre formed in the synthesis of amines taking into consideration similar reaction pairs in which the enantiomeric excess increases or reduces. The task is difficult but interesting because we pretend to shun using 3D structures of subtrates, chiral ligand, and products.
56
Sonia Arrasate et al.
A method independent of these aspects may become notably faster because we do not have to run optimization algorithms to predict the 3D structure; these optimization algorithms are computationally expensive. After developing the MLR-QSRR prediction model we used it to construct an ECN. Last, we compared the ECN predicted with an ECN constructed here based on measured values of product enantiomeric excess. A summary flowchart for all the steps given in the work is presented in Figure 1 in order to guide the reader.
Figure 1. Flow chart for all steps given in the work.
Results and discussion Training and validation of the MLR-QSRR model. We used Forward-stepwise to investigate which variables more strongly influence the change on enantioselectivity and construct the MLR-QSRR equation model. The more important variables were the differences between the initial and final reaction for: product partition coefficient (Î&#x201D;Pp), chiral
QSRR construction of networks for chirality inversion reactions
57
ligands hardness (ΔHl), solvent dipolar moment (ΔDs), reaction time (Δtr), reaction temperature (ΔTr), addition temperature (ΔTa), average enantiomeric excess for reactions using same procedure (ΔAe), substrate molar refractivity (ΔMi), and steric constant (ΔSo) and hardness of organolithium compounds (ΔPo) respectively. Using these variables the best model found was: Δ ee (R )% = − 6.60 + 5.80 ⋅ Δ Pp − 4.63 ⋅ Δ H l − 23 .08 ⋅ Δ D s + 44 .18 ⋅ Δ t r − 1.23 ⋅ Δ Tr − 0.18 Δ Ta + 0.24 Δ Ae + 1.90 ⋅ Δ S o − 8.22 ⋅ Δ P o − 0.24 ⋅ Δ M i n = 17404
R 2 = 0.803
R 2 adjusted = 0.803
F = 7120 .7
p < 0.00001 (1)
where, n is the number of cases (reaction pairs) used to train the model, R2 and R2adjusted are the train and adjusted square regression coefficients, F is Fisher ratio, and p the level of error. All these reactions were previously reported in the literature [28-40]. This model, with ten variables, predicts correctly 80.3% of variance of the data set with a standard error of 29.35%. Notably, the values of R2 and R2 adjusted are equal, which indicates that the model is not over-fitted due to incorporating an elevated number of parameters. In the Figure 2 we plot the observed Δee(R) % values vs. the values predicted with the model. In order to validate the model we used it to predict 26106 reactions pairs never used to train the model (validation series). In this series the results were: R2 79.98%, F 1043E2 and p < 0.00001. The model explains correctly 80.0% of variance of the data set with a standard error of 29.79% in the validation series. These results indicate that we developed an accurate model according to previous reports on the use of MLR in QSRR [41-43]. Using this model we can construct QSRR-based charts to depict visually the influence of the change in reaction variables over the enantioselectivity of the reaction [44]. This kind of analysis, known as desirability analysis (DA), allows us to predict which levels of the reactions variables ensure a desired enantioselectivity. [45] It could be used to optimize the reaction changing only one property by organic synthesis modification of substrate or chiral ligand or modifing a reaction condition. In Figure 3, we illustrate some of these charts. Note that these charts may refer to only one receptor region or two different regions at the same time. In this sense, if we represent Mi vs. Po we can observe that between -6.6 and 0.7 values of Mi the enantiomeric excess remains in spite of changes on substrates Figure 3 (a). If we represent T(a) vs t(r) we can observed that between -3.44 and 49.9 values of reaction time (t(r)), the enantiomeric excess remains although reactive addition temperature change Figure 3 (b).
58
Sonia Arrasate et al.
Figure 2. Observed vs predicted values.
Figure 3. Example of chart used for the desirability analysis of the QSRR model.
QSRR construction of networks for chirality inversion reactions
59
Complex networks study. Molecular CNs are used to study large data bases and/or complex systems [46-48]. For instance, proteins, nucleic acids, and small molecules (metabolites) form a dense network of molecular interactions and metabolic reactions in a cell [49]. In order to recall the capacity of the MLR-QSRR to predict new CNs we selected the same data employed for training and validating the QSRR model. With these goals in mind, we constructed first a new ECNobs using the observed values considering the experimental data. Next, we predicted the ECNpred with the QSRR model and last we compared both ECNs. In our CNs we explored the threshold values in a range from -98.31 to -175, obtaining an average values of output node degree from 89.3 to 103.4 respectively (see Table 2). Finally, a cut off = -175 was selected to obtain average node degree equal to 103.4; which guarantee that the number of disconnected reactions is 0. Next, we used the MLR-QSRR equation to predict the enantiomeric excess and configuration of some amines. The same as before, we explored the threshold values in a range from -98.31 to -175, obtaining average values of output node degree from 77 to 89.4 respectively, a cut-off = -175, which leads to an average output node degree of 89.4, was selected being 0 the number of disconnected reactions. Additionally, with this threshold, the number of edges is 23586 for the observed network and 20374 for the predicted network. In Figure 4 we illustrate the complex relationships between ECN drawing coincident edges for both ECNobs and ECNpred. In order to compare the ECNobs and ECNpred, we used the sensitivity, specificity and accuracy a Chi-Square test; the obtained value for the p < 0.00001 error level was Chisquare = 293.364.
Figure 4. Graphical view of the observed vs. predicted ECNs.
60
Sonia Arrasate et al.
Methods Computational chemistry methods The MARCH-INSIDE approach is based on the calculation of the different physicochemical molecular properties (λm) for substrates, organolithium reagents, chiral ligands and products (λs, λo, λl, λp) respectively. These λm are calculated as an average of atomic properties (λj). For instance, it is possible to derive average estimations of refractivities (MRs, MRo, MRl, MRp), partition coefficients (Ps, Po, Pl, Pp), and hardness (ηs, ηo, ηl, ηp) that we are going to use in this work, as seen in the equation below [50]: 1 5 k 1 5 λ m = ∑ λ = ∑ ∑ p k (λ j ) ⋅ λ j 6 k =0 6 k =0 j
(2)
It is possible to consider isolated atoms (k = 0) in the estimation of the molecular properties 0η, 0χ, 0MR, 0α, 0P. In this case the probabilities 0p(λj) are determined without considering the formation of chemical bonds (simple additive scheme). However, it is possible to consider the gradual effects of the neighbouring atoms at different distances in the molecular backbone. In order to reach this goal the method uses a MM, which determines the absolute probabilities kp(λj) with which the atoms placed at different distances k affect the contribution of the atom j to the molecular property in question. ⎡ 1 p1, 2 ⎢1 ⎢ p 2 ,1 0 0 0 k λ = p (λ1 ) p (λ2 )… p (λ n ) ⋅ ⎢ . ⎢ ⎢ . ⎢1 p ⎣ n ,1
[
]
1
p1, 2
1
p1, 3
.
1
p2,2
1
p2 ,3
.
. .
. .
. .
.
.
.
k
p1, n ⎤ ⎡ λ1 ⎤ ⎥ 1 p 2 , n ⎥ ⎢⎢ λ 2 ⎥⎥ n . ⎥ ⋅ ⎢ . ⎥ = ∑ k p (λ j ) ⋅ λ j ⎥ ⎢ ⎥ j =1 . ⎥ ⎢ . ⎥ 1 p n , n ⎥⎦ ⎢⎣ λ n ⎥⎦ 1
(3)
Where, from left to right, the first term is kλ, which is the average molecular property considering the effects of all the atoms placed at distance k over every atomic property λj. The vector on the left-hand side of the equation contains the probabilities 0p(λj) for every atom in the molecule, without considering chemical bonds. The matrix in the centre of the equation is the so-called stochastic matrix. The values of this matrix (1pij) are the probabilities with which every atom affects the parameters of the atom
QSRR construction of networks for chirality inversion reactions
61
bonded to it. Both kinds of probabilities 0p(λj) and 1pij are easily calculated from atomic parameters (λj) and the chemical bonding information: 0
p ij =
λj
∑λ k =1
1
p ij =
(
n
(4)
k
δ ij ⋅ λ j n
∑δ k =1
ik
(5)
⋅ λk
The only difference is that in the probabilities 0p(λj) we consider isolated atoms by carrying out the sum in the denominator over all n atoms in the molecule. On the other hand, for 1pij chemical bonding is taken into consideration by means of the factor δij. This factor has the value 1 if atoms i and j are chemically bonded and it is 0 otherwise. All calculations were performed using the program MARCH-INSIDE version 3.0 [51]; which can be obtained for free academic use, upon request, from the corresponding author of the present work.
Statistical analysis Given the λm the molecular parameters above-mentioned and λorv other reaction variables such as (T(a), T(r), t(r)) we can calculate the differences Δλ = λ(r2) - λ(r1) for any reaction pairs. Using these Δλm and Δλorv values as input we performed a MLR anlysis to fit the QSRR equation with the form:
Δ ee ( R )% pred =
∑b
m
s ,l ,o , p
⋅ Δ λm + ∑ borv ⋅ Δ λorv + b0
(6)
orv
The parameter Δ%ee (R)pred (the prediction of the difference in enantiomeric excess for R-product between two pair of reactions) is the output of the model. In equation (6), b represents the coefficients of the variables in the model determine with MLR module of the software package STATISTICA 6.0 [52]. We used Forward Stepwise algorithm for variable selection. The statistical significance of the MLR model was determined calculating the R2 = 0.803, R2adjusted = 0.803, F = 7120.7 and p-level (p) < 0.00001 of error with. The validation of the model was corroborated with external prediction series. The quality of the validation was determined by R2 = 0.799, R2adjusted = 0.799,
62
Sonia Arrasate et al.
F = 1043E2 and p-level (p) < 0.00001 for validation. The data set was conformed by a set of reported organolithuim addition to imines in presence of chiral ligands reactions.
Complex network (CN) analysis In order to achieve the enantiomeric excess and configuration of the product with a network approach where one node represents a reaction and the edges show reactions pairs with high propensity to R/S chirality inversion, we carried out the following steps: 1. First, we calculated the observed and QSRR-predicted average-scores that numerically characterize the propensity of one reaction to yield R/S chirality inversion. These scores were labelled as Obs. Avg.Δee(R)% and Pred.Avg.Δee(R)%: Obs . Avg .Δee(R )% v =
1 w = 228 1 w = 228 ( ) ( ) Δ ee R % v , w = ∑ ∑ (ee(R )% obs (v ) − ee(R )% obs (w)) obs . 228 w =1 228 w =1
(7a) w 1
Pr ed . Avg .Δee(R )% v =
w 1
1 w = 228 1 w = 228 ( ) ( ) Δ ee R % v , w = ∑ ∑ (ee(R )% pred (v ) − ee(R )% pred (w)) pred . 228 w =1 228 w =1
(7b) Where Obs.Avg.Δee(R)%ν is the difference between observed R enantiomeric excess for reaction ν minus observed R enantiomeric excess for reaction and Pred.Avg.Δee(R)% is the difference between predicted R enantiomeric excess for reaction ν minus observed R enantiomeric excess for reaction . 2. Then, we used these scores as inputs in a Microsoft-Excell sheet to calculate the elements of the Boolean or Adjacency matrix (A) associated to the ECNs as follows: ⎧if ⎪else ⎪ A≡⎨ ⎪if ⎪⎩else
sign (ee ( R )% obs (v )) = sign (ee ( R )% obs (w ))
then
avw = 0
[Obs . Avg .Δee (R )% v − Obs . Avg .Δee (R )% v ] ≤ cut − off
then
avw = 0 avw = 1
(8) 3. Next, we compared the observed and predicted ECNs pair-by-pair. For this comparison we measured the total number of coincident predictions
QSRR construction of networks for chirality inversion reactions
63
(Accuracy), the total number of reactions connected (Sensitivity), and not connected (Specificity) as well as a Chi-square (χ2) test. For (χ2) test, we used a contingency table where a, b, c and d are the observed frequencies in our networks.72 (See Table 1). These frequencies were calculated as follows: f = if (and (obs B2! = 1, pred B2! = 1), 1, if (and (obs B2! = 1, pred B2! = 0), 2, if (and (obs B2! = 0, pred B2! = 1), 3, 4)). Then: “a” is the number of pairs of reactions connected in observed and in predicted networks (observed and predicted are 1). “b” is the number of pairs of reactions connected in observed but not connected in predicted networks (observed is 1 and predicted is 0). “c” is the number of pairs of reactions not connected in observed but connected in predicted networks (observed is 0 and predicted is 1). “d” is the number of pairs of reactions not connected neither in observed nor in predicted networks (the elements in the observed and predicted matrices are equal to 0). 4. Chi-square test lets us determine if the variables are associated or not. If they are not associated we could conclude that they are independent. The first step of the chi-square test for independence is to establish hypotheses. The null hypothesis is that the two variables are independent (observed and predicted activity of the ECNs is not associated). The alternative hypothesis to be tested is that the two variables are dependent. χ2 was calculated as follows [53]: r
k
χ =∑∑ 2
i =1 j = 1
(O
ij
− Eij ) Eij
2
(9)
5. Where Oij is the observed frequency, Eij is the expected or theoretical frequency. Eij is calculated as follows E11 =
(a + b) × (a + c) n
(10a)
E 21 =
(c + d ) × ( a + c ) n
(10b)
E12 =
(a + b ) × (b + d ) n
(10c)
64
E22 =
Sonia Arrasate et al.
(c + d ) Ă&#x2014; (b + d ) n
(10d)
6. Then we compared the value calculated in the formula above to a standard set of tables. The value returned from the table is p < 0.005. Thus, we can reject the null hypothesis and conclude that there is an association between the variables. 7. The Boolean matrix was saved as a .txt format file. After we had renamed the .txt file as a .mat file we read it with the software CentiBin [54, 55]. Using CentiBin we can not only represent the network but also highlight all nodes connected to a specific node but calculate connectivity parameters including node degree. 8. CentiBin software was used to generate random networks by five different algorithms including: Barabasi-Albert random network, Kleinberg Small Wolrd Network (SWN), 2D Lattice network, Erdos-Renyi network and Epsstein power law network (PLN) [55]. These random networks were compared with the observed and predicted networks. 9. Last, all node degrees were used as input in STATISTICA in order to study the distribution of the network and compare it with other ideal network distributions including normal, lognormal, exponential, gamma, and Chisquare [52].
Supplementary material In the online supplementary material we depict all the parameters necessary to evaluate a reaction with the QSRR as well as the dataset used including corresponding structures, in the form of SMILE codes, for all compounds involved in this study.
Conclusions Using the MARCH-INSIDE approach is possible to seek a MLR-QSRR classifier to predict the probability of chirality inversion of reactions; which occur by adition of organolithium reagents to imines in presence of chiral ligands. The model can be used as a tool for preliminary screening of reactions without relaying upon geometrical optimization of substrate, organolithium, chiral ligand, or product structure. This MLR-QSRR was also demonstrated to be an efficient tool for computational construction of Enantioselectivity Complex Networks that accurately reproduces the network based on experimental findings. This kind of Complex Networks could
QSRR construction of networks for chirality inversion reactions
65
become a valuable approach to explore the complexity of the enantioselectivity in these reactions.
Acknowledgments Arrasate S. acknowledges sponsorships for a tenure-track research position at the University of Santiago de Compostela from the “Ikertzaileak Hobetzeko eta Mugitzeko/Perfeccionamiento y Movilidad del Personal Investigador” Program of the “Hezkuntza, Unibertsitate eta Ikerketa Saila/Departamento de Educación, Universidades e Investigación, Eusko Jaurlaritza/Gobierno Vasco”. Financial support from Gobierno Vasco (GIC07/92-IT-227-07) also is gratefully acknowledged. González-Díaz H. acknowledges sponsorships for a tenure-track research position at the University of Santiago de Compostela from the Isidro Parga Pondal Program of the “Dirección Xeral de Investigación e Desenvolvemento, Xunta de Galicia”.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Klein J. The Chemistry. In: Patai S, ed. The Chemistry of Double-bonded Functional Groups: Suppement A. Chichester: Wiley 1989. Volkmann RA. In: Schreiber SL, ed. Comprehensive Organic Synthesis, Additions to C-X π-Bonds, Part 1 Oxford: Pergamon Press 1991. Kleinman EFV, R. A. In: Heathcock CH, ed. Comprehensive Organic Synthesis, Additions to C-X π-Bonds, Part 2. Oxford: Pergamon Press 1991. Berrisford DJ. Angew Chem, Int Ed Engl. 1995;34:178-80. Risch NA, M. In: Helmchen GH, R. W.; Mulzer, J.; Schaumann, E., ed. Methods of Organic Chemistry Stereoselective Synthesis [Houben-Weyl]. Stuttgart: Thieme 1996. North M. Contemp Org Synth 1996;3:323-43. Denmark SEN, O. J.-C. Chem Commun. 1996:999-1004. Enders DR, U. Tetrahedron: Asymmetry. 1997;8:1895-946. Bloch R. Chem Rev. 1998;98:1404-38. Denmark SEN, O. J.-C. In: Jacobsen ENP, A.; Yamamoto, H., ed. Comprehensive Asymmetric Catalysis. Berlin: Springer-Verlag 1999. Seyden-Penne J. Chiral Auxiliaries and Ligands in Asymmetric Synthesis. New Yorak: Wiley 1995. Jacques JC, A.; Wilen, S. H. Enantiomers, Racemates, and Resolution. New York: Wiley 1981. Eliel ELW, S. H.; Mander, L. N. Stereochemistry of Organic Compounds. New York: Wiley 1994. Moser HR, G.; Santer, H. Z. Naturforsch. 1982;37B:451-62.
66
Sonia Arrasate et al.
15. Ariëns EJS, W.; Timmermans, P. B. M. W. M. Stereochemistry and Biological Activity of Drugs. Oxford: Blackwell Scientific 1983. 16. Satoh H, ; Funatsu, K. Further Development of a Reaction Generator in the SOPHIA System for Organic Reaction Prediction. Knowledge-Guided Addition of Suitable Atoms and/or Atomic Groups to Product Skeleton. J Chem Inf Comput Sci. 1996;36:173-84. 17. Shail Patel JR, Stephen Russell, Jos Tissen, and Werner Klaffke. J. Chem. Inf. Comput. Sci. 2001;41:926-33. 18. Ignatz-Hoover F, Petrukhin R, Karelson M, Katritzky AR. QSRR correlation of free-radical polymerization chain-transfer constants for styrene. J Chem Inf Comput Sci. 2001 Mar-Apr;41(2):295-9. 19. Varnek AF, D.; Hoonakker, F.; Solov'ev, V. P. Substrutural fragments: an universal language to encode reactions, molecular and supramolecular structures. J Comput Aided Mol Des. 2005;19:693-703. 20. Hiroko Satoh OS, Tadashi Nakata, Lingran Chen, Johann Gasteiger, Funatsu K. J. Chem. Inf. Comput. Sci. . 1998 38:210-9. 21. Long XN, J. Estimation of gas-phase reaction rate constants of alkylnaphthalenes with chlorine, hydroxyl and nitrate radicals. Chemosphere. 2007;67:2028-34. 22. Mu FU, P. J.; Unkefer, C. J.; Hlavacek, W. S. Prediction of oxidoreductasecatalyzed reactions based on atomic properties of metabolites. Bioinformatics. 2006;22(24):3082-8. 23. Bornholdt S, Schuster HG. Handbook of Graphs and Complex Networks: From the Genome to the Internet. Wheinheim: WILEY-VCH GmbH & CO. KGa. 2003. 24. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang DU. Complex networks: Structure and dynamics. Phys Rep. 2006;424:175-308. 25. Réka A, Barabasi A-L. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74(1):47-97. 26. Todeschini R, Consonni V. Handbook of Molecular Descriptors: Wiley-VCH 2002. 27. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. 28. Arrasate S, Lete S, Sotomayor N. Synthesis of enantiomerically enriched amines by chiral ligand mediated addition of organolithium reagents to imines Tetrahedron: Asymmetry. 2001; 12(14):2077-82. 29. Denmark SES, C. M. Effect of Ligand Structure in the Bisoxazoline Mediated Asymmetric Addition of Methyllithium to Imines. J Org Chem. 2000;65:5875-8. 30. Hasegawa MT, D.; Tomioka, K. Facile Asymmetric Synthesis of a-Amino Acids Employing Chiral Ligand-Mediated Asymmetric Addition Reactions of Phenyllithium with Imines. Tetrahedron. 2000;56:10153-8. 31. Taniyama DH, M.; Tomioka, K. A facile asymmetric synthesis of 1-substituted tetrahydroisoquinoline based on a chiral ligand-mediated addition of organolithium to imine. Tetrahedron: Asymmetry. 1999;10:221-3.
QSRR construction of networks for chirality inversion reactions
67
32. Inoue IS, M.; Koga, K; Kanai, M.; Tomioka, K. . Enantioselective Reaction of An Imine with Methyllithium Catalyzed by A Chiral Ligand. Tetrahedron: Asymmetry. 1995;6:2527-33. 33. Denmark SEN, N.; Nicaise, O. J.-C. . Asymmetric Addition of Organolithium Reagents to Imines. J Am Chem Soc. 1994;116:8797-8. 34. Anderson PGJ, F.; Tanner, D. Enantioselective Addition of Organolithium Reagents to Imines Mediated by C2-Symmetric Bis(aziridine) Ligands. Tetrahedron. 1998;54:11549-66. 35. Inoue IS, M.; Koga, K; Tomioka, K. Asymmetric 1,2-Addition of Organolithium to Aldimines Catalyzed by Chiral Ligand. Tetrahedron. 1994;50:4429-38. 36. Perron QA, A. Synthesis and application of a new pseudo C2-symmetric tertiary diamine for the enantioselective addition of MeLi to aromatic imines. Tetrahedron: Asymmetry. 2007;18:2503-6. 37. Cabello NK, J.-C.; Gille, S.; Alexakis, A.; Bernardinelli, G.; Pinchard, L.; Caille, J.-C. Simple 1,2-Diamine Ligands for Asymmetric Addition of Aryllithium Reagents to Imines. Eur J Org Chem. 2005:4835-42. 38. Kizirian J-CC, N.; Pinchard, L.; Caille, J.-C.; Alexakis, A.; . Enantioselective addition of methyllithium to aromatic imines catalyzed by C2 symmetric tertiary diamines. Tetrahedron. 2005;61:8939-46. 39. Gille SC, N.; Kizirian, J.-C.; Alexakis, A.;. A new pseudo C2-symmetric tertiary diamine for the enantioselective addition of MeLi to aromatic imines. Tetrahedron: Asymmetry. 2006;17:1045-7. 40. Cabello NK, J.-C.; Alexakis, A.; Enantioselective addition of aryllithium reagents to aromatic imines mediated by 1,2-diamine ligands. Tetrahedron Lett. 2004;45:4639-42. 41. Cheng Z, Ren J, Li Y, Chang W, Chen Z. Study on the multiple mechanisms underlying the reaction between hydroxyl radical and phenolic compounds by qualitative structure and activity relationship. Bioorg Med Chem. 2002 Dec;10(12):4067-73. 42. Pompe MV, M.; Randic, M.; Balaban, A. T. Using variable and fixed topological indices for the prediction of reaction rate constants of volatile unsaturated hydrocarbons with OH radicals. Molecules (Basel, Switzerland). 2004;9:1160-76. 43. Ren YL, H.; Yao, X.; Liu, M. Prediction of ozone tropospheric degradation rate constants by projection pursuit regression. Anal Chim Acta. 2007;589:150-8. 44. González-Díaz H, Saiz-Urra L, Molina R, Santana L, Uriarte E. A Model for the Recognition of Protein Kinases Based on the Entropy of 3D van der Waals Interactions. Journal of proteome research. 2007 Feb 2;6(2):904-8. 45. Cruz-Monteagudo M, González-Díaz H, Agüero-Chapin G, Santana L, Borges F, Domínguez RE, et al. Computational Chemistry Development of a Unified Free Energy Markov Model for the Distribution of 1300 Chemicals to 38 Different Environmental or Biological Systems. J Comput Chem. 2007; 28:1909-22. 46. Bonchev D, Buck GA. From molecular to biological structure and back. Journal of chemical information and modeling. 2007 May-Jun;47(3):909-17. 47. Bonchev D. On the complexity of directed biological networks. SAR QSAR Environ Res. 2003 Jun;14(3):199-214.
68
Sonia Arrasate et al.
48. Park J, Barabasi AL. Distribution of node characteristics in complex networks. Proc Natl Acad Sci U S A. 2007 Nov 13;104(46):17916-20. 49. Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci U S A. 2003 Oct 14;100(21):12123-8. 50. Santana L, Uriarte E, González-Díaz H, Zagotto G, Soto-Otero R, MendezAlvarez E. A QSAR model for in silico screening of MAO-A inhibitors. Prediction, synthesis, and biological assay of novel coumarins. J Med Chem. 2006 Feb 9;49(3):1149-56. 51. González-Díaz H, Molina-Ruiz R, Hernandez I. MARCH-INSIDE v3.0 (MARkov CHains INvariants for SImulation & DEsign); Windows supported version under request to the main author contact email: gonzalezdiazh@yahoo.es. 3.0 ed 2007. 52. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002. 53. Hill T, Lewicki P. STATISTICS Methods and Applications. A Comprehensive Reference for Science, Industry and Data Mining. Tulsa: StatSoft 2006 54. Koschützki D. CentiBiN Version 1.4.2. 2006:CentiBiN Version 1.4.2, Centralities in Biological Networks © 2004-6 Dirk Koschützki Research Group Network Analysis, IPK Gatersleben, Germany. 55. Junker BH, Koschutzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC bioinformatics. 2006;7:219.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 69-94 ISBN: 978-81-7895-489-9 Editors: Humberto González-Díaz and Cristian Robert Munteanu
5. Network entropies classification of fungi and bacteria cellulases of interest for biotechnology Guillermín Agüero-Chapin1,2, Aminael Sanchez-Rodríguez1 Agostinho Antunes2, Gustavo A. de la Riva3 and Humberto González-Díaz4 1
CBQ, IBP, Faculty of Chemistry and Pharmacy, UCLV, SC, 54830, Cuba 2 CIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental Universidade do Porto, Rua dos Bragas, 177, 4050-123 Porto, Portugal 3 Departamento de Biología, Instituto Superior Tecnológico de Irapuato (ITESI) Irapuato, Guanajuato, 36821, México; 4Department of Microbiology and Parasitology Faculty of Pharmacy, University of Santiago de Compostela (USC) Santiago de Compostela, 15782, Spain
Abstract. Enzymatic hydrolysis is a limiting step in the conversion of lignocellulosic polymeric biomass into bio-ethanol. Searching of novel and more efficient cellulase polymer-degrading components is one of the solutions for improving cellulose hydrolysis. Here, we propose fast multi-target Quantitative Structure-Activity Relationship (mt-QSAR) approach using three models; one for each component of the cellulolytic complex: endoglucanase, exoglucanase, and β-glucosidase. These models are based on Entropy measures (Θk) of protein sequences pseudo-folding in HPLattice Networks. The QSAR models developed classified correctly 363 protein sequences downloaded from GenBank with accuracy higher than 85% both in training and validation. Comparable Correspondence/Reprint request: Dr. González-Díaz H, Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela (USC), Santiago de Compostela, 15782, Spain E-mail: humberto.gonzalez@usc.es or gonzalezdiazh@gmail.com
70
Guillermín Agüero-Chapin et al.
results were obtained by using sophisticated alignment methods such as HMM-Pfam. Finally, an mt-QSAR solution is presented assigning a score to each complex of three sequences based on the three different models. It allows screening complexes from different sources and selecting “in silico” the best enzymatic candidates with up to three components.
1. Introduction Biodegradation of natural occurring and synthetic polymer is of the major importance for the development of low environmental risk polymer industry [1]. It this sense, it is possible to study and/or develop new polymerbased biodegradable materials [2], polymers with medical applications [3]. Enzyme biodegradation is also useful to develop new routes to obtain fuels from alternative sources. For instance, fuel ethanol is becoming more and more an important alternative energy source triggered by the world energy crisis. In recent years, growing attention has been devoted to the conversion of lignocellulosic biomass into fuel ethanol, considered the cleanest liquid fuel alternative to fossil [4]. Economically competitive production of ethanol from lignocellulosic biomass by enzymatic hydrolysis and fermentation is currently limited, in part, by the relatively high cost and low efficiency of the enzymes required to hydrolyze cellulose to fermentable sugars. In this sense the discovery of new cellulolytic enzymes is of the major interest [5]. Lignocellulosic biomass, composed of cellulose, hemi cellulose and lignin is the most abundant and cheap bio-resource on the earth being an excellent raw material for ethanol production. There are mainly two processes involved in the conversion: (i) hydrolysis of cellulose in the lignocellulosic biomass to produce reducing sugars and (ii) fermentation of the sugars to ethanol. Based on current technologies, the cost of ethanol production from lignocellulosic materials is relatively high, with the main challenges being the low yield and high cost of the hydrolysis process. Research efforts to improve the hydrolysis of lignocellulosic materials include the pretreatment of lignocellulosic materials, a simultaneous saccharification and fermentation, and the optimization and loading of the cellulase enzymes in the hydrolysis step [6]. Several factors influence the efficiency of enzymatic hydrolysis, namely the size of the particles, the polymerization and crystalline degree of the substrate, the reaction conditions, the nature and proportion of other lignocellulosic components in the biomass and the cellulase complex [7]. The efficiency of the cellulase complex is influenced both by the quality of its components as well as its composition. The performance of the cellulase complex varies among microorganisms with detectable differences observed even among species of the same genus [8].
Network entropies classification of fungi and bacteria cellulases
71
One of the main goals for bio-ethanol enzymatic technology is the searching of efficient cellulase complexes or the utilization of recombinant ethanologens microorganisms with high cellulolytic potential [9]. Cellulase complex is composed mainly of three enzymes (endoglucanase, exoglucanase and β-glucosidase) acting in a synergistic way. These enzymes are found in many fungi and bacteria species [10], but only a few of these microorganisms are being used in the Biotechnology industry despite the fact that they all represent a natural source of cellulase genes. Moreover, there has been a dramatic growth of DNA and protein sequences in database for each enzyme member of the cellulase complex. Such sequences were isolated from fungi and bacteria in general regardless of its utility in biotechnology industry. However, due to the growing amount of protein database information, it is difficult to select the best cellulase complexes for enzymatic hydrolysis or the most promising cellulolytic components, which would allow the discovery of new and more efficient complexes or the establishment of recombinant microorganisms with high cellulolytic potential. Bioinformatics tools relying on Computational Chemistry methods allow the search and retrievement of relevant protein sequences. However, alignment-type approaches are solely based on linear information and may perform poorly with low levels of sequence similarity [11]. To overcome this problem, we can use different types of classification models and alignmentfree sequence parameters derived from graph [12-14] or network theory in the development of a predictive model [15-17]. The use of molecular parameters to predict biological activity is often known as the Quantitative Structure-Activity Relationship (QSAR) approach. The QSAR-type models may use as inputs either small-molecules 2D-3D parameters [18, 19], ligandtarget interaction parameters, or proteins [20] and DNA/RNA parameters of different types of network representations [21]. The methodology MARCH-INSIDE (MARkov CHains Invariants for SImulation & DEsign) generates molecular descriptors based on Markov Chain models associated to graphs or network representations of chemical structure. In previous works, this methodology has been often reported as the MARkovian CHemicals IN SIlico Design, mainly used in studies with small molecules [22]. However, the challenge is to use the QSAR approach in genomics [23] and proteomics [24] studies, allowing the functional annotation of nucleic acids and proteins. Here, we used MARCH-INSIDE methodology to codify information about protein sequences after rearranging or folding the sequences into a 2D-Lattice network. Several types of Complex networks have been reported, including the random networks models, the small world networks, several subtypes of the Lattice networks and others [25]. In this work, we used the Hydrophobicity-Polarity (HP)
72
Guillermín Agüero-Chapin et al.
Polymer Lattice networks, which are very useful for protein studies [26, 27]. We derived from this 2D-Lattice network-like protein sequence representation different numerical indexes to characterize the protein backbone [14]. Once we select a type of network there are also different numbers that can be used to describe the information of the protein sequence and seek for classification models. We selected Entropy and Spectral Moment type measures [28-30] because they are predictors for many different Computational Chemistry problems. Two very recent reviews on the uses of several types of network and graph parameters in small-sized molecules, DNA, RNA, proteins and Complex Networks of proteins, gene co-expression, metabolism and others have been published recently including MARCH-INSIDE and the other methods referred above [31, 32]. The methodology allowed the screening and the selection of cellulolytic components from many microorganism producers without the need of detailed protein structural information. Therefore, it may become a useful tool for leading cellulase’s complexes optimization and for the design of enzyme recombinants to improve enzymatic hydrolysis efficiency in ethanol production.
2. Materials and methods 2.1. Computational methods First, we divided the cellulase complex into its three polymer components (endoglucanase, exoglucanase and β-glucosidase). Secondly, we downloaded amino acid (aa) sequences belonging to each component from GenBank database and ensure to gather as much cellulase’s sequences as possible since our methodology relies only on the primary protein structure. Each sequence was labeled by its accession number; see Table I, III, and V in S01_file of supplementary materials (SM). We also used a control group comprehending heterogeneous proteins with diverse functions excepting cellulolytic action. As part of it, a set of hydrolases (glycosidase, O-glycosyl hydrolases, polygalacturonases) belonging to closely related families of our target sequences was also included (see Table II, IV, and VI in file S01 of SM). Afterwards, we encoded with numeric parameters biologically interesting sequence information content defined in terms of k-order Markov Matrix Entropy (Θk) and Markov Matrix Spectral Moments (πk). We used a Markov model (MM) to codify useful information about enzymatic cellulase complex sequences coming from different fungi and bacteria species. The MM approach, referred above as MARCH-INSIDE, considers any atom, nucleotide or amino acid as states of a Markov Chain (MC) in dependence of
Network entropies classification of fungi and bacteria cellulases
73
the kind of molecule to be described [33]. Variations of this methodology have been explained in-depth in several earlier reports [20, 34] describing the codification of proteins sequences into numerical descriptors (Θk and πk values). To calculate the Θk values first we used the MM to determine the absolute probabilities Apk(j) of the distribution of aa charges within a vicinity of length k for a given aa j in an specific spatial space [35]. We express in probability terms changes in amino acid electron density (charge) in subsequent intervals of time along the protein backbone until a stationary or steady state distribution arises. The calculation of Apk(j) depends on the definition of two matrix (0Q and kΠ) as well as its respective matrix’s components (Ap0 (j); kpij) which are defined below. The Θk are used to estimate sequence information content as the entropies of charge distribution in a whole protein at k time. They are calculated from the probability Apk(j) of distribution of aa charges in different spaces [36]: n
Θ k = − k B ∑ A pk ( j ) log Apk ( j )
(1)
j =1
where kB is the Boltzman constant. Before calculating Apk(j) values for each aa we rearrange the protein in a 2D Cartesian space. Folding protein sequences from cellulase in a 2D HP-Lattice network overcomes the need of knowing the cellulase 3D structure. Thus, the present study relies only in the protein sequence. This 2D Cartesian representation of proteins was successfully used for first time by our group [24] to study enzymes with interest in plant metabolism engineering and biotechnology and more recently in the recognition of ribonucleases without using alignment tools [14]. The representation was described in these previous reports comprehending a 20-types-of-aa sequence for proteins instead of a 4-types-of-base sequence for DNA [37]. We group the aa according to its physicochemical nature into 4 groups: polar, non-polar, acid, or basic. There are different sub-types of these HP networks reported, sometimes also referred as HP-maps or HPgraphs [27]. In this HP-Polymer Lattice network we begin with the first aa of the sequence in the center of the space with coordinates (0,0). Later we rearrange all the aa sequence as follows [24]: a) Increases in +1 the abscissas axe coordinate for an acid aa (rightwardsstep) or: b) Decreases in -1 the abscissas axe coordinate for a basic aa (leftwardsstep) or: c) Increases in +1 the ordinates axe coordinate for a polar aa (upwards-step) or:
74
Guillermín Agüero-Chapin et al.
d) Decreases in -1 the ordinates axe coordinate for a non-polar aa (downwards-step). After arranging the aa sequence within the 2D space we used the usual matrix MARCH-INSIDE approach for proteins,[36] which has been adapted to calculate the Apk(j) values. The method essentially uses three matrix magnitudes: a) The matrix 1Π called the one-step aai-aaj direct charge interaction stochastic matrix. This matrix is built up as a square matrix (n × n) [38] and contains the probabilities 1pij to reach a node nj moving throughout a walk of length k = 1 from other node ni covalently bound to nj. Note that the number of nodes (n) in the graph (see Fig. 2) may be equal or smaller than the number of amino acids in the protein sequence [39]. ⎛q ⎞
1
p ij =
α ij ⋅ ⎜⎜ j ⎟⎟ ⎝ rj0 ⎠ n
∑α m =1
A
p0 ( j ) =
im
⎛q ⋅ ⎜⎜ m ⎝ rm 0
qj
⎞ ⎟⎟ ⎠
=
α ij ⋅ ϕ j n
∑α m =1
im
(2)
⋅ϕm
(3)
n
∑q m =1
m
Where, qj is the charge of the node nj, αij equals to 1 if the nodes ni and nj are adjacent in the graph and equals to 0 otherwise, and ϕj is the electrostatic potential on the nj with respect to the centre of coordinates. We rescaled the charge parameter to a positive value representing negative charge as the lowest positive values in order to follow non-negativity of probability. b) The initial probabilities vector 0Q (see Eq. 4) contains the absolute initial probabilities Ap0(j) of the distribution of aa charges in the 2D-HP Polymer Lattice network. The method considers that a total charge or weight (qj) can be assigned to each node. This charge of the node is equal to the sum of the charges of all the aa coinciding in the same node. So, to retain a more compact matrix notation all charges are arranged as a column vector 0Q. c) The AΠk matrix (vector matrix) is defined as the product of multiplication of 0Q and the kΠ matrix, which is the kth power of 1Π matrix (see Eq. 4). This row matrix contains the Apk(j) values we need to calculate the sequence information content in the 2D space [40]:
Network entropies classification of fungi and bacteria cellulases A
Π κ = 0Q ×
k
Π =
( Π)
k
1
Q ×
0
75
(4)
Expanding expression (4) for k = 0, 1, 2 and 3 give the absolute probability Apk(j) of charge distribution for a given aaj within a vicinity of length k in a 2D space. This extension is illustrated for the linear graph n1-n2n3 characteristic of the sequence (Asp-Glu-Asp-Lys). Please note that the central node contains both Glu and Asp:
A
Π0 =
[ p (n1), A
A
0
p0 (n 2 ),
A
⎡1 0 0⎤ ⎢ p0 (n3) ⋅ 0 1 0⎥ = ⎢ ⎥ ⎢⎣0 0 1⎥⎦
]
[ p (n1), A
A
0
]
p0 (n 2 ), A p0 (n3)
(4a)
A
Π1 =
[ p (n1), A
A
0
p0 (n 2 ),
⎡ 1 p11 ⎢ A p0 (n3) ⋅ ⎢ 1 p21 ⎢ 0 ⎣
1
p12 1 p22
]
1
p32
0 ⎤ ⎥ 1 p23 ⎥ = 1 p33 ⎥⎦
[ p (n1), p (n2), A
A
1
1
A
]
p1 (n3)
(4b)
A
Π2 =
[ p (n1), A
0
A
p 0 (n2),
A
⎡ 1 p11 ⎢ p 0 (n3) ⋅ ⎢ 1 p 21 ⎢ 0 ⎣
1
p12 p 22 1 p 32
]
1
0 ⎤ ⎡ 1 p11 ⎥ ⎢ p 23 ⎥ ⋅ ⎢ 1 p 21 1 p 33 ⎥⎦ ⎢⎣ 0
1
1
p12 p 22 1 p 32
1
0 ⎤ ⎥ p 23 ⎥ = 1 p 33 ⎥⎦ 1
[
A
]
p 2 (n1), A p 2 (n2), A p 2 (n3)
(4c)
A
Π3 =
[ p (n1), A
0
A
p0 (n2),
A
⎡ 1 p11 ⎢ p 0 (n3) ⋅ ⎢ 1 p 21 ⎢ 0 ⎣
]
1 1 1
p12 p 22 p32
0 ⎤ ⎡ 1 p11 ⎥ ⎢ p 23 ⎥ ⋅ ⎢ 1 p 21 1 p33 ⎥⎦ ⎢⎣ 0 1
1 1 1
p12 p 22 p 32
0 ⎤ ⎡ 1 p11 ⎥ ⎢ p 23 ⎥ ⋅ ⎢ 1 p 21 1 p33 ⎥⎦ ⎢⎣ 0
1
1 1 1
p12 p 22 p 32
0 ⎤ ⎥ p 23 ⎥ = 1 p 33 ⎥⎦
1
[ p (n1), A
3
A
]
p 3 (n2), Ap 3 (n3)
(4d) the calculation of the Θk values for all aa sequences in both groups were carried out with our in-house software MARCH-INSIDE version 3.0 ® [41]. Finally, a row data table with the eleven Θk values for each sequence (k = 0,1,2,…10) and a dummy variable indicating if it is or not an active sequence was uploaded to this statistical analysis software. The other classes of parameter used were the Spectral Moments of the Markov Matrix, which are easy to calculate in terms of the Trace (Tr) operator (sum of the probabilities in the main diagonal):
76
Guillermín Agüero-Chapin et al.
[
]
π κ = Tr (1 Π ) = ∑ k pij k
n
(4e)
i= j
2.2. Statistics analysis. K-Means cluster analysis (k-MCA) The k-MCA was used in training and predicting series design [42]. The method requires a partition of each cellulolytic component class (endoglucanase, exoglucanase and β-glucosidase) and non active series of proteins independently in several statistically representative clusters of sequences. Thus, one may select the members of training and predicting series from all these clusters. This procedure ensures that the main protein classes (as determined by the clusters derived from k-MCA) will be represented in the cellulase-relating-sequence group and the inactive group. This fact allows that both training and predicting series can be representative of the entire “experimental universe”. This method has been applied before in QSAR research with the same purpose [43].
2.3. Linear Discriminant Analysis (LDA). We used in total three data sets, one for each component of the cellulase complex. Training and predicting series were selected following k-MCA. The numbers of active and non active sequences were balanced according to LDA requirements. A forward stepwise LDA was performed in order to seek the three quantitative-sequence-function relationships [44]. The statistics parameters of the above models are the same usually shown for QSAR LDAbased models [45] including Wilk’s statistical (λ), which provide an overall discrimination and varies from 0 (perfect discrimination) and 1 (no discrimination); and Fisher ratio (F), value of a variable indicating its statistical significance in the discrimination between groups, which is a measure of the extent of how a variable makes a unique contribution to the prediction of group membership with a probability of error (p-level) p(F) < 0.05.
2.4. Classification based on sequence homology We evaluated the extension of our methodology in codifying beyond merely linear sequence information. Sequence homology was used as a metric of the linear sequence content in proteins. The Smith-Waterman local alignment tool [46] was used to evaluate homology between all cellulase relating sequences (see Table VII, VIII and IX in files S02, S03 and S04 of SM). In addition, the Hidden Markov Model (HMM) profile for each cellulase family was downloaded from the Pfam database (www.sanger.ac.uk/Software/Pfam/) [47]. In order to compare the classification power between methodologies,
Network entropies classification of fungi and bacteria cellulases
77
each member was aligned versus its Pfam profile using the program hmmsearch from the hhmmer-2.3.2 package. We also assessed the linear correlation of different sequence comparison algorithms scoring schemes relative to sequence homology. The algorithms tested included the SmithWaterman tool, HMM-Pfam and the MM-LDA based classification model proposed in the present work. A data set of endoglucanase protein pairs with a variable degree of homology was used. Sequences were compared pair wisely and for each case the Smith-Waterman score was normalized in respect to the alignment length and recorded. HMM-Pfam score for the plot was obtained as a result of the score differences between pairs and normalized in respect to the alignment length. Finally each pair of proteins was scored given the MM-LDA model for the endoglucanase family and we recorded their score differences. For the correlation analysis, plots of sequence homology versus scores were obtained in each case.
3. Results and discussion 3.1. Alignment-free QSAR models for cellulases Markov stochastic process has been used for protein folding recognition [48] and it can be applied to Proteomics in the prediction of protein signal sequences [49]. In recent works, we have shown MARCH-INSIDE as an alternative method for the function annotation of proteins using sequence numeric descriptors generated by our software [24]. In this work, we focus the study on cellulases enzymes acting on the conversion of cellulose from biomass into bio-ethanol. As referred before, there are three cellulase families (endoglucanases, exoglucanases and ß-glucosidases) with its respective control groups. Endoglucanase family and its control group were split independently in four clusters made up by 37, 49, 31, 27 and 36, 43, 57, 11 members, respectively. The k-MCA divided exoglucanases into three clusters with 27, 24 and 21 members. Its control group was separated also into three clusters with 34, 61 and 54 members. The ß-glucosidase (active and no active) groups were split in three clusters: 28, 15, 40 and 34, 14, 71 members, respectively. Clusters were determined using two molecular descriptors defined in MARCH-INSIDE methodology: entropy type (Θk) and spectral moments type (πk) indices. Selection of the training and the prediction set members was carried out randomly, in a representative proportion of the clusters [50]. We have taken into consideration the standard deviation between and within clusters, the respective Fisher ratio and the significance of their p-level [51]. All variables used to construct the clusters (Θk and πk) showed p-levels < 0.05 for Fisher test, as depicted in Table 1. We described
78
Guillermín Agüero-Chapin et al.
three or four statistically homogeneous clusters in dependence of the cellulase family supporting the existence of several subfamilies. The main conclusion should be achieved from k-MCA: the existence of diversity between cellulase’s aminoacid sequences even from those ones belonging to the same family (as codified by MARCH-INSIDE descriptors). This fact is also supported by the pair-wise alignment results performed between members of the three families using the Smith-Waterman procedure that retrieved homology percentages as low as 23% and even lower for the endoglucanase and β-glucosidase classes. Table 1. MM-QSAR classification results in Training, Validation (cv), Overall (training + cv), and after removing homologous sequences. Endoglucanase Cases % Endogs others Total
98.15 83.48 90.58
QSAR Train Endogs others 2
106 19
96
% 94.44 87.5 91.17
QSAR CV Endogs others 34 4
2 28
Exoglucanases Cases
QSAR Train
QSAR CV
Exogs
% 94.44
Exogs 51
others 3
% 100
Exogs 18
others 0
others Total
84.11 87.57
17
90
78.05 84.75
9
32
Cases % β-glcs others Total
100 83.18 91.5
105 18
0 89
Endoglucanases QSAR Train + CV
Cases Endogs others Total
β-glucosidases QSAR Train β-glcs others
% 97.22 84.35 90.72
Cases % Exogs
95.83
others Total
82.43 86.81
Endogs 140 23
others 4 124
Exoglucanases QSAR Train + CV Exogs others 69 26
%
β-glcs
β-glcs
100
147
others Total
81.76 90.85
27
100 78.05 89.16
% 97.56 82.31 89.26
42 9
0 32
Endogs 120 26
others 3 121
12 homologous removed % Exogs others
3
95 86.48 88.94
others
QSAR CV β-glcs others
21 homologous removed
122
β-glucosidases QSAR Train + CV
Cases
%
57 20
3 128
27 homologous removed %
β-glcs
others
0
100
120
0
121
81.76 89.92
27
121
Network entropies classification of fungi and bacteria cellulases
79
Once we performed a representative selection of the training set for the three enzymes (aa sequences from endoglucanases, exoglucanases and ßglucosidases) and the control group, it can be used to fit the discrimination function. Thus, we choose the functions with higher statistical signification but with few parameters as possible. Each discriminant function expresses in probability terms the tendency or propensity of a given aa sequence to belong to one of the three cellulase’s families. The equations classify the sequences according to its biological functions providing a predicted probability as a numerical score. The best classification function equation found for each enzyme of the cellulolytic complex after LDA analyses were: Endoglucan ase = 22 .17 × N = 291
λ = 0.43
Exoglucanase = 16.17 × N = 220
exo
λ = 0.43
λ = 0.32
Θ 0 − 1.37 × F = 279 .45
β − glu cos idase = 24.89 × N = 295
endo
endo
Θ10 − 36 .67 p < 0.001
Θ0 − 28.91 F = 292.03 gluc
(6)
p < 0.001
Θ 0 − 43.56 F = 625.78
(5)
(7) p < 0.001
Where, N is the number of proteins used to seek the corresponding classification models, which discriminate between one class of cellulolytic enzymes (endoglucanases, exoglucanases, and β-glucosidases) and nonrelated-cellulase proteins. The statistics parameters of the above equations are the same usually shown for QSAR LDA-based models [45] including Wilk’s statistical (λ), and Fisher ratio (F) with a probability of error (p-level) p(F). Note that for the three cases the value of p(F) shows significance, rejecting the null hypothesis (Ho) (no difference between two groups). In these studies, we trained the models with 144 endoglucanases, 72 exoglucanases, and 147 β-glucosidases, respectively. The first discriminant function (equation 5) classified correctly 264 out of 291 proteins used in both training and validation series (high level of accuracy of 90.72%). More specifically, this model correctly classified 140/144 (97.22%) of the endoglucanases enzymes and 124/147 (84.35%) of the heterogeneous proteins. For the exoglucanases (equation 6), the model correctly classified 191 out of 220 total proteins (86.81%). In particularly, it correctly classified 69/72 (95.83%) of exoglucanases as well as 122/148 (82.43%) of the non-exoglucanase proteins. For the β-glucosidases (equation 7), the model correctly classified 268
80
Guillermín Agüero-Chapin et al.
proteins out of 295 proteins (90.85%). It recognized 147/147 (100%) of the β-glucosidases and 121/148 (81.76%) of the other proteins. In Table 2 we present the classification matrices of the training, the cross-validation and the two series for each enzymatic component of the complex (endoglucanases, exoglucanases and β-glucosidases). We performed a validation procedure based on an external series [52] in order to assess the QSAR model predictability. This validation was carried out with external sequences prediction series making up of 36 endoglucanases, 18 exoglucanases, and 42 β-glucosidases. The models achieve predictability averages of 91.17, 84.75 and 89.16%, respectively (see Table 2), which are similar to other values reported previously by our group using Markovian entropies derived from other 2D-graph types to characterize protein backbones [36]. To test the performance of our model at low redundancy level, we removed from the three data sets the proteins that share 100% of amino acid identity according to the Smith-Waterman local alignment analysis (see Table X in S01_file of SM). Although, the input information has been shortened, the classification power and robustness of our MM has been maintained. The new statistical parameters were also similar to those presented above and the same significant variables were entered in the model (see Table 3). Despite the varied levels of sequences similarity (23-100%) used to build up the MM, Table 2. Classification results derived from the Pfam Hidden Markov Model and Markov Model (MM) for each cellulase class HMM-Pfam Cellulase component Endoglucanase Exoglucanase β-Glucosidase
Total %
Well-Classified
Miss-classified
68.75 79.16 82.99
99 57 122
45 15 25
140 69 147
4 3 0
MM Endoglucanase Exoglucanase β-Glucosidase
97.22 95.83 100
Table 3. The statistical parameters of the linear regression models obtained for the correlation between scores obtained by different methodologies and similarity percentage of the endoglucanase sequences. Parameters a Scoring method
B
Intercept
R2
S
F
p-level
0.035 0.129 0.711 MM-QSAR 0.001 0.084 0.138 HMM-Pfam 0.403 -333.479 0.162 0.114 12.414 < 0.001 Smith-Waterman 0.964 0.293 0.930 0.027 1497.687 < 0.001 a B (Regression coefficient), F (Fisher ratio), S (Standard deviation), and R2 (Correlation coefficient).
Network entropies classification of fungi and bacteria cellulases
81
only one endoglucanase (AAD01959) sharing 24.60% of similarity with YP_084010 was miss-classified. These results support that our simple MM has a good performance at variable similarity levels even for proteins that share less than 30–40% sequence identity. As can be seen from Equations (5, 6, and 7), the parameter Θ0 is a common variable. It contributes significantly (16.17, 24.89 and 22.17) for the endoglucanase, the exoglucanase and the β-glucosidase function, respectively. Particularly in the first case, in addition to the parameter Θ0 another variable Θ10 is included in the model, providing information of the long-range charge distribution of the aa in the sequence (topological distance k = 10). In the case of the endoglucanase function, the behavior is the same but only if Θ10 simultaneously decreases. Therefore, entropies involved at initial state, where each aa conserve its charge contributing significantly to the function and any charge distribution along the polypeptide chain trending to the stationary state affects the biological function. The πk values, although important for the k-MCA do not outperform the Θk values in the LDA-QSAR study. We also built up the Receiver Operating Characteristic (ROC) [53-55] curve for each model. Notably, the curve presented a pronounced curvature (convexity) with respect to y = x line for each model (see Figure 1). This result confirms that the present model is a significant classifier having an area under the curve above 0.8. According to the ROC curve theory it is known that random classifiers have an area of only 0.5. Such result learly differentiate our classifiers from those working at random [56].
Figure 1. ROC-curves analysis.
82
Guillermín Agüero-Chapin et al.
3.2. Comparison of alignment-free QSAR models with alignment methods For the comparison with other classifiers, each member of the three enzymatic classes was aligned versus its HMM profile downloaded from Pfam.[57] HMM-Pfam methodology could correctly classify 99/144 (68.75%) endoglunacanases, 122/147 (82.99%) β-glucosidases and 57/72 (79.16%) exoglucanases following a profile derived from a multiple alignment of conserved domains. Both, our simple MM and HMM-Pfam methodologies are comparable (see Table 4) although MM-LDA simplicity showing a linear equation with two variables at most. Our model does not require a sequence alignment as a prerequisite and can codify the completed sequence whereas HMM-Pfam depends on a sequence alignment profile representing a subsequence or domain. Consequently, Pfam may perform poorly in classifying short domains, being less general than our MM that extract information from the whole sequence. Our MM could be applicable to predict short domains and even poorly conserved intergenic sequences that cannot be aligned. The MM generalization is observed at particular instances along the data classification. Miss-classified sequences (no hits detected) in the HMM-Pfam are well-recognized by our model such as the 25 members of β-glucosidase family (see Table 4). Other sequences closer to the Eval cutoff (<= 10) in HMM-Pfam as the AAD48493 exoglucanase (8.6), CAB65568 endoglucanase (9.4) and BAB85988 (9.3) β-glucosidase are high-scored by MM with values of 0.96, 0.90 and 0.89 respectively (see files S05, S06 and S07 of SM). The HPnetworks are alignment-independent and they codify higher order information and not only primary sequence information, complementing the Pfam methodology [58]. To corroborate this statement we demonstrated the independence of the scoring scheme used by our methodology in respect to linear sequence information. There was no correlation between the differences on proteins pairs HP-Polymer Lattice MM scores and their 1D sequence homology or similarity. In contrast, we found strong linear correlation with sequence similarity for Smith-Waterman local alignment tool and weak but significant correlation for HMM-Pfam normalized scores. Results derived from regression analysis for each scoring scheme are showed in Table 5. We concluded that our MM scores resulted from the codification of 2D higher-order sequence information. However, it is also important to refer that the model misclassified some cellulases with relatively high scores in both series maybe because the length of its sequence was too short to codify enough protein
Network entropies classification of fungi and bacteria cellulases
83
Table 4. Scores for some cellulase components and scores representing the whole cellulolytic complex from different microorganisms. Microorganisms
Kingdom
Endo
Exo
β
Score
Cellulomonas fimi
bacteria
0.90
0.96
0.96
0.83
Thermobifida fusca
bacteria
0.94
0.98
1.00
0.92
Clostridium cellulovorans
bacteria
0.95
0.76
0.95
0.69
Piromyces sp. E2
bacteria
0.64
0.96
0.97
0.59
Trichoderma viride
fungi
0.98
0.97
0.96
0.92
Gibberella zeae
fungi
0.99
0.97
0.78
0.75
Bacillus sp.
bacteria
0.99
0.96
0.75
0.72
Clostridium stercorarium
bacteria
0.99
0.95
0.98
0.92
Streptomyces coelicolor A3(2)
bacteria
0.98
0.74
1.00
0.72
Aspergillus aculeatus
fungi
0.79
0.94
0.92
0.68
Aspergillus fumigatus Af293
fungi
0.88
0.70
0.99
0.61
Caldicellulosiruptor saccharolyticus
bacteria
0.99
0.91
0.96
0.86
Phanerochaete chrysosporium
fungi
0.95
0.97
0.92
0.85
Humicola grisea var. thermoidea
fungi
0.98
0.98
0.96
0.92
Neurospora crassa
fungi
0.95
0.90
0.94
0.80
Clostridium thermocellum
bacteria
0.88
0.94
0.88
0.73
Hypocrea jecorina
fungi
1.00
0.98
0.58
0.57
Pectobacterium carotovorum
bacteria
0.87
0.63
0.93
0.51
Agaricus bisporus
fungi
0.91
0.98
1.00
0.89
Xanthomonas campestris
bacteria
0.91
0
1.00
0
Actinomyces sp. 40
bacteria
0.91
0
0.99
0
Cellulomonas flavigena
bacteria
0.97
0.91
0
0
Alternaria alternata
fungi
0.89
0.98
0
0
Talaromyces emersonii
fungi
0.97
0
1.00
0
Piromyces equi
fungi
1.00
0.95
0
0
Burkholderia pseudomallei 1710b
bacteria
0.66
0
1.00
0
Aspergillus kawachii
fungi
0.99
0.95
0
0
Clostridium josui
bacteria
0.96
0.60
0
0
Pseudomonas sp. PE2
bacteria
0.92
0.93
0
0
Thermotoga maritima MSB8
bacteria
0.99
0
0.99
0
Oceanobacillus iheyensis HTE831
bacteria
0.91
0
0.99
0
Erwinia chrysanthemi
bacteria
0.94
0
0.86
0
Streptomyces sp. KSM
bacteria
0.89
0
0.99
0
Bacillus subtilis
bacteria
0.99
0
1.00
0
Fusarium oxysporum
fungi
1.00
0.99
0
0
Bacillus cereus E33L
bacteria
0.96
0
1.00
0
Aspergillus niger
fungi
1.00
0
0.93
0
Candida albicans
fungi
0.97
0
1.00
0
84
Guillermín Agüero-Chapin et al.
information. In other cases, the aa sequences downloaded from GenBank did not represent properly coding regions or domains because sometimes protein sequences are deduced by translation and its function annotation is determined by sequence alignment. Both aspects affect the quality of the input data to train our Markov model causing a decrease of its predictability. These limitations arise from our main goal of screening as much cellulases sequence as possible, as linear information exceeds structural details and experimental confirmation. The methodology could also account for evolutionary relationships between cellulases of bacteria and fungi. We selected the ß-glucosidase family, as one component of the cellulase complex in order to illustrate the extension of our methodology. 2D-HP Polymer Lattice network representation for these enzymes showed the similarity in its aa sequence and composition (see Figure 2). The figure shows that sequences of ßglucosidase overlap in the same position of the Cartesian system; meaning that they have close similarities about the connectivity (sequence) and nature of their aa suggesting a conservation of the primary structure. This fact supports the evolutionary theory stating that cellulase genes are derived from bacteria by horizontal gene transfer [59]. Since this methodology does not consider aa identity, the grouping of aa according to its properties increase the likelihood of having better matching region than in alignment techniques.
Figure 2. Superposition of HP-Lattices for β-glucosidases of cellulolytic bacteria (width line) and fungi (narrow line).
Network entropies classification of fungi and bacteria cellulases
85
3.3. mt-QSAR final scores for selection of cellulases As referred before, one of the main aspects affecting cellulose hydrolysis is the quality and composition of the cellulase complex depending in the microorganism producer [8]. On the other hand, more and more cellulases sequences and cellulolytic microorganisms were described up-to-date [60]. Our method, assigns a score to each cellulase sequence depending on its relevance. As a result, we can also estimate a total score that characterize completely the cellulolytic complex. This overall score may guide the design and molecular engineering of new cellulolytic complexes or organisms [61]. With this aim, we inspected protein sequences reported for diverse cellulases in the GenBank database [62]. We screened 363 cellulase sequences in total including endoglucanases, exoglucanases, and β-glucosidases from more than 50 different microorganisms. We also included uncommon microorganisms in the biotechnology industry in order to explore other sources of cellulase enzymes. The Figure 3 shows a Two-way Joining Cluster Analysis [51] of the predicted probabilities for similarity/dissimilarity relationships of several microorganisms with respect to the predicted probabilities for the three cellulase components.
Figure 3. Two-way joining analysis of scores of the three enzyme classes for 38 microorganisms.
86
Guillermín Agüero-Chapin et al.
We scored each cellulose enzymatic component with the probability predicted by the corresponding model in dependence of its biological activity. These numerical indices express sequence quality: the closer the score is to 1 the more relevant is the cellulase. To obtain the microorganism or enzymatic complex overall score we multiply the three probabilities predicted for each one of the three classes of enzymes. In this sense, the score is a probability of high predicted probability for the three components (see probabilities multiplication rules) [63]. Thus, we also characterize the cellulolytic potential of each microorganism (see Table 6). The blank spaces in the table represents that either the sequence have not been reported or the microorganism producer is deficient of such a component. In the cases of having more than one sequence reported for the same protein (isoforms), we considered the longest sequence for the score calculation since the methodology depends on the input information. Consequently the method could be also useful to score isoforms of the same gene or related genes as a preliminary selection. Different alternative to the present score could be considered. In any case, the study of more elaborated scoring schemes is beyond the scope of this work. However, we would like to mention a simple alternative scoring rule that could be used if we are unable to easily calculate the probabilities for each component because we have not the appropriate software. In this case we can use simply MS Excel to calculate the average of the score given by the QSAR equations above-reported: Score − B = Score − B =
[(
1 [(Endogluc − score ) + (Exogluc − score ) + (β − gluc − score )] (8) 3
1 22.17 × 3
endo
Θ 0 − 1.37 ×
endo
) (
Θ10 − 36.67 + 16.17 ×
exo
) (
Θ 0 − 28.91 + 24.89 ×
gluc
Θ 0 − 43.56
)]
(9) Score − B = 0.72 ×
endo
Θ 0 − 0.46 ×
endo
Θ10 + 5.39 ×
exo
Θ 0 + 8.30 ×
gluc
Θ 0 − 36 .38
(10) Given all these considerations, we perform for the first time a mt-QSARguided Computational Chemistry selection of components for new cellulolytic complexes. In this sense, we want to bring up some examples showing the rationale of our approach: The cellulase complex from Trichoderma reesei is well known (Hypocrea jecorina) is one of the most used enzymatic complexes for degrading cellulose [64]. However, it has a limited β-glucosidase activity due to its poor secretion to periplasmic space and the nature of the enzyme [65]. This fact it is also corroborated by our methodology that gives a low score for Hypocrea jecorinas
Network entropies classification of fungi and bacteria cellulases
87
β-glucosidase in comparison with its very high endoglucanase and exoglucanase scores (see Table 6) supporting its frequently application [66]. In addition, some works have reported the need of adding β-glucosidase from other sources to the medium when cellulases from Trichoderma reesei are used for hydrolyzing efficiently lignocellulosic raw materials [67]. Total score calculated for this fungus support this fact (see Table 6). Consequently, we can propose to add more efficient (high score) β-glucosidases enzymes from diverse sources (Humicola grisea var. thermoidea, Thermobifida fusca, Talaromyces emersonii and genus Aspergillus). All these sequences were highly scored for this enzyme (see Table 6). Indeed the expression of β-glucosidase from Aspergillus niger in Trichoderma reesei has been proposed previously [68]. It also has been reported co-cultures of fungi belonging to both genus for enhancing β-glucosidase activity [69] taking advantage of the production for this enzyme by the genus Aspergillus [70], which it is supported by the high score values for β-glucosidases in this genus (see Table 6).
3.4. Approach to select cellulolytic microorganisms for fermentation The presented scores could be useful for selecting microorganisms when it is required to co-culture several cellulolytic microorganisms. For example, we can select a microorganism having a high score for two cellulolytic components and other one with a remarkable score in a third component. However, it is important to consider other factors for co-culturing cellulolytic microorganisms such as cellulase secretion and restrictions in culture conditions in order to avoid operational problems [71]. It is known that bacteria cultures are commonly easier to handling, because they grow faster and they are generally less demanding than fungi cultures. These reasons make cellulolytic bacteria attractive for enzymatic hydrolysis, particularly some species that have a high total score such as Thermobifida fusca and Clostridium stercorarium. Other species are high scored for two cellulolytic components as Cellulomas fimi, Clostridium cellulovorans, Bacillus sp., Streptomyces coelicolor, Thermotoga maritime, and Pseudomonas sp. (see Table 6). The possibility to co-culture bacteria and fungi species [72] enhances the efficiency of the cellulose hydrolysis process complementing cellulase complex of fungi with cellulolytic components of bacteria. Given the advantages of bacteria culture, many authors have attempted to co-culture bacterias to enhance cellulose degradation and operational conditions [73]. Regarding to this, Clostridium stercorarium and Clostridium thermocellum, are both thermopile cellulolytic bacteria and its cellulolytic complexes could be complemented or boosted in the synergistic activity according to their score values (see Table 6). In addition, Clostridium
88
Guillermín Agüero-Chapin et al.
thermocellum is an ethanogenic bacterium capable of directly converting cellulosic substrates into ethanol, which has a great practical meaning [74]. In this way, our results help to amplify the number of microorganisms involved in co-cultures for cellulose hydrolysis, particularly bacteria (see Table 6 and Table I, III, and V in file S01 of SM).
3.5. Computational chemistry based gene selection and engineering Saccharomyces cerevisiae has been target for the introduction of genes encoding cellulase components [75]. The same is valid for Zymomonas mobilis [76], Klebsiella oxytoca [77] and the recombinant E. Coli KO11 [78] carrying out ethanol pathway genes from Zymomonas mobilis. Other works have focused on improving the cellulolytic potential of Trichoderma reesei by cloning βglucosidase [79] from other sources. Our scores have successfully predicted the efficiency of cellulase genes cloned into these aforementioned recombinants. They also predicted the efficiency of endoglucanases, and exoglucanases from Phanerochaete chrysosporium, Trichoderma reesei, Clostridium thermocellum, Erwinia chrysanthemi and of β-glucosidases from Aspergillus niger and Talaromyces emersonii. Following these facts, we can explore other sources to find novel efficient cellulase genes for the recombinant technology in ethanol production, namely good candidates for β-glucosidase mentioned above and others listed in the Table II and Table V of SM (see scores values close to 1). Endoglucanases and exoglucanases from Trichoderma and Aspergillus genus had high scores supporting its common use in ethanol production. Significant scores for these enzymes were found in Thermobifida fusca, Humicola grisea var. thermoidea, Piromyces equi and Pseudomonas sp. PE2 (see Table II of SM). Many of these enzymes are thermo tolerant, which is convenient for the process and their sequences could be modified in order to increased expression and secretion in the host.
4. Conclusions The methodology MARCH-INSIDE may be applied to biotechnology and polymer science fields to develop a multi-factor model to find new cellulase complexes without relying upon alignment techniques. The work demonstrates the utility of using Markov Chain calculation of the Entropy in protein sequences folded on 2D-HP Polymer Lattice networks. The classification power of the MM for the three cellulase components is comparable to sequence-alignment based methods like HMM-Pfam. A numerical score is defined to each cellulolytic component and for the whole complex allowing its selection based on protein sequence and without the
Network entropies classification of fungi and bacteria cellulases
89
need of protein 3D structure. The methodology may be very useful for searching and screening cellulase genes from different sources in order to improve the efficiency of cellulolytic complexes.
Acknowledgments González-Díaz, H. acknowledges a tenure-track research position at the USC funded by Program Isidro Parga Pondal, which is supported by Xunta de Galicia. The authors acknowledge the Portuguese Fundação para a Ciência e a Tecnologia (FCT) (SFRH/BD/47256/2008). The authors acknowledge the Portuguese Fundação para a Ciência e a Tecnologia (FCT) (SFRH/BD/47256/2008) for financial support
Appendix A. Supplementary data Supplementary data associated with this article can be obtaining upon requested by emailing to corresponding author González-Díaz H (gonzalezdiazh@yahoo.es).
References 1. 2. 3. 4. 5.
6. 7. 8. 9.
Du J, Fang Y, Zheng Y. Synthesis, characterization and biodegradation of biodegradable-cum-photoactive liquid-crystalline copolyesters derived from ferulic acid. Polymer. 2007;48:5541-7. Wu C-S, Liao H-T. A new biodegradable blends prepared from polylactide and hyaluronic acid. Polymer. 2005;46:10017–26. Wang Y, Steinhoff B, Brinkmann C, Alig I. In-line monitoring of the thermal degradation of poly(L-lactic acid) during melt extrusion by UVevis spectroscopy. Polymer. 2008;49:1257-65. Lin Y, Tanaka S. Ethanol fermentation from biomass resources: current state and prospects. Appl Microbiol Biotechnol. 2006 Feb;69(6):627-42. Kim KH, Brown KM, Harris PV, Langston JA, Cherry JR. A proteomics strategy to discover beta-glucosidases from Aspergillus fumigatus with two-dimensional page in-gel activity assay and tandem mass spectrometry. Journal of proteome research. 2007;6(12):4749-57. Sun Y, Cheng J. Hydrolysis of lignocellulosic materials for ethanol production: a review. Bioresour Technol. 2002 May;83(1):1-11. Rivers DB, Emert, G.H. Factors affecting the enzimatic hydrolysis of municipal solid waste components. Biotech and Bioeng. 1988;1988:278 81. Klyosov AA. Trends in biochemistry and enzymology of cellulose degradation. Biochemistry. 1990 Nov 27;29(47):10577-85. Ghosh P, Ghose TK. Bioethanol in India: recent past and emerging future. Adv Biochem Eng Biotechnol. 2003;85:1-27.
90
Guillermín Agüero-Chapin et al.
10. Watson DLW, D. B.; Walker, L. P. W. Synergism in Binary Mixtures of Thermobifida fusca Cellulases Cel6B, Cel9A and Cel5A on BMCC and Avicel. Appl Biochem Biotechnol. 2002(101):97-111. 11. Dobson PD, Cai YD, Stapley BJ, Doig AJ. Prediction of protein function in the absence of significant sequence similarity. Curr Med Chem. 2004 Aug;11(16):2135-42. 12. Fernández M, Caballero J, Fernández L, Abreu J, Acosta G. Classification of conformational stability of protein mutants from 2D graph representation of protein sequences using support vector machines. Molecular Simulation 2007;33(11):889-96. 13. Fernández M, Caballero J, Fernández L, Abreu JI, Acosta G. Classification of conformational stability of protein mutants from 3D pseudo-folding graph representation of protein sequences using support vector machines. Proteins. 2008;70(1):167–75. 14. Agüero-Chapín G, González-Díaz H, Riva Gdl, Rodríguez E, Sánchez-Rodríguez A, Podda G, et al. Recognition of Ribonucleases without Alignment: Comparison with an HMM Model and Isolation from Schizosaccharomyces pombe, Prediction, and Experimental Assay of a New Sequence. J Chem Inf Model. 2008;48(2):434-48 15. Fernández L, Caballero J, Abreu JI, Fernández M. Amino Acid Sequence Autocorrelation Vectors and Bayesian-Regularized Genetic Neural Networks for Modeling Protein Conformational Stability: Gene V Protein Mutants. Proteins. 2007;67:834–52. 16. Caballero J, Fernandez L, Abreu JI, Fernandez M. Amino Acid Sequence Autocorrelation vectors and ensembles of Bayesian-Regularized Genetic Neural Networks for prediction of conformational stability of human lysozyme mutants. Journal of chemical information and modeling. 2006 May-Jun;46(3):1255-68. 17. Chou KC, Shen HB. Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. Journal of proteome research. 2007;6:1728-34. 18. Fernandez M, Tundidor-Camba A, Caballero J. Modeling of cyclin-dependent kinase inhibition by 1H-pyrazolo[3,4-d]pyrimidine derivatives using artificial neural network ensembles. Journal of chemical information and modeling. 2005 Nov-Dec;45(6):1884-95. 19. Caballero J, Fernandez M. Linear and nonlinear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-regularized neural networks. J Mol Model 2005 Oct 21:1-14. 20. González-Díaz H, Pérez-Castillo Y, Podda G, Uriarte E. J Comp Chem. 2007;28:1990-5. 21. Zhang X, Luo J, Yang L. New Invariant of DNA Sequence Based on 3DDCurves and Its Application on Phylogeny. J Comput Chem. 2007;28:2342-6. 22. González-Díaz H, Vina D, Santana L, de Clercq E, Uriarte E. Stochastic entropy QSAR for the in silico discovery of anticancer compounds: Prediction, synthesis, and in vitro assay of new purine carbanucleosides. Bioorg Med Chem. 2005 Oct 24.
Network entropies classification of fungi and bacteria cellulases
91
23. González-Díaz H, de Armas RR, Molina R. Markovian negentropies in bioinformatics. 1. A picture of footprints after the interaction of the HIV-1 PsiRNA packaging region with drugs. Bioinformatics. 2003 Nov 1;19(16):2079-87. 24. Agüero-Chapin G, González-Díaz H, Molina R, Varona-Santos J, Uriarte E, Gonzalez-Diaz Y. Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS lett. 2006;580 723-30. 25. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang DU. Complex networks: Structure and dynamics. Phys Rep. 2006;424:175-308. 26. Chikenji G, Fujitsuka Y, Takada S. Shaping up the protein folding funnel by local interaction: lesson from a structure prediction study. Proc Natl Acad Sci U S A. 2006 Feb 28;103(9):3141-6. 27. Chen M, Huang WQ. A branch and bound algorithm for the protein folding problem in the HP lattice model. Genomics Proteomics Bioinformatics. 2005 Nov;3(4):225-30. 28. Graham DJ. Information Content in Organic Molecules: Brownian Processing at Low Levels. Journal of chemical information and modeling. 2007;47(2):376-89. 29. Graham DJ. Information Content and Organic Molecules: Aggregation States and Solvent Effects. Journal of chemical information and modeling. 2005;45(1223). 30. Scholz TH, Sondey JM, Randall WC, Schwam H, Thompson WJ, Mallorga PJ, et al. Sulfonylmethanesulfonamide inhibitors of carbonic anhydrase. J Med Chem. 1993 Jul 23;36(15):2134-41. 31. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. 32. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. 33. Cruz-Monteagudo M, González-Díaz H, Agüero-Chapin G, Santana L, Borges F, Domínguez RE, et al. Computational Chemistry Development of a Unified Free Energy Markov Model for the Distribution of 1300 Chemicals to 38 Different Environmental or Biological Systems. J Comput Chem. 2007; 28:1909-22. 34. González-Díaz H, Saiz-Urra L, Molina R, Santana L, Uriarte E. J Proteome Res. 2007;6(2):904-8. 35. González-Díaz H, de Armas RR, Molina R. Vibrational Markovian modelling of footprints after the interaction of antibiotics with the packaging region of HIV type 1. Bull Math Biol. 2003 Nov;65(6):991-1002. 36. Ramos de Armas R, González-Díaz H, Molina R, Uriarte E. Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants. Proteins. 2004 Sep 1;56(4):715-23. 37. Randic´ M. Graphical representation of DNA as a 2-D map. Chem Phys Lett. 2004(386):468–71. 38. González-Díaz H, Uriarte E, Ramos de Armas R. Predicting stability of Arc repressor mutants with protein stochastic moments. Bioorg Med Chem. 2005 Jan 17;13(2):323-31.
92
Guillermín Agüero-Chapin et al.
39. Randic M, Vracko M. On the similarity of DNA primary sequences. J Chem Inf Comput Sci. 2000 May-Jun;40(3):599-606. 40. González-Díaz H, Aguero-Chapin G, Varona-Santos J, Molina R, de la Riva G, Uriarte E. 2D RNA-QSAR: assigning ACC oxidase family membership with stochastic molecular descriptors; isolation and prediction of a sequence from Psidium guajava L. Bioorg Med Chem Lett. 2005 Jun 2;15(11):2932-7. 41. González-Díaz H, Molina-Ruiz R, Hernandez I. 3.0 ed 2007:MARCH-INSIDE version 3.0 (MARkov CHains INvariants for SImulation & DEsign). Windows supported version under request to the main author contact email: gonzalezdiazh@yahoo.es. 42. Kowalski WJ, Marcoin W. Estimation of bioavailability of selected magnesium organic salts by means of molecular modelling. Boll Chim Farm. 2001 SepOct;140(5):322-8. 43. Marrero-Ponce Y, Iyarreta-Veitia M, Montero-Torres A, Romero-Zaldivar C, Brandt CA, Avila PE, et al. Ligand-based virtual screening and in silico design of new antimalarial compounds using nonstochastic and stochastic total and atomtype quadratic maps. Journal of chemical information and modeling. 2005 JulAug;45(4):1082-100. 44. STATISTICA for Windows release 6.0. Statsoft Inc. 2001. 45. Vilar S, Estrada E, Uriarte E, Santana L, Gutierrez Y. In silico studies toward the discovery of new anti-HIV nucleoside compounds through the use of TOPSMODE and 2D/3D connectivity indices. 2. Purine derivatives. Journal of chemical information and modeling. 2005 Mar-Apr;45(2):502-14. 46. Kann MG, Goldstein RA. Performance evaluation of a new algorithm for the detection of remote homologs with sequence comparison. Proteins. 2002 Aug 1;48(2):367-76. 47. Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006 (Database Issue 34):D247-D51 48. Di Francesco VM, P. J.; Garnier, J. FORESST: fold recognition from secondary structure predictions of proteins. Bioinformatics. 1999;15(2):131-40. 49. Chou KC. Prediction of protein signal sequences. Curr Protein Pept Sci. 2002 Dec;3(6):615-22. 50. Molina E, Diaz HG, Gonzalez MP, Rodriguez E, Uriarte E. Designing antibacterial compounds through a topological substructural approach. J Chem Inf Comput Sci. 2004 Mar-Apr;44(2):515-21. 51. Hill T, Lewicki P. STATISTICS Methods and Applications. A Comprehensive Reference for Science, Industry and Data Mining. Tulsa: StatSoft 2006 52. Vilar S, Santana L, Uriarte E. Probabilistic neural network model for the in silico evaluation of anti-HIV activity and mechanism of action. J Med Chem. 2006;49(3):1118-24. 53. Dewdney MM, Biggs AR, Turechek WW. A Statistical Comparison of the Blossom Blight Forecasts of MARYBLYT and Cougarblight with Receiver Operating Characteristic Curve Analysis. Phytopathology. 2007 Sep;97(9):1164-76.
Network entropies classification of fungi and bacteria cellulases
93
54. Soreide K. Receiver-operating characteristic curve analysis in diagnostic, prognostic and predictive biomarker research. J Clin Pathol. 2009 Jan;62(1):1-5. 55. Centor RM, Schwartz JS. An evaluation of methods for estimating the area under the receiver operating characteristic (ROC) curve. Med Decis Making. 1985 Summer;5(2):149-56. 56. Aguero-Chapin G, Varona-Santos J, de la Riva GA, Antunes A, Gonzalez-Villa T, Uriarte E, et al. Alignment-Free Prediction of Polygalacturonases with Pseudofolding Topological Indices: Experimental Isolation from Coffea arabica and Prediction of a New Sequence. Journal of proteome research. 2009 Apr 3;8(4):2122-8. 57. Finn RD, Mistry J, Schuster-Bรถckler B, Griffiths-Jones S, Hollich V, Lassmann T, et al. Nucleic Acids Research. 2006(Database Issue 34):D247-D51. 58. Jiang M, Zhu B. Protein folding on the hexagonal lattice in the HP model. J Bioinform Comput Biol. 2005 Feb;3(1):19-34. 59. Davison A, Blaxter M. Ancient Origin of Glycosyl Hydrolase Family 9 Cellulase Genes Molecular Biology and Evolution. 2005;22(5):1273-84 60. Tokuda G, Lo N, Watanabe H, Arakawa G, Matsumoto T, Noda H. Major alteration of the expression site of endogenous cellulases in members of an apical termite lineage. Mol Ecol. 2004 Oct;13(10):3219-28. 61. Lynd LR, van Zyl WH, McBride JE, Laser M. Consolidated bioprocessing of cellulosic biomass: an update. Curr Opin Biotechnol. 2005 Oct;16(5):577-83. 62. Benson DA, Boguski MS, Lipman DJ, Ostell J, Ouellette BF, Rapp BA, et al. GenBank. Nucleic Acids Res. 1999 Jan 1;27(1):12-7. 63. Gnedenko B. The theory of probability. Moscow: Mir Publishers 1978. 64. Sharma SK, Kalra KL, Grewal HS. Fermentation of enzymatically saccharified sunflower stalks for ethanol production and its scale up. Bioresour Technol. 2002 Oct;85(1):31-3. 65. Xiao Z, Storms R, Tsang A. Microplate-based filter paper assay to measure total cellulase activity. Biotechnol Bioeng. 2004 Dec 30;88(7):832-7. 66. Hari Krishna S, Janardhan Reddy T, Chowdary GV. Simultaneous saccharification and fermentation of lignocellulosic wastes to ethanol using a thermotolerant yeast. Bioresour Technol. 2001 Apr;77(2):193-6. 67. Rosgaard L, Pedersen S, Cherry JR, Harris P, Meyer AS. Efficiency of New Fungal Cellulase Systems in Boosting Enzymatic Degradation of Barley Straw Lignocellulose. Biotechnol Prog. 2006 Apr 7;22(2):493-8. 68. Ai Y, Teng R, Gao P, Meng F, Wang Z. [The intergeneric compatibility of heredity and expression for cellulase genomes between Aspergillus niger and Trichoderma reesei]. Wei Sheng Wu Xue Bao. 1998 Jun;38(3):186-92. 69. Wen Z, Liao W, Chen S. Production of cellulase/beta-glucosidase by the mixed fungi culture of Trichoderma reesei and Aspergillus phoenicis on dairy manure. Appl Biochem Biotechnol. 2005 Spring;121-124:93-104. 70. Reczey K, Brumbauer A, Bollok M, Szengyel ZS, Zacchi G. Use of hemicellulose hydrolysate for beta-glucosidase fermentation. Appl Biochem Biotechnol. 1998 Spring;70-72:225-35.
94
GuillermĂn AgĂźero-Chapin et al.
71. Chadha BS, Garcha HS. Mixed cultivation of Trichoderma reesei and Aspergillus ochraceus for improved cellulase production. Acta Microbiol Hung. 1992;39(1):61-7. 72. Teunissen MJ, Kets EP, Op den Camp HJ, Huis in't Veld JH, Vogels GD. Effect of coculture of anaerobic fungi isolated from ruminants and non-ruminants with methanogenic bacteria on cellulolytic and xylanolytic enzyme activities. Arch Microbiol. 1992;157(2):176-82. 73. Demain AL, Newcomb M, Wu JH. Cellulase, clostridia, and ethanol. Microbiol Mol Biol Rev. 2005 Mar;69(1):124-54. 74. Ng TK, Ben-Bassat A, Zeikus JG. Ethanol Production by Thermophilic Bacteria: Fermentation of Cellulosic Substrates by Cocultures of Clostridium thermocellum and Clostridium thermohydrosulfuricum. Appl Environ Microbiol. 1981 Jun;41(6):1337-43. 75. Ito J, Fujita Y, Ueda M, Fukuda H, Kondo A. Improvement of cellulosedegrading ability of a yeast strain displaying Trichoderma reesei endoglucanase II by recombination of cellulose-binding domains. Biotechnol Prog. 2004 MayJun;20(3):688-91. 76. Lawford HG, Rousseau JD. Cellulosic fuel ethanol: alternative fermentation process designs with wild-type and recombinant Zymomonas mobilis. Appl Biochem Biotechnol. 2003 Spring;105 -108:457-69. 77. Zhou S, Davis FC, Ingram LO. Gene integration and expression and extracellular secretion of Erwinia chrysanthemi endoglucanase CelY (celY) and CelZ (celZ) in ethanologenic Klebsiella oxytoca P2. Appl Environ Microbiol. 2001 Jan;67(1):6-14. 78. Kim TH, Lee YY. Pretreatment of corn stover by soaking in aqueous ammonia. Appl Biochem Biotechnol. 2005 Spring;121-124:1119-31. 79. Murray P, Aro N, Collins C, Grassick A, Penttila M, Saloheimo M, et al. Expression in Trichoderma reesei and characterisation of a thermostable family 3 beta-glucosidase from the moderately thermophilic fungus Talaromyces emersonii. Protein Expr Purif. 2004 Dec;38(2):248-57.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 95-122 ISBN: 978-81-7895-489-9 Editors: Humberto González-Díaz and Cristian Robert Munteanu
6. Scoring function for DNA-drug docking based on topological indices of supramolecular networks 1
Lázaro Guillermo Pérez-Montoto1, Lourdes Santana1 and Humberto González-Díaz2
Department of Organic Chemistry, Faculty of Pharmacy, USC, Santiago de Compostela 15782, Spain; 2Department of Microbiology and Parasitology, Faculty of Pharmacy, USC Santiago de Compostela, 15782, Spain
Abstract. Ehrlich cells constitute one of the most commonly used systems to test the antiproliferative effect of drug-like heads for PUVA therapy. In Medicinal Chemistry we can use Quantitative Structure-Activity/Binding Relationships (QSAR/QSBR) models, based on Molecular Connectivity Indices (mTIs), to predict these activities. In addition we can use Docking techniques to predict the optimal geometry of drug-target interactions such as drug-protein or durg-DNA interactions. In principle, we can use QSAR/QSBR studies as way to derive scoring functions useful to guide the search of optimal conformations in Docking studies. Nevertheless, many QSAR models do not offer information about the 3D geometry of the drug-DNA intercalation complex and its structural similarity with other known complexes. In this work, we introduce for the first time Supra-Molecular Network Connectivity Indices (smnTIs) based on Supra-Molecular Complex Network and use them to seek QSAR/QSBR models to predict Ehrlich cells antiproliferative activity. Correspondence/Reprint request: Dr. González-Díaz H, Department of Microbiology and Parasitology, Faculty of Pharmacy, USC, Santiago de Compostela, 15782, Spain. E-mail: gonzalezdiazh@yahoo.es
96
Lázaro Guillermo Pérez-Montoto et al.
In this complex network the nodes are conformations of the intercalation complexes and the edges connect the conformations with similar Monte Carlo molecular dynamics energetic profiles. A total of 22 mTIs and 15 smnTIs were calculated using the molecular graph and local centrality measures in a complex network, respectively. At molecular network level the best model obtained was the entropy type mTIs model which correctly classified 94.74% of molecules. The model obtained using smnTIs had the smallest classification power (82.46%) but the supra-molecular information provided by Ccen enhanced the quality and significance of the correct classification of the compounds in all TIs mixed model. Although the calculation of the smnTIs requires a lot of time and resources, the supra-molecular information that they encode, allows us to evaluate or predict the contribution of a complex conformation to the compound. This approach may constitute an interesting route for the development of scoring functions for DNA-Drug Docking studies.
1. Introduction The furocoumarins are a class of natural or synthetic compounds with very interesting pharmacological properties [1]. Commonly used in the treatment of skin diseases, they are characterized by hyperproliferation, such as psoriasis and mycosis fungoides [2]. This activity is treatment called PUVA consists in a therapy that combines the use of both chemicals and long-wave ultraviolet light (UV-A) [3]. The molecular base of PUVA is connected with the highly specific photo damage in DNA of epidermal cells. This damage interferes with the DNA replication, producing an inhibition of DNA synthesis which reduces or blocks the cell duplication [4]. The mechanism of action of furocoumarin has been deeply investigated: their ability to bind covalently the pyrimidine bases of nucleic acids in a two-steps reaction. In the first step, they interact in the dark with the DNA double helix, forming an intercalate complex. Then, after UV irradiation they undergo cycloadditions with adjacent pyrimidine bases, preferably on the 5,6 double bond of the thymine unit. Nevertheless, furocoumarins have two reactive sites, i.e. the 4’,5’ (furan) and the 3,4 (pyrone) double bonds, so different types of cycloadducts can be formed: mono- (furan-side or pyrone-side) and diadducts (cross-link) [5]. Although the lineal furocoumarins (psoralens) are able to form the three adduct types, the geometry of the angular ones (angelicins) only allows them to form monoadducts with the DNA. On the other hand, it is well known that the side effects observed in PUVA therapy, such as skin phototoxicity and risk of skin cancer are strictly connected with the bifunctional lesions in DNA [6]. The biological activity of these compounds is normally studied by evaluating of their capacity of forming an intercalated complex with DNA and their ability of photo-binding through mono or bi functional addition to the same macromolecule [7].
Scoring function for DNA-drug docking
97
A traditional procedure to determine the photobiological and antiproliferative activity of furocoumarins is based on the good correlation present between the rate constant of the photo binding to DNA and the capacity to inhibit DNA synthesis in Ehrlich Ascites tumor cells (EATC), reducing or blocking the cell duplication [8]. The biological activity expressed as ID50 (the UVA dose that reduces to 50% the DNA synthesis in Ehrlich cells) has been of great utility in many Quantitative StructureActivity Relationship (QSAR) studies aimed at elucidating or evaluating different structural and physicochemical requirements for antitumor activity in a great variety of compounds [9]. The classic QSAR studies connect information of the chemical structure of the molecule, expressed by means of numbers, with the biological activity [10]. These numerical indices are denominated molecular descriptors, if applied to molecules, and are calculated by applying the graphs or the network theory. Nowadays, several molecular descriptors have been introduced for small-sized drug discovery. A compilation by Todeschini and Consonni systematizes more than 1600 molecular descriptors. Some of these are redundant in some way or have topics in common [11]. However, the graph concept in chemistry, denominated “chemical graph”, is wider because it indicates the presence of vertices (nodes) and connections (edges). The vertices are chemical species such as atoms, molecules, molecular fragments, intermediary, etc. The connections between vertices can represent bonds, steps of a reaction, Van der Waals forces, etc [12]. In the last years, different connectivity measures or topological indices (TIs) based on the structure of complex networks, which can be used as descriptors in QSAR studies, have been defined. The field of application of TIs is, of course, not restricted to the chemistry lowmolecular-weight compounds and extends to other branches of sciences. For instance, we can cite the work of Chou et al. on the extension of TIs to the study of protein sequences-function relationships (protein QSAR) based on pseudo-amino acid composition [13]. The publication in the area has steady increased and, consequently, in the last years have appeared in-depth reviews that could be useful for the readers of the present manuscript [14-22]. Many researchers define TIs for graphs or networks using vector-matrix-vector procedures, a fact that proves significant similarities between them [23]. In this definition, the elements of the vectors are atomic characteristics such as valence, electronegativity, mass, etc. in molecular networks and different node properties (i.e. node valence) in the supra-molecular networks. In particular, in the case of molecular network there are very important precedents, according to the results obtaining by Bonchev [24] and Schreiber´s [25] groups respectively. Generally, all these TIs offer information about the connection between chemical species [26].
98
Lázaro Guillermo Pérez-Montoto et al.
Many authors prefer to use the term Quantitative Structure-Binding affinity Relationship (QSBR) when one use QSAR-like procedures to predict drug-target binding affinity and 3D structural information [27]. Anyhow, the term QSBR have to be used carefully to avoid confusion with Quantitative Structure-Biodegradability Relationships analysis [28, 29]. In this work we use QSBR in the first sense. In any case, both approaches QSAR and QSBR diverge in some degree on the type of measure (activity or binding) and sometimes on how detailed we need to know the chemical structure (2D or 3D) but both use essentially the same algorithm. In addition to predicting drug activity we can use 3D drug-target QSAR/QSBR models as scoring function to guide the search of optimal drug-target interaction geometries in drug-target Docking studies [30-32]. Almost all QSAR/QSBR or other types of Docking scoring functions are aimed to predict protein-drug interactions. For instace, Wang et al. [33] reported a comparative study of eleven whereas Ferrara et al. [34] studied nine different Docking scoring functions all for Protein-drug interactions. Conversely, DNA-drug and RNA-drug Docking are generally less investigated; in particular we did not found a QSAR/QSBR scoring function for DNA-Furocoumarin Docking. In this sense, it would be very interesting to work with TIs that encode supra-molecular information about the intercalation complexes of furocoumarins, taking into account that the previous non-covalent binding (in dark) between drug and DNA has a strong influence on the subsequent photoreaction and therefore on their biological activity [35]. In the present paper, we obtain and compare different quantitative models based on molecular and supra-molecular network TIs, able to differentiate furocoumarins derivatives according to their antiproliferative activity. Some of these QSAR models are also QSBR-like models that have potential applications as DNA-Furocoumarin Docking scoring functions.
2. Materials and methods In this study we selected different furocoumarins and some of their azaanalogous, whose antiploriferative activities in Ehrlich Ascites tumor cells have been determinated. These activities are expressed in the literature as ID50, the UVA dose that reduces to 50% of the DNA synthesis in Ehrlich cells in presence of tested compound at certain concentration (18-20µM). The protocols used in the activity determination are heterogeneous, however it is very common the use of the 8-MOP as reference to express the activity. In general, compounds as 4’-MAP and the 4-MBAP, with activities (ID50 relative to 8-MOP) of 0.13 and 0.14 are considered as poorly active [36, 37]. Keeping in mind all the above mentioned aspects, we classified the 57 compounds, compiled for our dataset in two observed activity groups: 0 for
Scoring function for DNA-drug docking
99
the inactive compounds (activity â&#x2030;¤ 0.1) and 1 for the active ones (activity > 0.1). QSAR studies were carried out to obtain models that allow us to classify the furocoumarins derivatives in one of these two activity groups. With this aim, we used a total of 37 TIs that encode information at two levels of complexity: molecular or supra-molecular level.
2.1. Molecular connectivity indices (mTIs) For the study at molecular network level, we used 22 mTIs for each one of the 57 compounds. Twelve of them are classic mTIs commonly employed in this type of studies, and the other ten are entropy-type mTIs based on Markov chains. The classic mTIs was calculated as implemented in the Chem3D Ultra software [38]. In the case of the entropy-type mTIs we used the software Markovian Chemicals "In Silico" Design (MARCH-INSIDE) [39]. The definition and the software used to calculate each of the mTIs[11, 40, 41] are shown in Table 1.
2.2. Supra-molecular network topological indices (smnTIs) At supra-molecular network level we used 15 smnTIs obtained by means of local centrality measures in a complex network whose nodes or vertices are the intercalation complexes and the edges connect the pairs of complexes with similar potential energy profiles. The intercalation complexes were modeled and the energy profiles were obtained by means of Monte Carlo molecular dynamics simulation. Then, the network was built and the centrality measures were calculated [42]. 2.2.1. Canonical decanucleotide used in this study For our supra-molecular study we simulated, starting from the decanucleotide of sequence d(CCGCTAGCGG) and using the HyperChem package, a fragment of DNA with double Helix in B form and sugars in 2'endo form. This decanucleotide sequence has been used in different studies concerning psoralens intercalation [43]. 2.2.2. Model building of drug-DNA intercalation complexes All the compounds were designed using the interactive model building package of HyperChem [44]. The optimization of their geometries was carried out by the semiempirical quantum mechanical calculations (PM3) [45] using the Polak-Ribiere algorithm and the options implemented by default in the mentioned package. Thus, the minimized molecular structures were intercalated
100
Lázaro Guillermo Pérez-Montoto et al.
Table 1. Definitions of the TIs used and the software employed for its calculation. Formulaa
Softb
1 ⋅ C ⋅ (d'⋅A ⋅ d'T ) 2
C3DI
Name
Symbol used
Balaban Index
Cbala
Cluster Count
Ccc
-
C3DI
Diameter
Cdiam
D = max w (dist (v, w))
C3DI
Molecular Topological Index
Cmti
MTI = ∑ [(A + D)v ]
C3DI
J=
A
i =1
i
Radius
Cradius
R = min w (dist (v, w))
C3DI
Shape Attribute
Cshpa
-
C3DI
Shape Coefficient
Cshpc
D−R R
C3DI
Sum Of Degrees
Csd
∑ C (v ) = ∑ deg(v )
C3DI
Sum Of Valence Degrees
Csvd
∑ C (v ) * = ∑ deg(v ) *
C3DI
Total Connectivity
Ctc
Total Valence Connectivity
Ctvc
Wiener Index
Cwien
Markov Entropies
CEk
I2 = deg
v
v
deg
v
v
⎛
A
⎞
⎝
i =1
⎠
χ T = ⎜ ∏ deg(v )⎟
−1
2
⎛ A ⎞ χ T * = ⎜ ∏ deg(v ) *⎟ ⎝ i =1 ⎠ W (G ) =
−1
C3DI 2
1 (u ⋅ D ⋅ u T ) 2
C3DI
C3DI
n
Θ k = − kT ⋅ ∑ p k ( j ) ⋅ log p k ( j )
MII
j
a
All symbols used in these formulae are very common in complex networks theory literature and cannot be explained in detail here. However, G = (V, E is an undirected or directed, (strong) connected graph with n = |V| vertices; deg(v) denotes the degree of the vertex v in an undirected graph and deg(v)* denotes the valence degree for molecular network only; dist(v, w) denotes the length of a shortest path between the vertices v and w; σst denotes the number of shortest paths from s to t and σst(v) the number of shortest path from s to t that use the vertex v. D and A are the topological distance and the adjacency matrix of the graph G. Please, for more details see the references cited and others. b Indices calculated with different Software: Chem3D Ultra Indices (C3DI); MARCH-INSIDE Indices (MII); CentiBin Indices (CBI)
by hand approach in the DNA fragment, using the HyperChem package and taking into account the following experimentally demonstrated statements: 1. In the dark, the poly[dA-dT] poly[dA-dT] sequence in DNA is the most favorable site for intercalation since the further photoreaction takes place mainly on the 5,6 double bond of the thymine [46]. So, the optimized
Scoring function for DNA-drug docking
101
molecules were inserted among the thymine units in a parallel plane to the bases and, according to our decision, in a halfway position (Figure 1. Left). 2. The furocoumarins have two reactive sites, but after photoreaction, different types of cycloadducts can be formed: mono (furan-side or pyroneside) and di-adducts (the cross-link) [5]. Although psoralens are able to form all the cycloadduct types, angelicins forms only monoadducts owing to their angular molecular structure. Keeping this in mind, for each lineal molecule we modeled only one starting conformation, for which the cycloadduct formation by either one or other reactive site (furan or pyrone-side) is equally feasible from a geometric point of view. For each angular molecule we decided to model two starting conformations, one for each monoadduct formation (for the furan-side that we named as j-conformation and for the pyrone-side that we named as c-conformation). 3. The stereochemistry of the furocoumarins adducts is cis-syn [47, 48]. Consequently, the molecules were oriented in such a way that the intercalation complex favors mainly the formation of cycloadducts with this stereochemistry. In the case of the furan-side, the stereochemistry syn means that the furan O1â&#x20AC;&#x2122; and the pyrimidine N1 are going to be on the adjacent corners of the future cyclobutane ring. For the pyrone-side, the stereochemistry syn is defined as having the carbonyl-carbon of the pyrone ring and the N1 of the pyrimidine on the adjacent corners of the future cyclobutane ring (Figure 1. Right).
Figure 1. Left: Intercalation complex of the compound 1 between the thymine units (DT). Right: Plane projections of the main conformations used in the model building of Drug-DNA intercalation complexes (lineal and angular furocoumarins). Deoxyribose is designated by -dRib.
102
LĂĄzaro Guillermo PĂŠrez-Montoto et al.
On the other hand, some of the studied angular molecules present ramifications in the C3 carbon that hindered us to model appropriately their jconformation, due to steric problems with the thymine ring. We also found steric impediments in the backbone of the DNA when these ramifications are much bigger. In all these cases we decided to model several alternative starting conformations for which the steric effects were eliminated. For the majority of the cases we just varied the insertion degree of molecule in the DNA; in the most critical cases we also had to rotate the molecule clockwise. Both, the displacement outwards DNA and the molecule rotation were carried out in the halfway and parallel plane to the nitrogen bases. In this sense, the geometric criterion used was the relative distance (in the plane projection) between the geometric centers of the double bonds (j or c bond for furocoumarins and 5,6 bond for the thymine) that will take part in the photoaddition and the relative angle between them. In Figure 2, the variations of these geometric parameters used to model the j-conformations are represented in a simplified way. Taking into account all the abovementioned considerations, a total of 175 starting conformations were modeled; one conformation for each one of the 21 lineal molecules and the rest to the angular ones.
Figure 2. Modifications made to the j-conformations to avoid the steric impediments. (top) Decrease of the insertion degree of the molecule in the DNA. The relative distance between the geometric centers of the double bonds takes discreet values of 0, 0.5 and 1 times the distance of a C-C bond. (bottom) Decrease of the insertion degree accompanied by a rotation clockwise of the molecule to a magnitude of 45 degrees.
Scoring function for DNA-drug docking
103
2.2.3. Obtaining Monte Carlo molecular dynamics energetic profiles The DNA-Drug Docking molecular dynamics trajectories or energetic profiles of all the starting intercalation complexes were obtained by means of the Monte Carlo [49] method, using the HyperChem package. In this sense, the force field AMBER94 of molecular mechanics was used with distantdependent dielectric constant (scale factor 1), electrostatic and Van der Waals values by default and cutoffs shifted with outer radius of 14 Å. All the components of the force field were included and the atom type was recalculated keeping their current charges. Finally, the simulation was executed in the vacuo at 300 K and 100 optimization steps potential energy profiles were calculated to build the supra-molecular network. 2.2.4. Supra-molecular network construction In order to obtain the smnTIs for de QSAR studies, we constructed a complex network as follows: 1. The values of the potential energy of each one of the 175 profiles were organized into a table of steps (rows) vs. intercalation complex (columns) using Microsoft Excel. This raw data was used as input for the STATISTICA 6.0 package [50] employed to calculate multistep correlations between Monte Carlo energetic profiles of starting conformations pairs in the form of correlation coefficient (r) [51]. 2. The correlation matrix obtained was transformed into an adjacency Boolean matrix using Microsoft Excel. This transformation is achieved through substituting the elements of the correlation matrix by "1" or "0" if the elements are bigger or smaller than certain threshold value respectively. The selected threshold value should offer a matrix with a minimum percentage of the profiles without correlating with any of them (intercalation complexes completely disconnected in the future network), in order to loss as few cases as possible. In our case, it is not convenient to lose cases because of the small size of our dataset. After scanning different threshold values in the range r = 0 to 1 with a 0.001 step width, we found that a 0.976 threshold offered us a matrix with 0% of disconnected profiles and each intercalation complex is directly connected, on average, with other 32 complexes. Finally, the elements of the principal diagonal are replaced by “0”, to avoid loops in the future network; the raw an columns whose components are “0” (cases disconnected) are eliminated. The Boolean matrix is saved as a text file and then it is renamed as a “.mat” file. 3. The a “.mat” file is read with the CentiBin software [41] which facilitates the representation of the supra-molecular network and highlights all the
104
LĂĄzaro Guillermo PĂŠrez-Montoto et al.
starting complexes or conformations (nodes) and the pairs of them with similar potential energy minimizations profiles (edges). Besides, it calculates the desired smnTIs (centrality measures of the network) [40]. Following the above-indicated procedure, we obtained a loop-free network with 175 nodes and 2913 undirected edges (starting complexes or conformations pairs whose potential energy profiles correlate with an r > 0.976). The Figure 3 illustrates one of the networks obtained in the interface of CentiBin.
Figure 3. One Supra-molecular Complex network for different DNA-Furocoumarin complexes
2.2.5. Calculation of smnTIs As previously commented, the smnTIs are obtained by means of local calculations in a supra-molecular network build. These descriptors are 15 centrality measures (implemented in the CentiBin software) and calculated for each one of the nodes of our network (see definitions in Table 2). In case of the compounds for which we modeled two or more intercalation complexes (angular molecules), the arithmetic means of each smnTIs were assigned.
Scoring function for DNA-drug docking
105
Table 2. Definitions of the smTIs used and the software employed for its calculation. Name
Symbol used
Formulaa
Softb
Degree
Cdeg
Cdeg (v ) = deg(v )
CBI
Eccentricity
Cecc
Cecc (v ) = max{dist (v, w) : w ∈ V }
CBI
Closeness
Cclo
⎛ ⎞ Cclo (v ) = 1 /⎜ ∑ dist (v, w)⎟ ⎝ w∈V ⎠
CBI
Radiality
Crad
Crad (v ) = ∑ (Δ G + 1 − dist (v, w)) / (n − 1)
CBI
Centroid Values
Ccen
C cen (v ) = min{ f (v, w) : w ∈ V \ {v}}
CBI
Stress
Cstr
C str =
Shortest-path Betweenness
Cspb
C spb =
Current-Flow Closeness
Ccfc
Current-Flow Betweenness
Ccfb
Katz Status Index
Ckatz
Ckatz = ∑ α k ⋅ (A t ) ⋅ u
CBI
Eigenvector
Ceig
EC (v) = e1 (v)
CBI
Hubbell Index
Chub
G Chubbell = E + WChubbell
CBI
Bargaining
Cbarg
Cbrg = α ⋅ (I − βA ) ⋅ A ⋅ u
CBI
PageRank
Cpage
C pagerank = dPC pagerank + (1 − d ) ⋅ u
CBI
HITS-Authority
Chauth
C auths = A T C hubs
CBI
HITS-Hubs
Chhubs
C hubs = A ⋅ C auths
CBI
Closeness Vitality
Cclv
C clv (v ) = W (G ) − W (G \ {v})
CBI
−1
w∈V
∑ ∑σ (v )
CBI
∑ ∑ δ (v )
CBI
s∉v∈V t∉v∈V
st
st
s∉v∈V t∉v∈V
⎛ ⎞ Ccfc (v ) = (n − 1) / ⎜ ∑ pvt (v ) − pvt (t )⎟ t ∉V ⎝ ⎠ Ccfb (v ) =
∑τ (v )/ (n − 1)(n − 2) st
s ,t∈V
∞
a
k
CBI CBI
k =1
−1
All symbols used in these formulae are very common in complex networks theory literature and cannot be explained in detail here. However, G = (V, E is an undirected or directed, (strong) connected graph with n = |V| vertices; deg(v) denotes the degree of the vertex v in an undirected graph and deg(v)* denotes the valence degree for molecular network only; dist(v, w) denotes the length of a shortest path between the vertices v and w; σst denotes the number of shortest paths from s to t and σst(v) the number of shortest path from s to t that use the vertex v. D and A are the topological distance and the adjacency matrix of the graph G. Please, for more details see the references cited and others. b Indices calculated with different Software: Chem3D Ultra Indices (C3DI); MARCH-INSIDE Indices (MII); CentiBin Indices (CBI)
106
Lázaro Guillermo Pérez-Montoto et al.
2.3. Statistical analysis Linear discrimination analysis (LDA) [52] is often used in QSAR as an appropriate technique for classification problems. Then LDA was selected in this work aimed to seek for the lineal discrimination functions, which can classify the furocoumarins and their aza-analogous in the two activity groups previously commented. The classification functions have the following format. Z(1) = a1,0 + a1,1C1 + a1,2C2 + a1,3C3 + … + a1,nCn
(1)
Z(0) = a0,0 + a0,1C1 + a0,2C2 + a0,3C3 + … + a0,nCn
(2)
Where, Z is a variable indicator of antiproliferative activity; the agroup,n are the coefficients of the classification functions for each group determined by least-squares as implemented on the LDA modulus of STATISTICA 6.0 and the Cn are different descriptors (TIs) selected by the software by means of the Forward Stepwise strategy (fixed by us) for conforming the model. In our case, following the objective of simplifying the models, we used one discrimination function of the type [53]. dZ = Z(1) - Z(0) = α0 + α1C1 + α2C2 + α3C3 + … + αnCn
(3)
Where, dZ is a score of the antiploriferative activity. A very high value of dZ indicates that the compound belongs to the group 1 (active), otherwise the compound belongs to the group 0 (Inactive). In the equation (3) the αn coefficients are resulted from subtracting the functions (1)-(2) (αn=a1,n - a0,n). The quality of LDA analysis was determined by examining the Wilk’s λ (also known as U-statistic), the Fisher ratio (F), and the p-level (p). The Wilk’s statistic for overall discrimination should take values in the range from 0 (perfect discrimination) to 1 (non-discrimination). The comparison of Fisher ratio allowed us to check the hypothesis of group separation with a probability of error (p-level) < 0.5. We also analyzed the sensitivity, specificity and accuracy [54].
3. Results and discussion In this work, the models obtained to discriminate the molecules in the three groups of antiploriferative activity (very active, active or inactive) have a maximum of six variables to avoid over-fitting problems. One of the parameters used currently to know if a QSAR model is over-fitting, is the ratio between cases and adjustable parameters (ρ) [55]. The definition of ρ in the LDA case is given below:
Scoring function for DNA-drug docking
ρ=
Nc Ng ( 1+ Nv )
107
(4)
where Nc is the number of compounds included in the model; Ng is the number of groups and Nv is the number of variables in the model. A necessary condition for the development of a linear model is that ρ has to be > 4, which has to be considered when introducing the variables [56]. In our case, the models obtained using all cases (N=57) and the LDA-set with training and validation series (N=42) should contain six and four variables as maximum respectively when the criterion of ρ value is considered. On the contrary, the number of variables will increase to fourteen (all cases) and ten (training and validation series) if the Tropsha’s criterion is taken into account. Considering both criteria, we decided to work with an intermediate value of six variables.
3.1. Molecular network QSAR studies. Models based on mTIs At molecular level we obtain the models (A) and (B) using two different types of mTIs: entropy based on Markov chains (CEk) and classic mTIs (Table 3). The model (A) was the best model found at this complexity level with 94.74% of accuracy when we used all cases. In the model, the most significant variables were the first, tenth, zero, ninth, and second -order entropies. These results coincide with the high importance of Entropy-type measures to explain many chemical properties of both molecular and/or supramolecular structures found by other authors such as Graham et al. [57]. The values of different statistical parameters (U is Wilk’s lambda or U-statistic, F is Fisher ratio, and p is p-level) demonstrate the high significance of the model. As a result, this model classifies correctly 11 out of 12 non-active compounds (Sp = 91.67%) and 43 out of 45 active compounds (Sn = 95.56%). The main indicator of the good classification is the probability with which our model classifies a compound in its correct group of antiproliferative activity. The posterior probabilities [58] values of classification for each compound using all-cases model are shown in Table 4. The structure and references of all drugs numbered in Table 4 are detailed in Table 5 and Table 6. The model is very stable and robust considering that the accuracy is 95.24%, more specifically, 40 out of 42 compounds were correctly classified for the training series. The prediction power of this model is also high since it is able to predict correctly 14 out of 15 (Ac = 93.33%) compounds in the validation series (Table 3). The results can be considered as good for this class of QSAR models [59]. The model (C) (Table 3) was obtained mixing all the mTIs but it is not able to improve the model (A).
108
Lázaro Guillermo Pérez-Montoto et al.
Table 3. Detailed results for different QSAR/QSBR models including Docking scoring functions. TI
All 0
1
%
91.5
11
1
*
95.6
2
43
Ac*
94.7
Param
%
Sp* Sn
(A) Entropy Type
Train 0
1
%
CV 0
1
100
9
0
100
3
0
93.9
2
31
91.7
1
11
95.2
93.3
dZ = -4.10CE01 +262.67CE10 -16.45CE00 -260.65CE09 +15.23CE02 -0.01
All
N = 57
U = 0.28
F = 25.83
p < 0.01
ρ = 4.75
Nc/Nv = 11.4
dZ = -3.01CE01 +206.18CE10 -14.18CE00 -203.09CE09 +11.13CE02 -0.01 Train N = 42
(B) Classic
U = 0.31
F = 16.10
p < 0.01
ρ = 3.50
Nc/Nv = 8.40
Sp
75.0
9
3
88.9
8
1
100
3
0
Sn
95.6
2
43
90.9
3
30
100
0
12
Ac
91.2
90.5
100 dZ = -12.39Csvd -4.74Ctc +0.44
All
N = 57 U = 0.31
F = 59.23
p < 0.01
ρ = 9.50
Nc/Nv = 29.0
dZ = -10.86Csvd -4.40Ctc +0.67 Train
(C) Entropy & Classic mTIs
N = 42
F (2,39) = 37.19
p < 0.01
ρ = 7.00
Nc/Nv = 21
Sp
91.7
11
1
100
9
0
100
3
0
Sn
95.7
2
43
93.9
2
31
100
0
12
Ac
94.7
95.2
100
dZ = +1.75Csvd +42.40CE10 -38.63CE08 -10.12CMTI +2.71Ctvc -0.11 All N = 57
U = 0.24
F = 31.55
p < 0.01
ρ = 4.75
Nc/Nv = 11.4
dZ = -31.85CE08 +36.13CE10 -13.53CMTI +7.13Csvd +4.48Ctvc +0.02
Train
N = 42
(D) smnTIs
U = 0.34
U = 0.26
F (5,36) = 20.07
p < 0.01
ρ = 3.50
Nc/Nv = 8.40
Sp
50.0
6
6
77.8
7
2
66.7
2
1
Sn
91.1
4
41
78.8
7
26
66.7
4
8
Ac
82.5
78.6
66.7
dZ = 0.88Ccfc -3.36Cpage +3.98Cdeg +2.02Cstr -0.91Cbarg -1.55Cspb +1.10
All
N = 57
U = 0.74
F = 2.88
p < 0.05
ρ = 4.07
Nc/Nv = 10.0
dZ = -0.30Ccfc -3.79Cpage +4.21Cdeg +2.30Cstr -1.25Cbarg -1.94Cspb +1.11 Train N = 42
(E) All TIs Mixed
U = 0.74
F = 2.00
p = 0.09
ρ = 3.00
Nc/Nv = 7.00
Sp
91.7
11
1
100
9
0
100
3
0
Sn
95.6
2
43
93.9
2
31
91.7
1
11
Ac
94.7
95.2 dZ = 3.13Csvd +50.00CE10 -45.74CE08 -11.21CMTI +3.38Ctvc -1.25Ccen -0.11
All N = 57
N = 42 Sp: Specificity; Sn: Sensitivity and Ac: Accuracy.
U = 0.24
F = 26.88
p < 0.01
ρ = 4.07
Nc/Nv = 10.0
dZ = -43.86CE08 +49.22CE10 -16.37CMTI +10.54Csvd +6.17Ctvc -2.12Ccen -0.02
Train
*
93.3
U = 0.24
F = 18.88
p < 0.01
ρ = 3.00
Nc/Nv = 7.00
Scoring function for DNA-drug docking
109
Table 4. Probabilities with which the compounds were correctly classified (Models A-E).
b
Drug
Act.a
Obs.b
E
Drug
Act.a
Obs.b
A
B
C
D
E
1
0.34
1
1
1
1
2
0.66
1
0.98
1
1
0.96
1
30
0.05
0
0.61
0.22
0.58
0.2
0.68
0.96
1
31
0
1
1
1
0.1
1
0.98
1
32
0.06 2 0.07
3
0.84
1
1
0.56
1
0
0.56
0.74
0.75
0.12
0.87
4
0.89
1
1
0.99
1
0.07
1
33
0.07
0
0.96
0.44
0.98
0.27
0.98
5
1
1
1
1
1
0.85
1
34
0.2
1
0.09
0.29
0.15
0.38
0.12
6
1.01
1
1
1
1
0.99
1
35
0.2
1
1
1
1
0.97
1
7
1.26
1
1
1
1
0.97
1
36
0.33
1
1
1
1
0.96
1
8
1.34
1
1
1
1
0.98
1
37
0.35
1
1
1
1
0.77
1
9
1.52
1
1
1
1
0.94
1
38
0.4
1
0.02
0.19
0.04
0.78
0.04
10
1.79
1
1
1
1
0.44
1
39
0.55
1
1
1
1
0.96
1
11
2.32
1
1
1
1
0.94
1
40
0.55
1
1
1
1
0.97
1
12
27.6
1
1
1
1
0.95
1
41
0.61
1
0.81
0.99
0.96
0.86
0.92
13
0.13
1
1
1
1
0.96
1
42
0.8
1
1
1
1
0.7
1
14
0.14
1
1
1
1
0.96
1
43
0.81
1
1
1
1
0.93
1
15
0.18
1
1
1
1
0.94
1
44
1.27
1
1
1
1
0.96
1
16
0.25
1
1
1
1
0.94
1
45
1.47
1
1
1
1
0.94
1
17
0.67
1
1
1
1
0.94
1
46
5.3
1
1
1
1
0.91
1
18
0.68
1
1
1
1
0.95
1
47
5.75
1
1
1
1
0.59
1
19
0.97
1
1
1
1
0.99
1
48
5.78
1
1
1
1
0.86
1
20
1.83
1
1
1
1
0.97
1
49
0.48
1
1
1
1
0.97
1
21
3.66
1
1
1
1
0.85
1
50
0.66
1
1
1
1
0.74
1
22
<0.01
0
0.24
0.03
0.2
0.48
51
1.07
1
1
1
1
0.88
1
23
<0.01
0
1
1
0 . 3 1
0.59
1
52
1.36
1
1
1
1
0.81
1
24
<0.01
0
1
0.88
1
0.81
1
53
2.09
1
1
1
1
0.36
1
25
<0.01
0
1
0.99
1
0.68
1
54
2.59
1
1
1
1
0.95
1
26
<0.01
0
1
1
1
0.68
1
55
4.62
1
1
1
1
0.86
1
27
<0.01
0
1
1
1
0.69
1
56
5.6
1
1
1
1
0.76
1
28
<0.01
0
1
1
1
0.62
1
57
9.25
1
1
1
1
0.94
1
29
<0.01
0
1
1
1
0.35
1
A
B
C
D
a Activity: experimental antiproliferative activity in Ehrlich Ascites tumor cells expressed as ID50.relative to 8-MOP Observed: observed activity group; is 0 for not actives (Activity â&#x2030;¤ 0.10) and 1 for actives (Activity > 1.05) compounds.
110
Lázaro Guillermo Pérez-Montoto et al.
Table 5. Lineal furocoumarins (psoralens) and their aza-analogous used. R 4´ h R5´ O
R5
R4
Z
O
c R3 O
R8
Drug 1
Z C
R3 Me
R4 Me
R5 H
R4´ Me
R5´ H
R8 H
Act.a 0.34
Obs.b 0
Ref.c [37]
2
C
H
H
OMe
H
H
H
0.66
0
[4]
3
C
H
CH2OH
H
Me
H
OMe
0.84
0
[7]
4
C
Me
H
H
Me
H
OMe
0.89
0
[69]
5
C
H
H
H
H
H
OMe
1.00
0
[36]
6
C
Me
H
H
Me
H
H
1.01
0
[70]
7
C
H
H
H
Me
Me
H
1.26
1
[37]
8
C
Me
H
H
Me
H
Me
1.34
1
[70]
9
C
H
H
H
H
H
H
1.52
1
[71]
10
C
Me
H
H
Me
Me
H
1.79
1
[37]
11
C
H
CH2OH
H
Me
H
H
2.32
1
[7]
12
C
H
Me
H
H
Me
Me
27.6
1
[4]
13
N
H
H
H
Me
H
-
0.13
0
[36]
14
N
H
Me
H
H
H
-
0.14
0
[36]
15
N
H
H
H
Me
Me
-
0.18
0
[72]
16
N
H
H
Me
Me
Me
-
0.25
0
[36]
17
N
Me
Me
H
Me
H
-
0.67
0
[36]
18
N
H
Me
H
Me
H
-
0.68
0
[72]
19
N
H
H
Me
Me
H
-
0.97
0
[36]
20
N
Me
Me
H
Me
Me
-
1.83
1
[72]
21
N
H
Me
H
Me
Me
-
3.66
1
[72]
a
Act.: the experimental antiproliferative activity in Ehrlich Ascites tumor cells expressed as IC50 (µmol of compound that reduces to 50% the DNA synthesis in Ehrlich cells). b Obs.: observed activity group. It takes value -1 for the inactives (Act < 0.10); 0 for the actives (0.10 ≤ Act ≤ 1.05) and 1 for very actives (Act > 1,05) compounds. c Ref.: References in which the activity of compounds was reported.
3.2. Supra-Molecular network QSAR studies. Models based only in smnTIs The LDA model (D) was obtained when we used only the smnTIs (Table 3). The most significant variables were the Current-Flow Closeness (Ccfc),
Scoring function for DNA-drug docking
111
Table 6. Angular furocoumarins (angelicins) and their aza-analogous used. R5
R6
Z
O R 5´
Compd.
a
Z
R1
R4
j
c R3 O
R1 R 4´
R3
R4
R5
R6
R4´
R5´
Act.a
Obs.b
22
O
-
COMe
H
H
H
H
H
<0.01
-1
23
O
-
COPh
H
H
H
H
H
<0.01
-1
24
O
-
CON(Et)2
H
H
H
H
H
<0.01
-1
25
O
-
CONH(CH2)2OH
H
H
H
H
H
<0.01
-1
26
O
-
CONH(CH2)2OEt
H
H
H
H
H
<0.01
-1
27
O
-
CONH(CH2)2NMe2
H
H
H
H
H
<0.01
-1
28
O
-
CON[(CH2)2OH]2
H
H
H
H
H
<0.01
-1
29
O
-
CON(CH2)2NMe
H
H
H
H
H
<0.01
-1
30
O
-
CONH2
H
H
H
H
H
0.05
-1
31
O
-
CON(CH2)2O
H
H
H
H
H
0.06
-1
32
O
-
CO2H
H
H
H
H
H
0.07
-1
33
O
-
CON(Me)2
H
H
H
H
H
0.07
-1
34
O
-
CO2Me
H
H
H
H
H
0.20
0
35
O
-
Me
H
H
H
H
H
0.20
0
36
O
-
Me
Me
H
H
Me
H
0.03
0
37
O
-
Me
Me
H
H
H
H
0.35
0
38
O
-
CO2Et
H
H
H
H
H
0.40
0
39
O
-
H
H
H
H
H
H
0.55
0
40
O
-
H
Me
H
H
H
H
0.55
0
41
O
-
H
Me
H
H
CH2OMe
Me
0.60
0
42
O
-
H
H
H
H
H
Me
0.80
0
43
O
-
H
H
H
H
Me
H
0.81
0
44
O
-
H
Me
H
H
H
Me
1.27
1
45
O
-
H
H
H
H
Me
Me
1.47
1
46
O
-
H
H
Me
H
Me
H
5.30
1
47
O
-
H
Me
H
H
Me
H
5.75
1
48
O
-
H
Me
Me
H
Me
H
5.78
1
49
N
H
H
Me
H
H
Me
H
0.48
0
50
N
H
H
CH2OH
H
Me
H
Me
0.66
0
51
N
H
H
Me
H
Me
Me
CH2OH
1.07
1
52
N
H
H
Me
H
Me
H
Me
1.36
1
53
N
Me
H
CH2OMe
H
Me
H
Me
2.09
1
54
N
H
H
Me
H
H
Me
Me
2.59
1
55
N
H
H
Me
H
Me
Me
H
4.62
1
56
N
Me
H
CH2OH
H
Me
H
Me
5.60
1
57
N
H
H
Me
H
Me
Me
Me
9.25
1
Ref.c [66] [66] [66] [66] [66] [66] [66] [66] [66] [66] [66] [66] [66] [67] [37] [67] [67] [73] [37] [67] [73] [73] [67] [67] [73] [73] [67] [74] [75] [74] [74] [75] [74] [74] [75] [74]
Act.: the experimental antiproliferative activity in Ehrlich Ascites tumor cells expressed as IC50 (µmol of compound that reduces to 50% the DNA synthesis in Ehrlich cells). b Obs.: observed activity group. It takes value -1 for the inactives (Act < 0.10); 0 for the actives (0.10 ≤ Act ≤ 1.05) and 1 for very actives (Act > 1.05) compounds. c Ref.: References in which the activity of compounds was reported.
112
LĂĄzaro Guillermo PĂŠrez-Montoto et al.
PageRank (Cpage), Degree (Cdeg), Stress (Cstr), Bargaining (Cbarg) and Shortest-path Betweenness (Cspb) Topological Indices. Although the p-level <0.05 indicates that these TIs have a significant influence on the activity, the high value of U (0.74) means that the quality of the model for the correct classification of the compounds is not good enough. The model (using all cases) classifies correctly only 47/57 (Ac = 82.46%) molecules in their corresponding group of anti-proliferative activity. Specifically, only 6/12 (Sp = 50.00%) compounds were correctly classified as non-active and 41/45 (Sn = 91.11%) as actives. The classification percentages for the training and validation series were also low. Only the 78.57% and 66.67% of the molecules was correctly classified by the model in the case of the training and validation series respectively [60] (Table 3). The bad classification and prediction of this model can be shown clearly in terms of the posterior probabilities. The classification probabilities for each compound (Table 4) are lower than those obtained with the models previously discussed.T his result is in concordance with the fact that our study has been focused on previous non-covalent binding (in dark) between drug and DNA on the anti-proliferative activity. Although it is certain that the capacity and the effectiveness of this intercalation process have a great importance in the subsequent covalent photo-addition of the compounds, this process is only the first step mechanism of action for these compounds. Evidently our smnTIs only encode information about this first step.
3.3. Global QSAR studies. Models based on both mTIs and smnTIs In general, both mTIs and smnTIs are of the same nature in mathematical terms. Both classes of TIs describe networks. The difference is in the structural scale (molecular or supra-molecular) of the system. A clear example is that we can use a molecular Wiener index (Cwien), one the mTIs, and/or the Clossenes vitality (Cclv), to describe the same system at two different levels (molecular and supra-molecular). Consequently, Cclv is a generalization of Cwien. However, Cwien is calculated at the chemical structure level and Cclv is calculated as the difference between the Wiener indices of the complete and one-node deleted supra-molecular complex network. Many theoretical developments of Wiener-type indices have been reported by Ivanciuc and others [61]. In this section, we investigate the effect of the combined used of mTIs and their higher-level analogues, the smnTIs. The best model found using all TIs mixed was the model (E) (Table 3). This model has a total of 6 descriptors and two of them are entropy-type mTIs based on Markov chains. These are: Topological index of tenth, and eighthorder entropy (CEk). Other three descriptors are classic mTIs: Sum of Valence
Scoring function for DNA-drug docking
113
Degrees (Csvd) and Total Valence Connectivity (Ctvc). Herein, the results confirm again the high importance of Entropy-type measures both molecular and/or supra-molecular structures found by other authors such as Graham et al. [62] Finally, only one descriptor, Centroid Values (Ccen) encodes supramolecular information. This model presents the same specificity (91.67%), sensitivity (95.56%) and accuracy (94.74%) that the model (A), but the value of the statistic parameter U (0.24) and F(26.88) demonstrate that the model (E) has a superior quality and significance within the correct classification of the compounds (Table 3). Although the differences are apparently small, in Table 4 can be clearly observed that the probabilities with which the model (E) correctly classifies the compounds are superior to those of the model (A). A similar behavior can be observed in the stability, robustness and predictive power of the model evaluated in the training and prediction series (Table 3). However, these improvements do not compensate, from the practical point of view, the employment of the model (E) instead of (A) due to two fundamental reasons: the number of variables (use one more variable) and the calculation of the smnTIs requires much more time and resources. The utilization of this mixed model would be only justified if it is able to give some additional information that cannot be obtained by in another way. In the present case, it is of major relevance that we can predict the suitability of different conformations of the compounds with models including smTIs; which is not possible to perform using mTIs. 3.3.1. Effect of the starting conformation on classification Taking advantage of the fact that in order to calculate the smnTIs of angelicins we used two or more possible starting conformations, we decided to study if the model (E) is able to differentiate between the contributions of the starting conformations to the correct classification of their respective compound. To carry out the study we used these 154 starting conformations as external prediction series, assigning to each one of them the mTIs of its respective compound. The appropriate parameter to quantify the contribution of certain starting conformation to the correct classification of their compound is the posterior probability [63] with which the model classifies this conformation in the activity group of its compound. The posterior probabilities for each starting conformation (p conf) in the external prediction series are depicted in Figure 4, showing that only the 14% of those used starting conformations contributes to the correct classification of its respective compound, with a probability â&#x2030;¤ 0.5.
114
Lázaro Guillermo Pérez-Montoto et al.
Figure 4. Accumulative distribution of the posterior probabilities with which the model (E) classifies the conformations in the activity group of their respective compounds.
In order to differentiate geometrically the starting conformations we used an indicator of the “good” overlap of furocoumarin derivative and the thymine double bond [64, 65]. In this sense, we decided to enclose both the insertion degree and the molecule rotation in only one parameter. We defined this geometric parameter as the Deviation of the Overlapping Geometry (DOG), which can be obtained by dividing the distance between the geometric centers of the double bonds (in the plane projection) by the cosine of the molecule rotation angle. In Table 7 are shown the values of distance, angle and DOG for the ten typical cases of the used conformations. Figure 5 shows the contribution of each conformation according to its value of DOG for the compound 22 (non-active) [66] and 38 (active) [67]. In the case of compound 22 (graph A) we can observe that the conformation guided by the furan-side presents a slightly higher contribution than the pyrone-side configuration with the same value of DOG. This is in agreement with the fact that the compound presents a substituent in the C3 carbon and consequently the furan side approximation is less impeded. In this case, the DOG increase in the j-conformations is only due to the increment of the insertion degree of
Scoring function for DNA-drug docking
115
the molecule (the compound was not rotated) and this movement inwards DNA causes an increase in the contribution to the correct classification of the compound 22. However, in the case of compound 38, as depicted in graph B, when the molecule goes into the DNA starting from a conformation with DOG=-1 (p conf=0.004), reaching a maximum value of 0.11 when the conformation with DOG=-0.5 is achieved. After this position the contribution falls down to 0.01. This behavior, among other factors, could be the result of an increase of the steric hindrances, given the biggest size of the substituent in the C3 carbon. Table 7. Geometric parameters and representations for starting conformations used.
dRib R5' N O O R4' NH O
R4'
R5' R4'
O
dRib N O O NH
R4'
dRib NR5' O O NH
R5' R4'
O
O
1
1
0.5
0.5
0
Ang.a
0o
45o
0o
45o
0o
1.4
0.5
0.7
0
R5'
dRib ON O NH
R4'
R5dRib ' N O O NH
R4'
R5' R4'
R5' dRib N O O NH
R4 '
O NH
O
O
dRib ON
O
R4
O
O
Dist.a
-0.5
-0.5
-1
-1
1
Ang.a
0o
45o
0o
45o
0o
-0.7
-1
-1.4
1
DOGa -0.5
dRib N O O NH
O
Dist.a
DOGa 1
a
dRib NR5' O NHO
NH Z R 1 O N R3 O dRib
Dist.: discrete distance between the geometric centers of the two double bonds (in the plane projection). The possible modular values are 0; 0.5 and 1. Positive value if the compound was moved inwards DNA pocket and negative if it was moved outwards DNA. Ang.: magnitude of clock-wise rotation of compound (0 or 45o). DOG: is the Deviation of the Overlap Geometry that we defined as: DOG = dist/cos(ang). Its sign has the same meaning that in the case of the distance.
116
LĂĄzaro Guillermo PĂŠrez-Montoto et al.
Figure 5. Contribution of the starting conformations to the correct classification of the compound 22 (graph A) and 38 (graph B) using model (E).
4. Conclusions Using Monte Carlo molecular dynamics trajectories or energetic profiles, it is possible to construct supra-molecular complex networks in where drugDNA complex conformations may play the role of nodes. In our case, two nodes were connected when their energetic profiles were correlated with an r
Scoring function for DNA-drug docking
117
≥ 0.976. Different smnTIs that encode supra-molecular information can be obtained by means of the node centrality measures for these networks. The smnTIs can increase the classification and predictive powers of QSAR models based on mTIs. In the classification of furocoumarins and their azaanalogous according to their anti-proliferative activities in Ehrlich cells, the supra-molecular information provided by Ccen was responsible for the increment in the quality and significance in the correct classification of the compounds of the mixed model as compared to the mTIs (Entropy-type mTIs) model. Although the calculation of the smnTIs requires a lot of time and resources, they allow us to evaluate or predict the contribution of a complex conformation to the compound activity [68].
Acknowledgments Pérez-Montoto, L.G.; acknowledges scholarship of University of Santiago de Compostela for foreign Latino-American students in Spain. González-Díaz, H. thanks Program Isidro Parga Pondal supported by Xunta de Galicia and European research funds from: Fondo Social Europeo (F.S.E.).
References 1. 2. 3. 4.
5.
6. 7.
Santana L, Uriarte E, Roleira F, Milhazes N, Borges F. Furocoumarins in medicinal chemistry. Synthesis, natural occurrence and biological activity. Curr Med Chem. 2004 Dec;11(24):3239-61. Pathak MA, Fitzpatrick TB. The evolution of photochemotherapy with psoralens and UVA (PUVA): 2000 BC to 1992 AD. J Photochem Photobiol B. 1992 Jun 30;14(1-2):3-22. Parrish JA, Stern RS, Pathak MA, and Fitzpatrick TB. in The Science of Photomedicine In: Parrish JDRaJA, ed.: Plenum Press, New York, 1982:595. Dall'Acqua F, Vedaldi D, Baccichetti F, Bordin F, Averbeck D. Photochemotherapy of skin-diseases: comparative studies on the photochemical and photobiological properties of various mono- and bifunctional agents. Farmaco [Sci]. 1981 Jul;36(7):519-35. Tessman JW, Isaacs ST, Hearst JE. Photochemistry of the furan-side 8methoxypsoralen-thymidine monoadduct inside the DNA helix. Conversion to diadduct and to pyrone-side monoadduct. Biochemistry (Mosc). 1985 Mar 26;24(7):1669-76. Pathak MA, Parrish JA, Fitzpatrick TB. Psoralens in photochemotherapy of skin diseases. Farmaco [Sci]. 1981 Jul;36(7):479-91. Zagotto G, Gia O, Baccichetti F, Uriarte E, Palumbo M. Synthesis and photobiological properties of 4-hydroxymethyl-4'-methylpsoralen derivatives. Photochem Photobiol. 1993 Oct;58(4):486-91.
118
8. 9.
10.
11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
Lázaro Guillermo Pérez-Montoto et al.
Musajo L, Visentini P, Baccichetti F, and Razzi MA. Photoinactivation of Ehrlich Ascites Cells in vitro Obtained with Skin-Photosensitizing Furocoumarins. Experientia. 1967;23:335-6. Giordanetto F, Fossa P, Menozzi G, Mosti L. In silico rationalization of the structural and physicochemical requirements for photobiological activity in angelicine derivatives and their heteroanalogues. J Comput Aided Mol Des. 2003 Jan;17(1):53-64. Prado-Prado FJ, González-Díaz H, Martinez de la Vega O, Ubeira FM, Chou KC. Unified QSAR approach to antimicrobials. Part 3: First multi-tasking QSAR model for Input-Coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds. Bioorg Med Chem. 2008;16:5871–80. Todeschini R, Consonni V. Handbook of Molecular Descriptors: Wiley-VCH 2002. Balaban AT. QAPR/QSAR Studies by Molecular Descriptors New York: Huntington 2000. Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005 Jan 1;21(1):10-9. Mason O, Verwoerd M. Graph theory and networks in Biology. IET systems biology. 2007 Mar;1(2):89-119. Krishnan A, Zbilut JP, Tomita M, Giuliani A. Proteins as networks: usefulness of graph theory in protein science. Curr Protein Pept Sci. 2008 Feb;9(1):28-38. Kayser K, Gabius HJ. Graph theory and the entropy concept in histochemistry. Theoretical considerations, application in histopathology and the combination with receptor-specific approaches. Prog Histochem Cytochem. 1997;32(2):1-106. Glassman RB. Topology and graph theory applied to cortical anatomy may help explain working memory capacity for three or four simultaneous items. Brain Res Bull. 2003 Apr 15;60(1-2):25-42. Garcia-Domenech R, Galvez J, de Julian-Ortiz JV, Pogliani L. Some new trends in chemical graph theory. Chem Rev. 2008 Mar;108(3):1127-69. Chou KC. Applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady-state systems. Biophys Chem. 1990 Jan;35(1):1-24. Estrada E, Uriarte E. Recent advances on the role of topological indices in drug discovery research. Curr Med Chem. 2001;8:1573-88. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. Estrada E. Generalization of topological indices. Chem Phys Lett. 2001;336:248-52. Bonchev D, Buck GA. From molecular to biological structure and back. Journal of chemical information and modeling. 2007 May-Jun;47(3):909-17. Haggarty SJ, Clemons PA, Schreiber SL. Chemical Genomic Profiling of Biological Networks Using Graph Theory and Combinations of Small Molecule Perturbations J Am Chem Soc. 2003;125(35):10543-5.
Scoring function for DNA-drug docking
119
26. Zhang W. Computer inference of network of ecological interactions from sampling data. Environ Monit Assess. 2007 Jan;124(1-3):253-61. 27. Zhang S, Golbraikh A, Tropsha A. Development of Quantitative StructureBinding Affinity Relationship Models Based on Novel Geometrical Chemical Descriptors of the Protein-Ligand Interfaces. J Med Chem. 2006;49:2713-24. 28. Cuissart B, Touffet F, Cremilleux B, Bureau R, Rault S. The maximum common substructure as a molecular depiction in a supervised classification context: experiments in quantitative structure/biodegradability relationships. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1043-52. 29. Andrews CW, Bennett L, Yu LX. Predicting human oral bioavailability of a compound: development of a novel quantitative structure-bioavailability relationship. Pharm Res. 2000 Jun;17(6):639-44. 30. Hetenyi C, Paragi G, Maran U, Timar Z, Karelson M, Penke B. Combination of a modified scoring function with two-dimensional descriptors for calculation of binding affinities of bulky, flexible ligands to proteins. J Am Chem Soc. 2006 Feb 1;128(4):1233-9. 31. Lill MA, Vedani A, Dobler M. Raptor: combining dual-shell representation, induced-fit simulation, and hydrophobicity scoring in receptor modeling: application toward the simulation of structurally diverse ligand sets. J Med Chem. 2004 Dec 2;47(25):6174-86. 32. Smith R, Hubbard RE, Gschwend DA, Leach AR, Good AC. Analysis and optimization of structure-based virtual screening protocols. (3). New methods and old problems in scoring function design. J Mol Graph Model. 2003 Sep;22(1):41-53. 33. Wang R, Lu Y, Wang S. Comparative evaluation of 11 scoring functions for molecular docking. J Med Chem. 2003 Jun 5;46(12):2287-303. 34. Ferrara P, Gohlke H, Price DJ, Klebe G, Brooks CL, 3rd. Assessing scoring functions for protein-ligand interactions. J Med Chem. 2004 Jun 3;47(12):3032-47. 35. Gia O, Marciani Magno S, González-Díaz H, Quezada E, Santana L, Uriarte E, et al. Design, synthesis and photobiological properties of 3,4cyclopentenepsoralens. Bioorg Med Chem. 2005 Feb 1;13(3):809-17. 36. Baccichetti F, Bordin F, Simonato M, Toniolo L, Marzano C, Rodighiero P, et al. Photobiological activity of certain new methylazapsoralens. Il Farmaco. 1992 Dec;47(12):1529-41. 37. Antonello C, Zagotto G, Mobilio S, Marzano C, Gia O, Uriarte E. Synthesis and characterization of new methylpsoralens as potential photochemotherapeutic agents. Il Farmaco. 1994 Apr;49(4):277-80. 38. CambridgeSoft.Corporation. Chem3D Ultra software, Molecular Modeling and Analysis. 8.0 ed. Cambridge, MA. USA: CambridgeSoft 2005. 39. González-Díaz H, Molina-Ruiz R, Hernández I. MARCH-INSIDE version 2.0 (Markovian Chemicals In Silico Design). 2.0 ed. gonzalezdiazh@yahoo.es: Main author information requesting contact email 2005. 40. Junker BH, Koschuetzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006 Apr 21;7(1):219.
120
Lázaro Guillermo Pérez-Montoto et al.
41. Koschützki D. CentiBiN Version 1.4.2. 2006:CentiBiN Version 1.4.2, Centralities in Biological Networks © 2004-6 Dirk Koschützki Research Group Network Analysis, IPK Gatersleben, Germany. 42. Estrada E. Virtual identification of essential proteins within the protein interaction network of yeast. Proteomics. 2006 Jan;6(1):35-40. 43. Eichman BF, Mooers BH, Alberti M, Hearst JE, Ho PS. The crystal structures of psoralen cross-linked DNAs: drug-dependent formation of Holliday junctions. J Mol Biol. 2001 Apr 20;308(1):15-26. 44. Hypercube.Inc. Hyperchem software. Release 7.5 for windows, Molecular Modeling System. Gainesville, FL, USA: Hypercube Inc. 2002. 45. Clark T. A Handbook of Computational Chemistry. New York: John Wiley & Sons 1985. 46. Kitamura N, Kohtani S, Nakagaki R. Molecular aspects of furocoumarin reactions: Photophysics, photochemistry, photobiology, and structural analysis. J Photochem Photobiol C: Photochem Rev. 2005;6:168-85. 47. Cimino GD, Gamper HB, Isaacs ST, Hearst JE. Psoralen as photoactive probes of nucleic acid structure and function: organic chemistry, photochemistry, and biochemistry. Ann Rev Biochem. 1985;54:1151-93. 48. Caffieri S, Miolo G, Dall'Acqua F, Benetollo F, Bombieri G. Photoaddition of 4,6dimethyltetrahydrobenzoangelicin to thymine in DNA: X-ray studies and experiments with model oligonucleotides. Photochem Photobiol. 2000 Jul;72(1):23-7. 49. Tominaga Y, Jorgensen WL. General model for estimation of the inhibition of protein kinases using Monte Carlo simulations. J Med Chem. 2004 May 6;47(10):2534-49. 50. StatSoft.Inc. STATISTICA (data analysis software system), version 6. www.statsoft.com. 6.0 ed. Tulsa, OK, USA: StatSoft. Inc. 2001. 51. Voy BH, Scharff JA, Perkins AD, Saxton AM, Borate B, Chesler EJ, et al. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput Biol. 2006 Jul 21;2(7):e89. 52. Ponce YM, Hassan Khan MT, Martin GMC, Ather A, Sultankhodzhaev MN, Torrens F, et al. Atom-based 2D quadratic indices in drug discovery of novel tyrosinase inhibitors: results of In Silico studies supported by experimental results. QSAR Comb Sci. 2007;26(4):469-87. 53. González-Díaz H, Olazabal E, Castanedo N, Sanchez IH, Morales A, Serrano HS, et al. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer aided molecular design II: experimental and theoretical assessment of a novel method for virtual screening of fasciolicides. J Mol Model 2002 Aug;8(8):237-45. 54. Marrero-Ponce Y, Castillo-Garit JA, Olazabal E, Serrano HS, Morales A, Castanedo N, et al. Atom, atom-type and total molecular linear indices as a promising approach for bioorganic and medicinal chemistry: theoretical and experimental assessment of a novel method for virtual screening and rational design of new lead anthelmintic. Bioorg Med Chem. 2005 Feb 15;13(4):1005-20. 55. Garcia-Domenech R, de Julian-Ortiz JV. Antimicrobial activity characterization in a heterogeneous group of compounds. J Chem Inf Comput Sci. 1998 MayJun;38(3):445-9.
Scoring function for DNA-drug docking
121
56. González MP, Terán C, Teijeira M. Search for NewAntagonist Ligands forAdenosine Receptors from QSAR Point of View. How CloseAreWe? Med Res Rev. 2007;DOI 10.1002/med.20108. 57. Graham DJ. Information Content in Organic Molecules: Structure Considerations Based on Integer Statistics. J Chem Inf Comput Sci. 2002;42:215. 58. Marrero-Ponce Y, Diaz HG, Zaldivar VR, Torrens F, Castro EA. 3D-chiral quadratic indices of the 'molecular pseudograph's atom adjacency matrix' and their application to central chirality codification: classification of ACE inhibitors and prediction of sigma-receptor antagonist activities. Bioorg Med Chem. 2004 Oct 15;12(20):5331-42. 59. Marrero-Ponce Y, Marrero RM, Torrens F, Martinez Y, Bernal MG, Zaldivar VR, et al. Non-stochastic and stochastic linear indices of the molecular pseudograph's atom-adjacency matrix: a novel approach for computational in silico screening and "rational" selection of new lead antibacterial agents. J Mol Model (Online). 2005 Nov 4:1-17. 60. Patankar SJ, Jurs PC. Classification of inhibitors of protein tyrosine phosphatase 1B using molecular structure based descriptors. J Chem Inf Comput Sci. 2003 May-Jun;43(3):885-99. 61. Ivanciuc O. QSAR comparative study of Wiener descriptors for weighted molecular graphs. J Chem Inf Comput Sci. 2000 Nov-Dec;40(6):1412-22. 62. Graham DJ. Information Content in Organic Molecules: Brownian Processing at Low Levels. Journal of chemical information and modeling. 2007;47(2):376-89. 63. González-Díaz H, Cruz-Monteagudo M, Molina R, Tenorio E, Uriarte E. Predicting multiple drugs side effects with a general drug-target interaction thermodynamic Markov model. Bioorg Med Chem. 2005 Feb 15;13(4):1119-29. 64. Gia O, Anselmo A, Pozzan A, Antonello C, Magno SM, Uriarte E. Some new methyl-8-methoxypsoralens: synthesis, photobinding to DNA, photobiological properties and molecular modelling. Il Farmaco. 1997 Jun-Jul;52(6-7):389-97. 65. Boggia R, Fanciullo M, Finzi L, Incani O, Mosti L. DNA - psoralens molecular recognition using molecular dynamics. Il Farmaco. 1999;54(4):202-12. 66. Iester M, Fossa P, Menozzi G, Mosti L, Baccichetti F, Marzano C, et al. Synthesis and photobiological properties of 3-acylangelicins, 3alkoxycarbonylangelicins and related derivatives. Il Farmaco. 1995 Oct;50(10):669-78. 67. Vedaldi D, Dall'Acqua F, Baccichetti F, Carlassare F, Bordin F, Rodighiero P, et al. Methylangelicins: structure activity studies on the role of methyl groups present in 3,4 and 4',5' photoreactive sites. Il Farmaco. 1991 Nov;46(11):1381-406. 68. Overton E. Studien uber die Narkose Zugleich ein Beitrag zur Allgemeine Phamakologie Germany 1901. 69. Palumbo M, Baccichetti F, Antonello C, Gia O, Capozzi A, Magno SM. Photobiological activity of 3,4'-dimethyl-8-methoxypsoralen, a linear furocoumarin with unusual DNA-binding properties. Photochem Photobiol. 1990 Sep;52(3):533-40. 70. Gia O, Uriarte E, Zagotto G, Baccichetti F, Antonello C, Marciani-Magno S. Synthesis and photobiological activity of new methylpsoralen derivatives. J Photochem Photobiol B. 1992 Jun 30;14(1-2):95-104.
122
Lázaro Guillermo Pérez-Montoto et al.
71. Bordin F, Baccichetti F, Carlassare F, Peron M, Dall'Acqua F, Vedaldi D, et al. Pre-clinical evaluation of new antiproliferative agents for the photochemotherapy of psoriasis: angelicin derivatives. Il Farmaco [Sci]. 1981 Jul;36(7):506-18. 72. Vedaldi D, Caffieri S, Miolo G, Dall'Acqua F, Baccichetti F, Guiotto A, et al. Azapsoralens: new potential photochemotherapeutic agents for psoriasis. Il Farmaco. 1991 Dec;46(12):1407-33. 73. Rodighiero G, Dall´Acqua F, Phatak MA. Photobiological properties of monofunctional furocoumarin derivatives. Top Photomed. 1984:319-98. 74. Marzano C, Chilin A, Bordin F, Baccichetti F, Guiotto A. DNA damage and biological effects induced by photosensitization with new N(1)-unsubstituted furo[2,3-h]quinolin-2(1H)-ones. Bioorg Med Chem. 2002 Sep;10(9):2835-44. 75. Chilin A, Marzano C, Baccichetti F, Simonato M, Guiotto A. 4-Hydroxymethyland 4-methoxymethylfuro[2,3-h]quinolin-2(1H)-ones: synthesis and biological properties. Bioorg Med Chem. 2003 Apr 3;11(7):1311-8.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 123-144 ISBN: 978-81-7895-489-9 Editors: Humberto González-Díaz and Cristian Robert Munteanu
7. Entropy analysis of enzymes with QSAR, partial order, and 3D-contact networks 1
Riccardo Concu1,2 , Gianni Podda1, Bairong Shen2 and Humberto González-Díaz3
Dipartimento Farmaco Chimico Tecnologico, Facoltà di Farmacia, Universitá Degli Studi di Cagliari, Cagliari 09124, Italy; 2Center for Systems Biology Soochow University, Room-101, Wegetang, No1. Shizi street, Suzhou, Jiangsu 215006, China 3 Department of Microbiology and Parasitology, USC, Santiago de Compostela, 15782, Spain
Abstract. In the precedent work, we introduce for the first time the Markovian Backbone Negentropies (MBN) descriptor to model the effect on protein stability of a complete set of alanine substitutions in the Arc repressor. In this work we apply the Markovian chain approach to study a total of 1371 proteins, divided into 689 enzymes and 682 non-enzymes, by means of Linear Discriminant Analysis(LDA) using MBN as molecular descriptor; this descriptor is based on a Markov chain model of electron delocalization throughout the protein backbone. With this approach we solve the problem of the prediction of the proteins with an unknown function but a 3D known structure. In order to optimize the investigation we perform an orthogonalization analysis, Tanimoto and Chi Square, and a partial order ranking. The database was retrieved from a work of Dobson & Doig (J Mol. Biol. 2003, 330, 771–783), all proteins were collected from the Protein Data Bank. The best model we found was a linear model carried out with the LDA, it was Correspondence/Reprint request: Dr. González-Díaz H, Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela (USC), Santiago de Compostela, 15782, Spain E-mail: humberto.gonzalez@usc.es or gonzalezdiazh@gmail.com
124
Riccardo Concu et al.
able to classify correctly 76.86% of the proteins using only 2 entropy potentials. Otherwise we define 3D-HINT potentials (Îźk) and use them for the first time to derive a classifier for protein enzymes. In closing, this MBN allows a fast calculation and comparison of different potentials deriving into accurate protein 3D structure-function relationships.
Introduction In these years we see growing considerably the number of proteins with a defined 3D structure but with an unknown function. In this sense a lot of researcher groups intent, through computational methods, to study the relationship between the structure and the function of the proteins. The methods applied in this field are various like the work published by Dobson & Doig (D&D) where they have shown the possibility to predict proteins as enzymatic or not, from the spatial structure, without resorting to alignments. In the cited paper, the authors used 52 protein features and a non-linear support vector machine (SVM) to classify more than 1000 proteins with 77% accuracy[1]. In addition to the work of D&D we find some SVM models that are able to distinguish between an enzymatic and a non-enzymatic protein. Most of them have an overall accuracy lower than our model, like that of Munteanu, with an overall accuracy of around 70% based also on topological indices[2]. Recently, we have found an interesting paper by Lu et al., in which the authors propose an automatic enzyme classifier based on SVM [3] ; using a dataset of 7329 proteins, they obtain a classifier that is able to predict proteins with a 90% overall accuracy, but without reporting the numbers of variables or the protein features used by the SVM or other model specification; in this paper the author reports other models, like that of Cai and Chou [4], which applied protein functional domain composition approach to the problem and gained a success rate of about 85%, but they used a dataset that consists of sequences with identity no more than 20%. In this sense, a group of researchers have recently published a review on the growing importance of machine learning methods to predict protein functional class, independently of the sequence similarity[5]. These methods often use as input 1D sequence numerical parameters, specifically defined to seek sequence-function relationships. On the other hand, many authors have introduced 2D or higher dimension representations of proteins or DNA sequences prior to the calculation of numerical parameters. We discussed many of these Computational Chemistry and Bioinformatics methods in a recent in-depth review, published in Current Topics in Medicinal Chemistry [6]. In any case, all these methods, even base
Entropy analysis of enzymes
125
ultimately on sequence and not on protein 3D structure. In our opinion, 3D Quantitative Structure-Activity Relationships (3D-QSAR) methodology, usually applied to small-sized molecules, may become an alternative to sequence methods for 3D-Structure-Function predictions in proteins. A major factor for the future necessarily will be the growth of the database of experimentally determined protein structures, through structural genomics projects. In this context, the ability to predict protein function from the 3D structure experimentally determined is becoming increasingly important as the number of structures resolved is growing more rapidly than our capacity to study the function [6]. Furthermore, entropy theory was just applied in a lot of field like to encode structure governing biological activity[7-9], or to distinguish between natural product and synthetic molecules[10], or to a computational design of protein[11], or to detecting protein homology[12]. But until now never the entropy theory was applied to the 3D-QSAR to perform a model that was able to distinguish enzyme form no enzyme using for the study only the 3D structure of the proteins. For all this reasons we decide to approach the problem of structure solving using the Markovian Backbone Negentropies (MBN); this type of approach is a MCM of the intra-molecular movement of electrons. Now the method was renamed as MARkov CHains INvariants for SImulation & DEsign (MARCH-INSIDE 2.0) describing more adequately the broad uses of the method. In a very recent review, published on Proteomics 2008 [13] , it is shown necessary to give general formulae for the different kind of molecular fields by incorporating other important factors such as hydrophobic interaction field (HINT), making possible to develop comparative studies and determine the influence of the different fields over biological function. Here we give general MCM formulae to calculate local average of 3D-potentials considering neighbor amino acids at different distance, including: Electrostatic eΘk(R), van der Waals vΘk(R), and HINT field hΘk(R) and develop 3D-QSAR prediction of enzymes, the inclusion of the HINT potential in this paper is completely innovative and new. The work is based essentially on the same dataset used by D&D that was composed by 1371 proteins (689 enzymes and 682 non-enzymes). We calculated the three types of total and local potentials for four regions of the protein structure (core, inner, middle, and surface). The simpler model found was able to classify 76.86% of proteins using only two entropy potentials. We perform a orthogonalize test to investigate the meaning of the variables because in these case is easy to have some collinear variables; the method we choose is the Randic’ method[14]. Finally we perform a Partial Order (PO) ranking because orders based on only one ranking attribute are simple but may fail in describing all remarkable
126
Riccardo Concu et al.
sample features for complex cases or samples such as DNA sequence, proteins structure, proteomics maps or gene microarray data. In this cases, may be more useful the construction of a Partial Order (PO) based on more than one ranking parameters (xi). These POs can be represented with the so called Hasse diagrams, which are also graphs or network like representations. The nodes of Hasse diagrams are the samples or cases (in this case, protein) and edges herein express ordering instead of similarity/dissimilarity relationships. Finally here we propose a simple figure that well explain this paper step by step. In closing, the use of MCM allows us to perform fast calculation and comparative study of different type of potentials and investigate protein structure-function relationships. The advantage in this sense is similar to Comparative Molecular Field Analysis (CoMFA) studies for small-sized drugs [15].
Figure 1. Graphical abstract.
Materials and methods Markov Chain Model (MCM) In previous works we have used the following type of Entropy measures of MCM for protein QSAR prediction: Electrostatic Entropy and vdW
Entropy analysis of enzymes
127
Entropy. In this paper, we use for the first time the HINT Entropy type of MCM invariants. The method uses as a source of protein descriptors the stochastic matrices 1Πf built up as squared matrices (n × n), where n is the number of amino acids (aa) in the protein. As follows we give the general formula to calculate MCM Entropy measures of any potential and formulae for specific field as well 10: (1) (2)
(3) (4)
(1d) The superscript f points to the type of molecular force field. The stochastic matrices used may encode any potential field, including the most common ones: a) Short-term vibrations field (f = vib, basen on aa vibrations fj = νj) with stochastic matrix 1Πvib, b) Electrostatic (f = eΘ, fj = ϕj) with 1Πe, c) van der Waals (f = vΘ, fj = vΘj) with 1Πvdw. d) HINT (f = hΘ, fj = hj) with 1Πh. In order to extend the method we can consider a hypothetical situation in which every jth-aa has general potential fj at an arbitrary initial time (t0). All these potentials can be listed as elements of the vector 0ϕf. It can be supposed that, after this initial situation, all the aa interact with every other aaj in the protein with interaction energy 1Eij. For the sake of simplicity, a truncation function αij is applied in such a way that a short-term interaction takes place in a first approximation only between neighbouring aa (αij = 1). Otherwise, the interaction is forbidden (αij = 0). Neglecting direct interactions between distant aa in 1Πf does not avoid the possibility that potential interactions propagate between those aa within the protein backbone in an indirect manner. Consequently, in the present model long-range interactions are possible (not forbidden) but estimated indirectly
128
Riccardo Concu et al.
using the natural powers of 1Πf. The use of MCM theory allows a simple and fast model to calculate the average values of (Ψk) considering indirect interaction between an aaj and another aai after previous interaction of aaj with other k neighbour amino acids. It is remarkable that the average general potentials Ψk(f) depend on the absolute probabilities Apk(j) with which the amino acids interact with other amino acids and their k-order. The potential Ψk also depends on the initial unperturbed potential of the amino acid. In the equations represented above, the Apk(j) values are calculated with the vector of absolute initial probabilities, 0πf, and the matrix 1Пf based on the Chapman–Kolgomorov equations. In particular, the evaluation of such expansions for k = 0 gives the initial average unperturbed electrostatic potential (eΘ0); for k = 1 the short-range potential (eΘ1), for k = 2 the middlerange potential (eΘ2), and for k = 3 the long-range one. In order to carry out the calculations referred to in equations (1) for any kind of potential and detailed in (2a), (2b), (2c) and (2d); for the electrostatic potential, the elements (1pij) of 1Πf and the absolute initial probabilities Apk(j) were calculated as follows (see for instance, equations 3 and 4 for the general case; 3a and 4a for the particular electrostatic potential case, or 3b and 4b for the vdW case)10. 1
pij =
α ij ⋅ Eij δ +1
∑α m =1
1
pij =
pij =
∑α
im
∑α
p0 ( j ) =
im
δ +1
∑α
im
α ij ⋅
=
⋅ψ j ( wi , w j , dij ) qi ⋅ q j
=
α ij ⋅ ϕ j δ +1
∑α m =1
α ij ⋅ qi ⋅
im
⋅ ϕm
qj
dij dij = = δ +1 qi ⋅ qm qm α im ⋅ qi ⋅ ∑ α im ⋅ ∑ dim dim m =1 m =1 δ +1
⎛ aj bj ⎞ − ⎜ dij12 dij 6 ⎟⎟ ⎝ ⎠ = δ +1 ⎛ a b α im ⋅ ⎜ m12 − m6 ∑ ⎜d dij m =1 ⎝ ij
(3)
α ij ⋅
qj
dij = qm α im ⋅ ∑ dim m =1 δ +1
α ij ⋅ ϕ j δ +1
∑α m =1
im
⋅ Eim
f (w j , d0 j )
∑ f (w , d ) j
0j
(3b)
⎞ ⎟⎟ ⎠ qj
(4)
A
p0 ( j ) =
d0 j qm ∑ m=1 d 0m n
(3a)
⋅ ϕm
α ij ⋅ ⎜
n
m=1
α ij ⋅ψ j ( wi , w j , dij ) m =1
⋅ Eim
α ij ⋅ Eij δ +1
m =1
A
⋅ Eim
α ij ⋅ Eij δ +1
m =1
1
im
=
(4a)
⎛ aj bj ⎞ ⎜ ⎟ − ⎜ d 12 d 6 ⎟ ij ij ⎠ A p0 ( j ) = ⎝ n ⎛ a bj ⎞ j ⎜ ⎟ − ∑ 12 6⎟ ⎜ d d m=1 ⎝ 0 j 0j ⎠
(4)
Entropy analysis of enzymes
129
Where, wi are the weights or parameters of the amino acid related to the specific potential field and f is a non-negative potential function of w and d. For instance, qi and qj are the electronic charges, and aj and bj are the vdW field parameters for the ith-aa and the jth-aa, and the neighbourhood relationship (truncation function αij = 1) was turned on if these amino acids participate in a peptidic hydrogen bond or dij < dcut-off = 5 Ǻ 10. In this sense, the truncation of the molecular field is usually applied to simplify all the calculations in large biological systems. The distance dij is the Euclidean distance between the Cα atoms of the two amino acids and d0j the distance between the amino acid and the centre of protein charge. Both kinds of distances were derived from the x, y, and z coordinates of the amino acids collected from the protein PDB files. All calculations were carried out with our in-house software MARCH-INSIDE. For calculation, all water molecules and metal ions were removed 10. We can consider the Calfa as nodes and depict all the structure as a Complex network in plain, also known as protein contact map. The protein contact map is a graph build up from a binary twodimensional matrix; using this matrix we are able to draw a three dimensional graph of a protein. For two residues i and j, the ij element of the matrix is 1 if the two residues are closer than a predetermined threshold, and 0 otherwise[16]. In Table 1 we give the detailed information for the Calfa atoms of one selected protein (1PI2 in this case) and in Figure 2 we illustrate the Complex Network and the 3D structure of this protein for dij < dcut-off = 5 Ǻ. In this complex network the nodes are the Calfa and the edges are the bond between two adjacent Calfa.
Linear discriminant analysis LDA forward stepwise analysis was carried out for variable selection to build up the models. All the variables included in the models were calculated by the in-house software MARCH-INSIDE [13, 17] , then standardized in order to bring them onto the same scale. Subsequently, a standardized linear discriminant equation that allows comparison of their coefficients was obtained. The square of Mahalanobis’s distance (D2) and Wilk’s (λ) statistic (λ = 0 perfect discrimination, where 0 < λ < 1) were examined in order to assess the discriminatory power of the model. All the LDA models have been trained with the software STATISTICA 6.0®[18], for which our laboratory holds rights of use, furthermore all the graphics for the LDA analysis and discussion were build up with the same program.
130
Riccardo Concu et al.
Table 1. 1PI2 serine proteinase inhibitor. Naa
Na
aa type
x
y
z
δ
Naa
Na
aa type
x
y
z
δ
1
2
TYR
47.516
56.711
33.951
1
32
237
CYS
35.348
62.996
33.298
2
2
14
SER
44.468
58.276
35.796
2
33
243
LYS
31.718
63.788
34.509
2
3
20
LYS
45.598
61.921
34.927
2
34
252
SER
29.494
61.701
32.006
2
4
29
PRO
43.551
63.922
32.283
2
35
258
CYS
31.305
58.619
30.532
3
5
36
CYS
45.011
64.477
28.713
3
36
264
MET
29.797
56.33
27.818
2
6
42
CYS
43.394
64.316
25.149
2
37
272
CYS
30.741
52.662
27.132
3
7
48
ASP
44.994
63.407
21.711
3
38
278
THR
29.691
49.617
24.997
2
8
56
LEU
42.084
64.325
19.304
2
39
285
ARG
27.862
46.7
26.709
2
9
64
CYS
40.151
67.495
20.442
3
40
296
SER
30.766
44.242
25.95
3
10
70
MET
37.465
69.482
18.5
2
41
302
GLN
34.005
42.773
27.507
2
11
78
CYS
36.028
72.955
19.248
3
42
311
PRO
36.33
44.706
27.351
2
12
84
THR
33.144
75.189
17.827
2
43
318
GLY
34.099
47.822
26.904
4
13
91
ARG
34.196
78.531
16.125
2
44
322
GLN
35.368
50.769
24.71
3
14
102
SER
31.979
80.459
18.711
2
45
331
CYS
34.638
54.168
26.443
3
15
108
MET
33.163
82.626
21.756
2
46
337
ARG
34.5
57.896
25.482
3
16
116
PRO
33.681
81.026
24.36
2
47
348
CYS
34.017
61.22
27.469
3
17
123
PRO
34.493
77.762
22.401
3
48
354
LEU
30.936
63.351
26.469
2
18
130
GLN
32.731
74.382
23.293
3
49
362
ASP
31.832
66.334
28.809
2
19
139
CYS
35.28
71.383
23.102
3
50
370
THR
32.275
69.747
27.02
2
20
145
SER
35.383
67.449
23.597
2
51
377
ASN
34.625
72.592
28.298
2
21
151
CYS
37.615
64.199
22.895
3
52
385
ASP
35.852
75.988
26.811
4
22
157
GLU
38.342
60.27
22.261
2
53
393
PHE
39.441
74.505
26.656
2
23
166
ASP
38.601
56.411
23.219
3
54
404
CYS
41.295
71.307
25.434
2
24
174
ARG
39.784
54.192
26.261
2
55
410
TYR
43.502
69.038
27.759
3
25
185
ILE
39.016
50.457
27.315
3
56
422
LYS
47.202
68.497
26.519
2
26
193
ASN
36.728
49.914
30.5
3
57
431
PRO
48.217
65.448
24.309
3
27
201
SER
37.073
53.474
32.101
2
58
438
CYS
49.007
62.133
26.102
2
28
207
CYS
37.33
57.287
31.28
2
59
444
LYS
52.713
62.197
27.125
2
29
213
HIS
39.753
59.879
32.922
2
60
453
SER
53.909
58.577
26.412
2
30
223
SER
39.278
60.719
36.656
2
61
458
ARG
53.855
58.555
22.547
1
31
229
ASP
38.154
64.301
35.659
2
Naa = number of the aminoacid; Na =; aa type = aminoacid type; x,y,z = spatial coordinate; δ = number of connection in the web
Entropy analysis of enzymes
Figure 2. Complex network and 3D structure of the protein 1PI2.
131
132
Riccardo Concu et al.
Orthogonalization of the variables Scanning the variables included in the model and the others utilized in the partial order ranking, we see a high level of correlation between the variables. The main philosophy of this approach is to avoid the exclusion of descriptors on the basis of collinearity with other variables previously included in the model. The interrelatedness among the different variables makes difficult to know the relative importance of each variable in the final model. In these cases a lot of methods to eliminate the collinearity and obtain orthogonal variables are available and can be used; in this work we select the method proposed by Randic’ et al[14]. This method is well known and in this work we investigate how the orthogonalization order affect the final result of the applied method.
Partial order The partial order theory allow to make a rank of a set (or poset) based on two or more attributes. Each poset is composed by a set with a binary relation that determine the ranking of that poset. It is intuitive to understand that a partially ordered set can be different from a total order set. A finite poset can be visualized through a so called Hasse diagram. In the Figure 3 we represent a simple comparison between a total order and a PO. To build up our partial order ranking we use three sample attributes as inputs for the construction of alternative two-attributes POs schemes. We select like input for the construction of the PO rank three models obtained with the LDA methodology. Different test and statistics where calculated to compare and assess the quality of these alternative POs, including the two more important: T(g1, g2) index: Tanimoto’s coefficient and χ2: Chi-square. The Chi-square is a classic statistic and the T(g1, g2) indices for two alternative POs of two posets (A and B) can be calculated as follows[19] T (g1 , g 2 ) ≡
A∩ B = A∪ B
∑ ∑ sr
sr
+ ∑ + g1 ⋅ ∑ + g 2 ⋅ ∑ rr
irA
(5)
irB
Where, g1 and g2 are weights that can take the values of either 0 or 1 and to explain the above equation Sorensen and Burggemann et al. introduced the following notations for comparable (<A, <B, ≤ A, ≤ B) or incomparable (A║,B║) elements in posets A and B. Being, < and ≤ the classic symbols of less than and less or equal than and ║the symbol introduce to demark notcomparable features.
Entropy analysis of enzymes
133
Figure 3. Hasse diagram.
∑ : sum of pairs xi < A x j and xi < B x j ⇔ same ranking sr
(5A)
∑ : sum of pairs xi < A x j and x j < B xi ⇔ reverse ranking rr
(5B)
∑ : sum of pairs xi ≤ A x j and xi irA ∑ : sum of pairs xi irB
A
B
x j ⇔ incomplete ranking in A (5C)
x j and xi ≤ B x j ⇔ incomplete ranking in B (5D)
Other important statistics reported were the P(IB): Stability of ranking, d(N): Diversity, t(N): Selectivity, NL: Number of Levels, NEL: Number of Elements in the Largest Level, V(N): Comparability, U(N): Contradictions,
134
Riccardo Concu et al.
K(N): Level of degeneracy, NEC: Number of equivalent classes, and C: Complexity, see references for details. We carried out the PO analysis using the software Hasse for Windows (WHASSE), which was kindly released by Prof. R. Bruggemann.
Dataset The list of the protein structures was picked from the D&D article and downloaded from the PDB. All the PDB files with the 3D protein structures [20] and their classification into enzymes or non-enzymes were taken from the literature [1]. Before collecting the PDB files we perform the calculation of the Electrostatic Potentials, Electrostatic Entropy and Electrostatic Spectral moments as well as vdW Potentials, vdW Entropy, vdW Spectral moments, HINT Potentials, HINT Entropy and HINT Spectral moments. The inclusion of vdW average potential and HINT average potential type of MCM invariants is completely new and we introduce these types of invariants for the first time to resolve the problem of identifying an unknown structure. All calculations were carried out with our in-house software MARCH-INSIDE [13]. Then we carried out all the models with the software STATISTICA 6.0 速[18]. The dataset was composed of 689 enzymes and 682 non-enzymes for a total of 1371 proteins, essentially the same of D&D. For the calculation, the MARCH-INSIDE software divides the protein into four orbits (O) called c, i, m and s and constitutes specific groups or collections of amino acids placed
Figure 4. Protein 3D with his orbits.
Entropy analysis of enzymes
135
at the protein core (c), inner (i), middle (m) or surface region (s). In Figure 4 we represented a 3D protein with its orbits (i correspond to the core, ii to the inner, iii to the middle, and iv to the surface). The diameters of the orbits, as a percentage of the longer distance with respect to the centre of charge, are 0 to 25 for orbit c, 26 to 50 for orbit i, 51 to 75 for orbit m, 76 to 100 for orbit s. Detailed information about the name, the PDB ID, the number of proteins, the values of the electrostatic potential, the corresponding observed classification, the predicted classification, and the subsequent probability of each enzyme are given in Table SM1 of the Supplementary Material.
Results and discussion LDA-QSAR models In this study we used MCM to calculate average non-interacting (eΘ0, v Θ0, hΘ0), short-range (eΘ1, vΘ1, hΘ1), middle-range (eΘ2, vΘ1, hΘ1), and longrange electrostatic, vdW, and HINT interactions (eΘk, vΘk, hΘk > 3) in different regions of 1371 proteins. The aforementioned descriptors were subsequently used to carry out a LDA with a random training subset of proteins in order to classify each one of them as enzyme or non-enzyme. The bests model found are the following: Enzyme − score = 0.03 × v Θ0 ( t ) − 0.89 N = 823
λ = 0.85
F = 139.62
p < 0.001
Enzyme − score = 0.13 × v Θ0 ( t ) + 0.01×h Θ0 ( t ) − 0.89 N = 823
λ = 0.85
F = 70.06
p < 0.001
Enzyme − score = 0.004 × e Θ1 ( t ) + 0.013 ×v Θ0 ( t ) + h Θ0 (t ) − 0.60 N = 823
λ = 0.85
F = 46.82
p < 0.001
(1a)
(1b)
(1c)
Where, the symbols vΘ- hΘ(O) used in the equations have the following elements: vΘ represents the van der Waals potential, hΘ represents the HINT potential, k is the topological distance between the amino acids considered and O is the orbit of amino acids considered in the calculation (denoted between brackets) [21]. In equation (1) the value N = 823 is the number of proteins used in the training series, which were selected at random out of the 1371. We report here three models because after we perform the PO test with these same models.
136
Riccardo Concu et al.
The statistical parameters of the above equation were also shown including Wilk’s statistic (λ), Fischer Ratio (F) and significance level (plevel) [22]. In the training series we used 414 enzymatic proteins and the model classified correctly 306 out of 414 with a 73.91% level of accuracy, 202 out of 274 in the validation series with a 73.72%% level of accuracy, 508 out of 688 in a training + validation series with a 73.84% level of accuracy, in average the level of accuracy was 72.86%. The respective classification matrices are depicted in Table 2. Moreover of this we make a desirability analysis where is easy to see how the biological activity change in respect of the variables, we report this in the Figure 5. Otherwise of this we report in Table 3 all the results of the models obtained with the combination of the most important variables; in this table it is possible to see the predictive power of each model performed with the LDA, using a standard selection of the variables or the forward stepwise selection. In fact the most important variables were retrieved using the forward stepwise selection. We perform all the models working on the STATISTICA6.0®[18] software. In our investigation we find the same accuracy on the models obtained with the vdW and HINT variables retrieved with the forward stepwise selection. Table 2. Accuracy for training and validation series for the kΘ(s) model. Train %
Specificity
Validation Sensitivity
%
Specificity
Sensitivity
80.44
Specificity
329
80
79.12
216
57
73.91
Sensitivity
108
306
73.72
72
202
77.16
Accuracy
75.29
79.27
76.42
75.00
77.99
npv
ppv
npv
ppv
Both Train + Validation %
Average Train + Validation
Specificity
Sensitivity
%
Specificity
Sensitivity
79.91
Specificity
545
137
82.70
272.5
57
73.84
Sensitivity
180
508
73.83
90
254
76.86
Accuracy
75.17
78.76
78.27
75.38
81.67
npv
ppv
npv
ppv
The parameter npv is the negative predictive value and ppv is the positive predictive value
Entropy analysis of enzymes
137
Figure 5. Desirability analysis Table 3. Resume of all the models. Potential
p(E)
N° Var
NE%
E%
TOT%
71
71
71
75
69
72
68
70
69
73
70
71
77
70
73
75
74
75
ALL
71
68
70
73
68
71
76
69
72
75
74
75
70
69
70
74
72
73
77
72
74
70
70
70
75
72
74
76
73
74
66
72
69
70
72
71
80
74
77
69
71
70
Θ 00 04; hΘ 02 02
70
72
71
Θ 01 04; vΘ 00 04; hΘ 00 04
79
72
76
e
Θ
STD
53
ALL
e
Θ
FWD
57
1
e
Θ
STD
58
ALL
STD
57
ALL
v
Θ
v
FWD
57
4
v
FWD
575
1
v
STD
Θ Θ Θ
h
Θ
STD
57
ALL
h
Θ
FWD
575
2
h
Θ
STD
575
1
STD
575
ALL
e
Θ+vΘ
e
v
Θ+ Θ
FWD
575
5
e
Θ+vΘ
STD
575
2
e
Θ+ hΘ
STD
575
ALL
h
Name Var
e
v
Θ 00 04
h
v
v
Θ 01 04; vΘ 00 04
575
v
STD
575
ALL
v
FWD
58
2
h
v
STD
575
2
v
STD
56
ALL
e
Θ+vΘ+ hΘ
FWD
58
2
e
Θ+vΘ+ hΘ
STD
58
3
v
h
h
h
h
h
Θ 00 04; Θ 01 04; Θ 00 00; Θ 01 00; Θ 04 02 e
Θ+ Θ+ Θ
v
e
STD
h
v
Θ 00 04; Θ 01 04; Θ 00 00; Θ 01 00; Θ 04 02
Θ+ hΘ
v
Θ 00 04
v
e
e
Θ 00 04; hΘ 02 02 h
575
Θ+ hΘ
v
v
FWD
Θ+ hΘ
v
Θ 00 04; Θ 00 00; Θ 01 00; Θ 03 03
Θ+ Θ
5
Θ 01 04
v
e
Θ+ hΘ
e
select
Θ 01 04; hΘ 00 04
Θ 00 04; hΘ 02 02 Θ 00 04; hΘ 00 04
h e h
Θ = electrostatic entropy; Θ = van der Waals entropy; Θ = HINT entropy;
138
Riccardo Concu et al.
Table 4. Resume of the models before the orthogonalization process. Ω eΘ,vΘ,hΘ
Ω eΘ,hΘ,vΘ
P(e)
NE%
E%
TOT%
P(e)
NE%
E%
TOT%
57
75
69
72
57
75
69
72
575
73
74
73
50
71
47
59
Θ
5001
62
69
65
575
75
73
74
e
Θ+ vΘ
575
77
72
74
57
74
68
71
e
Θ+ hΘ
572
73
71
72
575
76
73
74
Θ+ hΘ
575
79
73
76
575
79
73
76
Θ+vΘ+hΘ
575
79
72
76
575
79
72
76
e
Θ
v
Θ
h
v
e
Ω vΘ,eΘ,hΘ
Ω vΘ,hΘ,eΘ
P(e)
NE%
E%
TOT%
P(e)
NE%
E%
TOT%
50
61
48
55
501
59
51
55
575
76
74
75
575
76
74
75
Θ
5001
62
69
65
5001
59
75
67
e
Θ+vΘ
575
77
72
74
575
75
73
74
e
Θ+hΘ
5001
64
49
57
5001
64
49
57
Θ+hΘ
575
80
74
77
575
80
74
77
Θ+vΘ+hΘ
575
79
72
76
575
79
72
76
e
Θ
v
Θ
h
v
e
Ω hΘ,eΘvΘ P(e)
NE%
E%
TOT%
P(e)
NE%
E%
TOT%
50
65
48
57
50
65
48
57
5001
69
47
58
5001
60
72
66
Θ
575
75
74
75
575
75
74
75
e
Θ+vΘ
5001
64
49
56
5001
64
49
56
e
Θ+hΘ
575
76
73
74
575
76
73
74
Θ+hΘ
575
80
74
77
575
80
74
77
Θ+vΘ+hΘ
575
79
72
76
575
79
72
76
e
Θ
v
Θ
h
v
e
e
Ω hΘ,vΘ,eΘ
Θ = electrostatic entropy; vΘ= van der Waals entropy; hΘ = HINT entropy; Ω = orthogonalize; P(e) prior probabilities for enzymes; NE = specificity; E = sentivity; TOT= accuracy.
Entropy analysis of enzymes
139
For this reason we decided to investigate the collinearity between the variables. For this process we select the three most important variable of each potential and then we combine the order of orthogonalization to investigate if the models changes changing the input order of the variables. Our reference for this experiment was the model select for the LDA and the model with the most important variable of electrostatic, vdW and HINT. How it is report in Table 4 the order do not change the final result of the selected models; we annoted a small loss of accuracy when the first variable of the process is the electrostatic. After the orthogonal transformation procedure we have a new equation which retains the predictive power 19: Enzyme − score = 0.014 × 1 Ω 0 ( v ) N = 823 %train = 74.40 %cv = 74.09
Enzyme − score = 0.014 × 1 Ω 0 ( v ) + 0.03 × 2 Ω0 ( h ) − 0.89 N = 823 %train = 73.91 %cv = 73.72 Enzyme − score = 0.03 × 1 Ω 0 ( v ) + 0.02 × 2 Ω 0 (h) + 0.004 × 3 Ω1 (e) − 0.60 N = 823 %train = 74.15 %cv = 69.71
(2a)
(2b)
(2c)
In this equation, the symbol Ω is used to indicate that these are the final forms of the indices after the orthogonal transformation. The variables of the orthogonalized models do not present problem of collinearity, all features, after the transformation, are orthogonal; the process do not affect the accuracy of the models and this is a confirmation of the success of the procedure. The accuracy of this models we think is very high considering we obtained this results working only with LDA models; we have to take into consideration the models proposed here is a linear model and gain a average between 74% and 76% using only one, two or three variables. In the Table SM2 we give all the specification and all the statistical parameters of the models here reported, including the orthogonalized models.
Partial order Ordering of samples may be very useful for the classification of the enzymatic protein and comparison with no-enzymatic protein, in principle we may propose different 1D alternative orders for all the dataset. These orders may be constructed based on different sample features. In this study is easy to realize that some of the models performed above can play the role of ranking attributes by themselves to order the samples. For instance, our total 1D order
140
Riccardo Concu et al.
is based on three different inputs that are the three models selected. In any case, a total order based on one single parameter is less rich in information content and may easily fail in capturing all the biologically remarkable sample characteristics due to the high complexity of the 3D structure of a protein. Consequently, is more reasonable to construct a 2D order of samples based on more than one feature at time. In PO theory one may use different combinations of sample features to construct the PO scheme. The general principle is to order or rank different elements or samples (xi) using multiple ranking attributes or sample features. Consequently, the study of the best set of attributes or sample features used to build the PO becomes of the major importance [23-40]. In our case, is desirable a PO scheme where the separation between the various models is minimal, in this sense, we used Tanimoto and Chi-square analysis to compare the different POs proposed. In Table 5 we report all the statistical parameters inherent the POs; itâ&#x20AC;&#x2122;s easy to see the high level of the Tanimoto index and the chi-square statistic for each PO study proposed. One can note high PO overall similarity with total Tanimoto indes of 1, 0.94, 0.99 and 1 for the 4 rank proposed. This and all the other parameters depicted in the table 5 are a robustness confirmation of the validity of our method and approach to this field. Inspection of Table 5 confirms that all reported model are very good and the classification make by the LDA with the various model is coherent in everyone.
Figure 6. Partial order simple graph and protein contact map.
Entropy analysis of enzymes
141
Table 5. Statistical parameters for the PO test. Parametersa
Cases for Tanimoto's analysis All models
Models 2+3
Models 1+3
Models 1+2
T
1
0.94357
0.99999
1
χ2
3.8415
3.8415
3.8415
3.8415
P(IB)
0.055541
0
0.055532
0.055541
d(N)
0.013
0.0019
0.013
0.013
t(N)
0.29
1
0.29
0.29
NL
199
539
199
199
NEL
9
1
9
V(N)
223853
237241
223855
223853
U(N)
26328
0
26324
26328
K(N)
2
450
2
2
NEC
688
539
688
688
C
No
No
No
No
a
T: Overall Tanimoto’s coefficient, χ2: Chi-square statistic, P(IB): Stability of ranking, d(N): Diversity, t(N): Selectivity,NL: Number of Levels, NEL: Number of Elements in the Largest Level, V(N): Comparability, U(N): Contradictions, K(N): Level of degeneracy, NEC: Number of equivalent classes, and C: Complexity; Models: 1= eΘ0+ vΘ0+ hΘ0,2= vΘ0+ hΘ0, 3= vΘ0
Otherwise of this, it is possible to use the PO to predict new enzymes; it is easy to understand that we can compare an unknown protein with a known one when both are in the same level in the PO rank.
Conclusions In this paper we demonstrate that it is possible to distinguish between enzymes and non-enzymes with a linear classifier based on average 3D electrostatic potentials, which is notably simpler than the previous model reported. Although the results obtained in these fields are often subject according to the selected database, we believe that the accuracy of the models obtained by the MBN is due to the fact that the spread of this energy can describe in more effective 3D structure of a protein and thus it is able to scan more accurate constants, which is important to forecast in accordance with our previous works.
Acknowledgments González-Díaz, H. acknowledges research contract at the Faculty of Pharmacy, USC, Spain, financed by the Contract/grant sponsor: “Program
142
Riccardo Concu et al.
Isidro Parga Pondal, Xunta de Galicia” and European Social Fund (F.S.E.). Concu R. is indebted to the Regione Autonoma della Sardegna which approved a Master & back project (RO2669) To Fund A One-Year Research Training Position At The USC.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
Dobson PD, Cai YD, Stapley BJ, Doig AJ. Prediction of protein function in the absence of significant sequence similarity. Curr Med Chem. 2004 Aug;11(16):2135-42. Munteanu CR, González-Díaz H, Magalhaes AL. Enzymes/non-enzymes classification model complexity based on composition, sequence, 3D and topological indices. Journal of theoretical biology. 2008 Sep 21;254(2):476-82. Lu L, Qian Z, Cai YD, Li Y. ECS: an automatic enzyme classifier based on functional domain composition. Computational biology and chemistry. 2007 Jun;31(3):226-32. Cai CZ, Han LY, Ji ZL, Chen YZ. Enzyme family classification by support vector machines. Proteins. 2004 Apr 1;55(1):66-76. Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, et al. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics. 2006;6:4023–37. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal chemistry and bioinformatics--current trends in drugs discovery with networks topological indices. Current topics in medicinal chemistry. 2007;7(10):1015-29. Kier LB. Use of molecular negentropy to encode structure governing biological activity. Journal of pharmaceutical sciences. 1980 Jul;69(7):807-10. Abhiman S, Sonnhammer EL. Large-scale prediction of function shift in protein families with a focus on enzymatic function. Proteins. 2005 Sep 1;60(4):758-68. Das B, Meirovitch H. Solvation parameters for predicting the structure of surface loops in proteins: transferability and entropic effects. Proteins. 2003 May 15;51(3):470-83. Stahura FL, Godden JW, Xue L, Bajorath J. Distinguishing between natural products and synthetic molecules by descriptor Shannon entropy analysis and binary QSAR calculations. J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1245-52. Sciretti D, Bruscolini P, Pelizzola A, Pretti M, Jaramillo A. Computational protein design with side-chain conformational entropy. Proteins. 2008 Jul 10;74(1):176-91. Alejandro Sánchez-Flores EP-RLS. Protein homology detection and fold inference through multiple alignment entropy profiles. 2008:248-56. González-Díaz H, Gonzalez-Diaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008 Feb;8(4):750-78. Randic´ M. Orthogonal molecular descriptors. New J Chem. 1991;15:517-25. Caballero J, Saavedra M, Fernandez M, Gonzalez-Nilo FD. Quantitative structure-activity relationship of rubiscolin analogues as delta opioid peptides using comparative molecular field analysis (CoMFA) and comparative molecular
Entropy analysis of enzymes
16. 17. 18. 19. 20. 21.
22.
23. 24. 25. 26. 27. 28. 29. 30.
143
similarity indices analysis (CoMSIA). Journal of agricultural and food chemistry. 2007 Oct 3;55(20):8101-4. Vassura M, Margara L, Di Lena P, Medri F, Fariselli P, Casadio R. Reconstruction of 3D structures from protein contact maps. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM. 2008 Jul-Sep;5(3):357-67. Humberto González-Díaz FP-PaFMU. Predicting Antimicrobial Drugs and Targets with the MARCH INSIDE Approach. Current Topics in Medicinal Chemistry. 2008 No. 18;8(18):1676-90. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002. Bruggemann R, Pudenz S, Carlsen L, Sorensen PB, Thomsen M, Mishra RK. The use of Hasse diagrams as a potential approach for inverse QSAR. SAR QSAR Environ Res. 2001 Feb;11(5-6):473-87. Ivanisenko VA, Pintus SS, Grigorovich DA, Kolchanov NA. PDBSite: a database of the 3D structure of protein functional sites. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D183-7. Cruz-Monteagudo M, González-Díaz H, Agüero-Chapin G, Santana L, Borges F, Domínguez RE, et al. Computational Chemistry Development of a Unified Free Energy Markov Model for the Distribution of 1300 Chemicals to 38 Different Environmental or Biological Systems. J Comput Chem. 2007; 28:1909-22. Casanola-Martin GM, Marrero-Ponce Y, Khan MT, Ather A, Khan KM, Torrens F, et al. Dragon method for finding novel tyrosinase inhibitors: Biosilico identification and experimental in vitro assays. European journal of medicinal chemistry. 2007 Nov-Dec;42(11-12):1370-81. Sorensen PB, Bruggemann R, Carlsen L, Mogensen BB, Kreuger J, Pudenz S. Analysis of monitoring data of pesticide residues in surface waters using partial order ranking theory. Environ Toxicol Chem. 2003 Mar;22(3):661-70. Bruggemann R, Sorensen PB, Lerche D, Carlsen L. Estimation of averaged ranks by a local partial order model. J Chem Inf Comput Sci. 2004 Mar-Apr;44(2):618-25. Lee C, Grasso C, Sharlow MF. Multiple sequence alignment using partial order graphs. Bioinformatics. 2002 Mar;18(3):452-64. Todeschini R, Consonni V, Mauri A, Ballabio D. Characterization of DNA Primary Sequences by a New Similarity/Diversity Measure Based on the Partial Ordering Journal of chemical information and modeling. 2006;46(5):1905-11. Grasso C, Modrek B, Xing Y, Lee C. Genome-wide detection of alternative splicing in expressed sequences using partial order multiple sequence alignment graphs. Pac Symp Biocomput. 2004:29-41. Grasso C, Lee C. Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics. 2004 Jul 10;20(10):1546-56. Pfleiderer C, Reznik D, Pintschovius L, Lohneysen HV, Garst M, Rosch A. Partial order in the non-Fermi-liquid phase of MnSi. Nature. 2004 Jan 15;427(6971):227-31. Ye Y, Godzik A. Multiple flexible structure alignment using partial order graphs. Bioinformatics. 2005 May 15;21(10):2362-9.
144
Riccardo Concu et al.
31. Carlsen L. A combined QSAR and partial order ranking approach to risk assessment. SAR QSAR Environ Res. 2006 Apr;17(2):133-46. 32. Grasso C, Quist M, Ke K, Lee C. POAVIZ: a Partial order multiple sequence alignment visualizer. Bioinformatics. 2003 Jul 22;19(11):1446-8. 33. Lee C. Generating consensus sequences from partial order multiple sequence alignment graphs. Bioinformatics. 2003 May 22;19(8):999-1008. 34. Lerche D, Matsuzaki SY, Sorensen PB, Carlsen L, Nielsen OJ. Ranking of chemical substances based on the Japanese Pollutant Release and Transfer Register using partial order theory and random linear extensions. Chemosphere. 2004 May;55(7):1005-25. 35. Lerche D, Bruggemann R, Sorensen P, Carlsen L, Nielsen OJ. A comparison of partial order technique with three methods of multi-criteria analysis for ranking of chemical substances. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1086-98. 36. Lerche D, Sorensen PB, Bruggemann R. Improved estimation of the ranking probabilities in partial orders using random linear extensions by approximation of the mutual ranking probability. J Chem Inf Comput Sci. 2003 SepOct;43(5):1471-80. 37. Jensen TS, Lerche DB, Sorensen PB. Ranking contaminated sites using a partial ordering method. Environ Toxicol Chem. 2003 Apr;22(4):776-83. 38. Sandford D, Fendorf M, Stacy AM, Holstein WL, Crawford MK. Observation of Hendricks-Teller partial order in a tetragonal cuprate superconductor: La1.68Nd0.14Na0.10K0.082CuO4. Phys Rev B Condens Matter. 1994 Oct 1;50(13):9419-25. 39. Sorensen PB, Mogensen BB, Carlsen L, Thomsen M. The influence on partial order ranking from input parameter uncertainty. Definition of a robustness parameter. Chemosphere. 2000 Aug;41(4):595-601. 40. Lerche D, Sorensen PB, Larsen HS, Carlsen L, Nielsen OJ. Comparison of the combined monitoring-based and modelling-based priority setting scheme with partial order theory and random linear extensions for ranking of chemical substances. Chemosphere. 2002 Nov;49(6):637-49.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 145-161 ISBN: 978-81-7895-489-9 Editor: Humberto González-Díaz and Cristian Robert Munteanu
8. QSPR models for human Rhinovirus surface networks 1
Santiago Vilar1 and Humberto González-Díaz2
National Institutes of Health, DHHS, Bethesda, Maryland 20892, USA 2 Department of Microbiology & Parasitology, Faculty of Pharmacy University of Santiago de Compostela, Santiago de Compostela, 15782, Spain
1. Introduction Quantitative Structure-Property/Activity Relationship (QSPR/QSAR) [1] techniques based in different indices have a wide variety of applications in bioorganic chemistry research to connect the chemical structure of smallsized molecules with antiviral activity of drugs [2]. The interest in the application of QSAR has steadily increased in recent decades and in a very recent work Verma and Hansch pointed out that it may be useful in the search for anti-HRV (Human Rhinoviruses) agents [3]. In this paper, they have discussed QSAR models to predict new anti-HRV drugs taking into consideration the chemical structure of the drug candidate. In our opinion, QSAR approaches can be interesting for the computational study of not only small molecules but large biopolymers or not-molecular biological systems. It opens, for instance, the possibility for a QSAR view of the HRVs problem from the side of the virus instead of the drug side, which can be seen as the second part of Verma and Hansch work. Correspondence/Reprint request: Dr. González-Díaz H, Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela (USC), Santiago de Compostela, 15782, Spain E-mail: humberto.gonzalez@usc.es or gonzalezdiazh@gmail.com
146
Santiago Vilar & Humberto González-Díaz
In fact, the knowledge about the 3-D structure including the surface of proteins increases our understanding of its function and interaction with other proteins. However, this knowledge of the sequence and its 3-D structure does not provide a clear relationship with its biological properties. For this reason, is important to search for novel protein 3D indices derived from new protein molecular graphics representations useful to seek QSAR models able predict the biological functions of proteins. Some 3D molecular descriptors used to codify molecular structures of polymers include Arteca’s mean crossing-over number, the Flory radius of gyration and the I3 index amongst others [4]. In addition, the creation of databases of DNA/RNA and protein sequences without 3D structure determined has led to significant developments in this search of molecular graphics of 2D network/graph type based methodologies that characterize DNA and protein sequences. Molecular graphics representations for DNA or RNA sequences have been reported by different authors [5, 6]. In the case of proteins, HydrophobicityPolarity Lattices (HP-Lattices) are one of the more used types of molecular graphics to model protein structure-properties relationships and folding dynamics in 2D or 3D spaces (pseudo-folding) [7, 8]. A new protein pseudofolding molecular graphics or network type representation have been introduced by Fernádez and Caballero et al. recently [9]. In any case, using different type of numerical indices derived from these proteins or DNA/RNA 2D molecular graphics to perform QSAR studies is simpler than when we need to know 3D structure. These indices describe graph/network topology, connectivity, or branching and are often referred to as the graph Topological Indices (TIs) or Network Connectivity Indices (CIs). Often, CIs/TIs are enough efficient to codify important amounts of information in a timely way with respect to 3D indices [10]. The uses of these indices to seek Structure-Function relationships in Cellular Biochemistry are diverse [11]. In two recent reviews, we revised in-depth the applications in Theoretical Biology and Bioinformatics [12] and CIs/TIs derived from network/graph type molecular graphics of small-sized molecules, macromolecules, and more complex sources of information including whole Proteome Mass Spectrums, Genomes, or Protein Interaction Networks [13]. Recently, the MARCH-INSIDE approach introduced by our group has been generalised to encode structural features of DNA/RNA and proteins. In these works we included 2D-RNA secondary structure graphs, pseudo-3D proteins molecular graphics, and other types of graphics or networks, [14] including the HP-Lattice type of complex networks [15]. The study described here concerns to the protein sequences present in the capsid of HRV. In each case the sequence of viral proteins of 19 strains of HRV is represented by an HP-Lattice network. Later, we calculated by the
QSPR models for human Rhinovirus surface networks
147
first time up to 11 different classes of TIs for these HP-Lattice networks. The TIs include total indices of the full network and local indices for specific groups of aminocaids. Next, we used these TIs as input parameters of a Linear Discriminant Analysis (LDA) in order to construct 11 different QSAR models. These QSAR models are discriminant functions that may classify a HRV as a major or minor group virus. The mechanisms are the above referred ICAM-1-mediated and the LDL-mediated mechanism. We compared the QSAR models based on the different types of TIs. We also compared these models with 3D Topographic Indices (TGIs) derived from virus surface road maps. These are graphs equivalent to Viral SCN or the same Complex Networks of aminoacid vicinity at the viral surface. Some of the models QSAR based on TGIs of SCN were previously published and other were developed in these work in order to make a more rigorous comparative study [13, 16, 17].
2. Materials and methods 2.1. Statistical analysis Once the different descriptors had been calculated for all the Human Rhinoviruses (HRVs) in the databases we proceed with statistical analysis. For it, the variables were standardised and a Linear Discriminant Analysis (LDA) was carried out using the STATISTICA 6.0 software package [18] to develop three classification functions that are capable of differentiating between different groups. The formula for the LDA classification function is: S = a0 +
3, m ,21
∑b
kcg
· kTI c ( g )
(1)
c,k , g
The variable S is a real value score for the biological property under investigation: the tendency of the virus to bound the LDLR receptor instead of ICAM. The values ak and bkcg are the coefficients obtained for the LDA classification functions (QSAR model) [19, 20]. The statistical quality of the models was assessed using parameters such as Wilks’ statistic (λ), the Fisher ratio (F), the square of the Mahalanobis distance (D2) and the percentage of good classification for the training set as well as for the cross-validation procedure. The classification of cases was carried out by considering the subsequent classification probabilities, which are the probabilities that the respective case belongs to a particular group, i.e., active or inactive. The different discriminant functions were obtained using the forward-stepwise
148
Santiago Vilar & Humberto González-Díaz
method with a minimum tolerance of 0.01 [19, 20]. As an alternative to TIs in we also explored TGIs using the same notation: S = a0 +
3, m ,21
∑b
kcg
· kTGI c ( g )
(2)
c,k , g
3. Results and discussion 3.1. QSAR models based on TIs of 2D-lattice surface complex networks Human Rhinoviruses (HRVs) are the single most important cause of common colds. We used the same HRV series recently used by Vlasak et al.[21] a total of 19 HRVs were studied: 10 belonging to the minor group and the other 9 to the major group. The widespread nature of this affliction, the economic consequences, and the well-known impracticality of vaccine development due to the large number of HRV serotypes (>100) have justified the search for antiviral chemotherapeutic agents. Rhinoviruses belong to the Picornaviridae family and represent a type of RNA virus with a small size. HRVs are naturally occurring polymers-composed systems present in the form of small icosahedral particles (~30 nm) composed of 60 copies of viral capsid proteins VP1, 2, 3 and 4 and a positive-strand (messenger sense) RNA. In terms of mechanism of cellular infection they may be classified into two groups: the major group (viruses binding intracellular adhesion molecule 1, ICAM-1) and the minor group (viruses binding low-density lipoprotein receptor, LDLR). The structure of the VP surface has been found to be involved in the receptor specificity of HRVs. In this sense, the prediction of the mechanism of infection of new viral mutants with theoretical models becomes of the major importance to assist bioorganic and medicinal chemists on the design new effective drugs [22]. However, in many cases knowing a DNA/RNA or protein sequence does not provide information about relationships with biological properties. For this reason it is necessary to codify sequence information by constructing 2D graphical representations [23] or other representations for DNA[24]; Liao’s RNA graphs [25] or HP-Lattice proteins network representations [26] through which it is possible to calculate TIs with the aim of finding relationships between a protein structure and its biological activity. Liao and other authors have successfully used Lattice like representations of DNA viral sequences to model biological properties [27]. Randic et al. [28] demonstrated that 2D-Lattice representations are the result of projecting in the 2D plane the view of a 3D-Tetrahedral
QSPR models for human Rhinovirus surface networks
149
representation of the DNA sequence. The results also apply to HP-Lattice representations of proteins of course; we only have to substitute the four classes of nucleotides by the four groups of aminoacids grouped according to polarity and hydrophobicity. In this 2D-Lattice projection, we draw only the nodes in the 3D- space with tetrahedral coordinates that are closer to the plane (more external nodes from this plane view with respect to the center) and all the rest of nodes with the same coordinates in the projecting axis have to be overlapped in the same node. In this sense, we interpret here 2DLattices as the projection of the surface of the protein represented in the 3DTetrehdral coordinates. By this reason, we call HP-Lattice here as 2D-LatticeSruface Complex Networks (2DL-SCN) in opposition to real 3D-SCN. In consequence, 2DL-SCN are Pseudo-folding graph representations of protein whereas 3D-SCN are representations of realistic 3D protein Folding. Initially, the sequences of the different HRVs were introduced into the programme MARCH-INSIDE 2.0[29] and this was used to generate one HPLattice network representation per sequence. Specifically in this work, we interpreted HP-Lattice Networks as 2D-Lattice Surface Complex Networks (2DL-SCN); see results and discussion section. In the Figure 1 we superposed the 2DL-SCN of all HRV strains to illustrate the similarities and dissimilarities between the two groups and how difficult may be to discriminate between them. The aim of the method proposed here is to overcome the 10D-amino acid space bottleneck by grouping the twenty natural amino acids into only four groups:
Figure 1. 2DL-SCN for HRVs 2, 1A, 29 and 1B and BPA of the different fragments (F1-F5).
150
Santiago Vilar & Humberto González-Díaz
1. The coordinates of abscissa axis increases in +1 for an acidic amino acid (rightwards-step) or: 2. The coordinates of abscissa axis decreases in –1 for a basic amino acid (leftwards-step) or: 3. The coordinates of ordinate axis increases in +1 for a polar amino acid (upwards-step) or: 4. The coordinates of ordinate axis decreases in –1 for a non-polar amino acid (downwards-step). Table 1. Names, symbols, formula, and network type for TIs and/or TGIs in this work.
a
Is the notation of TIs used in this work. b The parameters Apk(j), kpjj, denotes the absolute probabilities of finding an aminoacid with charge q or the probabilities of self-return to an aminoacid with charge q after a loop type random walk of length k within the 2DL-SCN. The indices R and D are the topological radius and the topological diameter obtained from the distance matrix.
QSPR models for human Rhinovirus surface networks
151
In this work, LDA was used to link the TIs of the sequence 2DL-SCN networks with the cellular entry route of a series of Human Rhinoviruses and discriminate the HRV strains. We developed in total 11 different QSAR classification models, one for each class of TIs. These four groups characterise the physicochemical nature of the amino acids as polar, non-polar, acidic or basic in essence Hydrophobicity or Polarity (HP). This kind of classifications has been used for the annotation of protein fragment patterns and motifs or generate HP-Lattice networks or the same 2DL-SCN [30]. Classification as acidic or basic prevails over the polar/non-polar classification in such a way that the four groups do not overlap each other. Subsequently, each amino acid in the sequence is placed in a Cartesian 2D space starting with the first amino acid at the (0, 0) coordinates. The coordinates of the successive amino acids are calculated as follows (in a similar manner as for DNA spaces) [31]. The names, symbols, and notation of all these TIs that entered into the 11 models after statistical analysis appear in the Table 1. The equations of the models appear at follows. Once the different 2DLSCN had been generated for the proteins in the different viruses, a series of total and local molecular descriptors for the whole sequence and the different amino-acids in the sequence were calculated with the aforementioned programme MARCH-INSIDE 2.0.[29] We calculated these TIs for the whole Table 2. Experiment 1: Coefficients of QSAR models using TIs of 2DL-SCN vs. TGIs of 3D-SCN. TGIs of VSCNs Markov Chain indices VSCNs TGIs type Entropy Spectral Moment Wiener Randic Sum of Node Degrees Diameter
TGIs 5
TGIΘ(T) TGIΘ(L) a0 2 TGIπ(T) 0 TGIπ(L) a0
0
0
TGIW(T) TGIW(L) a0 0 TGIχ(T) 0 TGIχ(L) a0 0 TGIδ(T) 0 TGIδ(L) a0 0 TGID(T) 0 TGID(L) a0 0
HP-Lattice network QSAR Coefficient 0.84 289.9 -22.53 0 17.3 -15.39 Classic TIs 9.6·10-4 0.09 23.42 5.04·10-4 -14.97 15.12 -0.11 -14.97 15.12 0.58 -0.75 -7.07
QSAR Coefficient 0.09 48.6 -8.77 0.24 31.14 -14.61 8.0·10-4 3.24 -4.68 -1.72 0.72 -5.56 0.06 3.4 -18.63 -1.72 0.72 4.53
TIs 5
TIΘ(T) TIΘ(L) a0 2 TIπ(T) 0 TIπ(L) a0
0
0
TIW(T) TIW(L) a0 0 TIχ(T) 0 TIχ(L) a0 0 TIδ(T) 0 TIδ(L) a0 0 TID(T) 0 TID(L) a0 0
TIs type Entropy Spectral Moment Wiener Randic Sum of Node Degrees Diameter
152
Santiago Vilar & Humberto González-Díaz
2DL-SCN and for local groups of amino-acids. In this work we used the uniform notation kTIc(g) for all TIs; where is the classic symbol of the TI and refers to one of the 11 classes of TIs calculated, k is the order of the TI within the class, and g to the local group of amino-acids. When a TI is calculate for the entire lattice the local group g = T, indicating that it is a Total and not local TI. The classes of TIs considered were 11. If the TI belong to a class without TIs of different order we use k = 0. The groups of TIs include the g = T and other 20 groups for each kind of amino-acid. The calculations of these and other TIs for different graphs/networks have been explained in detail before; consequently we give herein only general formulae in Table 2 [12]. Classic TIs models Balaban TIs model:
S = −7.97·10−8 · 0TI J (T ) + 0.09· 0TI J ( L) − 0.92
(3)
Wiener TIs model: S = 9.6·10−4 · 0TIW (T ) − 0.92· 0TI w ( L) + 23.47
(4)
MTI TIs model: S = −7.0·10−5 · 0TI MTI (T ) − 1.50· 0TI MTI ( L) + 21.94
(5)
Randic Connectivity TIs model: S = 5.04·10−4 · 0TI χ (T ) − 14.97· 0TI χ ( L) + 15.12
(6)
Lattice network nodes Sum of Degrees TIs model: S = −0.11· 0TI δ (T ) − 9.78· 0TI δ ( L) + 28.10
(7)
Shape coefficient TIs models
Radius TIs model: S = 1.54· 0TI R (T ) − 0.86· 0TI R ( L) − 12.7
(8)
Diameter TIs model: S = 0.58· 0TI D (T ) − 0.75· 0TI D ( L) − 7.07
(9)
QSPR models for human Rhinovirus surface networks
153
Shape coefficient type TIs model: S = 3.07· 0TI I 2 (T ) + 11.04· 0TI I 2 ( L) − 0.92
(10)
Markov Chain TIs Models
Entropy TIs model: S = 0.09· 3TI Θ (T ) + 48.60· 0TI Θ ( L) − 8.77
(11)
Spectral Moments TIs model: S = 0.24· 0TI π (T ) + 31.14· 1TI π ( L) − 14.61
(12)
2DL-Electrostatic Potential TIs model: S = −22.21· 0TI ξ (T ) + 3.47· 0TI ξ ( L) + 32.68· 5TI ξ ( L) + 16.04
(13)
Almost all models have p-level values <0.05 and proved to have very good predictability in training series. The Wiener indices model is the only one that correctly classified all the HRV strains in training and cross validation series. The models with the Balaban, MTI, Lattice Electrostatic Potential, and Spectral Moment HP-Lattice network descriptors correctly evaluated 100% of the viruses in training but misclassified some strains in validation. However, the Balaban model has a p-level relative high for a statistical significant model considering that p = 0.05 is just the threshold value for the test. The Shape coefficient indices model showed the lowest discriminatory power, with only 60% of the LDLR group correctly evaluated and 88.9% of ICAM-1 recognised by the theoretical model.
3.2. Back Projection Analysis of 2DL-SCN based models In the context of QSAR the so called Back-Projection Analysis is the process of drawing a map that depicts the influence of every molecular substructure on the property under investigation [32]. Some of the descriptors used in this study, such as entropy or moment, allow this type of approach to be used, a fact that is extremely useful in terms of interpreting the results. These descriptors and the application of this type of back-projection analysis approach have been reported in previous publications [33]. In fact, one of the most significant advantages of the QSAR approach reported here concerns to the interpretation of the results in terms of the influence that each network
154
Santiago Vilar & Humberto González-Díaz
sub-structure has over the property in question. This information can be obtained by applying a BPA, which consists of the projection of the QSAR model backwards onto the 2DL-SCN network. In this sense, we may first to make a partition of the network or graph into nodes. Them, calculate the local TIs of these network nodes. Later we can substitute the value of these local TIs or node centralities into the QSAR model, and finally sum the contribution of nodes re-grouped into sub-networks to map this fragment contributions over the network in a colour scale [34]. This kind of analysis has been largely reported for sub-graphs in the QSAR study of small-sized molecules [35]. We extended BPA to map the function of the different fragment of a protein backwards over the network representation of the large secondary structure of the corresponding RNA [33]. In a very recent study, the contributions obtained in this way can be matched against the degree of conservation of this sequences fragments by BLAST-based sequence alignment [36]. In this work, the BPA of the QSAR model was carried out by the first time using the node TIs on selected 2DL-SCN structures and the results are shown in Figure 2. In this figure, the fragments that contribute most to the interaction with the viral receptor are represented by darker colours and those that contribute least by lighter colours. The node containing the Lys of the HI loop appears as a black dot. The TIs of a node for smallsized graphs and large complex networks (also known as node centralities) formally differ but they are essentially local TIs of the same graph/network nature [13]. For instance, see the case of Closeness-vitality, Cclv(j)= W(G) – W(G/j); a node centrality of complex networks derived as the difference between the Wiener index of the network with and without the node j [37]. The 2DL-SCN of the protein sequences of HRVs 2, 1A, 29 and 1B were partitioned into 4 or 5 relevant fragments (F1-F5) depending on the structures studied. The contributions of the different fragments to the interaction with the viral receptor were then calculated. This calculation was carried out with the spectral moment and the entropy model. The calculation of fragment contributions with spectral moments based QSAR models is one of the most extended in the literature [38]. We also perform the calculation with the Entropy model to illustrate that the results obtained for both models are very similar, and validate the consistency of the method. Interestingly, for all models the statistical procedure selected local descriptors of the region of HI loop. This region of the virus presents the higher contribution to binding the low-density lipoprotein receptor (LDLR). This is the region in which the lysine is maintained in the HI loop located in fragment F2. It appears that this amino acid is very important in influencing the entry route of the virus, although it is possible that a range of factors could be responsible for the
QSPR models for human Rhinovirus surface networks
155
ability of the virus to penetrate the cell through various mechanisms [39]. Fragment F1 also appears to make a significant contribution in the four proteins studied, although the contribution is markedly lower than that of fragment F2.
Figure 2. Surface road map and 3D-SCN for HRV 1A
156
Santiago Vilar & Humberto GonzĂĄlez-DĂaz
3.3. QSAR models based on TIs of 3D Surface Complex Networks In general Complex Networks other than the Lattice-like networks above treated are of wide use in modern science including proteins as well [13]. Different types of protein contact maps or protein structural Complex Networks can be used to represent spatial protein structure information in the form of 2D graph/network representations. In general, in these networks two amino-acids (nodes) are connected by and edge if they are spatial neighbours or the nodes or edges of the network are weighted with 3D structure dependent labels. Consequently the TIs derived for these classes of networks depend on the 3D structure of the protein. Consequently, several researches prefer to call these TIs as the graph Topographic Indices (TGIs). In previous works, we investigated the road maps of HRV surface. These road maps are in mathematical terms protein 3D-Surface Complex Networks (3D-SCN) with viral surface exposed amino-acids playing the role of nodes. In the 3DSCN two amino-acids (nodes) are connected by and edge (arc) if they are spatial neighbours (adjacent) in the virus surface. The reader should be aware that being chemically connected (continuous amino-acids in the protein sequence or S-S bridge connected amino-acids) is not a condition necessary not sufficient to be a neighbour in the 3D-SCN. The construction of 3D-SCN or viral road maps has been given in detail before, so we refer to the original work and omit here detailed explanations. In Figure 3 we illustrate a road map of the HRV surface and the corresponding 3D-SCN for the viral strain HRV 1A.
Figure 3. (A) 3D-SCN, and (B) 2DL-SCN for HRV2 strain.
QSPR models for human Rhinovirus surface networks
157
In this work, we perform a comparison between the QSAR models obtained with the TIs of HP-Lattice network (see previous section) and the TGIs of a 3D-SCN. Two of the QSAR models based on TGIs of 3D-SCN have been reported before (the Markov Chain Entropy and Electrostatic Potential) [16, 17]. The other models based on TGIs of 3D-SCN and used in the comparison are being reported here by the first time: Markov Chain TGIs Models: Spectral Moments TGIs model (previously reported) [16]: S = 3.11· 2TGI π ( Bs ) + 17.30· 0TGI π ( L) − 15.39
(14)
Entropy TGIs model (previously reported) [17]: S = 0.84· 5TGI Θ (T ) + 289.9· 0TGI Θ ( L) + 24.07· 0TGI Θ ( Bs ) − 22.53
(15)
Absolute Probability TGIs model (reported in this work): S = 24.27· 0TGI Pa ( Bs ) + 384.21· 0TGI Pa ( L) − 17.22
(16)
Models based on SCN TGIs analogues of Classic TIs Wiener TGIs model (reported in this work): S = 8.0·10−4 · 0TGIW (T ) + 3.24· 0TGIW ( L) − 4.68
(17)
SCN Sum of Degrees model (reported in this work): S = 0.06· 0TI δ (T ) + 3.4· 0TI δ ( L) − 18.63
(18)
Diameter TGIs model (reported in this work): Vr ( LDLR / ICAM ) = −1.72· 0TGI D (T ) + 0.72· 0TGI D ( L) + 4.53
(19)
Randic TGIs model (reported in this work): S = −4.86· 0TGI χ (T ) + 23.2· 0TGI χ ( L) − 5.56
(20)
Models based on SCN TGIs analogues of Complex Networks Centralities Node Eccentricity TGIs model (reported in this work): S = 286.9· 0TGI Cecc (T ) + 3729.9· 0TGI Cecc ( L) − 59.59 Node Closeness TGIs model (reported in this work):
(21)
158
Santiago Vilar & Humberto González-Díaz
S = −630.3· 0TGI Cclo (T ) + 61379.7· 0TGI Cclo ( L) − 2.393
(22)
We have to be aware that the 2DL-SCN like in the case of Nandy type lattices for nucleotides is a 2D projection of a pseudo-folding (raw approximation) to protein or nucleic acid folding in the 3D space [40]. On the other hand, the 3D-SCN presupposes a detailed knowledge of the 3D folding of the protein to determine which amino acids are surface neighbours. Consequently, the QSAR models based on TGIs of 3D-SCN were statistically significant and based on a more realistic network than the 2DL-SCN model. However, the 2DL-SCN demonstrated to be enough rigorous to produce accurate QSAR models based on them. These difference in the type of information encode determine that the TGIs of 3D-SCN and the TIs of 2DL-SCN are essentially different even when they are based on the same type of invariant. In order to visually illustrate the differences between of 3D-SCN vs. 2DLSCN we shown both type of networks for the virus strain HRV2 in Figure 4. In general, we should not expect the same behaviour in the coefficients of the QSAR model even for the same class of invariants. For instance, QSAR coefficients of the same type of entropy differ from one network to the other but the entropy QSAR models are both accurate. It coincides with the successful application of entropy type measures to codify information content at different structural scale of the system reported by Graham et al [41, 42]. Interestingly, we can determine the exact QSAR coefficients of some TIs of 2DL-SCN and their more rigorous TGIs analogues based on 3D-SCN; demonstrating the previous statement. Thence, we can confirm here that the utility of a type of graph invariant depend not only on the type of problem we are trying to solve. It depends on the database, or the invariant formula but also on the type of graph representation used, which justifies the recent search by many authors of new graph or network type representations for nucleic acids, [43, 44] proteins, [45, 46] or proteomic maps [47].
4. Conclusions We demonstrated that TIs of SCN are indices of general use at different structure organization levels. In particular, we show that both TIs of realistic folding 3D-SCN and TGIs of pseudo-folding 2DL-SCN of viral capsids predict HRVs-receptor Interactions.
Acknowledgments González-Díaz H. acknowledges tenure track research position funded by Program Isidro Parga Pondal, Xunta de Galicia.
QSPR models for human Rhinovirus surface networks
159
References 1. 2.
3. 4. 5. 6. 7. 8. 9.
10. 11. 12. 13. 14. 15.
Balaban AT, Beteringhe A, Constantinescu T, Filip PA, Ivanciuc O. Four new topological indices based on the molecular path code. Journal of chemical information and modeling. 2007 May-Jun;47(3):716-31. Marrero-Ponce Y. Linear indices of the "molecular pseudograph's atom adjacency matrix": definition, significance-interpretation, and application to QSAR analysis of flavone derivatives as HIV-1 integrase inhibitors. J Chem Inf Comput Sci. 2004 Nov-Dec;44(6):2010-26. Verma RP, Hansch C. Understanding human rhinovirus infections in terms of QSAR. Virology. 2007 Mar 1;359(1):152-61. Estrada E. Characterization of the folding degree of proteins. Bioinformatics. 2002;18:697-704. Nandy A, Nandy P. Graphical analysis of DNA sequence structure: II. Relative abundances of nucleotides in DNAs, gene evolution and duplication. Curr Sci. 1995;68:75-85. Liao B, Ding K, Wang T. On A Six-Dimensional Representation of RNA Secondary Structures. J Biomol Struc Dynamics 2005;22:455-64. Chikenji G, Fujitsuka Y, Takada S. Shaping up the protein folding funnel by local interaction: lesson from a structure prediction study. Proc Natl Acad Sci U S A. 2006 Feb 28;103(9):3141-6. Jiang M, Zhu B. Protein folding on the hexagonal lattice in the HP model. J Bioinform Comput Biol. 2005 Feb;3(1):19-34. Fernández M, Caballero F, Fernández L, Abreu JI, Acosta G. Classification of conformational stability of protein mutants from 3D pseudo-folding graph representation of protein sequences using support vector machines. Proteins. 2008;70(1):167-75. González-Díaz H, Pérez-Castillo Y, Podda G, Uriarte E. Computational Chemistry Comparison of Stable/Nonstable Protein Mutants Classification Models Based on 3D and Topological Indices. J Comput Chem. 2007;28:1990-5. Chou KC, Cai YD. Prediction and classification of protein subcellular locationsequence-order effect and pseudo amino acid composition. J Cell Biochem. 2003 Dec 15;90(6):1250-60. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. González-Díaz H, Saiz-Urra L, Molina R, Gonzalez-Diaz Y, Sanchez-Gonzalez A. Computational chemistry approach to protein kinase recognition using 3D stochastic van der Waals spectral moments. J Comput Chem. 2007 Jan 31;28(6):1042-8. Agüero-Chapin G, González-Díaz H, Molina R, Varona-Santos J, Uriarte E, González-Díaz Y. Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS Lett. 2006;580 723-30.
160
Santiago Vilar & Humberto González-Díaz
16. González-Díaz H, Uriarte E. Biopolymer stochastic moments. I. Modeling human rhinovirus cellular recognition with protein surface electrostatic moments. Biopolymers. 2005 Apr 5;77(5):296-303. 17. González-Díaz H, Molina, R.R., Uriarte, E. Stochastic molecular descriptors for polymers. 1. Modelling the properties of icosahedral viruses with 3D-Markovian negentropies. Polymer. 2003(45):3845-53. 18. STATISTICA-6.0. 6.0 ed. Tulsa, OK, U.S.A.: StatSoft Inc. 2002. 19. Marrero-Ponce Y, Khan MT, Casanola Martin GM, Ather A, Sultankhodzhaev MN, Torrens F, et al. Prediction of Tyrosinase Inhibition Activity Using AtomBased Bilinear Indices. ChemMedChem. 2007 Apr 16;2(4):449-78. 20. Castillo-Garit JA, Marrero-Ponce Y, Torrens F, Rotondo R. Atom-based stochastic and non-stochastic 3D-chiral bilinear indices and their applications to central chirality codification. J Mol Graph Model. 2007 Jul;26(1):32-47. 21. Vlasak M, Blomqvist S, Hovi T, Hewat E, Blaas D. Sequence and structure of human rhinoviruses reveal the basis of receptor discrimination. J Virol. 2003;77:6923-30. 22. Herz J. Deconstructing the LDL receptor--a rhapsody in pieces. Nat Struct Biol. 2001;8:476-8. 23. Liao B, Wang TM. New 2D graphical representation of DNA sequences. J Comput Chem. 2004 Aug;25(11):1364-8. 24. Zhang Y, Chen W. Analysis of similarity/dissimilarity of long DNA sequences based on three 2DD-curves. Comb Chem High Throughput Screen. 2007 Mar;10(3):231-7. 25. Liao B, Wang T, Ding K. On A Seven-Dimensional Representation of RNA Secondary Structures. Molecular Simulation. 2005;31(14 ):1063-71. 26. Thachuk C, Shmygelska A, Hoos HH. A replica exchange Monte Carlo algorithm for protein folding in the HP model. BMC Bioinformatics. 2007 Sep 17;8(1):342. 27. Liao B, Xiang X, Zhu W. Coronavirus phylogeny based on 2D graphical representation of DNA sequence. J Comput Chem. 2006;27(11):1196-202. 28. Randic M, Vracko M, Nandy A, Basak SC. On 3-D graphical representation of DNA primary sequences and their numerical characterization. J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1235-44. 29. Gonzales-Diaz H, Molina R, Hernandez I. MARCH-INSIDE version 2.0 (Markovian Chemicals In Silico Design). 2.0 ed: Chemicals Bio-actives Center, Central University of Las Villas, Cuba. 2006:MARCH-INSIDE version 2.0 (Markovian Chemicals In Silico Design).This is a preliminary experimental version, a future professional version shall be available to the public. For any information about it, sends and e-mail to the corresponding author gonzalezdiazh@yahoo.es or humbertogd@uclv.edu.cu. 30. Berger B, Leighton T. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. J Comput Biol. 1998 Spring;5(1):27-40. 31. Randic M. Graphical representations of DNA as 2-D map. Chem Phys Lett 2004;386(4-6):468-71. 32. Stiefl N, Baumann K. Mapping Property Distributions of Molecular Surfaces: Algorithm and Evaluation of a Novel 3D Quantitative Structure-Activity Relationship Technique. J Med Chem. 2003;46 1390-407. 33. González-Díaz H, Aguero-Chapin G, Varona-Santos J, Molina R, de la Riva G, Uriarte E. 2D RNA-QSAR: assigning ACC oxidase family membership with
QSPR models for human Rhinovirus surface networks
34. 35.
36.
37. 38.
39.
40. 41. 42. 43. 44. 45. 46.
47.
161
stochastic molecular descriptors; isolation and prediction of a sequence from Psidium guajava L. Bioorg Med Chem Lett. 2005 Jun 2;15(11):2932-7. Gia O, Marciani Magno S, González-Díaz H, Quezada E, Santana L, Uriarte E, et al. Design, synthesis and photobiological properties of 3,4cyclopentenepsoralens. Bioorg Med Chem. 2005 Feb 1;13(3):809-17. Vilar S, Estrada E, Uriarte E, Santana L, Gutierrez Y. In silico studies toward the discovery of new anti-HIV nucleoside compounds through the use of TOPSMODE and 2D/3D connectivity indices. 2. Purine derivatives. Journal of chemical information and modeling. 2005 Mar-Apr;45(2):502-14. González-Díaz H, Agüero-Chapin G, Varona J, Molina R, Delogu G, Santana L, et al. 2D-RNA-Coupling Numbers: A New Computational Chemistry Approach to Link Secondary StructureTopology with Biological Function. J Comput Chem. 2007;28:1049–56. Junker BH, Koschuetzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006 Apr 21;7(1):219. Estrada E, Vilar S, Uriarte E, Gutierrez Y. In silico studies toward the discovery of new anti-HIV nucleoside compounds with the use of TOPS-MODE and 2D/3D connectivity indices. 1. Pyrimidyl derivatives. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1194-203. Vlasak M, Roivainen M, Reithmayer M, Goesler I, Laine P, Snyers L, et al. The minor receptor group of human rhinovirus (HRV) includes HRV23 and HRV25, but the presence of a lysine in the VP1 HI loop is not sufficient for receptor binding. J Virol. 2005 Jun;79(12):7389-95. Randič M, Vračko M, Nandy A, Basak SC. On 3-D Graphical Representation of DNA Primary Sequences and Their Numerical Characterization. . J Chem Inf Comput Sci. 2000;40:1235-44. Graham DJ. Information Content in Organic Molecules: Brownian Processing at Low Levels. Journal of chemical information and modeling. 2007;47(2):376-89. Graham DJ. Information Content in Organic Molecules: Structure Considerations Based on Integer Statistics. J Chem Inf Comput Sci. 2002;42:215. Nandy A, Harle M, Basak SC. Mathematical descriptors of DNA sequences: development and applications. ARKIVOC. 2006;9:211-38. Raychaudhury C, Nandy A. Indexing Scheme and Similarity Measures for Macromolecular Sequences. J Chem Inf Comput Sci. 1999;39 243-7. Randic´ M, Butina D, Zupan J. Novel 2-D graphical representation of proteins. Chem Phys Lett. 2006;419 528-32. Fernández M, Caballero J, Fernández L, Abreu JI, Garriga M. Protein radial distribution function (P-RDF) and Bayesian-Regularized Genetic Neural Networks for modeling protein conformational stability: Chymotrypsin inhibitor 2 mutants. J Mol Graph Model. 2007;26(4):748-59. Randic M, Estrada E. Order from chaos: observing hormesis at the proteome level. Journal of proteome research. 2005 Nov-Dec;4(6):2133-6.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 163-178 ISBN: 978-81-7895-489-9 Editor: Humberto González-Díaz and Cristian Robert Munteanu
9. Predicting bacterial co-aggregation networks with phylogenetic spectral moments Ronal Ramos de Armas1, Liane Saíz-Urra2,3 and Humberto González-Díaz2 1
Smiths Detection-LiveWave, Inc., Toronto, Canada; 2CBQ, Central University of Las Villas (UCLV), 54830, Santa Clara, Cuba; 3Rega Institute, Katholieke Universiteit Leuven b-3000 Leuven, Belgium; 4 Department of Microbiology & Parasitology, Faculty of Pharmacy University of Santiago de Compostela, 15782, Spain
Abstract. Pair-wise co-aggregation of bacterial species has been recognized as a very important step in biofilms formation process; a process reported R. J. Gibbons 40 years ago but with high clinical and industrial relevance. On the other hand phylogenetic trees offer a graph-based picture of the similarities between species, which is many times useful to derive biologically relevant information. Unfortunately, it is not possible predicting the occurrence of coaggregation by direct visual inspection of local phylogenetic tree topology. Here, we introduce tree Topological Indices (TTIs) called Specie-Specie Evolution Divergence Spectral moments (kπa,b) to account for bacteria-bacteria co-aggregations. They can be derived from 16s RNA phylogenetic trees using Markov Chain theory. We seek a classification function Co-agg.score = -2.326·2πa,b – 0.398, which correctly discriminate between 88.24% (165 out of 187) / 87.09% (54 out of 62) co-aggregating/no co-aggregating pairs Correspondence/Reprint request: Dr. González-Díaz H, Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782, Santiago de Compostela, Spain E-mai: gonzalezdiazh@yahoo.es or humberto.gonzalez@usc.es
164
Ronal Ramos de Armas et al.
of bacteria species in training/cross-validation series respectively. The outputs of the model were used to re-construct a bacterial co-aggregation Complex Network for known biofilms-forming bacteria species. We named this kind of model based on TTIs derived from a phylogenetic tree as: Quantitative Phylogenetics-Property Relationships (QPhPR) approach. The name resembles a direct analogy with the classic Quantitative Structure-Property Relationships (QSPR) models used in chemistry; which are based on TIs of molecular graphs.
1. Introduction Nearly 40 years ago, Dr. R.J. Gibbons made the first reports of the clinical relevance of what we now know as bacterial biofilms when he published his observations of the role of polysaccharide glycocalyx formation on teeth by S. mutans [1]. As the clinical relevance of bacterial biofilm formation became increasingly apparent, interest in the phenomenon exploded. Studies are rapidly shedding light on the biomolecular pathways leading to this sessile mode of growth but many fundamental questions remain. Four potential incentives behind the formation of biofilms by bacteria during infection are considered: protection from harmful conditions in the host (defense), sequestration to a nutrient-rich area (colonization), utilization of cooperative benefits (community), biofilms normally grow as biofilms and planktonic cultures are an in vitro artifact (biofilms as the default mode of growth) [2]. In this sense, co-aggregation has been recognized as a very important step in the process of biofilm formation. Specifically, coaggregation may enhance biofilm development in the human oral cavity [3, 4]. Co-aggregation interactions could enhance the development of biofilms in fast-flowing water systems [5] and mediate the integration of pathogens into biofilms [6]. Ultimately, co-aggregation interactions could influence the bacterial diversity of freshwater biofilms. On the other hand, co-aggregation interactions can be studied as interaction networks (co-aggregation net-works). In general, biological networks such as metabolic networks [7] and protein interaction networks (PINs) [8] share important structural features with other real-world networks in dissimilar fields ranging from the Internet to social networks [9, 10]. These networks can be used to study a wide variety of biological endpoints such as the robustness of food webs against species loss [11]. Therefore, the representation of this type of interactions as a network is an interesting goal intended to further studies related to biofilms formation. Computational methods for predicting interactions advanced in past years, with completely new approaches and sophisticated â&#x20AC;&#x2DC;miningâ&#x20AC;&#x2122; of existing interaction data to infer additional interactions. One new trend was to study the tendency of interacting proteins to exhibit similar phylogenetic trees [12];
Predicting bacterial co-aggregation networks with phylogenetic spectral moments
165
quantitative algorithms for assigning interaction partners involved analyzing trees of families of interacting proteins, such as a ligand and receptor tree, and finding proteins that occupy similar positions in two trees [13, 14]. Phylogenetics played an important role in evolutionary biology [15]. Considering the potentialities of phylogenetics is interesting to make use of phylogenetics trees for the reconstruction of bacterial co-aggregation network in order to shed light insight of the bacterial co-aggregation phenomenon. However, recent studies of co-aggregating freshwater biofilm bacteria have demonstrated that co-aggregation often occurs between bacteria that are taxonomically distant (intergeneric co-aggregation) and occasionally between strains belonging to the same species (intra-species coaggregation) [16-18]. As a consequence, becomes very difficult to predict the occurrence of coaggregation by direct visual inspection of the phylogenetic tree. More in general, many researchers use the qualitative comparison of tree Topology to infer different rules in phylogenetic analysis [19, 20]. It evidences certain necessity on the development of Tree Topological Indices (TTIs) to compare trees topology in quantitative and not qualitative-only terms and connect it to other relevant properties. With this purpose, we use here the representation of the phylogenetic tree as a graph. It means that we represent both the bacterial species as well as the evolutionary stages as nodes connected by evolutionary steps represented by edges. In principle, different kind of graph or networks and their indices has been used to describe very different type of chemical and biological systems [21]. Based on this kind of supposition we can use local measures of the connectedness of the graphs such as Connectivity Indices (CIs) and Topological Indices (TIs) to describe bacterial coaggregation. We propose here a new kind of TIs; which may be comprehended as evolution pathway divergence spectral moments (kĎ&#x20AC;a,b). These new TIs are TTIs that can be calculated using a Markov Model (MM) at different steps of bacterial co-evolution within a 16S rRNA genes phylogenetic tree. In any case, the new TTIs may be applied in principle to quantitatively describe the topology of any kind of trees previously published for phylogenetic analysis [22-25]. The selection of spectral moments measures was justified considering the several successful applications reported for this kind of parameters to the study of diverse complex systems [26, 27]. Next, we used (kĎ&#x20AC;a,b) values to develop by first time a Quantitative Sequence-Interaction Relationship (QPHPR) model to predict the probability with which two bacteria species coaggregates given its 16S rRNA sequences for a large set of 138 pairs of no coaggregating bacteria and 111 pairs of co-aggregating ones. The corresponding bacterial co-aggregation network was reconstructed by using the above mentioned methodology.
166
Ronal Ramos de Armas et al.
2. Materials and methods 2.1. Spectral moments for the evolutionary divergence path of two bacteria species We used for a database previously experimentally studied and reported in the literature to derive the present modeling of bacteria co-aggregation [28]. In the Figure 1 we illustrated the phylogenetic tree used in this previous experimental work. In the tree, the names of the bacteria species investigated appeared in boldface style and other bacteria species did not investigated by these authors appear in normal style. We focused our attention herein on the evolution divergence pathway followed by every pair of two bacteria species a and b after k evolution steps moving forward (and also backwards) from the root of the tree (ancient common bacteria specie) to the final or current species. Our hypothesis is that spectral moments measures of the different steps given by both species in this pathway express certain degree of divergence of two bacteria and may be then connected to their co-aggregation. We used a Markov Model (MM) to calculate the values of these values of Spectral moments (kπa,b) for the evolutionary divergence. The algorithm based on the so-called MARCH-INSIDE approach [26, 27]. Here, the classic Markov matrix MARCH-INSIDE approach [29] has been adapted to characterize phylogentic information in the following way. First, we need to construct the matrix 1Π (see Eq. 1). This matrix is built up as a square matrix (n × n). The matrix 1Π contains the evolution transition probabilities 1pij to reach a node ni moving from a node nj inside the phylogenetic tree throughout an walk of length k = 1 (one evolution step). Let be δj the number of possible evolution steps that the specie may give from node nj to ni (degree of the phylogenetic tree node). Given that, αij = 1 if and only if the two nodes ni and nj are neighbors placed at topological distance k = 1 in the phylogentic graphs (it means separated by one evolution step) and αij = 0 otherwise we can calculate: 1
pij =
α ij ⋅ δ j n
∑α l =1
il
(1)
⋅ δl
Next, we used the theory of Markov chains to calculate the evolution spectral moments (kπa,b) : k k πk = Tr ⎡( 1 Π ) ⎤ = pij ∑ ⎥⎦ ⎣⎢ i = j∈Path (0, a , b )
(2)
Predicting bacterial co-aggregation networks with phylogenetic spectral moments
167
Where, kpij are the probabilities to reach the nodes na moving throughout a walk of length k from node nb. The symbols i = jâ&#x2C6;&#x2C6;Path(0,a,b) indicates that we sum only the values of kpij in the main diagonal of the matrix that at the same time lie within the shortest path (divergent evolution paths) connecting na or nb with n0 (represent the root of the phylogenetic tree). The values of k Ď&#x20AC;a,b were calculated with the software MARCH-INSIDE 3.0 [30].
Figure 1. Bacteria Phylogenetic tree based on 540-bp-long sequences of 16S rRNA genes. The tree was constructed by Rickard et al. (see reference Rickard et al. 2003b) using the neighbourjoining method of Jukes and Cantor. The scale bar represents one estimated substitution for every 10 nucleotides. The bootstrap values indicate confidence limits of the phylogenies, based on percentages of 100 replications. T. thermophilus (X07998) was used to root the tree.
168
Ronal Ramos de Armas et al.
2.2. Statistical analysis Using the values of kπa,b, as defined previously, for all pairs of bacterial species we can attempt discriminate co-aggregating from no co-aggregating species fitting a simple linear classifier with the formula: ⎛ 1 πa ,b ⎞ ⎛ 2 πa ,b ⎞ ⎛ k πa ,b ⎞ S (C ) = a1 ⋅ ⎜ + a2 ⋅ ⎜ + ... + ak ⋅ ⎜ 0 0 0 ⎟ ⎟ ⎟ + b0 π π π a , b a , b a , b ⎝ ⎠ ⎝ ⎠ ⎝ ⎠
(3)
S (C ) = a1 ⋅ *π1 + a2 ⋅ *π2 + ... + ak ⋅ *πk + b0
(4)
S(C) is a real-valued output for the model that scores the propensity of species a and b to present co-aggregation. We used 0πa,b as scaling factor for all the spectral moments with k > 0. We also selected Linear Discriminant Analysis (LDA) [31] to fit the discriminant function. In the model, b0 and ak represents the coefficients of the classification function, determined by the least square method as implemented in the LDA module of the STATISTICA 6.0 software package [32]. Forward-stepwise algorithm was used for variable selection [33, 34]. The statistical significance of the LDA model was determined by Fisher’s test by examining the respective p-level (p) for the Canonical regression coefficient (Rc). All the variables included in the model were standardized in order to bring it into the same scale. Subsequently, a standardized linear discriminant equation that allows to compare their coefficients is obtained [35]. The model was trained by using and later validated with and external validation series. Table 1. Calssification results.
Predicting bacterial co-aggregation networks with phylogenetic spectral moments
Table 2. Observed vs. Predicted Bacterial co-aggregation.
169
170
Table 2. Continued.
Ronal Ramos de Armas et al.
Predicting bacterial co-aggregation networks with phylogenetic spectral moments
Table 2. Continued.
171
172
Table 2. Continued.
Ronal Ramos de Armas et al.
Predicting bacterial co-aggregation networks with phylogenetic spectral moments
Table 2. continued.
173
174
Ronal Ramos de Armas et al.
Table 2. continued.
3. Results and discussion In order to reconstruct the bacterial co-aggregation interactions network we selected randomly 104 co-aggregating and 83 non co-aggregating pairs of bacteria to train the first model. The model correctly predicts more than 80% of cases in all training and validation experiments (see Table 1). The equation of the model is: S (C ) = 1602.9 ⋅ *π1 − 940.4 ⋅ *π2 − 273.8 ⋅ *π3 − 227.5 Rc = 0.63
p < 0.001
(5)
In Table 2, we show all pairs of bacteria species and the classification as bacterial coaggregation or no co-aggregation pairs. For the reconstruction of
Predicting bacterial co-aggregation networks with phylogenetic spectral moments
175
the corresponding network, we use the adjacency matrix generated by the classifications of the LDA model as input for the CentiBin software. Using the values of Table 2 we can reconstruct the real network connecting with an edge all pairs predicted by the model as co-aggregation species (probability p > 0.5). The predicted bacterial co-aggregation interactions network was represented by using the Centralities in Bio-logical Networks (CentiBin) software [36]; an application for the calculation and visualization of centralities for biological networks. The adjacency matrix related to the observed co-aggregation interactions (if the pair of bacteria co-aggregate then the matrix element is 1, otherwise 0) was used as input. A predicted bacterial co-aggregation interactions network is shown in Figure 2.
Figure 2. Bacterial co-aggregation networks.
4. Conclusions The phylogenetic tree could hide interesting biological information that can be unraveled with TTIs, as in this case. The TTIs called Markov evolution divergence spectral moments (kĎ&#x20AC;a,b) were introduced as phylogenetic
176
Ronal Ramos de Armas et al.
bio-descriptors for bacteria-bacteria inter-species similarity. The present results introduces one of the first quantitative rules to reconstruct bacterial co-aggregation interactions networks based on RNA sequences, which constitutes a step of relevance in the bioinformatics approach to bacteria biofilms formation. We can understand the use of numerical indices derived from phylogenetic tree as an alternative to models based on pa-rameters derived directly from protein or nucleic acid sequences.
Acknowledgements González-Díaz H. acknowledges financial support from Program Isidro Parga Pondal of the Xunta de Galicia, University of Santiago de Compostela, Spain.
References 1.
Costerton JW, Geesey GG, Cheng KJ. How bacteria stick. Sci Am. 1978 Jan;238(1):86-95. 2. Jefferson KK. What drives bacteria to produce a biofilm? FEMS Microbiol Lett. 2004 Jul 15;236(2):163-73. 3. Kolenbrander PE, Andersen RN, Clemans DL, Whittaker CJ, Klier CM. Potential role of functionally similar coaggregation mediators in bacterial succession. In: Newman HN, Wilson M, eds. Dental Plaque Revisited:Oral Biofilms in Health and Disease. Cardiff: Bioline Press 1999:171-86. 4. Bradshaw DJ, Marsh PD, Watson GK, Allison C. Role of Fusobacterium nucleatum and coaggregation in anaerobe survival in planktonic and biofilm oral microbial communities during aeration. Infect Immun. 1998 Oct;66(10):4729-32. 5. Elvers KT, Leeming K, Moore CP, Lappin-Scott HM. Bacterial-fungal biofilms in flowing water photo-processing tanks. J Appl Microbiol. 1998 Apr;84(4):607-18. 6. Buswell CM, Herlihy YM, Lawrence LM, McGuiggan JT, Marsh PD, Keevil CW, et al. Extended survival and persistence of Campylobacter spp. in water and aquatic biofilms and their detection by immunofluorescent-antibody and -rRNA staining. Appl Environ Microbiol. 1998 Feb;64(2):733-41. 7. Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic networks. Nature. 2000 Oct 5;407(6804):651-4. 8. Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM. Protein interaction networks from yeast to human. Curr Opin Struct Biol. 2004 Jun;14(3):292-9. 9. Albert R, Barabási A-L. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74(1):47-97. 10. Strogatz SH. Exploring complex networks. Nature. 2001 Mar 8;410(6825):268-76. 11. Estrada E. Food webs robustness to biodiversity loss: the roles of connectance, expansibility and degree distribution. J Theor Biol. 2007 Jan 21;244(2):296-307. 12. Goh CS, Cohen FE. Co-evolutionary analysis reveals insights into protein-protein interactions. J Mol Biol. 2002 Nov 15;324(1):177-92.
Predicting bacterial co-aggregation networks with phylogenetic spectral moments
177
13. Gertz J, Elfond G, Shustrova A, Weisinger M, Pellegrini M, Cokus S, et al. Inferring protein interactions from phylogenetic distance matrices. Bioinformatics. 2003 Nov 1;19(16):2039-45. 14. Ramani AK, Marcotte EM. Exploiting the co-evolution of interacting proteins to discover interaction specificity. J Mol Biol. 2003 Mar 14;327(1):273-84. 15. Harvey PH, Pagel MD. The Comparative Method in Evolutionary Biology. Oxford: Oxford Univ. Press 1991. 16. Rickard AH, Leach SA, Hall LS, Buswell CM, High NJ, Handley PS. Phylogenetic relationships and co-aggregation ability of freshwater biofilm bacteria. Appl Environ Microbiol. 2002 Jul;68(7):3644-50. 17. Buswell CM, Herlihy YM, Marsh PD, Keevil CW, Leach SA. Coaggregation amongst aquatic biofilm bacteria. J Appl Microbiol. 1997 September 1997;83(4):397-530. 18. Rickard AH, Gilbert P, High NJ, Kolenbrander PE, Handley PS. Bacterial coaggregation: an integral process in the development of multi-species biofilms. Trends Microbiol. 2003 Feb;11(2):94-100. 19. Puslednik L, Serb JM. Molecular phylogenetics of the Pectinidae (Mollusca: Bivalvia) and effect of in-creased taxon sampling and outgroup selection on tree topology. Mol Phylogenet Evol. 2008 Sep;48(3):1178-88. 20. Hampl V, Cepicka I, Flegr J, Tachezy J, Kulda J. Critical analysis of the topology and rooting of the parabasalian 16S rRNA tree. Mol Phylogenet Evol. 2004 Sep;32(3):711-23. 21. Bonchev D, Buck GA. From molecular to biological structure and back. Journal of chemical information and modeling. 2007 May-Jun;47(3):909-17. 22. Tippery NP, Les DH. Phylogenetic analysis of the internal transcribed spacer (ITS) region in Menyantha-ceae using predicted secondary structure. Molecular phylogenetics and evolution. 2008 Aug 6. 23. Hunt T, Vogler AP. A protocol for large-scale rRNA sequence analysis: towards a detailed phylogeny of Coleoptera. Molecular phylogenetics and evolution. 2008 Apr;47(1):289-301. 24. Volokhov DV, Neverov AA, George J, Kong H, Liu SX, Anderson C, et al. Genetic analysis of house-keeping genes of members of the genus Acholeplasma: phylogeny and complementary molecular markers to the 16S rRNA gene. Molecular phylogenetics and evolution. 2007 Aug;44(2):699-710. 25. Hofstetter V, Miadlikowska J, Kauff F, Lutzoni F. Phylogenetic comparison of protein-coding versus ribosomal RNA-coding sequence data: a case study of the Lecanoromycetes (Ascomycota). Molecular phylogenetics and evolution. 2007 Jul;44(1):412-26. 26. González-Díaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. 27. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connec-tivity indices. Proteomics. 2008;8:750-78. 28. Rickard AH, McBain AJ, Ledder RG, Handley PS, Gilbert P. Coaggregation between freshwater bacteria within biofilm and planktonic communities. FEMS Microbiol Lett. 2003 Mar 14;220(1):133-40.
178
Ronal Ramos de Armas et al.
29. Ramos de Armas R, González-Díaz H, Molina R, Uriarte E. Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants. Proteins. 2004 Sep 1;56(4):715-23. 30. González-Díaz H, Molina-Ruiz R, Hernandez I. MARCH-INSIDE v3.0 (MARkov CHains INvariants for SImulation & DEsign); Windows supported version under request to the main author contact email: gonzalezdiazh@yahoo.es. 3.0 ed 2007. 31. Van Waterbeemd H. Discriminant Analysis for Activity Prediction. In: Van Waterbeemd H, ed. Chemometric methods in molecular design. New York: Wiley-VCH 1995:265-82. 32. STATISTICA. 6.0 for Windows ed: Statsoft Inc. 2001. 33. Kowalski RB, Wold S. Pattern recognition in chemistry. In: Krishnaiah PR, Kanal LN, eds. Handbook of statistics. Amsterdam: North Holland Publishing Company 1982:673-97. 34. Van Waterbeemd H. Chemometric methods in molecular design. New York: Wiley-VCH 1995. 35. Kutner MH, Nachtsheim CJ, Neter J, Li W. Standardized Multiple Regression Model. Applied Linear Statistical Models. Fifth ed. New York: McGraw Hill 2005:271-7. 36. Koschützki D. CentiBiN, Centralities in Biological Networks. 1.4.2 ed. Germany: IPK Gatersleben 2004.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 179-189 ISBN: 978-81-7895-489-9 Editor: Humberto González-Díaz and Cristian Robert Munteanu
10. QSPR models for cerebral cortex co-activation networks Humberto González-Díaz1, Santiago Vilar2, Daniel Rivero3 Enrique Fernández-Blanco3, Ana Porto3 and Cristian Robert Munteanu3 1
Department of Microbiology & Parasitology, Faculty of Pharmacy University of Santiago de Compostela, Santiago de Compostela, 15782, Spain 2 National Institutes of Health, DHHS, Bethesda, Maryland 20892, USA 3 Department of Information and Communication Technologies, Computer Science Faculty University of A Coruña, Campus de Elviña, 15071 A Coruña, Spain
1. Introduction We can use complex networks to study different systems from such diverse areas as physics, biology, economics, ecology, and computer science. In current problems of the Biosciences, prominent examples are protein molecular networks in the genome. On larger scales one finds complex networks of cells as in neural networks, up to the scale of organisms in ecological food webs [1]. The reader may see the results after Anwander and Tittgemeyer et al. [2] on the study of a connectivity-based parcellation of Broca's Area. These authors refer that it is generally agreed that the cerebral cortex can be segregated into structurally and functionally distinct areas (that may play the role of network nodes). However, brain function also strongly depends upon area-area anatomical connectivity (edges), which therefore forms a sensible criterion for the functio-anatomical segregation of cortical Correspondence/Reprint request: Dr. González-Díaz H, Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela (USC), Santiago de Compostela, 15782, Spain E-mail: humberto.gonzalez@usc.es or gonzalezdiazh@gmail.com
180
Humberto GonzĂĄlez-DĂaz et al.
areas. Diffusion-weighted magnetic resonance (MR) imaging offers the opportunity to apply this criterion in the individual living subject. Probabilistic tractographic methods provide excellent means to extract the connectivity signatures from diffusion-weighting MR data sets (functional edges in a network). The correlations among these signatures may then be used by an automatic clustering method to identify cortical regions with mutually distinct and internally coherent connectivity. In any case, cerebral cortex networks may be considered static (at least for a given period of time) or time dependent. For instance, Honey and Kotter, et al. [3] studied the network structure of cerebral cortex shapes functional connectivity on multiple time scales. Functional networks recovered from long windows of neural activity (minutes) largely overlap with the underlying structural network. As a result, hubs in these long-run functional networks correspond to structural hubs. In contrast, significant fluctuations in functional topology are observed across the sequence of networks recovered from consecutive shorter (seconds) time windows. The functional centrality of individual nodes varies across time as interregional couplings shift. Furthermore, the transient couplings between brain regions are coordinated in a manner that reveals the existence of two anticorrelated clusters. These clusters are linked by prefrontal and parietal regions that are hub nodes in the underlying structural network. At an even faster time scale (hundreds of milliseconds) we detect individual episodes of interregional phase-locking and find that slow variations in the statistics of these transient episodes, contingent on the underlying anatomical structure, produce the transfer entropy functional connectivity and simulated blood oxygenation level-dependent correlation patterns observed on slower time scales. Many results in this direction, seefor instance the work of Costa and Kaiser et al., have revealed a high interest on predicting the connectivity of primate cortical networks from topological and spatial node properties. They presented a computational reconstruction approach to the problem of network organization, by considering the topological and spatial features of each area in the primate cerebral cortex as subsidy for the reconstruction of the global cortical network connectivity. Starting with all areas being disconnected, pairs of areas with similar sets of features are linked together, in an attempt to recover the original network structure. This type of studies may be relevant for the study of different disease. For instance, Mizuno and Villalobos, et al. [4] examined the functional connectivity between thalamus and cerebral cortex in terms of blood oxygen level dependent signal cross-correlation with high-functioning autism and matched normal controls, using functional MRI during simple visuo-motor coordination. Both groups exhibited widespread connectivity, consistent with known extensive thalamo-cortical connectivity. In a direct
QSPR models for cerebral cortex co-activation networks
181
group comparison, overall more extensive connectivity was observed in the autism group, especially in the left insula and in right post-central and middle frontal regions. Consequently, having numerical parameter to quantitatively describe the connectivity of these networks may be used to discriminate healthy subject from patients in neurosciences. This role may be played by the so called graph theoretical indices. These indices are numerical parameters easily derived from graphs/networks that may be used on Quantitative StructureProperty Relationships (QSPR) studies. QSPR studies have been used for the construction of models predicting the properties of a complex systems based on numerical parameters that describe the structure of the system. In particular, Topological Indices (TIs), Connectivity Indices (CIs), or Node centralities Ct(j) of type t, may be calculated from the graph/network representation of a system to describe full network topology, node connectivity, or sub-graph branching. This class of indices is one of the most flexible classes of indices for QSPR studies; see the recent reviews after GonzĂĄlez-DĂaz et al. [5-9]. In this work, we report by the first time a QSPR study of cat cerebral cortex network using TIs derived with node centralities Ct(j) calculated by the software CentiBin [10].
2. Materials and methods Names and abbrevations used for all areas Here the abbreviation used for the areas in Table 1 and Figure 1, which is followed by the name of each area and the relation of our parcellation to other relevant anatomical and physiological schemes. This information comes from many other references cited in the annex of the work after Scannell et al. [11]. The number 17, area 17, primary visual cortex, or striate cortex and 18, area 18, a retinotopically organized visual area. PMLS is the posteromedial lateral suprasylvian area. AMLS is the anteromedial lateral suprasylvian area; a visual area in the medial wall of the suprasylvian sulcus. VLS is the ventrolateral suprasylvian area; a visual area situated in the posterior wall of the posterior part of the suprasylvian sulcus. PMLS, VLS, and parts of PLLS and AMLS correspond to the Clare-Bishop area of other parcellation schemes. PLLS, posterolateral lateral suprasylvian area. ALLS, anterolateral lateral suprasylvian area, a visual area in the lateral wall of the anterior part of the middle suprasylvian sulcus. DLS, dorsolateral suprasylvian area, is a visual area in the anterior wall of the posterior part of the suprasylvian gyrus. PLLS, ALLS, and DLS overlap with the lateral suprasylvian area in other parcellation schemes. We have treated PMLS,
182
Humberto González-Díaz et al.
PLLS, AMLS, ALLS, VLS, and DLS as separate entities for the collation of connection data, but the precise disposition of visual areas in the lateral suprasylvian cortex remains unclear There is considerable evidence, however, for several visual areas within lateral suprasylvian sulcus. The number 21a, area 21a, is a visual area on the posterior part of the suprasylvian gyrus and the superior wall of the posterior part of the suprasylvian sulcus. 21a overlaps with the Clare-Bishop area in the parcellation of Sherk. 21b, area 21b, is a visual area on the posterior part of the suprasylvian gyrus and the posterior wall of the posterior part of the suprasylvian gyrus. 20a, area 20a, a retinotopically organized visual area on the posterior ectosylvian gyrus. 20b, area 20b, a retinotopically organized visual area on the posterior ectosylvian gyrus [11]. ALG, a visual area located in the lateral wall of the lateral gyrus between areas 7 and 19. ALG may be part of area 19. Although ALG was included in the literature survey and is shown in Figure 1, it was excluded from most o the topological analyses because of the lack of available connectional data. Number 7, is area 7, a region of cortex on the middle part of the suprasylvian gyrus, lateral sulcus and lateral gyms. The area contains cells responsivet o visual, auditory, and somatosensorys timuli and has been implicated in the control of eye movements. AES, anterior ectosylvian sulcus, a region of multimodal cortex, containing cells responsive to auditory, visual, and somatosensory stimulation. A large proportion of cells are multimodal. Dorsoposterior parts of the sulcus are dominated by auditory responsivity, the Table 1. Forward-step-wise analysis of TIs for Cerebral Cortex regions. Name of TIa Notationb F p Effect c Node Degree TIδ(g) 106.1 0.0000 In Betweeness for shortest TICb(g) 10.1 0.0019 In path 8.6 0.0043 In Values for Centroid TICc(g) HITS-Hub TIChh(g) 16.0 0.0001 In 0.1 0.8079 Out Radiality TICr(g) Closeness TICclo(g) 8.6 0.0041 Out 0.6 0.4228 Out Stress TICs(g) Eccentricity TICecc(g) 1.9 0.1663 Out 0.2 0.6247 Out Katz status index TICk(g) 14.1 0.0003 Out Eigenvector TICt(g) 6.2 0.0144 Out Bargain TICg(g) Page rank TICpr(g) 9.6 0.0025 Out 1.4 0.2425 Out HITS-Authority TICha(g) a Name of the respective centrality indices, see also Table 1. b Notation used in this work, see also Table 1 and Materials and Methods. c Variables in the model (in) or not in the model (out)
QSPR models for cerebral cortex co-activation networks
183
Figure 1. (A) Views of the parcellation of the cat brain cerebral cortex in areas; the hippocampus and subiculum are not shown. (B) Complex network of region-region coactivation; all the areas shown on the map were included in the analysis, with the exception of TCA (corticoamygdaloid transition area), PPC (prepiriform cortex), and OB (olfactory bulb).
fundus and ventral bank of the middle part are dominated by visually responsive cells, and the anterodorsal part contains a somatosensoryre presentation(SIV) that is considered as a distinct area in this analysis. AES has strong connections with the superior colliculus and may be involved in oculomotor function. SVA, splenial visual area, situated in the splenial sulcus, between area 17 and the cingulate gyms. SVA may be visually responsive region of the posterior cingulate cortex [11]. PS, posterior suprasylvian area a retinotopically organized area on the inferior part of the suprasylvian gyrus and sulcus. AI, primary auditory field. AAF, anterior auditory field. P, posterior auditory field, in the posterior wall of the posterior ectosylvian sulcus. VP, ventroposterior auditory field. AI, AAE P and VP are the â&#x20AC;&#x153;coreâ&#x20AC;? auditory fields showing tonotopic organization with narrowly tuned cells. AU, second auditory field. DP, dorsoposterior auditory field. V, ventral auditory field. SSF, suprasylvian fringe, a thin band
184
Humberto González-Díaz et al.
of multimodal cortex running along the inferolateral border of the suprasylvian sulcus. EPp, posterior part of the posterior ectosylvian gyrus, a visual and auditory association area. Tern, temporal auditory field. AII, SSE EPp, DP and Tern lie within the “auditory belt.” They lack the strict tonotopy of the “core” areas. The tonotopic organization of AII, for example, is less orderly than that of the “core” auditory fields, and cells within this area exhibit broad tuning curves 3a, area 3a. 3b, area 3b. I, area 1. 2, area 2. These are areas of somatosensory cortex. 3b, 1, and 2, constitute SI contain one or more somatotopicr epresentationsd ominatedb y cells responsive to cutaneous stimulation. 3b, 1, and 2 may constitute a single somatic koniocortical area. 3a contains a single contralateral somatotopic representation dominated by cells responsive to deep stimuli. SII, seconds omatosensorya area having multiple representations of some of the body regions. The majority of cells respond to superficial stimuli. In contrast to the primary somatosensorya reas, SII shows a degree of ipsilateral input [11]. SIV, fourth somatosensorya rea occupying the dorsalb ank of the anterior part of the anterior ectosylvian sulcus and adjoining anterior ectosylvian gyrus. SIV has an orderly topographic representationo f the body surface. 4g, area 4h; a region of motor cortex that occupies the anterior part of the cruciate sulcus and a small area of surrounding cortex. 4, area4. This corresponds to areas4 f, 4sf, and 4a that are in the superior and posterior aspects of the cruciate sulcus. 61, lateral division of area 6, an area of premotor cortex. This area includes all regions of area 6 lateral to the rostra1 margin of the cruciate sulcus. It corresponds to the lateral region of area 6a. Electrical stimulation in this area may evoke movements. 6m, medial division of area 6. This area consists of regions of area 6 medial to the rostra1 margin of the cruciate sulcus and corresponds to the medial part of 6aB, 6ª (u, and 6if). 6m contains a region, the medial frontal eye field, where electrical stimulation may elicit eye movements. POA, presylvian oculomotor areas. These areas are located the medial and lateral walls of the presylvian sulcus and correspond to DLo and VLo and to the lateral frontal eye fields. These two physiologically distinct areas have been considered together because of the relative lack of connectional data. They may be part of area 6. km, medial part of area 5a on the medial side of the anterior part of the lateral gyrus and the medial part of the lateral sulcus. 5~1, lateral part of area 5a, on the anterior suprasylvian gyrus and lateral side of the lateral sulcus. 5a overlapsw ith SIII, the third somatosensorya rea. 5bm, medial part of area 5b, on the anterior part of the lateral gyrus and the medial side of the lateral sulcus. Sbl, lateral part of area 5b, on the anterior part of the suprasylvian gyrus running into the lateral side of the lateral gyrus. 5m, medial division of area 5, on the anterior part of the medial lateral gyrus. SSAo, outer part of suprasylvian sulcal division of area
QSPR models for cerebral cortex co-activation networks
185
5, in the anterior part of the medial wall of the suprasylvian sulcus. SSAi, inner (deep) part of suprasylvian sulcal division of area 5, in the anterior part of the medial wall of the suprasylvian sulcus. Area 5 receivess omatosensoryv, isual, and auditory input. 5b may be involved in visuomotor integration, and SSAo and SSAi may have polysensory responsivity. PFCr, rostra1 division of the prefrontal cortex. This area overlaps with the dorsal division of prefrontal cortex as defined by Musil and Olson. PFCdl, dorsolateral division of the prefrontal cortex. PFCv, ventral division of the prefrontal cortex. This area overlaps with both the dorsal and infrahmbic divisions of the medial prefrontal cortex of Musil and Olson. PFCdm, dorsomedial division of the prefrontal cortex. This area overlaps with the dorsal division of the medial prefrontal cortex of Musil and Olson. la, agranular insula. Ig, granular insula. The dorsal insula contains a region responsive to visual stimuli, while more ventral insula regions respond to multimodal stimulation. CGa, anterior part of cingulate cortex. This corresponds to the anterior part of the posterior cingulate area of Olson and Musil. CGp, posterior part of cingulate cortex. This corresponds to the posterior part of the posterior cingulate area of Olson and Musil. CGa and CGp contain neurons that respond to multimodal sensory stimulation. LA, anterior limbic cortex. This corresponds to the anterior cingulate area of Musil and Olson. RS, retrospenial cortex. PL, prelimbic area. This corresponds to area 32. This overlaps with the infralimbic division of the medial prefrontal cortex of Musil and Olson. IL, infralimbic area. This corresponds to area 25 of Room. 35, area 35 of the perirhinal cortex. 36, area 36 of the perirhinal cortex. PSb, presubiculum, parasubiculum, and postsubicular cortex. Because of the relative lack of connectional data for these areas in the cat, they were considered together in the analysis. Sb is the subiculum and ER, entorhinal cortex [11].
Statistical analysis Once the different descriptors had been calculated for all Cerebral Cortex regions or areas, we proceed with statistical analysis. For it, the variables were standardised and a Multiple Linear Regression (MLR) was carried out using the software package STATISTICA [12]. In the formula for MLR equation (1) is: log S ( A ) = a0 +
3, m ,21
∑b
kc
· kTI c
(1)
c,k
The variable S(A) is a real value score for the biological property under investigation: the activation of a Cerebral Cortex region. S(A) is the sum of the row for the region or area A in the Matrix of corticocortical connections in the cat
186
Humberto GonzĂĄlez-DĂaz et al.
the work after Scannell et al. [11]. The parameter logs(A) was the input variable and was calculated taking the logarithm of S(A). The values ak and bkcg are the coefficients obtained for the LDA classification functions (QSAR model) [13, 14]. The statistical quality of the models was assessed using parameters such as Regression coefficient (R), Fisher ratio (F), and level of error (p) [13, 14]. Table 2. Detailed results for QSAR analysis of Cerebral Cortex regions.
QSPR models for cerebral cortex co-activation networks
187
Table 2. Continued.
3. Results The best model found, see equation (2), presented excellent results in training with R = 0.94; which indicates that the model accounts for 88% of variance of data. In addition, the model presented R= 0.86 in an external validation series with N = 32 cases not used to train the model. We also reported the adjusted value of the regression model both in training and validation series after removing the effect of over-fitting due to an excessive number of variables. We can note that this values are also high Radj = 0.88 and Radj = 0.73 and very similar to not-adjusted values in training and validation. The detailed information for all Brain regions analyzed the values of the TIs in the QSAR equation and the type of Input/output activation as well as other details appears in Table 2. log S = 0.0242·TI δ ( j ) + 0.0128·TIV ( j ) − 0.00007·TI B ( j ) − 15.4112·TI H ( j ) + 1.6503 N = 98
R = 0.94
F = 176.75
p < 0.001
(2)
4. Conclusions We demonstrated that TIs are indices of general use at different structure organization levels. In particular, we show that TIs may be used to seek QSPR models that predict Cerebral Cortex co-activation. This approach
188
Humberto González-Díaz et al.
opens a new gateway for the extension of classic TIs to other uses in Neurological sciences.
Acknowledgments The authors thank for the partial financial support from the grants 2006/60, 2007/127 and 2007/144 from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia, from the grant (Ref. PIO52048 and RD07/0067/0005) funded by the Carlos III Health Institute and from the grant TIN2009-07707 from Spanish “Ministerio de Ciencia e Innovación (MICINN)”. We also thank the Ibero-American Network of the Nano-Bio-Info-Cogno Convergent Technologies (Ibero-NBIC) network (209RT0366) funded by CYTED (Ciencia y Tecnología para el Desarrollo). González-Díaz H. and Munteanu C. R. acknowledge the funding for a research position by Isidro Parga Pondal Programme, Xunta de Galicia, Spain.
References 1. 2. 3. 4. 5. 6.
7. 8. 9.
Bornholdt S, Schuster HG. Handbook of Graphs and Complex Networks: From the Genome to the Internet. Wheinheim: WILEY-VCH GmbH & CO. KGa. 2003. Anwander A, Tittgemeyer M, von Cramon DY, Friederici AD, Knosche TR. Connectivity-Based Parcellation of Broca's Area. Cereb Cortex. 2007 Apr;17(4):816-25. Honey CJ, Kotter R, Breakspear M, Sporns O. Network structure of cerebral cortex shapes functional connectivity on multiple time scales. Proc Natl Acad Sci U S A. 2007 Jun 12;104(24):10240-5. Mizuno A, Villalobos ME, Davies MM, Dahl BC, Muller RA. Partially enhanced thalamocortical functional connectivity in autism. Brain Res. 2006 Aug 9;1104(1):160-74. Pérez-Montoto LG, Prado-Prado F, Ubeira FM, González-Díaz H. Study of Parasitic Infections, Cancer, and other Diseases with Mass-Spectrometry and Quantitative Proteome-Disease Relationships. Curr Proteomics. 2009; 6:in press. González-Díaz H, Prado-Prado F, Pérez-Montoto LG, Duardo-Sánchez A, López-Díaz A. QSAR Models for Proteins of Parasitic Organisms, Plants and Human Guests: Theory, Applications, Legal Protection, Taxes, and Regulatory Issues. Curr Proteomics. 2009;6:in press. González-Díaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39.
QSPR models for cerebral cortex co-activation networks
189
10. Junker BH, Koschuetzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006 Apr 21;7(1):219. 11. Scannell JW, Blakemore C, Young MP. Analysis of connectivity in the cat cerebral cortex. J Neurosci. 1995 Feb;15(2):1463-83. 12. STATISTICA-6.0. 6.0 ed. Tulsa, OK, U.S.A.: StatSoft Inc. 2002. 13. Marrero-Ponce Y, Khan MT, Casanola Martin GM, Ather A, Sultankhodzhaev MN, Torrens F, et al. Prediction of Tyrosinase Inhibition Activity Using AtomBased Bilinear Indices. ChemMedChem. 2007 Apr 16;2(4):449-78. 14. Castillo-Garit JA, Marrero-Ponce Y, Torrens F, Rotondo R. Atom-based stochastic and non-stochastic 3D-chiral bilinear indices and their applications to central chirality codification. J Mol Graph Model. 2007 Jul;26(1):32-47.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 191-204 ISBN: 978-81-7895-489-9 Editor: Humberto González-Díaz and Cristian Robert Munteanu
11. Network prediction of fasciolosis spreading in Galicia (NW Spain) Humberto González-Díaz1, Mercedes Mezo2, Marta González-Warleta2 Laura Muíño-Pose1, Esperanza Paniagua1 and Florencio M. Ubeira1 1
Department of Microbiology & Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, Santiago de Compostela, 15782, Spain; 2 Laboratorio de Parasitología, Centro de Investigaciones Agrarias de Mabegondo-Xunta de Galicia, Abegondo, 15318, Spain
Abstract. Complex interacting networks are observed in systems from such diverse areas as physics, biology, economics, ecology, and computer science. Landscape networks incorporate topographical (coordinates and altitude), climate, hydrographical and other types of information to describe some phenomena (species migration, transport, disease spreading) that involves different places (lakes, cities, industries, forests) interconnected in the form of a network. By analogy to previous works we aim to study the landscape spreading of fasciolosis in Galicia (NW Spain). Fasciolosis is a parasitic infection caused by Fasciola hepatica that has become an important cause of lost productivity in livestock worldwide. Effective control of fasciolosis is difficult, especially in milking cows, which can only be treated during the dry period, a control strategy that has not been yet evaluated. However, there is not report of a network study of F. hepatica spread in Galicia coupled to a Geographical Information System (GIS). In this work, Correspondence/Reprint request: Dr. González-Díaz H, Department of Microbiology and Parasitology, Faculty of Pharmacy, University of Santiago de Compostela (USC), Santiago de Compostela, 15782, Spain E-mail: humberto.gonzalez@usc.es or gonzalezdiazh@gmail.com
192
Humberto GonzĂĄlez-DĂaz et al.
we construct by the first time a network for fasciolosis spreading in Galicia. We also calculated many centrality measures for all the nodes of the new network (livestock farms). Last, using these measures of landscape network structure as inputs we seek a Quantitative Structure-Property Relationship (QSPR) model. This QSPR model may predict the prevalence of disease, in the former or new farms, after or in absence of different medical treatments based only on details retrieved from GIS (location and altitude). The study may have predictive value for the positioning of new farms with lower risk of infection or in managing cattle during infections.
1. Introduction Complex networks (CNs) are observed in systems from such diverse areas as physics, biology, economics, ecology, and computer science. For example, economic or social interactions often organize themselves in CNs structures. Similar phenomena are observed in traffic flow and in communication networks as the internet. In current problems of the Biosciences, prominent examples are protein structure networks and proteinprotein networks in the living cell, as well as molecular networks in the genome. On larger scales one finds networks of cells as in neural networks, up to the scale of organisms in ecological food webs [1]. Landscape CNs incorporate topographical (coordinates and altitude), climate, hydrographical and other types of information to describe some phenomena (species migration, disease spreading, transport) that involves different places (lakes, cities, industries, forests) interconnected in the form of a CN. The places are represented by nodes and the edges or arcs connecting these nodes usually express geographical proximity but may incorporate additional information about the flow or transmission of some additional information, magnitude, or phenomena between two nodes. For instance, Minor and Tessel et al. [2] studied the role of landscape connectivity in assembling exotic plant communities by means of a CNs analysis. They considered an spatial arrangement of habitat fragments (nodes) to be critical in affecting movement of individuals through a landscape and how invasive species respond to landscape configuration relative to native species. This information is crucial for managing the global threat of invasive species spread. Using CNs analysis they show that forest plant communities in a fragmented landscape have spatial structure that is best captured by a CN representation of landscape connectivity (edges). In other study, Michels et al. [3] investigated geographical and genetic distances among zooplankton populations in a set of interconnected ponds. They considered systems of interconnected ponds or lakes (CN nodes). Using a landscape-based approach, they modelled the
Network prediction of fasciolosis spreading in Galicia (NW Spain)
193
effective geographical distance among a set of interconnected ponds in a Geographic Information System (GIS) environment. The first model reported corrects for the presence of direct connections among ponds (edges) based on the existing landscape structure (i.e. CN of connecting elements among ponds, travelling distance and direction of the current). A second model, called the Flow Rate Model, also incorporated field data on flow rates in the connecting elements as the driving force for the passive dispersal of the active zooplankton population component. Finally, the third model (the Dispersal Rate Model) incorporated field data on zooplankton dispersal rates. An analysis of the pattern of genetic differentiation among Daphnia ambigua populations inhabiting 10 ponds revealed that effective geographical distance as modelled by the flow rate and the dispersal rate model provides a better approximation of the true rates of genetic exchange among populations than mere Euclidean geographical distances or the landscape model that takes into account solely the presence of physical connections. Very interestingly, CN (incorporating or not landscape information) may be used to study disease spread. For instance, Natale et al. [4] carried out a network analysis of Italian cattle trade patterns in 2007 for evaluation of risks for potential disease spread. In fact, Livestock movement data represent a valuable source of information to understand the pattern of contacts between premises which may determine the spread of diseases. A description of the structure of the Italian cattle industry is presented and the main trade flows and the relations between premises in relation to the potential spread of cattle diseases were investigated. Epidemic simulations have been carried out on the network build out of movement data using a network-based metapopulation model. The simulations show the influence of the network structure on the dynamics and size of a hypothetic epidemic and give useful indications on the effects of targeted removal of nodes based on the centrality of premises within the network of animal movements. In other work, BigrasPoulin et al. [5] studied the relationship of trade patterns of the Danish swine industry animal-movements network to potential disease spread. The movements of animals were analysed under the conceptual framework of graph theory in mathematics. The swine production related premises of Denmark were considered to constitute the nodes of a network and the links were the animal movements. In this framework, each farm will have a CN of other premises to which it will be linked. A premise was a farm (breeding, rearing or slaughter pig), an abattoir or a trade market. The overall network was divided in premise specific subnets that linked the other premises from and to which animals were moved. This approach allowed them to visualise and analyse the three levels of organization related to animal movements that
194
Humberto González-Díaz et al.
existed in the Danish swine production registers: the movement of animals between two premises, the premise specific networks, and the industry network. The assumption that animal movements can be randomly generated on the basis of farm density of the surrounding area of any farm was not correct since the patterns of animal movements have the topology of a scalefree CN with a large degree of heterogeneity. This supported the opinion that the disease spread software assuming homogeneity in farm-to-farm relationship should only be used for large-scale interpretation and for epidemic preparedness. The authors concluded that the network approach, based on graph theory, can be used efficiently to express more precisely, on a local scale (premise), the heterogeneity of animal movements. This approach, by providing CN knowledge to the local veterinarian in charge of controlling disease spread, should also be evaluated as a potential tool to manage epidemics during the crisis. GIS could also be linked in the approach to produce knowledge about local transmission of disease. On the other hand Quantitative Structure-Property Relationships (QSPR) is a type of technique for the construction of models predicting the properties of a molecular system based on numerical parameters that describe the structure of the system. In particular, Topological Indices (TIs), Connectivity Indices (CIs), or Node centralities Ct(j) of type t, may be calculated from the graph/network representation of a system to describe full CN topology, node connectivity, or sub-graph branching. This class of indices is one of the most flexible classes of indices for QSPR studies; see the recent reviews after González-Díaz et al. [6-10]. The reader may see also the interesting works of Estrada et al. [11-18], Managbanag et al. [19], Bonchev & Buck [20] and other authors that have used TIs and Ct(j) measures to study different complex networks at different structural levels. By analogy to previous works we aim to study a database very recently reported by Mezo et al. [21] related to the Landscape spreading of fasciolosis in Galicia (NW Spain). Fasciolosis is a parasite infection caused by Fasciola hepatica (a liver fluke) that has become an important cause of lost productivity in livestock worldwide. Considered a secondary zoonotic disease until the mid-1990s, human fascioliasis is at present emerging or re-emerging in many countries, including increases of prevalence and intensity and geographical expansion. Research in recent years has justified the inclusion of fascioliasis in the list of important human parasitic diseases. At present, fasciolosis is a vector-borne disease presenting the widest known latitudinal, longitudinal and altitudinal distribution. [22]. In endemic areas of Central and South America, Europe, Africa and Asia, human fasciolosis presents a range of epidemiological characteristics related to a wide diversity of environments. Besides having a wide spectrum of hepatobiliary symptoms like obstructive
Network prediction of fasciolosis spreading in Galicia (NW Spain)
195
jaundice, cholangitis and liver cirrhosis, the parasitic infection also has extrabiliary manifestations. Effective control of fasciolosis is difficult, especially in milking cows, which can only be treated during dry periods, a control strategy that has not been yet evaluated. In this sense, the study of geographical spreading of fasciolosis becomes a goal of the major interest. However, there is not report of a network study of F. hepatica spread in Galicia coupled to a GIS system. In this work, we construct by the first time a network for fasciolosis spreading in Galicia. We also calculated many Ct(j) measures for all the nodes of the new CN (livestock farms). Last, using these Ct(j) measures as inputs we seek a QSPR model that may predict the prevalence of disease, in the former or new farms, after or in absence of different medical treatments based only on details retrieved from GIS (location and altitude). The study may have predictive value for the positioning of new farms with lower risk of infection or in managing cattle during infections.
2. Materials and methods 2.1. Network construction and centrality calculation We used the dataset reported by Mezo et al. to construct a network of farm-to-farm spreading of fasciolosis in cattle in Galicia (NW Spain). Each farm was considered as a node of the network associated to a Boolean matrix with elements bij [23-25]. We placed an arc (directed edge) connecting the ith farm with the j-th farm if they above the condition given at follow in the form of a Microsoft Excel command (see equation 1) that is used to truncate the farm-to-farm distance function (equation 2):
bij =if (or (dij =0,dij >d cutoff *average(dij )),0,1)
(1)
d ij =0.5*(h i +h j )*Tri *Trj *SQR((x i -x j )^2+(yi -y j )
(2)
As this is a symmetric condition, the existence of a connection ij implies the existence of the inverse connection ji. Loops, connections from j-th to the same j-th farm or the same self-connections (representing self-infection of animals inside the same farm), were allowed. We used this network as input for the software CentiBin, which was used to calculate the Ct(j) values [26]. The general formula and symbols for all the indices calculated appear in Table 1.
196
Humberto Gonzテ。lez-Dテュaz et al.
Table 1. Definitions of the network node centrality Ct(j) measures used.
a
All symbols used in these formulae are very common in complex networks theory literature and cannot be explained in detail here. However, G = (V, E) is an undirected or directed, (strong) connected graph with n = |V| vertices; deg(v) denotes the degree of the vertex v in an undirected graph and deg(v)* denotes the valence degree for molecular network only; dist(v, w) denotes the length of a shortest path between the vertices v and w; マピt denotes the number of shortest paths from s to t and マピt(v) the number of shortest path from s to t that use the vertex v. D and A are the topological distance and the adjacency matrix of the graph G. Please, for more details see the references cited and others.
3. Results In the previous cross-sectional study, Mezo et al. investigated the effect of the type of flukicide treatment on the prevalence and intensity of infection in dairy cattle from Galicia, an area where fasciolosis is endemic and which is also the main milk-producing region in Spain. Faecal samples were taken from 5188 dairy cows on 275 randomly selected farms for measurement of the concentration of F. hepatica copro-antigens by a monoclonal antibody based immunoassay (MM3-COPRO ELISA). We used this database by Mezo et al. to construct a Landscape parasite-spreading network for fasciolosis in Galicia (NW Spain) [21], see Figure 1. On the same day as the sampling, each
Network prediction of fasciolosis spreading in Galicia (NW Spain)
197
Figure 1. Geographical maps of Galicia showing the location of the 275 sampled farms. A â&#x20AC;&#x201C; The status of infection (empty circles: F. hepatica free and filled circles: F. hepatica infected) and the treatment administered on each farm are shown (blue: none; red: an anthelmintic effective against fluke mature stages and green: a fasciolicide effective against immature and mature stages). B - Distribution of farms according to the presence of F. hepatica infection (grey: uninfected; cyan: infected with a within-herd prevalence <25% and pink: infected with a within-herd prevalence â&#x2030;Ľ 25%). C - SCN for landscape parasite-spreading.
198
Humberto González-Díaz et al.
farm owner/manager was questioned about the types of treatment used on the farm. Three groups of farms were considered according to the fasciolicide treatment: (A) flukicides were not used, (B) an anthelmintic effective against mature stages of flukes was used (albendazole or netobimin) and (C) a fasciolicide effective against immature and mature stages was used (triclabendazole: TCBZ). The survey showed that most dairy farmers are unaware of the existence of F. hepatica infection on their farms, and treatments, when given, are administered without prior diagnosis. Treatment with TCBZ administered only at drying off did not show advantages over other measures including no treatment, or treatment with other benzimidazoles. Consequently, TCBZ should only be used to treat individual animals after correct diagnosis of the infection, and correct management measures taken to control re-infection. The detailed information for all farms analyzed as well as other details appears in Table 2. This situation, prompt us to find a combined QSPR & CN model that may help us in disease management. For this study we decided to test the potentialities of Classification Tree (CT) models to seek the QSPR model. CT have been used to test a non-linear model do not based on assumptions of parametric distribution of data and non-linear models as well [27]. We used as Ordered Predictors the variables Ct(j) calculated with CentiBin. Starting from here, several split methods were carried out: i) CT Discriminant-based Linear Combinations (CT-LC), ii) Discriminant-based univariate splits (CTUS), and CRT-style exhaustive search form univariate splits (CRT). In CRT we used three different measures of Goodness-of-fit Gini Measure, ChiSquare, and G-Square. Like in LDA we set ever a prior probability of p(pPPI) = p(npPPI) = 0.5, unless we specify a different value. Last, we used a FACTstyle direct stopping rule with value of 0.01 to control the large of the CT. All the CTs have been trained with the software STATISTICA 6.0 ®, for which our laboratory holds rights of use [28]. The summary of the results founds using different CTs appear in Table 3. The more promising results found were for CRT based on Gini measure with stopping fraction of 0.01. This CT presented the highest Accuracy = Specificty = Sensitivity = 100% values in training. The model has also adequate Accuracy = 72.1% mainly due to significant Specificity = 88.4% in external cross-validation (cv) series but Sensitivity = 44.0% was poor. It determines that we propose as further research the variation of different aspects and the parameterization of some variables to optimize the results. For instance, we should test different rules, distance functions, and/or cut-off values to construct the CN in order to obtain Ct(j) values with more predictive power. In fact, the use of different distance functions has been demonstrated to be relevant for the final CN topology. For instance, Zhang et al. [29] investigated a total of 16 distance
Network prediction of fasciolosis spreading in Galicia (NW Spain)
Table 2. Parasite infection spread out to different jth Farms in Galicia.
199
200
Humberto González-Díaz et al.
Table 2. Contined.
a
Tr. is the chemotherapy treatment used: 1-not treated, doramectin (non-flukicidal treatment), 3-Albendazole/Netobimin (adult flukicide), 4-Triclabendazole (all parasite stages). b h is the altitude with respect to sea level given in feets · 10-2. c Pr. Is the province: CO-Coruña, LG-Lugo, PV-Pontevedra, and OR-Ourense. dPAT is disease Prevalence After Treatment ≥ 25%.
Network prediction of fasciolosis spreading in Galicia (NW Spain)
201
Table 3. Results for CT analysis of fasciolosis spreading.
a
Stopping rules: direct stopping - fractions 0.05-0.01, pom - prunned on misclassifications, or pod â&#x20AC;&#x201C; pruned on deviance. b Class refers to farms with high â&#x2030;Ľ 25%) or low (<25%) of prevalence for fasciolosis after treatment.
and similarity measures, including Euclidean distance, Manhattan distance, Pearson correlation, partial correlation, point correlation, linkage coefficients, Jaccard coefficient etc., to detect taxa pairs in ecological CNs. In addition, we also propose the scanning of the cut-off values, which may lead us to find and
202
Humberto González-Díaz et al.
optimal cut-off in further research. A similar case was the recent research by da Silveira et al. [30] establishing a value of 7Å as the optimal cut-off for protein contact CNs.
4. Conclusion We demonstrated that Ct(j) values of landscape complex networks are promising indices of general use for the prediction of fasciolosis spreading in Galicia.
Acknowledgments González-Díaz H. acknowledges tenure track research position funded by Program Isidro Parga Pondal, Xunta de Galicia. The authors thank for the partial financial support from project (AGL2006-13936-C02) Ministry of Education and Science, Spain, which is co-financed with European Union funds (FEDER) and for the grants 2007/127 and 2007/144 from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia.
References 1. 2. 3.
4. 5. 6. 7.
Bornholdt S, Schuster HG. Handbook of Graphs and Complex Networks: From the Genome to the Internet. Wheinheim: WILEY-VCH GmbH & CO. KGa. 2003. Minor ES, Tessel SM, Engelhardt KA, Lookingbill TR. The role of landscape connectivity in assembling exotic plant communities: a network analysis. Ecology. 2009 Jul;90(7):1802-9. Michels E, Cottenie K, Neys L, De Gelas K, Coppin P, De Meester L. Geographical and genetic distances among zooplankton populations in a set of interconnected ponds: a plea for using GIS modelling of the effective geographical distance. Mol Ecol. 2001 Aug;10(8):1929-38. Natale F, Giovannini A, Savini L, Palma D, Possenti L, Fiore G, et al. Network analysis of Italian cattle trade patterns and evaluation of risks for potential disease spread. Prev Vet Med. 2009 Dec;92(4):341-50. Bigras-Poulin M, Thompson RA, Chriel M, Mortensen S, Greiner M. Network analysis of Danish cattle industry trade patterns as an evaluation of risk potential for disease spread. Prev Vet Med. 2006 Sep 15;76(1-2):11-39. Pérez-Montoto LG, Prado-Prado F, Ubeira FM, González-Díaz H. Study of Parasitic Infections, Cancer, and other Diseases with Mass-Spectrometry and Quantitative Proteome-Disease Relationships. Curr Proteomics. 2009; 6:in press. González-Díaz H, Prado-Prado F, Pérez-Montoto LG, Duardo-Sánchez A, López-Díaz A. QSAR Models for Proteins of Parasitic Organisms, Plants and
Network prediction of fasciolosis spreading in Galicia (NW Spain)
8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.
203
Human Guests: Theory, Applications, Legal Protection, Taxes, and Regulatory Issues. Curr Proteomics. 2009;6:in press. González-Díaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. Estrada E, Bodin O. Using network centrality measures to manage landscape connectivity. Ecol Appl. 2008 Oct;18(7):1810-25. Estrada E, Higham DJ, Hatano N. Communicability and multipartite structures in complex networks at negative absolute temperatures. Phys Rev E Stat Nonlin Soft Matter Phys. 2008 Aug;78(2 Pt 2):026102. Estrada E, Hatano N. Communicability in complex networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2008 Mar;77(3 Pt 2):036111. Estrada E. How the parts organize in the whole? A top-down view of molecular descriptors and properties for QSAR and drug design. Mini Rev Med Chem. 2008 Mar;8(3):213-21. Estrada E. Food webs robustness to biodiversity loss: the roles of connectance, expansibility and degree distribution. J Theor Biol. 2007 Jan 21;244(2):296-307. Estrada E. Topological structural classes of complex networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2007 Jan;75(1 Pt 2):016103. Estrada E. Protein bipartivity and essentiality in the yeast protein-protein interaction network. Journal of proteome research. 2006 Sep;5(9):2177-84. Estrada E. Virtual identification of essential proteins within the protein interaction network of yeast. Proteomics. 2006 Jan;6(1):35-40. Managbanag JR, Witten TM, Bonchev D, Fox LA, Tsuchiya M, Kennedy BK, et al. Shortest-path network analysis is a useful approach toward identifying genetic determinants of longevity. PLoS ONE. 2008;3(11):e3802. Bonchev D, Buck GA. From molecular to biological structure and back. Journal of chemical information and modeling. 2007 May-Jun;47(3):909-17. Mezo M, Gonzalez-Warleta M, Castro-Hermida JA, Ubeira FM. Evaluation of the flukicide treatment policy for dairy cattle in Galicia (NW Spain). Vet Parasitol. 2008 Nov 7;157(3-4):235-43. Mas-Coma S. Epidemiology of fascioliasis in human endemic areas. J Helminthol. 2005 Sep;79(3):207-16. Wollbold J, Huber R, Pohlers D, Koczan D, Guthke R, Kinne RW, et al. Adapted Boolean network models for extracellular matrix formation. BMC systems biology. 2009;3:77. Pomerance A, Ott E, Girvan M, Losert W. The effect of network topology on the stability of discrete state models of genetic control. Proc Natl Acad Sci U S A. 2009 May 19;106(20):8209-14.
204
Humberto GonzĂĄlez-DĂaz et al.
25. Zhang SQ, Ching WK, Ng MK, Akutsu T. Simulation study in Probabilistic Boolean Network models for genetic regulatory networks. International journal of data mining and bioinformatics. 2007;1(3):217-40. 26. Junker BH, Koschuetzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006 Apr 21;7(1):219. 27. Hill T, Lewicki P. STATISTICS Methods and Applications. A Comprehensive Reference for Science, Industry and Data Mining. Tulsa: StatSoft 2006. 28. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002. 29. Zhang W. Computer inference of network of ecological interactions from sampling data. Environ Monit Assess. 2007 Jan;124(1-3):253-61. 30. da Silveira CH, Pires DE, Minardi RC, Ribeiro C, Veloso CJ, Lopes JC, et al. Protein cutoff scanning: A comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins. Proteins. 2009 Feb 15;74(3):727-43.
Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks, 2010: 205-212 ISBN: 978-81-7895-489-9 Editor: Humberto GonzĂĄlez-DĂaz and Cristian Robert Munteanu
12. Study of criminal law networks with Markov-probability centralities Aliuska Duardo-Sanchez Department of Especial Public Law, Financial and Tributary Law Area Faculty of Law, USC, Santiago de Compostela, 15782, Spain
Abstract. Graph theory and Complex Network analysis tools are expanding to new potential fields of application at different levels on Information Sciences. For instance, at molecular level we can use them to describe drug-virus action pairs in antiviral medicinal chemistry research. In any case, the applications are far to be restricted to the world of molecules. We can use the same type of graph and complex networks to describe relationships between non-living objects, organisms, or even social actors. Despite of the type of system described we can use numerical parameters of graphs and networks to characterize the structural information of these systems. With these indices, also called Topological Indices (TIs), we can search of Quantitative Structure-Property Relationship (QSPR) models for prediction and discovery of antimicrobial drugs. More in general, we can use TIs to find Quantitative Structure-Property Relationship (QSPR) models for complex social networks. Anyhow, almost all works focus only on the development of QSAR/QSPR models at only one structural level. In this work, we decided to test the potentialities one of the classes of TIs in criminal law networks. For this test we selected the class of TIs called the node absolute Correspondence/Reprint request: Dr. Aliuska Duardo-Sanchez, Department of Especial Public Law, Financial and Tributary Law Area, Faculty of Law, USC, Santiago de Compostela, 15782, Spain E-mail: aliuskaduardo@yahoo.es
206
Aliuska Duardo-Sanchez
probabilities Ď&#x20AC;k(i) that can be calculated with the method MARCH-INSIDE based on Markov models. The second model developed is able to discriminate between main and secondary causes in causality criminal case networks with Accuracy = 94.74%. The work opens new directions in the generalization of TIs to develop QSAR/QSPR models for predicting relevant information of systems at different structural levels.
1. Introduction Graph theory and Complex Network analysis tools are expanding to new potential fields of application of Information Sciences at different levels from molecular to populations, social or technological such as genome networks, protein-protein networks, sexual disease transmission networks, power electric power network or internet [1]. In particular, the case of relationships among social actors, as well as the relationships among actors at different levels of analysis (such as persons and groups) are being subject of intensive investigation [2]. It provides a common approach for all those disciplines involved in social structure study [3-6] susceptible of network depiction. Social structure concept is merely used in sociology and social theory. Although there is not agreement between theorists, it can refer to a specific type of relation between entities or groups also can evolve enduring patterns of behavior and relationship within a society, or social institutions and norms becoming embedded into social systems. For a most complete review of SNA see the in-depth review of Newman M entitled: The Structure and Function of Complex Networks [7]. Anyway, if we take in consideration that a network is a set of items, usually called nodes, with connections between them, so called edges [8], thus it means the representation of social relationships in terms of nodes and ties, where nodes can be the individual actors within the networks, and ties the relationships between these actors [1]. In fact, SNA is nothing new in social sciences studies, in early 1930s, sociologists already have made a social network to study friendships between school children [9]. Since the important of network approach to social sciences high increased, and it application goes from interrelation between family members [10] to companies business interaction [11, 12] or patterns of sexual contacts [13, 14]. Although the network approach is so pervasive in the social sciences their application in the Law scope is still weak. Networks tools and methodologies might useful to illustrate the interrelation between the different law types, check the importance of a specific instrument so as the normative hierarchy respect by legislators in order to regulate the most important matter for individuals through law instruments which require the approval from the most representative democratic actors. Also can help to understand laws consequences in society live and it effectiveness or not.
Study of criminal law networks with Markov-probability centralities
207
In this sense, using network TIs of application at different levels of organization of matter (molecular, biological, social, economical, or even technological) may be of high interest to develop a general methodology for the search of Quantitative Structure-Property Relationships (QSPR) models. These QSPR models shall connect the structure of the system (drug, protein, microorganism, people, social groups, internet...) with their properties and can be used to predict the behavior of these systems in different situations. Disappointingly, QSPR studies are generally focused on the study of limited properties of small molecules. Anyhow, the applications of TIs in QSPR research are far to cover all the potentialities of TIs and new gateways in molecular, biological, or even social QSPR models are still waiting to be opened. On this line of thinking, our group has introduced a Markov model (MM) method named MARCH-INSIDE: Markovian Chemicals In Silico Design. MARCH-INSIDE generate TIs in the form of matrix invariants such as stochastic entropies, spectral moments, or absolute probabilities for the study of molecular properties. Recently the method has been renamed as MARCH-INSIDE 2.0: Markov Chain Invariants for Network Simultaion & Design, in order to give a more clear idea of the unexplored potentialities. Recent reviews about MACRH-INSIDE and similar QSPR methods have been published by González-Díaz et al. including discussion of multiple applications in different fields [15-19]. In this work, we decided to test the potentialities at different structural levels of one of the classes of TIs calculated by MARCH-INSIDE. For this test we selected the class of TIs called the node absolute probabilities πk(i). The πk(i) values represent the absolute probability of reaching node i after a walk of length k moving from any node in the network. We calculate the πk(i) based on a Markov matrix associated to a graph or network. These TIs differ from other MARCH-INSIDE TIs because they are useful only to describe a node or, if we sum several πk(i) values, we can describe a collection of nodes (atoms, aminoacids, a group of electric plants, a social subgroup...); which form part of larger network systems (molecule, protein, US Electric power system, Society...). It happens because the sum of all πk(j) values for the whole system is always equal to one for any system a do not give structural information. Consequently, the πk(i) values may be consider as local node TIs. We commonly have known this class of TIs as networks nodes Centralities. Several node centralities have been defined before and the software CentiBin calculate some of the more used [20]. However, the definition of new Centralities is an active field of research and new centralities have been recently introduced such as sub-graph centrality introduced by Estrada [21]. Certainly, the πk(j) values were used in the past by our group [22, 23] but ever at the molecular level only and never for no-
208
Aliuska Duardo-Sanchez
molecular problems. In order to both confirm the potentials applications of πk(j) values beyond traditional frontiers, we are going to develop here a new QSPR models. One model is the first QSPR model based on πk(j) values that can be used to predict the probability of crime causation for a single actor (person or cause) in a Criminal law network.
2. Materials and methods Absolute probability centralities kCπ(j) for actions in crime networks First, we need to construct the crime causality Markov matrix 1Π. This matrix is built up as a square matrix (n × n), where n are all the actions related to the crime including the original actions (causes), the co-actions (secondary causes) and the consequence (crime). The matrix 1Π contains the transition probabilities (1pij) that have the action i to be the cause or at least to be occurred immediately after it in the crime than other action j. The probabilities 1pij may be calculated using the Eq. 28 and 29. δj represents the number of actions that occurred immediately after the action i-th. In addition, we use the absolute initial probabilities vector π0; see Eq. 26. This vector lists the absolute initial probabilities kpj to reach a node ni from a randomly selected node nj. Here we consider the initial probability inverse to the dimention (N, number of nodes) of the shp connecting ni with nii. Next, we used the theory of Markov chains in order to calculate the criminal causation entropy centrality kCπ(i,ii): k
Cπ ( i, ii ) = − ∑ k Cπ ( j ) = − ∑ k p j j∈shp
(1)
j∈shp
In this equation the values kpj are the absolute probabilities to reach the nodes moving throughout a walk of length k from node ni. The sum runs only over the nodes that lie within shp connecting ni with nii. The ChapmanKolmogorov equations were used to calculate the vector πk containing the kpj values using the vector π0 of initial probabilities (0pj) and the matrix 1Π with the first-step transition probabilities (1pij). π k = π0 × k Π = π0 × ( 1 Π )
k
(2)
Data analysis Using the values of kCπ(i,ii), as defined previously, for all pairs of cause(i)-consequence(ii) or the same causality paths we can attempt to
Study of criminal law networks with Markov-probability centralities
209
discriminate determinant causes from less important causes in a Crime network. We selected LDA [24] to fit the discriminant function. CC − score = b0 + a0 ⋅ 0Cπ ( i , ii ) + a1 ⋅ 1Cπ ( i, ii ) + ... + ak ⋅ k Cπ ( i, ii ) = b0 + ∑ ak ⋅ k C π ( i, ii )
(3) In Eq. 37, b0 and ak represent the coefficients of the classification function, determined by the least square method implemented in the LDA module of the software STATISTICA 6.0; please see associated book [25]. Forward-stepwise algorithm was used for variable selection [26, 27]. The statistical significance of the LDA model was determined by Fisher’s test by examining F and p. All the variables included in the model were standardized in order to bring it into the same scale. Subsequently, a standardized linear discriminant equation that allows to compare their coefficients is obtained [28]. We also inspected the percentage of good classification, cases/variables ratios (ρ parameter), and number of variables to be explored to avoid overfitting or chance correlation [26].
3. Results and discussion QSPR models for Criminal Causality One of the reasons people have difficulty in dealing with complex systems is that the linear causal chain way of thinking - A causes B causes C causes D ... etc - breaks down in the presence of feedback and multiple interactions between causal and influence pathways. One could say that complex systems are characterized by networked rather than linear causal relationships. Nevertheless, it is important to be able to reason about complex systems, make inferences about factors that contribute to current and alternative states of complex systems and explore their possible future trajectories, especially if we wish to influence them towards more favorable futures and away from more possibilities that are dangerous. Large scale examples include ecosystems, economic systems, coupled biophysicalsocioeconomic systems, integrated supply chains/industrial systems and social systems, but these remarks also apply for example to attempts to understand a physical organism as a complex system. Crime causality is a very important phenomenon in this sense. Different measures of crime causality have been developed before [29]. In this work, we introduced the Markov entropy centrality kCπ(j) for a node in a Crime causality network. At the same time, we propose new measures of crime causality calculated as of the sum of all the kCπ(j) values of the same order k for all nodes placed in the
210
Aliuska Duardo-Sanchez
shortest path (shp), connecting the original node ni (possible cause) with the final node nii (consequence). The model was trained and later validated with and external validation series. The best model found was: CC − score = −35.36 ⋅ 0Cπ ( i, ii ) + 112.80 ⋅ 5Cπ ( i, ii ) − 25.59 n = 47 Rc = 0.85 Chi − sqr = 72.8
p < 0.005
(4)
The output of the model, CC-score, is a real value variable that scores the possibility of a Crime Cause (CC) to be the main cause of a given crime. This model is able to correctly predict the 94.74% of the main crime causes (CC) out of 47 potential crime causes in 17 crime cases. We also obtained two additional classification functions to discriminate secondary cases of to lower degrees not reported here for reasons of space. The present model also correctly predicts 94.74% of main crime causes in Leave-One-Out (LOO) cross-validation experiments. In Figure 1, we illustrate the separation of different crime CN in the canonical space using this LDA model.
Figure 1. Canonical space representation of different criminal network cases
Acknowledgments Duardo-Sánchez, A., gratefully acknowledges partial financial support of the Research project (2006/PX 207) from the Department of Especial Public Law, Financial and Tributary Law Area, Faculty of Law, of the University of
Study of criminal law networks with Markov-probability centralities
211
Santiago de Compostela in Spain; which was supported by Xunta de Galicia and ESF.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
17.
Bornholdt S, Schuster HG. Handbook of Graphs and Complex Networks: From the Genome to the Internet. Wheinheim: WILEY-VCH GmbH & CO. KGa. 2003. Breiger R. The Analysis of Social Networks. In: Hardy M, Bryman A, eds. Handbook of Data Analysis. London: Sage Publications 2004:505-26. Abercrombie N, Hill S, Turner BS. Social structure. The Penguin Dictionary of Sociology. 4th ed. London: Penguin 2000. Craig C. Social Structure. Dictionary of the Social Sciences. Oxford: Oxford University Press 2002. White H, Scott Boorman and Ronald Breiger. . "." Social Structure from Multiple Networks: I Blockmodels of Roles and Positions. American Journal of Sociology. 1976;81:730-80. Wellman B, Berkowitz SD. Social Structures: A Network Approach. Cambridge: Cambridge University Press 1988. Newman M. The Structure and Function of Complex Networks. SIAM Review. 2003;56:167-256. Newman M. The Structure and Function of Complex Networks. SIAM Review. 2003(56):167-256. Moreno JL. Who Shall Survive? New York: Beacon House 1934. Padgett JF, Ansell CKJF. Robust action and the rise of the Medici, 1400-1434. Amer J Sociol. 1993;98:259-1319. Mariolis P. Interlocking directorates and control of corporations: The theory of bank control. Social Sci Quart. 1975;56:425-39. Mizruchi MS. The American Corporate Network, 1904-1974. Beverly Hills: Sage 1982. Klovdahl AS, Potterat JJ, Woodhouse DE, Muth JB, Muth SQ, Darrow WW. Social networks and infectious disease: The Colorado Springs study. Soc Sci Med. 1994;38:79-88. Liljeros F, Edling CR, Amaral LAN, Stanley HE, Aberg Y. The webof human sexual contacts. Nature. 2001;411:907-8. Pérez-Montoto LG, Prado-Prado F, Ubeira FM, González-Díaz H. Study of Parasitic Infections, Cancer, and other Diseases with Mass-Spectrometry and Quantitative Proteome-Disease Relationships. Curr Proteomics. 2009; 6:in press. González-Díaz H, Prado-Prado F, Pérez-Montoto LG, Duardo-Sánchez A, López-Díaz A. QSAR Models for Proteins of Parasitic Organisms, Plants and Human Guests: Theory, Applications, Legal Protection, Taxes, and Regulatory Issues. Curr Proteomics. 2009;6:in press. González-Díaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90.
212
Aliuska Duardo-Sanchez
18. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. 19. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. 20. Junker BH, Koschuetzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006 Apr 21;7(1):219. 21. Estrada E, Rodriguez-Velazquez JA. Subgraph centrality in complex networks. Phys Rev E. 2005 May;71(5 Pt 2):056103. 22. González-Díaz H, Prado-Prado F. Unified QSAR and Network-Based Computational Chemistry Approach to Antimicrobials, Part 1: Multispecies Activity Models for Antifungals. J Comput Chem. 2008;29:656-7. 23. González-Díaz H, Sanchez IH, Uriarte E, Santana L. Symmetry considerations in Markovian chemicals 'in silico' design (MARCH-INSIDE) I: central chirality codification, classification of ACE inhibitors and prediction of sigma-receptor antagonist activities. Comput Biol Chem. 2003 Jul;27(3):217-27. 24. Estrada E, Molina E. 3D connectivity indices in QSPR/QSAR studies. J Chem Inf Comput Sci. 2001 May-Jun;41(3):791-7. 25. Hill T, Lewicki P. STATISTICS Methods and Applications. A Comprehensive Reference for Science, Industry and Data Mining. Tulsa: StatSoft 2006 26. Van Waterbeemd H. Chemometric methods in molecular design. New York: Wiley-VCH 1995. 27. Cruz-Monteagudo M, González-Díaz H, Aguero-Chapin G, Santana L, Borges F, Dominguez ER, et al. Computational chemistry development of a unified free energy Markov model for the distribution of 1300 chemicals to 38 different environmental or biological systems. J Comput Chem. 2007 Aug;28(11):1909-23. 28. Kutner MH, Nachtsheim CJ, Neter J, Li W. Standardized Multiple Regression Model. Applied Linear Statistical Models. Fifth ed. New York: McGraw Hill 2005:271-7. 29. Devah P. Mark of a Criminal Record. Am J Soc. 2003(108):937-75.