Complex, Network, Entropy by Research Signpost

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences Editor

Humberto González-Díaz Department of Microbiology & Parasitology, Faculty of Pharmacy, USC, 15782 Santiago de Compostela, Spain

Co-editors Francisco J. Prado-Prado Xerardo García-Mera Department of Organic Chemistry, Faculty of Pharmacy, USC, 15782 Santiago de Compostela, Spain

Transworld Research Network, T.C. 37/661 (2), Fort P.O., Trivandrum-695 023 Kerala, India

Published by Transworld Research Network 2011; Rights Reserved Transworld Research Network T.C. 37/661(2), Fort P.O., Trivandrum-695 023, Kerala, India Editor Humberto González-Díaz Co-editors Francisco J. Prado-Prado Xerardo García-Mera Managing Editor S.G. Pandalai Publication Manager A. Gayathri Transworld Research Network and the Editors assume no responsibility for the opinions and statements advanced by contributors ISBN: 978-81-7895-507-0

Preface The applications of Complex Network theory cover areas as diverse as the complex interacting systems present in nature; which include such diverse areas as physics, biology, economics, ecology, and computer science. For example, at the molecular level the structure of drugs, DNA sequences, RNA secondary structure, and proteins spatial structure may be described in terms of molecular graphs and/or different contact networks. In other current problems of the Biosciences, we can find prominent examples of supra-molecular networks such as protein-protein interaction (PPI) networks, RNA transcript co-expression networks, as well as molecular networks in the genome of the living cells. However, the uses of graphs and networks do not limit to the molecular, macro-molecular or supra-molecular level. Economic or social interactions often organize themselves in complex network structures. Similar phenomena are observed in traffic flow and in communication networks as the internet. On larger scales one finds networks of cells as in neural networks, up to the scale of organisms in ecological food webs. Currently is gaining in importance on a broad spectrum of topics the use of Entropy or measures of full graphs or node centrality to study Complex Networks of different systems. This determined the recent development of several interesting software, web-servers, and/or theoretical methods to construct graphs and networks of different systems, calculate Entropy measures of these graphs, and seek structure-function relationships and manage data mining in many fields. In any case, in only one research paper and/or review manuscript is very difficult to zip all this information. So it is necessary, at least, one book to describe the different uses of Entropy measures at different levels of organization of matter. This kind of book becomes interesting because many of the users of these programs limit to a narrow field of application and ignore the several applications at different higher or lower levels with implications in their own research. On the other hand, many researchers, which move by the frontiers of these fields, miss materials reviewing the actual applications and future perspectives of these methods and the possible relationships of data flow between them in a common theoretic framework. Taking into consideration all these aspects, we decided to edit the present e-book composed by a collection of papers devoted to review and/or introduce new results on the common theoretic basis, applications, and inter-connections between the inputs and outputs of different Entropy measures and graph/networks approaches in different areas. As a consequence of the contents we decided to entitle this book: Complex Network Entropy: from Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences. We hope that the present book may serve as a bridge between

theoretical scientists in graph theory and experimentalists in all these areas in order to suggest new areas of mutual interchange and collaboration. Finally, I would like to express, in the name of all-coauthors, our sincere gratitude to the editorial team of Transworld Research Network by decisive and kind attention. GonzĂĄlez-DĂaz H

Contents

Chapter 1 Multi-target Markov Entropy QSAR for antiviral drugs vs. different viral species Francisco J. Prado-Prado, Xerardo García-Mera, Olga Caamaño and Humberto González-Díaz

Chapter 2 Entropy Multi-target QSAR model for Anti-Parasitic and Anti-Alzheimer GSK-3 inhibitors Isela García, Yagamare Fall, Generosa Gómez and Humberto González-Díaz

Chapter 3 Predicting drug-parasite networks based on Markov Entropy indices of drug structure Francisco Prado-Prado, Xerardo García-Mera, Franco Fernandez and Humberto González-Díaz

Chapter 4 A model based on Markov Entropy to predict the stability of collagen peptides Riccardo Concu, Gianni Podda, Bairong Shen and Humberto Gonzalez Diaz

Chapter 5 Non-self discrimination of parasite proteins with entropy of 3D structure networks and artificial neural networks Humberto González-Díaz, Xerardo García-Mera and Francisco Prado-Prado

Chapter 6 Predicting parasite-host networks with Markov Entropy measures for secondary structures of RNA phylogenetic biomarkers Humberto González-Díaz, Santiago Vilar and Lázaro Guillermo Pérez-Montoto

Chapter 7 Using Shannon entropy to seek a QSPR model for cerebral cortex co-activation networks Humberto González-Díaz Chapter 8 Criminal law networks, markov chains, Shannon entropy and artificial neural networks Aliuska Duardo-Sanchez

107

Chapter 9 Predicting fasciolosis in Galicia with Shannon entropy of landscape complex network Humberto González-Díaz, Mercedes Mezo, Marta González-Warleta Esperanza Paniagua and Florencio M. Ubeira

115

Chapter 10 Markov entropy for biology, parasitology, linguistic, technology, social and law networks Marilena N. Berca, Aliuska Duardo-Sanchez, Humberto González-Díaz Alejandro Pazos and Cristian R. Munteanu

127

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 1-15 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

1. Multi-target Markov Entropy QSAR for antiviral drugs vs. different viral species 1

Francisco J. Prado-Prado1, Xerardo García-Mera1 Olga Caamaño1 and Humberto González-Díaz2

Department of Organic Chemistry, Faculty of Pharmacy, USC, 15782, Santiago de Compostela Spain; 2Department of Microbiology & Parasitology, Faculty of Pharmacy, USC, 15782 Santiago de Compostela, Spain

1. Introduction Examples of diseases caused by viruses include the common cold (produced by any one of a variety of related viruses), AIDS (caused by HIV) and cold sores (caused by herpes simplex); which produced some of the major health problems in the last 30 years. Other relationships are being studied such as the connection of Human Herpesvirus 6 (HHV6), one of the eight known members of the human herpes virus family, with organic neurological diseases such as multiple sclerosis and chronic fatigue syndrome. Severe acute respiratory syndrome (SARS) is caused by a novel coronavirus, called the SARS coronavirus (SARS-CoV). Over 95% of well characterized cohorts of SARS have evidence of recent SARS-CoV infection. The genome of SARS-CoV has been sequenced and it is not related to any of the previously known human or animal coronaviruses. It is probable that SARS-CoV was an animal virus that adapted to human-human transmission Correspondence/Reprint request: Dr. Francisco J. Prado-Prado, Department of Organic Chemistry, Faculty of Pharmacy, USC, 15782, Santiago de Compostela, Spain. E-mail: fenol1@hotmail.com

Francisco J. Prado-Prado et al.

in the recent past. Recently, it has been shown that cervical cancer is caused, at least partially, by Papillomavirus, representing the first significant evidence in humans for a link between cancer and an infective agent. The relative ability of viruses to cause disease is described in terms of virulence [1-14]. Consequently, there is an increasing interest on the development of rational approaches for discovery of antiviral drugs. In this sense, a very important role may be played by computer-added drug discovery techniques based on Quantitative-Structure-Activity-Relationship (QSAR) models [15]. Unfortunately, almost QSAR studies, including those for antiviral activity and others, use limited databases of structurally parent compounds acting against one single viral species [16]. One important step in the evolution of this field was the introduction of QSAR models for heterogeneous series of antimicrobial compounds; see for instance the works of Cronin, de JuliánOrtiz, Galvéz, Gárcía-Domenech, Gosalbez, Marrero-Ponce, Torrens, et al. and others [17-29]. As a result, researchers may predict very heterogeneous series of compounds but often need to use/develop as many QSAR equations as microbial species are necessary to be predicted. In any case, if you aim to predict activity against different targets you still need to use one different QSAR model for each target. An interesting alternative, is the prediction of structurally diverse series of antimicrobial compounds (antiviral in this case) against different targets (mechanisms) using complicated non-linear Artificial Neural Networks with multi-class prediction, e.g. the work of Vilar et al. [30]. We can understand strategies developed in this sense as Multi-Objective Optimization (MOOP) techniques; in this case we pretend to optimize the activity of antiviral drugs against many different objectives or targets (viral species). A very useful strategy related to the MOOP problem use Derringer's desirability function desirability function and many QSAR models for different objectives [31]. In this sense, it is of major importance the development of unified but simple linear equations explaining the antimicrobial activity, in the present work antiviral activity, of structurally-heterogeneous series of compounds active against as many targets (viral species) as possible. We call this class of QSAR problem the multi-target QSAR (mt-QSAR) [32, 33]. There are near to 2000 chemical molecular descriptors that may be in principle generalized and used to solve the mt-QSAR problem. Many of these indices are known as Topological Indices (TIs) or simply invariants of a molecular graph G. We can rationalize G as a draw composed of vertices (atoms) weighted with physicochemical properties (mass, polarity, electro negativity, or charge) and edges (chemical bonds) [34]. In any case, many of these indices have not been extended yet to encode additional information to

Entropy mt-QSAR for antiviral drugs vs. viruses

chemical structure. One alternative to mt-QSAR is the substitution of classic atomic weights by target specific weights. For instance, we introduced and/or reviewed TIs that use atomic weights for the propensity of the atom to interact with different microbial targets [35] or undergoes partition in a biphasic systems or distribution to biological tissues [36-38]. The method, called MARCH-INSIDE approach [35, 39-41], MARkovian CHemicals IN SIlico Design, calculates TIs using Markov Chain theory. In fact, MARCHINSIDE define a Markov matrix to derive matrix invariants such as stochastic spectral moments, mean values, absolute probabilities, or entropy measures, for the study of molecular properties. Applications to macromolecules have extended to RNA, proteins, and blood proteome [42-47]. In particular, one of the classes of MARCH-INSIDE descriptors is defined in terms of entropy measures; which have demonstrated flexibility in many bioorganic and medicinal chemistry problems such as: estimation of anticoccidial activity, modelling the interaction between drugs and HIV-packaging-region RNA, and predicting proteins and virus activity [38, 48-50]. We give high importance to entropy measures due to it have been largely demonstrate as an excellent function to codify information in molecular systems, see for instance the important works of Graham [51-56]. However, have not been studied the proficiency of entropy indices (of MARCH-INSIDE type or not) to solve the mt-QSAR problems in antiviral compounds. The present study develops the first mt-QSAR model based on entropy indices to predict antiviral activity of drugs against different viral species. The model fits one of the largest datasets used up-to-date in QSAR studies, number of entries 47 000+ cases; which is the result of forming different (antiviral compounds/viral target) pairs.

2. Methods 2.1. Markov Entropy (Î¸ k ) for drug-target k-th step-by-step interaction One can consider a hypothetical situation in which a drug molecule is free in the space at an arbitrary initial time (t0). It is then interesting to develop a simple stochastic model for a step-by-step interaction between the atoms of a drug molecule and a molecular receptor in the time of desencadenation of the pharmacological effect. For the sake of simplicity, we are going to consider from now on a general structure less receptor. Understanding as structure-less molecular receptor a model of receptor which chemical structure and position it is not taken into consideration. Specifically, the molecular descriptors used in the present work are called stochastic

Francisco J. Prado-Prado et al.

entropies θk, which are entropies describing th connectivity and the distribution of electrons for each atom in the molecule [57]. The initial entropy of interaction a j-th atom of the drug with the target 0θj(s) is considered as a state function so a reversible process of interaction may be came apart on several elemental interactions between the j-th atom and the receptor. The 0 indicates that we refer to the initial interaction, and the argument (s) indicates that this energy depends on the specific viral species. Afterwards, interaction continues and we have to define the interaction probability kθij(s) between the j-th atom and the receptor for specific viral specie (s) given that i-th atom has been interacted at previous time tk. In particular, immediately after of the first interaction (t0 = 0) takes place an interaction 1pij(s) at time t1 = 1 and so on. So, one can suppose that, atoms begin its interaction whit the structure-less molecular receptor binding to this receptor in discrete intervals of time tk. However, there several alternative ways in which such step-by-step binding process may occur [38, 58, 59]. The Figure 1 illustrates this idea. The entropy 0θj(s) will be considered here as a function of the absolute temperature of the system and the equilibrium local constant of interaction between the j-th atom and the receptor 0γj(s) for a give microbial species. Additionally, the energy 1θij(s) can be defined by analogy as γij(s) [38, 58, 60]:

θ j (s ) = − R ⋅ T ⋅ log 0 Γ j (s )

(1)

θ ij (s ) = − R ⋅ T ⋅ log 1 Γij (s )

(2)

The present approach to antimicrobial-species-specific-drug-receptor interaction has two main drawbacks. The first is the difficulty on the definition of the constants. In this work, we solve the first question estimating 0γj(s) as the rate of occurrence nj(s) of the j-th atom on active molecules against a given specie with respect to the number of atoms of the j-th class in the molecules tested against the same specie nt(s). With respect to 1γij(s) we must taking into consideration that once the j-th atom have interacted the preferred candidates for the next interaction are such i-th atoms bound to j by a chemical bond. Both constants can be then written down as [38, 58, 60]: θ ( s) ⎛ n j (s) ⎞ Rj⋅T 0 Γj (s) = ⎜⎜ +1⎟⎟ = e ( ) n s ⎝ T ⎠ 0

θ ( s) ⎛ n j (s) ⎞ Rij⋅T 1 Γij (s) = ⎜⎜αij ⋅ +1⎟⎟ = e ( ) n s T ⎝ ⎠

(3)

(4)

Entropy mt-QSAR for antiviral drugs vs. viruses

Figure 1. Alternative routes to step-by-step drug-target Markov interaction.

Where, αij are the elements of the atom adjacency matrix, nj(s), nt(s), 0θj(s), and 1θij(s) have been defined in the paragraph above, r is the universal gases constant, and t the absolute temperature. The number 1 is added to avoid scale and logarithmic function´s definition problems. The second problem relates to the description of the interaction process at higher times tk > t1. Therefore, mm theory enables a simple calculation of the probabilities with which the drug-receptor interaction takes place in the time until the studied effect is achieved. In this work we are going to focus on drugs-microbial

Francisco J. Prado-Prado et al.

structure less target interaction. As depicted in figure 1, this model deals with the calculation of the probabilities (kpij) with which any arbitrary molecular atom j-th bind to the structure less molecular receptor given that other atom i-th has been bound before; along discrete time periods tk (k = 1, 2, 3, …); (k = 1 in grey), (k = 2 in blue) and (k = 3 in red) throughout the chemical bonding system. The procedure described here considers as states of the mm the atoms of the molecule. The method arranges all the 0θj(s) values in a vector θ (s) and all the 1θij(s) entropies of interaction as a squared table of n x n dimension. After normalization of both the vector and the matrix we can built up the corresponding absolute initial probability vector φ(s) and the stochastic matrix 1Π(s), which has the elements 0pj(s) and 1pij(s) respectively. The elements 0pj(s) of the above mentioned vector φ(s) constitutes the absolute probabilities with which the j-th atom interact with the molecular target or receptor in the species s at the initial time with respect to any atom in the molecule [38, 58, 60]: ⎛ n (s ) ⎞ ⎛ n (s ) ⎞ log ⎜⎜ j + 1⎟⎟ − RT ⋅ log ⎜⎜ j + 1⎟⎟ ( ) s θ ( ) ( ) n s n s 0 ⎠ ⎝ T ⎠ = ⎝ T p j (s ) = m j = m m ⎞ ⎛ ⎞ ⎛ 0 θ a (s ) ∑ − RT ⋅ log ⎜⎜ na + 1⎟⎟ ∑ log ⎜⎜ na + 1⎟⎟ ∑ a =1 a =1 ⎝ nT (s ) ⎠ ⎝ nT (s ) ⎠ a =1 0

(

(5)

Where, m represents all the atoms in the molecule including the j-th, na is the rate of occurrence of any atom a including the j-th with value nj. On the other hand, the matrix is called the 1-step drug-target interaction stochastic matrix. 1 Π(s) is built too as a squared table of order n, where n represents the number of atoms in the molecule. The elements 1pij(s) of the 1-step drugtarget interaction stochastic matrix are the binding probabilities with which a j-th atom bind to a structure less molecular receptor given that other i-th atoms have been interacted before at time t1 = 1 (considering t0 = 0) [32, 38, 58, 60]: ⎛ n j (s ) ⎞ ⎛ n j (s ) ⎞ ⎜⎜ α log ⋅ + 1⎟⎟ ⎟ 1 + ij 1 ⎟ θ ij (s ) nT n(s ) 1 ⎠ ⎝ ⎝ ⎠ pij (s ) = n = n = n ⎛ n (s ) ⎞ ⎞ ⎛ 1 θia (s ) ∑ α ia ⋅ (− RT ) ⋅ log⎜⎜ na (s ) + 1⎟⎟ ∑ α ia ⋅ log⎜⎜ j + 1⎟⎟ ∑ a =1 a =1 ⎝ nT (s ) ⎠ ⎝ nT (s ) ⎠ a =1

α ij ⋅ (− RT ) ⋅ log⎜⎜

( (6)

By using, φ(s), 1Π(s) and chapman-kolgomorov equations one can describe the further evolution of the system.10-17 summing up all the atomic free

Entropy mt-QSAR for antiviral drugs vs. viruses

energies of interaction 0θj(s) pre-multiplied by the absolute probabilities of drug-target interaction apk(j,s) one can derive the average changes in entropies kθs of the gradual interaction between the drug and the receptor at a specific time k in a given microbial species (s) [38]: k

θs = ϕ(s)⋅ Π(s)⋅ θ (s) = ϕ(s)⋅ [ Π(s)] ⋅θ Π(s) = ∑ θ j (s) = ∑ A pk ( j, s)⋅0θ j (s) k

j=1

(7)

j=1

Such a model is stochastic per se (probabilistic step-by-step atom-receptor interaction in time) but also considers molecular connectivity (the step-bystep atom union in space throughout the chemical bonding system).

2.2. Statistical analysis As a continuation of the previous sections, we can attempt to develop a simple linear QSAR using the MARCH-INSIDE methodology, as defined previously, with the general formula:

Actv = a0 ⋅0 θ s + a1 ⋅1θ s + a 2 ⋅2 θ s + a3 ⋅3θ s ..... + a k ⋅k θ s + b0

(8)

Here, kθs act as the microbial species specific molecule-target interaction descriptors. The calculation of these indices has been explained in supplementary material by space reasons. We selected Linear Discriminant Analysis (LDA) to fit the classification functions. The model deals with the classification of a set of compounds as active or not against different microbial species [60]. A dummy variable (Actv) was used to codify the antimicrobial activity. This variable indicates either the presence (Actv = 1) or absence (Actv = –1) of antimicrobial activity of the drug against the specific species. In equation (1), ak represents the coefficients of the classification function and b0 the independent term, determined by the least square method as implemented in the LDA module of the STATISTICA 6.0 software package [61]. Forward stepwise was fixed as the strategy for variable selection [60]. The quality of LDA models was determined by examining Wilk’s U statistic, Fisher ratio (F), and the p-level (p). We also inspected the percentage of good classification and the ratios between the cases and variables in the equation and variables to be explored in order to avoid over-fitting or chance correlation. Validation of the model was corroborated by re-substitution of cases in four predicting series [60, 61].

Francisco J. Prado-Prado et al.

2.3. Data set The data set was formed by a set of marketed and/or very recently reported antiviral drugs which low reported MIC50 < 10 μM against different virus. The data set was conformed to more of 1100 different drugs experimentally tested against some species of a list of 40 virus. Not all drugs were tested in the literature against all listed species so we were able to collect 47 470 cases (drug/species pairs) instead of 1100 x 40 cases. The names or codes and activity for all compounds as well as the references used to collect it are depicted in supplementary material files, see Table 1SM.

3. Results and discussion 3.1. mt-QSAR model One of the main advantages of the present stochastic approach is the possibility of deriving average thermodynamic parameters depending on the probability of the states of the MM. The generalized parameters fit on more clearly physicochemical sense with respect to our previous ones [38, 58, 59]. In specific, this work introduces by the first time a linear mt-QSAR equation model useful for prediction and MOOP of the antiviral activity of drugs against different viral target species or objectives. The best model found was: actv= 0.38⋅θ3 (s)het − 0.84⋅θ0 (s)total − 0.91⋅θ0 (s)Csat + 0.89⋅θ1 (s)Csat + 2.01⋅θ0 (s)Csp&sp2 − 0.32⋅θ5 (s)h−het − 4.71 N = 31190

λ = 0.38

χ 2 = 377.43

p < 0.001

(9) In the model the coefficient λ is the Wilk’s statistics, statistic for the overall discrimination, χ2 is the Chi-square, and p the error level. In this equation, kθs where calculated for the totality (T) of the atoms in the molecule or for specific collections of atoms. These collections are atoms with a common characteristic as for instance are: heteroatom (Het), unsaturated Carbon atoms (Cunst), saturated Carbon atoms (Csat) and hydrogen bound to heteroatom (H-Het. The model correctly classifies 31 188 out of 31 213 non-active compounds (99.92%) and 432 out of 434 active compounds (99.54%). Overall training predictability was 98.56%. Validation of the model was carried out by means of external predicting series, the model classifying, thus, 15 588 out of 15 606 non-active compounds and 213 out of 217 active compounds. Overall validation predictability was 98.54% see Table 1. The more interesting fact is that kθs have the skill of discerning the active/no-active classification of compounds among a large number of viral

Entropy mt-QSAR for antiviral drugs vs. viruses

Table 1. Results of the model, analysis, validation. Parameter

Classes

Non-active

Antiviral

31 188 2

25 432

15 588 4

18 213

Analysis Sensitivity Specificity

99.92 99.54

Accuracy

98.56

Non-active Antiviral Validation

Sensitivity Specificty Accuracy

99.88 98.16 98.54

Non-active Antiviral

species. This property is related to the definition of the kθs using speciesspecific atomic weights (see supplementary material file for method). It allows us to model by the first time a very heterogeneous a diverse data with more than 47 470 cases (one of the largest in QSAR). Another interesting characteristic of the model is that the kθs used as molecular descriptors depend both on the molecular structure of the drug and the viral species against which the drug must act. The codification of the molecular structure is basically due to the use of the adjacent factor αij to encode atom-atom bonding, molecular connectivity. The other aspect that allows encoding molecular structural changes is that the entropy kθs are atom-class specific. This property is related to the definition of the kθs. The values of these species and specific atomic standard free energies reported herein for the first time are given in Table 2 for some atoms and more than 40 species. For example, one change in the molecular structure of, e.g. S by O, necessarily implies a change in the moments of interaction. Moreover, the most interesting fact is that kµs are the molecular descriptors reported for antimicrobial mt-QSAR studies able to distinguish among a large number of viral species. The present work is the first reported mt-QSAR model using entropy kθs as a molecular descriptor that allow one predicting antiviral activity of any organic compound against a very large diversity of viral pathogens.

4. Conclusions Entropy based mt-QSAR equation is able to predict the biological activity of antiviral drugs in more general situations than the traditional QSAR models; which the major limitation is predict the biological activity of

Francisco J. Prado-Prado et al.

Table 2. Standard atomic free energy values for atom-receptor interactions.

drugs against only one viral species. The present model with a very large data set improves significantly the previous QSAR models and may help to perform MOOP of drug activity against different viral species. This mt-QSAR methodology improves models using entropy as a molecular descriptor that allow predicting antiviral activity of any organic compound against a very large diversity of viral pathogens.

Entropy mt-QSAR for antiviral drugs vs. viruses

Investigaci贸n e Desenvolvemento, Xunta de Galicia and European Social Fund (ESF). The authors thank partial financial support from project n潞 07CSA008203PR, which have been sponsored by Xunta de Galicia.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

Mushahwar IK. Hepatitis E virus: molecular virology, clinical features, diagnosis, transmission, epidemiology, and prevention. Journal of medical virology. 2008 Apr;80(4):646-58. Fryer JF, Baylis SA, Gottlieb AL, Ferguson M, Vincini GA, Bevan VM, et al. Development of working reference materials for clinical virology. J Clin Virol. 2008 Dec;43(4):367-71. Rabenau HF, Kessler HH, Kortenbusch M, Steinhorst A, Raggam RB, Berger A. Verification and validation of diagnostic laboratory tests in clinical virology. J Clin Virol. 2007 Oct;40(2):93-8. Nindl I, Gottschling M, Stockfleth E. Human papillomaviruses and nonmelanoma skin cancer: basic virology and clinical manifestations. Disease markers. 2007;23(4):247-59. Hayes EB, Sejvar JJ, Zaki SR, Lanciotti RS, Bode AV, Campbell GL. Virology, pathology, and clinical manifestations of West Nile virus disease. Emerging infectious diseases. 2005 Aug;11(8):1174-9. Raimondo G, Pollicino T, Squadrito G. Clinical virology of hepatitis B virus infection. Journal of hepatology. 2003;39 Suppl 1:S26-30. Rizzetto M. Hepatitis D: virology, clinical and epidemiological aspects. Acta gastro-enterologica Belgica. 2000 Apr-Jun;63(2):221-4. Pawlotsky JM. Hepatitis C: virology, clinical aspects and the relation to cryoglobulinemia. Acta gastro-enterologica Belgica. 2000 Apr-Jun;63(2):200-1. Hendley JO. Clinical virology of rhinoviruses. Advances in virus research. 1999;54:453-66. Coutlee F, Mayrand MH, Provencher D, Franco E. The future of HPV testing in clinical laboratories and applied virology research. Clinical and diagnostic virology. 1997 Aug;8(2):123-41. Shafer RW, Merigan TC. HIV virology for clinical trials. AIDS (London, England). 1995;9 Suppl A:S193-202. McCutchan JA. Virology, immunology, and clinical course of HIV infection. Journal of consulting and clinical psychology. 1990 Feb;58(1):5-12. Hoofnagle JH. Type B hepatitis: virology, serology and clinical course. Seminars in liver disease. 1981 Feb;1(1):7-14. Nicholls J, Dong XP, Jiang G, Peiris M. SARS: clinical virology and pathogenesis. Respirology (Carlton, Vic. 2003 Nov;8 Suppl:S6-8. Prado-Prado FJ, Martinez de la Vega O, Uriarte E, Ubeira FM, Chou KC, Gonzalez-Diaz H. Unified QSAR approach to antimicrobials. 4. Multi-target QSAR modeling and comparative multi-distance study of the giant components of antiviral drug-drug complex networks. Bioorg Med Chem. 2009 Jan 15;17(2):569-75.

Francisco J. Prado-Prado et al.

16. Fratev F, Benfenati E. 3D-QSAR and molecular mechanics study for the differences in the azole activity against yeastlike and filamentous fungi and their relation to P450DM inhibition. 1. 3-substituted-4(3H)-quinazolinones. Journal of chemical information and modeling. 2005 May-Jun;45(3):634-44. 17. Cronin MT, Aptula AO, Dearden JC, Duffy JC, Netzeva TI, Patel H, et al. Structure-based classification of antibacterial activity. J Chem Inf Comput Sci. 2002 Jul-Aug;42(4):869-78. 18. Marrero-Ponce Y, Castillo-Garit JA, Olazabal E, Serrano HS, Morales A, Castanedo N, et al. Atom, atom-type and total molecular linear indices as a promising approach for bioorganic and medicinal chemistry: theoretical and experimental assessment of a novel method for virtual screening and rational design of new lead anthelmintic. Bioorg Med Chem. 2005 Feb 15;13(4):1005-20. 19. Marrero-Ponce Y, Medina-Marrero R, Torrens F, Martinez Y, Romero-Zaldivar V, Castro EA. Atom, atom-type, and total nonstochastic and stochastic quadratic fingerprints: a promising approach for modeling of antibacterial activity. Bioorg Med Chem. 2005 Apr 15;13(8):2881-99. 20. Marrero-Ponce Y, Meneses-Marcel A, Castillo-Garit JA, Machado-Tugores Y, Escario JA, Barrio AG, et al. Predicting antitrichomonal activity: a computational screening using atom-based bilinear indices and experimental proofs. Bioorg Med Chem. 2006 Oct 1;14(19):6502-24. 21. Montero-Torres A, Vega MC, Marrero-Ponce Y, Rolon M, Gomez-Barrio A, Escario JA, et al. A novel non-stochastic quadratic fingerprints-based approach for the 'in silico' discovery of new antitrypanosomal compounds. Bioorg Med Chem. 2005 Nov 15;13(22):6264-75. 22. Meneses-Marcel A, Marrero-Ponce Y, Machado-Tugores Y, Montero-Torres A, Pereira DM, Escario JA, et al. A linear discrimination analysis based virtual screening of trichomonacidal lead-like compounds: outcomes of in silico studies supported by experimental results. Bioorg Med Chem Lett. 2005 Sep 1;15(17):3838-43. 23. Vega MC, Montero-Torres A, Marrero-Ponce Y, Rolon M, Gomez-Barrio A, Escario JA, et al. New ligand-based approach for the discovery of antitrypanosomal compounds. Bioorg Med Chem Lett. 2006 Apr 1;16(7):1898-904. 24. Marrero-Ponce Y, Meneses-Marcel A, Rivera-Borroto OM, Garcia-Domenech R, De Julian-Ortiz JV, Montero A, et al. Bond-based linear indices in QSAR: computational discovery of novel anti-trichomonal compounds. J Comput Aided Mol Des. 2008 Aug;22(8):523-40. 25. Garcia-Domenech R, Galvez J, de Julian-Ortiz JV, Pogliani L. Some new trends in chemical graph theory. Chem Rev. 2008 Mar;108(3):1127-69. 26. Marrero-Ponce Y, Khan MT, Casanola-Martin GM, Ather A, Sultankhodzhaev MN, Garcia-Domenech R, et al. Bond-based 2D TOMOCOMD-CARDD approach for drug discovery: aiding decision-making in 'in silico' selection of new lead tyrosinase inhibitors. J Comput Aided Mol Des. 2007 Apr;21(4):167-88. 27. Garcia-Garcia A, Galvez J, de Julian-Ortiz JV, Garcia-Domenech R, Munoz C, Guna R, et al. Search of chemical scaffolds for novel antituberculosis agents. J Biomol Screen. 2005 Apr;10(3):206-14.

Entropy mt-QSAR for antiviral drugs vs. viruses

28. Garcia-Garcia A, Galvez J, de Julian-Ortiz JV, Garcia-Domenech R, Munoz C, Guna R, et al. New agents active against Mycobacterium avium complex selected by molecular topology: a virtual screening method. J Antimicrob Chemother. 2004 Jan;53(1):65-73. 29. Meneses-Marcel A, Rivera-Borroto OM, Marrero-Ponce Y, Montero A, Machado Tugores Y, Escario JA, et al. New antitrichomonal drug-like chemicals selected by bond (edge)-based TOMOCOMD-CARDD descriptors. J Biomol Screen. 2008 Sep;13(8):785-94. 30. Vilar S, Santana L, Uriarte E. Probabilistic neural network model for the in silico evaluation of anti-HIV activity and mechanism of action. J Med Chem. 2006;49(3):1118-24. 31. Cruz-Monteagudo M, Borges F, Cordeiro MN, Cagide Fajin JL, Morell C, Ruiz RM, et al. Desirability-based methods of multiobjective optimization and ranking for global QSAR studies. Filtering safe and potent drug candidates from combinatorial libraries. J Comb Chem. 2008 Nov-Dec;10(6):897-913. 32. González-Díaz H, Prado-Prado FJ, Santana L, Uriarte E. Unify QSAR approach to antimicrobials. Part 1: Predicting antifungal activity against different species. Bioorg Med Chem. 2006 Jun 5;14 5973-80. 33. González-Díaz H, Prado-Prado F. Unified QSAR and Network-Based Computational Chemistry Approach to Antimicrobials, Part 1: Multispecies Activity Models for Antifungals. J Comput Chem. 2008;29:656-7. 34. Todeschini R, Consonni V. Handbook of Molecular Descriptors: Wiley-VCH 2002. 35. Gonzalez-Diaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008; 8(18): 1676-90. 36. González-Díaz H, Cabrera-Pérez MA, Agüero-Chapín G, Cruz-Monteagudo M, Castañedo-Cancio N, del Río MA, et al. Multi-target QSPR assemble of a Complex Network for the distribution of chemicals to biphasic systems and biological tissues. Chemometrics Intellig Lab Syst. 2008;94:160-5. 37. Cruz-Monteagudo M, González-Díaz H, Agüero-Chapin G, Santana L, Borges F, Domínguez RE, et al. Computational Chemistry Development of a Unified Free Energy Markov Model for the Distribution of 1300 Chemicals to 38 Different Environmental or Biological Systems. J Comput Chem. 2007; 28:1909-22. 38. González-Díaz H, Aguero G, Cabrera MA, Molina R, Santana L, Uriarte E, et al. Unified Markov thermodynamics based on stochastic forms to classify drugs considering molecular structure, partition system, and biological species: distribution of the antimicrobial G1 on rat tissues. Bioorg Med Chem Lett. 2005 Feb 1;15(3):551-7. 39. González-Díaz H, Torres-Gomez LA, Guevara Y, Almeida MS, Molina R, Castanedo N, et al. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer-aided molecular design III: 2.5D indices for the discovery of antibacterials. J Mol Model. 2005 Mar;11(2):116-23. 40. González-Díaz H, Gia O, Uriarte E, Hernadez I, Ramos R, Chaviano M, et al. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer-aided molecular design I: discovery of anticancer compounds. J Mol Model 2003 Dec;9(6):395-407.

Francisco J. Prado-Prado et al.

41. González-Díaz H, Olazabal E, Castanedo N, Sanchez IH, Morales A, Serrano HS, et al. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer aided molecular design II: experimental and theoretical assessment of a novel method for virtual screening of fasciolicides. J Mol Model 2002 Aug;8(8):237-45. 42. González-Díaz H, Uriarte E. Proteins QSAR with Markov average electrostatic potentials. Bioorg Med Chem Lett. 2005 Nov 15;15(22):5088-94. 43. Saiz-Urra L, González-Díaz H, Uriarte E. Proteins Markovian 3D-QSAR with spherically-truncated average electrostatic potentials. Bioorg Med Chem. 2005 Jun 1;13(11):3641-7. 44. Ferino G, Delogu G, Podda G, Uriarte E, González-Díaz H. Quantitative Proteome-Disease Relationships (QPDRs) in Clinical Chemistry: Prediction of Prostate Cancer with Spectral Moments of PSA/MS Star Networks. In: Mitchem BHaS, Ch.L., ed. Clinical Chemistry Research (ISBN: 978-1-60692-517-1). NY: Nova Science Publisher 2009. 45. Concu R, Podda G, Uriarte E, González-Díaz H. A New Computational Chemistry & Complex Networks approach to Structure-Function and Similarity Relationships in Protein Enzymes. In: Collett CTaR, C.D., ed. Handbook of Computational Chemistry Research: Nova Science Publishers 2009. 46. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. 47. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. 48. González-Díaz H, Saiz-Urra L, Molina R, Santana L, Uriarte E. A Model for the Recognition of Protein Kinases Based on the Entropy of 3D van der Waals Interactions. Journal of proteome research. 2007 Feb 2;6(2):904-8. 49. González-Díaz H, Marrero Y, Hernandez I, Bastida I, Tenorio E, Nasco O, et al. 3D-MEDNEs: an alternative "in silico" technique for chemical research in toxicology. 1. prediction of chemically induced agranulocytosis. Chem Res Toxicol. 2003 Oct;16(10):1318-27. 50. González-Díaz H, Molina R, Uriarte E. Markov Entropy backbone electrostatic descriptors for predicting proteins biological activity. Bioorg Med Chem Lett. 2004 Sep 20;14(18):4691-5. 51. Graham DJ. Information Content in Organic Molecules: Brownian Processing at Low Levels. Journal of chemical information and modeling. 2007;47(2):376-89. 52. Graham DJ, Schacht D. Base Information Content in Organic Molecular Formulae. J Chem Inf Comput Sci. 2000;40:942. 53. Graham DJ. Information Content in Organic Molecules: Structure Considerations Based on Integer Statistics. J Chem Inf Comput Sci. 2002;42:215. 54. Graham DJ, Malarkey C, Schulmerich MV. Information Content in Organic Molecules: Quantification and Statistical Structure via Brownian Processing. J Chem Inf Comput Sci. 2004;44(1601).

Entropy mt-QSAR for antiviral drugs vs. viruses

55. Graham DJ, Schulmerich MV. Information Content in Organic Molecules: Reaction Pathway Analysis via Brownian Processing. J Chem Inf Comput Sci. 2004;44(1612). 56. Graham DJ. Information Content and Organic Molecules: Aggregation States and Solvent Effects. Journal of chemical information and modeling. 2005;45(1223). 57. Gonzalez-Diaz H, Tenorio E, Castanedo N, Santana L, Uriarte E. 3D QSAR Markov model for drug-induced eosinophilia--theoretical prediction and preliminary experimental assay of the antimicrobial drug G1. Bioorg Med Chem. 2005 Mar 1;13(5):1523-30. 58. González-Díaz H, Cruz-Monteagudo M, Molina R, Tenorio E, Uriarte E. Predicting multiple drugs side effects with a general drug-target interaction thermodynamic Markov model. Bioorg Med Chem. 2005 Feb 15;13(4):1119-29. 59. Cruz-Monteagudo M, González-Díaz H. Unified drug-target interaction thermodynamic Markov model using stochastic entropies to predict multiple drugs side effects. Eur J Med Chem. 2005 Oct;40(10):1030-41. 60. Van Waterbeemd H. Discriminant Analysis for Activity Prediction. In: Van Waterbeemd H, ed. Chemometric methods in molecular design. New York: Wiley-VCH 1995:265-82. 61. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 17-29 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

2. Entropy Multi-target QSAR model for Anti-Parasitic and Anti-Alzheimer GSK-3 inhibitors 1

Isela García1, Yagamare Fall1, Generosa Gómez1 and Humberto González-Díaz2

Department of Organic Chemistry, University of Vigo, Spain; 2Department of Microbiology and Parasitology, Faculty of Pharmacy, USC, Santiago de Compostela, 15782, Spain

Introduction In this moment, there is an increasing interest in the evaluation of kinases from unicellular parasites as targets for potential new anti-parasitic drugs. The evolutionary difference between unicellular kinases and their human homologues might be sufficient to allow the design of parasite-specific inhibitors. The Plasmodium falciparum genome contains 65 genes that encode kinases, including three forms of Glycogen synthase kinase-3 (GSK-3). An initial study showed that P. falciparum exports PfGSK-3 to the cytoplasm of host erythrocytes (which are devoid of GSK-3), where it colocalizes with parasite-generated membrane structures known as Maurer´s clefts. 3. The function of PfGSK-3 is unknown, but the presence PfCK1, a CK1 homolog, in infected red blood cell supports the hypothesis that both kinases play a role in regulating the strong circadian rhythm of the parasite, which is responsible for the circadian fevers that are characteristic of this infections disease [1]. Correspondence/Reprint request: Dr. Humberto González-Díaz, Department of Microbiology and Parasitology Faculty of Pharmacy, USC, Santiago de Compostela, 15782, Spain. E-mail: gonzalezdiazh@yahoo.es

Isela García et al.

The vector-borne parasitic disease African trypanosomiasis, caused by members of the Trypanosoma brucei complex, is a serious health threat. It is estimated that 300,000 to 500,000 humans in sub-Saharan African are infected. If the disease is left inadequately treated, it often has a fatal outcome. Once infection is established, safe and effective therapy is critically important, yet it has been difficult to achieve. Despite the critical need, the available therapies are becoming less satisfactory due to the rising level of resistance to the available drugs, the long period of treatment required to achieve a cure, and the unacceptable and sometimes severe adverse effects associated with current therapies [2]. An urgent priority is to identify and validate new targets for the development of safe, effective, and inexpensive therapeutic alternatives. Compounds that inhibit T. brucei GSK-3 activity and not host GSK-3 might be required for therapy for pregnant women and infants, in that GSK-3 regulates proteins critical in development, such as the wnt gene product. However, optimization of the selectivity of drug candidates for parasite kinases becomes an issue due to the highly conserved amino acids and protein conformation of the catalytic domains [3-6]. Understanding the differences in the substrate binding properties and the three-dimensional structures between mammalian and parasite GSK-3 enzymes is important for the optimization of selected target inhibitors for drug development [7, 8]. To report of all these cases, more parasites, fungi, etc. exist that are keep out for compounds that also disable the enzyme GSK-3, and in it consists our aim objective of this work. On the other hand, Alzheimer´s disease (AD) is the most recent reason of dementia in the elders at present [9]. This serious and degenerative disorder explains the gradual loss of neurons, and in spite of the efforts realized by the big pharmacists of the world, still is not very clear the reason of this pathology. The fundamental characteristic of Alzheimer´s disease is the presence in the brain of two injuries: the neurofibrillary tangles that are formed by paired helical filaments (PHF) whose main component is Tau protein kinase (TPK) and the senile plaques formed by the aggregation of the β-amiloide peptide. In addition, GSK-3 is a serine-threonine kinase encoded by two isoforms in mammals, termed GSK-3α and GSK-3β [10]. Initially GSK-3 was implicated in muscle energy storage and metabolism, but since its cloning, a more generalized role in cellular regulation has emerged, highlighted by the wide array of substrates controlled by this enzyme that includes cytoplasmic proteins and nuclear transcription factors. GSK-3 targets encompass proteins implicated in Alzheimer´s disease, neurological disorders, in the wnt and insulin signaling pathway, glycogen and protein synthesis, regulation of transcription factors, embryonic development, cell proliferation and adhesion, tumorigenesis, apoptosis, circadian rhythm,… etc.

Multi-target QSAR GSK-3 inhibitors

The functions of GSK-3 and its implication in various human diseases have stimulated and active search for potent and selective GSK-3 inhibitors [11]. Studies of GSK-3 homologues in various organisms have revealed physiological roles for the enzyme in differentiation, cell fate determination, and spatial patterning to establish bilateral embryonic symmetry [12]. Purified GSK-3α and GSK-3β exhibit similar biochemical and substrate properties [12, 13], and is known that in the phosphorylation of Tau protein kinase takes part actively glycogen synthase kinase 3β (GSK-3β), which not only plays a fundamental role in the synthesis of the glycogen (where it was identified by the first time), but it is very important in several processes as cellular signs, metabolic control, embryogenesis, cellular death and oncogenesis [14], and it is related to a wide range of neurodegenerative [15] diseases, bipolar mood disorders [16] and diabetes, by this the inhibition of this enzyme is one of the therapeutic aims more promoters of those which are fighting at present. In 1988 Ishiguro and col. [17] isolated one enzyme when they were studying an extract of the brain that there was showing the generation of paired helical filaments of Tau protein kinase, typical injury of Alzheimer´s disease. TPKI and TPKII are the two kinases implied in this process and they found that TPKI has an identical structure to GSK-3β. In parallel, the development of QSARs using simple molecular indices appears to be a promising alternative or complementary technique to drugprotein docking, high-throughput screening and combinatorial chemistry techniques. Almost all QSAR techniques are based on the use of molecular descriptors, which are numerical series that codify useful chemical information and enable correlations between statistical and biological properties [18-20]. Shannon entropy and other entropy-like measures are one of the more prominent parameters in order to codify structural information on QSAR studies [21, 22]. In this direction, our research group has introduced a novel series of stochastic indices in the so called MARCH-INSIDE approach. The method is based on the use of Markov Models (MM) to calculate absolute probabilities of the distribution of different atomic properties within the molecular skeleton with specific bonding patterns. Using these absolute probabilities we can apply the Shannon formula to calculate entropy parameters of the distribution of atomic properties in the molecule [23, 24]. In this work we will explore the potential of MARCHINSIDE to seek a QSAR for GSK-3 inhibitors from a heterogeneous series of compounds. In the first step, the aforementioned molecular descriptors were calculated for a large series of active/non-active compounds. LDA was subsequently used to fit a classification function. The QSAR developed was then validated with an external predicting series by the re-substitution technique.

Isela GarcĂa et al.

Materials and methods Computational methods. The MARCH-INSIDE approach [25-27] is based on the calculation of the different physicochemical molecular properties as an average of atomic properties (ap). For instance, it is possible to derive average estimations of molecular descriptors or group indices [28, 29].

Multi-target Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) was used to construct the classifiers. One of the most important steps in this work was the organization of the spreadsheet containing the raw data used as input for the LDA because this is not a classic classifier. Herein, the schematisation of the paper is peculiar. Our expectation is to use a two-group Discriminant function to classify compounds into two possible groups: compounds that belong to a particular group and compounds that do not belong to this group. To this end, we have to indicate somehow what group we pretend to predict in each case. In this regard, we made the following steps, these steps are essentially the same given by Concu et al. [30, 31] for the QSAR study of six classes of enzymes: 1.

We created a raw data representing each compound input as a vector made up of 1 output variable, 108 structural variables (inputs) divided in values (see the first term of the equation 1), averages (see the second term of the equation 1) and differences between values and averages (see the third term of the equation 1); and the Compound Assay Conditions query (CACq) variable. CACq is an auxiliary not used to construct the model. The first element (output) is a dummy variable (Boolean) called Observed Group (OG); OG = 0 if the compound belongs to the class to which we refer in CACq and 1 otherwise (OG = 1). We could repeat each compound more than once in the raw data. In fact, we could repeat each compound 38 times corresponding to 38 CACq Assay Conditions (see Table 1). The first time we used the CACq = CAC number. It means that we used the real CAC class of the compound in CACq. In this case, the LDA model had to give the highest probability to the group OG = 0 because it had to predict the real class of the compound. The remnant 38 times we use an CAC class number different to the real in CACq and then the LDA model had to predict the highest probability for the group OG = 1. This indicated that the compound did not belong to this group.

Multi-target QSAR GSK-3 inhibitors

Table 1. Compound Assay Conditions query (CACq). q

Param.

Enz.

Iso.

Target

Type

Class

Cond.

Obs.

IC50 (μg/mL)

MRS

bacterium

IC50 (μg/mL)

S. aureus

bacterium

IC50 (μM)

MRSA

bacterium

—

IC50 (μM)

Hep2

bacterium

—

MIC (μg/mL)

bacterium

IC50 (μg/mL)

HVC

IC50 (μM)

HVC

IC50 (μM)

U937

IC50 (μM)

HT29

GSK-3

enzyme

cKi

GSK-3

enzyme

100

IC50 (nM) GSK-3

enzyme

2000

IC50 (nM) GSK-3

enzyme

2000

IC50 (nM) GSK-3

M. intracellulare

enzyme

2000

IC50 (μM) GSK-3

enzyme

IC50 (μM) GSK-3

T. brucei

enzyme

IC50 (μM) GSK-3

P. falciparum

enzyme

IC50 (μM) GSK-3 α/β

M. intracellulare

enzyme

IC50 E-9 (M)

GSK-3

L. donovani

enzyme

pIC50

GSK-3

enzyme

pIC50

GSK-3

enzyme

IC50 (μM) IC50 (μg/mL)

C. albicans

fungi

C. neoformans

fungi

Isela García et al.

Table 1. Continued

CE = Cell Efficacy glycogen synthase stimulation, HVC = Human Vero cells, W2 = choroquine resistant W2 clone, D6 = choroquine sensitive D6 clone, CL = Cellular Line, M. tuberculosis h = M. tuberculosis (H37Rv), nd = not determined, NA = not active, NC = not cytotoxicity

The problem in this type of organization of raw data is that θk(G) values are global or local compound constants that depend only on structure. Consequently, if these latter and LDA are based only on these values, they will necessarily fail when we change OG values. An inconvenient in this regard occurs if we pretend to use the model for a real enzyme, since we have only one unspecific prediction and we need 38 specific probabilities, 1 confirming the real class and 37 giving low probabilities for the other CACq. We can solve this problem introducing variables characteristic of each CAC class referred on the CACq but without giving information in the input about the real CAC class of the protein. To this end, we used the average value of each θk(G)for all enzymes that belonged to the same CAC class. We also calculated the deviation of the θk(G)from the respective group indicated in CACq. Altogether, we have then 36 θk(G) values + 36 θk(G)avg average values for CAC class + 36 θk(G)dev deviation values from CAC class average = 108 input variables. It is of major importance to understand that we never used as input CACq, so the model only includes as input the θk(G)values for the protein entry and the average and deviations of these values from the CACq, which is not necessarily the real CAC class. The general formula for this class of LDA model is shown below, where S(CACq) is not the

Multi-target QSAR GSK-3 inhibitors

probability but a real valued score that predicts the propensity of a compound to act as an inhibitor of a given class:

S(E) =

∑b

⋅θ k (G ) +

k ,G , Dt

∑b

k ,G , Dt

∑c

⋅θ k (G )avg +

k ,G , D

⋅θ k (G ) +

∑c

k ,G , D

∑ d ⋅(θ (G ) − θ (G ) ) + a (1) k

avg

k ,G , D

⋅θ k (G )avg +

∑d

⋅θ k (G )dif + a 0

k ,G , D

LDA forward stepwise analysis was carried out for variable selection to build up the models [29]. All the variables included in the model were standardized in order to bring them onto the same scale. Subsequently, a standardized linear discriminant equation that allows comparison of their coefficients was obtained [32]. The square of Canonical regression coefficient (Rc) and Wilk’s statistics (U) were examined in order to assess the discriminatory power of the model (U = 0 perfect discrimination, being 0 < U < 1); the separation of the two groups of proteins was statistically verified by the Fisher ratio (F) test with an error level p < 0.05. Data Set. The data set was conformed to a set of marketed and/or reported drugs/receptor pairs where affinity/non-affinity of drugs with the receptors was established taking into consideration the IC50, ki, pki,... values. In consequence, we managed to collect 1012 cases active compounds in different CACq. In addition, we used a negative control series of 2536 cases of non-active results for compounds evaluated at different CACq. The two data sets used were: training series with 249 active + 638 non-active (887 in total) and validation series with 763 + 1898 = 2661 cases in total. The names or codes for all compounds are depicted in the Supporting Information, due to space constraints, as well as the references consulted to compile the data in this table. This series is composed at random by the most representative families of GSK-3 inhibitors taken from the literature (supplementary material). The remaining compounds were a heterogeneous series of inactive compounds including members of the aforementioned families and compounds including in the Merck index [33].

Results and discussion General QSAR for GSK-3 inhibitors. The development of a discriminant function [34] that allows the classification of organic compounds as active or non-active is the key step in the present approach for the discovery of GSK-3 inhibitors. It was therefore necessary to select a

Isela García et al.

training data set of GSK-3 inhibitors containing wide structural variability. To define all the compounds there have been defined a series of conditions that are indicated in the supplementary material gathered from the bibliography. The selection here of discriminant techniques instead of regression techniques was determined by the lack of homogeneity in the conditions under which these values were measured. As reported in different sources, numerous IC50 values lie within a range rather than a single value. In other cases, the activity is not scored in terms of IC50 values but is quoted as inhibitory percentages at a given concentration. Once the training series had been designed, forward stepwise Linear Discriminant Analysis (LDA) was carried out in order to derive the QSAR, see the full equation as well as the compact notation of the model:

s(CACq ) = −1.75 ⋅ θ2 (C inst )avg + 8.39 ⋅ θ0 (X )avg + 2.33 ⋅ θ5 (Het )avg

(

)

+ 1.66 ⋅ (θ0 (Total ) − θ0 (Total )avg ) + 0.49 ⋅ θ1 (C inst ) − θ1 (C inst )avg + 0.62 λ = 0.54

F = 147.8

(2a)

p < 0.001

s(CACq) = −1.75⋅ θ2 (Cinst )avg + 8.39 ⋅ θ0 (X)avg + 2.33⋅ θ5 (Het)avg + 1.66 ⋅ θ0 (Total)dif + 0.49 ⋅ θ1 (Cinst )dif + 0.62 λ = 0.54

F = 147.8

(2b)

p < 0.001

The statistical significance of this model was determined by examining Wilk’s λ statistic, Fisher ratio (F), and the p-level (p). This equation confirm our intuitive hypothesis and we can conclude that the deviation of the parameters of one compound from the average values for active compounds tested in a given assay condition (CACq) is very important for the prediction of this compound as active. In any case, the present model is of more general application than the other known methods that apply only to compounds tested in only one CAC and/or belonging to only one homogeneous structural class of compounds. A confirmation of this stamen is that the present classification function have given rise to an efficient separation of all compounds with Accuracy = 84.0% (training series) and Accuracy = 84.4% (validation series), see Table 2 for details. The names, observed classification, predicted classification and subsequent probabilities for all 3548 compounds in training and average validation are given as supplementary material. This level of total Accuracy, Sensitivity and Specificity is considered as excellent by other researches that have used LDA for QSAR studies; see for instance the works of Garcia-Domenech, R., Prado-Prado, F. J.; Marrero-Ponce, Y., etc [35-50].

Multi-target QSAR GSK-3 inhibitors

Table 2. Training and validation results. Group

Parameter

GSK-3 inhibitors

Non-active

Training GSK-3 inhibitor Sensitivity Specificity Non-active Total

Accuracy

96.4

240

79.2

133

505

84.0 Validation

GSK-3 inhibitor Sensitivity Specificity Non-active Total

Accuracy

96.3

735

79.6

387

1511

84.4

Conclusions In this work we have shown that the MARCH-INSIDE methodology can be considered a good alternative for developing GSK-3 inhibitors in a fast and efficient way. This approach is able to correctly classify the GSK-3 inhibitory activity of compounds with different structural patterns.

Acknowledgment We are grateful to the Xunta de Galicia (INCITE08PXIB314255PR) for partial financial support. GonzĂĄlez-DĂaz, H. acknowledges financial support from Program Isidro Parga Pondal, Xunta de Galicia.

References 1. 2. 3. 4. 5. 6.

Meijer L, Flajolet M, Geengard P. Pharmacological inhibitors of glycogen synthase kinase 3. Trends in Phamacological Sciences. 2004 September 2004;25(9):471-80. Fairlamb AH. Chemotherapy on human African trypanosomiasis:current and future prospects. Trends Parasitol. 2003;19:488-94. Copeland RA, Pompliano DL, Meek TD. Drug-target residence time and its implications for lead optimization. Nat Rev Drug Discov. 2006;5:730-2. Liao JJ. Molecular recognition of protein kinase binding pockets for design of potent and selective kinase inhibitors. J Med Chem. 2007;50:409-24. Pink R, Hudson A, Mouries MA, Bending M. Opportunities and chanllenges in antiparasitic drug discovery. Nat Rev Drug Discov. 2005;4:727-40. Plyte SE, Hughes K, Nilkolakaki E, Pulverer BJ, Woodgett JR. Glycogen Synthase Kinase-3: functions in oncogenesis and development. Biochim Biophys Acta. 1992;1114:147-62.

7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

22. 23.

24.

Isela García et al.

Dajani R, Fraser E, Roe SM, Young N, Good V, Dale TC, et al. Crystal structure of glycogen synthase kinase-3 beta: structural basic for phosphate-primed subtrate specificity and autoinhibition. Cell (Cambridge, Mass). 2001;105:721-32. Ojo KK, Gillespie RG, Riechers A, Napuli AJ, Verlinde CL, Buckner FS, et al. Glycogen Synthase Kinase 3 is a potential drug target for african trypanosomiasis therapy. Antimicrob Agents and Chemother. 2008 October 2008;52(10):3710-7. Olson RE. Secretase inhibitors as therapeutics for Alzheimer´s disease. Annu Rep Med Chem. 2000;35:31-40. Woodgett JR. Molecular cloning and expression of glycogen synthase kinase3/factor A. EMBO J. 1990 Aug;9(8):2431-8. Doucheau E. Plasmodium falciparum glycogen synthase kinase-3: molecular model, expression, intracellular localisation and selective inhibitors Biochim Biophys Acta. 2004;1697:181-96. Ali A, Hoeflich KP, Woodgett JR. Glycogen Synthase Kinase-3: Properties, Functions, and Regulation. Chem Rev. 2001;101:2527-40. Woodgett JR. cDNA cloning and properties of glycogen synthase kinase-3 Methods Enzymol. 1991;200:564-77. Grimes CA, Jope RS. The Multifaceted roles of glycogen synthase kinase 3B in cellular signaling. Prog Neurobiol. 2001;65:391-426. Nadri C, Lipska B, Kozlovsky N, Weinberger DR, Belmaker RH, Agam G. Dev Brain Res. 2003;141(1,2):33-7. Gould TD, Zarate CA, Manji HK. Glycogen Synthase Kinase-3: A Target for Novel Bipolar Disorder Treatments. 2004;65(1):10-21. Ishiguro K, Ihara Y, Uchida T, Imahori K. A Novel Tubulin-Dependent Protein Kinase Forming a Paired Helical Filament Epitope on Tau. J Bio Chem. 1988;104(3):319-21. Nunez MB, Maguna FP, Okulik NB, Castro EA. QSAR modeling of the MAO inhibitory activity of xanthones derivatives. Bioorg Med Chem Lett. 2004 Nov 15;14(22):5611-7. Todeschini R, Consonni V. Handbook of Molecular Descriptors. Wiley VCH. 2000. Freund JA, Poschel T. Stochastic processes in physics, chemistry, and biology. Lect Notes Phys. Berlin, Germany: Springer-Verlag 2000. Stahura FL, Godden JW, Bajorath J. Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations. J Chem Inf Comput Sci. 2002 May-Jun;42(3):550-8. Stahura FL, Godden JW, Xue L, Bajorath J. Distinguishing between natural products and synthetic molecules by descriptor Shannon entropy analysis and binary QSAR calculations. J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1245-52. González-Díaz H, Marrero Y, Hernandez I, Bastida I, Tenorio E, Nasco O, et al. 3D-MEDNEs: an alternative "in silico" technique for chemical research in toxicology. 1. prediction of chemically induced agranulocytosis. Chem Res Toxicol. 2003 Oct;16(10):1318-27. González-Díaz H, Aguero G, Cabrera MA, Molina R, Santana L, Uriarte E, et al. Unified Markov thermodynamics based on stochastic forms to classify drugs

Multi-target QSAR GSK-3 inhibitors

25. 26. 27. 28. 29.

30.

31.

32. 33. 34. 35. 36.

37.

38.

considering molecular structure, partition system, and biological species: distribution of the antimicrobial G1 on rat tissues. Bioorg Med Chem Lett. 2005 Feb 1;15(3):551-7. Gonzalez-Díaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. Santana L, Gonzalez-Diaz H, Quezada E, Uriarte E, Yanez M, Vina D, et al. Quantitative structure-activity relationship and complex network approach to monoamine oxidase a and B inhibitors. J Med Chem. 2008 Nov 13;51(21):6740-51. Santana L, Uriarte E, González-Díaz H, Zagotto G, Soto-Otero R, MendezAlvarez E. A QSAR model for in silico screening of MAO-A inhibitors. Prediction, synthesis, and biological assay of novel coumarins. J Med Chem. 2006 Feb 9;49(3):1149-56. Concu R, Dea-Ayuela MA, Perez-Montoto LG, Prado-Prado FJ, Uriarte E, Bolas-Fernandez F, et al. 3D entropy and moments prediction of enzyme classes and experimental-theoretic study of peptide fingerprints in Leishmania parasites. Biochim Biophys Acta. 2009 Aug 28;1794(12):1784-94. Concu R, Dea-Ayuela MA, Perez-Montoto LG, Bolas-Fernandez F, Prado-Prado FJ, Podda G, et al. Prediction of Enzyme Classes from 3D Structure: A General Model and Examples of Experimental-Theoretic Scoring of Peptide Mass Fingerprints of Leishmania Proteins. Journal of proteome research. 2009 Sep 4;8(9):4372-82. Kutner MH, Nachtsheim CJ, Neter J, Li W. Standardized Multiple Regression Model. Applied Linear Statistical Models. Fifth ed. New York: McGraw Hill 2005:271-7. Hall Ca. The Merck Index, twelfth ed. 1996. Van Waterbeemd H. Discriminant Analysis for Activity Prediction. In: Van Waterbeemd H, ed. Chemometric methods in molecular design. New York: Wiley-VCH 1995:265-82. Calabuig C, Anton-Fos GM, Galvez J, Garcia-Domenech R. New hypoglycaemic agents selected by molecular topology. Int J Pharm. 2004 Jun 18;278(1):111-8. Garcia-Garcia A, Galvez J, de Julian-Ortiz JV, Garcia-Domenech R, Munoz C, Guna R, et al. New agents active against Mycobacterium avium complex selected by molecular topology: a virtual screening method. J Antimicrob Chemother. 2004 Jan;53(1):65-73. Prado-Prado FJ, Ubeira FM, Borges F, Gonzalez-Diaz H. Unified QSAR & network-based computational chemistry approach to antimicrobials. II. Multiple distance and triadic census analysis of antiparasitic drugs complex networks. J Comput Chem. 2009 May 6. Prado-Prado FJ, Martinez de la Vega O, Uriarte E, Ubeira FM, Chou KC, Gonzalez-Diaz H. Unified QSAR approach to antimicrobials. 4. Multi-target

39.

40. 41.

42.

43. 44.

45.

46.

47.

48.

49.

Isela GarcĂa et al.

QSAR modeling and comparative multi-distance study of the giant components of antiviral drug-drug complex networks. Bioorg Med Chem. 2009 Jan 15;17(2):569-75. Prado-Prado FJ, de la Vega OM, Uriarte E, Ubeira FM, Chou KC, Gonzalez-Diaz H. Unified QSAR approach to antimicrobials. 4. Multi-target QSAR modeling and comparative multi-distance study of the giant components of antiviral drugdrug complex networks. Bioorg Med Chem. 2009;17:569â&#x20AC;&#x201C;75. Prado-Prado FJ, Borges F, Perez-Montoto LG, Gonzalez-Diaz H. Multi-target spectral moment: QSAR for antifungal drugs vs. different fungi species. Eur J Med Chem. 2009 May 5;44(10):4051-6. Prado-Prado FJ, Gonzalez-Diaz H, de la Vega OM, Ubeira FM, Chou KC. Unified QSAR approach to antimicrobials. Part 3: first multi-tasking QSAR model for input-coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds. Bioorg Med Chem. 2008 Jun 1;16(11):5871-80. Prado-Prado FJ, Gonzalez-Diaz H, Santana L, Uriarte E. Unified QSAR approach to antimicrobials. Part 2: predicting activity against more than 90 different species in order to halt antibacterial resistance. Bioorg Med Chem. 2007 Jan 15;15(2):897-902. Marrero-Ponce Y, Khan MT, Casanola Martin GM, Ather A, Sultankhodzhaev MN, Torrens F, et al. Prediction of Tyrosinase Inhibition Activity Using AtomBased Bilinear Indices. ChemMedChem. 2007 Apr 16;2(4):449-78. Marrero-Ponce Y, Meneses-Marcel A, Castillo-Garit JA, Machado-Tugores Y, Escario JA, Barrio AG, et al. Predicting antitrichomonal activity: a computational screening using atom-based bilinear indices and experimental proofs. Bioorg Med Chem. 2006 Oct 1;14(19):6502-24. Meneses-Marcel A, Marrero-Ponce Y, Machado-Tugores Y, Montero-Torres A, Pereira DM, Escario JA, et al. A linear discrimination analysis based virtual screening of trichomonacidal lead-like compounds: outcomes of in silico studies supported by experimental results. Bioorg Med Chem Lett. 2005 Sep 1;15(17):3838-43. Marrero-Ponce Y, Diaz HG, Zaldivar VR, Torrens F, Castro EA. 3D-chiral quadratic indices of the 'molecular pseudograph's atom adjacency matrix' and their application to central chirality codification: classification of ACE inhibitors and prediction of sigma-receptor antagonist activities. Bioorg Med Chem. 2004 Oct 15;12(20):5331-42. Murcia-Soler M, Perez-Gimenez F, Garcia-March FJ, Salabert-Salvador MT, Diaz-Villanueva W, Medina-Casamayor P. Discrimination and selection of new potential antibacterial compounds using simple topological descriptors. J Mol Graph Model. 2003 Mar;21(5):375-90. Cercos-del-Pozo RA, Perez-Gimenez F, Salabert-Salvador MT, Garcia-March FJ. Discrimination and molecular design of new theoretical hypolipaemic agents using the molecular connectivity functions. J Chem Inf Comput Sci. 2000 Jan;40(1):178-84. Estrada E, Vilar S, Uriarte E, Gutierrez Y. In silico studies toward the discovery of new anti-HIV nucleoside compounds with the use of TOPS-MODE and

Multi-target QSAR GSK-3 inhibitors

2D/3D connectivity indices. 1. Pyrimidyl derivatives. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1194-203. 50. Cronin MT, Aptula AO, Dearden JC, Duffy JC, Netzeva TI, Patel H, et al. Structure-based classification of antibacterial activity. J Chem Inf Comput Sci. 2002 Jul-Aug;42(4):869-78.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 31-40 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

3. Predicting drug-parasite networks based on Markov Entropy indices of drug structure Francisco Prado-Prado1, Xerardo García-Mera1, Franco Fernandez1 and Humberto González-Díaz2

Departments of 1Organic Chemistry, 2Microbiology and Parasitology, Faculty of Pharmacy University of Santiago de Compostela, Santiago de Compostela, 15782, Spain

Abstract. In principle, one can predict which nodes interconnect or not in a Complex Network (CN) using Computational Chemistry methods based on Topological Indices (TIs) that describe numerically the structure of nodes of the system. It means that in different experiments TIs for objects in a lower level can be used to predict networks formed by the same objects at the higher levels. However, we did not found a previous work investigating the efficiency of TIs to solve this problem at different levels. In this work, we present the first study of TIs in this sense focusing, but not limited to, entropy type indices. We tested TIs in two experiments developed at two structural levels: 1) drug-target and 2) parasite-host interactions. In Experiment 1 we constructed molecular graphs for a many anti-parasite and no active drugs (lower-level network), where the atoms are nodes and chemical bonds are edges. Next, we calculated TIs for all molecules. Later, we used these indices to seek a model and predict which drugs are connected in a drug-target CN because they have a common target parasite (higher-level network). Similarly, in Experiment 2 we built Correspondence/Reprint request: Dr. Humberto González-Díaz, Department of Microbiology & Parasitology and Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Spain. E-mail: gonzalezdiazh@yahoo.es

Francisco Prado-Prado et al.

graphs for secondary structure of RNAs used as phylogenetic biomarker of both parasites and hosts. In this graphs (lower-level network) the nodes represent nucleotides and edges represent nucleotide or hydrogen bonds. We also calculated the TIs for the secondary structure of RNAs. Last, we can use all these indices to seek a model and predict which parasites and hosts we have to connect in a parasite-host CN because they have a common target (higher-level network). The study opens new trends in the applications of graph theory in biology in general with special emphasis in Parasitology.

Introduction Complex networks are present by everywhere (or at least objects that we can see as networks in a first approach). We can see as networks drug-target interactions, disease-genome correspondences, whole-cell regulation process, metabolic reactions, protein-protein interaction, sexual relationships, disease transmission, internet communications, transport, electric power systems, politics, crime, legislative action, and scientific collaboration and many others matters 1-7. On larger scales one finds networks of cells as in neural networks, up to the scale of organisms in ecological food webs 8. The elucidation of structural and functional relationships in these and other chemical, biological, technological, and social networks generates the need for a meaningful ranking of networks with numerical indices often known as Topological Indices (TIs). TIs are numerical indices that describe the connectedness or connectivity between all nodes in a network and are very useful in Computational Chemistry and other sciences to study not local but global network properties 9-12. On the other hand, TIs based on Markov Chain Models (MCMs) are a very powerful Computational Chemistry tool for describing interesting phenomena in complex systems. We have introduced the MCM method called MARCH-INSIDE (MARkov CHain Invariants for Network SImulation & DEsign) and applied it to different Computational Chemistry problems 13-20. With this method, we can calculate different type of TIs for systems represented by means of graphs or complex networks. In particular, TIs based on Markov Entropy values and symbolized by kÎ¸ are between the more successful of the MARCH-INSIDE indices with applications to systems ranging from small molecules to proteins. These indices are entropy measures related to all nodes or states in a system (networks) separated each other at least at k steps in the full graph 21-25. In these models, we put special emphasis on entropy measures because many authors have demonstrated that they are very useful tools for Computational Chemistry; see for instance the interesting works of Graham using entropy to codify information content of organic molecules and other systems 26-31.

Entropy Prediction of drug-parasite networks

In a recent review, we discussed the application of MARCH-INSIDE indices, including entropies, for the study of antimicrobial drugs, targets, and drug-drug networks. We put emphasis on works focused on antibacterial, antiviral, and antifungal drugs as well as recently reported applications to anti-parasite drug-drug networks 32. In all these works, we have noted that one can predict which nodes interconnect or not in a CN using TIs derived for a graph representation of the nodes of the system. However, we did not found a previous work investigating the efficiency of TIs to solve this problem at different levels. In this work, we present the first study of TIs in this sense focused, but not limited to, MARCH-INSIDE entropy type indices. We tested TIs in two experiments developed at two structural levels: 1) drugtarget and 2) parasite-host interactions. It means that in different experiments TIs for objects in a lower level can be used to predict networks formed by the same objects at the higher levels. For instance, in the Experiment 1 we constructed molecular graphs for a many anti-parasite and no active drugs (lower-level network), where the atoms are nodes and chemical bonds are edges. Next, we calculated the kθ for some groups of atoms or the whole molecules. Later, we used these indices to seek a QSAR model and predict which drugs are connected in a drug-target CN because they have a common target parasite (higher-level network) 32. Similarly, in Experiment 2 we built graphs for secondary structure of RNAs used as phylogenetic biomarker of both parasites and hosts. In this graphs (lower-level network) the nodes represent nucleotides and edges represent nucleotide or hydrogen bonds. The authors may see the works of Marrero-Ponce et al. 33-35, Galindo and Bermúdez et al. 36,37, Shu and Bo et al. 38, describing different TIs and graph representations of RNA secondary structure. We also calculated the Kθ of these nucleotides and TIs for the secondary structure of RNAs. Last, we can use all these indices to seek a QSPR model and predict which parasites and hosts we have to connect in a parasite-host CN because they have a common target (higher-level network). The study is planned to open new trends in the applications of graph theory in biology in general with special emphasis in Parasitology.

Materials and methods By using, Chapman-Kolgomorov equations we can calculate multi-target θ values referred to atoms (nodes) in molecular graphs. As was mentioned

Francisco Prado-Prado et al.

above multi-target here means that we obtain different kθ values for the same atom in the same molecule when the molecular target (bacteria, virus, parasite, receptor, enzyme, etc.) change. First, we have to calculate the absolute probabilities spk(j) for the interaction in many step of different j-th atoms with the specific target. Here targets are only different microbial species (s). These values can be determined as the elements of the vectors kπ(s). These vectors are elements of a Markov chain based on the stochastic matrix 1Π, which describes probabilities of interaction sp1(i,j) of the j-th atom given that previously other i-th atom has interacted with the target.

[

] [

πs = π(s) ⋅ Π(s) = p0(1), 0

p0(2),

p0(3),

⎡ s p1(1,1) ⎢s ⎢ p1(2,1) s p0(n) ⋅ ⎢ . ⎢ ⎢ . ⎢ s p (n,1) ⎣ 1

]

p1(1,2) s

p1(s) .

. .

p1(1, n) ⎤ ⎥ . ⎥ . ⎥ ⎥ . ⎥ s p1(n, n)⎥⎦ s

(1)

The specificity for one target is given using target specific weights in the definition of the elements of the matrix 1Π. The theoretic foundations of the method have been given in previous works, so we do not detail it here but refer the reader to these works 39,40. After that, the entropy TI is very ease to calculate applying the Shannon’s formula to each element spk(j) of the vectors k π(s) and obtain the entropy TI measures. As in the Example 1 we can sum the kθ(j) values for specific atom sets (AS), or the same groups of nodes, to create local molecular descriptors for the drug-target interaction. Herein the AS used were: halogens (X), insaturated carbons (Cins), saturated carbons (Csat), heteroatoms (Het), and hydrogens bound to heteroatoms (H-Het). The corresponding symbols of the local entropy TI for these AS are: kθ(X), k θ(Cins), kθ(Csat), kθ(Het), kθ(H-Het) and kθ(T) or kθ. In this study, we calculated the first six classes of entropy TI (k = 0 to 5) for the 5 AS in total 6·5 = 30 molecular local TIs for each drug 40. At following, we give the formula for both the transition probabilities (elements of the matrix) and the atoms set entropy TI measures. p0 ( j ) =

∑

k =1

p1 (i , j ) =

(2)

δ ij ⋅ s w j

(3)

∑ δ ik ⋅ s w k

∑

k =1

θ ( AS ) = −

∑ θ ( j)= − ∑ k

j ∈ AS

k =1

p k ( j )log

[

]

pk ( j )

(4)

Entropy Prediction of drug-parasite networks

Results and discussion Multi-target entropy QSAR for anti-parasitic drugs With the mt-QSAR generalization of QSAR models (including but not limited to MARCH-INSIDE) we may selected pairs of anti-parasites drugs with similar/dissimilar predicted multi-species activities and represented it as a Complex Network (CN). We call this type of CN the drug-drug multispecies CN (msCN). Please, do not confuse the network used to represent the molecular structure of the drug (molecular graph) and the network of drugtarget interactions. The first refer to only one molecule (the nodes are atoms and the edges chemical bonds). Conversely, the second refer to many drugs and targets so each node is a drug, target, or drug-target pair and the edges express relationships between pairs of drugs and/or targets 41. In fact, we can use the first network (molecular graph) or the TIs of this network as inputs to predict which pairs of nodes (drugs or targets) are connected in the second or output network (here the msCN) 42. The msCN is useful, for instance, to identify drugs with similar mechanism of action or similar activity against many different species. At the same time, we may invert the procedure and selected pairs of parasites species with similar/dissimilar drugs sensibility to construct a parasite-parasite multi-drug resistance CN (mdrCN) 17,43-46. The mdrCNs may be used to identify parasites species with sensibility for the same drugs in order to select parasites with specific resistance to drugs. Anyhow, we did not reported a mt-QSAR study based on kÎ¸ values before. Consequently, in the two next sections we shall give the theoretical basis and discuss the results obtained when we extend, by the first time, kÎ¸ values in order to calculate TIs useful to perform mt-QSAR for different anti-bacterial species (s). One of the main advantages of the present approach is that the generalized parameters kÎ¸ fit on more large and complex databases than the previous ones. In specific, this work introduces by the first time a single linear mt-QSAR equation model to classify anti-parasitic drugs as active or no-active against different species. The data set used here contains a set of marketed and/or very recently, reported anti-parasitic drugs. The data set includes 4 442 Drug-Parasite Paris (DPPs) for different drugs experimentally tested against some species of a list of more than 16 parasites. Not all drugs were tested in the literature against all listed parasite species. The names or codes as well as both observed and predicted activity for all compounds and the references used appear in a supplementary material file. The best mt-QSAR model found was:

Francisco Prado-Prado et al.

S ( DPP) = −2.60⋅5 θ (Csat ) − 4.77⋅0 θ (Cins ) + 9.12⋅1θ (Cins ) + 6.83⋅4 θ (Cins ) − 2.81⋅0 θ ( H − Het) + 2.48⋅2 θ ( H − Het) + 5.38 n = 1468 U = 0.4691 p < 0.001

(5)

In the model, S(DPP) is a real-valued output variable that scores the propensity of drugs to be effective against parasite species forming DPPs. The coefficient U is the Wilk’s statistics, statistic for the overall discrimination, F is the Fisher ratio, and p the error level. In this equation, k θ(j) values where summed for the totality (T) of the atoms in the molecule or for specific atom sets (AS) as we referred above. These collections are atoms with a common characteristic as for instance are: saturated Carbon atoms (Csat), hydrogen atoms bonded to one heteroatom (H-Het). The model correctly classifies 1126 out of 1291 (training Sensitivity = 87.22%) of non-DPPs for non-active compounds and 170 out of 177 DPPs for active compounds (training Specificity = 96.05%) in training series. Overall training Accuracy with respect to DPPs of both active/non-active compounds was 88.28% (1296 out of 1468 cases in training). In order to validate the model we used an external or independent validation series. The model correctly classifies 2262 out of 2563 (validation Sensitivity = 88.26%) of non-DPPs for non-active compounds and 343 out of 371 DPPs for active compounds (validation Specificity = 92.45%). Overall validation Accuracy was 88.79% (2605 out of 2934 validation cases in total). Other researchers that applied LDA in QSAR generally accept this level of accuracy as correct; e.g., the works of García-Domenech, Galvez, Bruno-Blanch, Marrero-Ponce, Rotondo and others 47-58. Next, we used the outputs of the mt-QSAR as inputs to constructs the first drug-parasite network CN based on kθ(j) values. In previous works we constructed by the first time mt-QSAR models accounting for pairs of anti-parasite 59,60 anti-fungal 17,40 or anti-viral drugs 61 with similar/dissimilar multi-species activity profile and represented it as large networks. In this work, we have to manage with a very high number of possible DPPs. In DPP-CN the DPPs are nodes; which are interconnected by and edge if they have similar drug-parasite activity. We propose to construct here, by the first time, a DBP-CN taking into consideration only the DPPs predicted by the mt-QSAR model based on kθ(j) values. Last, we compares both the DPP-CN predicted with a DPP-CN constructed here based on experimental values of activity. The DPP-CN predicted recognize correctly 864 872 DPPs; which represent and Accuracy = 72.9%, Sensitivity = 72.9% and Specificity = 70.4 % of the mt-QSAR model for the reconstruction of the real DPP-CN. The Figure 1 depicts both CNs.

Entropy Prediction of drug-parasite networks

Figure 1. Giant components of: (A) the DPP-CN Observed vs. (B) DPP-CN predicted with mt-QSAR.

Acknowledgments González-Díaz H and Prado-Prado F acknowledge financial support of research Programs: Isidro Parga Pondal (IPP) and Angeles Alvariño, respectively; both programs have been funded by Dirección Xeral de Investigación e Desenvolvemento, Xunta de Galicia and European Social Fund (ESF). The authors thank partial financial from project nº 07CSA008203PR sponsored by Xunta de Galicia.

References 1. 2. 3. 4. 5. 6.

Fowler, J. H.; Jeon, S. Social Networks 2008, 30, 16-30. Mason, O.; Verwoerd, M. IET systems biology 2007, 1(2), 89-119. Newman, M. E. Phys Rev E Stat Nonlin Soft Matter Phys 2001, 64(1 Pt 2), 016132. Newman, M. E. Phys Rev E Stat Nonlin Soft Matter Phys 2001, 64(1 Pt 2), 016131. Ohn, J. H.; Kim, J.; Kim, J. H. AMIA Annu Symp Proc 2003, 958. De, P.; Singh, A. E.; Wong, T.; Yacoub, W.; Jolly, A. M. Sex Transm Infect 2004, 80(4), 280-285.

7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.

Francisco Prado-Prado et al.

Johnson, J. C.; Orbach, M. K. Social Networks 2002, 24 291-310. Bornholdt, S.; Schuster, H. G. Handbook of Graphs and Complex Networks: From the Genome to the Internet; WILEY-VCH GmbH & CO. KGa.: Wheinheim, 2003. González-Díaz, H.; Vilar, S.; Santana, L.; Uriarte, E. Curr Top Med Chem 2007, 7(10), 1025-1039. Vilar, S.; Cozza, G.; Moro, S. Curr Top Med Chem 2008, 8(18), 1555-1572. Helguera, A. M.; Combes, R. D.; Gonzalez, M. P.; Cordeiro, M. N. Curr Top Med Chem 2008, 8(18), 1628-1655. González-Díaz, H.; González-Díaz, Y.; Santana, L.; Ubeira, F. M.; Uriarte, E. Proteomics 2008, 8, 750-778. Concu, R.; Podda, G.; Uriarte, E.; Gonzalez-Diaz, H. J Comput Chem 2009, doi:10.1002/jcc. Cruz-Monteagudo, M.; González-Díaz, H.; Agüero-Chapin, G.; Santana, L.; Borges, F.; Domínguez, R. E.; Podda, G.; Uriarte, E. J Comput Chem 2007, 28, 1909-1922. González-Díaz, H.; Agüero-Chapin, G.; Varona, J.; Molina, R.; Delogu, G.; Santana, L.; Uriarte, E.; Gianni, P. J Comput Chem 2007, 28, 1049–1056. González-Díaz, H.; Pérez-Castillo, Y.; Podda, G.; Uriarte, E. J Comput Chem 2007, 28, 1990-1995. González-Díaz, H.; Prado-Prado, F. J Comput Chem 2008, 29, 656-657. Gonzalez-Diaz, H.; Saiz-Urra, L.; Molina, R.; Gonzalez-Diaz, Y.; SanchezGonzalez, A. J Comput Chem 2007, 28(6), 1042-1048. Prado-Prado, F. J.; Ubeira, F. M.; Borges, F.; Gonzalez-Diaz, H. J Comput Chem 2009. Vilar, S.; González-Díaz, H.; Santana, L.; Uriarte, E. J Comput Chem 2008, 29 2613-2622. González-Díaz, H.; Molina, R.; Uriarte, E. Bioorg Med Chem Lett 2004, 14(18), 4691-4695. González-Díaz, H.; Saíz-Urra, L.; Molina, R.; Uriarte, E. Polymer 2005, 46, 2791–2798. González-Díaz, H.; Vina, D.; Santana, L.; de Clercq, E.; Uriarte, E. Bioorg Med Chem 2006, 14(4), 1095-1107. González-Díaz, H.; Saiz-Urra, L.; Molina, R.; Santana, L.; Uriarte, E. Journal of proteome research 2007, 6(2), 904-908. Cruz-Monteagudo, M.; González-Díaz, H.; Borges, F.; Dominguez, E. R.; Cordeiro, M. N. Chem Res Toxicol 2008(21), 619–632. Graham, D. J. Journal of chemical information and modeling 2007, 47(2), 376-389. Graham, D. J.; Schacht, D. J Chem Inf Comput Sci 2000, 40, 942. Graham, D. J. J Chem Inf Comput Sci 2002, 42, 215. Graham, D. J.; Malarkey, C.; Schulmerich, M. V. J Chem Inf Comput Sci 2004, 44(1601). Graham, D. J.; Schulmerich, M. V. J Chem Inf Comput Sci 2004, 44(1612). Graham, D. J. Journal of chemical information and modeling 2005, 45(1223). Gonzalez-Diaz, H.; Prado-Prado, F.; Ubeira, F. M. Curr Top Med Chem 2008, 8(18), 1676-1690.

Entropy Prediction of drug-parasite networks

33. Marrero-Ponce, Y.; Nodarse, D.; González-Díaz, H.; Ramos de Armas, R.; Romero-Zaldivar, V.; Torrens, F.; Castro, E. A. Int J Mol Sci 2004, 5, 276-293. 34. Marrero-Ponce, Y.; Ortega-Broche, S. E.; Diaz, Y. E.; Alvarado, Y. J.; Cubillan, N.; Cardoso, G. C.; Torrens, F.; Perez-Gimenez, F. J Theor Biol 2009. 35. Aguero-Chapin, G.; Antunes, A.; Ubeira, F. M.; Chou, K. C.; Gonzalez-Diaz, H. Journal of chemical information and modeling 2008, 48, 2265–2277. 36. Bermudez, C. I.; Daza, E. E.; Andrade, E. J Theor Biol 1999, 197(2), 193-205. 37. Galindo, J. F.; Bermudez, C. I.; Daza, E. E. J Theor Biol 2006, 240(4), 574-582. 38. Shu, W.; Bo, X.; Zheng, Z.; Wang, S. BMC Bioinformatics 2008, 9, 188. 39. Prado-Prado, F.; González-Díaz, H.; Santana, L.; Uriarte, E. Bioorg Med Chem 2007, 15, 897-902. 40. González-Díaz, H.; Prado-Prado, F. J.; Santana, L.; Uriarte, E. Bioorg Med Chem 2006, 14 5973–5980. 41. Yildirim, M. A.; Goh, K. I.; Cusick, M. E.; Barabasi, A. L.; Vidal, M. Nat Biotechnol 2007, 25(10), 1119-1126. 42. Yamanishi, Y.; Araki, M.; Gutteridge, A.; Honda, W.; Kanehisa, M. Bioinformatics 2008, 24(13), i232-240. 43. Prado-Prado, F. J.; de la Vega, O. M.; Uriarte, E.; Ubeira, F. M.; Chou, K. C.; Gonzalez-Diaz, H. Bioorg Med Chem 2009, 17, 569–575. 44. Prado-Prado, F. J.; Gonzalez-Diaz, H.; Santana, L.; Uriarte, E. Bioorg Med Chem 2007, 15(2), 897-902. 45. Prado-Prado, F. J.; Gonzalez-Diaz, H.; de la Vega, O. M.; Ubeira, F. M.; Chou, K. C. Bioorg Med Chem 2008, 16(11), 5871-5880. 46. Gonzalez-Diaz, H.; Prado-Prado, F. J.; Santana, L.; Uriarte, E. Bioorg Med Chem 2006, 14(17), 5973-5980. 47. Garcia-Garcia, A.; Galvez, J.; de Julian-Ortiz, J. V.; Garcia-Domenech, R.; Munoz, C.; Guna, R.; Borras, R. J Antimicrob Chemother 2004, 53(1), 65-73. 48. Gozalbes, R.; Brun-Pascaud, M.; Garcia-Domenech, R.; Galvez, J.; Pierre-Marie, G.; Jean-Pierre, D.; Derouin, F. Antimicrob Agents Chemother 2000, 44(10), 2771-2776. 49. Gozalbes, R.; Galvez, J.; Garcia-Domenech, R.; Derouin, F. SAR QSAR Environ Res 1999, 10(1), 47-60. 50. Marrero-Ponce, Y.; Meneses-Marcel, A.; Rivera-Borroto, O. M.; GarciaDomenech, R.; De Julian-Ortiz, J. V.; Montero, A.; Escario, J. A.; Barrio, A. G.; Pereira, D. M.; Nogal, J. J.; Grau, R.; Torrens, F.; Vogel, C.; Aran, V. J. J Comput Aided Mol Des 2008, 22(8), 523-540. 51. Garcia-Domenech, R.; Galvez, J.; de Julian-Ortiz, J. V.; Pogliani, L. Chem Rev 2008, 108(3), 1127-1169. 52. Talevi, A.; Cravero, M. S.; Castro, E. A.; Bruno-Blanch, L. E. Bioorg Med Chem Lett 2007, 17(6), 1684-1690. 53. Talevi, A.; Bellera, C. L.; Castro, E. A.; Bruno-Blanch, L. E. J Comput Aided Mol Des 2007, 21(9), 527-538. 54. Prieto, J. J.; Talevi, A.; Bruno-Blanch, L. E. Mol Divers 2006, 10(3), 361-375. 55. Bruno-Blanch, L.; Galvez, J.; Garcia-Domenech, R. Bioorg Med Chem Lett 2003, 13(16), 2749-2754.

Francisco Prado-Prado et al.

56. Alvarez-Ginarte, Y. M.; Marrero-Ponce, Y.; Ruiz-Garcia, J. A.; MonteroCabrera, L. A.; Garcia de la Vega, J. M.; Noheda Marin, P.; Crespo-Otero, R.; Zaragoza, F. T.; Garcia-Domenech, R. J Comput Chem 2008, 29(3), 317-333. 57. Casanola-Martin, G. M.; Marrero-Ponce, Y.; Khan, M. T.; Ather, A.; Sultan, S.; Torrens, F.; Rotondo, R. Bioorg Med Chem 2007, 15(3), 1483-1503. 58. Casanola-Martin, G. M.; Marrero-Ponce, Y.; Khan, M. T.; Ather, A.; Khan, K. M.; Torrens, F.; Rotondo, R. Eur J Med Chem 2007, 42(11-12), 1370-1381. 59. Prado-Prado, F. J.; González-Díaz, H.; Martinez de la Vega, O.; Ubeira, F. M.; Chou, K. C. Bioorg Med Chem 2008, 16, 5871–5880. 60. González-Díaz, H.; Prado-Prado, F.; Ubeira, F. M. Curr Top Med Chem 2008, 8(18), 1676-1690. 61. Prado-Prado, J.; Martinez de la Vega, O.; Uriarte, E.; Ubeira, F. M.; Chou, K.-C.; González-Díaz, H. Bioorg Med Chem 2008, doi:10.1016/j.bmc.2008.11.075. 62. Forst, C. V. DDT 2006, 11(5/6 ). 63. González-Díaz, H.; Pérez-Bello, A.; Cruz-Monteagudo, M.; González-Díaz, Y.; Santana, L.; Uriarte, E. Chemom Intell Lab Systs 2007, 85, 20-26. 64. González-Díaz, H.; de Armas, R. R.; Molina, R. Bioinformatics 2003, 19(16), 2079-2087. 65. Gonzalez-Diaz, H.; de Armas, R. R.; Molina, R. Bull Math Biol 2003, 65(6), 991-1002.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 41-55 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

4. A model based on Markov Entropy to predict the stability of collagen peptides 1

Riccardo Concu1,2, Gianni Podda2, Bairong Shen1 and Humberto Gonzalez Diaz3

Center for Systems Biology, Soochow University, No1. Shizi street, Suzhou Jiangsu 215006,China Universita’ di Cagliari, Dipartimento Farmaco Chimico Tecnologico, Cagliari, Italia; 3Department of Microbiology & Parasitology, Faculty of Pharmacy, USC, 15782 Santiago de Compostela, Spain

Introduction The MARCH-INSIDE (Markovian Chemicals In Silico Design) methodology has been developed by our research group to generate molecular descriptors based on the Markov Chain (MC) Theory[1, 2]. This approach has been successfully employed in Quantitative Structure Property Relationship (QSPR) and Quantitative Structure Activity Relationship (QSAR) studies, including studies related to Proteomics and Nucleic AcidDrug interactions[3], discovery antimicrobial target[4], predict protein stability[5]. We have already described the approach in a lot of papers and we recollect it in a recent review[6]. The method has also demonstrated flexibility in relation to many different problems in protein research [4, 7-10]. Recently we use this approach to solve the problem to predict the protein function and enzyme classification using only the 3D structure of the protein[11-13]. We use the entropy force field in a lot of field like predict the Correspondence/Reprint request: Dr. Humberto Gonzalez Diaz, Department of Microbiology & Parasitology Faculty of Pharmacy, USC, 15782 Santiago de Compostela, Spain. E-mail: gonzalezdiazh@yahoo.es

Riccardo Concu et al.

stability in Arc Repressor Mutants[14, 15], a study of local drugâ&#x20AC;&#x201C;nucleic acid complexes[16], study of biopolymer biological activity[17] and so on. In this paper, we extend this methodology to predict the stability of the collagen using the MARCH-ISIDE approach to calculate the molecular descriptors based on the entropy parameter of the MC and then use it to as inputs perform an Artificial Neural Network (ANN) analysis [18-27]. Collagen is the most abundant protein of the whole human body; it represents the 30% of the proteins in our body. We can find the collagen in a lot of tissues and organ like veins, arteries, skin, tendon, vascular, ligature, bone, cartilage, extracellular matrix and in an increasing set of noncollagenous proteins, many of which are involved in host-defense[28]. This implies that any loss of functionality or stability of the collagen may have heavy repercussions on the human body such serious diseases, disorders and syndromes like osteogenesis imperfecta, Ehlers-Danlos syndrome, infantile cortical hyperostosis, collagenopathy[29-33]. For this reason is easy to understand that any change in the classical structure of the protein can generate a cascade effect that involve few tissues and organs. Collagen structure is well-known, is a helix made up by three alpha chains, each possessing the conformation of a left hand helix; the stabilization of the helix require a Glycine every three residues, this generate a repeated motif Gly-X-Y, where in X and Y position we can find amino acid that can generate hydrogen bonds in order to stabilize the structure. In the classical structure in the X and in the Y position we find the Proline (Pro-P), but after a post-translational modification the Pro in the Y position is hydroxylated to Hydroxyproline (Hyp-O) that isnâ&#x20AC;&#x2122;t one of the 20 essential amino acid. The hydroxylation of the Pro is an important factor for the stabilization of the collagen because the new amino acid has a hydroxyl free for a new hydrogen bond. But recent studies have demonstrated that a repeated sequence G-P-P have a stability comparable with a G-P-O sequence, moreover the stability of the structure is increased if the classical structure Gly-X-Y is abundant[34]. Otherwise, in this kind of approach we analyze the energy correlated to every amino acid, and the stability is calculated through a derivation that is correlated only with the side chain energy of each amino acid residue. The ability to predict structure and stability from amino acid sequence is an important step in the understanding of basic protein principles and the structural consequences of pathological mutations. The vast number of amino acid sequences available from DNA data contrasts with the smaller number of high resolution protein structures and the limited experimental data on protein stability. The ability to make predictions that are in good agreement with experimental data provides insight into the stabilizing interactions within proteins. In addition, there is much interest in computing the effect of

Markov Entropy approach to collagen stability

single amino acid replacements on protein stability because destabilizing effects are associated with deleterious mutations that result in clinically detectable phenotypes Predict the stability of peptides is an important goal to prevent and treat the diseases, design new enzymes and antibodies and so on[35]. For this reason the computational study of structure/stability relationships has become an important area in protein science. A lot of group worked and are working to develop smart and fast tool to predict the protein stability. For instance, Shortle et al. have studied 118 mutants of Staphylococcal nuclease, other researchers have modeled the stability of 145 mutants of T4 Lysozyme, 96 mutants of Barnase, and 71 mutants of Chymotrypsin[36], in a recent paper Persikov and all, have generate an algorithm to relate the amino acid sequence of a collagen triple helix to its thermal stability[34].

Materials and methods Theoretical method In our previous works we used a Markov Chain (MC) approach to codify information about molecular structure. We gave a precise definition of the descriptors generated by the MARCH-INSIDE in several reports [15, 37]. Briefly, we can say that MARCH-INSIDE methodology considers as states of the MC any atom, nucleotides, or amino acids in the molecule depending on the kind of molecule wanted to be described: small-to-medium sized drug, a nucleic acid or a protein, respectively.40 The method uses, as source of molecular descriptor, the 1Î matrix (the one-step electron-transition stochastic matrix) built up as a n x n squared matrix where n is the number of atoms, nucleotides or aminoacids in the molecule. Due to this work that deals with proteins, in the present definition we will use from now on only amino acids, represented as aa. One can imagine a real situation in which, after a perturbation by some external factor, electron density around these aa residues reaches a distribution different from the density distribution in the stationary state at the time (t0). In this case, it is of interest to develop a simple stochastic model for the distribution within the protein backbone and return of electrons to the original position with time. It can be supposed that, after this initial situation, electrons around amino acid residues begin to distribute in different ways at discrete intervals of time (tk with k = 0,1,2,. . .). Thus, by using MC theory it is possible to develop a simple model of the probabilities, with which the amino acid electron density changes in subsequent intervals of time until a stationary or steady state distribution arises, in the Figure 1 we give a simple representation of the situation.

Riccardo Concu et al.

Figure 1. 1Π Matrix calculation.

As depicted in Figure 1, such a model will deals with the calculation of the probabilities (kpij) with which the electron distributions of aa move from any aa in vicinity i at time t0 (in black), to another aa j (in white) along discrete time periods. In this context, the elements (1pij) of 1Π are calculated as the ratio between the electronic charge index (ECI) for the jth aa and the sum of this charge over all the δ aa covalently bounded or linked through hydrogen bond to the ith aa plus 1 including itself; as exemplified Figure 2, see also Eq. 1: 1

ECI δ +1

∑

k =1

ECI

(1) k

Markovian Molecular Negentropies generalized for protein backbone molecular descriptors (k) were defined as the Entropies of the charge distribution over the whole protein molecule with time (k): Θ κ = −∑

p k ( j ) log

(

)

pk (j)

(2)

The parameter k is neither a pure time measurementnor a topological distance coordinate with respect to j. This parameter codifies magnitudes of both time and space. The parameter k accounts for the integer intervals of

Markov Entropy approach to collagen stability

Figure 2. Representaion of the stochastic amino acids distribution kinetic in a simple Markovian model. The symbol ts indicates stationary time: the time at which electrons reach equilibrium distribution around amino acid residues.

time at which the intensity of the charge distribution varies with Markovian probabilities along the protein. These molecular descriptors for the protein backbone could be interpreted as the entropy involved in the charge distribution over the protein domains after time k. The calculation of the absolute probabilities was straightforward from classical results from Markov chains theory[38, 39]. A

(

)

(3)

Where AΠk are 1 X n vectors whose elements Apk(j) are the aforementioned absolute probabilities, AΠ0 is a 1 X n vector whose elements are the Ap0 (j) probabilities for n atoms in the molecule and kΠ are the kth-natural powers of the 1Π matrix. The Apk(j) values were defined similarly to the kpij probabilities [Equation (1)] but consider in the sum all the amino acids in the protein molecule. All the calculations were carried out using our experimental software MARCH-INSIDE[40].

Riccardo Concu et al.

Dataset The list of the protein was retrieved from the literature[34, 41, 42], for a total of 100 sequences. In order to build up the LDA and ANN models we split the dataset in a stable and in a unstable series, peptides with a melting temperature (Tm) >40 are in the stable series, while the peptides with a Tm <40 are achieved in the unstable series. In Table 1 we report all the peptide ID, the Tm and the amino acid sequence with the one letter abbreviation. The backbone of the peptide was built using the “draw mode” of the program, all the O was substituted by the P amino acid. In this respect we only considered covalent interactions (peptidic bond) and hydrogen bonding interactions. Finally we calculate the indices for all the peptides. Table 1. ID, sequence, length and melting temperature of the peptides. ID AchE-146(t) AchE-146/241 AchE-146A(t) AchE-146B(t) AchE-224(t) AchE-251(t) AchE-HG-Alt AchE-HG-C1 AchE-HG-C2 AchE-HG-C3 AchE-HG-C4 AchE-HG-N1 AchE-HG-N2 AchE-P126® AchE-P231(r) C1qA-15 C1qC-67 GAA GAAGAA GAAGPP GAD GAF GAK GAL GAPGAP GAR GAS GDA GDK GDPGPR GDR

sequence pp-grp-grk-grp-gvr-gpr-(gpp)4-g pp-grp-gkr-gkp-gvr-gpr-(gpp)4-g pp-grp-gaa-gap-gvr-gpr-(gpp)4-g pp-gap-grk-grp-gva-gpa-(gpp)4-g pp-glp-gml-gqk-gem-gpk-(gpp)4-g pp-grp-gkr-gkt-glk-gdi-(gpp)4-g gpp-gpp-grq-gkr-gkp-gpp-gpp-gpp-gg gpp-gpp-gpp-gkr-gkp-gpp-gpp-gpp-gg gpp-gpp-grp-gkr-gkp-gpp-gpp-gpp-gg gpp-gpp-grp-gkr-gkq-gpp-gpp-gpp-gg gpp-gpp-grp-gkr-gkq-gqk-gpp-gpp gpp-gpp-gpp-grk-grp-gpp-gpp-gpp-gg gpp-gpp-grp-grk-grp-gpp-gpp-gpp-gg gpp-gpp-grp-grk-grp-(gpp)5-g gpp-gpp-grp-gkr-gkq-gqk-(gpp)4-g gpp-grp-grr-grp-glk-geq-(gpp)4-gy gpp-gir-gpk-gqk-gep-glp-(gpp)4-gy gpp-gpp-gpp-gaa-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gaa-gaa-gpp-gpp-gpp-gg gpp-gpp-gpp-gaa-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gad-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gaf-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gak-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gal-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gap-gap-gpp-gpp-gpp-gg gpp-gpp-gpp-gar-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gas-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gda-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gdk-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gdp-gpr-gpp-gpp-gpp-gg gpp-gpp-gpp-gdr-gpp-gpp-gpp-gpp-gg

length 30 30 30 30 30 30 29 29 29 29 24 26 26 31 31 32 32 24 20 27 24 24 24 24 27 24 24 24 24 27 24

Tm 15.5 19.7 18.6 11 7 9.8 20.1 32.3 26.9 21.3 8.1 22.2 17.4 30 23.9 23.3 28.4 32.9 20 32.9 33 21.9 30.8 27.8 36.9 38.2 33 31.6 30.9 36.2 37.1

Markov Entropy approach to collagen stability

Table 1. Continued GEA GED GEK GEKGPP GEN GEPGPK GEQ GER GET GEV GFA GGA GGF GGK GGL GIA GKD GKE GKN GKPGEP GKQ GKR GLA GLK GLL GLPGLP GLQ GMK GPAGPA GPKGDP GPKGEP GPKGPE GPLGLP GPP30 GPPGAA GPPGPP GPRGDP GPRGEP GQK GQR GRD GRE GRK GRS GVK MBL 42–61 MBL 45–61 MSR-1 T1–655

gpp-gpp-gpp-gea-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-ged-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gek-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gek-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gen-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gep-gpk-gpp-gpp-gpp-gg gpp-gpp-gpp-geq-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-ger-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-get-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gev-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gfa-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gga-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-ggf-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-ggk-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-ggl-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gia-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gkd-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gke-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gkn-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gkp-gep-gpp-gpp-gpp-gg gpp-gpp-gpp-gkq-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gkr-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gla-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-glk-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gll-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-glp-glp-gpp-gpp-gpp-gg gpp-gpp-gpp-glq-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gmk-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gpa-gpa-gpp-gpp-gpp-gg gpp-gpp-gpp-gpk-gdp-gpp-gpp-gpp-gg gpp-gpp-gpp-gpk-gep-gpp-gpp-gpp-gg gpp-gpp-gpp-gpk-gpe-gpp-gpp-gpp-gg gpp-gpp-gpp-gpl-gpl-gpp-gpp-gpp-gg gpp-gpp-gpp-gpp-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gpp-gaa-gpp-gpp-gpp-gg gpp-gpp-gpp-gpp-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gpr-gdp-gpp-gpp-gpp-gg gpp-gpp-gpp-gpr-gep-gpp-gpp-gpp-gg gpp-gpp-gpp-gqk-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gqr-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-grd-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gre-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-grk-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-grs-gpp-gpp-gpp-gpp-gg gpp-gpp-gpp-gvk-gpp-gpp-gpp-gpp-gg gin-gfp-gkd-grd-gtk-gek-gep-(gpp)4-gg gfp-gkd-grd-gtk-gek-gep-(gpp)4-gy pp-(gpp)2-gpk-gqk-gek-(gpp)4-g (gpp)3-gak-gda-gpp-gpa-(gpp)3-gy

24 24 24 27 24 27 24 24 24 24 24 24 24 24 24 24 24 24 24 27 24 24 24 24 24 27 24 24 27 27 27 27 27 30 27 27 27 27 24 24 24 24 24 24 24 35 32 30 32

34.6 29.7 35 35 29.5 38 37.7 40.4 35.9 35.3 24.1 26 19.7 26.9 25.3 33.9 35.8 35.3 31.7 37.9 38.9 39.1 31.2 31.1 26.9 38.1 35.7 31.7 35.8 47.1 47.8 36.5 28.2 43 33.3 47.3 39.6 42.8 32.6 39.5 34.5 33.8 29.5 30.5 32.5 23 17.9 30 42.8

Riccardo Concu et al.

Table 1. Continued T1–892 T1–892 unbl T1–892(P24A) T1–892(P26A) T1–892r T1–904 T1A2–697 T2–508 T3–505 T3–508 T3–511 T3–514 T3–517 T3–520 T3–772 T3–785 T3–997 T7–2031 T7–2058 α1CB2

gpa-gpa-gpv-gpa-gar-gpa-(gpp)4-gv gpa-gpa-gpv-gpa-gar-gpa-(gpp)4-gy (gpa)2-gpv-gpa-gar-gpa-gpp-gpa-(gpp)2-gv (gpa)2-gpv-gpa-gar-gpa-(gpp)2-gap-gpp-gv (gpp)4-gpa-gpa-gpv-gpa-gar-gpa-gv gar-gpa-gpq-gpr-gdk-get-(gpp)4-gv gfp-gaa-grt-gpp-gps-gis-(gpp)4-gv gsp-gaq-glq-gpr-glp-gtp-(gpp)4-gv ggk-gda-gap-ger-gpp-gla-(gpp)4-gv gda-gap-ger-gpp-gla-gap-(gpp)4-gv gap-ger-gpp-gla-gap-glr-(gpp)4-gv ger-gpp-gla-gap-glr-gga-(gpp)4-gv gpp-gla-gap-glr-gga-gpp-(gpp)4-gv gla-gap-glr-gga-gpp-gpe-(gpp)4-gv gpp-gap-gpl-gia-git-gar-gla-(gpp)4-ggpp-(gpp)2-git-gar-gla-(gpp)4-g gpr-gnr-ger-gse-gsp-ghp-gqp-gpp-gpp-gap-gv gla-gep-gkp-gip-glp-gra-(gpp)4-gv ger-ger-gek-ger-geq-grd-(gpp)4-gv gps-gpr-glp-gpp-gap-gpq-gfq-gpp-(gep)2-gas-gpm

32 32 32 35 32 32 32 32 32 32 32 32 32 32 35 30 32 32 32 36

26 20.6 23.2 24.1 26 30.8 <4 25 20.9 23.2 25.9 16.5 15.8 17.5 <4 18 <4 25.4 23.2 12

*Tm = melting temperature, length= length Pf the peptide

ANN models ANNs models were built up using the STATISTICA 6.0 software[43], all the variables used were standardized to bring them onto the same scale; a cross-validation series was included to validate the model. For the ANNs models we have used different algorithm such as: Probabilistic Neural Network (PNN), Multi-layer Perceptron (MLP), Radial basis function (RBF) and Linear Neural Network (LNN)[44]. All the ANN have been put under one-step test (one training period) and later under a two-step test (two training periods) changing the training algorithms. In the two-step training different algorithm were combined, such as: Back- Propagation, LevenbergMarquardt, Quick propagation, Quasi-Newton, Conjugated Gradient Descent. The combinations of two different methods were tested using different number of epochs to train the ANN (within a range of 10 to 100 000 epochs). Otherwise, in order to obtain the ROC curve[45], with the ANN models we built up some Linear Neural Network (LNN) and then we choose the most similar to our LDA final model.

Markov Entropy approach to collagen stability

Results and discussion ANN prediction models We tried to find a good classifier using the ANN technique; we use four different types of ANNs: PNN, RBF, MLP, and LNN. Almost all models present good performance on training and validation higher than 70% and some times higher than 80%, see Table 2. In this sense, the best model selected should be consider in terms of model complexity including the number of input entropy parameters, which is equivalent to the number of neurons in the input layer (I), the number of neurons in the Hidden layers (Hi) as well as the number of these Hi layers per se. In Table 1, column with head profile, we depict the information related to the number of neurons on these layers for each model. Table 2. Results of the ANN analysis. ANN Type LNN

PNN

RBF

Profile

Performance Train Test

LNN 1:1-1:1

72.2

69.2

LNN 2:2-1:1

72.2

LNN 3:3-1:1

ANN Type MLP2

Profile

Performance Train Test

MLP 1:1-11-5-1:1

75.9

84.6

76.9

MLP 1:1-11-6-1:1

35.4

34.6

70.9

73.1

MLP 1:1-11-7-1:1

75.9

84.6

LNN 4:4-1:1

73.4

80.8

MLP 1:1-11-8-1:1

70.9

80.8

LNN 5:5-1:1

77.2

57.7

MLP 1:1-11-8-1:1

65.8

65.4

LNN 6:6-1:1

78.5

80.8

MLP 1:1-11-8-1:1

75.9

84.6

PNN 1:1-79-2-2:1

75.9

76.9

MLP 1:1-11-9-1:1

75.9

84.6

PNN 2:2-79-2-2:1

75.9

76.9

MLP 1:1-11-9-1:1

75.9

84.6

PNN 3:3-79-2-2:1

75.9

76.9

MLP 1:1-11-9-1:1

72.2

69.2

PNN 4:4-79-2-2:1

75.9

76.9

MLP 1:1-8-5-1:1

75.9

84.6

PNN 5:5-79-2-2:1

75.9

76.9

MLP 3:3-8-7-1:1

72.2

80.8

PNN 6:6-79-2-2:1

75.9

76.9

MLP 5:5-8-8-1:1

75.9

80.8

RBF 1:1-1-1:1

43.0

57.7

MLP 6:6-6-6-1:1

72.2

80.8

RBF 1:1-1-1:1

43.0

57.7

MLP 2:2-11-7-1:1

68.4

80.8

Riccardo Concu et al.

Table 2. Continued

MLP1

RBF 2:2-1-1:1

35.4

38.5

MLP 3:3-11-8-1:1

77.2

84.6

MLP 1:1-5-1:1

57.0

34.6

MLP 4:4-11-9-1:1

75.9

76.9

MLP 1:1-7-1:1

75.9

84.6

MLP 5:5-11-7-1:1

70.9

76.9

MLP 2:2-9-1:1

75.9

84.6

MLP 5:5-11-7-1:1

72.2

76.9

MLP 2:2-9-1:1

75.9

84.6

MLP 5:5-11-8-1:1

77.2

84.6

MLP 2:2-9-1:1

73.4

80.8

MLP 5:5-11-9-1:1

72.2

53.8

MLP 3:3-8-1:1

72.2

80.8

MLP 6:6-11-7-1:1

72.2

80.8

MLP 3:3-9-1:1

75.9

84.6

MLP 6:6-11-7-1:1

72.2

80.8

MLP 4:4-9-1:1

72.2

76.9

MLP 6:6-11-8-1:1

72.2

76.9

MLP 4:4-9-1:1

72.2

76.9

MLP 6:6-11-9-1:1

74.7

80.8

MLP 5:5-7-1:1

73.4

76.9

MLP 6:6-11-9-1:1

72.2

76.9

MLP 6:6-4-1:1

73.4

76.9

MLP 6:6-11-9-1:1

72.2

80.8

MLP 6:6-6-1:1

72.2

76.9

MLP 6:6-11-9-1:1

72.2

80.8

MLP 6:6-7-1:1

73.4

76.9

MLP 6:6-11-9-1:1

73.4

76.9

MLP 6:6-7-1:1

73.4

76.9

70.9

80.8

MLP 6:6-7-1:1

73.4

76.9

72.2

76.9

MLP 6:6-7-1:1

73.4

76.9

74.7

57.7

MLP 6:6-8-1:1

73.4

76.9

73.4

76.9

MLP 6:6-8-1:1

73.4

76.9

MLP 6:6-11-9-1:1 MLP 3:3-11-111:1 MLP 5:5-11-101:1 MLP 5:5-11-111:1 MLP 6:6-11-101:1

72.2

76.9

The general formula of these profiles is: ANN Ni:I-H1-H2-O:No. In this profile formula ANN = PNN, RBF, MLP, or LNN refers to the type of network, Ni is the number of input parameters (peptide entropy values), I, H1, H2, and O are the number of neurons in Input, Hidden 1, Hidden 2, and Output layers, and No is the number of output answers (ever 1 in this case). See also in Figure 3 a graphically depiction of this idea. In this sense, the simpler but accurate predictor seems to be the LNN classifier with profile: LNN 4:4-1:1. This classifier uses only four entropy parameters and has only one hidden layer with one neuron, which is equivalent to a linear equation

Markov Entropy approach to collagen stability

with four variables. The good performance in training and validation of 73.4 and 80.8% respectively is similar/higher than the performance of the more complicated ANNs. This result confirms the strong linear relationship between the entropy values of the collagen polypeptide chain and their stability. Furthermore, we analyze the ROC curve for the models depicted in the Table 1; all the models present an area under the ROC curve notably high in according with all the results until now presented. In Figure 4 we report the ROC curves for all the best models. The ROC curve of a significant classifier must be higher than 0.5 (value for a random classifier) and as near to 1 as possible. We used these values also to confirm than the LNN is the best classifier in terms of low-complexity/performance ratio.

Figure 3. Topology of the LNN and ANN models (some examples).

Riccardo Concu et al.

Figure 4. Results of the ROC curve analysis.

Conclusions In this paper, the MARCH-INSIDE methodology has been extended to collagen stability studies. We demonstrate that it is possible to predict the collagen stability with a linear and no-linear classifier based only on the entropy of the amino acid sequence. The models here presented are the first in this field and the data show that the entropy of an electrostatic charge distribution may affect protein stability, such calculating it we can predict the collagen stability with a fast and accurate model.

References 1.

Gonzalez Diaz H, Olazabal E, Castanedo N, Sanchez IH, Morales A, Serrano HS, et al. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer aided molecular design II: experimental and theoretical assessment of a novel method for virtual screening of fasciolicides. Journal of molecular modeling. 2002 Aug;8(8):237-45. Gonzales-Diaz H, Gia O, Uriarte E, Hernadez I, Ramos R, Chaviano M, et al. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising

Markov Entropy approach to collagen stability

4. 5. 6. 7.

9. 10. 11.

12.

13.

14.

approach for computer-aided molecular design I: discovery of anticancer compounds. Journal of molecular modeling. 2003 Dec;9(6):395-407. Gonzalez-Diaz H, Aguero G, Cabrera MA, Molina R, Santana L, Uriarte E, et al. Unified Markov thermodynamics based on stochastic forms to classify drugs considering molecular structure, partition system, and biological species: distribution of the antimicrobial G1 on rat tissues. Bioorganic & medicinal chemistry letters. 2005 Feb 1;15(3):551-7. Gonzalez-Diaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Current topics in medicinal chemistry. 2008;8(18):1676-90. Gonzalez-Diaz H, Uriarte E, Ramos de Armas R. Predicting stability of Arc repressor mutants with protein stochastic moments. Bioorganic & medicinal chemistry. 2005 Jan 17;13(2):323-31. Gonzalez-Diaz H, Vilar S, Santana L, Uriarte E. Medicinal chemistry and bioinformatics--current trends in drugs discovery with networks topological indices. Current topics in medicinal chemistry. 2007;7(10):1015-29. Munteanu CR, Vázquez JM, Dorado J, Pazos-Sierra A, Sánchez-González A, Prado-Prado FJ, et al. Complex Network Spectral Moments for ATCUN Motif DNA Cleavage: First Predictive Study on Proteins of Human Pathogen Parasites. Journal of proteome research. 2009:doi:10.1021/pr900556g. Aguero-Chapin G, Varona-Santos J, de la Riva GA, Antunes A, Gonzalez-Villa T, Uriarte E, et al. Alignment-Free Prediction of Polygalacturonases with Pseudofolding Topological Indices: Experimental Isolation from Coffea arabica and Prediction of a New Sequence. Journal of proteome research. 2009 Apr 3;8(4):2122-8. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. González-Díaz H, Saiz-Urra L, Molina R, Santana L, Uriarte E. A Model for the Recognition of Protein Kinases Based on the Entropy of 3D van der Waals Interactions. Journal of proteome research. 2007 Feb 2;6(2):904-8. Concu R, Dea-Ayuela MA, Perez-Montoto LG, Prado-Prado FJ, Uriarte E, Bolas-Fernandez F, et al. 3D entropy and moments prediction of enzyme classes and experimental-theoretic study of peptide fingerprints in Leishmania parasites. Biochimica et biophysica acta. 2009 Aug 28. Concu R, Dea-Ayuela MA, Perez-Montoto LG, Bolas-Fernandez F, Prado-Prado FJ, Podda G, et al. Prediction of Enzyme Classes from 3D Structure: A General Model and Examples of Experimental-Theoretic Scoring of Peptide Mass Fingerprints of Leishmania Proteins. Journal of proteome research. 2009 Sep 4;8(9):4372-82. Concu R, Podda G, Uriarte E, Gonzalez-Diaz H. Computational chemistry study of 3D-structure-function relationships for enzymes based on Markov models for protein electrostatic, HINT, and van der Waals potentials. Journal of computational chemistry. 2009 Jul 15;30(9):1510-20. Ramos de Armas R, González-Díaz H, Molina R, Uriarte E. Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants. Proteins. 2004 Sep 1;56(4):715-23.

Riccardo Concu et al.

15. González-Díaz H, Uriarte E, Ramos de Armas R. Predicting stability of Arc repressor mutants with protein stochastic moments. Bioorg Med Chem. 2005 Jan 17;13(2):323-31. 16. González-Díaz H, de Armas RR, Molina R. Markovian negentropies in bioinformatics. 1. A picture of footprints after the interaction of the HIV-1 PsiRNA packaging region with drugs. Bioinformatics (Oxford, England). 2003 Nov 1;19(16):2079-87. 17. de Armas RR, Diaz HG, Molina R, Uriarte E. Stochastic-based descriptors studying biopolymers biological properties: extended MARCH-INSIDE methodology describing antibacterial activity of lactoferricin derivatives. Biopolymers. 2005 Apr 5;77(5):247-56. 18. Marini F. Artificial neural networks in foodstuff analyses: Trends and perspectives A review. Anal Chim Acta. 2009 Mar 9;635(2):121-31. 19. Ivanciuc O. Drug Design with Artificial Neural Networks. In: Meyers RA, ed. Encyclopedia of Complexity and Systems Science. Berlin: Springer-Verlag 2009:2139-59. 20. Zou J, Han Y, So SS. Overview of artificial neural networks. Methods Mol Biol. 2008;458:15-23. 21. Krogh A. What are artificial neural networks? Nat Biotechnol. 2008 Feb;26(2):195-7. 22. Cartwright HM. Artificial neural networks in biology and chemistry: the evolution of a new analytical tool. Methods Mol Biol. 2008;458:1-13. 23. Caballero J, Fernandez M. Artificial neural networks from MATLAB in medicinal chemistry. Bayesian-regularized genetic neural networks (BRGNN): application to the prediction of the antagonistic activity against human platelet thrombin receptor (PAR-1). Curr Top Med Chem. 2008;8(18):1580-605. 24. Bartosch-Harlid A, Andersson B, Aho U, Nilsson J, Andersson R. Artificial neural networks in pancreatic disease. Br J Surg. 2008 Jul;95(7):817-26. 25. Patel JL, Goyal RK. Applications of artificial neural networks in medical science. Current clinical pharmacology. 2007 Sep;2(3):217-26. 26. Huang Y, Kangas LJ, Rasco BA. Applications of artificial neural networks (ANNs) in food science. Crit Rev Food Sci Nutr. 2007;47(2):113-26. 27. Grossi E, Buscema M. Introduction to artificial neural networks. Eur J Gastroenterol Hepatol. 2007 Dec;19(12):1046-54. 28. Brodsky B, Persikov AV. Molecular structure of the collagen triple helix. Advances in protein chemistry. 2005;70:301-39. 29. Tedeschi E, Antoniazzi F, Venturi G, Zamboni G, Tato L. Osteogenesis imperfecta and its molecular diagnosis by determination of mutations of type I collagen genes. Pediatr Endocrinol Rev. 2006 Sep;4(1):40-6. 30. Makareeva E, Cabral WA, Marini JC, Leikin S. Molecular mechanism of alpha 1(I)-osteogenesis imperfecta/Ehlers-Danlos syndrome: unfolding of an N-anchor domain at the N-terminal end of the type I collagen triple helix. The Journal of biological chemistry. 2006 Mar 10;281(10):6463-70. 31. Shuster S. Osteoporosis, a unitary hypothesis of collagen loss in skin and bone. Medical hypotheses. 2005;65(3):426-32.

Markov Entropy approach to collagen stability

32. Galicka A, Wolczynski S, Gindzienski A. Studies on type I collagen in skin fibroblasts cultured from twins with lethal osteogenesis imperfecta. Acta biochimica Polonica. 2003;50(2):481-8. 33. Myllyharju J, Kivirikko KI. Collagens and collagen-related diseases. Annals of medicine. 2001 Feb;33(1):7-21. 34. Persikov AV, Ramshaw JA, Brodsky B. Prediction of collagen stability from amino acid sequence. The Journal of biological chemistry. 2005 May 13;280(19):19343-9. 35. Cheng J, Randall A, Baldi P. Prediction of protein stability changes for single-site mutations using support vector machines. Proteins. 2006 Mar 1;62(4):1125-32. 36. Deep S, Ahluwalia JC. Theoretical studies on solvation contribution to the thermodynamic stability of mutants of lysozyme T4. Protein engineering. 2003 Jun;16(6):415-22. 37. Ramos de Armas R, Gonzalez Diaz H, Molina R, Uriarte E. Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants. Proteins. 2004 Sep 1;56(4):715-23. 38. Diaz HG, Marrero Y, Hernandez I, Bastida I, Tenorio E, Nasco O, et al. 3DMEDNEs: an alternative "in silico" technique for chemical research in toxicology. 1. prediction of chemically induced agranulocytosis. Chemical research in toxicology. 2003 Oct;16(10):1318-27. 39. Cruz-Monteagudo M, Gonzalez-Diaz H, Borges F, Dominguez ER, Cordeiro MN. 3D-MEDNEs: an alternative "in silico" technique for chemical research in toxicology. 2. quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy. Chemical research in toxicology. 2008 Mar;21(3):619-32. 40. GonzĂĄlez-DĂaz H, Molina-Ruiz R, Hernandez I. MARCH-INSIDE version 2.0 (Markovian Chemicals In Silico Design). 2.0 ed 2005:MARCH-INSIDE version 2.0 (Markovian Chemicals In Silico Design). Main author information requesting contact email: gonzalezdiazh@yahoo.es. 41. Persikov AV, Ramshaw JA, Kirkpatrick A, Brodsky B. Electrostatic interactions involving lysine make major contributions to collagen triple-helix stability. Biochemistry. 2005 Feb 8;44(5):1414-22. 42. Persikov AV, Ramshaw JA, Kirkpatrick A, Brodsky B. Peptide investigations of pairwise interactions in the collagen triple-helix. Journal of molecular biology. 2002 Feb 15;316(2):385-94. 43. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002. 44. Dohnal V, Kuca K, Jun D. [Methods of artificial intelligence: a new trend in pharmacy]. Ceska Slov Farm. 2005 Jul;54(4):163-7. 45. Swets JA. Measuring the accuracy of diagnostic systems. Science. 1988 Jun 3;240(4857):1285-93.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 57-71 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

5. Non-self discrimination of parasite proteins with entropy of 3D structure networks and artificial neural networks 1

Humberto González-Díaz1, Xerardo García-Mera2 and Francisco Prado-Prado2 Department of Microbiology & Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, 15782, Santiago de Compostela, Spain; 2Department of Organic Chemistry Faculty of Pharmacy, USC, 15782 Santiago de Compostela, Spain

Introduction There are many species of Human Parasites (HPs), which members cause important parasitic diseases in human and cattle hosts. The overall burden of world's HPs inffections may be severely underestimated [1-6]. Nowadays, a goal of major importance is the computational prediction of new drugs or drug-target proteins for different HPs [7]. This question is very interesting because the proteins differentially expressed in only one organism (self proteins) with specific function and unique peptides may present unique drug-binding pockets. It is then reasonable to expect that these self proteins are good candidates to find new drug targets so they may be considered as proteins with high targetability. The concept of self proteins (related to protein selfness) comes from immunology. The nature of the relationship between an antigenic protein and its capability to evoke an immune response Correspondence/Reprint request: Dr. Humberto González-Díaz, Department of Microbiology and Parasitology Faculty of Pharmacy, University of Santiago de Compostela, 15782, Santiago de Compostela, Spain E-mail: gonzalezdiazh@yahoo.es

Humberto GonzĂĄlez-DĂaz et al.

is still an unsolved problem. Although experiments indicate that specific (dis)continuous amino acid sequences may determine specific immune responses, how immunogenic properties and recognition information are mapped onto a non-linear sequence is not understood. Immunology has invoked the concept of self/non-self discrimination in order to explain the capability of the organism to selectively immune-react. However, no clear, logical and rational pathway has emerged to relate a structure and its immuno-non-reactivity. It cannot yet be dismissed what Koshland wrote in 1990: "Of all the mysteries of modern science, the mechanism of self versus non-self recognition in the immune system ranks at or near the top". Kanduc [8] have reviewed the concept of self/non-self discrimination in the immune system starting from the historical perspective and the conceptual framework that underlie immune reaction pattern. Dummer and Mittelman et al. [9] have used non-self-discrimination as a driving concept in the identification of an immune-dominant epitopic peptide sequence by auto-antibodies from melanoma cancer patients. Following these ideas, in our opinion there is an important analogy between the concepts of self/non-self protein discrimination in antigenantibody and drug target/non-drug target discrimination for different diseases counting infections caused by pathogen organisms including parasites. In this sense, the previous research have shed some light on the search of immunologically self-proteins (unique sequence); which may become important in vaccine design but the computational discovery of Parasite-SelfProteins or Peptides (PSPs) in 3D structural terms remains a goal of the major importance for drug target discovery. Consequently, becomes also very interesting the development of online free methods to predict the function and how unique are (across different organisms) all the peptides found in a protein sample after PMF analysis. We can summarize all these facts in the necessity of a free web-tool for self/non-self proteins and peptides discrimination (PSP/non-PSP discrimination or shortly PSP/nPSP discrimination). We can apply sequence alignment procedures to predict the function and PSP scores (in sequence terms) of these proteins or peptides such as the very successful BLAST-like procedures [10-16]. However, techniques relying on alignment may fail because there are not similar candidates in the databases or the more similar templates have not been annotated [17]. An alternative is the application of alignment-free Machine Learning methods for predicting protein functional class based on sequence (1D) parameters but regardless of the sequence-sequence similarity [18-21]. These studies are very similar to Quantitative Structure-Activity/Property (QSAR/QSPR) methods used for drugs but applied to large protein and/or drug-protein structures. Generally speaking, a QSAR/QSPR study use more

Entropy-based prediction of non-self

or less detailed structural indices (composition, sequence, 2D, and/or 3D) of molecular systems (drugs, RNAs, proteins, protein-protein complexes) to predict the biological activity or desired properties; which are usually expensive to measure experimentally specially for large databases. With this aim, QSAR/QSPR make use of Statistical, Artificial Intelligence (Artificial Neural Networks) and/or Machine Learning methods to correlate the structural indices with the property of the systems obtaining predictor models with different degrees of complexity. No matter the type of system described, structural descriptor used, or learning method selected, there are a series of common steps for almost all QSAR/QSPR studies, see recent reviews in this topic [22, 23]. A versatile QSAR/QSPR method was introduced by GonzĂĄlez-DĂaz et al as Markovian Chemicals In Silico Design (MARCH-INSIDE 1.0) for the computational design of small-sized drugs. In successive studies, we have extended this method to perform fast calculation of 2D and 3D alignmentfree numeric parameters including Shannon entropy-like parameters. We calculated these parameters with the natural powers (k) of a Markovian transition matrix associated to the complex network representation of these systems. For instance, we used the Shannon entropies of a Markovian transition matrix based on molecular vibrations for RNA secondary structures [24]. Currently the method was renamed as Markov Chains Invariants for Networks SImulation & Design (MARCH-INSIDE 2.0). The approach uses a Markov Chain model (MCM) to calculate parameters of small-sized and also complex chemical structures including Shannon entropies and others [25-27]. In a very recent review, we discussed the details and many applications of the MARCH-INSIDE method based on a Markovian transition matrix of complex network to discover both drugs and targets in Molecular Microbiology and Parasitology [28]. In this work, we approach the QSAR/QSPR problem of 3D-protein selfness in parasites using the Shannon entropy parameters calculated by the method MARCH-INSIDE. In Figure 1 we depict a flowchart for all the steps that we are going to give in this work to construct the new classifiers and server.

Materials and methods Theoretic methods Entropy parameters of protein structure. In previous works we have predicted protein function based on different protein structural parameters derived from a Markov matrix that account for electrostatic interactions

Humberto GonzĂĄlez-DĂaz et al.

Figure 1. Representations of a protein (PDB-ID 3hha): (A) 3D structure and (B) complex network.

Entropy-based prediction of non-self

between aminoacid pairs in the 3D structure of the protein. One of the classes of parameters used was called the Shannon Entropy θk(R) of the markov matrix. These values are used here as inputs to construct the protein QSAR model for HPs. The detailed explanation has been published before [26, 27, 29-35] and reviewed in detail more recently [36]. At follows we give the formula for θk(R) values and some general explanations:

[

]

θ k (R ) = − ∑ k p ij (R ) ⋅ log k p ij (R )

(1)

i = j∈R

It is remarkable that, the spectral moments are the direct sum of the probabilities kpii(R) with which the effect of the electrostatic interaction propagates from the amino acid ith to other amino acids jth next to it and returns to ith after k-steps. These probabilities refer to aminoacids considered isolated in the space (k = 0), direct interaction between aminoacids and other in contact with them in the residue network (k = 1) and spatial (k > 1) indirect interactions between amino acids placed at a distance equal to k-times the cut-off distance (rij = k ·rcut-off). The method uses a Markov Chain Model (MCM) to calculate these probabilities. However, for the sake of simplicity, a truncation or cut-off function αij is applied in such a way that a short-term interaction takes place in a first approximation only between neighboring aminoacids (αij = 1 if rij < rcut-off). Otherwise, the interaction is banished (αij = 0). The relationship αij may be visualized in the form of a protein structure complex network. In this network the nodes are the Cα atoms of the amino acids and the edges connect pairs of amino acids with αij = 1. Euclidean 3D space r3 = (x, y, z) coordinates of the Cα atoms of amino acids listed in protein PDB files. For calculation, all water molecules and metal ions were removed [28]. All calculations were carried out with our in-house software MARCH-INSIDE 2.0 [28]. For the calculation, the MARCH-INSIDE software always uses the full matrix, never a sub-matrix, but the last summation term may run either for all amino acids or only for some specific protein regions (R). These regions or orbits are often defined in geometric terms and called core, inner, middle or surface region. In Figure 2 we depict: (A) the 3D structure model for a protein (PDB-ID 3HHA) and (B) the respective complex network graph representation of the protein structure. At this structural level nodes are aminoacids and we link two nodes with an edge if the distance between then is lower than 7Å [37] (an optimal cutoff value recently determined by an scan study). This type of network is also known as contact map or protein residue networks [38-45].

Humberto GonzĂĄlez-DĂaz et al.

Figure 2. Flowchart for all the steps given in the construction of the classifiers.

Entropy-based prediction of non-self

In this work, we denote the regions of the protein (c correspond to core, i to inner, m to middle, and s to surface orbits, respectively). The diameters of the orbits, as a percentage of the longest distance rmax with respect to the centre of charge, are: 0 ≤ c ≤ 25, 25 < i ≤ 50, 50 < to m ≤ 75, 75 < s ≤ 100. Additionally, we consider the total orbit (t) that contains all the amino acids in the protein (orbit diameter 0 to 100% of rmax). Consequently, we can calculate different θk(R) for the amino acids contained in the regions (c, i, m, s, or t) and placed at a topological distance k each other within this orbit (k is the order) [25-27, 46, 47]. In this work, we have calculated altogether 5(types of regions) x 6(orders considered) = 30 θk(R) indices for each PPp or nPPp. Dataset. The protein structures were downloaded from PDB [48] using the following schemes for PDB-database search: (i) introducing as input parameter the name of the different parasite genus (Plasmodium, Fasciola, Ascaris, Giardia, Trypanosoma, Toxoplasma, Entoamoeba, etc.) in the search item called source organism (for positive cases). The second alternative (ii) was introducing the PDB IDs for all the proteins contained in the list reported in the article of Dobson and Doig [49]; which are considered as negative cases after curation to confirm the source organism. The dataset consist of 15,341 cases (protein/organism pairs) including 1,733 positive cases referred to proteins differentially expressed in the query parasite organisms (PSPs) and 13,608 negative cases in a control group of proteins do not differentially expressed by the query organism (nPSPs). The list covers more than 20 organisms, including bacteria, virus, fungi, and parasites, as well as their human or cattle hosts. In total 1,300 PSPs and 10,212 nPSPs were used to train the model. In addition, 433 PSPs and 3,396 nPSPs were used in the external validation set. Detailed information about the PDB ID, the values of the electrostatic spectral moment indices, the corresponding observed classification, and the predicted classification for each PSP or nPSP protein/organism pair are given in Table SM1 of the Supplementary Material. Alignment-free 3D-QSAR analysis of PSPs. Both non-linear Artificial Neural Networks (ANN) and the particular case of Linear Neural Network (LNN) were used to construct the classifiers. One of the most important steps in this work was the organization of the spreadsheet containing the raw data used as input for the LDA and for the ANN because this is not a classic classifier. Herein, the schematisation of the paper is peculiar. Our expectation is to use a two-group discriminant function to classify proteins into two possible groups: proteins that belong to a particular group and proteins that do not belong to this group. To this end, we have to indicate somehow what group we pretend to predict in each case. In this regard, we made the following steps that have been explained in detail in recent works [50, 51]:

3. 4. 5.

Humberto González-Díaz et al.

First of all, the PSP (proteins expressed in at least one HPP) were divided according to their genus membership but without to specify the species. The genus of the HPP that differentially express one PSPs is the first term in the genus specie taxonomy classification of the source organism; this information was obtained from the PDB file of the protein [48]. We created a raw data representing each protein item as a vector made up of 90 3D-structural variables (inputs), one dependent variable (PSPg), and two auxiliary variables Genus observed (Go) and Genus Query (Gq); as auxiliary variable Go and Gq were never used in the analysis. Using Go = o we express that a given protein is a PSP (is differentially expressed by) at least one parasite organism that belongs to the observed genus o. Using Gq = q we express that we are using the model to predict whether or not a given protein is a PSP (is differentially expressed by) at least one parasite organism that belongs to the query genus q. The dependent variable PSPg is a dummy variable (Boolean); PSPg = yes if Gq = Go = g (the PSP protein is differentially expressed by a parasite with observed genus equal to the query genus; otherwise PSPg = no. It means that, we can repeat each protein more than once in the raw data. In fact, we could repeat each protein 9 times corresponding to 9 different Gq or Go = g = 1, 2, 3…9 values found in this data. In alphabetical order we have the following possible Go and Gq values: 1-Ascari, 2-Entamoeba, 3-Fasciola, 4-Giardia, 5-Leishmania, 6-Tolypocladium, 7-Toxoplasma, 8-Trypanosoma, and 9-Plasmodium. In order to sort data the first time that a protein appears we used the Gq = Go, and consequently PSPg = yes. It means that we used in Gq the real genus (Go) of the organism that expresses this protein. In this case, the LDA model had to give the highest probability to the group PSPg = yes because it had to predict the real organism that express the protein. The other times we use Gq ≠ Go (query genus different from the real or observed genus) and then the LDA model have to predict the highest probability for the group PSPg = no, meaning that this is not a PSP for the parasite of this genus. Conversely, non-PSPs proteins (or simply nPSPs) have more than one line entry for each protein (selected at random) with different Gq values in a proportion of above 10 nPSPs (decoys) by each PSP. The problem in this type of organization of raw data is that 30 θk(R) values are protein constants. Consequently, if LDA is based only on these values, the model shall necessarily fail when we change Gq values. An inconvenient in this regard occurs if we pretend to use the model for a

Entropy-based prediction of non-self

real PSP, since we have only one unspecific prediction and we need 9 specific probabilities (9 possible Gq values and only one real Gq = Go). In means that, for a PSP a perfect classifier have to predict a probability equal to 1 (or approximately 1) confirming the real Go of the organism that express this protein and 8 probabilities values approximately equal to 0 for the other possible values of Gq. 9. We can solve this problem introducing characteristic variables of each Gq but without giving information in the input about the Go for the organism that express a given protein. To this end, we used the average values θk(R)g of each θk(R) for all PSPs expressed in organisms of the same genus. We also calculated the deviation (Δθk(R) g = θk(R) - θk(R) g) of the θk(R) from the θk(R) g for respective genus indicated in Gq. Altogether, we have then (30 θk(R) values of the PSP) + (30 θk(R) average values for each Gq) + (30 θk(R)g values of deviations for each PSP from the average of the respective Gq) = 90 input variables. 10. It is of major importance to understand that we never used as input Go or Gq, so the model only includes as input the θk(R) values for the protein entry and the average and deviations of these values from the Gq, which is not necessarily the real Go. The general formula for the generated LNN model is shown below, where Sg(r3) (the model output) is not the dependent variable PSPg neither the output probability. Sg(r3) is a real valued score that predicts the propensity of a protein to act as an PSP differentially expressed in one HPP organism that belong to genus Gq = g: S g ( r3 ) = a 0 + = a0 +

k = 5 ,o = 4

k = 0 ,o = 0

∑ b k ⋅θ k (R ) +

k = 5 ,o = 4

∑b

k = 0 ,o = 0

⋅θ k (R ) +

∑ c k ⋅θ k (R )g +

k = 5 ,o = 4

∑c

k = 0 ,o = 0

⋅θ k (R )g +

∑ d ⋅(θ (R ) − θ (R ) )

k = 5 ,o = 4

∑d

k = 0 ,o = 0

⋅Δ θ k (R )g

(2)

In fact, the variable Sg(r3) is the output of the LNN model and may be characterized as a real value score proportional to the probability with which the organisms of genus Gq with average values θk(R) g differentially express the PSP protein p with structural parameters θk(R) and deviation Δθk(R) g from the training sub-set of proteins expressed in this genus. All the variables included in the model were standardized in order to bring them onto the same scale [52].

Results and conclusions We tried to find a better model by processing our data with different ANN [53, 54]; we carried out four different types of network, namely,

Humberto GonzĂĄlez-DĂaz et al.

Probabilistic Neural Network (PNN), Radial Basis Function (RBF), Multi Layer Perceptron (MLP), and Linear Neural Network (LNN). We report a summary of the ANN that we have trained and tested in Table 1. In Figure 3 we present the topology of some of the networks trained. All these QSAR models are more complex (in terms of number of layers) than the LNN model and the performance was also higher than 90%. In any case, taking into consideration that simpler LNN is very large (in terms of number of input parameters equal to 90). Certainly, the high number of cases in training (N = 11,512) exclude the possibility of chance correlation due to an excessive number of parameters. However, we decided to look for better models without to check for model simplicity in term of layers. In this sense, the simpler model found in terms of input parameters/number of layers ratio was MLP1 (we exclude MLP2 due to the have an additional hidden layer). This study allowed us to confirm that Shannon entropy parameters derived with an MCM are useful predictors for PSP/nPSP discrimination involving 3D structural and not only sequence information. We confirm by LNN a useful but very large linear relationship. However, using ANN classifiers we may obtain notably accurate models. This combined strategy may be used to identify and predict proteins/peptides of prokaryote and eukaryote parasites which may be of interest in drug development or target identification or serve in anti-parasitic vaccine design. Table 1. Results of LDA and ANN analysis. Technique a

Performance Train

Performance Test

Inputs

Hidden(1)

Hidden(2)

PNN

0.93

16249

PNN

0.93

16249

MLP1

0.60

MLP2

0.64

MLP2

0.94

MLP2

0.84

0.85

RBF

0.85

RBF

0.85

LNN

0.94

0.93

LNN

0.94

Entropy-based prediction of non-self

Figure 3. Topology of some ANN models trained.

References 1. 2.

Hotez PJ, Kamath A. Neglected tropical diseases in sub-saharan Africa: review of their prevalence, distribution, and disease burden. PLoS Negl Trop Dis. 2009;3(8):e412. Franco-Paredes C, Bottazzi ME, Hotez PJ. The unfinished public health agenda of chagas disease in the era of globalization. PLoS Negl Trop Dis. 2009;3(7):e470.

3. 4. 5. 6. 7. 8. 9.

10. 11. 12.

13. 14. 15.

16. 17. 18.

Humberto González-Díaz et al.

Hotez PJ. One world health: neglected tropical diseases in a flat world. PLoS Negl Trop Dis. 2009;3(4):e405. Hotez PJ, Wilkins PP. Toxocariasis: America's Most Common Neglected Infection of Poverty and a Helminthiasis of Global Importance? PLoS Negl Trop Dis. 2009;3(3):e400. Utzinger J, Xiao SH, Tanner M, Keiser J. Artemisinins for schistosomiasis and beyond. Curr Opin Investig Drugs. 2007 Feb;8(2):105-16. Matthys B, Tschannen AB, Tian-Bi NT, Comoe H, Diabate S, Traore M, et al. Risk factors for Schistosoma mansoni and hookworm in urban farming communities in western Cote d'Ivoire. Trop Med Int Health. 2007 Jun;12(6):709-23. Ibarra-Velarde F, Vera-Montenegro Y, Huesca-Guillen A, Canto-Alarcon G, Alcala-Canto Y, Marrero-Ponce Y. In silico fasciolicide activity of three experimental compounds in sheep. Ann N Y Acad Sci. 2008 Dec;1149:183-5. Kanduc D. Immunogenicity in peptide-immunotherapy: from self/nonself to similar/dissimilar sequences. Adv Exp Med Biol. 2008;640:198-207. Dummer R, Mittelman A, Fanizzi FP, Lucchese G, Willers J, Kanduc D. Nonself-discrimination as a driving concept in the identification of an immunodominant HMW-MAA epitopic peptide sequence by autoantibodies from melanoma cancer patients. Int J Cancer. 2004 Sep 20;111(5):720-6. Zehetner G. OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res. 2003 Jul 1;31(13):3799-803. Zhou Y, Huang GM, Wei L. UniBLAST: a system to filter, cluster, and display BLAST results and assign unique gene annotation. Bioinformatics. 2002 Sep;18(9):1268-9. Neuwald AF, Poleksic A. PSI-BLAST searches using hidden markov models of structural repeats: prediction of an unusual sliding DNA clamp and of betapropellers in UV-damaged DNA-binding protein. Nucleic Acids Res. 2000 Sep 15;28(18):3570-80. Muller A, MacCallum RM, Sternberg MJ. Benchmarking PSI-BLAST in genome annotation. J Mol Biol. 1999 Nov 12;293(5):1257-71. Zhang J, Madden TL. PowerBLAST: a new network BLAST application for interactive or automated sequence analysis and annotation. Genome Res. 1997 Jun;7(6):649-56. Durand P, Canard L, Mornon JP. Visual BLAST and visual FASTA: graphic workbenches for interactive analysis of full BLAST and FASTA outputs under MICROSOFT WINDOWS 95/NT. Comput Appl Biosci. 1997 Aug;13(4):407-13. Altschul SF, Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res. 1997;25:3389-402. Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, et al. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics. 2006;6:4023–37. Lin HH, Han LY, Zhang HL, Zheng CJ, Xie B, Chen YZ. Prediction of the functional class of lipid binding proteins from sequence-derived properties irrespective of sequence similarity. J Lipid Res. 2006 Apr;47(4):824-31.

Entropy-based prediction of non-self

19. Lin HH, Han LY, Cai CZ, Ji ZL, Chen YZ. Prediction of transporter family from protein sequence by support vector machine approach. Proteins. 2006 Jan 1;62(1):218-31. 20. Han LY, Cai CZ, Ji ZL, Cao ZW, Cui J, Chen YZ. Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucleic Acids Res. 2004;32(21):6437-44. 21. Han LY, Cai CZ, Ji ZL, Chen YZ. Prediction of functional class of novel viral proteins by a statistical learning method irrespective of sequence similarity. Virology. 2005 Jan 5;331(1):136-43. 22. González-Díaz H, Prado-Prado F, Pérez-Montoto LG, Duardo-Sánchez A, López-Díaz A. QSAR Models for Proteins of Parasitic Organisms, Plants and Human Guests: Theory, Applications, Legal Protection, Taxes, and Regulatory Issues. Curr Proteomics. 2009;6:214-27. 23. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. 24. González-Díaz H, de Armas RR, Molina R. Markovian negentropies in bioinformatics. 1. A picture of footprints after the interaction of the HIV-1 PsiRNA packaging region with drugs. Bioinformatics. 2003 Nov 1;19(16):2079-87. 25. Concu R, Podda G, Uriarte E, Gonzalez-Diaz H. Computational chemistry study of 3D-structure-function relationships for enzymes based on Markov models for protein electrostatic, HINT, and van der Waals potentials. J Comput Chem. 2009;30:1510-20. 26. Gonzalez-Diaz H, Saiz-Urra L, Molina R, Gonzalez-Diaz Y, Sanchez-Gonzalez A. Computational chemistry approach to protein kinase recognition using 3D stochastic van der Waals spectral moments. J Comput Chem. 2007 Apr 30;28(6):1042-8. 27. González-Díaz H, Pérez-Castillo Y, Podda G, Uriarte E. Computational Chemistry Comparison of Stable/Nonstable Protein Mutants Classification Models Based on 3D and Topological Indices. J Comput Chem. 2007;28: 1990-5. 28. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. 29. Aguero-Chapin G, Antunes A, Ubeira FM, Chou KC, Gonzalez-Diaz H. Comparative Study of Topological Indices of Macro/Supramolecular RNA Complex Networks. J Chem Inf Model. 2008 Oct 21;48:2265–77. 30. Cruz-Monteagudo M, Munteanu CR, Borges F, Cordeiro MNDS, Uriarte E, Chou K-C, et al. Stochastic molecular descriptors for polymers. 4. Study of complex mixtures with topological indices of mass spectra spiral and star networks: The blood proteome case. Polymer. 2008;49(25):5575-87. 31. Dea-Ayuela MA, Perez-Castillo Y, Meneses-Marcel A, Ubeira FM, BolasFernandez F, Chou KC, et al. HP-Lattice QSAR for dynein proteins: experimental proteomics (2D-electrophoresis, mass spectrometry) and theoretic study of a Leishmania infantum sequence. Bioorg Med Chem. 2008 Aug 15;16(16):7770-6.

Humberto González-Díaz et al.

32. Aguero-Chapin G, Gonzalez-Diaz H, de la Riva G, Rodriguez E, SanchezRodriguez A, Podda G, et al. MMM-QSAR recognition of ribonucleases without alignment: comparison with an HMM model and isolation from Schizosaccharomyces pombe, prediction, and experimental assay of a new sequence. J Chem Inf Model. 2008 Feb;48(2):434-48. 33. Ferino G, Gonzalez-Diaz H, Delogu G, Podda G, Uriarte E. Using spectral moments of spiral networks based on PSA/mass spectra outcomes to derive quantitative proteome-disease relationships (QPDRs) and predicting prostate cancer. Biochem Biophys Res Commun. 2008 Jul 25;372(2):320-5. 34. Gonzalez-Diaz H, Dea-Ayuela MA, Perez-Montoto LG, Prado-Prado FJ, AgueroChapin G, Bolas-Fernandez F, et al. QSAR for RNases and theoreticexperimental study of molecular diversity on peptide mass fingerprints of a new Leishmania infantum protein. Mol Divers. 2009 Jul 4. 35. Aguero-Chapin G, Varona-Santos J, de la Riva GA, Antunes A, Gonzalez-Villa T, Uriarte E, et al. Alignment-Free Prediction of Polygalacturonases with Pseudofolding Topological Indices: Experimental Isolation from Coffea arabica and Prediction of a New Sequence. J Proteome Res. 2009 Apr 3;8(4):2122-8. 36. Gonzalez-Diaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. 37. da Silveira CH, Pires DE, Minardi RC, Ribeiro C, Veloso CJ, Lopes JC, et al. Protein cutoff scanning: A comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins. Proteins. 2009 Feb 15;74(3):727-43. 38. Gupta N, Mangal N, Biswas S. Evolution and similarity evaluation of protein structures in contact map space. Proteins. 2005 May 1;59(2):196-204. 39. Webber CL, Jr., Giuliani A, Zbilut JP, Colosimo A. Elucidating protein secondary structures using alpha-carbon recurrence quantifications. Proteins. 2001 Aug 15;44(3):292-303. 40. Gobel U, Sander C, Schneider R, Valencia A. Correlated mutations and residue contacts in proteins. Proteins. 1994 Apr;18(4):309-17. 41. Krishnan A, Zbilut JP, Tomita M, Giuliani A. Proteins as networks: usefulness of graph theory in protein science. Curr Protein Pept Sci. 2008 Feb;9(1):28-38. 42. Krishnan A, Giuliani A, Zbilut JP, Tomita M. Implications from a network-based topological analysis of ubiquitin unfolding simulations. PLoS ONE. 2008;3(5):e2149. 43. Palumbo MC, Colosimo A, Giuliani A, Farina L. Essentiality is an emergent property of metabolic network wiring. FEBS Lett. 2007 May 29;581(13):2485-9. 44. Krishnan A, Giuliani A, Zbilut JP, Tomita M. Network scaling invariants help to elucidate basic topological principles of proteins. J Proteome Res. 2007 Oct;6(10):3924-34. 45. Krishnan A, Giuliani A, Tomita M. Indeterminacy of reverse engineering of Gene Regulatory Networks: the curse of gene elasticity. PLoS ONE. 2007;2(6):e562. 46. González-Díaz H, Saiz-Urra L, Molina R, Santana L, Uriarte E. A Model for the Recognition of Protein Kinases Based on the Entropy of 3D van der Waals Interactions. J Proteome Res. 2007 Feb 2;6(2):904-8.

Entropy-based prediction of non-self

47. Gonzalez-Diaz H, Molina R, Uriarte E. Recognition of stable protein mutants with 3D stochastic average electrostatic potentials. FEBS Lett. 2005 Aug 15;579(20):4297-301. 48. Ivanisenko VA, Pintus SS, Grigorovich DA, Kolchanov NA. PDBSite: a database of the 3D structure of protein functional sites. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D183-7. 49. Dobson PD, Doig AJ. Distinguishing enzyme structures from non-enzymes without alignments. J Mol Biol. 2003 Jul 18;330(4):771-83. 50. Concu R, Dea-Ayuela MA, Perez-Montoto LG, Prado-Prado FJ, Uriarte E, Bolas-Fernandez F, et al. 3D entropy and moments prediction of enzyme classes and experimental-theoretic study of peptide fingerprints in Leishmania parasites. Biochim Biophys Acta. 2009 Dec;1794(12):1784-94. 51. Concu R, Dea-Ayuela MA, Perez-Montoto LG, Bolas-Fernandez F, Prado-Prado FJ, Podda G, et al. Prediction of enzyme classes from 3D structure: a general model and examples of experimental-theoretic scoring of peptide mass fingerprints of Leishmania proteins. J Proteome Res. 2009 Sep;8(9):4372-82. 52. Kutner MH, Nachtsheim CJ, Neter J, Li W. Standardized Multiple Regression Model. Applied Linear Statistical Models. Fifth ed. New York: McGraw Hill 2005:271-7. 53. Ivanciuc O. Drug Design with Artificial Neural Networks. In: Meyers RA, ed. Encyclopedia of Complexity and System Science. New York: Springer-Verlag 2009. 54. Ivanciuc O. New Neural Networks for Structure-Property Models. In: Diudea MV, ed. QSPR/QSAR Studies by Molecular Descriptors. Huntington, N.Y.: Nova Science 2001:213-31.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 73-88 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

6. Predicting parasite-host networks with Markov Entropy measures for secondary structures of RNA phylogenetic biomarkers 1

Humberto González-Díaz1, Santiago Vilar2 and Lázaro Guillermo Pérez-Montoto1 Department of Microbiology & Parasitology, University of Santiago de Compostela, 15782, Spain National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health DHHS, Bethesda, Maryland 20892, USA

1. Introduction Complex networks are present by everywhere (or at least objects that we can see as networks in a first approach). We can see as networks drug-target interactions, disease-genome correspondences, whole-cell regulation process, metabolic reactions, protein-protein interaction, sexual relationships, disease transmission, internet communications, transport, electric power systems, politics, crime, legislative action, and scientific collaboration and many others matters [1-7]. On larger scales one finds networks of cells as in neural networks, up to the scale of organisms in ecological food webs [8]. The elucidation of structural and functional relationships in these and other chemical, biological, technological, and social networks generates the need for a meaningful ranking of networks with numerical indices often known as Topological Indices (Entropy measures). Entropy measures are numerical Correspondence/Reprint request: Dr. Humberto González-Díaz, Department of Microbiology & Parasitology University of Santiago de Compostela, 15782, Spain. E-mail: gonzalezdiazh@yahoo.es

Humberto González-Díaz et al.

indices that describe the connectedness or connectivity between all nodes in a network and are very useful in Computational Chemistry and other sciences to study not local but global network properties [9-14]. On the other hand, Entropy measures based on Markov Chain Models (MCMs) are a very powerful Computational Chemistry tool for describing interesting phenomena in complex systems. We have introduced the MCM method called MARCH-INSIDE (MARkov CHain Invariants for Network SImulation & DEsign) and applied it to different Computational Chemistry problems [15-21]. With this method, we can calculate different type of Entropy measures for systems represented by means of graphs or complex networks. In particular, Entropy measures based on Markov Entropy values and symbolized by kθ are between the more successful of the MARCH-INSIDE indices with applications to systems ranging from small molecules to proteins. These indices are entropy measures related to all nodes or states in a system (networks) separated each other at least at k steps in the full graph [22-26]. In these models, we put special emphasis on entropy measures because many authors have demonstrated that they are very useful tools for Computational Chemistry; see for instance the interesting works of Graham using entropy to codify information content of organic molecules and other systems [27-32]. Table 1. Results for some Host-Parasite models.

Entropy Prediction of parasite-host networks

In a recent review, we discussed the application of MARCH-INSIDE indices, including entropies, for the study of antimicrobial drugs, targets, and drug-drug networks. We put emphasis on works focused on antibacterial, antiviral, and antifungal drugs as well as recently reported applications to anti-parasite drug-drug networks [33]. In all these works, we have noted that one can predict which nodes interconnect or not in a CN using Entropy measures derived for a graph representation of the nodes of the system. However, we did not found a previous work investigating the efficiency of Entropy measures to solve this problem at different levels. In this work, we present the first study of Entropy measures in this sense focused, but not limited to, MARCH-INSIDE entropy type indices. We tested Entropy measures in two experiments developed at two structural levels: 1) drugtarget and 2) parasite-host interactions. It means that in different experiments Entropy measures for objects in a lower level can be used to predict networks formed by the same objects at the higher levels. For instance, in the Experiment 1 we constructed molecular graphs for a many anti-parasite and no active drugs (lower-level network), where the atoms are nodes and chemical bonds are edges. Next, we calculated the kÎ¸ for some groups of atoms or the whole molecules. Later, we used these indices to seek a QSAR model and predict which drugs are connected in a drug-target CN because they have a common target parasite (higher-level network) [33]. Similarly, in Experiment 2 we built graphs for secondary structure of RNAs used as phylogenetic biomarker of both parasites and hosts. In this graphs (lower-level network) the nodes represent nucleotides and edges represent nucleotide or hydrogen bonds. The authors may see the works of Marrero-Ponce et al. [34-36], Galindo and BermĂşdez et al. [37, 38], Shu and Bo et al. [39], describing different Entropy measures and graph representations of RNA secondary structure. We also calculated the KÎ¸ of these nucleotides and Entropy measures for the secondary structure of RNAs. Last, we can use all these indices to seek a QSPR model and predict which parasites and hosts we have to connect in a parasite-host CN because they have a common target (higher-level network). The study is planned to open new trends in the applications of graph theory in biology in general with special emphasis in Parasitology.

Materials and methods We used MARCH-INSIDE to calculate the parameters for RNAs of both parasites and host. This software use as inputs connectivity tables (ct) files generated by the servers developed by Zuker et al. such as Mfold [40] and DINAMelt [41]. These ct files contain the information related to the

Humberto González-Díaz et al.

secondary structure of the RNA molecules. The Figure 1 depicts of RNA AJ564112 of parasite Dactylogyrus auriculatus in order to illustrate the type of graph used as input for MARCH-INSIDE software. The method uses a Markov Chain associated to the network of the secondary structure of the RNA to calculate Entropy measures and Kθ values of the RNA molecules. The indices are derived in the form of invariants of the stochastic the matrix 1Π; which is a squared matrix to characterize the secondary structure of the RNA in terms of a network of nucleotide-nucleotide electrostatic interactions. Accordingly, the matrix 1Π contains the probabilities 1pij to reach a nucleotide (node) ni moving throughout a walk of length k = 1 from other node nj: pij =

Qj n

∑α m =l

⋅ Ql

(1)

Where, Qj is the charge of the node nj and αij equals to 1 if the nodes ni and nj are adjacent in the graph and equals to 0 otherwise. The charge of the node is equals to the sum of the charges of the nucleotide placed at this node. Afterwards, we can derive different invariants to characterize the RNA. For instance, the calculation of the spectral moments of this matrix in order to numerically describe some or all nucleotides in the RNA is straightforward to realize:

Figure 1. Secondary structure of RNA AJ564112 of Dactylogyrus auriculatus.

Entropy Prediction of parasite-host networks

[ ]

π (S ) = ∑ kπ ( j ) = ∑ k pij = Tr (1 Π ) n

j∈S

(2)

i= j

Where, Tr is called the trace and points to the sum of all the values in the main diagonal of the matrices kΠ = (1Π)k, calculated as natural powers of 1Π [42]. It is also straightforward to realize the calculation of other Markov Entropy measures such as the entropy (θk) and average electrostatic potential numbers (ξk).

θ (S ) = ∑ θ ( j ) = −k ⋅ ∑ (k p j )⋅ log(k p j ) n

j∈S

ξ (S ) = ∑ kξ ( j ) = ∑ k p j ⋅ Q j

j∈S

(3)

j =1

(

(4)

j =1

Please, note that in these cases we used absolute probabilities (kpj) instead of using directly the interaction probabilities (kpij). The calculation of the kpj values has been explained in detail in the literature so we omit it here. In any case, is relevant to the work highlighting that calculation of kpj values used for θk and ξk is notably more complicated than calculating the kpij values used to determine πk [12, 33].

Results and discussion Phylogenetic-free models for Host-Parasite interactions With the availability of complete genomic sequences of various hosts and pathogens, together with breakthroughs in proteomics, metabolomics and other experimental areas, the investigation of host–pathogen systems on a multitude of levels of detail has come within reach. Unlike traditional biological research that focuses on a small set of components, systems biology studies the complex interactions between a large number of genes, proteins and other elements of biological networks and systems. Host– Pathogen systems biology examines the interactions between the components of two distinct organisms, either a microbial or viral pathogen and its animal host or two different microbial species in a community [43]. In these networks, the nodes are the microbial or host species and the edges (we can use arcs if necessary) indicate the presence of Host-Parasite Relationships (HPRs). To construct these networks is necessary to determine experimentally (or discard) the existence the specific relationship between two species;

Humberto González-Díaz et al.

which may result a time consuming task with high material cost. In this sense, the development of new computational models to predict HPRs becomes of the major interest. On the other hand, our group has a vast experience on the calculation of different types of Entropy measures and Kθ values for RNA secondary structures. The use of sequence only graphs may contain insufficient information whereas 3D RNA structure may be computationally expensive to be determined. Considering that RNA folding Entropy measures and Kθ contain structural information about the molecular biomarkers of both the host and the parasite we can expect that they may be used to predict HPRs. In addition, taking into consideration that they are alignment-free indices they may become an alternative approach to alignment procedures like BLAST. In previous works, we have reported many results on the prediction of different properties including mycobacterium promoters, miRNAs, and fruit-ripening related proteins, and drug-target interactions for HIV RNA regions with this class of indices (Markov entropy and others) [16, 36, 44-46]. However, until now these indices have not been used to predict parasite-host interactions. In this work we used Entropy measures and kθ for RNA secondary structure of specific gene of parasites and fish (hosts) and used these indices to find a model that is able to discriminate between host-parasite pairs that interact in nature from those that do not interact.

S(HPR)= 0.27⋅1θ (p) + 2.37⋅1θ ( f ) + 0.27⋅4π (p - f ) + 0.41⋅0ξ (p - f ) +1.60 n =159 U = 0.56 p < 0.001

(5)

In this equation, S(HPR) is the a real-valued output variable that scores the propensity of one parasite-host pair to present a HPR. The symbols used for different Entropy measures are: kθ(j) = Markov entropy TI of order k; k π(j) = Parasite-fish Markov spectral moment TI of order k; and kξ(pf) = Markov average electrostatic potential of order k. In these symbols if j = p (parasite) the descriptors are calculated for secondary structure of the RNA transcript corresponding to the mitochondrial gene encoding the protein cytochrome b. In this case, we used only the nucleotides in the positions 600-649 to generate the secondary structures. We support this decision on the results of a BLAST analysis carried out with all the sequences of parasites (see Figure 2); which demonstrate that for these species this region present the higher variability and consequently have the higher interest to discriminate different species.

Entropy Prediction of parasite-host networks

Figure 2. BLAST for parasites using as query the sequence AJ564112 of Dactylogyrus auriculatus.

Nevertheless, if j = f (host fish) the descriptors are calculated of partial 18S rRNA sequences. Last, if j = p – f the index is actually the difference between the respective parasite and fish Entropy measures. For instance, 0 ξ(p-f) = 0ξ(p) - 0ξ(f), so it is actually a difference between two Entropy measures. In total, 47 out of 59 positive HPRs (80%) and 82 out of 100 negative HPRs (82%) in the HPR-CN were correctly classified in training. Whereas, in Leave-Oone-Out (LOO) cross-validation 45 out of 59 positive HPRs were correctly classified (80%) and again 82 out of 100 negative HPRs were correctly classified (82%). This model was selected as the best model found compared with many other models obtained here based in different indices. Notably, the two first (more significant) variables that entered in model (10) are also entropy Entropy measures of type kCθ(j); which confirm the high success of entropy measures to predict properties of molecular systems. A summary of all models founds appears in Table 1 and detailed results for all HPRs predicted with model (10) appear in Table 2.

Humberto GonzĂĄlez-DĂaz et al.

Table 2. Results for HPR-CN predictions for different Dactylogyrus parasites and different hosts.

Entropy Prediction of parasite-host networks

Table 2. Continued

Humberto González-Díaz et al.

Entropy Prediction of parasite-host networks

Table 2. Continued

Figure 3. Tanglegram tree representing the evolutionary relationships between host and parasites.

Humberto GonzĂĄlez-DĂaz et al.

Figure 4. Host-Parasite networks of: (A) HPR-CN Observed vs. (B) HPR-CN predicted with model.

Making a detailed inspection of both real and predicted HPRs summarized in Table 2 we can note that they cannot be derived directly from the phylogenetic proximity between host or parasite species (see also Figure 3). It means that not necessarily two parasite with high phylogenetic similarity interact with the same host. Conversely, two fish with high phylogenetic similarity not necessarily interact with the same parasite. This fact make more interesting and challenging the prediction of HPRs. Using the software Pajek we constructed and draw both the observed CN and a CN predicted with HPRs that present a probability value p > 0.5 in Table 2. We illustrated these networks in Figure 4 in order to depict visually the high similarity between then above 80% predicted with LDA function (see previous paragraph).

Entropy Prediction of parasite-host networks

Acknowledgments González-Díaz H acknowledges financial support of Program Isidro Parga Pondal of the funded by Dirección Xeral de Investigación e Desenvolvemento, Xunta de Galicia.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

14.

Fowler JH, Jeon S. The authority of Supreme Court precedent. Social Networks. 2008; 30:16-30. Mason O, Verwoerd M. Graph theory and networks in Biology. IET Syst Biol. 2007 Mar;1(2):89-119. Newman ME. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Phys Rev E Stat Nonlin Soft Matter Phys. 2001 Jul;64 (1 Pt 2):016132. Newman ME. Scientific collaboration networks. I. Network construction and fundamental results. Phys Rev E Stat Nonlin Soft Matter Phys. 2001 Jul;64 (1 Pt 2):016131. Ohn JH, Kim J, Kim JH. Social network analysis of gene expression data. AMIA Annu Symp Proc. 2003:958. De P, Singh AE, Wong T, Yacoub W, Jolly AM. Sexual network analysis of a gonorrhoea outbreak. Sex Transm Infect. 2004 Aug;80(4):280-5. Johnson JC, Orbach MK. Perceiving the political landscape: ego biases in cognitive political networks. Social Networks 2002;24 291-310. Bornholdt S, Schuster HG. Handbook of Graphs and Complex Networks: From the Genome to the Internet. Wheinheim: WILEY-VCH GmbH & CO. KGa. 2003. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. Vilar S, Cozza G, Moro S. Medicinal chemistry and the molecular operating environment (MOE): application of QSAR and molecular docking to drug discovery. Curr Top Med Chem. 2008;8(18):1555-72. Helguera AM, Combes RD, Gonzalez MP, Cordeiro MN. Applications of 2D descriptors in drug design: a DRAGON tale. Curr Top Med Chem. 2008;8(18):1628-55. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. González-Díaz H, Mezo M, González-Warleta M, Muíño-Pose L, Paniagua E, Ubeira FM. Network prediction of fasciolosis spreading in Galicia (NW Spain) In: González-Díaz HaM, C.R., ed. Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks. Kerala: Transworld Research Network 2010:191-204. González-Díaz H, Vilar S, Rivero D, Fernández-Blanco E, Porto A, Munteanu CR. QSPR models for cerebral cortex co-activation networks In: González-Díaz

15.

16.

17. 18. 19.

20.

21.

22. 23. 24.

25. 26.

Humberto González-Díaz et al.

H, ed. Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks. Kerala: Transworld Research Network 2010:179-90. Cruz-Monteagudo M, González-Díaz H, Agüero-Chapin G, Santana L, Borges F, Domínguez RE, et al. Computational Chemistry Development of a Unified Free Energy Markov Model for the Distribution of 1300 Chemicals to 38 Different Environmental or Biological Systems. J Comput Chem. 2007; 28:1909-22. González-Díaz H, Agüero-Chapin G, Varona J, Molina R, Delogu G, Santana L, et al. 2D-RNA-Coupling Numbers: A New Computational Chemistry Approach to Link Secondary StructureTopology with Biological Function. J Comput Chem. 2007;28:1049-56. González-Díaz H, Pérez-Castillo Y, Podda G, Uriarte E. Computational Chemistry Comparison of Stable/Nonstable Protein Mutants Classification Models Based on 3D and Topological Indices. J Comput Chem. 2007;28:1990-5. González-Díaz H, Prado-Prado F. Unified QSAR and Network-Based Computational Chemistry Approach to Antimicrobials, Part 1: Mulentropy measurespecies Activity Models for Antifungals. J Comput Chem. 2008;29:656-7. Gonzalez-Diaz H, Saiz-Urra L, Molina R, Gonzalez-Diaz Y, Sanchez-Gonzalez A. Computational chemistry approach to protein kinase recognition using 3D stochastic van der Waals spectral moments. J Comput Chem. 2007 Apr 30;28(6):1042-8. Prado-Prado FJ, Ubeira FM, Borges F, Gonzalez-Diaz H. Unified QSAR & network-based computational chemistry approach to antimicrobials. II. Multiple distance and triadic census analysis of antiparasitic drugs complex networks. J Comput Chem. 2009 May 6;31(1):164-73. Vilar S, González-Díaz H, Santana L, Uriarte E. QSAR Model for AlignmentFree Prediction of Human Breast Cancer Biomarkers Based on Electrostatic Potentials of Protein Pseudofolding HP-Lattice Networks. J Comput Chem. 2008;29 2613-22. González-Díaz H, Molina R, Uriarte E. Markov entropy backbone electrostatic descriptors for predicting proteins biological activity. Bioorg Med Chem Lett. 2004 Sep 20;14(18):4691-5. González-Díaz H, Saíz-Urra L, Molina R, Uriarte E. Stochastic molecular descriptors for polymers. 2. Spherical truncation of electrostatic interactions on entropy based polymers 3D-QSAR. Polymer. 2005;46:2791-8. González-Díaz H, Vina D, Santana L, de Clercq E, Uriarte E. Stochastic entropy QSAR for the in silico discovery of anticancer compounds: prediction, synthesis, and in vitro assay of new purine carbanucleosides. Bioorg Med Chem. 2006 Feb 15;14(4):1095-107. González-Díaz H, Saiz-Urra L, Molina R, Santana L, Uriarte E. A Model for the Recognition of Protein Kinases Based on the Entropy of 3D van der Waals Interactions. J Proteome Res. 2007 Feb 2;6(2):904-8. Cruz-Monteagudo M, González-Díaz H, Borges F, Dominguez ER, Cordeiro MN. 3D-MEDNEs: An Alternative “in Silico” Technique for Chemical Research in Toxicology. 2. Quantitative Proteome-Toxicity Relationships (QPTR) based on Mass Spectrum Spiral Entropy. Chem Res Toxicol. 2008(21):619-32.

Entropy Prediction of parasite-host networks

27. Graham DJ. Information Content in Organic Molecules: Brownian Processing at Low Levels. J Chem Inf Model 2007;47(2):376-89. 28. Graham DJ, Schacht D. Base Information Content in Organic Molecular Formulae. J Chem Inf Comput Sci. 2000;40:942. 29. Graham DJ. Information Content in Organic Molecules: Structure Considerations Based on Integer Staentropy measurestics. J Chem Inf Comput Sci. 2002;42:215. 30. Graham DJ, Malarkey C, Schulmerich MV. Information Content in Organic Molecules: Quantification and Staentropy measurestical Structure via Brownian Processing. . J Chem Inf Comput Sci. 2004;44(1601). 31. Graham DJ, Schulmerich MV. Information Content in Organic Molecules: Reaction Pathway Analysis via Brownian Processing. J Chem Inf Comput Sci. 2004;44(1612). 32. Graham DJ. Information Content and Organic Molecules: Aggregation States and Solvent Effects. J Chem Inf Model 2005;45(1223). 33. Gonzalez-Diaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. 34. Marrero-Ponce Y, Nodarse D, González-Díaz H, Ramos de Armas R, RomeroZaldivar V, Torrens F, et al. Nucleic Acid Quadratic Indices of the “Macromolecular Graph’s Nucleotides Adjacency Matrix”. Modeling of Footprints after the Interaction of Paromomycin with the HIV-1 Ψ-RNA Packaging Region. Int J Mol Sci 2004;5:276-93. 35. Marrero-Ponce Y, Ortega-Broche SE, Diaz YE, Alvarado YJ, Cubillan N, Cardoso GC, et al. Nucleotide's bilinear indices: Novel bio-macromolecular descriptors for bioinformatics studies of nucleic acids. I. Prediction of paromomycin's affinity constant with HIV-1 Psi-RNA packaging region. J Theor Biol. 2009 Mar 9. 36. Aguero-Chapin G, Antunes A, Ubeira FM, Chou KC, Gonzalez-Diaz H. Comparative Study of Topological Indices of Macro/Supramolecular RNA Complex Networks. J Chem Inf Model. 2008 Oct 21;48:2265-77. 37. Bermudez CI, Daza EE, Andrade E. Characterization and comparison of Escherichia coli transfer RNAs by graph theory based on secondary structure. J Theor Biol. 1999 Mar 21;197(2):193-205. 38. Galindo JF, Bermudez CI, Daza EE. tRNA structure from a graph and quantum theoretical perspective. J Theor Biol. 2006 Jun 21;240(4):574-82. 39. Shu W, Bo X, Zheng Z, Wang S. A novel representation of RNA secondary structure based on element-contact graphs. BMC Bioinformatics. 2008;9:188. 40. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003 Jul 1;31(13):3406-15. 41. Markham NR, Zuker M. DINAMelt web server for nucleic acid melting prediction. Nucleic Acids Res. 2005 33 W577-W81. 42. Agüero-Chapín G, González-Díaz H, de la Riva G, Rodríguez E, SánchezRodríguez A, Podda G, et al. MMM-QSAR Recognition of Ribonucleases without Alignment: Comparison with HMM model and Isolation from Schizosaccharomyces pombe, Prediction, and Experimental assay of a new sequence. J Chem Inf Mod. 2008;48:434-48.

Humberto González-Díaz et al.

43. Forst CV. Host–pathogen systems biology. DDT. 2006 March;11(5/6 ). 44. González-Díaz H, Pérez-Bello A, Cruz-Monteagudo M, González-Díaz Y, Santana L, Uriarte E. Chemometrics for QSAR with low sequence homology: Mycobacterial promoter sequences recognition with 2D-RNA entropies. Chemom Intell Lab Systs. 2007; 85:20-6. 45. González-Díaz H, de Armas RR, Molina R. Markovian negentropies in bioinformatics. 1. A picture of footprints after the interaction of the HIV-1 PsiRNA packaging region with drugs. Bioinformatics. 2003 Nov 1;19(16):2079-87. 46. Gonzalez-Diaz H, de Armas RR, Molina R. Vibrational Markovian modelling of footprints after the interaction of antibiotics with the packaging region of HIV type 1. Bull Math Biol. 2003 Nov;65(6):991-1002.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 89-105 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

7. Using Shannon entropy to seek a QSPR model for cerebral cortex co-activation networks Humberto González-Díaz Department of Microbiology & Parasitology, Faculty of Pharmacy University of Santiago de Compostela, Santiago de Compostela, 15782, Spain

1. Introduction Claude Elwood Shannon, was an American electronic engineer and mathematician, which is known as the father of information theory [1]. The theory of information and particularly the Shannon entropy parameters defined by Shannon itself in 1948 to measure the amount of information have several applications in modern science [2]. For instance, Staura et al. [3-5] have used Shannon entropy parameters as molecular descriptors to seek Quantitative Structure-Property Relationship (QSPR) models with different applications. Recently, a novel set of molecular descriptors called SHED (SHannon Entropy Descriptors) was presented by Gregori-Puigjane and Mestres [6]. They are derived from distributions of atom-centered feature pairs extracted directly from the topology of molecules. The value of a SHED is then obtained by applying the information-theoretical concept of Shannon entropy to quantify the variability in a feature-pair distribution. The collection of Correspondence/Reprint request: Dr. Humberto González-Díaz, Department of Microbiology & Parasitology Faculty of Pharmacy, University of Santiago de Compostela, Santiago de Compostela, 15782, Spain E-mail: gonzalezdiazh@yahoo.es

Humberto GonzĂĄlez-DĂaz

SHED values reflecting the overall distribution of pharmacophoric features in a molecule constitutes its SHED profile. Similarity between pairs of molecules is then assessed by calculating the Euclidean distance of their SHED profiles. Also, a box-counting-based algorithm (SEBC) has been developed by Lorenzo and Mosquera [7] for the numerical computation of the Shannon entropy from samples of continuous functions. Its performance was tested by applying it to several samples of known continuous distribution functions. The results obtained with SEBC reproduced those obtained by analytical or numerical integration. SEBC was also employed for computing the Shannon entropies of the steric energy, Sh(E(S)), of several amino acids from their in vacuo NVE molecular dynamics simulations using the AMBER-4 force field. The results obtained correlate linearly with the experimental standard thermodynamic entropies of these compounds. This work resembles a QSPRlike study and points to the possibility of introducing straightforward and reliable calculations of thermodynamic entropies from empirical linear relationships with Sh(E(S)) obtained from MD simulations. Shannon entropy use is not only limited to small molecules and we can use it for proteins and gene analysis in genomics and proteomics [8-10]. Last, in neuro-informatic studies authors have also found applications for Shannon entropyparameters. For instance, de Araujo and Tedeschi et al. [11] published the manuscript: Shannon entropy applied to the analysis of event-related fMRI time series. In this paper they propose a new method for the analysis of ER-fMRI based in a specific aspect of information theory: the entropy of a signal using the Shannon formulation, which makes no assumption about the shape of the response. The results show the ability to discriminate between activated and resting cerebral regions for motor and visual stimuli. Moreover, the results of simulated data show a more stable pattern of the method, if compared to typical algorithms, when the signal to noise ratio decreases. On the other hand, we can use complex networks to study different systems from such diverse areas as physics, biology, economics, ecology, and computer science. In current problems of the Biosciences, prominent examples are protein molecular networks in the genome. On larger scales one finds complex networks of cells as in neural networks, up to the scale of organisms in ecological food webs [12]. The reader may see the results after Anwander and Tittgemeyer et al. [13] on the study of a connectivity-based parcellation of Broca's Area. These authors refer that it is generally agreed that the cerebral cortex can be segregated into structurally and functionally distinct areas (that may play the role of network nodes). However, brain function also strongly depends upon area-area anatomical connectivity (edges), which therefore forms a sensible criterion for the functio-anatomical segregation of cortical areas. Diffusion-weighted magnetic resonance (MR)

Shannon entropy model for cerebral cortex networks

imaging offers the opportunity to apply this criterion in the individual living subject. Probabilistic tractographic methods provide excellent means to extract the connectivity signatures from diffusion-weighting MR data sets (functional edges in a network). The correlations among these signatures may then be used by an automatic clustering method to identify cortical regions with mutually distinct and internally coherent connectivity. In any case, cerebral cortex networks may be considered static (at least for a given period of time) or time dependent. For instance, Honey and Kotter, et al. [14] studied the network structure of cerebral cortex shapes functional connectivity on multiple time scales. Functional networks recovered from long windows of neural activity (minutes) largely overlap with the underlying structural network. As a result, hubs in these long-run functional networks correspond to structural hubs. In contrast, significant fluctuations in functional topology are observed across the sequence of networks recovered from consecutive shorter (seconds) time windows. The functional centrality of individual nodes varies across time as interregional couplings shift. Furthermore, the transient couplings between brain regions are coordinated in a manner that reveals the existence of two anticorrelated clusters. These clusters are linked by prefrontal and parietal regions that are hub nodes in the underlying structural network. At an even faster time scale (hundreds of milliseconds) we detect individual episodes of interregional phase-locking and find that slow variations in the statistics of these transient episodes, contingent on the underlying anatomical structure, produce the transfer entropy functional connectivity and simulated blood oxygenation leveldependent correlation patterns observed on slower time scales. Many results in this direction, see for instance the work of Costa and Kaiser et al., have revealed a high interest on predicting the connectivity of primate cortical networks from topological and spatial node properties. They presented a computational reconstruction approach to the problem of network organization, by considering the topological and spatial features of each area in the primate cerebral cortex as subsidy for the reconstruction of the global cortical network connectivity. Starting with all areas being disconnected, pairs of areas with similar sets of features are linked together, in an attempt to recover the original network structure. This type of studies may be relevant for the study of different disease. For instance, Mizuno and Villalobos, et al. [15] examined the functional connectivity between thalamus and cerebral cortex in terms of blood oxygen level dependent signal cross-correlation with high-functioning autism and matched normal controls, using functional MRI during simple visuo-motor coordination. Both groups exhibited widespread connectivity, consistent with known extensive thalamo-cortical connectivity. In a direct group comparison, overall more extensive connectivity was

Humberto GonzĂĄlez-DĂaz

observed in the autism group, especially in the left insula and in right post-central and middle frontal regions. Consequently, numerical parameter like Shannon entropy to quantitatively describe the connectivity of these networks may be useful to discriminate healthy subject from patients in neurosciences.

2. Materials and methods Names and abbreviations used for all areas Here the abbreviation used for the areas in Table 1 and Figure 1, which is followed by the name of each area and the relation of our parcellation to other relevant anatomical and physiological schemes. This information comes from many other references cited in the annex of the work after Scannell et al. [16]. The number 17, area 17, primary visual cortex, or striate cortex and 18, area 18, a retinotopically organized visual area. PMLS is the posteromedial lateral suprasylvian area. AMLS is the anteromedial lateral suprasylvian area; a visual area in the medial wall of the suprasylvian sulcus. VLS is the ventrolateral suprasylvian area; a visual area situated in the posterior wall of the posterior part of the suprasylvian sulcus. PMLS, VLS, and parts of PLLS and AMLS correspond to the Clare-Bishop area of other parcellation schemes. PLLS, posterolateral lateral suprasylvian area. ALLS, anterolateral lateral suprasylvian area, a visual area in the lateral wall of the anterior part of the middle suprasylvian sulcus. DLS, dorsolateral suprasylvian area, is a visual area in the anterior wall of the posterior part of the suprasylvian gyrus. PLLS, ALLS, and DLS overlap with the lateral suprasylvian area in other parcellation schemes. We have treated PMLS, PLLS, AMLS, ALLS, VLS, and DLS as separate entities for the collation of connection data, but the precise disposition of visual areas in the lateral suprasylvian cortex remains unclear. There is considerable evidence, however, for several visual areas within lateral suprasylvian sulcus. The number 21a, area 21a, is a visual area on the posterior part of the suprasylvian gyrus and the superior wall of the posterior part of the suprasylvian sulcus. 21a overlaps with the Clare-Bishop area in the parcellation of Sherk. 21b, area 21b, is a visual area on the posterior part of the suprasylvian gyrus and the posterior wall of the posterior part of the suprasylvian gyrus. 20a, area 20a, a retinotopically organized visual area on the posterior ectosylvian gyrus. 20b, area 20b, a retinotopically organized visual area on the posterior ectosylvian gyrus [16]. ALG, a visual area located in the lateral wall of the lateral gyrus between areas 7 and 19. ALG may be part of area 19. Although ALG was included in the literature survey and is shown in Figure 1 (A), it was excluded from most

Shannon entropy model for cerebral cortex networks

Table 1. Detailed results for QSAR analysis of Cerebral Cortex regions. logSpred 1.14

logSres 0.04

In/Out

-0.08

logSobs 1.18

train

out

Area2-Aff(ij)

-0.08

1.18

1.14

0.04

train

out

Area4-Aff(ij)

-0.21

1.41

1.50

-0.08

train

out

Area17-Aff(ij)

-0.16

1.32

1.14

0.18

train

out

Area18-Aff(ij)

-0.17

1.34

1.24

0.10

train

out

Area19-Aff(ij)

-0.28

1.53

1.48

0.05

train

out

Area36-Aff(ij)

-0.30

1.56

1.60

-0.04

train

out

Area20a-Aff(ij)

-0.40

1.71

1.66

0.05

train

out

Area20b-Aff(ij)

-0.26

1.49

1.56

-0.07

train

out

Area21b-Aff(ij)

-0.07

1.15

1.24

-0.10

train

out

Area3a-Aff(ij)

-0.11

1.23

1.24

-0.01

train

out

Area3b-Aff(ij)

-0.15

1.30

1.35

-0.05

train

out

Area5al-Aff(ij)

-0.41

1.72

1.65

0.07

train

out

Area5am-Aff(ij)

-0.33

1.60

1.53

0.07

train

out

Area5bl-Aff(ij)

-0.34

1.62

1.60

0.02

train

out

Area5m-Aff(ij)

-0.33

1.61

1.59

0.02

train

out

Area6I-Aff(ij)

-0.29

1.54

1.56

-0.02

train

out

Area6m-Aff(ij)

-0.37

1.67

1.64

0.03

train

out

AreaAES-Aff(ij)

-0.35

1.64

1.67

-0.02

train

out

AreaAI-Aff(ij)

-0.10

1.20

1.19

0.01

train

out

AreaAII-Aff(ij)

-0.16

1.32

0.00

train

out

AreaALLS-Aff(ij)

-0.02

1.04

1.19

-0.15

train

out

AreaAMLSAff(ij)

-0.15

1.30

1.38

-0.08

train

out

AreaAmyg-Aff(ij)

-0.14

1.28

1.24

0.04

train

out

AreaCGp-Aff(ij)

-0.40

1.71

1.60

0.11

train

out

AreaDLS-Aff(ij)

-0.04

1.08

1.01

0.07

train

out

AreaDP-Aff(ij)

-0.11

1.23

1.19

0.04

train

out

AreaER-Aff(ij)

-0.21

1.41

1.44

-0.02

train

out

AreaHipp-Aff(ij)

0.11

0.70

0.84

-0.14

train

out

AreaIa-Aff(ij)

-0.40

1.71

1.68

0.03

train

out

cereb cortex area

Î¸0

Area1-Aff(ij)

Humberto González-Díaz

Table 1. Continued AreaIL-Aff(ij)

-0.16

1.32

1.35

-0.03

train

out

AreaLA-Aff(ij)

-0.41

1.72

1.69

0.03

train

out

AreaP-Aff(ij)

-0.15

1.30

1.24

0.06

train

out

AreaPFCdm-ff(ij)

-0.30

1.56

1.60

-0.04

train

out

AreaPFCr-Aff(ij)

-0.30

1.56

1.58

-0.02

train

out

AreaPFCv-Aff(ij)

-0.26

1.49

1.59

-0.10

train

out

AreaPLLS-Aff(ij)

-0.20

1.40

1.44

-0.04

train

out

AreaPMLS-Aff(ij)

-0.21

1.41

1.38

0.03

train

out

AreaPOA-Aff(ij)

0.04

0.90

0.84

0.06

train

out

AreaPSb-Aff(ij)

0.00

1.00

1.08

-0.08

train

out

AreaRS-Aff(ij)

-0.20

1.40

1.44

-0.04

train

out

AreaSb-Aff(ij)

0.02

0.95

1.01

-0.05

train

out

AreaSIV-Aff(ij)

-0.22

1.43

1.46

-0.03

train

out

AreaSSAi-Aff(ij)

-0.28

1.52

1.55

-0.03

train

out

AreaSSAo-Aff(ij)

-0.29

1.54

1.56

-0.02

train

out

AreaSVA-Aff(ij)

-0.02

1.04

1.01

0.03

train

out

AreaTem-Aff(ij)

0.02

0.95

1.08

-0.12

train

out

AreaV-Aff(ij)

0.00

1.00

0.93

0.07

train

out

AreaVP-Aff(ij)

0.04

0.90

0.93

-0.03

train

out

Area1-Eff(ji)

-0.21

1.41

1.38

0.03

train

Area2-Eff(ji)

-0.23

1.45

1.44

0.01

train

Area7-Eff(ji)

-0.40

1.72

1.64

0.07

train

Area17-Eff(ji)

-0.16

1.32

1.19

0.13

train

Area18-Eff(ji)

-0.18

1.36

1.32

0.04

train

Area35-Eff(ji)

-0.44

1.78

1.71

0.06

train

Area36-Eff(ji)

-0.42

1.75

1.71

0.04

train

Area20a-Eff(ji)

-0.32

1.59

1.55

0.04

train

Area21a-Eff(ji)

-0.21

1.41

1.38

0.03

train

Area21b-Eff(ji)

-0.16

1.32

1.38

-0.06

train

Area3a-Eff(ji)

-0.21

1.41

1.32

0.09

train

Area4g-Eff(ji)

-0.31

1.58

1.48

0.10

train

Area5al-Eff(ji)

-0.23

1.45

1.52

-0.07

train

Area5am-Eff(ji)

-0.18

1.36

1.41

-0.05

train

Area5bm-Eff(ji)

-0.25

1.48

0.00

train

Area5m-Eff(ji)

-0.17

1.34

1.44

-0.09

train

Shannon entropy model for cerebral cortex networks

Table 1. Continued Area6I-Eff(ji)

-0.25

1.48

1.50

-0.02

train

AreaAAF-Eff(ji)

-0.12

1.26

1.24

0.01

train

AreaAES-Eff(ji)

-0.41

1.73

1.68

0.06

train

AreaAI-Eff(ji)

-0.12

1.26

1.24

0.01

train

AreaALG-Eff(ji)

0.06

0.85

0.78

0.07

train

AreaALLS-Eff(ji)

-0.17

1.34

1.44

-0.09

train

AreaAMLS-Eff(ji)

-0.28

1.52

1.55

-0.03

train

AreaCGa-Eff(ji)

-0.38

1.68

1.62

0.06

train

AreaCGp-Eff(ji)

-0.35

1.63

1.53

0.10

train

AreaDLS-Eff(ji)

-0.08

1.18

1.14

0.04

train

AreaEPp-Eff(ji)

-0.28

1.52

1.56

-0.04

train

AreaER-Eff(ji)

-0.28

1.52

1.48

0.04

train

AreaHipp-Eff(ji)

0.04

0.90

0.84

0.06

train

AreaIg-Eff(ji)

-0.42

1.74

1.68

0.06

train

AreaIL-Eff(ji)

-0.17

1.34

1.38

-0.04

train

AreaLA-Eff(ji)

-0.25

1.48

0.00

train

AreaPFCdl-Eff(ji)

-0.25

1.48

1.50

-0.02

train

AreaPFCdm-ff(ji)

-0.21

1.41

1.44

-0.02

train

AreaPFCr-Eff(ji)

-0.16

1.32

0.00

train

AreaPL-Eff(ji)

-0.18

1.36

1.41

-0.05

train

AreaPLLS-Eff(ji)

-0.27

1.51

1.55

-0.04

train

AreaPMLS-Eff(ji)

-0.23

1.45

1.44

0.01

train

AreaPS-Eff(ji)

-0.23

1.45

1.52

-0.07

train

AreaPSb-Eff(ji)

-0.17

1.34

1.38

-0.04

train

AreaRS-Eff(ji)

-0.19

1.38

1.41

-0.03

train

AreaSII-Eff(ji)

-0.33

1.60

1.52

0.08

train

AreaSIV-Eff(ji)

-0.30

1.56

-0.01

train

AreaSSAi-Eff(ji)

-0.21

1.41

1.48

-0.07

train

AreaSSF-Eff(ji)

-0.24

1.46

1.53

-0.07

train

AreaSVA-Eff(ji)

-0.20

1.40

1.41

-0.01

train

AreaTem-Eff(ji)

-0.11

1.23

1.24

-0.01

train

AreaVLS-Eff(ji)

-0.04

1.08

1.14

-0.06

train

AreaVP-Eff(ji)

-0.11

1.23

1.24

-0.01

train

Area7-Aff(ij)

-0.40

1.71

1.64

0.07

out

Area35-Aff(ij)

-0.39

1.69

1.64

0.05

out

Humberto González-Díaz

Table 1. Continued Area21a-Aff(ij)

-0.14

1.28

-0.01

out

Area4g-Aff(ij)

-0.30

1.56

1.53

0.02

out

Area5bm-Aff(ij)

-0.32

1.59

1.58

0.01

out

AreaAAF-Aff(ij)

0.06

0.85

0.84

0.00

out

AreaALG-Aff(ij)

0.06

0.85

0.78

0.07

out

AreaCGa-Aff(ij)

-0.49

1.85

1.74

0.11

out

AreaEPp-Aff(ij)

-0.35

1.64

0.00

out

AreaIg-Aff(ij)

-0.37

1.66

1.62

0.04

out

AreaPFCdl-Aff(ij)

-0.35

1.64

1.65

-0.01

out

AreaPL-Aff(ij)

-0.28

1.53

1.48

0.05

out

AreaPS-Aff(ij)

-0.15

1.30

1.28

0.02

out

AreaSII-Aff(ij)

-0.22

1.43

1.44

-0.01

out

AreaSSF-Aff(ij)

-0.27

1.51

1.59

-0.08

out

AreaVLS-Aff(ij)

-0.08

1.18

1.19

-0.02

out

Area4-Eff(ji)

-0.27

1.51

1.52

-0.01

Area19-Eff(ji)

-0.33

1.61

1.53

0.08

Area20b-Eff(ji)

-0.25

1.48

1.53

-0.06

Area3b-Eff(ji)

-0.21

1.41

0.00

Area5bl-Eff(ji)

-0.20

1.40

1.44

-0.04

Area6m-Eff(ji)

-0.35

1.63

1.59

0.04

AreaAII-Eff(ji)

-0.19

1.38

1.41

-0.03

AreaAmyg-Eff(ji)

-0.33

1.60

1.58

0.03

AreaDP-Eff(ji)

-0.10

1.20

1.24

-0.04

AreaIa-Eff(ji)

-0.37

1.66

1.65

0.01

AreaP-Eff(ji)

-0.21

1.41

1.38

0.03

AreaPFCv-Eff(ji)

-0.18

1.36

1.44

-0.08

AreaPOA-Eff(ji)

-0.12

1.26

1.32

-0.07

AreaSb-Eff(ji)

-0.18

1.36

1.38

-0.02

AreaSSAo-Eff(ji)

-0.23

1.45

1.48

-0.03

0.00 0.00 2.03 -2.03 cv AreaV-Eff(ji) In/Out indicator: In for Efferent (ji) activations and Out for Afferent (ij) Activations.

of the topological analyses because of the lack of available connectional data. Number 7, is area 7, a region of cortex on the middle part of the suprasylvian gyrus, lateral sulcus and lateral gyms. The area contains cells responsive to visual, auditory, and somatosensorys timuli and has been implicated in the control of eye movements. AES, anterior ectosylvian sulcus, a region of

Shannon entropy model for cerebral cortex networks

Figure 1. (A) Views of the parcellation of the cat brain cerebral cortex in areas; the hippocampus and subiculum are not shown. (B) Complex network of region-region co-activation; all the areas shown on the map were included in the analysis, with the exception of TCA (corticoamygdaloid transition area), PPC (prepiriform cortex), and OB (olfactory bulb).

multimodal cortex, containing cells responsive to auditory, visual, and somatosensory stimulation. A large proportion of cells are multimodal. Dorsoposterior parts of the sulcus are dominated by auditory responsivity, the fundus and ventral bank of the middle part are dominated by visually responsive cells, and the anterodorsal part contains a somatosensoryre presentation (SIV) that is considered as a distinct area in this analysis. AES has strong connections with the superior colliculus and may be involved in oculomotor function. SVA, splenial visual area, situated in the splenial sulcus, between area 17 and the cingulate gyms. SVA may be visually responsive region of the posterior cingulate cortex [16]. PS, posterior suprasylvian area a retinotopically organized area on the inferior part of the suprasylvian gyrus and sulcus. AI, primary auditory field.

Humberto González-Díaz

AAF, anterior auditory field. P, posterior auditory field, in the posterior wall of the posterior ectosylvian sulcus. VP, ventroposterior auditory field. AI, AAE P and VP are the “core” auditory fields showing tonotopic organization with narrowly tuned cells. AU, second auditory field. DP, dorsoposterior auditory field. V, ventral auditory field. SSF, suprasylvian fringe, a thin band of multimodal cortex running along the inferolateral border of the suprasylvian sulcus. EPp, posterior part of the posterior ectosylvian gyrus, a visual and auditory association area. Tern, temporal auditory field. AII, SSE EPp, DP and Tern lie within the “auditory belt.” They lack the strict tonotopy of the “core” areas. The tonotopic organization of AII, for example, is less orderly than that of the “core” auditory fields, and cells within this area exhibit broad tuning curves 3a, area 3a. 3b, area 3b. I, area 1. 2, area 2. These are areas of somatosensory cortex. 3b, 1, and 2, constitute SI contain one or more somatotopicr epresentationsd ominated by cells responsive to cutaneous stimulation. 3b, 1, and 2 may constitute a single somatic koniocortical area. 3a contains a single contralateral somatotopic representation dominated by cells responsive to deep stimuli. SII, seconds omatosensorya area having multiple representations of some of the body regions. The majority of cells respond to superficial stimuli. In contrast to the primary somatosensorya reas, SII shows a degree of ipsilateral input [16]. SIV, fourth somatosensorya rea occupying the dorsalb ank of the anterior part of the anterior ectosylvian sulcus and adjoining anterior ectosylvian gyrus. SIV has an orderly topographic representation of the body surface. 4g, area 4h; a region of motor cortex that occupies the anterior part of the cruciate sulcus and a small area of surrounding cortex. 4, area4. This corresponds to areas4 f, 4sf, and 4a that are in the superior and posterior aspects of the cruciate sulcus. 61, lateral division of area 6, an area of premotor cortex. This area includes all regions of area 6 lateral to the rostra1 margin of the cruciate sulcus. It corresponds to the lateral region of area 6a. Electrical stimulation in this area may evoke movements. 6m, medial division of area 6. This area consists of regions of area 6 medial to the rostra1 margin of the cruciate sulcus and corresponds to the medial part of 6aB, 6ª (u, and 6if). 6m contains a region, the medial frontal eye field, where electrical stimulation may elicit eye movements. POA, presylvian oculomotor areas. These areas are located the medial and lateral walls of the presylvian sulcus and correspond to DLo and VLo and to the lateral frontal eye fields. These two physiologically distinct areas have been considered together because of the relative lack of connectional data. They may be part of area 6. km, medial part of area 5a on the medial side of the anterior part of the lateral gyrus and the medial part of the lateral sulcus. 5~1, lateral part of area 5a, on the anterior suprasylvian gyrus and

Shannon entropy model for cerebral cortex networks

lateral side of the lateral sulcus. 5a overlaps with SIII, the third somatosensorya rea. 5bm, medial part of area 5b, on the anterior part of the lateral gyrus and the medial side of the lateral sulcus. Sbl, lateral part of area 5b, on the anterior part of the suprasylvian gyrus running into the lateral side of the lateral gyrus. 5m, medial division of area 5, on the anterior part of the medial lateral gyrus. SSAo, outer part of suprasylvian sulcal division of area 5, in the anterior part of the medial wall of the suprasylvian sulcus. SSAi, inner (deep) part of suprasylvian sulcal division of area 5, in the anterior part of the medial wall of the suprasylvian sulcus. Area 5 receivess omatosensoryv, isual, and auditory input. 5b may be involved in visuomotor integration, and SSAo and SSAi may have polysensory responsivity. PFCr, rostra1 division of the prefrontal cortex. This area overlaps with the dorsal division of prefrontal cortex as defined by Musil and Olson. PFCdl, dorsolateral division of the prefrontal cortex. PFCv, ventral division of the prefrontal cortex. This area overlaps with both the dorsal and infrahmbic divisions of the medial prefrontal cortex of Musil and Olson. PFCdm, dorsomedial division of the prefrontal cortex. This area overlaps with the dorsal division of the medial prefrontal cortex of Musil and Olson. la, agranular insula. Ig, granular insula. The dorsal insula contains a region responsive to visual stimuli, while more ventral insula regions respond to multimodal stimulation. CGa, anterior part of cingulate cortex. This corresponds to the anterior part of the posterior cingulate area of Olson and Musil. CGp, posterior part of cingulate cortex. This corresponds to the posterior part of the posterior cingulate area of Olson and Musil. CGa and CGp contain neurons that respond to multimodal sensory stimulation. LA, anterior limbic cortex. This corresponds to the anterior cingulate area of Musil and Olson. RS, retrospenial cortex. PL, prelimbic area. This corresponds to area 32. This overlaps with the infralimbic division of the medial prefrontal cortex of Musil and Olson. IL, infralimbic area. This corresponds to area 25 of Room. 35, area 35 of the perirhinal cortex. 36, area 36 of the perirhinal cortex. PSb, presubiculum, parasubiculum, and postsubicular cortex. Because of the relative lack of connectional data for these areas in the cat, they were considered together in the analysis. Sb is the subiculum and ER, entorhinal cortex [16].

Complex network analysis We used as input a Shannon entropy-like parameter Î¸0(j) [4, 5, 17-22] calculated by means of equation (1a) in order to quantify the amount of local information emitted and/or received by the jth cerebral cortex region. The parameter Î¸0(j) depends on the value Î´(j), which is the number of regions

100

Humberto González-Díaz

with efferent and/or afferent connection to the jth cerebral cortex region. Consequently, δ(j) is the node degree of the jth node in a complex network represented in Figure 1 (B). In this complex network the cerebral regions are nodes and to nodes are connected by an arc if there is at least a weak afferent (ij-arc) or efferent (ji-arc) connection between the respective ith and jth regions. In fact, δ(j) is a total node degree equal to the sum of inputs + output connections for the jth region. It is easy to demonstrate that θ0(j) is a Shannon entropy-like parameter because it has the form –p(j)logp(j) and p(j) = 1/δ(j) is a probability that measures the tendency of the node to be connected by means of one specific ij-arc or ji-arc (input or output arc) out of the total number of arcs δ(j) for the jth node, see equation (1c). The demonstration that p(j) obey the conditions for a probability is also very easy because it is a real value with domain p(j) ∈ R (0,1) and also obeys normalization condition

θ 0 = − p ( j) ⋅ log p ( j ) = −1 / δ ( j) ⋅ log 1 / δ ( j )

(1a)

δ ( j) = δ ( j) in + δ ( j) out

(1b)

p0 =

δ ( j)

∑ p ( j ) = ∑ 1 / δ ( j) = 1

(1c)

Statistical analysis The θ0(j) values were calculated for all Cerebral Cortex regions or areas and next we proceed with statistical analysis. For it, the variables were standardised and a Multiple Linear Regression (MLR) was carried out using the software package STATISTICA [23]. In the formula for MLR equation (2) is: log S( j)pred = log S( j)obs + ε = b ⋅ θ0 ( j) + a

(2)

The variable S(j) is a real value score for the biological property under investigation: the activation of the jth Cerebral Cortex region and is represented as the jth node in the complex network. S(j) is the sum of the row for the jth region or area in the Matrix of cortico-cortical connections in the cat the work after Scannell et al. [16]. The parameter logS(j)obs was the input variable and was calculated taking the logarithm of S(j). The values a and b are the coefficients obtained for the regression functions (QSAR model) [24, 25]. The statistical quality of the models was assessed using parameters such as Regression coefficient (R), Fisher ratio (F), and level of error (p) [24, 25].

Shannon entropy model for cerebral cortex networks

101

3. Results The best model found this work, see equation (3), presented also excellent results in training with R = 0.96; which indicates that the model accounts for 92.7% of variance of data. In addition, the model presented R= 0.97 in an external validation series with N = 31 cases not used to train the model. We found one outlier the case AreaV-Eff(ji) (the case 32). This input case corresponds to the efferent activation of cerebral cortex region Area V. This case become an outlier because it presents a δ(j) = 1 that implies θ0(j) = 0 and a predicted value of 2.03, which is out of the standard residual limit of ± 2. The detailed information for all Brain regions analyzed the values of the θ0(j) in the QSAR equation and the type of Input/output activation as well as other details appears in Table 1.

log S = 2 .03 ⋅ θ 0 ( j) − 7 .88 N = 98

R = 0 .96

(3)

F = 1219 .03

p < 0 .001

We also graphically illustrate in Figure 2 the excellent fit of this equation for both training and external validation series together. On the other hand, the best model found in a previous work for the same problem [26], see equation (4), presented also excellent results in training with R = 0.94; which indicates that the model accounts for 88% of variance of data. In addition, the model presented R= 0.86 in an external validation series with N = 32 cases not used to train the model. We also reported the adjusted value of the regression model both in training and validation series after removing the effect of over-fitting due to an excessive number of variables. We can note that this values are also high Radj = 0.88 and Radj = 0.73 and very similar to not-adjusted values in training and validation but notably lower than the value of R = 0.96 for our equation. In any case, this model is based on four different parameters (node centralities) of the complex networks with more complicated calculation whereas our entropy equation uses only one parameter. log S = 0 .0242 ·δ ( j) + 0 .0128 ·V ( j) − 0 .00007 ·B ( j) − 15 .4112 ·H ( j) + 1 .6503 N = 98

R = 0 .94

F = 176 .75

(4 )

p < 0 .001

In this last equation the role of input parameter was be the so called graph theoretical indices. These indices are numerical parameters easily derived from graphs/networks that may be used on QSPR studies. QSPR

102

Humberto González-Díaz

Figure 2. Plot of Observed (logS(j)obs) vs. Predicted (logS(j)pred) values using our entropy equation.

studies have been used for the construction of models predicting the properties of a complex systems based on numerical parameters that describe the structure of the system. In particular, Topological Indices (TIs), Connectivity Indices (CIs), or Node centralities Ct(j) of type t, may be calculated from the graph/network representation of a system to describe full network topology, node connectivity, or sub-graph branching. This class of indices is one of the most flexible classes of indices for QSPR studies; see the recent reviews after González-Díaz et al. [27-31]. In this work, we report by the first time a QSPR study of cat cerebral cortex network using TIs derived with node centralities Ct(j) calculated by the software CentiBin [32].

4. Conclusions We demonstrated that entropy indices are of general use at different structure organization levels. In particular, we show that it may be used to seek QSPR models that predict Cerebral Cortex co-activation. This approach opens a new gateway for the extension of classic TIs to other uses in Neurological sciences.

Shannon entropy model for cerebral cortex networks

103

Acknowledgments GonzĂĄlez-DĂaz H. and Munteanu C. R. acknowledge the funding for a research position by Isidro Parga Pondal Programme, Xunta de Galicia, Spain.

References 1. 2. 3. 4.

5. 6. 7. 8. 9. 10. 11. 12. 13.

Sloane NJA, Wyner AD, eds. CLaude Elwood Shannon, Collected Papers: IEEE press. Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27:379-423. Godden JW, Stahura FL, Bajorath J. Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations. J Chem Inf Comput Sci. 2000 May;40(3):796-800. Stahura FL, Godden JW, Bajorath J. Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations. J Chem Inf Comput Sci. 2002 May-Jun;42(3):550-8. Stahura FL, Godden JW, Xue L, Bajorath J. Distinguishing between natural products and synthetic molecules by descriptor Shannon entropy analysis and binary QSAR calculations. J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1245-52. Gregori-Puigjane E, Mestres J. SHED: Shannon entropy descriptors from topological feature distributions. Journal of chemical information and modeling. 2006 Jul-Aug;46(4):1615-22. Lorenzo L, Mosquera RA. A box-counting-based algorithm for computing Shannon entropy in molecular dynamics simulations. J Comput Chem. 2003 Apr 30;24(6):707-13. Stewart JJ, Lee CY, Ibrahim S, Watts P, Shlomchik M, Weigert M, et al. A Shannon entropy analysis of immunoglobulin and T cell receptor. Mol Immunol. 1997 Oct;34(15):1067-82. Frappat L, Minichini C, Sciarrino A, Sorba P. Universality and Shannon entropy of codon usage. Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Dec;68(6 Pt 1): 061910. Zhang Y. Relations between Shannon entropy and genome order index in segmenting DNA sequences. Phys Rev E Stat Nonlin Soft Matter Phys. 2009 Apr;79(4 Pt 1):041918. de Araujo DB, Tedeschi W, Santos AC, Elias J, Jr., Neves UP, Baffa O. Shannon entropy applied to the analysis of event-related fMRI time series. Neuroimage. 2003 Sep;20(1):311-7. Bornholdt S, Schuster HG. Handbook of Graphs and Complex Networks: From the Genome to the Internet. Wheinheim: WILEY-VCH GmbH & CO. KGa. 2003. Anwander A, Tittgemeyer M, von Cramon DY, Friederici AD, Knosche TR. Connectivity-Based Parcellation of Broca's Area. Cereb Cortex. 2007 Apr;17(4):816-25.

104

Humberto González-Díaz

14. Honey CJ, Kotter R, Breakspear M, Sporns O. Network structure of cerebral cortex shapes functional connectivity on multiple time scales. Proc Natl Acad Sci U S A. 2007 Jun 12;104(24):10240-5. 15. Mizuno A, Villalobos ME, Davies MM, Dahl BC, Muller RA. Partially enhanced thalamocortical functional connectivity in autism. Brain Res. 2006 Aug 9;1104(1):160-74. 16. Scannell JW, Blakemore C, Young MP. Analysis of connectivity in the cat cerebral cortex. J Neurosci. 1995 Feb;15(2):1463-83. 17. Graham DJ. Information Content in Organic Molecules: Brownian Processing at Low Levels. Journal of chemical information and modeling. 2007;47(2):376-89. 18. Graham DJ, Schacht D. Base Information Content in Organic Molecular Formulae. J Chem Inf Comput Sci. 2000;40:942. 19. Graham DJ. Information Content in Organic Molecules: Structure Considerations Based on Integer Statistics. J Chem Inf Comput Sci. 2002;42:215. 20. Graham DJ, Malarkey C, Schulmerich MV. Information Content in Organic Molecules: Quantification and Statistical Structure via Brownian Processing. J Chem Inf Comput Sci. 2004;44(1601). 21. Graham DJ, Schulmerich MV. Information Content in Organic Molecules: Reaction Pathway Analysis via Brownian Processing. J Chem Inf Comput Sci. 2004;44(1612). 22. Graham DJ. Information Content and Organic Molecules: Aggregation States and Solvent Effects. Journal of chemical information and modeling. 2005;45(1223). 23. STATISTICA-6.0. 6.0 ed. Tulsa, OK, U.S.A.: StatSoft Inc. 2002. 24. Marrero-Ponce Y, Khan MT, Casanola Martin GM, Ather A, Sultankhodzhaev MN, Torrens F, et al. Prediction of Tyrosinase Inhibition Activity Using AtomBased Bilinear Indices. ChemMedChem. 2007 Apr 16;2(4):449-78. 25. Castillo-Garit JA, Marrero-Ponce Y, Torrens F, Rotondo R. Atom-based stochastic and non-stochastic 3D-chiral bilinear indices and their applications to central chirality codification. J Mol Graph Model. 2007 Jul;26(1):32-47. 26. González-Díaz H, Vilar S, Rivero D, Fernández-Blanco E, Porto A, Munteanu CR. QSPR models for cerebral cortex co-activation networks In: González-Díaz H, ed. Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks. Kerala: Transworld Research Network 2010:179-90. 27. Pérez-Montoto LG, Prado-Prado F, Ubeira FM, González-Díaz H. Study of Parasitic Infections, Cancer, and other Diseases with Mass-Spectrometry and Quantitative Proteome-Disease Relationships. Curr Proteomics. 2009; 6:246-61. 28. González-Díaz H, Prado-Prado F, Pérez-Montoto LG, Duardo-Sánchez A, López-Díaz A. QSAR Models for Proteins of Parasitic Organisms, Plants and Human Guests: Theory, Applications, Legal Protection, Taxes, and Regulatory Issues. Curr Proteomics. 2009;6:214-27. 29. Gonzalez-Diaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. 30. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78.

Shannon entropy model for cerebral cortex networks

105

31. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. 32. Junker BH, Koschuetzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006 Apr 21;7(1):219.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 107-114 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

8. Criminal law networks, markov chains, Shannon entropy and artificial neural networks Aliuska Duardo-Sanchez Department of Especial Public Law, Financial and Tributary Law Area Faculty of Law, USC, Santiago de Compostela, 15782, Spain

1. Introduction The artificial neural network (ANN), or simply neural network, is a machine learning method evolved from the idea of simulating the human brain. The data explosion in modem drug discovery research requires sophisticated analysis methods to uncover the hidden causal relationships between single or multiple responses and a large set of properties. The ANN is one of many versatile tools to meet the demand in drug discovery modeling. Compared to a traditional regression approach, the ANN is capable of modeling complex nonlinear relationships. The ANN also has excellent fault tolerance and is fast and highly scalable with parallel processing. This chapter introduces the background of ANN development and outlines the basic concepts crucially important for understanding more sophisticated ANN. Several commonly used learning methods and network setups are discussed briefly at the end of the chapter [1, 2]. A recent review, described the main neural network architectures Correspondence/Reprint request: Dr. Aliuska Duardo-Sanchez, Department of Especial Public Law, Financial and Tributary Law Area, Faculty of Law, USC, Santiago de Compostela, 15782, Spain E-mail: aliuskaduardo@yahoo.es

108

Aliuska Duardo-Sanchez

and give examples of their application to solve food analytical problems are presented, together with some considerations about their uses and misuses [3]. In addition, another recent work ANN recognised that have been applied to a lot of more diverse problems ranging from speech recognition to prediction of protein secondary structure, classification of cancers and gene prediction. The vast applications of ANN make them a candidate method for the study of complex systems present in legal sciences too. For instance, prediction of criminal recidivism has been extensively studied in criminology with a variety of statistical models. An interesting article proposed the use of ANN models to address the problem of splitting the population into two groups â&#x20AC;&#x201D; non-recidivists and eventual recidivists â&#x20AC;&#x201D; based on a set of predictor variables [4]. The results from an empirical study of the classification capabilities of ANN on a well-known recidivism data set are presented and discussed in comparison with logistic regression. Analysis indicates that ANN models are competitive with, and may offer some advantages over, traditional statistical models in this domain. Last, the assigning of different weights to risk factors in actuarial formulas for the assessment of violence risk in criminal offenders has been debated. The authors explore the predictive validity of an index with 10 well-established risk factors for criminal recidivism with respect to violent reconvictions among 404 former forensic psychiatric examinees in Sweden. Four different weighting conditions are tested experimentally, including Nuffieldâ&#x20AC;&#x2122;s method, bivariate and multivariate logistic regression, and an ANN procedure [5]. Simpler weighting techniques do not improve predictive accuracy over that of a nonweighted reference, and the more complex procedures yield a statistical shrinkage effect. The authors hypothesize that the general lack of causal risk factors in prediction models may contribute to the observed low utility of weighting techniques. On the other hand, Graph theory and Complex Network analysis tools are expanding to new potential fields of application of Information Sciences at different levels from molecular to populations, social or technological such as genome networks, protein-protein networks, sexual disease transmission networks, power electric power network or internet [6]. In particular, the case of relationships among social actors, as well as the relationships among actors at different levels of analysis (such as persons and groups) are being subject of intensive investigation [7]. It provides a common approach for all those disciplines involved in social structure study [8-11] susceptible of network depiction. Social structure concept is merely used in sociology and social theory. Although there is not agreement between theorists, it can refer to a specific type of relation between entities or groups also can evolve enduring patterns of behavior and relationship within a society, or social

Entropy in Criminal law networks

109

institutions and norms becoming embedded into social systems. For a most complete review of SNA see the in-depth review of Newman M entitled: The Structure and Function of Complex Networks [12]. Anyway, if we take in consideration that a network is a set of items, usually called nodes, with connections between them, so called edges [13], thus it means the representation of social relationships in terms of nodes and ties, where nodes can be the individual actors within the networks, and ties the relationships between these actors [6]. In fact, SNA is nothing new in social sciences studies, in early 1930s, sociologists already have made a social network to study friendships between school children [14]. Since the important of network approach to social sciences high increased, and it application goes from interrelation between family members [15] to companies business interaction [16, 17] or patterns of sexual contacts [18, 19]. Although the network approach is so pervasive in the social sciences their application in the Law scope is still weak. Networks tools and methodologies might useful to illustrate the interrelation between the different law types, check the importance of a specific instrument so as the normative hierarchy respect by legislators in order to regulate the most important matter for individuals through law instruments which require the approval from the most representative democratic actors. Also can help to understand laws consequences in society live and it effectiveness or not. In this sense, using network Shannon entropy parameters may be of high interest to develop a general methodology for the search of Quantitative Structure-Property Relationships (QSPR) models. In a recent book chapter has discussed many applications of entropy parameters of different types of networks [20]. In this chapter, we propose the study of multiple systems using node centrality or connectedness information measures derived from a Graph or Complex Network. The information is quantified in terms of the Entropy centrality Î¸k(j) of the jth parts or states (nodes) of a Markov Chain associated to the system, represented by a network graph. The procedure is standard for all systems despite the complexity of the system. First, we define the phenomena to study, ranging from molecular systems composed by single molecules (drug activity, drug toxicity), multiple-molecules (networks of chemical reactions), and macromolecules (DNA-Drug interaction, protein function), to ecological systems (bacterial co-aggregation), or social systems (criminal causation, legislative productivity). Secondly, we collect several cases from literature (drugs, chemical reactions, proteins, bacterial species, or criminal cases). Next, we classify the cases in at least two different groups (active/non-active drugs, enantioselective/non-enantioselective reactions, functional/non-functional proteins, co-aggregating/non-co-aggregating bacteria, or crime/non-crime cause, efficient/non-efficient law). After that, we

110

Aliuska Duardo-Sanchez

represent the interconnectivity of the discrete parts of the system (atoms, amino acids, reactants, bacteria species, or people) as a graph or network. The Markov Chain theory is used to calculate the Entropy of the system for nodes placed at different distances. Last, we both derived and validated a classification model using the Entropy values as input variables and the classification of cases as the output variables. The model is used to predict the probability with which a case presents the studied property. The referred work proposed the Entropy of a Markov Chain associated to a network or graph to be used as a universal quantity in pattern recognition regardless the chemical, biological, social, or other nature of the systems under study. These QSPR models shall connect the structure of the system (drug, protein, microorganism, people, social groups, internet ...criminal law networks) with their properties and can be used to predict the behavior of these systems in different situations. The algorithm used to connect the entropy parameters with the desired property may be a linear equation or precisely an ANN. Our group has introduced a Markov model (MM) method named MARCH-INSIDE: Markovian Chemicals In Silico Design. MARCHINSIDE generate TIs in the form of matrix invariants such as stochastic entropies, spectral moments, or absolute probabilities for the study of molecular properties. Recently the method has been renamed as MARCHINSIDE 2.0: Markov Chain Invariants for Network Simultaion & Design, in order to give a more clear idea of the unexplored potentialities. Recent reviews about MACRH-INSIDE and similar QSPR methods have been published by González-Díaz et al. including discussion of multiple applications in different fields [21-25]. In the present work, we decided to illustrate the study of criminal networks with the combined use of two computer programs that perform the above-mentioned calculations. First, we shall use MARCH-INSIDE to create and calculate entropy parameters of criminal networks. Next, we are going to use STATISTICA to train ANN that classify criminal actions using as inputs the entropy parameters calculated with MARCH-INSIDE.

Materials and methods Markov Chain approach to Shannon Entropy θk(j) for actions in Crime networks. First, we need to construct the crime causality Markov matrix 1Π. This matrix is built up as a square matrix (n × n), where n are all the actions related to the crime including the original actions (causes), the co-actions (secondary causes) and the consequence (crime). The matrix 1Π contains the transition probabilities (1pij) that have the action i to be the cause or at least to

Entropy in Criminal law networks

111

be occurred immediately after it in the crime than other action j. The probabilities 1pij may be calculated using the Eq. 28 and 29. δj represents the number of actions that occurred immediately after the action i-th. In addition, we use the absolute initial probabilities vector π0; see Eq. 26. This vector lists the absolute initial probabilities kpj to reach a node ni from a randomly selected node nj. Here we consider the initial probability inverse to the dimention (N, number of nodes) of the shp connecting ni with nii. Next, we used the theory of Markov chains in order to calculate the criminal causation entropy centrality kθ(i,ii): k

θ(i, ii ) = − ∑ k θ( j) = − ∑ k p j log k p j j∈shp

(1)

j∈shp

In this equation the values kpj are the absolute probabilities to reach the nodes moving throughout a walk of length k from node ni. The sum runs only over the nodes that lie within shp connecting ni with nii. The ChapmanKolmogorov equations were used to calculate the vector πk containing the kpj values using the vector π0 of initial probabilities (0pj) and the matrix 1Π with the first-step transition probabilities (1pij). π k = π 0 ×k Π = π 0 × (1 Π )

(2)

Results and discussion ANN-Entropy QSPR models for Criminal Causality One of the reasons people have difficulty in dealing with complex systems is that the linear causal chain way of thinking - A causes B causes C causes D ... etc - breaks down in the presence of feedback and multiple interactions between causal and influence pathways. One could say that complex systems are characterized by networked rather than linear causal relationships. Nevertheless, it is important to be able to reason about complex systems, make inferences about factors that contribute to current and alternative states of complex systems and explore their possible future trajectories, especially if we wish to influence them towards more favorable futures and away from more possibilities that are dangerous. Large scale examples include ecosystems, economic systems, coupled biophysicalsocioeconomic systems, integrated supply chains/industrial systems and social systems, but these remarks also apply for example to attempts to understand a physical organism as a complex system. Crime causality is a very important phenomenon in this sense. Different measures of crime

112

Aliuska Duardo-Sanchez

causality have been developed before [26]. In this work, we used the Markov entropy centrality kÎ¸(j) for a node in a Crime causality network. At the same time, we propose new measures of crime causality calculated as of the sum of all the kÎ¸(j) values of the same order k for all nodes placed in the shortest path (shp), connecting the original node ni (possible cause) with the final node nii (consequence) [20]. The ANN models were trained and later validated with and external validation series. The best ANN models found were summarized in Table 1. The output of these ANN models, CC-score, is a real value variable that scores the possibility of a Crime Cause (CC) to be the main cause of a given crime. The best models were to correctly predict above 87% of the crime causes (CC) out of many potential crime causes in 17 crime cases. In Figure 1, we illustrate the topology of some of the ANN models trained in this study. Table 1. Results of the ANN study.

Figure 1. Topology of some ANNs tested in this work.

Entropy in Criminal law networks

113

Acknowledgments Duardo-Sรกnchez, A., gratefully acknowledges partial financial support of the Research project (2006/PX 207) from the Department of Especial Public Law, Financial and Tributary Law Area, Faculty of Law, of the University of Santiago de Compostela in Spain; which was supported by Xunta de Galicia and ESF.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

Ivanciuc O. Drug Design with Artificial Neural Networks. In: Meyers RA, ed. Encyclopedia of Complexity and Systems Science. Berlin: Springer-Verlag 2009:2139-59. Zou J, Han Y, So SS. Overview of artificial neural networks. Methods Mol Biol. 2008;458:15-23. Marini F. Artificial neural networks in foodstuff analyses: Trends and perspectives A review. Anal Chim Acta. 2009 Mar 9;635(2):121-31. Palocsay SW, Wang P, Brookshire RG. Predicting criminal recidivism using neural networks. Socioecon Plann Sci. 2000;34(4):271-84. Grann M, Lรฅngstrรถm N. Actuarial Assessment of Violence Risk. To Weigh or Not to Weigh? Criminal Justice and Behavior. 2007;34(1):22-36. Bornholdt S, Schuster HG. Handbook of Graphs and Complex Networks: From the Genome to the Internet. Wheinheim: WILEY-VCH GmbH & CO. KGa. 2003. Breiger R. The Analysis of Social Networks. In: Hardy M, Bryman A, eds. Handbook of Data Analysis. London: Sage Publications 2004:505-26. Abercrombie N, Hill S, Turner BS. Social structure. The Penguin Dictionary of Sociology. 4th ed. London: Penguin 2000. Craig C. Social Structure. Dictionary of the Social Sciences. Oxford: Oxford University Press 2002. White H, Scott Boorman and Ronald Breiger. . "." Social Structure from Multiple Networks: I Blockmodels of Roles and Positions. American Journal of Sociology. 1976;81:730-80. Wellman B, Berkowitz SD. Social Structures: A Network Approach. Cambridge: Cambridge University Press 1988. Newman M. The Structure and Function of Complex Networks. SIAM Review. 2003;56:167-256. Newman M. The Structure and Function of Complex Networks. SIAM Review. 2003(56):167-256. Moreno JL. Who Shall Survive? New York: Beacon House 1934. Padgett JF, Ansell CKJF. Robust action and the rise of the Medici, 1400-1434. Amer J Sociol. 1993;98:259-1319. Mariolis P. Interlocking directorates and control of corporations: The theory of bank control. Social Sci Quart. 1975;56:425-39.

114

Aliuska Duardo-Sanchez

17. Mizruchi MS. The American Corporate Network, 1904-1974. Beverly Hills: Sage 1982. 18. Klovdahl AS, Potterat JJ, Woodhouse DE, Muth JB, Muth SQ, Darrow WW. Social networks and infectious disease: The Colorado Springs study. Soc Sci Med. 1994;38:79-88. 19. Liljeros F, Edling CR, Amaral LAN, Stanley HE, Aberg Y. The webof human sexual contacts. Nature. 2001;411:907-8. 20. Munteanu CR, Dorado J, Pazos Sierra A, Prado-Prado F, Pérez-Montoto LG, Vilar S, et al. Markov Entropy Centrality: Chemical, Biological, Crime and Legislative Networks. In: Dehmer M. E-sF, and Mehler A., ed. Information Theory Analysis of Complex Networks: Statistical Methods and Applications: Springer-Verlag 2010. 21. Pérez-Montoto LG, Prado-Prado F, Ubeira FM, González-Díaz H. Study of Parasitic Infections, Cancer, and other Diseases with Mass-Spectrometry and Quantitative Proteome-Disease Relationships. Curr Proteomics. 2009; 6:246-61. 22. González-Díaz H, Prado-Prado F, Pérez-Montoto LG, Duardo-Sánchez A, López-Díaz A. QSAR Models for Proteins of Parasitic Organisms, Plants and Human Guests: Theory, Applications, Legal Protection, Taxes, and Regulatory Issues. Curr Proteomics. 2009;6:214-27. 23. Gonzalez-Diaz H, Prado-Prado F, Ubeira FM. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr Top Med Chem. 2008;8(18):1676-90. 24. González-Díaz H, González-Díaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008;8:750-78. 25. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. 26. Devah P. Mark of a Criminal Record. Am J Soc. 2003(108):937-75.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 115-125 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

9. Predicting fasciolosis in Galicia with Shannon entropy of landscape complex network Humberto González-Díaz1, Mercedes Mezo2, Marta González-Warleta2 Esperanza Paniagua1 and Florencio M. Ubeira1 1

Department of Microbiology & Parasitology, Faculty of Pharmacy, University of Santiago de Compostela, Santiago de Compostela, 15782, Spain; 2Laboratorio de Parasitología, Centro de Investigaciones Agrarias de Mabegondo-Xunta de Galicia, Abegondo, 15318, Spain

1. Introduction Complex networks (CN) may be used to study disease spread. For instance, Bigras-Poulin and Barfod, et al. [1] studied the relationship of trade patterns of the Danish swine industry animal-movements network to potential disease spread. The movements of animals were analysed under the conceptual framework of graph theory in mathematics. The swine production related premises of Denmark were considered to constitute the nodes of a network and the links were the animal movements. In this framework, each farm will have a CN of other premises to which it will be linked. A premise was a farm (breeding, rearing or slaughter pig), an abattoir or a trade market. The overall network was divided in premise specific subnets that linked the other premises from and to which animals were moved. This approach allowed them to Correspondence/Reprint request: Dr. Humberto González-Díaz, Department of Microbiology & Parasitology Faculty of Pharmacy, University of Santiago de Compostela, Santiago de Compostela, 15782, Spain E-mail: gonzalezdiazh@yahoo.es

116

Humberto GonzĂĄlez-DĂaz et al.

visualise and analyse the three levels of organization related to animal movements that existed in the Danish swine production registers: the movement of animals between two premises, the premise specific networks, and the industry network. The assumption that animal movements can be randomly generated on the basis of farm density of the surrounding area of any farm was not correct since the patterns of animal movements have the topology of a scale-free CN with a large degree of heterogeneity. This supported the opinion that the disease spread software assuming homogeneity in farm-to-farm relationship should only be used for large-scale interpretation and for epidemic preparedness. The authors concluded that the network approach, based on graph theory, can be used efficiently to express more precisely, on a local scale (premise), the heterogeneity of animal movements. In other work, Natale and Giovannini et al. [2] carried out a network analysis of Italian cattle trade patterns in 2007 for evaluation of risks for potential disease spread. The simulations show the influence of the network structure on the dynamics and size of a hypothetic epidemic and give useful indications on the effects of targeted removal of nodes based on the centrality of premises within the network of animal movements. In this connection, Landscape Complex Networks (LCNs), which incorporate topographical (coordinates and altitude), climate, hydrographical and other types of information to describe some phenomena (species migration, disease spreading, transport) that involves different places (lakes, cities, industries, forests) interconnected, may be very useful too. The places are represented by nodes and the edges or arcs connecting these nodes usually express geographical proximity but may incorporate additional information about the flow or transmission of some additional information, magnitude, or phenomena between two nodes. For instance, in a recent work we have built up a LCN to study the Landscape spreading of the parasite infection fasciolosis in Galicia (NW Spain) [3]. Fasciolosis is a parasite infection caused by Fasciola hepatica (a liver fluke) that has become an important cause of lost productivity in livestock worldwide. Considered a secondary zoonotic disease until the mid-1990s, Human Fascioliasis is at present emerging or re-emerging in many countries, including increases of prevalence and intensity and geographical expansion. Research in recent years has justified the inclusion of Fascioliasis in the list of important human parasitic diseases. At present, fasciolosis is a vector-borne disease presenting the widest known latitudinal, longitudinal and altitudinal distribution [4]. In endemic areas of Central and South America, Europe, Africa and Asia, human fasciolosis presents a range of epidemiological characteristics related to a wide diversity of environments. In addition, F. hepatica produces varied clinical presentations of that still

Predicting fasciolosis in Galicia with network entropy

117

make a high index of suspicion mandatory. Besides having a wide spectrum of hepatobiliary symptoms like obstructive jaundice, cholangitis and liver cirrhosis, the parasitic infection also has extrabiliary manifestations. Effective control of fasciolosis is difficult, especially in milking cows, which can only be treated during dry periods, a control strategy that has not been yet evaluated. In this sense, the study of geographical spreading of fasciolosis becomes a goal of the major interest. In this previous work, we also calculated many centrality measures for all the nodes of the new network (livestock farms). Last, using these measures of landscape network structure as inputs we seek a Quantitative Structure-Property Relationship (QSPR) model. This QSPR model may predict the prevalence of disease, in the former or new farms, after or in absent of different medical treatments based only on details retrieved from GIS (location and altitude). The study may have predictive value for the positioning of new farms with lower risk of infection or in managing cattle during infections. Another example is the work after Minor and Tessel et al. [5] who studied the role of landscape connectivity in assembling exotic plant communities by means of a CNs analysis. They considered a spatial arrangement of habitat fragments (nodes) to be critical in affecting movement of individuals through a landscape and how invasive species respond to landscape configuration relative to native species. This information is crucial for managing the global threat of invasive species spread. Using LCNs analysis they show that forest plant communities in a fragmented landscape have spatial structure that is best captured by a LCN representation of landscape connectivity (edges). In other study, Michels and Cottenie et al. [6] investigated geographical and genetic distances among zooplankton populations in a set of interconnected ponds. They considered systems of interconnected ponds or lakes (CN nodes). Using a landscape-based approach, they modelled the effective geographical distance among a set of interconnected ponds in a GIS environment. By analogy to previous works we aim to study a database very recently reported by Meso et al. [7] related to the Landscape spreading of the parasite infection fasciolosis in Galicia (NW Spain) using a Shannon entropy parameters as a measure of information related to the landscape structure for fasciolosis spreading in Galicia.

2. Materials and methods 2.1. Complex network study We used the dataset reported Mezo et al. construct a network of farm-to-farm spreading of fasciolosis in cattle for Galicia (NW Spain), see

118

Humberto González-Díaz et al.

Figure 1. Each farm was considered as a node of the network associated to a Boolean matrix with elements bij [8-10]. We place an arc (directed edge) connecting the i-th farm with the j-th farm if they above the condition given at follow in the form of a Microsoft Excel command (see equation 1) that is used to truncate the farm-to-farm distance function (equation 2):

b ij = if (or (d ij = 0, d ij > d cutoff * average(dij )),0,1)

(1a)

d ij = 0.5 * (h i + h j ) * Tri * Tr j * SQR((x i - x j )^2 + (y i - y j )^2)

(1b)

As this is a symmetric condition the existence of a connection ij implies the existence of the inverse connection ji. Loops, connections from j-th to the same j-th farm or the same self-connections (representing self-infection of animals inside the same farm) were allowed. We used this network as input for the software CentiBin, which was used to calculate the δ(j) values [11], which is the number of regions with efferent and/or afferent connection to the jth cerebral cortex region. With these values we calculated the Shannon entropy θ0(j). We calculated θ0(j) [12-19] by means of equation (2a) in order to quantify the amount of local information related to farm neighborhood landscape and previous pharmacologic treatment used (see equations 1a and 1b). Consequently, δ(j) is the node degree of the jth node in a complex network represented in Figure 1 (B). It is easy to demonstrate that θ0(j) is a Shannon entropy-like parameter because it has the form –p(j)logp(j) and p(j) = 1/δ(j) is a probability that measures the tendency of the node (farm) to be connected by means of one specific link to another specific farm out of the total number of farms δ(j) for the jth node, see equation (2b). The demonstration that p(j) obey the conditions for a probability is also very easy because it is a real value with domain p(j) ∈ R (0,1) and also obeys normalization condition

θ0 = −p ( j) ⋅ log p( j) = −1/ δ ( j) ⋅ log1/ δ( j) δ

δ( j)

p 0 = ∑ p( j) =∑1/ δ( j) =1 1

(2a) (2b)

3. Results In the previous cross-sectional study, Mezo et al. investigated the effect of the type of flukicide treatment on the prevalence and intensity of infection in dairy cattle from Galicia, an area where fasciolosis is endemic and which is also the main milk-producing region in Spain. Faecal samples were taken

Predicting fasciolosis in Galicia with network entropy

119

Figure 1. Geographical maps of Galicia showing the location of the 275 sampled farms. A - The status of infection (empty circles: F. hepatica free and filled circles: F. hepatica infected) and the treatment administered on each farm are shown (blue: none; red: an anthelmintic effective against fluke mature stages and green: a fasciolicide effective against immature and mature stages). B - Distribution of farms according to the presence of F. hepatica infection (grey: uninfected; cyan: infected with a within-herd prevalence <25% and pink: infected with a within-herd prevalence â&#x2030;Ľ 25%). C - SCN for landscape parasite-spreading.

120

Humberto González-Díaz et al.

from 5 188 dairy cows on 275 randomly selected farms for measurement of the concentration of F. hepatica copro-antigens by a monoclonal antibody based immunoassay (MM3-COPRO ELISA). We used this database by Meso et al. to construct a Landscape parasite-spreading network for fasciolosis in Galicia (NW Spain) [7], see Figure 1. On the same day as the sampling, each farm owner/manager was questioned about the types of treatment used on the farm. Three groups of farms were considered according to the fasciolicide treatment: (A) flukicides were not used, (B) an anthelmintic effective against mature stages of flukes was used (albendazole or netobimin) and (C) a fasciolicide effective against immature and mature stages was used (triclabendazole: TCBZ). The survey showed that most dairy farmers are unaware of the existence of F. hepatica infection on their farms, and treatments, when given, are administered without prior diagnosis. Treatment with TCBZ administered only at drying off did not show advantages over other measures including no treatment, or treatment with other benzimidazoles. Consequently, TCBZ should only be used to treat individual animals after correct diagnosis of the infection, and correct management measures taken to control re-infection. The detailed information for all farms analyzed as well as other details appears in Table 1. This situation, prompt us to find a combined QSPR & LCN model that may help us in disease management. For this study we decided to evidence the predictive power of θ0(j) in Fasciolosis spreading problems using one of the more simple known classifiers called Linear Discriminant Analysis (LDA). The best classification function found with this technique was:

S(PAT) = 0.035 ⋅ θ 0 ( j) − 1 . 409

n = 207

χ 2 = 14.0

p < 0.001

(3)

In this equation, S(PAT) is the a real-valued output variable that scores the propensity of one farm to present a Prevalence After Treatment (PAT) > 24% for F. hepatica infection. The Chi-sqr statistic (χ2) present a low p-level < 0.001, which indicates a significant discrimination between high/low prevalence farms. We used a total of n = 207 farms to train the model and the remnant farms were used as an external validation set. The models present good values of Accuracy, Sensitivity, and Specificity both in training and validation series, see Table 2. In the anterior work [3] we found also excellent classifiers for this problem (see also Table 1) but all these models were based on more complicated decision algorithms with the form of Classification Trees (CT) [20] and not a simple one-parameter linear equation as is the QSPR model introduced now.

Predicting fasciolosis in Galicia with network entropy

Table 1. Parasite infection spread out to different jth Farms in Galicia.

121

122

Table 1. Continued

Humberto González-Díaz et al.

Predicting fasciolosis in Galicia with network entropy

Table 1. Continued

Table 2. Results for CT analysis of Fasciolosis spreading.

123

124

Humberto González-Díaz et al.

4. Conclusion We demonstrated that Shannon entropy values of LCNs are promising indices of general use for the prediction of fasciolosis spreading in Galicia.

Acknowledgments González-Díaz H. acknowledges tenure track research position funded by Program Isidro Parga Pondal, Xunta de Galicia. The authors thank for the partial financial support from project (AGL2006-13936-C02) Ministry of Education and Science, Spain, which is co-financed with European Union funds (FEDER) and for the grants 2007/127 and 2007/144 from the General Directorate of Scientific and Technologic Promotion of the Galician University System of the Xunta de Galicia.

References 1. 2. 3.

4. 5. 6.

7. 8.

Bigras-Poulin M, Thompson RA, Chriel M, Mortensen S, Greiner M. Network analysis of Danish cattle industry trade patterns as an evaluation of risk potential for disease spread. Prev Vet Med. 2006 Sep 15;76(1-2):11-39. Natale F, Giovannini A, Savini L, Palma D, Possenti L, Fiore G, et al. Network analysis of Italian cattle trade patterns and evaluation of risks for potential disease spread. Prev Vet Med. 2009 Sep 21. González-Díaz H, Mezo M, González-Warleta M, Muíño-Pose L, Paniagua E, Ubeira FM. Network prediction of fasciolosis spreading in Galicia (NW Spain) In: González-Díaz HaM, C.R., ed. Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks. Kerala: Transworld Research Network 2010:191-204. Mas-Coma S. Epidemiology of fascioliasis in human endemic areas. J Helminthol. 2005 Sep;79(3):207-16. Minor ES, Tessel SM, Engelhardt KA, Lookingbill TR. The role of landscape connectivity in assembling exotic plant communities: a network analysis. Ecology. 2009 Jul;90(7):1802-9. Michels E, Cottenie K, Neys L, De Gelas K, Coppin P, De Meester L. Geographical and genetic distances among zooplankton populations in a set of interconnected ponds: a plea for using GIS modelling of the effective geographical distance. Mol Ecol. 2001 Aug;10(8):1929-38. Mezo M, Gonzalez-Warleta M, Castro-Hermida JA, Ubeira FM. Evaluation of the flukicide treatment policy for dairy cattle in Galicia (NW Spain). Vet Parasitol. 2008 Nov 7;157(3-4):235-43. Wollbold J, Huber R, Pohlers D, Koczan D, Guthke R, Kinne RW, et al. Adapted Boolean network models for extracellular matrix formation. BMC systems biology. 2009;3:77.

Predicting fasciolosis in Galicia with network entropy

9. 10. 11. 12.

13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.

125

Pomerance A, Ott E, Girvan M, Losert W. The effect of network topology on the stability of discrete state models of genetic control. Proc Natl Acad Sci U S A. 2009 May 19;106(20):8209-14. Zhang SQ, Ching WK, Ng MK, Akutsu T. Simulation study in Probabilistic Boolean Network models for genetic regulatory networks. International journal of data mining and bioinformatics. 2007;1(3):217-40. Junker BH, Koschuetzki D, Schreiber F. Exploration of biological network centralities with CentiBiN. BMC Bioinformatics. 2006 Apr 21;7(1):219. Stahura FL, Godden JW, Bajorath J. Differential Shannon entropy analysis identifies molecular property descriptors that predict aqueous solubility of synthetic compounds with high accuracy in binary QSAR calculations. J Chem Inf Comput Sci. 2002 May-Jun;42(3):550-8. Stahura FL, Godden JW, Xue L, Bajorath J. Distinguishing between natural products and synthetic molecules by descriptor Shannon entropy analysis and binary QSAR calculations. J Chem Inf Comput Sci. 2000 Sep-Oct;40(5):1245-52. Graham DJ. Information Content in Organic Molecules: Brownian Processing at Low Levels. Journal of chemical information and modeling. 2007;47(2):376-89. Graham DJ, Schacht D. Base Information Content in Organic Molecular Formulae. J Chem Inf Comput Sci. 2000;40:942. Graham DJ. Information Content in Organic Molecules: Structure Considerations Based on Integer Statistics. J Chem Inf Comput Sci. 2002;42:215. Graham DJ, Malarkey C, Schulmerich MV. Information Content in Organic Molecules: Quantification and Statistical Structure via Brownian Processing. J Chem Inf Comput Sci. 2004;44(1601). Graham DJ, Schulmerich MV. Information Content in Organic Molecules: Reaction Pathway Analysis via Brownian Processing. J Chem Inf Comput Sci. 2004;44(1612). Graham DJ. Information Content and Organic Molecules: Aggregation States and Solvent Effects. Journal of chemical information and modeling. 2005;45(1223). Hill T, Lewicki P. STATISTICS Methods and Applications. A Comprehensive Reference for Science, Industry and Data Mining. Tulsa: StatSoft 2006. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002. Zhang W. Computer inference of network of ecological interactions from sampling data. Environ Monit Assess. 2007 Jan;124(1-3):253-61. da Silveira CH, Pires DE, Minardi RC, Ribeiro C, Veloso CJ, Lopes JC, et al. Protein cutoff scanning: A comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins. Proteins. 2009 Feb 15;74(3):727-43.

Transworld Research Network 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India

Complex Network Entropy: From Molecules to Biology, Parasitology, Technology, Social, Legal, and Neurosciences, 2011: 127-142 ISBN: 978-81-7895-507-0 Editors: Humberto González-Díaz, Francisco J. Prado-Prado and Xerardo García-Mera

10. Markov entropy for biology, parasitology, linguistic, technology, social and law networks 1

Marilena N. Berca1, Aliuska Duardo-Sanchez2, Humberto González-Díaz3 Alejandro Pazos4 and Cristian R. Munteanu4

ASLI Santiago S.C., Rúa Nova de Abajo 13, 15701 Santiago de Compostela, Spain; 2Department of Especial Public Law Financial and Tributary Law Area, Faculty of Law, USC, 15782 Santiago de Compostela, Spain; 3Department of Microbiology & Parasitology, Faculty of Pharmacy, USC, 15782 Santiago de Compostela, Spain; 4Department of Information and Communication Technologies Computer Science Faculty, University of A Coruña, Campus de Elviña, 15071 A Coruña, Spain

Introduction The graphical visualization and processing of the information of the complex system becomes a very useful tool involved in the majority of the scientific fields. One method is the Graph and Complex Network theory that is expanding the application to different levels of matter organization such as the genome networks, protein-protein networks, sexual disease transmission networks, linguistic networks, low and social networks [1-6], power electric power network or internet [7]. A network is a set of items, usually called nodes, with connections between them, so called edges [8]. The nodes can be atoms, molecules, proteins, nucleic acids, drugs, cells, organisms, parasites, people, words, laws, computers Correspondence/Reprint request: Dr. Cristian R Munteanu, Department of Information and Communication Technologies, Computer Science Faculty, University of A Coruña, Campus de Elviña, 15071, A Coruña, Spain E-mail: muntisa@gmail.com

128

Marilena N. Berca et al.

or any other part of a real system. The edges are relationships between the nodes such as chemical bonds, physical interactions, metabolic pathways, pharmacological action, law recurrence or social behavior. This branch of mathematical chemistry dedicated to encode the DNA/protein information in graph representations has become an intense research area [9-16]. The graphic approaches of the biological systems study can provide useful insights in protein folding kinetics [17], enzyme-catalyzed reactions [18-21], inhibition kinetics of processive nucleic acid polymerases and nucleases [22-26], DNA sequence analysis [27], anti-sense strands base frequencies [28], analysis of codon usage [29, 30] and in complicated network systems investigations [31-33]. The interactions between proteins in parasites, between the drugs and parasites or the spread of the parasites have been studied in recent works [34, 35]. In the case of the actor social networks, the representation of social relationships are made by nodes and ties, where nodes can be the individual actor within the networks, and ties the relationships between these actors [7]. The study of the social networks began in early 1930â&#x20AC;&#x2122;s with a social network to study friendships between school children [36]. Since the important of network approach to social sciences high increased, and it application goes from interrelation between family members [37] to companies business interaction [38, 39] or patterns of sexual contacts [40, 41]. Opposite to these social networks, the application of these methods in the Law scope is still at the beginning [42]. Networks tools might illustrate the interrelation between the different law types and help to understand laws consequences in society live and it effectiveness or not. Network theory provides a powerful tool for the representation and analysis of complex systems of interacting agents. Thus, Porter et al. [43] investigate the U.S. House of Representatives network of committees and subcommittees, with committees connected according to â&#x20AC;&#x153;interlocksâ&#x20AC;? or common membership. Analysis of this network reveals clearly the strong links between different committees, as well as the intrinsic hierarchical structure within the House as a whole. The team shows that network theory, combined with the analysis of roll-call votes using singular value decomposition, successfully uncovers political and organizational correlations between committees in the House without the need to incorporate other political information. The same methodology has been used in linguistics studies. Thus, computational linguistics is an interdisciplinary field dealing with the statistical and/or rule-based modeling of natural language from a computational perspective [44-46]. Foster and Toth [47] studied the phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European. Indo-European is the largest and best-documented language family in the world, yet the reconstruction of the Indo-European tree, first proposed in 1863, has remained controversial. Complications may include ascertainment bias when choosing the linguistic data,

Markov entropy for complex networks

129

and disregard for the wave model of 1872 when attempting to reconstruct the tree. Essentially analogous problems were solved in evolutionary genetics by DNA sequencing and phylogenetic network methods, respectively. The authors adapt these tools to linguistics, and analyze Indo-European language data, focusing on Celtic and in particular on the ancient Celtic language of Gaul (modern France), by using bilingual Gaulish–Latin inscriptions. The phylogenetic network reveals an early split of Celtic within Indo-European. Interestingly, the next branching event separates Gaulish (Continental Celtic) from the British (Insular Celtic) languages, with Insular Celtic subsequently splitting into Brythonic (Welsh, Breton) and Goidelic (Irish and Scottish Gaelic). Taken together, the network thus suggests that the Celtic language arrived in the British Isles as a single wave (and then differentiated locally), rather than in the traditional two-wave scenario (“P-Celtic” to Britain and “Q-Celtic” to Ireland). The phylogenetic network furthermore permits the estimation of time in analogy to genetics, and the authors obtain tentative dates for Indo-European at 8100 BC ± 1,900 years, and for the arrival of Celtic in Britain at 3200 BC ± 1,500 years. Thus, the phylogenetic method promises to be an informative approach for many problems in historical linguistics. The similitude between linguistics and protein structure was studied by Gimona [48]. The correspondence between biology and linguistics at the level of sequence and lexical inventories, and of structure and syntax, has fuelled attempts to describe genome structure by the rules of formal linguistics. She proposed definitions of protein linguistic rules and explain how the compositional semantics improve our understanding of protein organization and functional plasticity. Recently, Friedlaender et al. show that linguistics is more robust compared with genetics [49]. The studies in the Pacific have shown the robustness of linguistic phylogenetic reconstructions in comparison to genetic ones, when adequate linguistic data sets are available. Any congruence between linguistics and genetics is disrupted when populations speaking unrelated languages are in close contact. In such cases, genetic distinctions between groups rapidly become blurred, because genetic exchange is generally more prevalent and pervasive than is language borrowing or adoption. Languages are more integrated sets of features than are gene pools. Language change does not occur in a social vacuum, and sociolinguistic pressures to maintain distinctions between groups can evidently have a strong inhibitory effect against linguistic convergence. This underlines the comparative power of historical linguistics for reconstructions of population histories, especially in contact situations. The ontology structures in all the life sciences are example of application of networks on linguistics. The idea of representing knowledge in a standardized manner is not new. In Philosopy, the word Ontology has been used since the

130

Marilena N. Berca et al.

Ancient Greece era to refer to “the science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality” [50]. Philosophical Ontology has taken many forms along history, and different schools of philosophy have offered different approaches. However, one central goal in philosophical Ontology is a definitive and exhaustive classification of all entities [51]. Besides the philosophical point of view, towards the end of the 20th century, the term “ontology” (or ontologies) started to gain usage in computer science when referring to a research area in the subfield of Artificial Intelligence (AI) primarily concerned with the semantics of concepts and with expressive (or interpretive) processes in computer-based communications. There are several definitions of ontology in AI which have evolved over the years. The first definition was offered by Neches and colleagues in 1991. They stated that “An ontology defines the basic terms and relations comprising the vocabulary of a topic area as well as the rules for combining terms and relations to define extensions to the vocabulary” [52]. According to this definition, ontology includes not only the terms and relations explicitly defined in it. It also takes into account the new terms that can be inferred by means of rules. Two years later, in 1993, Gruber defined an ontology as “an explicit specification of a conceptualization” [53]. Currently, the most popular and widely used language is the Ontology Web Language (OWL) [54], built by the World Wide Web Consortium (W3C). In bio-domains, like biochemistry or biomedicine, ontologies are constituted by defined biological concepts and the relationships between them. The common strategy involves the ontology-based annotation of primary data, which is the association of elements from ontologies (i.e. concepts and relationships) to data, so that ontologies can be used both by humans and computers to share, search and navigate across genetic, phenotype and disease information [55-59]. Such ontologies, widely known as “bio-ontologies” [60], are being increasingly used in a variety of bioinformatics applications, ranging from semantic annotation [61] and search [62] to large-scale analysis [63]. One example of a semantic-based description of the microarray experiments is the MGED ontology [64]. The Shannon entropy centrality parameters of the complex networks have been used to develop a general methodology for the search of Quantitative Structure-Property Relationships (QSPR) models. Other applications of entropy parameters for different types of networks have been presented [65]. In the present chapter, we propose a comparison of different systems using entropy node centralities derived from the Graph or Complex Network theory. Therefore, the information is quantified in terms of the entropy centrality, θk(j), of the jth parts or states (nodes) of a Markov Chain associated to the system, represented by a network graph. The study procedure is standard for all the systems and it is not dependent of the

Markov entropy for complex networks

131

complexity. In the first step, we define the phenomena to study, ranging from molecular systems composed by single molecules (drug activity, drug toxicity), multiple-molecules (biochemical pathways, metabolism reactions), and macromolecules (DNA-Drug interaction, protein function), to ecological systems (bacterial co-aggregation), linguistic dictionaries or social systems (US airlines trips, Dolphin associations, organizations of Drug Policy, football networks, Dutch elite, financial laws, legislative productivity). In the second step, we represent the interconnectivity of the discrete parts of the system (amino acids, reactants, bacteria species, laws, words or people) as a graph or network. The Markov Chain theory is used to calculate the entropy centralities of the system for nodes placed at different distances. These centralities of a Markov Chain associated to a network or graph can be used as a universal quantity in pattern recognition regardless the chemical, biological, linguistic, social, or other type of complex system.

Materials and methods Biological, linguistic, technological, social and law complex networks This study is based on fourteen networks from five scientific fields, each network having as nodes and edges specific items and relations between them: 1. Bacterial Co-aggregation (C_1) [66]: the nodes are bacteria linked if the bacteria are forming mixed films. 2. NW Spain Fasciolosis (C_2) [67]: nodes are farms linked if Fasciolosis was spread from one to the other. 3. Spain Financial Law (C_3) [65]: nodes are laws linked if one node is the evolution of the other one. 4. Biochemical Pathways (C_4) [68]: the node are the biochemical compounds linked by biochemical reactions/transformations; 5. Brain Pathways (C_5) [68]: nodes are neural centers of the brain linked by pathways of electrical/neural activity; 6. Yeast Metabolism (C_6) [69]: nodes are yeast proteins linked if they interact; 7. US Airlines (C_7) [68]: nodes are airports linked by airplane trips; 8. Dolphins (C_8) [70]: the file dolphins contains an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand;

132

Marilena N. Berca et al.

9. Transcriptional E. Coli (C_9) [68]: nodes are genes involved in the transcription process in E. Coli and the links are the interactions between them; 10. Organizations in Drug Policy (C_10) [68]: nodes are organization involved in drug policy and the links appear if exists a relation between them; 11. Football network (C_11) [71] describes the 22 soccer teams which participated in the World Championship in Paris, 1998. Players of the national team often have contracts in other countries. This constitutes a players market where national teams export players to other countries. Members of the 22 teams had contracts in altogether 35 countries. Counting which team exports how many players to which country can be described with a valued, asymmetric graph. The graph is highly unsymmetric: some countries only export players, some countries are only importers. 12. Roget Thesaurus (C_12) [72]: . Each vertex of the graph corresponds to one categories in the 1879 edition of Peter Mark Roget's Thesaurus of English Words and Phrases, edited by John Lewis Roget. An arc goes from one category to another if Roget gave a reference to the latter among the words and phrases of the former or if the two categories were directly related to each other by their positions in Roget's book. 13. FOLDOC (C_13) [73]: FOLDOC is a searchable dictionary of acronyms, jargon, programming languages, tools, architecture, operating systems, networking, theory, conventions, standards, mathematics, telecoms, electronics, institutions, companies, projects, products, history, in fact anything to do with computing. The dictionary has been growing since 1985 and now contains over 13000 definitions totaling nearly five megabytes of text. Entries are cross-referenced to each other and to related resources elsewhere on the net. An arc (X,Y) from term X to term Y exists in the network if in the FOLDOC dictionary the term Y is used to describe the meaning of term X. 14. Dutch elite TOP200 (C_14) [68] : the data were about the administrative elite in The Netherlands, April 2006 and analysed by De Volkskrant and Wouter de Nooy. Figure 1 is presenting the graphical representation of four networks using Pajek [74, 75] and Gephi [76]: (A) Roget’s Thesaurus (1879), (B) FOLDOC, Free Online Dictionary of Computing (only letter “A” terms), (C) NW Spain Fasciolosis, (D) Spain Financial Law (Giant Component).

Markov entropy for complex networks

133

Figure 1. Graphical representation of the following networks using Pajek and Gephi: A - Roget´s Thesaurus (1879), B - FOLDOC (Free Online Dictionary of Computing, only letter “A” terms), C - NW Spain Fasciolosis, D - Spain Financial Law (Giant Component).

Markov Shannon entropy centralities for complex networks In information theory, entropy is a measure of the uncertainty associated with a random variable [77]. The term by itself in this context usually refers to the Shannon entropy, which quantifies, in the sense of an expected value, the information contained in a message, usually in units such as bits. Equivalently, the Shannon entropy is a measure of the average information content one is missing when one does not know the value of the random variable. The concept was introduced by Claude E. Shannon in his 1948 paper "A Mathematical Theory of Communication" [78]. In present work, we construct the classical Markov matrix (1Π) for each network as followings: a) The link connectivities between the nodes of a networks generate the connectivity matrix, C (n by n, where n is the number of vertices), the first numerical information of the actual calculation; b) The Markov matrix Π is built and contains the vertices probability (pij) based on C;

134

Marilena N. Berca et al.

c) The probability matrix is raised to the power k, resulting (1Π)k and multiply with the vector of the initial probabilities (0pj): k

P = 0 P × (1 Π ) = k

[ p , p ,…, p ] k

(1)

d) The resulted vectors contain the absolute probabilities to reach the nodes moving throughout a walk of length k from node ni (kpj) for each k and are the base for the entropy centrality (kθ) calculation:

θ = −∑ k p j log k p j

(2)

Law networks The complex systems are characterized by networked rather than linear causal relationships. This is the reason why people have difficulty in dealing with complex systems: the linear causal chain way of thinking - A causes B causes C causes D ... etc - breaks down in the presence of feedback and multiple interactions between causal and influence pathways. Thus, it is important to be able to understand the complex systems and to make know the factors that contribute to current and alternative states of complex systems in order to explore their possible future trajectories. Large scale examples include ecosystems, economic systems, coupled biophysical-socioeconomic systems, integrated supply chains/industrial systems and social systems, but these remarks also apply for example to attempts to understand a physical organism as a complex system. Financial law evolution is a very important phenomenon in this sense. The correspondent network is built using connections between the laws if the time-lag is less than 1 for the same type of laws. The Cutoff function for Spain financial law recurrence network associated to the matrix L with elements Lij is the following (see Figure 2): Lij = if($H4=J$2,0,if(ABS($G4-J$1)>$G$2,0,if($I4=J$3,1,0))) The parameters of this function are: 1. $H4 and J$2: Excel reference to the column and row containing node names. 2. $G4 and J$1: Excel reference to the column and row containing the values of the variable time (yy) equal to the norm approbation year. 3. $G$2: Excel reference to the cell to enter the time-lag cutoff value toff. 4. $I4 and J$3: Excel reference to the column and row containing the values of the variable law type equal, which are one-letter codes used to identify the type of financial law approved.

Markov entropy for complex networks

135

Figure 2. The Excel sheet calculation to obtain the matrix L.

Markov centralities for complex networks (MCeCoNet) MCeCoNet is a GUI Python/wxPython [79] application, with Graphviz [80] and gnuplot as graphical back-end, for calculation a new class of Markov topological indices/centralities of a network based on the Markov normalized node probabilities (see Figure 3): â&#x20AC;˘ â&#x20AC;˘

Markov topological indices kTIc(G) of class c and power k for the graph G during a network attack by consecutively removing the network nodes (the network is losing a node during each attack step); Markov Centralities kCc(j) for each removed node jth (the network is losing only one node).

The tool read networks in MAT format (CentiBin/Pajek compatible) and the nodes are eliminated in the order of classical connectivity (node degrees), average probability for each node [avg(pij)], entropy of each node [-pj*log(pj)], node asymmetries as difference between the averaged probability by line and column [pij-pji], autocorrelation as the product of the same averaged probabilities [pij*pji] or by keeping the original matrix (without no sorting). MCeCoNet can calculate the following types of centralities and topological indices [81]: Markov Shannon Entropy, Markov Traces, Markov Harary number, Markov Wiener index, Markov Gutman topological index, Markov Schultz topological index, Markov Moreau-Broto indices, Markov Balaban distance connectivity index, Markov Kier-Hall connectivity indices, Markov Randic connectivity index, Markov Galves indices and Markov Leverage indices. The networks are displayed by using several types of drawing application (dot, circo, twopi, neato and fdp from Graphviz) and the plots corresponding to centralities/TIs for an attack are generated using gnuplot. The entropy centralities for all the studied complex networks were analyzed with two-way joining clustering analysis (tw-JCA) and tree joining methods from STATISTICA [82].

136

Marilena N. Berca et al.

Figure 3. MCeCoNet graphical interface.

Results and discussion The Shannon entropy centralities were calculated for power k with values of 0 to 5 without any attack of 14 networks by using MCeCoNet tool (see Table 1). Additional information was included such as the type of the network and the principal topological numbers, the node and the edge of the networks. Table 1. Entropy Centralities of several types of networks calculated with MCeCoNet.

Markov entropy for complex networks

137

In order to detect the similitude of the entropy centralities with the order k along the networks, we computed a tw-JCA (threshold value of variability as StDv/2), see Figure 4. We can observe the same range of the entropy centralities along all k values in the case of the linguistic (C_12, C_13) and low (C_3) networks. The tree clustering method shows more detailed results about the similitude between different types of networks based on the entropy centralities (Figure 5). We can observe the recognition of the same network type in the case of the Biochemical Pathways (C_4) and Brain Pathways (C_5).

Figure 4. Two-way joining cluster analysis for entropy centralities of studied networks.

Figure 5. Tree clustering of the complex network based on the entropy centralities.

138

Marilena N. Berca et al.

Conclusion This work presents for the first time a comparison of 14 networks from biology, linguistics, technology, sociology and laws using the entropy centralities calculated with a new tool, MCeCoNet. The results show how the information entropy of the complex networks can point out similar networks and demonstrate the useful of MCeCoNet in the network-based studies.

Acknowledgments González-Díaz H. and Munteanu C. R. acknowledge the Isidro Parga Pondal Programme, Xunta de Galicia, Spain. We also thank the Ibero-American Network of the Nano-Bio-Info-Cogno Convergent Technologies (Ibero-NBIC) network (209RT0366) funded by CYTED (Ciencia y Tecnología para el Desarrollo).

References 1.

Breiger R. The Analysis of Social Networks. In: Hardy M, Bryman A, eds. Handbook of Data Analysis. London: Sage Publications 2004:505-26. 2. Newman M. The Structure and Function of Complex Networks. SIAM Review. 2003;56:167-256. 3. Abercrombie N, Hill S, Turner BS. Social structure. The Penguin Dictionary of Sociology. 4th ed. London: Penguin 2000. 4. Craig C. Social Structure. Dictionary of the Social Sciences. Oxford: Oxford University Press 2002. 5. White H, Scott Boorman and Ronald Breiger. "." Social Structure from Multiple Networks: I Blockmodels of Roles and Positions. American Journal of Sociology. 1976;81:730-80. 6. Wellman B, Berkowitz SD. Social Structures: A Network Approach. Cambridge: Cambridge University Press 1988. 7. Bornholdt S, Schuster HG. Handbook of Graphs and Complex Networks: From the Genome to the Internet. Wheinheim: WILEY-VCH GmbH & CO. KGa. 2003. 8. Newman M. The Structure and Function of Complex Networks. SIAM Review. 2003(56):167-256. 9. Liao B, Wang TM. Analysis of similarity/dissimilarity of DNA sequences based on nonoverlapping triplets of nucleotide bases. J Chem Inf Comput Sci. 2004 Sep-Oct;44(5):1666-70. 10. Liao B, Ding K. Graphical approach to analyzing DNA sequences. J Comput Chem. 2005 Nov 15;26(14):1519-23. 11. Randic M. Condensed representation of DNA primary sequences. J Chem Inf Comput Sci. 2000 Jan-Feb;40(1):50-6.

Markov entropy for complex networks

139

12. Randič M, Vračko M, Nandy A, Basak SC. On 3-D Graphical Representation of DNA Primary Sequences and Their Numerical Characterization. J Chem Inf Comput Sci 2000;40:1235-44. 13. Randic M, Basak SC. Characterization of DNA primary sequences based on the average distances between bases. J Chem Inf Comput Sci. 2001 MayJun;41(3):561-8. 14. Randic M, Balaban AT. On a four-dimensional representation of DNA primary sequences. J Chem Inf Comput Sci. 2003 Mar-Apr;43(2):532-9. 15. Bielinska-Waz D, Nowak W, Waz P, Nandy A, Clark T. Distribution Moments of 2D-graphs as Descriptors of DNA Sequences. Chem Phys Lett. 2007;443: 408-13. 16. Agüero-Chapin G, Gonzalez-Diaz H, Molina R, Varona-Santos J, Uriarte E, Gonzalez-Diaz Y. Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS lett. 2006;580 723-30. 17. Chou KC. Review: Applications of graph theory to enzyme kinetics and protein folding kinetics. Steady and non-steady state systems Biophys Chem. 1990;35: 1-24. 18. Chou KC. Graphical rules in steady and non-steady enzyme kinetics J Biol Chem 1989;264:12074-9. 19. Chou KC, Forsen S. Graphical rules for enzyme-catalyzed rate laws. Biochem J. 1980;187:829-35. 20. Chou KC, Liu WM. Graphical rules for non-steady state enzyme kinetics. J Theor Biol. 1981 Aug 21;91(4):637-54. 21. Kuzmic P, Ng KY, Heath TD. Mixtures of tight-binding enzyme inhibitors. Kinetic analysis by a recursive rate equation Anal Biochem. 1992;200 68-73. 22. Althaus IW, Chou JJ, Gonzales AJ, Diebel MR, Chou KC, Kezdy FJ, et al. Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E. J Biol Chem. 1993;268:6119-24. 23. Althaus IW, Chou JJ, Gonzales AJ, Diebel MR, Chou KC, Kezdy FJ, et al. Kinetic studies with the nonnucleoside HIV-1 reverse transcriptase inhibitor U88204E. Biochemistry 1993;32:6548-54. 24. Althaus IW, Chou JJ, Gonzales AJ, LeMay RJ, Deibel MR, Chou KC, et al. Steady-state kinetic studies with the polysulfonate U-9843, an HIV reverse transcriptase inhibitor. Experientia. 1994 Jan 15;50(1):23-8. 25. Althaus IW, Chou KC, Lemay RJ, Franks KM, Deibel MR, Kezdy FJ, et al. The benzylthio-pyrimidine U-31,355, a potent inhibitor of HIV-1 reverse transcriptase. Biochem Pharmacol. 1996 Mar 22;51(6):743-50. 26. Chou KC, Kezdy FJ, Reusser F. Review: Steady-state inhibition kinetics of processive nucleic acid polymerases and nucleases. Anal Biochem. 1994;221:217-30. 27. Qi XQ, Wen J, Qi ZH. New 3D graphical representation of DNA sequence based on dual nucleotides. J Theor Biol. 2007 Dec 21;249(4):681-90. 28. Chou KC, Zhang CT, Elrod DW. Do "antisense proteins" exist? J Protein Chem. 1996 Jan;15(1):59-61.

140

Marilena N. Berca et al.

29. Chou KC, Zhang CT. Diagrammatization of codon usage in 339 HIV proteins and its biological implication. AIDS Research and Human Retroviruses. 1992;8:1967-76. 30. Zhang CT, Chou KC. Analysis of codon usage in 1562 E. Coli protein coding sequences. J Mol Biol. 1994;238:1-8. 31. Diao Y, Li M, Feng Z, Yin J, Pan Y. The community structure of human cellular signaling network. J Theor Biol. 2007 Aug 21;247(4):608-15. 32. González-Díaz H, Vilar S, Santana L, Uriarte E. Medicinal Chemistry and Bioinformatics – Current Trends in Drugs Discovery with Networks Topological Indices. Curr Top Med Chem. 2007;7(10):1025-39. 33. Gonzalez-Diaz H, Gonzalez-Diaz Y, Santana L, Ubeira FM, Uriarte E. Proteomics, networks and connectivity indices. Proteomics. 2008 Feb;8(4):750-78. 34. Gonzalez-Diaz H, Dea-Ayuela MA, Perez-Montoto LG, Prado-Prado FJ, AgueroChapin G, Bolas-Fernandez F, et al. QSAR for RNases and theoreticexperimental study of molecular diversity on peptide mass fingerprints of a new Leishmania infantum protein. Mol Divers. 2009 Jul 4. 35. Concu R, Dea-Ayuela MA, Perez-Montoto LG, Prado-Prado FJ, Uriarte E, Bolas-Fernandez F, et al. 3D entropy and moments prediction of enzyme classes and experimental-theoretic study of peptide fingerprints in Leishmania parasites. Biochim Biophys Acta. 2009 Dec;1794(12):1784-94. 36. Moreno JL. Who Shall Survive? New York: Beacon House 1934. 37. Padgett JF, Ansell CKJF. Robust action and the rise of the Medici, 1400-1434. Amer J Sociol. 1993;98:259-1319. 38. Mariolis P. Interlocking directorates and control of corporations: The theory of bank control. Social Sci Quart. 1975;56:425-39. 39. Mizruchi MS. The American Corporate Network, 1904-1974. Beverly Hills: Sage 1982. 40. Klovdahl AS, Potterat JJ, Woodhouse DE, Muth JB, Muth SQ, Darrow WW. Social networks and infectious disease: The Colorado Springs study. Soc Sci Med. 1994;38:79-88. 41. Liljeros F, Edling CR, Amaral LAN, Stanley HE, Aberg Y. The webof human sexual contacts. Nature. 2001;411:907-8. 42. Duardo-Sanchez A. Study of criminal law networks with Markov-probability centralities. In: González-Díaz H, Munteanu CR, eds. Topological Indices for Medicinal Chemistry, Biology, Parasitology, Neurological and Social Networks. Kerala, India: Research Signpost 2010. 43. Porter MA, Mucha PJ, Newman ME, Warmbrand CM. A network analysis of committees in the U.S. House of Representatives. Proc Natl Acad Sci U S A. 2005 May 17;102(20):7057-62. 44. Stabler EP, Jr. Berwick and Weinberg on linguistics and computational psychology. Cognition. 1984 Jul;17(2):155-79. 45. Walker DE. The organization and use of information: contributions of information science, computational linguistics and artificial intelligence. J Am Soc Inf Sci. 1981 Sep;32(5):347-63. 46. Wang Y, Patrick J, Miller G, O'Hallaran J. A computational linguistics motivated mapping of ICPC-2 PLUS to SNOMED CT. BMC Med Inform Decis Mak. 2008;8 Suppl 1:S5.

Markov entropy for complex networks

141

47. Forster P, Toth A. Toward a phylogenetic chronology of ancient Gaulish, Celtic, and Indo-European. Proc Natl Acad Sci U S A. 2003 Jul 22;100(15):9079-84. 48. Gimona M. Protein linguistics - a grammar for modular protein assembly? Nat Rev Mol Cell Biol. 2006 Jan;7(1):68-73. 49. Friedlaender J, Hunley K, Dunn M, Terrill A, Lindstrom E, Reesink G, et al. Linguistics more robust than genetics. Science. 2009 Apr 24;324(5926):464-5. 50. Smith B. Ontology: philosophical and computational. In: Floridi L, ed. The blackwell guide to the philosophy of computing and information. Oxford: Blackwell Publishers 2003. 51. Yu AC. Methods in biomedical ontology. Journal of Biomedical Informatics. 2006;39(3):252-66. 52. Neches R, Fikes R, Finin TW, Gruber TR, Patil R, Senator TE, et al. Enabling technology for knowledge sharing. AI Magazine. 1991;12(3):36-56. 53. Gruber TR. A translation approach to portable ontology specifications. Knowledge Acquisition. 1993;5(2):199-220. 54. McGuinness DL, Van Harmelen F. OWL Web Ontology Language. Overview. [cited 2010 February 22]; Available from: http://www.w3.org/TR/owl-features/ 55. Ashburner M, Lewis S. On ontologies for biologists: the Gene Ontologyâ&#x20AC;&#x201D; untangling the web. Novartis Found Symposium. 2002;247:66-80. 56. Stevens R, Goble CA, Bechhofer S. Ontology-based knowledge representation for bioinformatics. Briefings in Bioinformatics. 2000;1(4):398-414. 57. Alonso-Calvo R, Maojo V, Billhardt H, Martin-Sanchez F, Garcia-Remesal M, Perez-Rey D. An agent- and ontology-based system for integrating public gene, protein, and disease databases. J Biomed Inform. 2007 Feb;40(1): 17-29. 58. Maojo V, Garcia-Remesal M, Billhardt H, Alonso-Calvo R, Perez-Rey D, Martin-Sanchez F. Designing new methodologies for integrating biomedical information in clinical trials. Methods Inf Med. 2006;45(2):180-5. 59. de la Calle G, Garcia-Remesal M, Chiesa S, de la Iglesia D, Maojo V. BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature. BMC Bioinformatics. 2009;10:320. 60. Blake J. Bio-ontologiesâ&#x20AC;&#x201D;fast and furious. Nature Biotechnology. 2004; 22(6): 773-4. 61. Shah NH, Rubin DL, Espinosa I, Montgomery K, Musen MA. Annotation and query of tissue microarray data using the NCI Thesaurus. BMC Bioinformatics. 2007;8(1):296. 62. Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith N, et al. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research. 2009;37(Web Server issue):W170-W3. 63. Gusev Y, Schmittgen T, Lerner M, Postier R, Brackett D. Computational analysis of biological functions and pathways collectively targeted by co-expressed microRNAs in cancer. BMC Bioinformatics. 2007;8(Suppl 7):S16. 64. Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, et al. The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics. 2006 Apr 1;22(7):866-73.

142

Marilena N. Berca et al.

65. Munteanu CR, Dorado J, Pazos Sierra A, Prado-Prado F, PĂŠrez-Montoto LG, Vilar S, et al. Markov Entropy Centrality: Chemical, Biological, Crime and Legislative Networks. In: Dehmer M. E-sF, and Mehler A., ed. Information Theory Analysis of Complex Networks: Statistical Methods and Applications: Springer-Verlag 2010. 66. Rickard AH, McBain AJ, Ledder RG, Handley PS, Gilbert P. Coaggregation between freshwater bacteria within biofilm and planktonic communities. FEMS Microbiol Lett. 2003;220(1):133-40. 67. Mezo M, Gonzalez-Warleta M, Castro-Hermida JA, Ubeira FM. Evaluation of the flukicide treatment policy for dairy cattle in Galicia (NW Spain). Vet Parasitol. 2008 Nov 7;157(3-4):235-43. 68. Batagelj V, Mrvar A. Pajek databases. 2006 [cited; Available from: http://vlado.fmf.uni-lj.si/pub/networks/data/ 69. Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, et al. Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res. 2003 May 1;31(9):2443-50. 70. Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E, Dawson SM. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology. 2003;54: 396-405. 71. Link Analysis and Visualization. Dagstuhl seminar. Dagstuhl 2001. 72. Roget PM. Roget's Thesaurus of English Words and Phrases. 1993 [cited; Available from: http://www.worldwideschool.org/library/books/gnrl/thesaurus/ RogetsThesaurus/toc.html 73. Howe D. Free on-line dictionary of computing: FOLDOC (www.foldoc.org) 2002. 74. Batagelj V, Mrvar A. Pajek 1.15. 2006. 75. Batagelj V, Mrvar A. Pajek - Program for Large Network Analysis. Connections. 1998;21:47-57. 76. Bastian M, Hetmann S, Jacomy M. Gephi: An Open Source Software for Exploring and Manipulating Networks. International AAAI Conference on Weblogs and Social Media (ICWSM09). North America 2009. 77. Shannon CE. The mathematical theory of communication. 1963. MD Comput. 1997 Jul-Aug;14(4):306-17. 78. Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27:379-423, 623-56. 79. Rappin N, Dunn R. wxPython in Action. Greenwich, CT: Manning Publications Co. 2006. 80. Koutsofios E, North SC. Drawing Graphs with dot. NJ, USA: AT&T Bell Laboratories, Murray Hill 1993. 81. Todeschini R, Consonni V. Handbook of Molecular Descriptors: Wiley-VCH 2002. 82. StatSoft.Inc. STATISTICA (data analysis software system), version 6.0, www.statsoft.com.Statsoft, Inc. 6.0 ed 2002.