QSPR-QSAR Studies on Desired Properties for Drug Design
Editor
Eduardo A. Castro Professor of Physical Chemistry and Theoretical Chemistry, La Plata National University INIFTA (UNLP-CONICET), Suc.4, C.C. 16, La Plata 1900, Buenos Aires, Argentina
Research Signpost, T.C. 37/661 (2), Fort P.O., Trivandrum-695 023 Kerala, India
Published by Research Signpost 2010; Rights Reserved Research Signpost T.C. 37/661(2), Fort P.O., Trivandrum-695 023, Kerala, India Editor Eduardo A. Castro Managing Editor S.G. Pandalai Publication Manager A. Gayathri Research Signpost and the Editor assume no responsibility for the opinions and statements advanced by contributors ISBN: 978-81-308-0404-0
Preface The application of chemoinformatics in the drug discovery process is today a wide field of theoretical and applied research. The development of a new drug is both a rather time- consuming and large cost-intensive event. The still long happening periods until a new compound comes to the market are coupled with a high financial effort. Despite of the introduction of combinatorial chemistry and the establishment of high-throughput screening techniques, the average number of new chemical drugs introduced annually to the world market is relatively low. The major reasons for attrition of new drugs are mainly lack of clinical efficacy, inappropriate pharmacokinetics, animal toxicity, adverse reactions in humans, commercial reasons, and formulation issues. As a consequence, the prediction of pharmacological properties, in addition to lead finding, is a central task of chemoinformatics in drug development. QSAR/QSPR theories are related to the findings of quantitative structureactivity and structure-property relationships, respectively, in order to be able to make sound inferences about which structure might have a desired biological activity or/and a given physical-chemistry property. The several different approaches to propose useful QSAR/QSPR relationships have grown markedly during the last years and the related predictions are very accurate, so that their insertion to the drug design realm is widely recognized. This book has the purpose of presenting some valuable and updated contributions of frontier QSAR/QSPR studies on desired properties for drug design and I hope that readers will find it a quite relevant document and a useful set of alternative methods and valuable results. Eduardo A. Castro
Contents
Chapter 1 Quantitative structure-activity relationship (QSAR) studies on bioactive cyclopeptides Alicia B. Pomilio, Stella M. Battista and Arturo A. Vitale
1
Chapter 2 QSPR studies on amino acids: Application to proteins Francisco Torrens and Gloria Castellano
35
Chapter 3 Molecular topology in QSAR and drug design studies J. Gálvez and R. García-Domenech
63
Chapter 4 QSAR models constructed by optimal descriptors and by multiple regression analysis for the prediction of carbonic anhydrase II inhibitory activity of substituted aromatic sulfonamides Georgia Melagraki, Antreas Afantitis, Andrey A. Toropov Haralambos Sarimveis and Olga Igglessi – Markopoulou
95
Chapter 5 From molecular structure to molecular design through the Molecular Descriptors Family Methodology Sorana D. Bolboacă and Lorentz Jäntschi Chapter 6 Selection of an optimal set of descriptors: use of the Enhanced Replacement Method Andrew G. Mercader Chapter 7 Orthogonalization methods in QSPR - QSAR Studies Pablo R. Duchowicz, Francisco M. Fernández and Eduardo A. Castro Chapter 8 Modeling some chemical reactions as a spatio-temporal discrete dynamical systems Juan Luis García Guirao
117
167
189
205
Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 1-34 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro
1. Quantitative structure-activity relationship (QSAR) studies on bioactive cyclopeptides 1
Alicia B. Pomilio1,*, Stella M. Battista1 and Arturo A. Vitale2,*
PRALIB (UBA-CONICET), Facultad de Farmacia y Bioquímica, Universidad de Buenos Aires Junín 956, C1113AAD Buenos Aires, Argentina; 2Research Institute INIFTA (UNLP-CCT La Plata-CONICET), Universidad Nacional de La Plata, Diag. 113 y 64 Suc. 4, CC 16, B1900 La Plata, Argentina
Abstract. Cyclopeptides show a variety of structures, from cyclodipeptides through cyclohepta- and octapeptides to cyclotides with 30 amino acids, which can be found in bacteria, insects, higher plants, fungi, animals, and humans. Peptides and in particular cyclopeptides are therapeutic targets which play an important role in medicinal chemistry, including drug design and quantitative structure-activity relationships (QSARs). However, few QSAR studies have been carried out on cyclopeptides in comparison with small organic molecules. Some structural features, substructures, and functional groups contribute to enhance the bioactivity. Configurational and conformational studies of cyclopeptides from plant and fungal origin were analysed, and related to antitumour activity and toxicity. Some physico-chemical properties, e.g., partition coefficient (log P) and several molecular parameters have been found to be relevant for activity. QSAR models, including a variety of descriptors, have been applied to synthetic and natural * Research Members of the National Research Council of Argentina (CONICET). These authors contributed equally to this work. Correspondence/Reprint request: Prof. Dr. Alicia B. Pomilio, PRALIB (UBA-CONICET), Facultad de Farmacia y Bioquímica, Universidad de Buenos Aires, Junín 956, C1113AAD Buenos Aires, Argentina E-mail: pomilio@ffyb.uba.ar
2
Alicia B. Pomilio et al.
cyclopeptides as well as to their analogues. The balance between antimicrobial and haemolytic properties for the design of antimicrobial cyclic peptides is discussed. Considerations about anticancer and cytotoxic activity of cyclopeptides are also included. This kind of studies also help the design of selective drugs.
Abbreviations QSARs, quantitative structure-activity relationships; SARs, structureactivity relationships; NMR, Nuclear Magnetic Resonance; MS, mass spectrometry; ESI-MS, electrospray ionization mass spectrometry; LPS, lipopolysaccharide; LA, lipid A; MIC, minimum inhibitory concentration; IC50, inhibition constant for 50% inhibition; 2D-NMR, bidimensional NMR; 3D-QSAR, three-dimensional QSAR; Abu, L-α-aminobutyric acid; Ala, alanine; Arg, arginine; Asn, asparagine; Cys, cysteine; β-Phe, β-phenylalanine; Phe, phenylalanine; Gly, glycine; Leu, leucine; Lys, lysine; Pro(Cl2), β,γdichlorinated proline; HOPro, hydroxyproline; D-Pro, D-proline; Pro, proline; Ser, serine; alloThr, allothreonine; Thr, threonine; Trp, tryptophan; Tyr, tyrosine; Val, valine. A, alanine; C, cysteine; D, aspartic acid; E, glutamic acid; F, phenylalanine; G, glycine; H, hystidine; I, isoleucine; K, lysine; L, leucine; M, methionine; N, asparagine; P, proline; Q, glutamine; R, arginine; S, serine; T, threonine; V, valine; W, tryptophan; Y, tyrosine.
Introduction Proteins and peptides play an important role in biological systems and affect most physiological processes of plants, animals and microorganisms, in particular in humans, usually acting as agonists and/or antagonists at specific receptors. Bioactive cyclopeptides or cyclic peptides have been found in animals, plants, fungi, bacteria, insects and humans. Cyclopeptides have shown cytotoxic, cytostatic, antifungal, antiviral, antibacterial, plantstimulating, insecticidal, antimalarial, estrogenic, sedative, nematicidal, immunosupressive, and enzyme-inhibitory activities. Recently, the occurrence, structures and bioactivity of cyclopeptides in higher plants and higher fungi have been reported [1]. In fact, many biologically active proteins and peptides, including cyclopeptides, are involved in disease processes, and the abnormal expression of these peptides has also been associated with human disease [2]. The experimental demonstration of the occurrence of antimicrobial peptides in several unexplained human inflammatory disorders can provide novel therapeutic approaches to the treatment of disease. Therefore, these compounds can be therapeutic targets, which can be used in the development of future drugs [3].
QSAR of cyclopeptides
3
A variety of antimicrobial peptides is naturally produced by different organisms through either ribosomal (defensins and small bacteriocins) or non-ribosomal synthesis (peptaibols, cyclopeptides and pseudopeptides). Some of these natural antimicrobial peptides are used for the design of new synthetic analogues, and have been also expressed in transgenic plants to confer disease protection. These compounds are also secreted by microorganisms, being active ingredients of commercial biopesticides [4,5]. However, most native peptides have a limited applicability as drug candidates due to: (a) rapid metabolism by proteolysis, (b) poor absorption by the gastrointestinal tract, and poor transport over the blood brain barrier, (c) rapid excretion by the liver and kidneys, and (d) lack of receptor specificity due to the conformational flexibility [6]. Therefore, the structure and key pharmacophore groups of native peptides are usually converted into new nonpeptidic drug candidates [7], the so-called peptidomimetics. The development of peptidomimetic protease inhibitors has been feasible in recent years by the three-dimensional (3D) structural information on proteases from X-ray diffraction and NMR spectroscopy [8, 9]. Until the 3D structures of the important group of the seven transmembrane G-proteincoupled receptors (GPCRs) have been resolved [10], it was necessary to rely on homology modelling of the receptors based on the structure of bovine rhodopsin and site-specific mutagenesis to provide important clues in the development of new drugs for these receptors [11].
Structural features of the peptides Peptides have a large degree of conformational freedom due to their free rotating bonds, thus giving rise to a large number of conformations. However, in solution and in the absence of receptors or enzymes the biologically active conformation may be poorly populated. Conformationally constrained analogues can significantly contribute to the identification of these conformations. Hence, peptidomimetics provide valuable information on SARs of both peptides and complex proteins [12]. Once the primary structure of the biologically active peptide has been identified, the first design step is to remove the amino acids from the amino and carboxy termini, one at a time, in order to obtain the smallest biologically active peptide fragment. Subsequently, side chain requirements, e.g., pharmacophore groups, can be determined by systematically replacing each residue in the peptide with a specific amino acid and evaluating the biological activity. The most often used amino acid is Ala, but occasionally Gly is used [13].
4
Alicia B. Pomilio et al.
Constrained structures mimicking the bioactive conformation will be less exposed to proteolytic cleavage, may give more selective ligands, providing an entropy advantage in receptor binding compared to the more flexible linear peptides [12,14]. After the SAR of each amino acid in the peptide has been determined the next step is to try to establish the bioactive conformation(s). The conformational freedom of the highly flexible peptide can be reduced by the introduction of local and/or global constraints. Methods of obtaining local constraints include incorporation of modified amino acids, e.g., D-amino acids and N-methyl amino acids, introduction of modified amide bonds or short-range cyclizations forming a link between two backbone termini, between one of the termini and one side chain, between two side chains, or between backbone atoms other than the termini. Secondary structure mimetics can also be used as constraints. When incorporated into a peptide, a secondary structure mimetic enforces a particular conformation. Incorporation of such a moiety provides additional information about requirements for receptor binding and/or activation [14]. SARs of the constrained analogues together with information obtained from NMR spectroscopy and molecular modelling can, in an iterative process, give a 3D pharmacophore model of the bioactive conformation. The last step is to introduce the pharmacophore groups onto a nonpeptidic scaffold in the correct spatial arrangement, in agreement with the obtained model of the bioactive conformation [12-14]. The receptor-based screening of natural and synthetic compound collections has proved to be a useful method for identifying peptidomimetics [11]. Furthermore, design, combinatorial chemistry and classical medicinal chemistry play important roles. Hirschmann et al. [15] reported an example in which pharmacophore modelling was used to obtain a peptidomimetic agonist of the cyclopeptide hormone somatostatin. Furthermore, a systematic exploration of the conformational space for a series of analogues of FC131, a cyclopentapeptide CXCR4 antagonist, has been recently performed, thus leading to a minimalistic 3D pharmacophore model for binding of these cyclopentapeptides [16]. The chemokine receptor CXCR4 is involved in HIV entry, and therefore, is an attractive target for antiretroviral drugs. The most essential conformational components of peptides and proteins are the secondary structure elements: ι-helices, β-sheets, reverse turns and loops. These elements are often located on the protein surface and seem to act as molecular recognition sites in biological processes. Even small peptide fragments can fold into turn conformations in which the amino acid side chains are displayed on the surface of a compact backbone core [17]. There is
QSAR of cyclopeptides
5
a lot of evidence suggesting that the side chain groups in the peptide are the most important recognition elements in peptide-receptor interaction [11]. The reversed turns can be divided into β-turns and γ-turns. One of the most common structure motifs in proteins is the β-turn [18]. To be considered a β-turn the tetrapeptide sequence should not be part of an α-helical region, and the distance between C-α of the first and C-α of the fourth amino acid residue should be ≤ 7 Å. A number of slightly different turn types has been reported [19]. A hydrogen bond between the carbonyl in the first residue and the NH group in the fourth residue is often present to stabilize the turn in a pseudo-ten-membered ring structure. A γ-turn, which is a more rare reversed turn, consists of three amino acid residues. It is defined by a hydrogen bond between the carbonyl group in the first residue and the NH-group of the third residue forming a pseudo-sevenmembered ring. γ-Turns are divided into two classes, inverse and classic. The second side chain is oriented in an equatorial position in the most common inverse γ-turn, while the rare classic γ-turn has an axial second side chain. γ-Turns are rare in proteins but are frequently found in small peptides, especially in small cyclic peptides [20].
Cyclotides Cyclotides are small plant proteins, e.g., cyclic disulfide-rich plant peptides of about 30 amino acids, found in plants of the Rubiaceae, Violaceace and Cucurbitaceae families [21], and are believed to be part of the host defence system [22] on the basis of their high expression levels in plants, and their toxic and growth retardant activity in feeding trials against Helicoverpa spp. insect pests [23]. These compounds have a macrocyclic peptide backbone and a cystine knotted arrangement made up of six conserved cysteine residues (three conserved disulfide bonds), which makes them very stable [24]. This unique structure, together with a variety of biological activities, makes them of great interest as possible leads in drug development or as carriers of grafted peptide sequences [25,26]. A database of cyclic protein sequences and structures, with applications in protein discovery and engineering is known [27] as well as chemical and biomimetic total syntheses of natural and engineered MCoTI cyclotides [28]. Molecular dynamics simulations and MM-PBSA free energy calculations have been carried out to elucidate structure and folding of these disulfide-rich miniproteins [29]. NMR spatial structure of ternary complex Kalata B7/Mn2+/DPC micelle has been recently reported [30].
6
Alicia B. Pomilio et al.
The insecticidal activity of cyclotides and the comparison with structurally similar cystine knot proteins from peas (Pisum sativum) and an amaranth crop plant (Amaranthus hypocondriacus) have been recently reviewed [23]. In addition to their insecticidal effect, cyclotides have also shown to be cytotoxic, anti-HIV [31], anthelmintic [32,33], molluscicidal [34], antimicrobial and haemolytic agents (see Anticancer and cytotoxic activities section).
Configurational and conformational studies on bioactive cyclopeptides Conformational studies of the natural cyclopeptides and their derivatives were related to the stereochemical requirements for a bioactive compound. Antitumour cyclic pentapeptides, astins A - H, have been isolated from Aster tataricus (Asteraceae), which is known as a Chinese medicine and as an ornamental higher plant [35]. Astins contain a 16-membered ring system with a unique mono- or dichlorinated proline and/or allothreonine residues. The main active principle, astin B, contains a β,γ-dichlorinated Pro, an alloThr, a Ser, a β-Phe and an α-aminobutyric acid, and was identified as cyclo(Pro(Cl2)-alloThr-Ser-β-Phe-Abu). Astin C was identified as cyclo(Pro(Cl2)-Abu-Ser-β-Phe-Abu) [36]. Thus, astins A-C (Fig. 1) showed to be similar to cyclochlorotine, cyclo(Pro(Cl2)-Abu-Ser-β-Phe-Ser), which has been isolated from Penicillium islandicum Sopp. (Fig. 2). In order to understand the mechanisms involved in the action of cyclic peptides, it was necessary to assess their conformational characteristics. The conformational analysis of this antitumour astin B was performed by 2D NMR techniques [9], temperature effects on NH protons, rate of hydrogendeuterium exchange, vicinal NH-CαH coupling constants, and NOE experiments [37]. H Cl
H N H
O O O O O
N H H
Astin A R1=H; R2=OH
N
N H R R2 1
H
Cl
Astin B R1=OH; R2=H Astin C R1=H; R2=H
NH H
Figure 1. Structures of Astin A, B and C from the higher plant Aster tataricus (Asteraceae).
QSAR of cyclopeptides
7 H H
N
H
Cl
N
O O O O O
N
CH 2OH
H
Cl
NH
H
H
N
H
H
OH
Figure 2. Structures of cyclochlorotine isolated from Penicillium islandicum.
The combination of 2D NMR analysis with molecular dynamics and mechanics calculations allowed to determine the energetically favourable conformation of astin B in solution. The methods of molecular mechanics and restrained molecular dynamics calculations were applied to understand the energetic preferences of various conformations of astin B. A conformational difference was observed between astin B, showing a cis configuration in a Pro amide bond, and cyclochlorotine from Penicillium islandicum, showing all trans amide configurations [38]. A detailed knowledge of the conformation of astin B in a polar solvent such as DMSO-d6 was considered the basis for SARs, allowing the design of new analogues with higher activity [39]. The assignments of 1H and 13C NMR signals of astin B were made by combination of 1H-lH COSY, HMQC and HMBC spectra [35,36]. The HMBC, which provided 1H-l3C long-range couplings, proved to be extremely valuable for the assignments. The conformational determination of astin B in solution was made on the basis of the results of the following experiments: Hydrogen bonding, vicinal NH-CÎąH coupling, NOE enhancements, and quenched molecular dynamics [36,38]. Computational procedures using NMR data were applied to the elucidation of the solution conformation of astin B and further to the disclosure of the difference between the conformations in the solid and solution states. Molecular dynamics techniques were applied to astins. Three distance constraints involved in the hydrogen bondings, as also found in the crystal state, and four distance constraints derived from the NOE experiments were used to show that this solution structure of astin B was consistent with experimental data [38].
Conformations, dipole moments and toxicity Conformation of the cyclopeptides isolated from the higher fungus Amanita phalloides (Vaill. ex Fr.) Secr. has been related to its toxicity. Hence, the electronic structures and conformations of the cyclopeptides,
8
Alicia B. Pomilio et al.
O-methyl-α-amanitin, phalloidin, and antamanide, were obtained from molecular parameters on the basis of semiempiric and ab initio methods [40]. The electronic structures and conformational analysis of the toxic cyclopeptides, α-amanitin, O-methyl-α-amanitin, S-deoxo-α-amanitin, α-amanitin-(S)-sulfoxide and α-amanitin-sulfone (Fig. 3) were obtained from molecular parameters on the basis of AM1 and ab initio methods [41]. Accordingly, the planar indole moiety of α-amanitin showed to be ahead from the rest of the bean-shaped bicyclic structure (Fig. 4). Therefore, the upper and lower sides of the π-heterocycle were available for interacting with any π-compounds, forming stable π-complexes. This region was also so lipophilic as required for transport through membranes to enter cells [40,41]. Total and point-charge dipole moments and sp hybrid were calculated for the five cyclic peptides of Fig. 3. The negative charge was then located on the sulphur atom of α-amanitin, O-methyl-α-amanitin, S-deoxo-α-amanitin, α- amanitin-(S)-sulfoxide and α-amanitin-sulfone towards the inner cavity. Then, this cavity was negative, nucleophilic, and thus adequate for scavenging cations in order to form complexes with probably a high stability constant. Inclusion of molecules was also possible depending on the inner cavity’s size of each compound, ranging from nearly 10 Å to 6-8 Å [41]. Smaller dipole moment values accounted for a major toxicity of the molecule examined. The lowest point-charge dipole moment was that of α-amanitin-sulfone, similar to that of the thioether S-deoxo-α-amanitin, while the highest point-charge dipole moment was achieved for α-amanitin(S)-sulfoxide, followed by the (R)-isomer, α-amanitin, due to the distinct CH3 NH OC R5 H
CHR 1R 2 CH OH H O C CN C C H CH 2
CH N CO
O H H C N C CH2 COR4
O
S H 2C C H
H N
CO NH
N H H N
H2 C
O C
HC R3 CO H2 C NH
CH3 CH C 2H5
α -amanitin R1=CH2OH; R2=OH; R3=OH; R4=NH2; R5=OH β-amanitin R1=CH2OH; R2=OH; R3=OH; R4=OH; R5=OH γ-amanitin R1=CH3; R2=OH; R3=OH; R4=NH2; R5=OH ε -amanitin R1=CH3; R2=OH; R3=OH; R4=OH; R5=OH amanin R1=CH2OH; R2=OH; R3=H; R4=OH; R5=OH amanin amide R1=CH2OH; R2=OH; R3=H; R4=NH2; R5=OH amanullin R1=CH3; R2=H; R3=OH; R4=NH2; R5=OH amanullinic acid R1=CH3; R2=H; R3=OH; R4=OH; R5=OH promanullin R1=CH3; R2=H; R3=OH; R4=NH2; R5=H
Figure 3. Structures of α-amanitin, O-methyl-α-amanitin, S-deoxo-α-amanitin α-amanitin-(S)-sulfoxide and α-amanitin-sulfone.
QSAR of cyclopeptides
9
Figure 4. Spatial structure of α-amanitin.
orientation of the oxygen atom in relation to the inner cavity, thus being able to modify its negative charge [41]. The α-amanitin-(R)-sulfoxide decreased the inner negative charge, while the (S)-compound increased it. The important difference was the direction of the dipole moment vector of both cyclic compounds. Dipole moment's direction was nearly alike for α-amanitin, O-methyl-αamanitin, S-deoxo-α-amanitin, and α-amanitin-sulfone, only α-amanitin-(S)sulfoxide showed quite a distinct orientation. In the case of α-amanitin-(S)sulfoxide, HOPro (amino acid 2), Asn (amino acid 1) and Cys (amino acid 8) were situated on the axis of the dipole moment. The dipole moment pointed away from the sulphur atom towards this positively charged portion of the molecule. Therefore, in the (S)-sulfoxide, the distribution of charges was disturbed owing to the occurrence of a marked polar character of one side, which usually should take part in the binding to macromolecules. Furthermore, this feature made difficult its passing through membranes. Hence, on the basis of dipole moment calculations it was possible to explain the decrease in toxicity and binding of this molecule, and the results were in agreement with the inhibitory constants Ki of RNA-polymerase II and lethal doses in mice [1,41].
Hydrophobicity of cyclopeptides The physical properties of peptides have not been studied so much as those of small organic molecules. In fact, only few QSAR studies on peptides and proteins have been carried out [42]. Physico-chemical descriptors, such as the partition coefficient, are useful for selecting compounds for screening and development of predictive QSAR models. In fact, experimental partition coefficients are main descriptors of
10
Alicia B. Pomilio et al.
lipophilicty or hydrophobicity, and many other ADMET (absorption, distribution, metabolism, excretion and toxicity) properties [43-47]. Hydrophobicity governs a variety of biological processes, such as transport, distribution and metabolism of biological molecules, molecular recognition, and protein folding. Therefore, such a parameter is essential to predict the transport and activity of drugs and potential pharmaceuticals. As it is known, the partition coefficient (P) is the ratio between the molar concentration of a chemical compound in an organic nonpolar layer, e.g., n-hexane, and that in an aqueous layer, e.g., water. This partition coefficient is expressed as log P. Elution times from RP-HPLC have been used as a measure of relative hydrophobicity of peptides and peptide analogues [48]. Unfortunately, the availability of measured log P values for peptides is limited [42]. During the past three decades, many methods for the prediction of log P have been reported [49-51]. Various physico-chemical parameters were used in these models, including structural effects, β-turn formation corrections, N- and C-terminal effects, etc. Akamatsu’s results were incorporated into the PLogP program [52]. At present, the most widely used method is an additive approach, where a molecule is divided into fragments, and the log P value is obtained by summing the contributions of each fragment. Addition of the log P values of each atom within the compound is also used, e.g., XLogP program [53]. Other approaches are based upon the use of topological indices and quantum mechanics. Thompson et al. [42] have recently reported on the accuracy of available programs for the prediction of log P values for peptides as effective measures of hydrophobicity for use in peptide QSAR studies [54]. Eight log P prediction programs were tested, of which seven programs were fragmentbased methods. Owing to the different input requirements of each program, various representations of the structures were used: amino acid sequences for use with PlogP [52]; SMILES strings [55] for ALogP, LogKow and Interactive Analysis’s LogP (IALogP); 2D SYBYL ‘mol2’ files for XlogP [53]; 3D structures from Corina [56] for MLogP and one whole-molecule approach (QikProp). The dataset consisted of 340 peptides, varying from 2 to 16 amino acids in length, and included 141 blocked peptides, 158 unblocked peptides, and 41 cyclic peptides [42]. The predictive accuracy of the programs was assessed using r2 values, with ALogP being the most effective program, and MLogP the least one. Blocked, unblocked, and cyclic peptide structures were studied. All programs gave better predictions for blocked peptides, while, in general, log P values for cyclic peptides were under-predicted and those of unblocked peptides were over-predicted. The performance of the programs (from best to
QSAR of cyclopeptides
11
worse) for cyclopeptides was as follows: LogKow, ALogP, XLogP, MLogP, QikProp, ACDLogP, IALogP [42]. Hattotuwagama and Flower [57] developed a new approach to the prediction of log P values for both blocked and unblocked peptides based on an empirical relationship between global molecular properties and measured physical properties. The final model consisted of five physico-chemical descriptors: molecular weight, number of single bonds, bidimensional van der Waals (2D-VDW) volume, bidimensional hydrophobic and polar van der Waals surface area (2D-VSA). The approach was peptide specific and its predictive accuracy was high, but was not applied to cyclopeptides. It is worth to mention that many authors has suggested that measuring the partition into other organic phases, such as phospholipids bilayers or micelles, might be more adequate than n-hexane for seeking biologicallyrelevant measures of peptide hydrophobicity.
Antimicrobial activity of peptides The extensive clinical use of the classical antibiotics has led to resistant bacteria strains, in particular those responsible for infectious diseases [58,59]. Then, new effective antibiotics are required [60]. Cationic antimicrobial peptides can represent such a class of antibiotics [61,62]. The endotoxin of the Gram-negative bacteria is a lipopolysaccharide (LPS), which is a component of the outer membrane of these bacteria, and the endotoxic membrane anchor moiety is a lipid A (LA) [63]. LPS is spread out during Gram-negative bacterial infection and antimicrobial therapy and/or bacteria lysis, and may result in a lethal endotoxemia [64]. Therefore, the target of any novel class of antimicrobial peptides is the neutralization of LPS and/or LA [65]. Endotoxin-binding host defence proteins showed that an LPS- and LAbinding substructure was formed by amphipathic sequences, e.g., hydrophilic and hydrophobic moieties into opposite faces of the molecule [66,67], rich in cationic residues with a β-sheet conformation [68,69]. The amphipathicity of antimicrobial peptides was necessary for their mechanism of action, because the positively charged polar face would help the molecules reach the biomembrane through electrostatic interaction with the negatively charged head groups of phospholipids, and then the nonpolar face of the peptides will allow insertion into the membrane through hydrophobic interactions, causing increased permeability and loss of barrier function of target cells [68,70]. Then, amphipathic cationic peptides were proposed as antimicrobials against Gram-negative bacteria by targeted disruption of LPS [71].
12
Alicia B. Pomilio et al.
Antimicrobial peptides take part in the innate immune response by providing a rapid first-line defence against infection [72]. Examples of antimicrobial peptides are magainins, cecropins, defensins, lactoferricins, tachyplesins, protegrins, thanatin, and others [73,74]. Antimicrobial peptide databases have been recently reported [75,76]. These compounds have been classified into three classes on the basis of secondary structures: a) linear peptides with propensity for amphiphilic α-helical structure [77,78], which mainly occur as disordered structures in aqueous media and become amphipathic helices upon interaction with the hydrophobic membranes [79,80], e.g., cecropins, magainins, and melittins; b) peptides with β or αβ structure stabilized by different number of disulfide bridges. The β-sheet class consists of cyclic peptides constrained in this conformation either by intramolecular disulfide bonds, e.g., defensins [81] and protegrins [82], or by an N-terminal to C-terminal covalent bond, e.g., gramicidin S and tyrocidins [83]; and c) peptides with over-representation of certain amino acids or unusual structures. The third group has been recently reviewed [84]; it includes aromatic amino acid-rich peptides, (ProArg)-rich peptides, unusual defensins and defensin-like molecules, unusual antimicrobial peptides from amphibians, bacteriocins with unusual structure and anionic antimicrobial peptides [84]. Design of new molecules has been achieved using combinatorialchemistry procedures coupled to high-throughput screening systems and data processing with design-of-experiments (DOE) methodology to obtain QSAR equation models and optimized compounds. Upon selection of best candidates with low cytotoxicity and moderate stability to protease digestion, anti-infective activity has been also evaluated in plant-pathogen model systems [85]. Large-scale production can be achieved by solution organic or chemoenzymatic procedures in the case of very small peptides, but, in many cases, production can be performed by biotechnological methods using genetically modified microorganisms (fermentation) or transgenic crops (plant biofactories) [85]. A variety of human proteins and peptides has antimicrobial activity and plays important roles in innate immunity. There are three important groups of human antimicrobial peptides, defensins, histatins, and cathelicidins [86]. Defensins are cationic non-glycosylated peptides containing six cysteine residues that form three intramolecular disulfide bridges, resulting in a triplestranded β-sheet structure, e.g., α-defensins and β-defensins in humans. The second group is the family of histatins, which are small, cationic, histidinerich peptides present in human saliva. Histatins adopt a random coil
QSAR of cyclopeptides
13
conformation in aqueous solvents and form α-helices in non-aqueous solvents. The third group comprises only one antimicrobial peptide, the cathelicidin LL-37. This peptide is derived proteolytically from the C-terminal end of the human CAP18 protein. Just like the histatins, it adopts a largely random coil conformation in a hydrophilic environment, and forms an α-helical structure in a hydrophobic environment [86]. Cathelicidin and defensin gene families are multifunctional natural antibiotic peptides and signalling molecules that activate host cell processes involved in immune defence and repair [87,88]. In mammals, defensins have evolved to have a central function in the host defence properties of granulocytic leukocytes, mucosal surfaces, skin and other epithelia. Three structural subgroups of mammalian defensins are involved as effectors of antimicrobial innate immunity [89]. Furthermore, over 80 different α-defensin or β-defensin peptides are expressed by the leukocytes and epithelial cells of birds and mammals. Although these compounds may be candidates for therapeutic development due to the broad-spectrum antimicrobial properties, there are technical limitations related to their size (30-45 residues) and complex structure. Therefore, minidefensins have been developed, which are antimicrobial peptides with 16-18 residues, approximately half the number found in α-defensins [90]. The θ-defensins are evolutionarily related to α- and β-defensins, but other minidefensins probably arose independently. Like αor β-defensins, minidefensin molecules have a net positive charge and a largely β-sheet structure that is stabilized by intramolecular disulfide bonds. Whereas α-defensins are found only in mammals and θ-defensins only in nonhuman primates, the other minidefensins come from widely divergent species, including horseshoe crabs, spiders, and pigs. Several α-defensins and minidefensins are effective inhibitors of HIV-1 infection in vitro, and recent evidence implicates α-defensins in resistance to HIV-1 progression in vivo [90]. Cyclic peptides have been and are under study as potential antimicrobial therapeutic agents. These peptides usually exhibit broad-spectrum activity against Gram-positive and Gram-negative bacteria, yeasts, fungi and enveloped viruses [86]. Combinatorial synthesis of cyclic peptides together with antimicrobial screening and other bioactivities can provide new lead identification, and construction of QSARs. Redman et al. [91] reported a new sequencing protocol for rapid identification of the members of a cyclic peptide library based on automated computer analysis of mass spectra, new lead identification, and construction of QSARs. The utility of the new MSsequencing approach was demonstrated using sonic spray ionization ion trap MS and MS/MS spectrometry on a single compound per bead cyclic peptide
14
Alicia B. Pomilio et al.
library and validated with individually synthesised pure cyclic D,L-αpeptides [91]. Also, complex libraries of glycosidated cyclic peptides, which are an important class of drug-like compounds, have been developed by incorporating glycosidated amino acids into linear peptides via solid-phase peptide synthesis followed by thioesterase-mediated peptide cyclization [92]. Then, the two major classes of cationic amphipathic antimicrobial peptides are α-helical and β-sheet peptides [93,94]. In fact, potent cyclic antimicrobial peptides selective for Gram-negative bacteria have been successfully developed on the basis of the β-stranded framework mimicking the putative LPS-binding sites of the LPS-binding protein family [95]. The disadvantage of antimicrobial peptides for clinical use as antibiotics is their toxicity or ability to lyse eukaryotic cells [61]. Then, it would be necessary to dissociate anti-eukaryotic activity from antimicrobial activity in order to use them as broad-spectrum antibiotics. SAR studies indicated that changes in the amphipathicity of these antibacterial peptides could be used to dissociate the antimicrobial activity from the haemolytic activity [96,97]. Furthermore, peptide cyclization increased the selectivity for bacteria because of substantially reducing the haemolytic activity [98]. Antimicrobial and haemolytic activities of de novo designed cyclic βsheet gramicidin S analogues have been successfully dissociated by systematic alterations in amphipathicity/hydrophobicity through D-amino acid substitutions [99,100]. Chen et al. [48,101] also demonstrated that in linear peptides the helix-destabilizing properties of D-amino acids offered a systematic approach to the controlled alteration of the hydrophobicity, amphipathicity, and helicity of amphipathic α-helical model peptides. Frecer et al. [102] reported the de novo design of a series of synthetic cyclic amphipathic cationic peptides for which a high affinity of binding to LA was predicted from molecular modelling. These V peptides were composed of two identical symmetric amphipathic LPS- and LA-binding motifs containing cationic residues, such as HBHPHBH and HBHBHBH (where B is a cationic residue, H is a hydrophobic residue, and P is a polar residue), that formed two strands of a β-hairpin joined by a G9S10G11 turn on one side and a disulfide bond between C1 and C19 bridging the N- and C-terminal residues on the other side (Cys1-Cys19 disulfide bridge linking the terminal residues) (Fig. 5). The structure of each peptide was cyclized via a disulfide bridge, and showed a βsheet conformation, which would bind to the bisphosphorylated glucosamine disaccharide head group of LA, primarily by ion-pair formation between anionic phosphates of LA and the cationic side chains [68].
QSAR of cyclopeptides
15 G9 C1
Ac C19 NH2
V2 V18
K3 K17
V4 V16
K5 K15
V6 V14
K7 K13
V8 V 12
S10 G 11
Figure 5. Chemical structure of the cyclic cationic peptide V1.
The V peptides contained seven alternating H and B or P residues with the general sequence Ac-C-HBHB(P)HBHGSG-HBHB(P)HBH-C-NH2, where Ac was an acetyl group. The two LPS- and LA-binding sites showed structural similarity to cyclic β-sheet defence peptides, such as protegrin 1, thanatin, and androctonin [73]. Peptides were further characterized by electrospray ionization mass spectrometry (ESI-MS) and amino acid analysis. MD simulations showed that the backbone conformations of free V peptides evolved from the initial β-hairpin with defined patterns of secondary structure into flexible random conformations. The patterns of the molecular shape fluctuations and torsional flexibility indicated high degrees of flexibility of the free V peptides in solution [102]. Antibacterial activity test, haemolytic activity assay, and cytotoxicity test were carried out. The therapeutic index, which is a widely used parameter to represent the specificity of antimicrobial reagents, was calculated by the ratio of minimal haemolytic concentration (MHC) (haemolytic activity) and minimal inhibitory concentration (MIC) (antimicrobial activity); thus, larger values in therapeutic index indicated higher antimicrobial specificity [59,103]. The therapeutic index could be increased either by increasing antimicrobial activity or decreasing haemolytic activity, while maintaining antimicrobial activity. High peptide hydrophobicity and amphipathicity also led to a higher peptide self-association in solution. When the self-association of a peptide in aqueous media was too strong, it would decrease the ability of the peptide to dissociate and penetrate into the biomembrane and to kill target cells. Temperature profiling in RP-HPLC from 5 to 80oC was used to measure selfassociation of small amphipathic molecules, including cyclic β-sheet peptides [104], accounting for dimerization of the peptides at 5 °C and the monomerization of peptides at 80 °C because of dissociation of the dimers [102]. A higher ability to self-associate in solution was correlated with weaker antimicrobial activity and stronger haemolytic activity of the peptides. In addition, self-associating ability was correlated with the secondary structure of peptides, i.e. disrupting the secondary structure by replacing the L-amino
16
Alicia B. Pomilio et al.
acid with its D-amino acid counterpart decreased the peptide association parameter (PA) values [102]. Therefore, the D-amino acid substituted peptides possessed an enhanced average antimicrobial activity compared with L-diastereomers.
QSAR models for antimicrobial cyclopeptides QSARs were obtained by associating the experimental biological potencies to physico-chemical molecular properties obtained from the peptide sequences [102]. A rational strategy was used to design cationic antimicrobial peptides via repeated sequences of alternating cationic and nonpolar residues. According to the above mentioned considerations, to achieve a high level of antimicrobial activity and selectivity toward bacteria instead of eukaryotic cells, systematic modifications of molecular properties were made by varying the amino acid residues of the amphipathic LPS- and LA-binding motifs [68] while preserving the size, symmetry, and amphipathic character of the peptides. In peptides V1 to V7, the molecular charges, amphipathicities, and lipophilicities of the peptides were modulated by varying the cationic (polar) amino acid residues in the center of the binding motifs, where B(P) was Lys or Arg (Ser or Gln), and the hydrophobic residues, where H was Ala, Val, Phe, or Trp, which preserved the symmetries, sizes, and amphipathic characters of the peptides with alternating polar and nonpolar residues. Lysine residues were previously shown to contribute mostly to the high affinity to LA when placed at the flanking basic residue position of the HBHB(P)HBH motif with a β-sheet conformation [68,102]. QSAR analysis of peptide sequences and their antimicrobial, cytotoxic, and haemolytic activities revealed that site-directed substitutions of residues in the hydrophobic face of the amphipathic peptides with less lipophilic residues selectively decreased the haemolytic effect without significantly affecting the antimicrobial or cytotoxic activity. On the other hand, the antimicrobial effect was enhanced by substitutions in the polar face with more polar residues, which increased the amphipathicity of the peptide [105]. The combination of three molecular properties (charge, amphipathicity, and lipophilicity) was found to correlate with the observed antimicrobial, haemolytic, and cytotoxic activities of the V peptides. Single-variate QSAR correlations of these properties to the biological effects could not be established, suggesting that the membrane disruption involved a concerted process [43,106,107]. The V peptides exhibited strong effects against five Gram-negative bacteria (Escherichia coli, Klebsiella pneumoniae, and Pseudomonas aeruginosa), with MICs in the nanomolar range, and low cytotoxic and
QSAR of cyclopeptides
17
haemolytic activities at concentrations significantly exceeding their MICs. Then, simple properties derived from the peptide sequences, such as the molecular charge (QM), amphipathicity index (AI), and lipophilicity index (Πo/w), were correlated to the mean antimicrobial effect against Gramnegative bacteria by multivariate linear regression as shown in eq. 1 [102]. For antimicrobial effect: ln (MIC) = 9.49 . QM + 10.17 . AI - 0.05 . Πo/w - 22.16
(equation 1)
The t test of the multivariate correlation equation revealed that the antimicrobial effect on bacteria was mainly determined by the V-peptide charge (QM) and amphipathicity (AI), i.e., by the number of cationic and polar residues forming the polar face of the V peptides and their distribution throughout the two symmetric amphipathic LPS- and LA-binding motifs [102]. In fact, a higher affinity to the outer bacterial membrane seemed to be a favourable prerequisite for the antimicrobial effects, since the V peptides displayed low micromolar Kd values and antimicrobial activities at concentrations in the nanomolar range. Kd accounted for the dissociation constant of the peptide-LA. The haemolytic activity of the peptides against human erythrocytes was determined as a main measure of peptide toxicity towards higher eukaryotic cells. Since both antimicrobial and haemolytic activities of the cationic peptides involved cell membrane lysis, and depended on the same physico-chemical properties [99,106] a similar correlation equation (eq. 2) was obtained for the haemolytic activities of the V peptides [102]. For haemolysis: ln (EC50) = - 5.34 . QM - 4.94 . AI - 0.23 . Πo/w + 31.87
(equation 2)
In this case the correlation parameters and the t statistics showed that the haemolytic activity against eukaryotic cells was mainly influenced by the molecular lipophilicity, i.e., the sum of the lipophilicities of all residues, with the major contribution coming from the H residues, which formed the nonpolar face of the V peptides, which was predicted to acquire a β-hairpinlike structure in the peptide-LA complexes [102]. The EC50 values of the V peptides for cytotoxicity ranged from 40 M to 5.7 mM, which exceeded their MICs by up to 3 orders of magnitude. For the cytotoxic effects of the V peptides, the correlation (eq. 3) was obtained [102]. For cytotoxicity: ln (EC50) = 8.98 . QM + 11.74 . AI - 0.04 . Πo/w - 8.70
(equation 3)
18
Alicia B. Pomilio et al.
Therefore, the correlation parameters and t statistics indicated that the cytotoxic activity was determined mainly by the peptide charge and the amphipathicity. Amphipathicity of the L-amino acid substituted peptides was determined by the calculation of hydrophobic moment [108] using the software package Jemboss [109], modified to include the determined hydrophobicity scale. Hydrophobicity coefficients were determined by RP-HPLC at pH 7 (phosphate buffer) [102]. Since the antimicrobial activities of the V peptides strongly increased with the increasing amphipathicity of the molecules at constant QM and Đ&#x;o/w, then, aggregates of V peptides rather than individual molecules would behave as strong antimicrobials [102]. The ability of cationic peptides to form aggregates has been related to their antimicrobial potencies, as previously reported for dermaseptin S4 [110,111], protegrin-1 [112], and human defensins [59,113]. The validity of the QSAR model for the antimicrobial potencies of the V peptides against Gram-negative bacteria was verified with the set of cyclic cationic amphipathic peptides designed by Muhle and Tam [95], which were similar to the V peptides, e.g., cyclo(PACRCRAG-PARCRCAG) sequences constrained by two cross-linking disulfide bonds. These peptides displayed potent activities against Gram-negative bacteria (Escherichia coli and Pseudomonas aeruginosa). The MICs of these peptides were 20 nM for E. coli. Frecer’s correlation equation for the antimicrobial activity (eq. 1) was able to reproduce the qualitative rank order of antimicrobial potencies at low salt concentrations for eight of the nine peptides [95]. Then, on the basis of QSARs, new analogues that had strong antimicrobial effects but that lacked haemolytic activity have been proposed [102]. QSAR eq. 1 predicted rapid increases in the antimicrobial activity with an increase in the molecular charge QM over 4 è (in units of electron charge) when amphipathicity and hydrophobicity were kept constant at the levels of the most promising peptide, V4. Model analogues of V4 that shared the polar HKHQHKH motif and that differed only in the H residues (which retained the amphipathicity index of V4) were predicted to possess decreasing haemolytic activities with decreasing lipophilicities, while their predicted antimicrobial and cytotoxic activities remained unchanged. In other analogues of V4, the replacement of the two central Gln residues by more polar Asn residues was predicted to lead to significantly increased antimicrobial potencies due to the increased amphipathicity independent of the H residues (predicted MICs were lower than that of V4) [102].
QSAR of cyclopeptides
19
Thus, variations in the H residues forming the hydrophobic face of the analogues of V4 mainly affected the haemolytic activity, which was shown to depend strongly on Πo/w, but did not affect the predicted antimicrobial activity of the analogues. Therefore, replacement of the H residues with less hydrophobic residues in the nonpolar face of the amphipathic analogues was appropriate for decreasing the haemolytic activities of the V peptides. On the other hand, directed substitutions of the B and P residues in the polar faces of the V peptides with more polar residues, which increased the amphipathic character (more negative AI values) of the peptide while keeping the net charge, the symmetry of the binding motifs, and the composition of the hydrophobic face, were predicted to bring about a significant increase in antimicrobial potencies [59,102]. The positively charged antimicrobial peptide cyclo[VKLdKVdYPLKVKL dYP] (GS14dK4), which is a diastereomeric lysine ring-size analogue of the naturally occurring antimicrobial peptide gramicidin S (Fig. 6), exhibited enhanced antimicrobial and markedly reduced haemolytic activities compared with gramicidin S itself [114]. The binding of GS14dK4 to various lipid bilayer model membranes has been recently studied using isothermal titration calorimetry [114]. Dynamic light scattering results indicated the absence of any peptide-induced major alteration in vesicle size or vesicle fusion under the experimental conditions. The binding of GS14dK4 was significantly influenced by the surface charge density of the phospholipid bilayer and by the presence of cholesterol. The presence of cholesterol markedly reduced the affinity of a peptide for phospholipid bilayers. The binding isotherms could be described quantitatively by a one-site binding model. The measured endothermic binding enthalpy (ΔH) varied strongly (+6.3 to +26.5 kcal/mol) and appeared to be inversely related to the order of the phospholipid bilayer system. However, the negative free energy (ΔG) of binding remained relatively constant (-8.5 to -11.5 kcal/mol) for all lipid membranes examined. The NH 2
H N HN
O
O N
NH HN
O O
O
HN
N
O
O O HN
O O
N H N H
NH2
Figure 6. Structure of gramicidin S.
20
Alicia B. Pomilio et al.
relatively small variation of negative free energy of peptide binding together with a pronounced variation of positive enthalpy produced an equally strong variation of TΔS (+16.2 to +35.0 kcal/mol), indicating that GS14dK4 binding to phospholipids bilayers was primarily entropy driven [114]. The properties and SAR studies of a macrocyclic analogue of porcine protegrin-1 have been recently reported [115]. Protegrin-1 (PG-1) is an 18residue β-hairpin peptide containing two disulfide bridges (Cys6–15 and Cys8-13) that belongs to the cathelicidin class of antimicrobial peptides (Fig. 7). These disulfides constrained the peptide backbone into a β-hairpin conformation, with a β-turn formed by residues 9–12, as detected by NMR. SAR studies [116] led to the discovery of analogue IB367 with Cys5–14 and Cys7–12 disulfide bridges, which has been clinically tested to treat ulcerative oral mucositis, ventilator associated pneumonia, and respiratory infections associated with cystic fibrosis. An approach to PG-1 peptidomimetics has been earlier reported [117,118] based on the use of β-hairpin-stabilizing organic templates. The template D-Pro-Pro was chosen for its ability to promote a β-hairpin loop structure. This design was used to prepare β-hairpin peptidomimetics [119-122]. The lead compound, containing the sequence cyclo(Leu-Arg-Leu-LysLys-Arg-Arg-Trp-Lys-Tyr-Arg-Val-D-Pro-Pro), showed antimicrobial activity against Gram-positive and Gram-negative bacteria, but a much lower haemolytic activity than PG-1. SAR studies were carried out on over 100 single site substituted synthetic analogues, and the biological profiles were assessed. Some analogues showed slightly improved antimicrobial activities (2–4-fold lowering of MICs), whereas other substitutions caused large increases in haemolytic activity [115]. Frecer [123] quantitatively analysed antimicrobial and haemolytic activities of protegrin-1 mimetics-cyclic cationic peptides with β-hairpin fold synthesised by Robinson et al. [115] (Fig. 7). Phe1 2 Arg11 Arg 9 Arg10
Cys1 3 Cys8
Trp8 Arg7 Lys5 Arg6
Val1 4
Cys1 5 Cys6
Lys9 Lys4
Tyr 10 Leu3
Val1 6 Leu 5
Arg11 Arg2
Gly1 7 Arg4
Val 12 Leu1
Arg1 8
NH2
Gly3 Gly2
Arg 1
D-Pro1 3 L-Pro14
Figure 7. Structure of protegrin-1 (PG-1) and cyclic β-hairpin peptides-analogues of PG-1.
QSAR of cyclopeptides
21
The polar face of the cyclic lead peptide R1 [115] was formed by the side chains of cationic residues 2, 4, 6, 7, 9, 11 and D-Pro13 and Pro14, while the nonpolar face consisted of residues 1, 3, 5, 8, 10 and 12 with predominant lipophilic/aromatic character. For the QSAR models, the selected properties, which characterized peptide’s charge, lipophilicity, amphipathicity, size, shape and flexibility were used, including the following descriptors: charge (Q), overall lipophilicity (L), lipophilicity of polar and nonpolar faces (P and N), surface areas of polar and nonpolar faces (SP and SN), molecular mass of the polar and nonpolar faces (MwP and MwN), count of small lipophilic, highly lipophilic and aromatic residues forming the nonpolar face (CSL, CHL and CAR), total number of hydrogen bond donor and acceptor centres (HBdon and HBacc), total number of rotatable bonds (RotBon) and various amphipathicity descriptors (P/L, P/N, L/N, Q/L, Q/N, SP/SN, MwP/MwN, Q/CSL, Q/CHL and Q/CAR) [123]. These simple additive molecular descriptors were easily derived from peptide sequences and tabulated amino acid properties [124]. Frecer [123] assumed that the analogues adopted an amphipathic β-hairpin secondary structure, which was in agreement with the fact that protegrins and synthetic analogues with a constrained β-hairpin conformation displayed higher antimicrobial potencies than linear or nonconstrained counterparts [69,125]. The best models obtained by application of genetic function approximation algorithm correlated antimicrobial potencies (log MICa) to peptide's charge and amphipathicity index, while haemolytic effect (log %Hem) correlated well with the lipophilicity of residues forming the nonpolar face of the β-hairpin [123]. The lipophilicity of the nonpolar face N and the amphipathicity parameter Q/N showed some relation to the antimicrobial activity, while N and the intercorrelated count of highly lipophilic residues in the nonpolar face (CHL) appeared to be related to the haemolytic activity. This finding was consistent with the above-mentioned QSAR studies (eqs. 1-3), which suggested that charge and amphipathicity correlated with MIC of cyclic cationic peptides and overall lipophilicity was very important for the haemolytic activity [102]. A large set of QSAR models combining up to five descriptors in each correlation equation was prepared by the genetic function approximation (GFA) algorithm [126] of the Cerius2 package. The fitness of each generated model was evaluated by using the lack-of-fit score [127]. The best performing QSAR model of the antimicrobial effect of R1 analogues accounted for two descriptors, molecular charge Q and amphipathicity parameter Q/N, which are
22
Alicia B. Pomilio et al.
determined by the cationic residues forming mainly the polar face and the lipophilicity of the nonpolar face, N (eq. 4) [123]. log MICa = 1.291 – 0.180 . Q + 1.438. (Q/N)
(equation 4)
Antimicrobial effect of cationic peptides related to molecular charge and amphipathicity has been previously reported [102,128]. Based on this QSAR model, the best variants of R1 lead should have the polar faces formed only by charged residues and the nonpolar faces by highly lipophilic residues in order to display strong antimicrobial activity. Thus, any analogue with the same charge as the lead peptide R1 (Q = 7 è) should show more potent antimicrobial activity than R1 when the lipophilicity of its nonpolar face N ≥ 6.84 (value of N for R1) [123]. MICa values were validated for the 97 peptides of Robinson et al. [115] for their Q and N descriptors [123]. The best QSAR model of the haemolytic effect for the R1 analogues related the lysis of human erythrocytes to the lipophilicity of the nonpolar face, N (eq. 5) [123]. log %Hem = -2.551 + 0.431 . N
(equation 5)
Therefore, the haemolytic potency of R1 analogues depended mainly on the lipophilicity of the nonpolar face, thus being almost independent on the charge and composition of the polar face. Based on this QSAR model, any analogue with N ≤ 6.84 should show lower levels of haemolytic activity than the lead peptide R1 [123]. The %Hem values of the 97 peptides of Robinson et al. [115] fitted this model. The combination of the QSAR models with the cyclic backbone of the protegrin analogues (constant turns, residues 6, 7 and 13, 14), sequence amphipathicity (regular alternation of cationic and nonpolar residues) and peptide symmetry provided a sufficient strategy for peptide design. The secondary structure of the backbone of the peptides NR7–NR9 was stabilized by a Cys3-Cys10 disulfide bridge in the form of a cyclic β-hairpin [123]. Tam et al. [128] showed by circular dichroism experiments that cyclic protegrins containing one to three cysteine bonds displayed some degree of β-strand structure in solution. The occurrence of the β-hairpin fold was essential for membrane permeation/disruption by PG-1 analogues as previously reported [128-130]. Recently, Bhonsle et al. [131] used 3D-QSAR for identification of descriptors defining bioactivity of antimicrobial peptides. The resulting 3Dphysico-chemical properties were controlled by the placement of amino acids with well-defined properties (hydrophobicity, charge density, electrostatic
QSAR of cyclopeptides
23
potential, and those mentioned above) at specific locations along the peptide backbone. These peptides exhibited different in vitro activity against Staphylococcus aureus and Mycobacterium ranae. The differences in the biological activity seem to be due to different physico-chemical interactions that occur between the peptides and the cell membranes of the bacteria. 3D-QSAR analyses showed that specific physico-chemical properties were responsible for antibacterial activity and selectivity. There were five physico-chemical properties specific to the S. aureus QSAR model, while five properties were specific to the M. ranae QSAR model. Accordingly, for any particular antimicrobial peptide, organism selectivity and potency are controlled by the chemical composition of the target cell membrane [131]. As described above, cationic peptide antibiotics possess amphiphilic structure, thereby displaying lytic activity against bacterial cell membranes. Naturally occurring antimicrobial peptides contain a large number of amino acid residues, which limit their clinical applicability. Recent studies indicated that it is possible to decrease the chain-length of these peptides without loss of activity, and suggested that a minimum of two positive ionizable (hydrophilic) and two bulky groups (hydrophobic) are required for antimicrobial activity. By employing the HipHop module of the software package CATALYST, these experimental findings have been translated into 3D pharmacophore models by finding common features among active peptides [132]. Positively ionizable and hydrophobic features were the important characteristics of compounds used for pharmacophore model development. Based on the highest score and the presence of amphiphilic structure, two separate hypotheses, Ec-2 and Sa-6 for Escherichia coli and Staphylococcus aureus, respectively, were selected for mapping analysis of active and inactive peptides against these organisms. The resulting models not only provided information on the minimum requirement of positively ionizable and hydrophobic features but also indicated the importance of their relative arrangement in space. The minimum requirement for positively ionizable features was two in both cases, but the number of hydrophobic features required in the case of E. coli was four, while for S. aureus it was found to be three [132]. Hypotheses were further validated using cationic steroid antibiotics, a different class of facial amphiphiles with the same mechanism of antimicrobial action as that of cationic peptide antibiotics. The results showed that cationic steroid antibiotics also require similar minimum features to be active against both E. coli and S. aureus.
24
Alicia B. Pomilio et al.
Interaction of antimicrobial peptides in biomembranes Cytoplasmic membrane is the main target of some antimicrobial peptides [133]. In fact, all cationic amphipathic peptides interact with membranes [134,135]. Cationic peptides first bind to the negatively charged LPS or LA of Gram-negative bacteria [98,136], then permeate the membrane by different mechanisms, finally leading to bacteria death. The development of resistance to membrane active peptides whose target is the cytoplasmic membrane is not expected because this would imply severe changes in the lipid composition of cell membranes of microorganisms. The mechanism of bacterial membrane disruption by cationic amphipathic peptides should involve several molecular properties of the peptides: a net positive charge (attachment to anionic outer membrane constituents), amphipathicity (aggregation on the membrane surface), and lipophilicity (permeation into the membrane), as has been discussed above [137]. Many models have been proposed on the interaction of cationic amphipathic antimicrobials with the cytoplasmic membrane [61,136-139] because lethal action could be either from membrane disruption or from translocation through the membrane to target receptors inside the cell. Two main proposed mechanisms are: (i) The “barrel-stave” mechanism: the peptide may form transmembrane channels/pores, as their hydrophobic surfaces interact with the lipid core of the membrane and the hydrophilic surfaces point inwards, producing an aqueous pore [140]; (ii) The “carpet” mechanism: peptides lie at the interface parallel with the membrane allowing their hydrophobic surface to interact with the hydrophobic component of the lipid, and the positive charge residues can still interact with the negatively charged head groups of the phospholipid [141]. An NMR study of the amphipathic cyclic β-sheet antimicrobial peptide of gramicidin S [142] supported the interface model. However, neither of these mechanisms alone could fully explain the reported data. The mechanism of action depends upon the difference in membrane composition between prokaryotic and eukaryotic cells [143]. If the peptides formed pores/channels in the hydrophobic core of the eukaryotic bilayer, they would cause the hemolysis of human erythrocytes. On the contrary, for prokaryotic cells the peptides lysed cells in a detergent-like mechanism as described in the carpet mechanism. In fact, the extent of interaction between peptide and biomembrane depends on the composition of the lipid bilayer. Liu et al. [144,145] used a polyleucine-based α-helical transmembrane peptide to demonstrate that the peptide reduced the phase transition temperature to a higher extent in
QSAR of cyclopeptides
25
phosphatidylethanolamine (PE) bilayers than in phosphatidylcholine (PC) or phosphatidylglycerol bilayers, indicating a higher disruption of PE organization. The zwitterionic PE is the main lipid component in prokaryotic cell membranes, and PC is the main lipid component in eukaryotic cell membranes [146]. According to the results, the carpet mechanism is essential for strong antimicrobial activity, and if there were a preference by the peptide for penetration into the hydrophobic core of the bilayer, the antimicrobial activity would actually decrease [143]. Recently, membrane interactions of designed cationic antimicrobial peptides were reported [147]. Novel cationic antimicrobial peptides typified by sequences such as KKKKKKAAX-AAXAAXAA-NH2, where X = Phe/Trp-displayed high antibacterial activity, but exhibited little or no haemolytic activity towards human red blood cells even at high doses. To clarify the mechanism of their selectivity for bacterial vs. mammalian membranes and to increase the understanding of the relationships between primary sequence and bioactivity, a library of derivatives was prepared by increasing segmental hydrophobicity, in which systematic substitutions of Ala for two, three, or four Leu residues were made. Conformationally constrained dimeric and cyclic derivatives were also synthesised. The peptides were examined for activity against pathogenic bacteria (Pseudomonas aeruginosa), haemolytic activity on human red blood cells, and insertion into models of natural bacterial membranes (containing anionic lipids) and mammalian membranes (containing zwitterionic lipids + cholesterol). Results were compared with the corresponding properties of the natural cationic antimicrobial peptides magainin and cecropin. Using circular dichroism and fluorescence spectroscopy, Gluckhov et al. [147] found that peptide conformation and membrane insertion were sequence dependent, both upon the number of Leu residues, and upon their positions along the hydrophobic core. Membrane disruption was likely enhanced by the fact that the peptides contained potent dimerization-promoting sequence motifs, as assessed by SDS-PAGE gel analysis. The overall results led to identify distinctions in the mechanism of actions of these cationic antimicrobial peptides for disruption of bacterial vs. mammalian membranes, the latter dependent on surpassing a "second hydrophobicity threshold" for insertion into zwitterionic membranes.
Anticancer and cytotoxic activities There is a need for novel drugs for the treatment of infectious diseases, autoimmunity and cancer. Cyclic peptides constitute a class of compounds
26
Alicia B. Pomilio et al.
that have made important contributions to the treatment of certain diseases. Penicillin, vancomycin, cyclosporin, the echinocandins and bleomycin are well-known cyclic peptides [148]. Cyclic peptides, compared to linear peptides, have been considered to have greater potential as therapeutic agents due to their increased chemical and enzymatic stability, receptor selectivity, and improved pharmacodynamic properties. They have been used as synthetic immunogens, transmembrane ion channels, antigens for Herpes Simplex Virus, potential immunotherapeutic vaccines for diabetes and Experimental Autoimmune Encephalomyelitis - an animal model of Multiple Sclerosis, as inhibitors against α-amylase and as protein stabilizers. Cyclic peptides as therapeutic agents in disease have been recently reviewed [148]. Isolation and anti-cancer effects of cyclotides obtained from Violaceae plants have been also reviewed [149]. A fractionation protocol was developed, leading to varv cyclotides from Viola arvensis (Violaceae). Separation methods included adsorption, ion exchange chromatography and solvent-solvent partitioning. Structures were determined on the basis of MS for cyclotide sequencing and mapping of disulfide bonds. Finally, to assess SARs, regarding their anti-cancer and cytotoxic effects, the three dimensional structures of cyclotides were characterized by homology modelling techniques [149]. Cytotoxic cyclotides were obtained from Viola tricolor [150]. Bioguided fractionation was carried out by RP-HPLC and a fluorometric cytotoxicity assay. Cyclotides were assayed against two human cancer cell lines, U-937 GTB (lymphoma) and RPMI-8226/s (myeloma). The most potent compounds isolated, which showed the lowest IC50 values, were: vitri A (IC50 = 0.6 μM and IC50 = 1 μM, respectively), varv A (IC50 = 6 μM and IC50 = 3 μM, respectively), and varv E (IC50 = 4 μM in both cell lines). Their sequences, determined by automated Edman degradation, quantitative amino acid analysis, and MS, were cyclo-GESCVWIPCITSAIGCSCKSKVCYRNGIPC (vitri A), cyclo-GETCVGGTCNTPGCSCSWPVCTRNGLPVC (varv A), and cyclo-GETCVGGTCNTPGCSCSWPVCTRNGLPIC (varv E) [150]. Cycloviolacin H4, a hydrophobic cyclotide, was isolated from the Australian native violet Viola hederaceae. Its sequence, cyclo(CAESCVWIPCTVTALLGCSCSNNVCYNGIP), was determined by nanospray MS/MS and quantitative amino acid analysis. This cyclotide was classified into the bracelet subfamily of cyclotides due to the absence of a cisPro peptide bond in the circular peptide backbone. Cycloviolacin H4 exhibited the most potent haemolytic activity in cyclotides, and this activity correlated with the size of a surface-exposed hydrophobic patch. These findings provided insight into the factors that modulate the cytotoxic properties of cyclotides [151].
QSAR of cyclopeptides
27
Recently, the sequences of 11 cyclotides, vibi A-K, isolated from the alpine violet Viola biflora, were determined by MS/MS sequencing of proteins and screening of a cDNA library of V. biflora in parallel [152]. To correlate amino acid sequence to cytotoxic potency, vibi D, E, G and H were analysed by a fluorometric microculture cytotoxicity assay using a lymphoma cell line. The IC50-values of the bracelet cyclotides vibi E, G and H ranged between 0.96 and 5.0 μM while the Möbius cyclotide vibi D was not cytotoxic at 30 μM [152]. Rational design, structure, and biological evaluation of cyclic peptides mimicking the vascular endothelial growth factor (VEGF) have been recently reported [153]. Angiogenesis is the development of a novel vascular network from a pre-existing structure. Blocking angiogenesis is an attractive strategy to inhibit tumor growth and metastasis formation. Based on structural and mutagenesis data, novel cyclic peptides were developed, which mimic, simultaneously, two regions of the VEGF crucial for the interaction with the VEGF receptors. The peptides, displaying the best affinity for VEGF receptor 1 on a competition assay, inhibited endothelial cell transduction pathway, migration, and capillary-like tubes formation. The specificity of these peptides for VEGF receptors was demonstrated by microscopy using a fluorescent peptide derivative. The resolution of the structure of some cyclic peptides by NMR and molecular modelling allowed the identification of various factors accounting for their inhibitory activity [153]. The interest in the application of the QSAR paradigm has steadily increased in recent decades and it may be useful in the design and development of DNA-binding molecules as new anticancer agents. Due to the great potential of DNA as a receptor, many classes of synthetic and naturally occurring molecules exert their anticancer activities through DNA-binding [154]. In the field of antitumour DNA-binding agents, a number of acridine and anthracycline derivatives are in the market as chemotherapeutic agents. However, the clinical application of such compounds has shown multi-drug resistance and secondary and/or collateral effects. Therefore, there has been increasing interest in discovering and developing small molecules that are capable of DNA-binding. Recently [154], the DNA-binding properties of different compound series have been discussed using 27 QSAR models. The most important determinants for the activity in these models were Hammett electronic (σ and σ+), hydrophobic, molar refractivity, and Sterimol width parameters. P-glycoprotein is implicated in multiple drug resistance exhibited by several types of cancer against a multitude of anticancer chemotherapeutic agents. Therefore, several research groups searched for effective
28
Alicia B. Pomilio et al.
P-glycoprotein inhibitors. Cyclosporine A, aureobasidin A and related analogues were reported to possess potent inhibitory actions against P-glycoprotein. Recently, receptor surface analysis was used to construct two satisfactory receptor surface models for cyclosporine- and aureobasidin-based P-glycoprotein inhibitors [155]. These pseudoreceptors were combined to achieve satisfactory 3D-QSAR for 68 different cyclosporine and aureobasidin derivatives. Upon validation against an external set of 16 randomly selected P-glycoprotein inhibitors, the optimal 3D-QSAR was found to be selfconsistent and predictive (r2LOO = 0.673, r2PRESS = 0.600). The resulting 3D-QSAR was employed to probe the structural factors that control the inhibitory activities of cyclosporine and aureobasidin analogues against P-glycoprotein [155].
Acknowledgements Thanks are due to Universidad de Buenos Aires and CONICET (Argentina) for financial support; Ministerio de Ciencia, Tecnología e Innovación Productiva (MINCYT, Argentina) for electronic bibliography facilities.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
Pomilio, A.B., Battista, M.E., and Vitale, A.A. 2006, Curr. Org. Chem., 10, 2075. Gallo, R.L., Murakami, M., Ohtake, T., and Zaiou, M. 2002, J. Allergy Clin. Immunol., 110, 823. Ahn, J.-M., Boyle, N.A., MacDonald, M.T., and Janda, K.D. 2002, Mini Rev. Med. Chem., 2, 463. Monroc, S., Badosa, E., Besalú, E., Planas, M., Bardají, E., Montesinos, E., and Feliu, L. 2006, Peptides, 27, 2575. Montesinos, E. 2007, FEMS Microbiol. Lett., 270, 1. Marr, A.K., Gooderham, W.J., and Hancock, R.E.W. 2006, Curr. Opin. Pharmacol., 6, 468. Kelso, M.J., and Fairlie, D.P. 2003, Current Approaches to Peptidomimetics. In: Molecular Pathomechanisms and New Trends in Drug Research.Toth I. and Keri G. eds., Taylor and Francis Ltd., London and New York, Chapter 44, 579. Leung, D., Abbenante, G., and Fairlie, D.P. 2000, J. Med. Chem., 43, 305. Westermann, J.C., and Craik, D.J. 2008, Methods Mol. Biol., 494, 87. Jones, R.M., Boatman, P.D., Semple, G., Shin, Y.-J., and Tamura, S.Y. 2003, Curr. Opin. Pharmacol., 3, 530. Hruby, V.J. 2002, Nature Rev. Drug Discov., 1, 847. Giannis, A., and Kolter, T. 1993, Angew. Chem. Int. Ed. Engl., 32, 1244. Gante, J. 1994, Angew. Chem. Int. Ed. Engl., 33, 1699.
QSAR of cyclopeptides
29
14. Nakanishi, H., and Kahn, M. 2003, Design of Peptidomimetics. In: The Practice of Medicinal Chemistry. 2nd ed., Wermuth C.G. ed., Academic Press, London. 15. Hirschmann, R., Nicolaou, K.C., Pietranico, S., Leahy, E.M., Salvino, J., Arison, B., Cichy, M.A., Spoors, P.G., Shakespeare, W.C., Sprengeler, P.A., Hamley, P., Smith III, A.B, Reisine, T., Raynor, K., Maechler, L., Donaldson, C., Vale, W., Freidinger, R.M., Cascieri, M.R., and Strader, C.D. 1993, J. Am. Chem. Soc., 115, 12550. 16. Våbeno, J., Nikiforovich, G.V., and Marshall, G.R. 2006, Biopolymers, 84, 459. 17. Ball, J.B., and Alewood, P.F. 1990, J. Mol. Recogn., 3, 55. 18. Ball, J.B., Hughes, R.A., Alewood, P.F., and Andrews, P.R. 1993, Tetrahedron, 49, 3467. 19. Hutchinson, E.G., and Thornton, J.M. 1994, Protein Sci., 3, 2207. 20. Etzkorn, F.A., Travins, J.M., and Hart, S.A. 1999, Rare protein turns: γ-turn, helix-turn-helix, and cis-proline mimics. In: Advances in Amino Acid Mimetics and Peptidomimetics. Abell A. Ed., Jai Press Inc., Stanford, Vol. 2, 126. 21. Gruber, C.W., Elliott, A.G., Ireland, D.C., Delprete, P.G., Dessein, S., Göransson, U., Trabi, M., Wang, C.K., Kinghorn, A.B., Robbrecht, E., and Craik, D.J. 2008, Plant Cell, 20, 2471. 22. Pelegrini, P.B., Quirino, B.F., and Franco, O.L. 2007, Peptides, 28, 1475. 23. Gruber, C.W., Cemazar M., Anderson, M.A., and Craik, D.J. 2007, Toxicon, 49, 561. 24. Mach, J. 2008, Plant Cell, 20, 2285. 25. Craik, D.J., Clark, R.J., and Daly, N.L. 2007, Expert Opin. Investig. Drugs, 16, 595 26. Craik, D.J., Cemazar, M., and Daly, N.L. 2007, Curr. Opin. Drug Discov. Devel., 10, 176. 27. Wang, C.K., Kaas, Q., Chiche, L., and Craik, D.J. 2008, Nucleic Acids Res., 36, D206. 28. Thongyoo, P., Roqué-Rosell, N., Leatherbarrow, R.J., and Tate, E.W. 2008, Org. Biomol. Chem., 6, 1462. 29. Combelles, C., Gracy, J., Heitz, A., Craik, D.J., and Chiche, L. 2008, Proteins, 73, 87. 30. Shenkarev, Z.O., Nadezhdin, K.D., Lyukmanova, E.N., Sobol, V.A., Skjeldal, L., and Arseniev, A.S. 2008, J. Inorg. Biochem., 102, 1246. 31. Ireland, D.C., Wang, C.K., Wilson, J.A., Gustafson, K.R., and Craik, D.J. 2008, Biopolymers, 90, 51. 32. Colgrave, M.L., Kotze, A.C., Ireland, D.C., Wang, C.K., and Craik, D.J. 2008, Chembiochem, 9, 1939. 33. Colgrave, M.L., Kotze, A.C., Huang, Y.H., O'Grady, J., Simonsen, S.M., and Craik, D.J. 2008, Biochemistry, 47, 5581. 34. Plan, M.R., Saska, I., Cagauan, A.G., and Craik, D.J. 2008, J. Agric. Food Chem., 56, 5237. 35. Morita, H., Nagashima, S., Takeya, K., and Itokawa, H. 1993, Chem. Pharm. Bull., 41, 992.
30
Alicia B. Pomilio et al.
36. Morita, H., Nagashima, S., Uchiumi, Y., Kuroki, O., Takeda, K., and Itokawa, H. 1996, Chem. Pharm. Bull., 44, 1026. 37. Saviano, G., Benedetti, E., Cozzolino, R., De Capua, A., Laccetti, P., Palladino, P., Zanotti, G., Amodeo, P., Tancredi, T., and Rossi, F. 2004, Biopolymers, 76, 477. 38. Rossi, F., Zanotti, G., Saviano, M., Iacovino, R., Palladino, P., Saviano, G., Amodeo, P., Tancredi, T., Laccetti, P., Corbier, C., and Benedetti, E. 2004, J. Pept. Sci., 10, 92. 39. Cozzolino, R., Palladino, P., Rossi, F., Cali, G., Benedetti, E., and Laccetti, P. 2005, Carcinogenesis, 26, 733. 40. Battista, M.E., Vitale, A.A., and Pomilio, A.B. 2000, Molecules, 5, 489. 41. Pomilio, A.B., Battista, M.E., and Vitale, A.A. 2001, Theochem, 536, 243. 42. Thompson, S.J., Hattotuwagama, C.K., Holliday, J.D., and Flower, D.R. 2006, Bioinformation, 1, 237. 43. Eros, D., Kövesdi, I., Orfi, L., Takács-Novák, K., Acsády, G., and Kéri, G. 2002, Curr. Med. Chem., 9, 1819. 44. Modi, S. 2003, Drug Discov. Today, 8, 621. 45. Obrezanova, O., Csanyi, G., Gola, J.M., and Segall, M.D. 2007, J. Chem. Inf. Model., 47, 1847. 46. Obrezanova, O., Gola, J.M., Champness, E.J., and Segall, M.D. 2008, J. Comput. Aided Mol. Des., 22, 431. 47. Frecer, V., Berti, F., Benedetti, F., and Miertus, S. 2008, J. Mol. Graph. Model., 27, 376. 48. Chen, Y., Mant, C.T., and Hodges, R.S. 2002, J. Pept. Res., 59, 18. 49. Akamatsu, M., and Fujita, T. 1992, J. Pharm. Sci., 81, 164. 50. Akamatsu, M., Katayama, T., Kishimoto, D., Kurokawa, Y., Shibata, H., Ueno, T., and Fujita T. 1994, J. Pharm. Sci., 83, 1026. 51. Osakai, T., Hirai, T., Wakamiya, T., and Sawada, S. 2006, Phys. Chem. Chem. Phys., 8, 985. 52. Peng, T., Wang, R., and Luhua Lai, L. 1999, J. Mol. Model., 5, 189. 53. Wang, R., Fu, Y., and Lai, L. 1997, J. Chem. Inf. Comp. Sci., 37, 615. 54. Guan, P., Doytchinova, I.A., Walshe, V.A., Borrow, P., and Flower, D.R. 2005, J. Med. Chem., 48, 7418. 55. Weininger, D. 1988, J. Chem. Inf. Comp. Sci., 28, 31. 56. Sadowski, J., and Gasteiger, J. 1993, Chem. Rev., 93, 2567. 57. Hattotuwagama, C.K., and Flower, D.R. 2006, Bioinformation, 1, 257. 58. Andres, E., and Dimarcq, J.L. 2004, J. Int. Med., 255, 519. 59. Hancock, R.E.W., and Sahl, H.G. 2006, Nature Biotech., 24, 1551. 60. Beisswenger, C., and Bals, R. 2005, Curr. Protein Pept. Sci., 6, 255. 61. Sitaram, N., and Nagaraj, R. 2002, Curr. Pharm. Des., 8, 727. 62. Park, Y., and Hahm, K.S. 2005, J. Biochem. Mol. Biol., 38, 507. 63. Takada, H., and Kotani, S. 1989, Crit. Rev. Microbiol., 16, 477. 64. Parillo, J.E. 1993, N. Engl. J. Med., 328, 1471. 65. Gough, M., Hancock, R.E.W., and Kelly, N.M. 1996, Infect. Immun., 64, 4922.
QSAR of cyclopeptides
31
66. Ahmad, A., Yadav, S.P., Asthana, N., Mitra, K., Srivastava, S.P., and Ghosh, J.K. 2006, J. Biol. Chem., 281, 22029. 67. Brown, K.L., and Hancock, R.E.W. 2006, Curr. Opin. Immunol., 18, 24. 68. Frecer, V., Ho, B., and Ding, J.L. 2000, Eur. J. Biochem., 267, 837. 69. Chen, X., Dings, R.P., Nesmelova, I., Debbert, S., Haseman, J.R., Maxwell, J., Hoye, T.R., and Mayo, K.H. 2006, J. Med. Chem., 49, 7754. 70. Hilpert, K., Elliot, M.R., Volkmer-Engert, R., Henklein, P., Donini, O., Zhou, Q., Winkler, D.F., and Hancock, R.E.W. 2006, Chem Biol., 13, 1101. 71. Bowdish, D.M.E., Davidson, D.J., Lau, Y.E., Lee, K., Scott, M.G., and Hancock, R.E.W. 2005, J. Leukoc. Biol., 77, 451. 72. McPhee, J.B., and Hancock, R.E.W. 2005, J. Pept. Sci., 11, 677. 73. Lee, M.K., Cha, L., Lee, S.H., and Hahm, K.S. 2002, J. Biochem. Mol. Biol., 35, 291. 74. Zikou, S., Koukkou, A.I., Mastora, P., Sakarellos-Daitsiotis, M., Sakarellos, C., Drainas, C., and Panou-Pomonis, E. 2007, J. Pept. Sci., 13, 481. 75. Hammami, R., Ben Hamida, J., Vergoten, G., and Fliss, I. 2008, Nucleic Acids Res., Epub ahead of print. PMID: 18836196. 76. Wang, G., Li, X., and Wang, Z. 2008, Nucleic Acids Res., Epub ahead of print. PMID: 18957441. 77. Dennison, S.R., Morton, L.H., Harris, F., and Phoenix, D.A. 2007, Biophys. Chem., 129, 279. 78. Dennison, S.R., Morton, L.H., Harris, F., and Phoenix, D.A. 2008, Chem. Phys. Lipids, 151, 92. 79. Chen, Y., Vasil, A.I., Rehaume, L., Mant, C.T., Burns, J.L., Vasil, M.L., Hancock, R.E.W., and Hodges, R.S. 2006, Chem. Biol. Drug Des., 67, 162. 80. Chen, Y., Guarnieri, M.T., Vasil, A.I., Vasil, M.L., Mant, C.T., and Hodges, R.S. 2007, Antimicrob. Agents Chemother., 51, 1398. 81. Ganz, T., and Lehrer, R. I. 1994, Curr. Opin. Immunol., 6, 584. 82. Fattorini, L., Gennaro, R., Zanetti, M., Tan, D., Brunori, L., Giannoni, F., Pardini, M., and Orefici, G. 2004, Peptides, 25, 1075. 83. Mootz, H.D., and Marahiel, M.A. 1997, J. Bacteriol., 179, 6843. 84. Sitaram, N. 2006, Curr. Med. Chem., 13, 679. 85. Montesinos, E., and BardajĂ, E. 2008, Chem. Biodivers., 5, 1225. 86. De Smet, K., and Contreras, R. 2005, Biotechnol. Lett., 27, 1337. 87. KlĂźver, E., Schulz-Maronde, S., Scheid, S., Meyer, B., Forssmann, W.G., Adermann, K. 2005, Biochemistry, 44, 9804. 88. Mookherjee, N., Rehaume, L., and Hancock, R.E.W. 2007, Expert Opin. Therap. Targets, 11, 993. 89. Selsted, M.E., and Ouellette, A.J. 2005, Nat. Immunol., 6, 551. 90. Cole, A.M., and Lehrer, R.I. 2003, Curr. Pharm. Des., 9, 1463. 91. Redman, J.E., Wilcoxen, K.M., and Ghadiri, M.R. 2003, J. Comb. Chem., 5, 33. 92. Boddy, C.N. 2004, Chem. Biol., 11, 1599. Comment on: Chem. Biol., 2004, 11, 1635. 93. Devine, D.A., and Hancock, R.E.W. 2002, Curr. Pharm. Des., 8, 703. 94. Jin, Y., Hammer, J., Pate, M., Zhang, Y., Zhu, F., Zmuda, E., and Blazyk, J. 2005, Antimicrob. Agents Chemother., 49, 4957.
32
Alicia B. Pomilio et al.
95. Muhle, S.A., and Tam, J.P. 2001, Biochemistry, 40, 5777. 96. Ramamoorthy, A., Thennarasu, S., Tan, A., Gottipati, K., Sreekumar, S., Heyl, D.L., An, F.Y., and Shelburne, C.E. 2006, Biochemistry, 45, 6529. 97. Schmitt, M.A., Weisblum, B., and Gellman, S.H. 2007, J. Am. Chem. Soc., 129, 417. 98. Oren, Z., and Shai, Y. 2000, Biochemistry, 39, 6103. 99. Kondejewski, L.H., Lee, D.L., Jelokhani-Niaraki, M., Farmer, S.W., Hancock, R.E.W., and Hodges, R.S. 2002, J. Biol. Chem., 277, 67. 100. Lee, D.L., Powers, J.P.S., Pflegerl, K., Vasil, M.L., Hancock, R.E.W., and Hodges, R.S. 2004, J. Pept. Res., 63, 69. 101. Chen, Y., Guarnieri, M.T., Vasil, A.I., Vasil, M.L., Mant, C.T., and Hodges, R.S. 2007, Antimicrob. Agents Chemother., 51, 1398. 102. Frecer, V., Ho, B., and Ding, J. L. 2004, Antimicrob. Agents Chemother., 48, 3349. 103. Jenssen, H., Hamill, P., and Hancock, R.E.W. 2006, Clin. Microbiol. Rev., 19, 491. 104. Lee, D.L., Mant, C.T., and Hodges, R.S. 2003, J. Biol. Chem., 278, 22918. 105. Brown, K.L, Mookherjee, N., and Hancock, R.E.W. 2007, Antimicrobial, Host Defence Peptides and Proteins. Encyclopedia of Life Sciences. John Wiley & Sons, Ltd, Chichester. 106. Mei, H., Liao, Z.H., Zhou, Y., and Li, S.Z. 2005, Biopolymers, 80, 775. 107. Tian, F., Zhou, P., Lv, F., Song, R., and Li, Z. 2007, J. Pept. Sci., 13, 549. 108. Eisenberg, D., Weiss, R.M., and Terwilliger, T.C. 1982, Nature, 299, 371. 109. Carver, T., and Bleasby, A. 2003, Bioinformatics, 19, 1837. 110. Feder, R., Dagan, A., and Mor, A. 2000, J. Biol. Chem., 275, 4230. 111. Rotem, S., Radzishevsky, I., and Mor, A. 2006, Antimicrob. Agents Chemother., 50, 2666. 112. Roumestand, C., Louis, V., Aumelas, A., Grassy, G., Calas, B., and Chavanieu, A. 1998, FEBS Lett., 421, 263. 113. Skalicky, J.J., Selsted, M.E., and Pardi, A. 1994, Proteins, 20, 52. 114. Abraham, T., Lewis, R.N., Hodges, R.S., and McElhaney, R.N. 2005, Biochemistry, 44, 2103. 115. Robinson, J.A., Shankaramma, S.C., Jetter, P., Kienzl, U., Schwenderer, R.A., Vrijbloed, J. W., and Obrecht, D. 2005, Bioorg. Med. Chem., 13, 2055. 116. Chen, J., Falla, T.J., Liu, H., Hurst, M.A., Fujii, C.A., Mosca, D.A., Embree, J.R., Loury, D.J., Radel, P.A., Chang, C.C., Gu, L., and Fiddes, J.C. 2000, Biopolymers, 55, 88. 117. Shankaramma, S.C., Athanassiou, Z., Zerbe, O., Moehle, K., Mouton, C., Bernardini, F., Vrijbloed, J.W., Obrecht, D., and Robinson, J.A. 2002, ChemBioChem, 3, 1126. 118. Shankaramma, S.C., Moehle, K., James, S., Vrijbloed, J.W., Obrecht, D., and Robinson, J.A. 2003, Chem. Commun., 1842. 119. Favre, M., Moehle, K., Jiang, L., Pfeiffer, B., and Robinson, J.A. 1999, J. Am. Chem. Soc., 121, 2679.
QSAR of cyclopeptides
33
120. Descours, A., Moehle, K., Renard, A., and Robinson, J.A. 2002. ChemBioChem, 3, 318. 121. Athanassiou, Z., Dias, R.L.A., Moehle, K., Dobson, N., Varani, G., and Robinson, J.A. 2004, J. Am. Chem. Soc., 126, 6906. 122. Fasan, R., Dias, R.L.A., Moehle, K., Zerbe, O., Vrijbloed, J.W., Obrecht, D., and Robinson, J.A. 2004, Angew. Chem., Int. Ed., 43, 2109. 123. Frecer, V. 2006, Biorg. Med. Chem., 14, 6065. 124. Dawson, R.M.C., Elliott, D.C., Elliott, W.H., and Jones, K.M. 1986, Data for Biochemical Research. 3rd Ed., Oxford Science Publications, 1. 125. Jenssen, H., Lejon, T., Hilpert, K., Fjell, C., Cherkasov, A., and Hancock, R.E.W. 2007, Chem. Biol. Drug Design, 70, 134. 126. Rogers, D., and Hopfinger, A.J. 1994, J. Chem. Inf. Comput. Sci., 34, 854. 127. Friedman, J. 1990, Multivariate Adaptive Regression Splines. Revised edition, 1st ed: 1988, Technical Report 102, Laboratory for Computational Statistics, Department of Statistics, Stanford University: Stanford, CA. 128. Tam, J.P., Wu, C., Yang, J.-L. 2000, Eur. J. Biochem., 267, 3289. 129. Lai, J.R., Huck, B.R., Weisblum, B., and Gellman, S.H. 2002, Biochemistry, 41, 12835. 130. Mani, R., Waring, A.J., Lehrer, R.I., and Hong, M. 2005, Biochim. Biophys. Acta, 1716, 11. 131. Bhonsle, J.B., Venugopal, D., Huddler, D.P., Magill, A.J., and Hicks, R.P. 2007, J. Med. Chem., 50, 6545. 132. Sundriyal, S., Sharma, R.K., Jain, R., and Bharatam, P.V. 2008, J. Mol. Model., 14, 265. 133. Yu, L., Ding, J.L., Ho, B., and Wohland, T. 2005, Biochim. Biophys. Acta, 1716, 29. 134. Wu, M., and Hancock, R.E.W. 1999, J. Biol. Chem., 274, 29. 135. Glukhov, E., Stark, M., Burrows, L.L., and Deber, C.M. 2005, J. Biol. Chem., 280, 33960. 136. Hancock, R.E.W., and Rozek, A. 2002, FEMS Microbiol. Lett., 206, 143. 137. Blondelle, S.E., Lohner, K., and Aguilar, M. 1999, Biochim. Biophys. Acta, 1462, 89. 138. Sitaram, N., and Nagaraj, R. 1999, Biochim. Biophys. Acta, 1462, 29. 139. Zhang, L., Rozek, A., and Hancock, R.E.W. 2001, J. Biol. Chem., 276, 35714. 140. Ehrenstein, G., and Lecar, H. 1977, Q. Rev. Biophys., 10, 1. 141. Pouny, Y., Rapaport, D., Mor, A., Nicolas, P., and Shai, Y. 1992, Biochemistry, 31, 12416. 142. Salgado, J., Grage, S.L., Kondejewski, L.H., Hodges, R.S., McElhaney, R.N., and Ulrich, A.S. 2001, J. Biomol. NMR, 21, 191. 143. Chen, Y., Mant, C.T., Farmer, S.W., Hancock, R.E.W., Vasil, M.L., and Hodges, R.S. 2005, J. Biol. Chem., 280, 12316. 144. Liu, F., Lewis, R.N., Hodges, R.S., and McElhaney, R.N. 2004, Biochemistry, 43, 3679. 145. Liu, F., Lewis, R.N., Hodges, R.S., and McElhaney, R.N. 2004, Biophys. J., 87, 2470.
34
Alicia B. Pomilio et al.
146. Devaux, P.F., and Seigneuret, M. 1985, Biochim. Biophys. Acta, 822, 63. 147. Glukhov, E., Burrows, L.L., and Deber, C.M. 2008, Biopolymers, 89, 360. 148. Katsara, M., Tselios, T., Deraos, S., Deraos, G., Matsoukas, M., Lazoura, E., Matsoukas, J., and Apostolopoulos, V. 2006, Curr. Med. Chem., 13, 2221. 149. Göransson, U., Svangård, E., Claeson, P., and Bohlin, L. 2004, Curr. Protein Pept. Sci., 5, 317. 150. Svangård, E., Göransson, U., Hocaoglu, Z., Gullbo, J., Larsson, R., Claeson, P., and Bohlin, L. 2004, J. Nat. Prod., 67, 144. 151. Chen, B., Colgrave, M.L., Wang, C., and Craik, D.J. 2006, J. Nat. Prod., 69, 23. 152. Herrmann, A., Burman, R., Mylne, J.S., Karlsson, G., Gullbo, J., Craik, D.J., Clark, R.J., and Göransson, U. 2008, Phytochemistry, 69, 939. 153. Gonçalves, V., Gautier, B., Coric, P., Bouaziz, S., Lenoir, C., Garbay, C., Vidal, M., and Inguimbert, N. 2007, J. Med. Chem., 50, 5135. 154. Verma, R.P., and Hansch, C. 2008, J. Pharm. Sci., 97, 88. 155. Zalloum, H.M., and Taha, M.O. 2008, J. Mol. Graph. Model., 27, 439.
Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 35-62 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro
2. QSPR studies on amino acids: Application to proteins 1
Francisco Torrens1,* and Gloria Castellano2
Institut Universitari de Ciència Molecular, Universitat de València, Edifici d’Instituts de Paterna P. O. Box 22085, E 46071 València, Spain; 2Departamento de Ciencias Experimentales y Matemáticas, Facultad de Ciencias Experimentales, Universidad Católica de Valencia San Vicente Mártir, Guillem de Castro 94, E 46003 València, Spain
Abstract. Valence topological charge-transfer (CT) indices are applied to the calculation of pH at the pI isoelectric point. The combination of CT indices allows the estimation of pI. The model is generalized for molecules with heteroatoms. The ability of the indices for the description of molecular charge distribution is established by comparing them with the pI of 21 amino acids. Linear correlation models are obtained. The CT indices improve multivariable regression equations for pI. The variance decreases by 95%. No superposition of the corresponding Gk–Jk and GkV–JkV pairs is observed in most fits, which diminishes the risk of collinearity. The inclusion of heteroatoms in π-electron system is beneficial for the description of pI, owing to either the role of the additional p orbitals provided by heteroatom or role of steric factors in π-electron conjugation. The use of only CT and valence CT indices {Gk, Jk, GkV, JkV} gives limited results for modelling pI of amino acids. Furthermore, the inclusion of the numbers of acidic and basic groups improves all models. The effect is specially Correspondence/Reprint request: Dr. Francisco Torrens, Institut Universitari de Ciència Molecular, Universitat de València, Edifici d’Instituts de Paterna, P. O. Box 22085, E 46071 València, Spain
36
Francisco Torrens & Gloria Castellano
noticeable for amino acids with more than two functional groups. The fitting line obtained for the 21 amino acids can be used to estimate the isoelectric point of lysozyme and its fragments, by only replacing (1+Δn/nT) with (M+Δn)/nT. For lysozyme, the results of smaller fragments can estimate that of the whole protein with 1–13% errors.
Introduction During the simulation of pH at the pI isoelectric point of n = 21 amino acids, Pogliani introduced indices D and Dv, and the concept of fragmentary molecular connectivity indices [1,2]. He defined the terms: XpI = (χ/0χv) (1+Δn/nT), where Δn = nA – nB, nA = number of acidic groups (two for Asp and Glu, one for all others), nB = number of basic groups (two for His and Lys, three for Arg, and one for all others), and nT = nA + nB (total number of functional groups); for nT = 2, Δn = 0 [3–5]. There were eight such terms following the type of index that enters in numerator χ. He defined the nomenclature for χ = Dv → X ≡ DXv, etc. The best single descriptor for pI was 0 v X with Q = 2.12, F = 267, r = 0.966, s = 0.46, u = (16, 28). He improved statistic Q at expenses of statistics F and u, with the linear combination of X terms made up of connectivity indices, which can be derived by the aid of both forward and full combinatorial techniques: {DXv, 0X, 0Xv, 1X}: Q = 2.53, F = 95, r = 0.980, s = 0.39, u = (3.1, 2.8, 4.7, 2.8, 26). Average <u> dropped to 7.9, the utility of 0Xv dropped dramatically, and only the unitary index maintained a good utility. He used the vector of orthogonalized terms: Ω = (1Ω, 2Ω, 3Ω, 4Ω, U0), where 1Ω ≡ 0Xv, 2Ω ← DXv, 3Ω ← 1X, 4Ω ← 0X. The vector showed: u = (19, 1.3, 1.0, 2.8, 33). Parameters 1Ω ≡ 0Xv and U0 ≡ Ω 0 ≡ 1 were important. He obtained an enhanced utility for 1Ω and U0: 19 and 33. The statistical score of the molar masses for pI was Q = 0.002 and F = 0.14. He confirmed a small interrelation between the eight terms: <rIM(pI:{X})> = 0.560, rw(DX, Xt) = 0.004, and rs(DX, 1X) = 0.975, where rw and rs stood for the weakest and strongest interrelations, respectively. The 0Xv term was trivial, nothing other than (1 + Δn/nT) (cf. [6] for a review). He discovered the term: X’pI = {[(1χv)0.5– 180χtv]/D}(0.04χtv+Δn/nT), whose modelling power was remarkable: Q = 3.41, F = 693, r = 0.987, s = 0.29, <u> = 58, u = (26, 90), and the correlation vector C = (77.99429, 5.75382). He wrote the modelling equation: pI = 5.75 + 77.99X’pI, which was a highly dominant dead-end term. The 0Xv term was mainly based on valence-type molecular connectivity indices. The generation and decomposition of amino-acid and peptide radicals are processes of biological importance, due to their connection to the oxidative damage caused by ionizating radiation or oxidizing agents [7,8].
QSPR studies on amino acids: Application to proteins
37
Experimental studies showed that amino-acid and peptide radical cations can be generated by the electrospray technique and peptide cationization using Cu2+ [9]. The mass spectra obtained in these cases are rich and differ considerably from those of protonated systems, which can provide useful information in peptide sequencing. The group of Sodupe performed quantum chemical calculations on nine amino acids and the smallest N-glycylglycine peptide [10,11]. They discussed the influence of intramolecular hydrogen bonds and amino-acid side chain on the localization of the electron hole upon oxidation and subsequent fragmentation process. They showed that for systems involving aromatic amino acids, oxidation is mainly produced at the side chain, whereas for nonaromatic ones oxidation is produced either at the basic NH2 or CO groups, the nature of the electron hole depending on the existent intramolecular hydrogen bonds. In earlier publications, topological charge-transfer (CT) indices [12] were applied to the calculation of the molecular dipole moment of hydrocarbons [13], valence-isoelectronic series of benzene, styrene [14,15] and cyclopentadiene [16], phenyl alcohols [17], 4-alkylanilines [18] and amino acids [19]. In the present report, the valence CT indices have been applied to the calculation of pH at the pI isoelectric point of 21 amino acids. The next section presents the CT indices and their generalization for heteroatoms. Following that, the fractal and fractal hybrid orbital (HO) analyses of tertiary structure of protein molecule are presented. Then, the phylogenesis of avian birds and the 1918 influenza virus are examined. Next, distinct molecular surfaces and hydrophobicity of aminoacid residues in proteins are reviewed. Following that, volumetric studies of L-valine and L-leucine in aqueous solutions of NaBr are revised. Then, the pH-dependent properties of proteins using pKa calculations and accurate, conformation-dependent predictions of solvent effects on protein ionization constants are analyzed. Next, a simple method for protein structural classification is resumed. Following that, use of amino-acid composition to predict ligand-binding sites is reported. Then, the calculation results are presented and discussed. The last section summarizes the perpectives.
Computational method The most important matrices that delineate the labelled chemical graph are the adjacency (A) [20] and distance (D) matrices, wherein Dij = lij if i = j, “0” otherwise; lij is the shortest edge count between vertices i and j [21]. In A, Aij = 1 if vertices i and j are adjacent, “0” otherwise. The D[-2] matrix is that whose elements are the squares of the reciprocal distances Dij-2. The intermediate matrix M is defined as the matrix product of A by D[-2]: M = AD[–2].
38
Francisco Torrens & Gloria Castellano
The CT matrix C is defined as C = M – MT where MT is the transpose of M [22]. By agreement Cii = Mii. For i ≠ j, the Cij terms represent a measure of the intramolecular net charge transferred from atom j to i. The topological CT indices Gk are described as the sum of absolute values of the Cij terms defined for the vertices i,j placed at a topological distance Dij equal to k: N −1
Gk = ∑
∑ C δ (k,D ) N
i =1 j = i+1
ij
ij
(1)
where N is the number of vertices in the graph, Dij are the entries of the D matrix, as well as δ is the Kronecker δ function being δ = 1 for i = j and δ = 0 for i ≠ j. The Gk represent the sum of all the Cij terms, for every pair of vertices i and j at topological distance k. Other topological CT index, Jk, is defined as: Jk =
Gk N −1
(2)
The index represents the mean value of CT for each edge, since the number of edges for acyclic compounds is N – 1. When heteroatoms are present, some way of discriminating atoms of different kinds needs to be considered [23]. In valence CT-index terms, the presence of each heteroatom is taken into account by introducing its electronegativity in the corresponding entry of the main diagonal of the adjacency matrix A. For each heteroatom X its entry Aii is redefined as: A Vii = 2.2(χ X − χ C )
(3)
to give the valence adjacency AV matrix, where χX and χC are the electronegativities of heteroatom X and carbon, respectively, in Pauling units. The subtractive term keeps AiiV = 0 for the C atom, and the factor gives AiiV = 2.2 for O, which was taken as standard. From AV instead of A, MV, CV, GkV and JkV are calculated following the former procedure. The CiiV, GkV and JkV are graph invariants. The enzyme protein lysozyme (129 amino-acid residues, molecular weight 14307g·mol–1) has been taken from the Protein Data Bank code 2LYM (cf. [24] for a review). The charge on lysozyme is +12.0e at pH 4.0, +8.0e at pH 7.0, +4.0e at pH 10.0 and decreases rapidly as the isoelectronic point at pH 11.35 is approached [25].
QSPR studies on amino acids: Application to proteins
39
Fractal analysis of tertiary structure of protein molecule A method for the computation of a dimension index D was implemented in our program TOPO and applied to calculate the solvent-accessible surfaces (SASs) of molecules [26–30]. TOPO distinguished external from internal atoms and used the feature to give two fractal-like dimension indices, viz. D, and D’ [31]. The D’–D difference was a sensitive method to elucidate the occurrence of atoms that were hidden to solvents. For molecules with buried (solvent-excluded) atoms the difference was greater (e.g., faujasite). The procedure was compared with our version of code GEPOL, which provided high-quality results [32–41]. TOPO systematic error was easily corrected by simple addition of a small constant value (0.011). Correlation models between indices D and D’, globularity G, rugosity G’, dipole moment and other properties made clear the existence of a homogeneous molecular structure in each series. Additional applications were the extrapolation of D to infinite polymers, the variation of D with generations of dendrimers and a revision of D for lysozyme. A comparative analysis with the results obtained with our version of program SURMO2, which does not consider the cavity, allows characterizing the cavity [42]. The method was applied to the lysozyme secondary structures and cavity-like space (cf. Figure 1). The dipole moments calculated for the helices were greater than for the sheet. For helices, the main contribution to the water-accessible surface area was the hydrophobic term, while the hydrophilic component part dominates in the sheet. The molecular globularity G was the topological index that quantitatively differentiated better between helices and sheet (cf. Table 1).
B
A
S
E
D C
Figure 1. Ribbon image of lysozyme linking Cα skeleton: A–D) helices, E) β-sheet and S) binding site.
40
Francisco Torrens & Gloria Castellano
Table 1. Topological indices for lysozyme secondary-structure regions [G’ (Å–1)]. Structure
G
G ref.
G'
G' ref.
D
D ref.
D'
Mean of helices A–D
0.472
0.442
1.092
1.161
1.581
1.597
1.749
Antiparallel β-sheet E
0.404
0.381
1.107
1.173
1.629
1.646
1.821
All molecule
0.203
0.188
1.033
1.113
1.908
1.930
2.201
All molecule (cavity
0.776
0.188
0.242
1.113
1.542
1.930
2.201
0.140
–
2.058
–
3.242
–
–
not considered) Cavity
The cavity-like space showed the greatest fractal-like index, indicating the maximal sensitivity to solvent size. It was suggested that the catalytic activity of the enzyme was located at the cavity-like space. The tertiary structures of 43 proteins, selected to cover the five structural classes of protein molecule, were analyzed with a geometrical theory called fractal theory, with the intention of devising a new tool for quantitative description of the tertiary structure of protein [43]. A brief introduction to fractal theory was reported. It was demonstrated that the principles dictating the folding of the local backbone structure and the global backbone structure were well characterized in terms of the representation of fractal theory (cf. Figure 2) [44]. Comparison of the fractal character D of protein molecules with that of the ideal Gaussian chain (GC) revealed several characters of the principles (cf. Table 2). The proteins in the structural class of β-type were quantitatively distinguished from other classes with this representation. A curious discovery that several proteins took the fractal dimension greater than two was reported and discussed.
Figure 2. Measurement of the length of a protein.
QSPR studies on amino acids: Application to proteins
41
Table 2. Fractal dimension of the solvent-accessible surface for some proteins. Structural class
Number of residues
D
Only α-helix
136
1.39
Almost exclusively β-sheet
191
1.29
α-Helix and β-sheet tend to be segregated
180
1.34
292
1.34
Neither α-helix nor β-sheet
109
1.40
Mean of the five structural classes
211
1.34
Gaussian chain
–
1.50
throughout the chain
α-Helix and β-sheet tend to alternate throughout the chain
Fractal hybrid orbital analysis of tertiary structure of protein The bond angles and fractal dimensions for ideal hybrid orbitals (HOs) were reported (cf. Table 3). The hybridizations of C atom were reported (cf. Table 4). Table 3. Bond angles and fractal dimensions for ideal hybrid orbitals [θ: bond angle (º)]. Ideal hybrid orbital
n
s-ratio
Example polymer
θ
D
sp
1
0.500
–C≡C–C≡
180.00
1.000
sp2
2
0.333
–CH=CH–CH=
120.00
1.262
sp3
3
0.250
–CH2–CH2–CH2–
109.47
1.413
p
∞
0.000
–
90.00
2.000
Table 4. Hybridizations of carbon [C–H energy (kJ·mol–1), length (Å), bond angle (º), JC–H (Hz)]. Carbon
s-ratio
hybrid
Example
C–H energy
C–H length
Bond angle
pKa
JC–H
molecule
sp
0.500
CH≡CH
506
1.057
180.00
25
249
sp2
0.333
CH2=CH2
444
1.079
120.00
42
156
sp3
0.250
CH4
423
1.094
109.47
47
125
42
Francisco Torrens & Gloria Castellano
The concept of fractal was applied to a number of properties of proteins [45]. The structure and shape of the polypeptide chain of proteins were determined by the hybridized states of atomic orbitals (AOs) in the molecular chain. The fractal dimensions, in the range of short distances (1–10Å) of the tertiary structures of some proteins covering various structural classes of protein molecules, were analyzed and compared with a GC (cf. Table 5). The interpretation was given in terms of steric repulsion. The calculated s-ratios in the spn HOs were computed from the fractal dimensions. The tertiary structures of eight proteins covering four structural classes were analyzed. A mean value of ca. 0.29 predicted sp2.46 HOs, halfway between planar sp2 and tetrahedral sp3 HOs. The proteins in the β-structural class were quantitatively distinguished from other classes with the representation. They showed a higher s-ratio (ca. 0.32), which predicted ca. sp2.1 HOs rather similar to planar sp2 HOs. A comparison of the proteins with a GC was interpreted in terms of steric repulsion. Table 5. Contribution of s AO in spn HOs from fractal dimension for proteins (Nr: No. of residues). Secondary
Protein
Nr
D
s
n
Haemoglobin (deoxy)
141
1.40
0.26
2.9
Myoglobin (sperm whale,
153
1.42
0.25
3.1
Immunoglobulin
208
1.26
0.33
2.0
Trypsin (native, pH 8)
223
1.30
0.31
2.2
Lysozyme (hen egg-white)
129
1.42
0.25
3.1
Ribonuclease A
124
1.33
0.29
2.4
α-Helix and
Adenylate kinase (porcine
194
1.36
0.28
2.6
β-sheet in
muscle)
Phosphoglycerate kinase (horse)
408
1.33
0.29
2.4
Mean
198
1.34
0.29
2.5
–
–
1.50
0.21
3.8
structure
α-Helix
deoxy)
β-Sheet α-Helix and β-sheet in separate regions
alternate regions
Gaussian chain
QSPR studies on amino acids: Application to proteins
43
The calculated s-ratios in the spn HOs were also computed for 81 proteins (cf. Table 6) [46]. Iron proteins were compared with the self-avoiding random walk (SAW) model, and two classes were quantitatively distinguished: ferric hemeproteins and iron–sulphur (Fe–S) proteins [47]. A curious discovery that a protein took sp0.5 HOs was discussed [48]. A dependence of fractal HOs on ionic strength was observed [49]. The calculated s-ratios in the spn HOs were also computed for 43 proteins selected to cover the five structural classes of protein molecules [50]. It was demonstrated that the principles dictating the folding of the local and global backbone structures were well characterized, in terms of the representation given by fractal theory. Comparison of the fractal character of protein molecules with that of the ideal GC revealed several features of these principles. β-type proteins were distinguished quantitatively from those in other classes. They showed a greater s-ratio (ca. 0.32) in the sp2.20 HOs rather similar to planar sp2 HOs. Table 6. Contribution of s AO in the spn HOs from fractal dimension of SAS surface of Fe proteins. Protein
Nr
D
s-ratio
n
Ferric hemeproteins
155
1.56
0.18
4.8
Iron–sulphur proteins
102
1.30
0.31
2.3
Mean of iron proteins
138
1.47
0.23
4.0
Various peoteins
196
1.48
0.22
3.6
Mean of Tables 5 and 6
186
1.46
0.23
4.1
(Fe2S2·Cys4)
Phylogenesis of avian birds and the 1918 influenza virus Lysozyme is an enzyme with 129 residues. The amino-acid compositions of certain avian lysozymes was determined. The amino-acid sequence of hen egg-white lysozyme was annotated [51]. Certain discrepancies exist between this sequence and that reported in [52] (at residues 40, 41, 42, 46, 48, 58, 65, 66, 92 and 93). Crystallographic analysis [53] gave results for residues 40, 41, 42, 58, 59, 92 and 93, which are in agreement with the former. Discrepancies at residues 46, 48, 65 and 66 are a difference between Asp or Asn; from the electron density maps it could not be determined whether these residues are amide or free acid. The amino-acid sequences of duck, Japanese quail and turkey egg-white lysozymes were determined [54–56]. Amino-acid sequences for human urine and milk lysozymes were also determined.
44
Francisco Torrens & Gloria Castellano
Comparative studies of sequences for lysozymes of different origins are interesting from the viewpoint of structure–function relationships (Asp-101 of hen lysozyme, which is known to be implicated at the substrate binding site, is replaced by Gly in turkey lysozyme). Trp-62 of hen lysozyme, which also plays an important role in substrate binding, is replaced by Tyr in human lysozyme. The differences between the avian species sequences, which are compared, are expressed as percentage of different amino acids in lysozyme. The greater the differences, the farther in time must be the separation between species. Grouping level b can be identified with biological time. The obtained phylogenetic tree is represented by scheme: (1,…,5) → (1,4,5)(2,3) → (1,5)(2,3)(4) → (1)(2,3)(4)(5) → (1)(2)(3)(4)(5). The scheme is in agreement with data obtained in morphological studies. Optimality criterion SS associated with different proposals for phylogenetic trees allows the equipartition conjecture to be validated or invalidated in phylogenesis. If, in the calculation of entropy associated with the phylogenetic tree, a species is systematically omitted, the difference between entropies with and without the species can be considered as a measure of species entropy. The contributions may be studied with the equipartiton conjecture. It is not within the scope of the simulation method to replace biological tests of drugs or field data in palaeontology, but such simulation methods can be useful to assert priorities in detailed experimental research. Available experimental and field data should be examined by different classification algorithms to reveal possible features of real biological significance [57]. Scheme (cf. Figure 3) is in agreement with data obtained in morphological studies, and with the method based on entropy production and the conjecture of equipartition of production of entropy. Each band of parallel edges indicates a split. The distance between any two taxa x and y corresponds to the sum of weights of all splits that separate x and y. An arsenal of effective medicines and others in developing phase is available. Research in viral genomes expedites progress [58–61]. In the search of the keys of the origin of the 1918 virus hemagglutinin (HA), the gene sequences of HA subtype H1 of several strands of influence virus were analyzed [62–64]. Its phylogenetic tree was built [65,66]. The samples of 1918 strand are inscribed in that family of influenza virus adapted to man (cf. Figure 4). Distance between the 1918 gene H1 and the known avian family reflects that it was originated in a strand of avian influenza virus, although it evolved in an unidentified host before emerging in 1918. Biomacromolecular structural data are maintained by the Protein Data Bank (PDB) [67]. Classroom applications were described [68–71]. Three-dimensional structures can be displayed by program RasMol [72]. Program WPDB compresses PDB structure files into a set of indexed files [73,74]. Program BABEL converts
QSPR studies on amino acids: Application to proteins
45
Figure 3. Dendrogram (binary tree) for distances as different amino acids in lysozyme.
molecular modelling file formats [75]. Database system RELIBASE+ analyzes protein-ligand structures in PDB [76]. Classroom applications of WPDB were described [77]. The successor of RasMol and Chime [78] is Jmol [79â&#x20AC;&#x201C;86].
Molecular surfaces/hydrophobicity of amino-acid residues in proteins Hydrophobicity was a useful concept to rationalize the role played by amino-acid residues, in terms of buried or exposed conformations, with regard to the aqueous environment in proteins [87]. The relationship of this concept, with distinct approaches to represent the molecular surface, was analyzed by computing reliable surface areas for three definitions, viz. the
46
Francisco Torrens & Gloria Castellano
Figure 4. Familiar dendrogram of influenza.
van der Waals, SAS, and solvent-excluded molecular surfaces. The surface areas were obtained for all the naturally occurring amino acids by first setting a proper reference standard state and then calculating the values for a database of proteins containing a total of 4 297 residues. Despite the great differences in the molecular surfaces, proper indices were defined for handling the information of interest to study the hydrophobic behaviour of amino acids provided by the surfaces.
Volumetric studies of L-valine and L-leucine in aqueous solutions of NaBr Densities (ρ) of the amino acids, viz. L-valine, and L-leucine, in aqueous solutions of sodium bromide (NaBr, 0.05–1.0mol·kg–1) were measured at 298.15K [88]. From the densities, apparent molar volumes (Vφ) and partial
QSPR studies on amino acids: Application to proteins
47
molar volumes of the amino acids (Vφº) at infinite dilution were evaluated. The data were combined with the known values of Vφº of glycine and L-alanine in aqueous NaBr solutions at 298.15K in order to determine the group contributions. The trends of transfer volumes (ΔVφº) were interpreted in terms of solute–cosolute interactions on the basis of a cosphere overlap model. Pair and triplet interaction coefficients were also calculated from transfer parameters.
Analysys/predictions of solvent effects on protein ionization constants The results of protein pKa calculations were routinely being analyzed to understand the pH-dependence of protein characteristics, e.g., stability and catalysis [89]. Systems of functionality important titratable groups were identified from protein from pKa calculations, but the rationalization of the behaviour of the systems was inherently problematic due to a lack of theoretical tools and methods. A number of methods for analyzing the results of protein pKa calculations were reported, which were embedded in a graphical user interface (pKaTool). Methods for assessing the reliability of protein pKa calculations and for analysing the roles of individual residues in determining active-site pKa values and the pH-dependence of protein stability were reported. The published methods were freely available to academic researchers. Predicting how aqueous solvent modulated the conformational transitions and influenced the pKa values that regulated the biological functions of biomolecules remained an unsolved challenge [90]. To address the problem, it was developed FDPB_MF, a rotamer repacking method that exhaustively sampled side-chain conformational space and rigorously calculated multibody protein–solvent interactions. FDPB_MF predicted the effects on pKa values of various solvent exposures, large ionic strength variations, strong energetic couplings, structural reorganizations and sequence mutations. FDPB_MF achieved high accuracy, with root-mean-square deviations within 0.3 pH unit of the experimental values measured for turkey ovomucoid third domain, hen lysozyme, Bacillus circulans xylanase, as well as human and Escherichia coli thioredoxins. FDPB_MF provided a faithful, quantitative assessment of electrostatic interactions in biological macromolecules.
A simple method for protein structural classification Since the concept of structural classes of proteins was proposed, the problem of protein classification was tackled by many groups [91]. Most of
48
Francisco Torrens & Gloria Castellano
the classification criteria were based only on the helix/strand contents of proteins. A method for protein structural classification based on the secondary-structure sequences was reported. It was a classification scheme that confirmed existing classifications. A mathematical model was constructed to describe protein secondary-structure sequences, in which each protein secondary-structure sequence corresponded to a transition probability matrix that characterized and differentiated protein structure numerically. Its application to a set of real data indicated that the method classified protein structures correctly. The final classification result was shown schematically. Therefore, it was visual to observe the structural classifications, which was different from traditional methods.
Use of amino-acid composition to predict ligand-binding sites A novel method, for predicting the binding sites for druglike compounds on the surface of proteins, was developed on the basis of the specific aminoacid composition observed at the ligand-binding sites of ligandâ&#x20AC;&#x201C;protein complexes determined by X-ray analysis [92]. A profile, representing the preference of each of the 20 standard amino acids at the binding sites of druglike molecules, was obtained for a small set of high-quality complex structures. An index termed propensity for ligand binding (PLB) was created from the profiles. The PLB index was used to predict the propensity of binding for 804 ligands, at all potential binding sites on the proteins whose structures were determined by X-ray analysis. If the sites with the first two highest PLB indices were taken into consideration, the successfully predicted sites reached a high percentage of 86. The PLB prediction was relatively simple, but the validation study showed that it is both fast and accurate to detect ligand-binding sites, especially the binding sites of druglike molecules. The PLB index was used to predict the ligand-binding sites of uncharacterized protein structures and also to identify novel drug-binding sites of known drug targets. Many biochemical processes and phenomena involving amino acids and proteins are stereospecific; e.g., L- and D-enantiomers of amino acids have different tastes [93,94].
Computation results and discussion The molecular CT indices Gk, Jk, GkV and JkV (with k < 6) are reported in Table 7 for 21 amino acids. Hydroxyproline (4-hydroxypyrrolidine2-carboxylic acid, Hyp) differs from proline (Pro) by the presence of a hydroxyl (â&#x20AC;&#x201C;OH) group attached to the CÎł atom. The Gk indices contain both CT and size effects, e.g., Gk(Pro) < Gk(Hyp). The size effect is eliminated in the
QSPR studies on amino acids: Application to proteins
49
Table 7. Values of the Gk and Jk charge-transfer indices up to fifth order for 21 amino acids (AA). AA
N
G1
G2
G3
G4
G5
Ala
6
2.5000
2.6667
0.5000
0.0000
0.0000
Arg
12
4.2500
6.2222
1.2500
0.4622
0.2639
Asn
9
4.0000
4.7778
0.9375
0.3689
0.1250
Asp
9
4.0000
4.7778
0.9375
0.3689
0.1250
Cys
7
2.5000
2.8889
0.6875
0.1111
0.0000
Gln
10
4.0000
5.0000
1.0000
0.3422
0.2708
Glu
10
4.0000
5.0000
1.0000
0.3422
0.2708
Gly
5
2.0000
2.3333
0.2500
0.0000
0.0000
His
11
3.7500
7.5000
1.2569
0.5211
0.3472
Hyp
9
3.0000
2.8889
0.8750
0.3422
0.0625
Ile
9
3.0000
3.3333
1.0000
0.3422
0.0625
Leu
9
3.5000
2.8889
0.9375
0.3511
0.1250
Lys
10
2.5000
2.8889
0.8125
0.3111
0.2014
Met
9
2.5000
2.8889
0.8125
0.3111
0.1458
Phe
12
3.2500
9.2222
1.2500
0.6844
0.4306
Pro
8
2.0000
2.6667
0.6528
0.2222
0.0000
Ser
7
2.5000
2.8889
0.6875
0.1111
0.0000
Thr
8
3.0000
3.1111
0.8750
0.2222
0.0000
Trp
15
4.2500
13.3889
2.0278
0.9989
0.6425
Tyr
13
4.5000
10.5556
1.6250
0.9867
0.4861
Val
8
3.0000
3.1111
0.8750
0.2222
0.0000
AA
J1
J2
J3
J4
J5
Ala
0.5000
0.5333
0.1000
0.0000
0.0000
Arg
0.3864
0.5657
0.1136
0.0420
0.0240
Asn
0.5000
0.5972
0.1172
0.0461
0.0156
Asp
0.5000
0.5972
0.1172
0.0461
0.0156
Cys
0.4167
0.4815
0.1146
0.0185
0.0000
Gln
0.4444
0.5556
0.1111
0.0380
0.0301
Glu
0.4444
0.5556
0.1111
0.0380
0.0301
Gly
0.5000
0.5833
0.0625
0.0000
0.0000
His
0.3750
0.7500
0.1257
0.0521
0.0347
50
Francisco Torrens & Gloria Castellano
Table 7. Continued Hyp
0.3750
0.3611
0.1094
0.0428
0.0078
Ile
0.3750
0.4167
0.1250
0.0428
0.0078
Leu
0.4375
0.3611
0.1172
0.0439
0.0156
Lys
0.2778
0.3210
0.0903
0.0346
0.0224
Met
0.3125
0.3611
0.1016
0.0389
0.0182
Phe
0.2955
0.8384
0.1136
0.0622
0.0391
Pro
0.2857
0.3810
0.0933
0.0317
0.0000
Ser
0.4167
0.4815
0.1146
0.0185
0.0000
Thr
0.4286
0.4444
0.1250
0.0317
0.0000
Trp
0.3036
0.9563
0.1448
0.0713
0.0459
Tyr
0.3750
0.8796
0.1354
0.0822
0.0405
Val
0.4286
0.4444
0.1250
0.0317
0.0000
AA
G1V
G2V
G3V
G4V
G5V
Ala
4.5000
3.3222
1.2333
0.0000
0.0000
Arg
7.1500
6.9306
1.9722
0.6922
0.3497
Asn
6.8000
6.0889
1.5458
0.5058
0.2130
Asp
7.9000
6.0889
1.6625
0.5408
0.1250
Cys
4.5000
3.3222
1.4181
0.3861
0.0000
Gln
6.8000
5.7611
1.8472
0.5147
0.2062
Glu
7.9000
6.3111
1.9694
0.5835
0.2942
Gly
4.5000
2.9361
0.4944
0.0000
0.0000
His
8.6500
7.6583
1.8625
0.4696
0.3174
Hyp
7.8000
4.3583
1.1556
0.4597
0.0625
Ile
5.0000
3.5444
1.6028
0.8810
0.2385
Leu
5.5000
3.3222
1.4236
0.6036
0.4770
Lys
5.1000
3.3750
1.4153
0.4836
0.2663
Met
4.5000
3.3222
1.4181
0.4949
0.3103
Phe
5.2500
9.6556
1.7361
0.5642
0.3519
Pro
5.6000
3.7028
1.1361
0.5222
0.0000
QSPR studies on amino acids: Application to proteins
51
Table 7. Continued Ser
6.2000
3.4278
1.2875
0.1111
0.0000
Thr
6.2000
3.9778
1.4722
0.4972
0.0000
Trp
6.9500
13.1028
2.5111
0.8436
0.4239
Tyr
7.2000
10.5444
1.8611
0.7289
0.4399
Val
5.0000
3.3222
1.6028
0.7722
0.0000
AA
J1V
J2V
J3V
J4V
J5V
Ala
0.9000
0.6644
0.2467
0.0000
0.0000
Arg
0.6500
0.6301
0.1793
0.0629
0.0318
Asn
0.8500
0.7611
0.1932
0.0632
0.0266
Asp
0.9875
0.7611
0.2078
0.0676
0.0156
Cys
0.7500
0.5537
0.2363
0.0644
0.0000
Gln
0.7556
0.6401
0.2052
0.0572
0.0229
Glu
0.8778
0.7012
0.2188
0.0648
0.0327
Gly
1.1250
0.7340
0.1236
0.0000
0.0000
His
0.8650
0.7658
0.1863
0.0470
0.0317
Hyp
0.9750
0.5448
0.1444
0.0575
0.0078
Ile
0.6250
0.4431
0.2003
0.1101
0.0298
Leu
0.6875
0.4153
0.1780
0.0755
0.0596
Lys
0.5667
0.3750
0.1573
0.0537
0.0296
Met
0.5625
0.4153
0.1773
0.0619
0.0388
Phe
0.4773
0.8778
0.1578
0.0513
0.0320
Pro
0.8000
0.5290
0.1623
0.0746
0.0000
Ser
1.0333
0.5713
0.2146
0.0185
0.0000
Thr
0.8857
0.5683
0.2103
0.0710
0.0000
Trp
0.4964
0.9359
0.1794
0.0603
0.0303
Tyr
0.6000
0.8787
0.1551
0.0607
0.0367
Val
0.7143
0.4746
0.2290
0.1103
0.0000
52
Francisco Torrens & Gloria Castellano
Jk, e.g., J2(Pro) > J2(Hyp). The effect of heteroatoms is included in both GkV and JkV, e.g., G4V(Pro) > G4V(Hyp). The calculated and experimental isoelectric points pI for the 21 amino acids are listed in Table 8. For the {Gk,Jk} chosen database the following best linear model turns out to be: pI = 12.0 − 12.7J1 − 22.5J 4
MAPE =17.73%
n = 21
r = 0.478
s = 1.598
F = 2.7
AEV = 0.7718
(4)
where the mean absolute percentage error (MAPE) is 17.73% and the approximation error variance (AEV) is 0.7718. The inclusion of N improves the correlation pI = 7.13 + 0.751N – 7.99J1 – 15.7J3 – 81.7J4 n = 21 r = 0.629 s = 1.499 F = 2.6 MAPE = 16.95% AEV = 0.6065
(5)
and AEV decreases by 21%. However, the model is limited to small N because N increases with both nA and nB, resulting inadequate for polypeptides and proteins. The database {Gk,Jk,GkV,JkV} improves the model: pI = 16.9 + 0.421G2 − 22.2J 4 − 2.62J1V − 9.53J 2V − 13.9J 3V − 23.1J 4V n = 21
r = 0.699 s = 1.475 F = 2.2 MAPE = 14.73% AEV = 0.5465
(6)
and AEV decreases by 29%. The inclusion of N improves the correlation:
pI = −11.1+ 3.35N + 5.56G3 − 18.2G5 + 3.39J 2 − 66.6J 4 − 3.31G2V + 0.955G V4 − 8.12J1V + 24.7J2V − 22.3J3V − 50.7J4V − 14.0J5V n = 21 r = 0.958 s = 0.781 F = 7.5 MAPE = 8.25% AEV = 0.1754
(7)
and AEV decreases by 77%. However, the model is inadequate for proteins because N, G3, G5, G2V and G4V increase with nA and nB. The use of (1+∆n/nT) = 0.5 for Arg, 4/3 for Asp and Glu, 2/3 for His and Lys, as well as one for all others improves the fit: pI = 14.8 − 9.01(1 + Δ n nT )
MAPE = 5.29%
n = 21
AEV = 0.0682
r = 0.965
s = 0.462 F = 259.8 (8)
QSPR studies on amino acids: Application to proteins
53
Table 8. Calculated and experimental values of pH at isoelectric point pI for 21 amino acids (AA). AA
pI (Eq. 8)
pI (Eq. 12)
Experiment
Ala
5.80
5.76
6.00
Arg
10.31
10.33
10.76
Asn
5.80
5.66
5.41
Asp
2.79
2.65
2.77
Cys
5.80
5.79
5.07
Gln
5.80
5.85
5.65
Glu
2.79
2.86
3.22
Gly
5.80
5.90
5.97
His
8.80
8.38
7.59
Hyp
5.80
5.81
5.80
Ile
5.80
5.96
6.02
Leu
5.80
5.99
5.98
Lys
8.80
9.28
9.74
Met
5.80
5.98
5.74
Phe
5.80
5.86
5.48
Pro
5.80
6.07
6.30
Ser
5.80
5.56
5.68
Thr
5.80
5.65
5.60
Trp
5.80
5.57
5.89
Tyr
5.80
5.58
5.66
Val
5.80
5.81
5.96
and AEV decreases by 91%. The correlation coefficient represents the 96.8% of that of the correlation of the means (n = 4, r = 0.997). The pI isoelectric points (calculated with Equation 8) for the 21 amino acids are also included in Table 8. For Equation (8) the absolute relative error results 5%. The pI isoelectric points (calculated with Equation 8 and experimental) for the 21 amino acids are shown in Figure 5. For Equation (8) the two amino acids farthest from the experimental value are His and Lys, with an absolute error of ca. 1.1 units. The variation of the pI isoelectric point as a function of (1+ホ馬/nT) for the 21 amino acids (cf. Figure 6) shows that some amino acids appear superposed.
54
Francisco Torrens & Gloria Castellano
10
pI Fitted
8 Eq. 8 Eq. 12
6
4
4
6
8
10
pI Experimental
Figure 5. Isoelectric points pI for the 21 amino acids with Equations (8/12). Line represents bisector. A 12
10
pI
Lysozyme Arg B=D=E=BS
Lys C Amino acids
8 His
Lysozyme fragments
6
4
Ala=Asn=Cys=Gln=Gly=Hyp=Ile=Leu =Met=Phe=Pro=Ser=Thr=Trp=Tyr=Val
Glu Asp
0.2
0.4
0.6
0.8
1
1.2
1.4
1+ホ馬/n T
Figure 6. Variation of pI vs. (1+ホ馬/nT) for 21 amino acids, lysozyme and its fragments. The fitting line corresponds to the amino acids.
QSPR studies on amino acids: Application to proteins
55
The fitting line corresponds to the 21 amino acids; both amino acids that are the farthest are His and Lys (nB = 2). The inclusion of {Gk,Jk} improves the fit: pI = 15.3 − 8.99(1+ Δn nT ) − 1.00J 2
MAPE = 5.10%
n = 21 r = 0.971
s = 0.435 F = 147.9
AEV = 0.0450
(9)
and AEV decreases by 93%. The inclusion of {JkV} improves the fit: V pI = 16.8 − 8.59(1+ Δn nT ) − 0.958J2 − 8.98J3 − 1.16J1 n = 21
s = 0.407
F = 85.7
MAPE = 4.70%
r = 0.977
AEV = 0.0450
(10)
AEV decreases by 94% and allows studying polypeptides, proteins and protein fragments. The inclusion of {GkV} improves the fit: pI = 16.0 − 8.94(1 + Δ n nT ) − 0.828J2 − 9.77J3 + 0.619G4
V
s = 0.378
F = 100.1
MAPE = 4.12%
n = 21
r = 0.981
AEV = 0.0425
(11)
and AEV decreases by 94%. However, the model is inadequate for proteins because G4V increases with nA and nB. No superposition of the corresponding Gk–Jk or GkV–JkV pairs is observed in Equations (4–6,8–11), which decreases the risk of collinearity in the fits, given the close relationship between each pair Gk–Jk in Equation (2) [95,96]. The simultaneous inclusion of {GkV,JkV} improves the fit: pI = 16.6 − 8.71(1 + Δ n nT ) − 0.787J2 − 9.52J 3 + 0.485G4 − 0.801J1 − 2.73J 4 V
V
V
n = 21 r = 0.989 s = 0.306 F = 103.9 MAPE = 4.20% AEV = 0.0380 (12) and AEV decreases by 95%. However, the model is inadequate for proteins because G4V increases with nA and nB. The pI isoelectric points (calculated with Equation 12) for the 21 amino acids are also included in Table 8. For Equation (12) the absolute relative error decreases to 4%. The pI isoelectric points (calculated with Equation 12 and experimental) for the 21 amino acids are displayed in Figure 5. For Equation (12) the error is reduced for most amino acids; in particular for His and Lys the error decreases to 0.6 units. The molecular CT indices are collected in Table 9 for lysozyme, five fragments of its tertiary structure and its binding site. In general, the CT indices
56
Francisco Torrens & Gloria Castellano
Table 9. Values of Gk and Jk charge-transfer indices up to fifth order for lysozyme and its fragments. Fragment
N
G1
G2
G3
G4
G5
α-Helix A α-Helix B
87
28.5000
39.3889
10.9097
6.4083
4.5069
83
26.2500
45.5000
11.5417
6.5189
4.7258
3.010-Helix C
39
13.2500
14.2222
5.2500
3.2222
2.1181
α-Helix D β-Sheet E
61
20.2500
24.6667
8.3750
5.1022
3.6111
104
39.0000
55.5556
13.8750
8.4000
6.4167
105
30.2500
60.1111
13.5833
5.0556
2.5906
1001
349.2500
541.7222
137.0833
85.9639
64.8828
Binding site S Lysozyme Fragment
J1
J2
J3
J4
J5
α-Helix A
0.3314
0.4580
0.1269
0.0745
0.0524
α-Helix B
0.3201
0.5549
0.1408
0.0795
0.0576
3.010-Helix C
0.3487
0.3743
0.1382
0.0848
0.0557
α-Helix D
0.3375
0.4111
0.1396
0.0850
0.0602
β-Sheet E
0.3786
0.5394
0.1347
0.0816
0.0623
Binding site S
0.2909
0.5780
0.1306
0.0486
0.0249
Lysozyme
0.3493
0.5417
0.1371
0.0860
0.0649
Fragment
G1V
G2V
G3V
G4V
G5V
α-Helix A
56.5000
50.9778
19.8069
11.3040
7.0642
α-Helix B
51.3500
56.1472
19.0750
10.6837
6.8304
3.010-Helix C
27.9500
19.9000
9.2583
5.2535
3.2994
α-Helix D
41.5500
34.1083
14.6944
8.9512
5.1353
β-Sheet E
79.6000
74.1861
22.9611
13.2918
8.6063
Binding site S
59.5500
70.9806
19.1278
7.0843
3.1818
Lysozyme
683.4500
691.8222
230.6111
139.8265
91.6138
Fragment
J1V
J2V
J3V
J4V
J5V
α-Helix A
0.6570
0.5928
0.2303
0.1314
0.0821
α-Helix B
0.6262
0.6847
0.2326
0.1303
0.0833
3.010-Helix C
0.7355
0.5237
0.2436
0.1382
0.0868
α-Helix D
0.6925
0.5685
0.2449
0.1492
0.0856
β-Sheet E
0.7728
0.7203
0.2229
0.1290
0.0836
Binding site S
0.5726
0.6825
0.1839
0.0681
0.0306
Lysozyme
0.6835
0.6918
0.2306
0.1398
0.0916
QSPR studies on amino acids: Application to proteins
57
Table 10. Values of the pH at the isoelectric point, pI for lysozyme fragments not included in the fit. Fragment
Residues
pI
Experiment
α-Helix A
5–15
12.95
–
α-Helix B
24–34
10.89
–
3.010-Helix C
80–85
10.31
–
α-Helix D
88–96
11.02
–
Total α-helix
5–15,24–34,88–96
11.62
–
Total helix
5–15,24–34,80–85,88–96
11.29
–
β-Sheet E
41–54
10.87
–
Total helix+sheet
5–15,24–34,41–54,80–85,88–96
11.21
–
Binding site S
34,35,37,44,57,59,62,63,101,107,114
11.00
–
Total helix+sheet+BS
5–15,24–35,37,41–54,57,59,62,63,
11.17
–
11.49
11.35
80–85,88–96,101,107,114 Lysozyme
1–129
do not distinguish α-helices, 3.010-helix, β-sheet and binding site. In particular both Jk and JkV indices for the whole molecule are similar to those for the α-helices and, specially, for α-helix D. The pI isoelectric points for lysozyme and its fragments not included in the fit are calculated by a modification of Equation (8): pI = 14.8 − 9.01( M + Δ n) nT
(13)
where M is the number of amino-acid residues in the protein or fragment. The choice seems sensible as pI values are strongly dependent on the type of sidechain functional groups. The pI isoelectric points (calculated and experimental) for lysozyme and its fragments not included in the fit are reported in Table 10. The calculation result for α-helix A (M = 11 residues) is an estimate for that of the whole lysozyme (M = 129 residues) with a relative error of 13%. Furthermore, the inclusion of the other two α-helices (A+B+D, M = 31 residues) reduces the error to 1%. The variation of the pI isoelectric point for lysozyme (experiment) and its fragments (calculation) as a function of (M+Δn)/nT (Figure 6) shows that some fragments appear superposed. Both lysozyme and its fragments lie in the fitting line obtained for the amino acids.
58
Francisco Torrens & Gloria Castellano
Perspectives From the present results and discussion the following perspectives can be drawn. 1. The detailed comparison of the sequences (primary structures) of enzyme lysozyme has allowed for the reconstruction of a molecular phylogenetic tree for birds. Single- and complex-linkage perform a binary taxonomy of the parameters that separates avian birds by the following scheme with successive branchings: (1,…,5) → (1,4,5)(2,3) → (1,5)(2,3)(4) → (1)(2,3)(4)(5) → (1)(2)(3)(4)(5). Genetic analyses were applied in judicial investigations [97,98]. 2. The inclusion of heteroatoms in the π-electron system was beneficial for the description of the isoelecric point, owing to either the role of the additional p orbitals provided by the heteroatom or the role of steric factors in the π-electron conjugation. 3. The use of only charge-transfer and valence charge-transfer indices {Gk,Jk,GkV,JkV} gave limited results for modelling the isoelectric point of amino acids. However, the inclusion of (1+∆n/nT) improved all the models. The effect is especially noticeable for those amino acids with more than two functional groups, viz. Arg, Asp, Glu, and, specially, His, and Lys. Moreover, the fractional index casts some light on the importance of the side-chain functional groups in the pI simulations of functional-rich molecules. The satisfactory modelling of the pI of 21 amino acids by the aid of a fractional index, based mainly on the Δn index, shows how to bypass the problem to derive and work with an extended set of charge-transfer indices (here, m = 20) as, in this case, a good description can be obtained with only one index. 4. The fitting line obtained for the 21 amino acids can be used to estimate the isoelectric point of lysozyme and its fragments, by only replacing (1+Δn/nT) with (M+Δn)/nT. 5. For lysozyme, the results of smaller fragments can estimate that of the whole protein with 1–13% errors. An extension of the present study to other enzymes and proteins would give an insight into a possible generality of these conclusions, because most globular, water-soluble proteins are ionic, e.g., lysozyme (charge +8.0e) and bovine serum albumin (anionic) at pH 7.0. The present study may be also of interest in charge-migration peptide studies. 6. Work is in progress on the further elucidation of the value of Δn in the fractional indices for a better definition of indices, which are highly dependent on side-chain functional groups.
QSPR studies on amino acids: Application to proteins
59
Acknowledgements The authors acknowledge financial support from the Spanish MEC (Project Nos. CTQ2004-07768-C02-01/BQU and CCT005-07-00365) and EU (Program FEDER).
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
Pogliani, L. 1992, J. Pharm. Sci., 81, 334. Pogliani, L. 1992, J. Pharm. Sci., 81, 967. Pogliani, L. 1996, J. Phys. Chem., 100, 18065. Pogliani, L. 1997, Med. Chem. Res., 7, 380. Pogliani, L. 1999, J. Chem. Inf. Comput. Sci., 39, 104. Pogliani, L. 2000, Chem. Rev., 100, 3827. Stadtman, E. R. 1993, Annu. Rev. Biochem., 62, 797. Berlett, B. S., and Stadtman, E. R. 1997, J. Biol. Chem., 272, 20313. Chu, I. K., Rodriquez, C. F., Lau, T.-C., Hopkinson, A. C., and Siu, K. W. M. 2000, J. Phys. Chem. B, 104, 3393. Simon, S., Gil, A., Sodupe, M., and Bertrán, J. 2005, J. Mol. Struct. (Theochem), 727, 191. Gil, A., Bertran, J., and Sodupe, M. 2006, J. Chem. Phys., 124, 154306. Torrens, F. 2003, Comb. Chem. High Throughput Screen., 6, 801. Torrens, F. 2001, J. Comput.-Aided Mol. Design, 15, 709. Torrens, F. 2003, J. Mol. Struct. (Theochem), 621, 37. Torrens, F. 2004, Mol. Diversity, 8, 365. Torrens, F. 2005, Molecules, 10, 334. Torrens, F. 2003, Molecules, 8, 169. Torrens, F. 2004, Molecules, 9, 1222. Torrens, F., and Castellano, G. 2007, Synthetic Organic Chemistry XI, J. A. Seijas and M. P. Vázquez-Tato (Eds.), MDPI, Basel (Switzerland), 1. Randi’c, M. 1975, J. Am. Chem. Soc., 97, 6609. Hosoya, H. 1971, Bull. Chem. Soc. Jpn., 44, 2332. Gálvez, J., García, R., Salabert, M. T., and Soler, R. 1984, J. Chem. Inf. Comput. Sci., 34, 520. Kier, L. B., and Hall, L. H. 1976, J. Pharm. Sci., 65, 1806. Torrens, F., and Castellano, G. 2007, Biomedical Data and Applications, A. S. Sidhu, T. S. Dillon, and E. Chang (Eds.), Springer, Berlin, in press. Bergers, J. J., Vingerhoeds, M. H., van Bloois, L., Herron, J. N., Janssen, L. H. M., Fischer, M. J. E., and Crommelin, D. J. A. 1993, Biochemistry, 32, 4641. Torrens, F., Ortí, E., and Sánchez-Marín, J. 1991, High Performance Computing II, M. Durand, and F. El Dabaghy (Eds.), Elsevier, Amsterdam, 549. Torrens, F., Ortí, E., and Sánchez-Marín, J. 1991, J. Chim. Phys. Phys.-Chim. Biol., 88, 2435.
60
Francisco Torrens & Gloria Castellano
28. Torrens, F., Ortí, E., and Sánchez-Marín, J. 1991, Advances in Biomolecular Simulations, R. Lavery, J.-L. Rivail, and J. Smith (Eds.), AIP Conference Proceedings Series No. 239, American Institute of Physics, New York, 1991, 118. 29. Torrens, F., Sánchez-Marín, J., and Nebot-Gil, I. 1998, J. Mol. Graphics. Model., 16, 57. 30. Torrens, F. 2001, Int. J. Mol. Sci., 2, 72. 31. Torrens, F., Sánchez-Marín, J., and Nebot-Gil, I. 2001, J. Comput. Chem., 22, 477. 32. Pascual-Ahuir, J. L., Silla, E., Tomasi, J., and Bonaccorsi, R. 1987, J. Comput. Chem., 8, 778. 33. Pascual-Ahuir, J. L., and Silla, E. 1990, J. Comput. Chem., 11, 1047. 34. Silla, E., Villar, F., Nilsson, O., Pascual-Ahuir, J. L., and Tapia, O. 1990, J. Mol. Graphics, 8, 168. 35. Silla, E., Tuñón, I., and Pascual-Ahuir, J. L. 1991, J. Comput. Chem., 12, 1077. 36. Tuñón, I., Silla, E., and Pascual-Ahuir, J. L. 1992, Protein Eng., 5, 715. 37. Tuñón, I., Silla, E., and Pascual-Ahuir, J. L. 1993, Chem. Phys. Lett., 203, 289. 38. Pascual-Ahuir, J. L., Silla, E., and Tuñón, I. 1994, J. Comput. Chem., 15, 1127. 39. Tuñón, I., Silla, E., and Pascual-Ahuir, J. L. 1994, J. Phys. Chem., 98, 377. 40. Scharlin, P., Battino, R., Silla, E., Tuñón, I., and Pascual-Ahuir, J. L. 1998, Pure Appl. Chem., 70, 1895. 41. Pascual-Ahuir, J. L., Silla, E., and Tuñón, I. 1998, J. Mol. Struct. (Theochem), 426, 331. 42. Terryn, B., and Barriol, J. 1981, J. Chim. Phys. Phys.-Chim. Biol., 78, 207. 43. Isogai, Y., and Itoh, T. 1984, J. Phys. Soc. Jpn., 53, 2162. 44. Torrens, F. 2000, Encuentros en la Biología, 8(64), 4. 45. Torrens, F., Sánchez-Marín, J., and Nebot-Gil, I. 1998, Information Technology Applications in Biomedicine, S. Laxminarayan (Ed.), IEEE, Washington, 1. 46. Torrens, F. 2000, Zh. Fiz. Khim., 74, 125. 47. Torrens, F. 2000, Russ. J. Phys. Chem. (Engl. Transl.), 74, 115. 48. Torrens, F. 2001, Complexity Int., 8, torren01. 49. Torrens, F. 2001, Synthetic Organic Chemistry V, O. Kappe, P. Merino, A. Marzinzik, H. Wennemers, T. Wirth, J.-J. vanden Eynde and S.-K. Lin (Eds.), MDPI, Basel, 1. 50. Torrens, F. 2002, Molecules, 7, 26. 51. Canfield, R. E. 1963, J. Biol. Chem., 238, 2698. 52. Jollès, J., Hermann, J., Niemann, B., and Jollès, P. 1967, Eur. J. Biochem., 1, 344. 53. Blake, C. C. F., Mair, G. A., North, A. C. T., Phillips, D. C., and Sarma, V. R. 1967, Proc. R. Soc. London, Ser. B, 167, 365. 54. Hermann, J., and Jollès, J. 1970, Biochim. Biophys. Acta, 200, 178. 55. Kaneda, M., Kato, T., Tominaga, N., Chitani, K., Narita, K. 1969, J. Biochem. (Tokyo), 66, 747. 56. LaRue, J. N., and Speck, J. C., Jr. 1969, Fed. Proc., 28, 662. 57. Torrens, F. 2000, Encuentros en la Biología, 8(60), 3. 58. Jones, P. S. 1998, Antivir. Chem. Chemother., 9, 283. 59. Ellis, R. W. 1999, Vaccine, 17, 1596. 60. Root, M. J., Kay, M. S., and Kim, P. S. 2001, Science, 291, 884.
QSPR studies on amino acids: Application to proteins
61
61. Bernstein, J. M. 2000, Antiviral Chemotherapy: General Overview, Wright State University School of Medicine, Dayton (OH). 62. Crosby, A. W. 2003, America’s Forgotten Pandemic: The Influenza of 1918, Cambridge University, Cambridge. 63. Reid, A. H., and Taubenberger, J. K. 2003, J. Gen. Virol., 84, 2285. 64. Kash, J. C., Basler, C. F., García-Sastre, A., Cartera, V., Billharz, R., Swayne, D. E., Przygodzki, R. M., Taubenberger, J. K., Katze, J. G., and Tumpey, T. M. 2004, J. Virol. 78, 9499. 65. Tumpey, T. M., Basler, C. F., Aguilar, P. V., Zeng, H., Solórzano, A., Swayne, D. E., Cox, N. J., Katz, J. M., Taubenberger, J. K., Palese, P., and García-Sastre, A. 2005, Science, 310, 77. 66. Taubenberger, J. K., Reid, A. H., Lourens, R. M., Wang, R., Jin, G., and Fanning, T. G. 2005, Nature (London), 437, 889. 67. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Jr., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., and Tasumi, M. 1977, J. Mol. Biol., 112, 535. 68. Torrens, F., Sánchez-Marín, J., and Sánchez-Pérez, E. 1989, Actes del II Sympòsium sobre l'Ensenyament de les Ciències Naturals, S. Riera (Ed.), Documents No. 11, Eumo, Vic, 595. 69. Torrens, F., Sánchez-Marín, J., and Sánchez-Pérez, E. 1989, Actes del II Sympòsium sobre l'Ensenyament de les Ciències Naturals, S. Riera (Ed.), Documents No. 11, Eumo, Vic, 669. 70. Torrens, F., Sánchez-Pérez, E., and Sánchez-Marín, J. 1989, Enseñanza de las Ciencias, Extra-III Congreso(1), 267. 71. Torrens, F., Ortí, E., and Sánchez-Marín, J. 1991, Colloquy University Pedagogy, Horsori, Barcelone, 375. 72. Sayle, R. A., and Milner-White, E. J. 1995, Trends Biochem, Sci, 20, 374. 73. Shindyalov, I. N., and Bourne, P. E. 1995, J. Appl. Crystallogr., 28, 847. 74. Shindyalov, I. N., and Bourne, P. E. 1997, CABIOS, 13, 487. 75. Walters, P., and Stahl, M. 1996, Program BABEL, University of Arizona, Tucson (AZ). 76. Hendlich, M. 1998, Acta Crystallogr., Sect. D, 54, 1178. 77. Tsai, C. S. 2001, J. Chem. Educ., 78, 837. 78. MDL 2007, Program Chime, MDL Information Systems, San Leandro (CA). 79. Claros, M. G., Fernández-Fernández, J. M., González-Mañas, J. M., Herráez, Á., Sanz, J. M., and Urdiales, J.L. 2001, BioROM 1.0 and 1.1, Sociedad Española de Bioquímica y Biología Molecular, Malaga. 80. Claros, M. G., Fernández-Fernández, J. M., García-Vallvé, S., González-Mañas, J. M., Herráez, Á., Oliver, J., Pons, G., Pujadas, G., Roca, P., Rodríguez, S., Sanz, J. M., Segués, T., Urdiales, J. L. 2002, BioROM 2002, Sociedad Española de Bioquímica y Biología Molecular, Malaga. 81. Claros, M. G., Alonso, T., Corzo, J., Fernández-Fernández, J. M., García-Vallvé, S., González-Mañas, J. M., Herráez, Á., Oliver, J., Pons, G., Roca, P., Sanz, J. M., Segués, T., Urdiales, J. L., Valle, A., and Villalaín, J. 2003, BioROM 2003, Sociedad Española de Bioquímica y Biología Molecular–Roche Diagnostics, Malaga.
62
Francisco Torrens & Gloria Castellano
82. Claros, M. G., Alfama, R., Alonso, T., Amthauer, R., Castro, E., Corzo, J., Fernández-Fernández, J. M., Figueroa, M., García-Vallvé, S., González-Mañas, J. M., Herráez, Á., Herrera, R., Moya, A., Oliver, J., Pons, G., Roca, P., Sanz, J. M., Segués, T., Tejedor, M. C., Urdiales, J. L., and Villalaín, J. 2004, BioROM 2005, Sociedad Española de Bioquímica y Biología Molecular–Universidad Miguel Hernández–Universidad del País Vasco, Malaga. 83. Claros, M. G., Alfama, R., Alonso, T., Amthauer, R., Castro, E., Corzo, J., Fernández-Fernández, J. M., Figueroa, M., García-Mondéjar, L., García-Vallvé, S., Garrido, M. B., González-Mañas, J. M., Herráez, Á., Herrera, R., Miró, M. J., Moya, A., Oliver, J., Palacios, E., Pons, G., Roca, P., Sanz, J. M., Segués, T., Tejedor, M. C., Urdiales, J. L., and Villalaín, J. 2005, BioROM 2006, Sociedad Española de Bioquímica y Biología Molecular–Pearson Educación, Malaga. 84. Herráez, Á. 2006, Biochem. Mol. Biol. Educ. 34, 255. 85. Claros, M. G., Alfama, R., Alonso, T., Amthauer, R., Carrero, I., Castro, E., Corzo, J., Fernández-Fernández, J. M., Figueroa, M., García-Vallvé, S., Garrido, M. B., González-Mañas, J. M., Herráez, Á., Herrera, R., Miró, M. J., Moya, A., Oliver, J., Palacios, E., Pons, G., Roca, P., Salgado, J., Sancho, P., Sanz, J. M., Segués, T., Tejedor, M. C., Urdiales, J. L., and Villalaín, J. 2006, BioROM 2007, Sociedad Española de Bioquímica y Biología Molecular–Pearson Educación, Malaga. 86. Miró, M. J., Méndez, M. T., Raposo, R., Herráez, Á., Barrero, B., and Palacios, E. 2007, III Jornada Campus Virtual UCM: Innovación en el Campus Virtual, Metodologías y Herramientas, Complutense, Madrid, 304. 87. Pacios, L. F. 2001, J. Chem. Inf. Comput. Sci., 41, 1427. 88. Pal, A., and Kumar, S. 2005, Indian J. Chem., Sect. A, 44, 1589. 89. Nielsen, J. E. 2007, J. Mol. Graphics Model., 25, 691. 90. Barth, P., Alber, T., and Harbury, P. B. 2007, Proc. Natl. Acad. Sci. USA, 104, 4898. 91. Liu, N., and Wang, T. 2007, J. Mol. Graphics Model., 25, 852. 92. Soga, S., Shirai, H., Kobori, M., and Hirayama, N. 2007, J. Chem. Inf. Model., 47, 400. 93. Solms, J., Vuataz, L., and Egli, R. H. 1965, Experientia, 21, 692. 94. Schiffman, S. S., Clark, T. B., III, and Gagnon, J. 1982, Physiol. Behav., 28, 457. 95. Box, G. E. P., Hunter, W. G., MacGregor, J. F., and Erjavec, J. 1973, Technometrics, 15, 33. 96. Hocking, R. R. 1976, Biometrics, 32, 1. 97. Ou, C.-Y., Ciesielski, C. A., Myers, G., Bandea, C. I., Luo, C.-C., Korber, B. T. M., Mullins, J. I., Schochetman, G., Berkelman, R. L., Economou, A. N., Witte, J. J., Furman, L. J., Satten, G. A., MacInnes, K. A., Curran, J. W., and Jaffe, H. W. 1992, Science, 256, 1165. 98. De Oliveira, Tm, Pybus, O. G., Rambaut, A., Salemi, M., Cassol, S., Ciccozzi, M., Rezza, G., Gattinara, G. C., d’Arrigo, R., Amicosante, M., Perrin, L., Colizzi, V., Perno, C. F., and Benghazi Study Group 2006, Nature (London) 444, 836.
Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 63-94 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro
3. Molecular topology in QSAR and drug design studies J. Gálvez and R. García-Domenech Unidad de Diseño de Fármacos y Conectividad Molecular, Dept. Química Física, Facultad de Farmacia, Universitat de Valencia, Avd. V.A. Estellés, s/n 46100 Burjassot-Valencia, Spain
Abstract. The use of graph theoretical approaches to describe the chemical structure of organic compounds has accomplished more and more relevance along the later years. Since the early times in which such formalism was used to predict simple properties on simple molecules, such as boiling points of alkanes, up to the design- for instance- of novel lead anticancer drugs, a significant progress was achieved and a long path has been covered. The aim of this review is depicting some of the milestones of such a path and somehow forecast which will be the challenges and expectancies for the future of the topics.
1. Introduction Since Corvin Hansch [1] introduced, at the beginning of the sixties in the past century, his famous equation relating some experimental properties of chemical compounds with certain electronic and steric parameters of molecules, thereby introducing the quantitative structure-activity relationships (QSAR), the developments of such methods has been dizzy. The introduction Correspondence/Reprint request: Dr. J. Gálvez, Unidad de Diseño de Fármacos y Conectividad Molecular Dept. Química Física, Facultad de Farmacia, Universitat de Valencia, Avd. V.A. Estellés, s/n 46100 BurjassotValencia, Spain. E-mail: jorge.galvez@uv.es and ramongar@uv.es
64
J. GĂĄlvez & R. GarcĂa-Domenech
of more and more capable computers made possible that progress up to the point that nowadays, the so called in silico approaches, stand as a essential tool. The initial QSAR technique, assumed that there was a relationship between the properties of a molecule and its structure, this relationship being profiled by a set of physicochemical parameters. QSAR based on this very original concept has been viewed as the classic as shown in the works of Hansch and Fujita [1]. A variety of modern QSAR methods are founded on the basis of the classic QSAR. These methods have become powerful tools to predict chemical properties or possible biological functions of unknown compounds, thus accelerating the process of designing novel compounds and their optimization. Other approaches such as the Free-Wilsonâ&#x20AC;&#x2122;s [2] were digging in the same concept and also yielded good results. The Free-Wilson analysis relies on the additive nature of the properties of molecular fragments, what not always is possible. However, in those cases in which it is applicable it yields very good results. Another significant approach in the field of QSAR was the introduction by Cramer in 1988 of three-dimensional molecular parameters, what allowed the later developed as 3D-QSAR [3]. This initiated a new era at the level of both conceptual evolution and practical applications. The effects from different conformers, stereoisomers or enantiomers of chemical compounds in 3D-QSAR models permitted the comparison molecular structures thereby setting up a representative structural group known as the pharmacophore [4]. The first, and currently one of the most widely used models employing 3D-QSAR to draw pharmacophores was the Comparative Molecular Field Analysis (CoMFA) method proposed by Cramer et al. [3]. Other 3D-QSAR approaches, such as Comparative Molecular Similarity Indices Analysis (CoMSIA) [5] or Self Organizing Molecular Field Analysis (SomFA) [6], have also been developed, some of which incorporate comparisons of different sets of molecular descriptors. Along with the rapid development of computational science, a pull of new techniques based on formalisms such as molecular mechanics, molecular dynamics, docking, scoring and pharmacophore analysis, are now widely used in the area of drug discovery. These computational techniques have been proven to assist in the design of novel, more potent inhibitors because they can visualize the mechanism of ligand-receptor interactions. Molecular topology (MT), a discipline usually considered within the QSAR methods, has demonstrated to be an excellent tool for a quick and accurate prediction of many physicochemical and biological properties [7-9]. One of the most interesting advantages of MT is the straightforward calculation of molecular descriptors to work with. Within this mathematical formalism a molecule is assimilated to a graph, where each vertex represents
Molecular topology and drug design
65
one atom and each axis one bond. Starting from the interconnections between the vertices, an adjacency topological matrix can be built up, whose ij elements take the values either one or zero, depending if the vertex i is connected or unconnected to the vertex j, respectively (see Figure 1). The manipulation of this matrix gives origin to a set of topological indices or topological descriptors which characterize each graph and allow the developments of QSPR [10â&#x20AC;&#x201C;12] and QSAR [13â&#x20AC;&#x201C;18] analysis as well. The singularity of molecular topology can be drawn in the following items: a) It is a completely mathematical framework in which molecular structure is profiled. b) It is a very efficient approach for drug discovery either by screening of large databases of compounds or by designing novel compounds by following the inverse process (propertiesâ&#x2020;&#x2019;structure). Furthermore, it is easily computerizable. The item b) is a direct consequence of the item a). The use of molecular topology in the search of QSAR has grown along the last years in an exponential way. Altogether the topological scope covers over 20% of the overall papers on QSAR. A recent search carried out with the Scifinder Scholar database, disclosed that about 3000 papers out of 15000 dealing on QSAR, were devoted to topological descriptors. (see Figure 2). In the current work, we focus on the contribution of molecular topology to QSAR studies obtained by our research group in the last years and its application to drug design.
Figure 1. The chemical graph and adjacency matrix of the isopentane.
2. Methods and applications Molecular topology, MT, can be defined as a part of chemistry consisting of the topological description of molecular structures. Such description deals basically with the connectivity of the atoms forming the molecule and should
66
J. GĂĄlvez & R. GarcĂa-Domenech
Figure 2. Bibliographic research obtained with Scifinder Scholar data base. White bars: papers including the key QSAR; Black bars: papers including the keys QSAR + molecular topology.
yield numerical descriptors which are invariant under deformation of the structure. Note that, although graph theory is usually the main source of descriptors and concepts feeding MT, this is a much broader concept and actually it is not a part of graph theory. Throughout this section somewhat different molecular connectivity indices will be introduced. In order to outline the particular QSAR techniques used with this methodology, descriptors will be defined before explaining the modeling tools applied with them. Diverse statistical and molecular techniques will be sketched here.
2.1. Descriptors The following types of indices, which have been mainly used in this research, are described in increasing order of complexity. Discrete invariants These are natural numbers calculated from what chemists understand qualitatively as the chemical structure. N is the number of non-hydrogen atoms, i.e., the number of molecular graph vertices [19,20]. Vk, where k is 3 or 4, is the number of vertices of degree k, i.e., the number of atoms having
Molecular topology and drug design
67
k bonds, σ or π, to non-hydrogen atoms [20]. PRk for k between 0 and 3 is the number of pairs of ramifications at distance k, i.e., the number of pairs of single branches at distance k in terms of bonds [20]. L is the length, i.e., the maximum distance between non-hydrogen atoms measured in bonds, and is thus the diameter of the molecular graph defined as max(dij) [20]. W is the Wiener number, i.e., the sum of the distances between any two non-hydrogen atoms measured in bonds [21]. Connectivity indices Throughout the present section the connectivity indices defined as in Eq 1 will be used [22, 23]. Some of them are slightly different from the previously defined indices. The connectivity index of order k [23] may be derived from the adjacency matrix and is normally written as, kχt, The order k is between 0 and 4 and is the number of connected non-hydrogen atoms which appear in a given sub-structure. ⎛ ⎞ k ⎜ χt = ∑ ∏ δi ⎟ ⎜ ⎟ j =1 ⎝ i∈S j ⎠ k
nt
−1 / 2
Eq. 1a
In eq 1a, δi is the number of simple (i.e., sigma) bonds of the atom i to nonhydrogen atoms, Sj represents the jth sub-structure of order k and type t, and k nt is the total number of sub-graphs of order k and type t that can be identified in the molecular structure. The types used are path (p), cluster (c), and path-cluster (pc). A sub-graph of type p is formed by a path, a sub-graph of type c is formed by a star, while the pc sub-graph can be defined as every tree which is neither a path nor a star. Alternatively: a pc sub-graph is any tree containing at least a star and a path. As an example, Table 1 displays all the p, c, and pc sub-graphs found in a simple molecular structure. The use of the valence delta, δv, instead of δ enables the encoding of π and lone-pair electrons [22] in the form given in Eq. 1b. ⎛ ⎞ v v k χt = ∑ ⎜ ∏ δi ⎟ ⎜ ⎟ j =1 ⎝ i∈S j ⎠ k
nt
−1 / 2
Eq. 1b
Here δv is just the degree of a vertex in a pseudograph and in this context the old definition, δv = Zv / (Z - Zv - 1), for the δiv of higher row atoms holds [22]. The values listed in Table 2 and used in this approach were adopted because of their general performance.
68
J. Gálvez & R. García-Domenech
Table 1. Types of sub-graphs present in the 2-methylpropanol structure. Type
Order 1
Order 2
OH
Order 3
OH
OH
OH
OH
OH
OH
OH
Order 4
Order 5
OH
OH
Path
Cluster
OH PathCluster
OH
Table 2. Values of δv for the different heteroatoms present in the listed groups. Group NH4+ NH3 -NH2 -NH=NH -N=N=N+= (azide) =N- (azide) -N= (nitro) -S=S S (-SO2-)
δv 1 2 3 4 4 5 5 4 6 6 1.33 0.99 2.67
Group H3O+ H2O -OH -O=O O (nitro) O (carboxyl) -F -Cl -Br -I =PP(5)
δv 3 4 5 6 6 6 6 7 0.690 0.254 0.15 0.560 2.22
Topological Charge Indices (TCI) The Topological Charge Indices Gk and Jk of order k =1-5 are defined for a given graph by Eq. 2 [24], in which N is the number of non-hydrogen atoms, and cij = mij – mji, is the charge term between vertices i and j. δ represents here the Krönecker delta symbol, i.e., if α = b, then d(a,b)=1, and if α ≠ b then d(a,b)=0, and finally, dij is the topological distance between vertices i and j.
Molecular topology and drug design N -1
Gk = ∑
N
∑c
ji
i =1 j=i +1
δ (k , d ij )
and
69
Jk =
Gk N −1
Eq. 2.
The variables mij are the elements of the NxN matrix M obtained as the product of two matrices, i.e., M = A·Q. The elements of M expanded in terms of the elements of A and Q are given in Eq. 3. N
m ij = ∑ a ih q hj
Eq. 3.
h =1
A is the adjacency matrix in which elements aih are: 0 if either i = h or i is not linked to h; 1 if i is linked to h by a single bond; 1.5 if linked to h by an aromatic bond; 2 if linked to h by a double bond; and 3 if linked to h by a triple bond. Q is the inverse squared distance or Coulombian matrix. Its elements, qhj, are 0 if h = j and otherwise qhj = 1/dhj2, where dhj is the topological distance between vertices h and j. Thus, Gk represents the overall sum of the cij charge terms for every pair of vertices i and j separated by a topological distance k. The valence Topological Charge Indices Gkv and Jkv are defined in a similar way, but using Av, the electronegativity-modified adjacency matrix, instead of A. The elements of A and Av are identical except for the main diagonal where A has zeroes and Av the corresponding Pauling electronegativity values, EN, weighed for EN(Cl) = 2 for each heteroatom. To illustrate the calculation of the topological charge indices, let us consider the n-butane. Its hydrogen-depleted graph is: •⎯•⎯•⎯•. If we number each vertex of this graph in the following way: 1-2-3-4, we can write the A, Q ad M matrices, which are used to derive the following G values: G1 = |c12| + |c23| + |c34| = 1/4 + 0 + 1/4 = 0.500, G2 = |c13| + |c24| = 1/9 + 1/9 = 0.2222, and G3 = |c14| = 0. ⎛0 ⎜ ⎜1 A=⎜ 0 ⎜ ⎜0 ⎝
1 0 0⎞ ⎟ 0 1 0⎟ 1 0 1⎟ ⎟ 0 1 0 ⎟⎠
1 1/ 4 1/ 9⎞ ⎛ 0 ⎜ ⎟ 0 1 1 / 4⎟ ⎜ 1 Q=⎜ 1/ 4 1 0 1 ⎟ ⎜ ⎟ ⎜1 / 9 1 / 4 1 0 ⎟⎠ ⎝
0 1 1/ 4 ⎞ ⎛ 1 ⎜ ⎟ 2 1 / 4 10 / 9 ⎟ ⎜ 1/ 4 M = A∗Q = ⎜ 10 / 9 1 / 4 2 1/ 4 ⎟ ⎜ ⎟ ⎜ 1/ 4 1 0 1 ⎟⎠ ⎝
Differences and quotients of connectivity indices The difference of connectivity indices, kDt, with k = 0-4, and t = p, c, pc, are defined in the following way, [25] k
D t = k χ t − k χ tv
Eq. 4.
70
J. Gálvez & R. García-Domenech
The quotient of connectivity indices, kCt, with k = 0-4, and t = p, c, pc, are defined in the following way [20] k
Ct =
k
χt
k
χ
v t
Eq.5
All descriptors used in this work were obtained with the aid of the Desmol11 program developed by us and available by e-mail request.
2.2. QSAR algorithm 2.2.1. Multilinear regression analysis, MLR Several multilinear descriptive functions, Eq. 6 have been obtained by the linear correlation of biological properties with the aforementioned descriptors. Pi = Ao + ∑ Ai X i
Eq. 6
where Pi is a property, Xi are the topological indices, and Ao and Ai are the regression coefficients of the equation obtained. The Furnival-Wilson algorithm [26] is used to obtain subsets of descriptors and equations with the least Mallows parameter, Cp [27]. This algorithm combines two methods of computing the residual sums of squares for all possible regressions to form a simple leap and bound technique for finding the best subsets without examining all possible ones. The result is a reduction by several orders of magnitude in the number of operations required to find the best subsets. The predictive ability of the selected mathematical models was evaluated through cross-validation, using the leave-one-out [28] method. To do this, one compound in the set was removed and the model was recalculated using the remaining N-1 compounds as training set. The property was then predicted for the removed element. This process was repeated for all the compounds in the set to obtain a prediction for each. A plot of the residual vs. cross-validation residual (cv) allowed the detection of outliers. In order to evidence the possible existence of fortuitous regressions, the randomization test is adopted in this paper. Thus, the values of the property of each compound are randomly permuted and linearly correlated with the aforementioned descriptors. This process is repeated as many times as needed. The usual way to represent the results of a randomization test is plotting the correlation coefficients versus predicted ones, r2 and Q2 respectively.
Molecular topology and drug design
71
The predictive ability of the equations obtained can be better assessed by an external test. A random sub-set of molecules is initially chosen (the “external test set") and the modeling study is carried out with the remaining molecules (the "training set"). The predictive performance of the model is assessed by the results obtained when it is applied to the external test set. MLR applications Prediction of plasmatic protein binding for a group of antineoplastics A set of 41 highly heterogeneous antineoplastic drugs was used in this study. The multilinear regression analysis was performed by means of the BMDP software [29]; The dependent variable, namely plasmatic protein binding (PPB) was correlated against the topological descriptors. The selected equation was: PPB(%) = 194.9 + 7.2 4χpc + 3.6 G1 -10.6 G5V – 124.3 J1V – 704.0 J4V – 8.5 2Dp
Eq. 7
N=41 r2=0.805 Q2=0.732 SEE=15.5 F=23.3 p<0.0001 The J and G labels correspond to the topological charge indices, which take into account the charge transfers inside the molecule, whereas χpc and 2Dp stand for the connectivity and difference of connectivity indices, respectively. The first encodes information abot the topological ensemble of the molecule and the second on electronic properties. Table 3 and Figure 3 show the outcome for the predicton of PPB for each drug. Altogether there is a pretty good predictive asset, as far as all the drugs Table 3. Results of prediction of plasmatic protein binding obtained by multilinear regression analysis for a group of antineoplastics. Compound
PPBexp(%)
PPBcalc(%)*
PPBcalc(cv)(%)
Dacarbazine
5.0
15.4
17.0
Gemcitabine
10.0
33.6
37.8
Cytarabine
13.0
32.1
33.9
Cyclophosphamide
15.0
48.7
51.2
Temozolomide
15.0
-2.1
-5.8
Mercaptopurine
19.0
18.6
18.5
Aminoglutethimide
25.0
53.1
57.2
72
J. GĂĄlvez & R. GarcĂa-Domenech
Table 3. Continued
*
Cladribine
25.0
9.9
7.3
Busulfan
30.0
53.0
62.4
Nimustine
34.0
26.6
26.0
Topotecan
35.0
37.7
38.0
Anastrozole
45.0
31.8
29.1
Methotrexate
50.0
71.0
74.7
Capecitabine
54.0
50.0
49.8
Letrozol
60.0
55.3
54.6
Irinotecan
65.0
87.6
90.3
Vinblastine
75.0
74.4
74.0
Vincristine
75.0
78.3
80.3
Mitoxantrone
76.0
58.9
56.8
Doxorubicin
80.0
75.1
74.7
Melphalan
80.0
84.7
85.1
Hydroxycarbamide
80.0
72.2
69.0
Epirubicin
85.0
75.1
74.1
Exemesttane
90.0
90.1
90.1
Formestane
93.0
93.0
93.1
Raltitrexed
93.0
68.7
65.5
Medroxyprogesteron 94.0
92.7
92.4
Cyproterone
95.0
94.3
94.1
Docetaxel
95.0
95.5
95.7
Etoposide
95.0
88.4
86.5
Flutamide
95.0
80.6
77.5
Imatinib
95.0
103.7
107.0
Paclitaxel
95.0
98.9
99.9
Idarubicin
96.0
76.0
73.9
Amsacrine
97.0
99.8
100.2
Bicalutamide
98.0
88.7
86.9
Chlorambucil
99.0
72.4
70.0
Estramustine
99.0
90.7
89.9
Tamoxifen
99.0
102.5
103.2
Thiotepa
99.0
89.3
80.2
Toremifen
99.0
110.4
112.8
Values obtained from Eq. 7
Molecular topology and drug design
73
Figure 3. Observed versus calculated values of the PPB.
Figure 4. Cross-validated residuals versus residuals for the PPB model.
showing large PPB rate (23 compounds with PPB above 80%) are predicted above 70% (excepting raltitrexed which is a clear outlier). On the other hand, all drugs with values below 20% (namely six compounds) are predicted to lay under a 35% threshold. As in the top values, there is a clear outlier here: cyclofosfamide.
74
J. Gálvez & R. García-Domenech
The uniform distribution shown at both sides of the tendency line (see Figure 3) is indicative of the well balanced selection of variables carried out in the model. The predictive ability of the selected mathematical model was evaluated through cross-validation, using the leave-one-out. Table 3 and Figure 4 show the obtained results. The value of Q2=0.732 is accepted as satisfactory. Prediction of the phenoloxidase inhibition by a group of benzaldehyde thiosemicarbazone A topological-mathematical model has been arranged to search for new derivatives of benzaldehyde thiosemicarbazone and related compounds acting as phenoloxidase inhibitors. Phenoloxidase, PO, also known as tyrosinase, is a key enzyme in different metabolic processes of microorganisms and other animals and plants [30,31]. In insects, PO is related to three important biochemical functions, including cuticule sclerotization, defensive encapsulation and the mechanization of alien organisms [32]. By using multilinear regression analysis a function with two descriptors, 1χv, 4χpv and r2=0,940 was capable to predict adequately the IC50 for each compound. The best linear regression equation obtained, including its statistical parameters, was: pIC50 = -3.132 + 5.716 1χv – 16.581 4χpv N=44
r2=0.940
Q2=0.931
SEE=0.500
Eq. 8 F=321.1
p<0.00001
The presence of the 1χv and 4χpv indices in the equation reflects the influence of branching and position of the substituent in aromatic ring, respectively. Figure 5 shows the predicted results obtained with each one of the compounds in the training and test set. The results of the randomness tests, Figure 6, suggest a high stability of the model (all regressions were rather poor except for the selected equation). For more details about this study, see the reference [33]. Others results of prediction of biological properties obtained by our research group The most interesting results obtained recently by the methodology described in the preceding paragraphs are shown in the Table 4. Additional details are available in the references cited.
Molecular topology and drug design
75
Figure 5. Relationship of pIC50exp with pIC50calc from prediction function obtained using multilinear regression analysis, Eq. 8. Open circles represent predictions for the training set; solid circles represent predictions for the test set.
Figure 6. Validation of the mathematical model obtained for the pIC50. Correlation coefficients, r2, versus prediction coefficients, Q2, obtained by randomization test.
76
J. Gálvez & R. García-Domenech
Table 4. QSAR topological models to predict pharmacological properties.
Property IC50.(μM) L. donovani IC50.(μM) K1 strain of P. falciparum LogLD50 (oral in rat) Log LD50 (i.p. in rat) Toxicity to Chlorella vulgaris
Predictive equation
Ref.
Log IC50 = 13.32 +0.814 4χp + 1.381 G3v – 32.16 J3v + [34] 0.0018W – 0.717N + 0.332PR1 – 0.263PR2 2 N= 48 r = 0.806 SEE=0.366 F=10.6 p<0.0001 1 v 4 Log IC50 = 3.154 – 0.338 χ + 0.381 χpc N= 54 r2 = 0.842 Q2 = 0.825 SEE=0.308 F=136.4 p<0.0001 [35] Log LD50 = 15.13 + 0.79 G1 – 1.83 4Cp – 1.10 1χv – 0.28 G2 – 8.89 0C + 1.43 G3v – 20.04 J3v – 0.13 4χpcv N=39 r2=0.821 Q2=0.701 SEE=0.40 F=17.2 p<0.0001 Log LD50 = 8.40 + 9.35 J1 – 2.03 4Cp – 10.11 0C + 1.20 1D + 0.83 G5v – 0.39 4χpcv N=39 r2=0.721 Q2=0.613 SEE=0.54 F=13.8 p<0.0001 pC = -4,494 + 0,568 0χv -0,113 G1v -1,161 G5v+ 10,071 J4 + 0,188 V3 N=70 r2=0.928 Q2=0.918 SEE=0.405 F=180 p<0.0001
[36]
[36] [37]
2.2.2. Linear discriminant analysis, LDA The objective of the linear discriminant analysis, LDA, is to find a linear combination of variables that allows discrimination between two or more categories or objects. Generally, two sets of compounds are considered in the analysis first, a set of compounds with proven pharmacological activity, and second, a set compounds known to be inactive. Introductory accounts of LDA are given by Kachigan [38] and McFarland and Gans [39].The selection of the descriptors is based on the Fisher-Snedecor parameter, and the classification criterion is the shortest Mahalanobis distance (i.e., the distance of each case from the mean of all cases used in the regression equation). Variables used in computing the linear classification functions are chosen stepwise. At each step either the variable that adds the most to the separation of the groups is entered into the discriminant function, or the variable that adds the least to the separation of the groups is removed from the discriminant function. The quality of the discriminant function is evaluated by Wilks' λ, which is a multivariate analysis of variance parameter that tests the equality of group means for the variable(s) in the discriminant function. Minimization of Wilks' parameter allows selecting the predictors to be entered or deleted in the discriminant function [40]. The technique is described in detail by Tabachnick and Fidell [41]. The discriminant ability of the selected function is stated by:
Molecular topology and drug design
-
-
77
The Classification matrix, in which each case is classified into a group according to the classification function. The number of cases classified into each group and the percent of correct classifications are shown. The Jack-knifed classification matrix, in which each case is classified into a group according to the classification functions computed from all the data except the case being classified. Cross validation with random sub-samples and classifying new cases. Here, the cases in each group are randomly subdivided into two separate sets, the first of which is then used to estimate the classification function, and the second of which is classified according to the function. By observing the proportion of correct classifications produced for the second set, one obtains an empirical measure for the success of the discrimination. Use of an External set test, which entails the use of an external compound set to check the validity of the selected discriminant functions.
LDA applications Inhibition of Trypanosoma cruzi hexokinase by bisphosphonates The American Trypanosomiasis or Chagas disease is a pathology whose causative agent is the protozoan parasite Trypanosoma Cruzi. The infection is transmitted by hemiptera bloodfeeding insect vector, subfamily Triatominae which affects human and domestic-wild animals. An estimate 18 million people are infected and one hundred million have high risk to get the infection in fifteen south-American countries from Mexico to Argentina [42]. Nowadays the therapy for this pathology is based on two drugs: nifurtimox and benznidazol, which have a hazardous level of toxicity and are only effective on the acute stage of the disease. That is the reason why itâ&#x20AC;&#x2122;s necessary to find new drugs against this disease. Hexokinase is the first enzyme involved in glycolysis in most organisms including the etiological agents of Chagas disease (Trypanosoma cruzi) and African sleeping sickness (Trypanosoma brucei). Recent studies have shown that bisphosphonates analogues, are potent inhibitors of T. cruzi hexokinase, which can represent a novel target to find new compounds actives against T. cruzi [43]. In this work, the inhibition of T. cruzi hexokinase, TcHK, of a group of bisphosphonate derivatives was investigated to obtain a QSAR model of prediction using molecular topology and linear discriminant analysis. A group of 42 bisphosphonate derivatives inhibitors of TcHK was selected.
78
J. Gálvez & R. García-Domenech
Tables 5 and 6 show the structures and inhibitory activity expressed as values of IC50 (μM) obtained for the in vitro assays for each compound reported through the papers [43, 44]. To obtain the discriminat function, we apply the LDA to a training group comprised of 35 compounds and to validate it, to a 7 compounds test group. The training series is comprised by two subgroups: an active group (compounds with values of IC50<20μM) and an inactive group (compounds Table 5. Structures of bisphosphonates studied in order of decreasing potency in TcHK inhibition. H N
PO3 H 2
PO 3H 2
H N
Br
PO3 H2
H N
PO 3H 2
PO 3H 2
N
H2 O3 P
N
PO3H 2
PO3H 2
6
5
N
7
F
H N
H2 O3 P
H 3 C(H2 C) 6
Et
4
OH
H2 O3P
N
OH
8
F
nonane-n
H N
O2N
PO 3H 2
PO3 H 2
PO 3H 2
PO3 H2
PO 3H 2 N+
OH
OH
PO3H2
9
PO3H2
H 2O 3P
H 2O3P
10
11
H2 O3 P
OH
13
12
PO3H 2
F
H N
H N
PO3 H 2 H N
PO3 H
F
H N
PO 3H 2
PO3H2
OH
O PO 3H 2
PO3 H2
PO3H 2
15
14
16 H 2 O3P
H N
PO3 H
17
18
OH
PO3 H2
Ph
H N
N
HO
H 2O 3P
N
PO3 H 2
F
N+
PO3H2
H N
Ph
PO3 H2
PO3H2
OH
H 2O 3P N
PO 3H 2
PO3 H 2
20
19
21
22
23
OH PO3H 2
Ph
HO
PO3 H2
Ph
N H
O
HO
N
PO3 H2
N
H N
PO3 H2
O
PO 3H 2
OH
SO 3H
PO 3H 2
24
25
H 2O3 P
26
PO3 H2
27
28
PO3H2
PO3 H2
N+
PO 3H 2
PO3H 2
OH
OH
H 2O 3P
29
OH H 2 O3 P
H 2O 3P
30
PO3H2
N
N+
N+
31
H2 O3 P
OH
32
N
H2 O3 P
OH
33 PO3H 2
Cl
PO 3H 2
O
H N
NH
H N
PO3H 2
N
PO3H 2
H N
PO3H2
PO 3H 2
H 2O 3P
PO 3H 2
PO3 H2
34
36
35 H N
37
PO3H 2
Ph
N
PO 3H 2
H 2O 3P
39
H N
PO3 H 2 PO3 H 2
N
N
PO3H 2
O
O
OH
H N
PO 3H 2
PO3H2
OH H2O3P
H 2 O3P
45
H 3 C(H2 C) 3
41
40 H N
44
38
PO3H2 H 3CH 2C
N
N
PO3 H2 Cl
N H
PO3 H 2
42
OH
PO3H2
H 2O 3P
43
Molecular topology and drug design
79
Table 6. Results of prediction obtained by lineal discriminant analysis with IC50 for TcHK and the bisphosphonate derivatives analysed. Compound
IC50(μM)
1
χ
G4V
J2 V
J5 V
DF
Class.
Active group training (IC50<20μM) 5
0.81
9.847
1.869
0.252
0.055
1.19
A
6
0.95
6.661
0.782
0.187
0.029
2.25
A
7
1.45
9.721
1.278
0.193
0.045
3.28
A
8
1.82
8.500
2.024
0.383
0.045
-1.85
I
9
1.95
8.321
0.959
0.197
0.038
2.23
A
10
2.29
10.761
2.222
0.321
0.045
3.07
A
11
2.29
9.933
1.884
0.326
0.046
0.88
A
12
2.4
7.604
0.776
0.171
0.035
2.41
A
13
2.75
8.561
1.126
0.191
0.040
2.81
A
14
2.75
9.149
1.286
0.206
0.040
3.31
A
15
3.47
8.821
0.967
0.254
0.038
1.04
A
16
3.63
8.738
1.267
0.212
0.039
2.91
A
17
4.07
7.721
1.248
0.230
0.037
1.67
A
18
12.6
6.618
0.813
0.338
0.036
-4.08
I
19
14.8
7.221
0.976
0.230
0.031
1.56
A
Inactive group training (IC50>20μM) 4
300
6.618
0.804
0.324
0.028
-1.87
I
22
300
7.972
1.906
0.360
0.058
-4.59
I
23
300
6.689
0.808
0.208
0.039
-0.45
I
24
300
8.209
1.149
0.313
0.043
-2.13
I
26
300
8.321
1.149
0.295
0.040
-0.82
I
27
300
9.100
1.495
0.334
0.042
-0.52
I
28
300
8.732
1.365
0.265
0.044
0.55
A
29
300
9.557
2.080
0.403
0.041
-0.43
I
30
300
9.442
1.718
0.172
0.086
-3.47
I
31
300
8.689
1.858
0.338
0.043
-0.34
I
32
300
6.618
1.222
0.368
0.031
-2.83
I
33
300
7.833
1.343
0.321
0.038
-1.14
I
35
300
7.911
1.590
0.224
0.061
-1.93
I
80
J. Gálvez & R. García-Domenech
Table 6. Continued 36
300
7.077
0.833
0.218
0.039
-0.30
I
37
300
7.121
2.162
0.255
0.058
-1.26
I
38
300
9.284
1.276
0.268
0.050
-0.55
I
39
300
7.221
1.071
0.258
0.034
0.29
A
40
300
9.512
1.594
0.261
0.068
-3.00
I
42
300
7.118
0.677
0.290
0.030
-1.11
I
43
300
5.693
0.990
0.234
0.053
-4.55
I
Test group 20
34.7
8.232
0.957
0.233
0.030
2.69
A
21
45.7
6.667
0.692
0.349
0.067
-11.10
I
25
300
8.735
2.027
0.392
0.059
-4.83
I
34
300
8.581
2.115
0.305
0.059
-1.83
I
41
300
7.141
2.180
0.241
0.081
-5.51
I
44
300
5.710
0.913
0.233
0.047
-3.53
I
45
300
8.078
1.551
0.180
0.080
-4.33
I
with IC50>20μM). The test series is arranged by compounds with values of IC50>20μM and randomly selected from the inactive group. The discriminant function selected was: DF = 5.261 + 1.013 1χ + 3.011 G4v – 32.49 J2v – 207.4 J5v N=35
F=5.90
Eq. 9
λ(Wilks’ lambda) = 0.556
From here, a given compound will be selected as a potential inhibitor TcHK if DF > 0, otherwise it is classified as “inactive”. The classification matrix is very significantly for the training set (86.7% of correct prediction for the active group, 13 out of 15 correctly classified, and 90.0% for the inactive group, 18 correct out of 20 (see Table 6). An easy way to evaluate the quality of the selected discriminant function is to apply it into an external test group. In our case we used 7 compounds which haven’t been included in the discriminant analysis, all of them with values of IC50 > 20μM. Table 6 shows the results obtained for each compound. As shown, all the compounds, except the number 20, are correctly classified as inactive, DF<0. For more details see ref. [45].
Molecular topology and drug design
81
Inhibition of parasitary activity against Leishmania donovani Leichmaniasis is a parasitarian disease caused by the Leishmania protozoo. The transmission pathway is by biting of a mosquito belonging to the Phlebotominae subfamily, being the main reservoirs some, either wild or domestic, mammalians, including the humans for some species such as L. donovani, L. tropica [46]. The disease is endemic in 88 countries into four continents. Nowadays is a real public health trouble concerning to 12 million people and an incidence of 1.5-2 million new records per year [47]. Leichmaniasis is considered by WHO as one of the orphan diseases to be included into the program TDR (Special Programme for Research and Training in Tropical Diseases) [48]. In this item, we shall focuse our interest in some dinitrobenzene sulphonamides, all of them derivatives of the herbicide orizaline. These compounds exhibit a proven anti-Leishmania activity, by inhibiting the polimerization of tubuline in purified parasites, thereby hindering the growth of parasite in the phases G2/M of the cell cycle [49]. The study consisted of the use of the discriminant analysis to get a topological model capable to classify correctly the activity/inactivity of a given compound on Leishmania Donovani. Once arranged the model, it could be applied to the search of new potentially active compounds. A set of 57 compounds (37 of which were 3,5-dinitrobenzene derivatives and 20 were selected from the Maybridge Organics Compounds database [50], according to the outcome of Werbovetz [49]). Table 7 illustrates both, the chemical structure as well as the antiparasitary activity of the selected compounds, expressed as IC50 (μM). In order to get the discriminant function we applied the LDA to the training set comprised of all the active compounds (IC50 values <20μM) plus 80% of the inactive compounds (IC50>20μM). The remaining 20% of molecules was used as an external validation test to check the performance of the discriminant function chosen, DF, which was: DF = -7.74 – 5.32 4χpc -0.542 G1v + 5.11 G3 + 5.16 G4v + 18.52 J2v – 141.58 J3v + 2.51 3Dc N=48
F=8.97
Eq. 10 λ(Wilks’ lambda) = 0.389
According to this equation, a compound should be classified as active if DF > 0, otherwise it is classified as inactive. The resulting classification matrix is very significant because 100% of the active compounds are correctly
82
J. Gálvez & R. García-Domenech
Table 7. Chemical Structure of the studied compounds and results of the classification as for their activity against L. donovani. (A= Active; I= Inactive). R
R N
NO2
NO2
X
Compound R
IC50expa DFb
X
Clasif.
Active group training (CI50<20μM) F F Cl
O2N F
46
0.5
0.87
A
2.3
5.23
A
2.5
3.09
A
2.6
3.66
A
3.7
5.72
A
5
1.42
A
5
4.55
A
5.5
3.53
A
5.6
4.69
A
S Cl
NO2 F F
Cl
O2N F
48 S Cl
O
O
NO2
F
33
n-propyl
11
n-butyl
SO2HN
SO2HN Cl
02
n-propyl
SO2HN
Cl
01
n-propyl
SO2HN Cl
32
n-propyl
SO2HN
Cl
Cl
17
n-propyl
12
n-butyl
SO2HN
SO2HN
F
Molecular topology and drug design
83
Table 7. Continued F
23
n-butyl
SO2HN
5.7
5.29
A
8.1
2.69
A
OCH3
30
n-propyl
36
n-pentyl
SO2NH2
9
2.9
A
35
n-ethyl
SO2HN
11
1.32
A
37
n-hexyl
SO2NH2
12
3.83
A
16
n-propyl
SO2HN
13
3.34
A
26
n-butyl
SO2NH2
20
1.82
A
SO2HN
Cl
Inactive group training (CI50>20ÎźM) 04
n-propyl
SO2N(Me)
21
-6.38 I
25
-3.34 I
27
-7.46 I
O F
O
O
Cl
F S
44
F
N H
O
Cl Cl
05
n-propyl
SO2N(CH2CH3)2
15
n-propyl
SO2HN
CH3
32
2.88
I
29
n-propyl
SO2HN
OCH3
32
1.83
I
32
-5.88 I
37
-5.93 I
N
N
N N
F
S
F
45
O
F
N
54
HO F
N H
Cl
Cl
84
J. Gálvez & R. García-Domenech
Table 7. Continued O
H N
O
N
39
-4.74 I
SO2HN
43
-3.97 I
SO2NH(CH2)4CH3
43
-1.21 I
43
-6.71 I
S
55
Cl
09
n-propyl
19
n-propyl O
n-propyl S
HN
22
N n-propyl
O NO2
08
n-propyl
SO2NH(CH2)3CH3
50
-1.89 I
10
n-propyl
SO2HN
50
-1.22 I
06
n-propyl
SO2NH(CH2)2CH3
54
-3.25 I
07
n-propyl
SO2N(CH2CH2CH3)2
55
-6.14 I
60
1.35
I
65
-0.7
I
66
-4.05 I
N
03
n-propyl
SO2HN
Oryzalin
n-propyl
SO2NH2 Cl H N
43
S
N
S O
O S
Br
24
H, n-propyl SO2NH2
67
-1.28 I
25
n-ethyl
69
-0.48 I
80
-3.34 I
90
-9.36 I
98
-3.17 I
>100
-4.55 I
SO2NH2
NO2
Cl
O
O
Cl
53
Cl
Cl
Cl
O
27
H2N
n-propyl
N
S
n-propyl
O NO2
F F
Cl
NO2
F
57 S Cl
13
n-propyl
C
N
Molecular topology and drug design
85
Table 7. Continued OH
14
N
n-propyl
>100
-1.45 I
>100
-0.25 I
>100
-7.15 I
>100
-1.28 I
>100
-4.54 I
>100
-3.31 I
>100
-3.05 I
>100
-8.57 I
21
-9.01 I
23
3.68
26
-0.58 I
C NH2
34
n-propyl
SO2N
F F F
38
H N N
N H O
O
F
OH
F F
CH3O
39 N H
CH3O
NO2
40
O
Cl O NO2 Cl
49
O
S O
F
F
OCH3
F
NO2
F
O
50
F S
Br F S OCH3
F O
N H
51
N N
OCH3 NO2
Test group CH3 N N
N Cl
N S
42
HO O Cl
CH3
31
n-propyl
20
n-propyl
SO2HN
SO2NH(CH2)5CH3
A
86
J. Gálvez & R. García-Domenech
Table 7. Continued
21
n-propyl
18
n-propyl
SO2N
47
-8.72 I
SO2N((CH2)3CH3)2
48
-3.39 I
60
-9.48 I
72
-14.41 I
76
-3.23 I
>100
-9.33 I
>100
-8.90 I
NO2
Cl S
56
O Cl NH O
Cl
41 N N
n-propyl
O
O
Cl
Cl
28
S
CONH2 N
N
SH N
47 HO SCH3 OH OCH3
52
H N
HO S O CH3
a b
N
Values of IC50 (μM) from reference [49]. Values of the discriminant function, Eq. 10.
classified whereas 90.6 % of the inactive (namely 29 out of 32 molecules) were also correctly placed. Altogether, the rate of correct classification is 93.8 %. To check the performance of the equation to disclose novel structures, an external validation test is necessary. For making so, a set of 10 compounds not included in the training set, randomly selected and all of them showing IC50 values above 20μM, were tested by the model. As observed in Table 7, column 5, all compounds except nº31, are correctly positioned by the model as inactive what imply a rate of success of 90%. For more details see ref. [34].
2.3. Molecular topology and drug design The use of molecular topology as a tool for the design and selection of new drugs has been the main objective of our group since over twenty years
Molecular topology and drug design
87
[51-54]. The outcome of such a task was the finding of many new active compounds in different therapeutical scopes as illustrated in Table 8. Details on assays and protocols can be found in the references therein. Some of these compounds can be considered as new leads which can be the starting point for the design of new and more effective drugs. Table 8. New biological activities discovered through virtual screening. For details see the references in the last column. Activity found
Selected drugs
Ref
Cytostatic
6-azuridine, quinine 1-Chloro-2,4-dinitrobenzene, 3-Chloro-5-nitroindazole, 1-Phenyl-3-methyl-2-pyrazolin-5-one, neohesperidin, amaranth, mordant brown 24, hesperidin, morine, niflumic acid, silymarine, fraxine Neotetrazolium chloride, benzotropine mesilate, 3-(2-Bromethyl)-indole, 1-Chloro-2,4-dinitrobenzene 3-Hydroxybutyl acetate 4-(3-Methyl-5-oxo-2-pyrazolin-1-yl) benzoic acid 1-(Mesitylene-2-sulfonyl) 1H-1,2,3-triazole 3,5-dimethyl-4-nitroisoxazole, nitrofurantoin, (pyrrolidinocarbonylmethyl)piperazine, nebularine, cordycepin, adipic acid, thymidine, Îąâ&#x2C6;&#x2019;thymidine, inosine, 2,4-diamino-6-(hydroxymethyl)pteridine, 7-(carboxymethoxy)-4-methylcoumarin, 5-methylcytidine Carminic acid, tetracycline, piromidic acid, doxycycline Monensin, nigericin, vinblastine, vincristine, vindesine, ethylhydrocupreine, quinacrine, salinomycin Cefamandole nafate Prazosin Andrographolide Dibenzothiophene sulfone 2-Acetamido-4-methyl-5 thiazolesulfonyl chloride Benzydamine 4-(1-Butylpentyl)pyridine N-(3-Bromopropyl)phtalimide N-(3-Chloropropyl)phtalimide N-(3-Chloropropyl)piperidine hydrochloride 5-Bromoindole Griseofulvin, anthrarobin, 9,10-Dihydro-2-methyl-4H-benzo [5,6] cyclohept [1,2-d] oxazol-4-ol, 2-Aminothiazole, Maltol, esculetin, fisetin, hesperetin, 4-methyl-umbellipheryl-4-guanidine benzoate 2-(1-propenyl)phenol, 2',4' dimethylacetophenone, p-chlorobenzohydrazide, 1-(p- chlorophenyl) propanol, 4-benzoyl-3-methyl-1-phenyl-2-pyrazolin-5-one
[55]
Antibacterial Antifungal Hypoglycaemic Antivirals (anti-Herpes) Antineoplastic Antimalarial
Antitoxoplasma
Antihystaminic
Bronchodilator
Analgesics
[56,57]
[58] [59]
[60] [61] [62]
[63]
[64]
[65]
[66,67]
88
J. GĂĄlvez & R. GarcĂa-Domenech
The overall process for topologically driven drug design can be summarized in the following steps: -
Step I: Selection of the therapeutical group and the most representative drugs therein (training set). Step II: Search of the physicochemical, biological and pharmacological information for every drug in the training set. Step III: Calculation of the adequate topological descriptors. Step IV: Performing of the QSAR throughout different statistical techniques such as multilinera regression, MRL, discriminant analysis, LDA, neural networks, NN, ... etc. Step V: Selection of the best topological model. Usually it is formed by both, predictive equations as well as discriminant functions. Step VI: Applcation of the model to the search of new drugs, either through database screening, chemical libraries from combinatorial chemistry or de novo design if necessary. Step VII: Finally, the selected compounds should be tested at the laboratoty to confirm their predicted activity. Typically the outcome is used for further refinement of the molecular candidates until goals fulfilment.
Just as an example, the most significant results recently achieved in antineoplastics and antimalarials are shown. Antineoplastic agents Particular relevance in this field has been the design of a novel lead compound named MT477, (see Figure 7), which has shown very potent activity in vivo against human cell carcinoma [68]. MT477 is a novel thiopyrano[2,3-c]quinoline that has been identified using molecular topology screening as a potential anticancer drug with a high activity against protein kinase C (PKC) isoforms. The objective of this study was to determine the mechanism of action of MT477 and its activity against human cancer cell lines. MT477 interfered with PKC activity as well as phosphorylation of Ras and ERK1/2 in H226 human lung carcinoma cells. It also induced poly-caspase-dependent apoptosis. Antimalarials agents against liver stages of Plasmodium Each year, the malaria parasite Plasmodium falciparum infects 300 to 660 million persons worldwide and causes several million deaths [69]. New
Molecular topology and drug design
89
Figure 7. Chemical structure of MT477. The chemical name of MT477 is dimethyl 5,6-dihydro-7-methoxy-5,5-dimethyl-6-(2-(2,5-dioxopyrrolidin-1-yl)acetyl)-1H-1-(4,5dimethoxycarbonyl-1,3-dithiolo-2-spiro) thiopyrano[2,3 ]quinoline-2,3-dicarboxylate.
antimalarial drugs are urgently needed, especially considering the increasing prevalence of drug-resistant P. falciparum strains and the lack of effective vaccines and vector control measures. The Plasmodium liver stage is an interesting drug target, as it precedes the emergence of blood stages that cause the symptoms and complications of malaria. Drugs that inhibit parasite maturation within hepatocytes could be used for short-term prophylaxis in areas of endemicity (refugees and travelers, etc.). We conducted a quantitative structure-activity relationship (QSAR) study based on a database of 127 compounds previously tested against the liver stage of Plasmodium yoelii in order to develop a model capable of predicting the in vitro antimalarial activities of new compounds. Topological indices were used as structural descriptors, and their relation to antimalarial activity was determined by using linear discriminant analysis. A topological model consisting of two discriminant functions DF1 and DF2 was created: DF1= 3.17 + 1.000χv +1.281χv -14.04J1v -22.94J3v +96.23J4v -65.98J5 + 1.880D - 23.534Dc +0.294Cc -0.51PR3 +0.38V3 N= 76
F= 7.95
λ= 0.42
DF2= 81.90 -3.454χpv +7.41G5v -70.211C -1.44PR1 -1.21V3 +2.44V4 N= 28
F= 12.4
Eq.11
λ= 0.21
Eq. 12
90
J. GĂĄlvez & R. GarcĂa-Domenech
The first function, DF1, discriminated between active and inactive compounds, and the second, DF2, identified the most active among the active compounds. The model was then applied sequentially to a large database of compounds with unknown activity against liver stages of Plasmodium. Seventeen drugs that were predicted to be active or inactive were selected for testing against the hepatic stage of P. yoelii in vitro (see Table 9). Antiretroviral, antifungal, and cardiotonic drugs were found to be highly active (nanomolar 50% inhibitory concentration values), and two ionophores completely inhibited parasite development. The 3-(4,5-dimethylthiazol-2-yl)2,5-diphenyltetrazolium bromide (MTT) assay was performed on hepatocyte cultures for all compounds, and none of these compounds were toxic in vitro. For both ionophores, the same in vitro assay as those for P. yoelii has confirmed their in vitro activities on Plasmodium falciparum. For more details see the ref. [70]. Table 9. Predicted drug activity on liver stage of Plasmodium. yoelii yoelii. Drugs (Therapeutic Category) Active drugs Monensin (Antibacterial/Ionophore) Nigericin (Ionophore) Delaverdine (Antiviral) Mibefradil (Antihypertensive) Licochalcone A (Estrogenic flavonoid) Miconazole (Antifungal) Dobutamine (Cardiotonic) Ritonavir (Antiviral) Saquinavir (Antiviral) Epoximicin (Antineoplastic) Indinavir (Antiviral) Vinblastine (Antineoplastic) Nordihydroguaiaretic acid (Antineoplastic) Inactive drugs Fenbendazol (Antihelminthic) Quinacrine (Anthelminthic/Antimalarial) Rimandine (Antiviral) Thiophanote (Anthelminthic) Atovaquone (ref. drug) Primaquine (ref. drug)
DF1(Class) DF2(Class)
IC50 exp (nM)c
4.22 (A) 3.36 (A) -0.21 (NC) 7.59 (A) 8.26 (A) 1.46 (A) 5.74 (A) 8.80 (A) 9.14 (A) 8.17 (A) 8.89 (A) 1.21 (A) 9.03 (A)
-21.62 (NC) -24.22 (NC) 6.75 (HA) 6.66 (HA) 7.99 (HA) 1.77 (HA) 1.52 (HA) -5.38 (A) -6.76 (A) 7.92 (HA) -10.55 (A) 38.71 (NC) 2.88 (HA)
< 10-3 < 10-3 0.846 0.873 0.927 2.03 3.7 34.2 35.2 3.95 x103 5 x103 7.95 x103 3 x104
-3.33 (I) -0.93 (I) -2.69 (I) -4.04 (I) 9.37(A) 0.91(A)
/ / / / 3.63(HA) 1.79(HA)
3 x104 3 x104 35.6 3 x104 57 75.7
a A=active; I=inactive; HĂ = highly active; NC= non classified.
3. Conclusions The results outlined here, clearly demonstrate that molecular topology (MT) based QSAR has become a powerful tool for the prediction of
Molecular topology and drug design
91
properties and the design and selection of novel drugs. Furthermore, the fact that MT consists of a pure mathematical description of molecular structure, beyond any geometrical or physical profile, is also an important asset of the approach. The reasons explaining why MT works with such a level of efficacy remains as an open question and constitutes probably a good challenge to be addressed in the near future.
4. Acknowledgements The authors acknowledge financial support from the Fondo de Investigaci贸n Sanitaria, Ministerio de Sanidad, Spain (project: SAF2005PI052128). We also thank prof. Eduardo Castro for his help in our research and his contributions into the field.
5. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
Hansch, C. and Fujita, T., 1964, J.Amer.Chem.Soc., 86, 1616. Free, S.M., Wilson, J.W., 1964, J. Med. Chem., 7, 395. Cramer, R.D. , Patterson, D.E., Bunce, J.D., 1988, J. Am. Chem. Soc., 110, 5959. G眉ner, O.F., 2002, Curr. Top. Med. Chem., 2, 1321. Klebe, G., Abraham, U., Mietzner, T., 1994, J. Med. Chem., 37, 4130. Robinson, D.D., Winn, P.J., Lyne, P.D., Richards, W.G., 1999, J. Med. Chem., 42, 573. Kier, L.B., Hall, L.H., 1976, Molecular Connectivity in Chemistry and Drug Research. Academic Press, London. Devillers, J., 2000, Current Opinion in Drug discovery and Development, 3(3), 275. Diudea M.V., Florescu, M.S., Khadikar, P.V., 2006, Molecular Topology and its Applications, EfiCon Press, Bucarest. L. Pogliani, 2000, Chem. Rev., 100, 3827. Ivanciuc, O., Balaban, A.T., 1998, Tetrahedron, 54, 9129. Hosoya, H., Gotoh, M., Murakami, M., Ikeda, S., 1999, J. Chem. Inf. Comput.Sci., 39, 192. Garcia-Domenech, R., Galvez, J., de Julian-Ortiz, J. V., Pogliani, L., 2008, Chem. Rev., 108(3), 1127. Basak, S. C., Mills, D., Gute, B. D., Natarajan, R., 2006, Topics in Heterocyclic Chemistry, 3, 39. Balaban, A. T., Motoc, I., Bonchev, D., Mekenyan, O., 1983, Topics in Current Chemistry, 114, 21. Estrada, E., Uriarte, E., 2001, Current Medicinal Chemistry, 8(13), 1573. Marrero-Ponce, Y., 2004, Bioorganic & medicinal chemistry, 12(24), 6351. Dudek, A.Z., Arodz, T., Galvez, J., 2006, Combinatorial Chemistry: High Throughput Screening, 9(3), 213.
92
J. Gálvez & R. García-Domenech
19. Galvez, J., Garcia-Domenech, R., 1994, Farmaindustria, 357. 20. Gálvez, J., García-Domenech, R., Julián-Ortiz, J. V. de, Soler, R., 1995, J. Chem. Inf. Comp. Sci., 35, 272. 21. Wiener, H., 1947, J. Am. Chem. Soc., 69, 17. 22. Kier, L.B., Hall, L.H., Molecular Connectivity in Structure-Activity Analysis, Wiley, New York, 1986. 23. Randić, M., 1975, J. Am. Chem. Soc., 97, 6609. 24. Gálvez, J., García-Domenech, R., Salabert, M. T., Soler, R., 1994, J. Chem. Inf. Comp. Sci., 34, 520. 25. Kier, L. B., Hall, L. M., 1989, Pharm. Res., 6, 497. 26. Furnival, G.M., Wilson, R.W., 1974, Technometrics, 16, 499. 27. Hocking, R.R., 1972, Technometrics, 14, 967. 28. Allen, D.M., 1974, Technometrics, 16, 125. 29. Dixon, W.J., Brown, M.B.L. Engelmanand R.I. Jennrich, BMDP Statistical Software Manual, Vol I. University of California, Berkeley. Press 1990, 33904358. 30. Sanchez–Ferrer, A., Rodriguez-Lopez, J. N., García–Cánovas, F., García– Carmona, F., 1995, Biochim. Biophys. Acta, 1247, 1. 31. Chase, M. R., Raina, K., Bruno, J., Sugumaran, M., 2000, Insect Biochem. Molec., 30, 953. 32. Ashida, M., Brey, P.T., 1995, Proc. Natl. Acad. Sci. U.S.A., 92, 10698. 33. García-Domenech, R., Calvo-Chamorro, M.L., Cuervo-Arias, A.Y., GómezSucerquia, L.J., Ortega-Chávez, V., Pérez-Torrado, E., Gálvez, J., 2008, Afinidad, 538, 430. 34. García-Domenech, R., Domingo-Puig, C., Esteve-Martinez, M.A., Schmitt, J., Vera-Martinez, J., Chindemi, A.L., Galvez, J., 2008, Anales de la Real Academia de Farmacia, 74, 345. 35. Garcia-Domenech, R., Lopez-Peña, W., Sanchez-Perdomo, Y., Sanders, J.R., Sierra-Araujo, M.M., Zapata, C., Galvez, J., 2008, Internacional Journal of Pharmaceutics, 363, 78. 36. Garcia-Domenech, R., Alarcon-Elbal, P., Bolas, G., Bueno-Mari, R., ChordaOlmos, F. A., Delacour, S. A., Mourino, M. C., Vidal, A., Galvez, J., 2007, SAR and QSAR in Environmental Research, 18, 745. 37. García-Domenech, R., Villanueva, A., Gálvez, J., 2008, Organic Chemistry : An Indian Journal, 38. Kachigan, S.L., 1991, Multivariate Statistical Analysis, Radius Press, New York 39. McFarland, J.W., Gans, D.J., 1986, Journal of Medicinal Chemistry, 30, 46. 40. Wold, S., Eriksson, L., 1995, Statistical validation of QSAR results. In: Van de Waterbeemd H (ed) Chemometric methods in molecular design. VCH, New York. 41. Tabahnick, B.G., Fidell, L.S., 1990, Using Multivariate Statistics, Harper Collins, New York. 42. Organization WH. Burdens and Trends in Chagas disease. Available from: www.who.int/ctd/chagas/burdens
Molecular topology and drug design
93
43. Hudock, M.P., Sanz-Rodriguez, C.E., Song, Y., Chan, J.M., Zhang, Y., Odeh, S., et al. 2006, J. Med. Chem., 49(1), 215. 44. Racagni, G.E., Machado de Doménech, E.E., 1983, Mol. Biochem. Parasitol., 9(2), 181. 45. García-Domenech, R., Espinoza, N., Galarza, R.F., Moreno-Padilla, M.J., RojasRuiz, B., Roldan-Arroyo, LL., Sanchez-Lavado, M.I., Gálvez, J., 2008, Ars Pharmaceutica, 49(3), . 46. Cheng, C., 1986, General Parasitology, Academic Press College Division. Orlando. 47. Weekly epidemiological record. WHO 2002, (44), 77, 365-372 48. Weekly epidemiological record. WHO 2002, (25), 77, 205-212. 49. Delfín, D.A., Bhattacharjee, A.K., Yakovich, A.J., Werbovetz, K.A., 2006, J. Med. Chem., 49, 4196. 50. Data base: Maybridge Organics Compounds, http://www.maybridge.com 51. Arviza, M.P., 1985, Predicción e interpretación de algunas propiedades fisicoquímicas y biológicas de un grupo de barbitúricos y sulfonamidas por el método de conectividad molecular. Doctoral Thesis, Universidad de Valencia, Spain. 52. Bernal, J., 1988, Desarrollo de un nuevo método de diseño molecular asistido por ordenador. Su aplicación a fármacos betabloqueantes y benzodiazepinas. Doctoral Thesis, Universidad de Valencia, Spain. 53. Gálvez, J., García-Domenech, R., Bernal, J. M., García-March, F. J., 1991, An. Real Acad. Farm., 57, 533. 54. Garcia-Domenech,R., Gálvez, J., Garcia-March, F.J., Moliner, R., 1991, Drug. Invest., 3(5), 344. 55. Gálvez, J., García-Domenech, R., Gómez-Lechón, M.J., Castell, J.V., 2000, J. Mol. Struc. (Theochem), 504, 241. 56. de Gregorio-Alapont, C., García-Domenech, R., Gálvez, J., Ros, M.J., Wolski, S., García, M.D., 2000, Bioorg. Med. Chem. Lett., 10, 2033. 57. Gálvez, J., García-Domenech, R., Gregorio Alapont, C.de, Julián-Ortiz, J. V. de, Salabert-Salvador, M. T., Soler-Roca, R., 1996, In Advances in Molecular Similarity. Vol. I , Carbó-Dorca, R., Mezey, P. G., JAI Press Inc : London. 58. Pastor L., García-Domenech, R., de Gregorio Alapont, C., Gálvez, J., 1998, Bioorg. Med. Chem. Lett., 8, 2577. 59. Antón-Fos, G. M., García-Domenech, R., Pérez-Giménez F., Peris-Ribera, J. E., García-March, F., Salabert-Salvador, M. T., 1994, Arzneim-Forsch/Drug Res., 44, 821. 60. De Julian-Ortiz, J.V., Galvez, J., Muñoz-Collado, C., Garcia-Domenech, R., Jimeno-Cardona, C., 1999, J. Med. Chem., 42, 3308. 61. Gálvez, J., Gómez-Lechón, M. J., García-Domenech, R., Castell, J. V., 1996, Bioorg. & Med. Chem. Lett., 6, 2301. 62. Mahmoudi, N., de Julian-Ortiz, J.V., Ciceron, L., Galvez, J., Mazier, D., Danis, D., Derouin, F, Garcia-Domenech, R., 2006, J. Antimic. Chemother., 57, 489. 63. Gozalbes, R., Gálvez, J., García-Domenech, R., Derouin, F., 1999, SAR QSAR Environ. Res., 10, 47.
94
J. Gálvez & R. García-Domenech
64. Casabán-Ros, E., Antón-Fos, G. M., Gálvez, J., Duart, M. J., García-Domenech, R., 1999, Quant. Struct.-Act. Relat., 18, 35. 65. 65a: Rios-Santamarina, I., García-Domenech, R., Cortijo, J., Santamaria, P., Morcillo, E.J., Gálvez, J., 2002, Internet Electron. J. Mol. Des., 1, 70. 65b: RíosSantamarina, I., García-Domenech, R., Gálvez, J., Santamaría, P., Cortijo, J., Morcillo, E.J., 1998, Bioorg. Med. Chem. Lett., 8, 477. 66. García-Domenech, R., García-March, F.J., Soler, R.M., Gálvez, J., Antón-Fos, G.M., de Julián-Ortiz, J.V., 1996, Quant. Struct.-Act. Relat., 15, 201. 67. Gálvez, J., García-Domenech, R., de Julián-Ortiz, J.V., Soler, R., 1994, J. Chem. Inf. Comput. Sci., 34, 1198. 68. Jasinski, P., Welsh, B., Galvez, J., Land, D., Zwolak, P., Ghandi, L., Terai, K., Dudek, A. Z., 2008, Investigational New Drugs., 26, 223. 69. Snow, R. W., Guerra, C. A., Noor, A. M., Myint, H. Y., Hay, S. I., 2005, Nature, 434, 214. 70. Mahmoudi, N., Garcia-Domenech, R., Galvez, J., Farhati, K., Franetich, J.F., Sauerwein, R., Hannoun, L., Derouin, F., Danis, M., Mazier, D., 2008, Antimicrobial Agents and Chemotherapy, 52(4), 1215.
Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 95-116 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro
4. QSAR models constructed by optimal descriptors and by multiple regression analysis for the prediction of carbonic anhydrase II inhibitory activity of substituted aromatic sulfonamides Georgia Melagraki1, Antreas Afantitis1,2, Andrey A. Toropov3 Haralambos Sarimveis1 and Olga Igglessi â&#x20AC;&#x201C; Markopoulou1 1
School of Chemical Engineering, National Technical University of Athens Department of Chemoinformatics, NovaMechanics Ltd, Nicosia, Cyprus 3 Uzbek Academy of Science Institute of Geology and Geophysics Tashkent, Uzbekistan
2
Abstract. In this work two different approaches are presented for the modeling and prediction of Carbonic Anhydrase II (CA II) inhibitory activity of substituted aromatic sulfonamides. The predictive models have been obtained by means of optimal descriptors calculated with Simplified molecular input line entry system (SMILES) and by multiple linear regression analysis (MLR). Both models were shown to significantly describe and predict the CAII inhibitory activity. The statistical results obtained from the two approaches are the following: SMILES based optimal descriptors: n=30, R2=0.76, s=0.252, F=89 (training set); n=17, R2pred =0.81, s=0.210, (test set); MLR: n=30, R2=0.77, Q2=0.66, s=0.268, F=20.8 (training set); n=17, R2pred=0.76, s=0.292 (test set). Correspondence/Reprint request: Dr. Andrey A. Toropovc, Uzbek Academy of Science Institute of Geology and Geophysics, Tashkent, Uzbekistan. E-mail: aatoropov@yahoo.com
96
Georgia Melagraki et al.
The carbonic anhydrases (CAs) are ubiquitous zinc enzymes, present in proeukaryotes and eukaryotes. These enzymes catalyze a very simple physiological reaction, the interconversion between carbon dioxide and the bicarbonate anion and are thus involved in crucial physiological processes accociated with pH control, ion transport and fluid secretion [1]. The CA enzymes found in mammals are divided into four broad subgroups, which, in turn consist of several isoforms. Carbonic anhydrase II is an isoform of cytosolic CAs and it is one of fourteen forms of human Îą carbonic anhydrases. Sulfonamides present an important class of biologically active compounds. Among them, those that inhibit the zinc enzyme carbonic anhydrase possess many applications as diuretic, antiglaucoma, antiepileptic or antithyroid drugs. Acetazolamide, methazolamide, ethoxzolamide and dichlorophenamide are among sulfonamides possessing carbonic anhydrase (CA, EC 4.2.1.1) inhibitory properties that have been used for therapeutic purposes. More recently, indisulam is in Phase II clinical trials as an anticancer agent to treat solid tumors [2]. Quantitative structure â&#x20AC;&#x201C; activity relationship (QSAR) is a tool for numerically estimating biochemical endpoints of interest for substances for which experimental data are missing. Carbonic anhydrase II inhibitory activity (Ki) of substituted aromatic sulfonamides is an important endpoint from biochemical and medicinal point of view. Various attempts have been reported in literature for modelling and predicting Carbonic Anhydrase II inhibitory activity. Several efforts have been previously reported in literature for the quantitative description of the CAII inhibitory activity. Among other scientists, Supuran and his group have presented a great amount of work in the field [3,4]. A review on the subject [5] indicates the number and importance of these attempts and could be a useful guide to the interested reader. A summary of the several QSAR studies that have been recently conducted in the field is given next. A QSAR study of two sets of carbonic anhydrase inhibitors was presented by A.T. Balaban et al. [6], using a variety of molecular descriptors including topological indices. The first set consisted of 29 benzenesulphonamides, the second set included 35 sulphanilamide Schiff bases and statistically significant models have been defined for both sets. P.V. Khadikar et al. [7] presented QSAR studies on 29 benzene sulphonamide carbonic anhydrase inhibitors using distance-based topological indices. The need of including the hydrophobic parameter (logP) for the topological modeling of CA II inhibition has been investigated. P.V. Khadikar et al. [8] used topological and quantum â&#x20AC;&#x201C; theoretical descriptors for the prediction of CA II on a large database of 95 compounds [8].
Optimal descriptors for aromatic sulfonamides
97
A dataset of 47 para-substituted aromatic sulphonamides has been investigated by using information topological indices [9]. Multi – parameter models have been proposed with significant accuracy. More recently, P.V. Khadikar has published a study on the parasubstituted compounds by using information, distance – based, connectivity indices and combinations of these three categories [10]. Their work resulted in a nine – parameter and an eight – parameter model with R2 values of 0.8375 and 0.8343 respectively [10]. In the present study two different techniques are employed for the development of a quantitative relationship between CAII inhibitory activity and sulfonamides structure: optimal descriptors calculated with Simplified molecular input line entry system (SMILES) and multiple linear regression analysis (MLR). The predictive ability of both approaches is tested using various statistical techniques. The utilized database contains the structures of substituted aromatic sulfonamides which are shown in Table 1. Table 1. Structures and numbers of para-substituted aromatic sulfonamides under consideration. No.
Structure
No.
NH2 O S O
25
1
O
Structure NH2 O S O
HN
OH
O
2
26
NH2 O S O
NH2 O S O F
F
HN
NHNH2
O
F
OF
3
27
NH2 O S O
O
NHNH NH O
Cl Cl
F
NH2 O S O
O
NH NH
98
Georgia Melagraki et al.
Table 1. Continued 4
28
NH2 O S O
NHNH NH
O
29
NHNH NH O
O
NH2 O S O
Cl
NHNH NH O
31
NH2 O S O
Cl
NH
Br
O
O
NH
CH3
30
NH2 O S O
O
7
O NH2 S O
HN
O
O
6
NH
CH3
NH2 O S O
O
NH
O
O
O
5
NH2 O S O
NH
NH2 O S O
O HN S O
NHNH NH O
8
32
NH2 O S O
O
9
O
33
NH2 O S O
O
HN O S O
NHNH NH O
NHNH NH O
H2N O S O
NH2 O S O
HN
S O
O
Optimal descriptors for aromatic sulfonamides
99
Table 1. Continued 10
34
NH2 O S O
O
NH2 O S O
O HN S O
O S NHNH NHNH O
F
O
11
35
NH2 O S O
O
H3C O S NHNH NHNH O
NH2 O S O
HN
O
S O
O O NH CH3
12
36
NH2 O S O
NH2 O S O
CH3
O
O S NHNH NHNH O
NH2
O
13
37
NH2 O S O
NH2 O S O
F
O
HN
O S NHNH NHNH O
NH2
O
14
38
NH2 O S O
NH2 O S O
Cl
O
O S NHNH NHNH O
NH2
O
15
39
NH2 O S O
NH2 O S O
F HN
NH2
NH2
100
Georgia Melagraki et al.
Table 1. Continued 16
40
NH2 O S O
NH2 O S O
F
Cl HN
17
NH2
NH2
41
NH2 O S O
NH2 O S O
Cl
HN
NH2
CH3 O
18
F HN O
19
42
NH2 O S O
NH2 O S O
Br
F
NH2
F
43
NH2 O S O
NH2 O S O
I
HN
NH2
CH3 O
20
44
NH2 O S O
NH2 O S O H2N Cl
HN O
21
45
NH2 O S O
O
CH3
46
NH2 O S O
HN
CH3 O
NH2 O S O Cl O S NH 2 NH2O
CH3
HN
22
CH3
O S NH 2 Cl O
NH2 O S O
OH
Optimal descriptors for aromatic sulfonamides
101
Table 1. Continued 23
47
H2N O S O
NH2 O S O
NH O
CH3 CH3
24
OH
NH2 O S O
CH3
HN O
2. Method The two models were developed based on the methodologies described below:
2.1. SMILES based optimal descriptors Optimal descriptors calculated with SMILES [11-13] utilized in the present study have been calculated as N
DCW= П CW(SAk) k=1
(1)
where SAk is an attribute of the SMILES. There is the following hierarchy of the SMILES attributes: i. Global attributes; ii. Attributes which contain two characters (for instance, Cl, Br, etc.); and iii. all others that contain one SMILES characters. Global SMILES attributes are the number of brackets (these are indicators of the branching in molecular structure), this kind of SAk is denoted by “xxx” (where x is some digit). By Monte Carlo method one can calculate numerical values for the CW(SAk) which produce the largest possible correlation coefficient between the end point of interest and the optimal descriptor DCW over a training set. Based on the CW(SAk) numerical data, one can calculate a single-variable model of the following form: K = C0 + C1 . DCW
(2)
102
Georgia Melagraki et al.
One can estimate the predictive potential of the model calculated with Eq. 2 using an external test set. Separation into the training and test set can be random but must comply with one limitation: intervals of the KI values for the training and test sets should be similar. SMILES used in present study have been obtained with ACD/ChemSketch software [14].
2.2. Multiple linear regression analysis Multiple linear regression analysis was performed by considering exactly the same compounds that constituted the training and test sets in the SMILES modeling approach. 69 physicochemical constants, topological and structural descriptors (Table 2) were considered as possible input candidates in order to build a Multiple Linear Regression (MLR) model. Before the calculation of the descriptors, all structures were fully optimized using Cambridge Software Mechanics and more specifically MM2 force fields and the TruncatedNewton-Raphson optimizer, which provide a balance between speed and accuracy (Chemoffice Manual). Before calculating the HOMO and LUMO Energies (eV) all the structures were additionally fully optimized using the AM1 basis set. All the descriptors were calculated using Chem3D and Topix [15,16]. Table 2. Calculated descriptors. ID
Description
Notation
ID
Description
Notation
1
Molar Refractivity
MR
2
Diameter
Diam
3
Partition Coefficient
ClogP
4
Molecular
TIndx
(Octanol Water)
Topological Index
5
Principal Moment of
PMIZ
6
Inertia Z 7
Principal Moment of Principal Moment of
NRBo
Rotatable Bonds PMIY
8
Inertia Y 9
Number of Polar Surface
PSAr
Area PMIX
10
Radius
Rad
Inertia X 11
LUMO Energy
LUMO
12
Shape attribute
ShpA
13
HOMO Energy
HOMO
14
Shape coefficient
ShpC
15
Balaban Index
BIndx
16
Sum of Valence
SVDe
Degrees 17
Cluster Count
ClsC
18
Total Connectivity
TCon
Optimal descriptors for aromatic sulfonamides
103
Table 2. Continued 19
Wiener Index
WIndx
20
21 23 25 27
DistEqTotal Randic 1 Randic 3 Randic Information 0
DistEqTotal Chi1 Chi3 ChiInf0
22 24 26 28
29
Randic Information 2
ChiInf2
30
31
Randic Information 4
ChiInf4
32
33 35 37
Randic Mod Xu2 Balaban Topological
ChiMod Xu2 TopoJ
34 36 38
39
Number of Rings
NRings
40
Total Valence Connectivity Randic 0 Randic 2 Randic 4 Randic Information 1 Randic Information 3 Molecular Weight Xu1 Xu3 Number of Bramches Wiener Dim
41
Bertz
Bertz
42
AtomCompMean
43 45 47 49 51 53 55 57 59 61
AtomCompTot Zagreb2 Kappa2 WienerDistCode DistEqMean InfMagnitDist Tot Gordon Ki1 Ki3 KiInf0
44 46 48 50 52 54 56 58 60 62
KiInf2
64
KiInf4
66
Zagreb1 Kappa1 Kappa3 Polarity Quadratic ScHultz Kier-Hall 0 Kier-Hall 2 Kier-Hall 4 Kier-Hall Information 1 Kier-Hall Information 3 Randic Cluster 3
67
AtomCompTot Zagreb2 Kappa2 Wiener Distance DistEqMean InfMagnitDistTot Gordon Kier-Hall 1 Kier-Hall 3 Kier-Hall Information 0 Kier-Hall Information 2 Kier-Hall Information 4 Randic Cluster 4
ChiCl4
68
69
Wiener Index
WIndx
63 65
Wiener Information
TVCon Chi0 Chi2 Chi4 ChiInf1 ChiInf3 MW Xu1 Xu3 NBranch Wiener Dim AtomCom pMean Zagreb1 Kappa1 Kappa3 Polarity Quadr ScHultz Ki0 Ki2 Ki4 KiInf1 KiInf3 ChiCl3 InfWiener
104
Georgia Melagraki et al.
Our first objective was to determinate the variables which produce the most significant linear QSAR models linking the structure of compounds with their inhibitory activity. The Elimination Selection-Stepwise Regression (ES-SWR) algorithm was used on the training data set to select the most appropriate descriptors. ES-SWR is a popular stepwise technique [17] that combines Forward Selection (FS-SWR) and Backward Elimination (BESWR). The accuracy of the produced MLR model was illustrated using the following evaluation techniques: leave-one-out cross-validation procedure and validation through an external test set. According to Tropsha’ group [19-21] a QSAR model is considered predictive, if the following conditions are satisfied: 2 > 0.6 R pred
(3)
2 '2 ( R 2 − Ro2 ) ( R − R ) o or < 0.1 < 0.1 2 2 R R
(4)
0.85 ≤ k ≤ 1.15 or 0.85 ≤ k ' ≤ 1.15
(5)
In Eqs. (4),(5) R2 is the correlation coefficient between experimental values and model prediction on the test set. Mathematical definitions of Ro2 , Ro'2 , k and k ' are based on regression of the observed activities against predicted activities and the opposite (regression of the predicted activities against observed activities). The definitions are presented clearly in [21] and are not repeated here for brevity. Moreover, in order for a QSAR model to be used for screening new compounds, its domain of application must be defined and predictions for only those compounds that fall into this domain may be considered reliable. Extent of Extrapolation is one simple approach to define the applicability of the domain. It is based on the calculation of the leverage hi [21-22] for each chemical, where the QSAR model is used to predict its activity: hi = xi ( X T X ) −1 xiT
In Eq. (6)
(6)
xi is the row vector containing the k model parameters of the
query compound and X is the nxk matrix containing the k model parameters for each one of the n training compounds. A leverage value greater than 3k/n is considered large. It means that the predicted response is the result of a substantial extrapolation of the model and may be not reliable.
Optimal descriptors for aromatic sulfonamides
105
3. Results and discussion In general the models produced by the two different approaches exhibited very good statistical performance both for the training and the test set. Detailed results for each method are presented below:
3.1. SMILES based optimal descriptors Statistical characteristics of the models for the K over three probes of the Monte Carlo optimization are demonstrated in Table 3. One can see that statistical quality of these models is quite good for both the training and test sets. Table 4 contains numerical data on the CW(SAk). Example of calculating the DCW is demonstrated in Table 5. One variable model of the logKI obtained in first probe of the optimization is the following: logKI = -19.444 (±0.3418) + 16.449 (±0.2672) * DCW
(7)
n=30, R2=0.7612, s=0.252, F=89 (training set) n=17, R2=0.8536, s=0.210, (test set) Table 6 depicts the separation of the compounds into training and validation sets. The same table presents the experimental logKI values and the respective predicted values calculated by Eq. 7. Graphically the model for the training and test sets is demonstrated in Fig. 1 and 2, respectively. Table 3. Statistical characteristics of QSAR based on optimal descriptors obtained in three probes of the Monte Carlo optimization. Training set, n=30
Test set, n=17
1
R2 0.7612
S 0.252
F 89
R2 0.8536
R2pred * 0.8148
S 0.210
F 87
2 3
0.7604 0.7606
0.252 0.252
89 89
0.8913 0.8768
0.8645 0.8447
0.196 0.205
123 107
Probe
n *)
R2pred = 1 -
n ∑ (Eyk – Cyk)2 / ∑ (Eyk – Ay)2
k=1
k=1
where Eyk , Cyk are experimental and calculated values of the inhibitory activity, respectively; Ay is average value of inhibitory activity on given set (i.e., training or test set), n is number of compounds in the set.
106
Georgia Melagraki et al.
Table 4. Numerical data for correlation weights of the SAk obtained in three probes of the Monte Carlo optimization. SMILES attribute, SAk (003________ (004________ (005________ (006________ (007________ (008________ (___________ 1___________ 2___________ 3___________ =___________ C___________ Br__________ Cl__________ F___________ I___________ N___________ O___________ S___________ c___________
CW(SAk) in probe 1 0.9946079 1.0156395 1.0091059 1.0136803 1.0258007 1.0010987 0.9989675 1.1835983 1.0020425 1.0074892 0.9994141 0.9990570 0.9751599 0.9899030 0.9963299 1.0033072 0.9946235 0.9913011 0.9975581 0.9963897
CW(SAk) in probe 2 0.9961606 1.0198094 1.0038481 1.0028883 1.0135517 0.9715683 1.0022593 1.2239750 0.9990315 1.0069452 1.0018132 0.9983833 0.9621277 0.9845061 0.9940541 0.9969325 0.9921148 0.9847984 0.9953914 0.9959790
CW(SAk) in probe 3 0.9735016 1.0201966 1.0160863 1.0291028 1.0612002 1.0188980 0.9949041 1.2992088 0.9966276 1.0070966 1.0053143 0.9967137 0.9488951 0.9791878 0.9922458 1.0000845 0.9892841 0.9797559 0.9907505 0.9952362
The SMILES based descriptors can be calculated with the correlation weights (Table 4) for an arbitrary substance (more exactly for arbitrary SMILES). Unfortunately, criteria for rational definition of the applicability domain for such models do not exist. There are several pieces of software to generate the SMILES. The choice can influence the statistical quality of the SMILES-based models. It is important that all SMILES are generated by the same software (utilization of SMILES prepared by different pieces of software is not correct). Taking into account the increase of SMILESoriented databases in the Internet, one can conclude that the approach (SMILES-based descriptors) is a good perspective.
3.2. Multiple linear regression analysis The ES-SWR algorithm was applied on the set of training data, by considering all 69 available descriptors as potential input parameters to the model. The separation of the dataset into training and validation sets was performed according to the popular Kennard and Stones algorithm [18]. The advantages of this algorithm are that the calibration samples map the measured
Optimal descriptors for aromatic sulfonamides
107
Table 5. Example of calculation the DCW with CW(SAk) obtained in the first probe of the Monte Carlo optimization for first SMILES of the training set: SMILES="O=S(N)(=O)c1ccc(cc1)C(O)=O"; DCW = 1.3194129. SAk O___________ =___________ S___________ (___________ N___________ (___________ (___________ =___________ O___________ (___________ c___________ 1___________ c___________ c___________ c___________ (___________ c___________ c___________ 1___________ (___________ C___________ (___________ O___________ (___________ =___________ O___________ (004________
CW(SAk) 0.9913011 0.9994141 0.9975581 0.9989675 0.9946235 0.9989675 0.9989675 0.9994141 0.9913011 0.9989675 0.9963897 1.1835983 0.9963897 0.9963897 0.9963897 0.9989675 0.9963897 0.9963897 1.1835983 0.9989675 0.9990570 0.9989675 0.9913011 0.9989675 0.9994141 0.9913011 1.0156395
region of the input variable space completely with respect to the induced metric and that the test samples all fall inside the measured region. The validation data were not involved by any means in the process of selecting the most appropriate descriptors or in the development of the QSAR model. They were considered as a completely unknown external set of data, which was used only to test the accuracy of the produced model. The most significant descriptors according to the ES-SWR algorithm are: Lipophilicity (ClogP), KiInf2, ChiInf1 and Bertz index. Table 7 presents the correlation matrix, where it is clear that the four selected descriptors are not highly correlated. Lipophilicity is known to be important for absorption, permeability, and in vivo distribution of organic compounds [17] and has been used as a physicochemical descriptor in QSARs with great success.
108
Georgia Melagraki et al.
Table 6. Experimental and calculated values (with Eq. 7) of the carbonic anhydrase II inhibitory activity (logKI) of para-substituted aromatic sulfonamides using the SMILES approach. No 1 2 3 4 5 6 7 8 10 11 13 15 17 18 20 24 25 26 27 29 30 32
Structure Training set O=S(N)(=O)c1ccc(cc1)C(O)=O O=S(N)(=O)c1ccc(cc1)C(=O)NN Clc2ccc(NC(=O)NNC(=O)c1ccc (cc1)S(N)(=O)=O)cc2Cl O=C(Nc1ccc(cc1)C(C)=O)NNC (=O)c2ccc(cc2)S(N)(=O)=O O=C(Nc1ccc(cc1)C(=O)OCC)NNC (=O)c2ccc(cc2)S(N)(=O)=O Brc2ccc(NC(=O)NNC(=O)c1ccc (cc1)S(N)(=O)=O)cc2 O=C(Nc1ccc(cc1)c2ccccc2)NNC (=O)c3ccc(cc3)S(N)(=O)=O NS(=O)(=O)c1ccc(cc1)C(=O)NNC (=O)Nc3ccc(Oc2ccccc2)cc3 O=S(C1=CC=C(C(NNC(NS(=O) (C2=CC=CC=C2)=O)=O)=O) C= C1)(N)=O O=S(C1=CC=C(C(NNC(NS(=O) (C2=CC=CC=C2C)=O)=O)=O) C=C1)(N)=O O=S(C1=CC=C(C(NNC(NS(=O) (C2=CC=C(F)C=C2)=O)=O)=O) C=C1)(N)=O Fc1cc(ccc1NN)S(N)(=O)=O O=S(N)(=O)c1ccc(NC(C)=O)cc1 O=S(N)(=O)c1ccc(NC(=O)C(F)(F)F) cc1 O=S(N)(=O)c1ccc(NC(=O)CCC) cc1 O=S(N)(=O)c1ccc(NC(=O) CCCCC)cc1 O=S(N)(=O)c2ccc(NC(=O) c1ccccc1)cc2 Fc1c(c(F)c(F)c(F)c1F)C(=O)Nc2ccc (cc2)S(N)(=O)=O O=C(Nc1ccccc1)Nc2ccc(cc2)S(N) (=O)=O O=C(Nc1ccccc1)NCCc2ccc(cc2)S (N)(=O)=O Clc2ccc(NC(=O)NCCc1ccc(cc1)S (N)(=O)=O)cc2Cl NS(=O)(=O)c2ccc(CNS(=O)(=O) c1ccccc1)cc2
DCW
Expr
Calc
ExprCalc
1.3194129 1.3167175 1.2404880
2.412 2.093 1.114
2.260 2.216 0.962
0.152 -0.123 0.152
1.2641648
1.176
1.351
-0.175
1.2519862
0.954
1.151
-0.197
1.2344774
0.863
0.863
0.000
1.2573710
1.041
1.240
-0.199
1.2464332
1.255
1.060
0.195
1.2880059
1.826
1.743
0.083
1.2867913
1.732
1.723
0.009
1.2497919
0.978
1.115
-0.137
1.3006607 1.3225867 1.3001713
1.708 2.391 2.124
1.952 2.312 1.944
-0.244 0.079 0.180
1.3200935 1.3176050
2.356 1.799
2.271 2.230
0.085 -0.431
1.3007132
1.568
1.952
-0.384
1.2483770
1.230
1.092
0.138
1.2937200
2.380
1.837
0.543
1.2912812
1.875
1.797
0.078
1.2546021
1.114
1.194
-0.080
1.2745907
1.602
1.523
0.079
Optimal descriptors for aromatic sulfonamides
109
Table 6. Continued 34 35 36 38 39 40 44 45 9 12 14 16 19 21 22 23 28 31 33 37 41 42 43 46 47
O=S(=O)(Nc1ccc(cc1)S(N)(=O)=O) c2ccc(F)cc2 CC(=O)Nc1ccc(cc1)S(=O)(=O)NCCc 2ccc(cc2)S(N)(=O)=O O=S(N)(=O)c1ccc(N)cc1 NCc1ccc(cc1)S(N)(=O)=O NCCc1ccc(cc1)S(N)(=O)=O Fc1cc(ccc1N)S(N)(=O)=O Clc1c(cc(c(N)c1Cl)S(N)(=O)=O)S (N)(=O)=O O=S(N)(=O)c1cc(c(N)cc1Cl)S(N) (=O)=O Test set O=C(Nc1ccc(cc1)Cc2ccccc2)NNC (=O)c3ccc(cc3)S(N)(=O)=O O=S(C1=CC=C(C(NNC(NS(=O)(C2 =CC=C(C)C=C2)=O)=O)=O)C=C1) (N)=O O=S(C1=CC=C(C(NNC(NS(=O)(C2 =CC=C(Cl)C=C2)=O)=O)=O)C=C1) (N)=O NNc1ccc(cc1Cl)S(N)(=O)=O O=S(N)(=O)c1ccc(NC(=O)CC)cc1 O=S(N)(=O)c1ccc(NC(=O)CC(C)C) cc1 O=S(N)(=O)c1ccc(NC(=O)CCCC) cc1 O=S(N)(=O)c1ccc(NC(=O)CC(C)C) cc1 O=C(Nc1ccccc1)NCc2ccc(cc2)S(N) (=O)=O O=S(=O)(Nc1ccc(cc1)S(N)(=O)=O) c2ccccc2 NS(=O)(=O)c2ccc(CCNS(=O)(=O) c1ccccc1)cc2 NNc1ccc(cc1)S(N)(=O)=O Nc1ccc(cc1Cl)S(N)(=O)=O Nc1ccc(cc1Br)S(N)(=O)=O Ic1cc(ccc1N)S(N)(=O)=O OCc1ccc(cc1)S(N)(=O)=O OCCc1ccc(cc1)S(N)(=O)=O
1.2742383
0.954
1.517
-0.563
1.2678946
1.875
1.413
0.462
1.3125085 1.3112708 1.3100342 1.3076914 1.2811755
2.477 2.230 2.204 1.778 1.447
2.147 2.126 2.106 2.067 1.631
0.330 0.104 0.098 -0.289 -0.184
1.2815962
1.875
1.638
0.237
1.2561853
1.176
1.220
-0.044
1.2532128
0.991
1.171
-0.180
1.2417301
0.959
0.982
-0.023
1.2922707 1.3213395 1.3076600
1.881 2.365 2.412
1.814 2.292 2.067
0.067 0.073 0.345
1.3188486
2.330
2.251
0.079
1.3076600
2.362
2.067
0.295
1.2925000
2.021
1.817
0.204
1.2757938
1.690
1.543
0.147
1.2733888
1.447
1.503
-0.056
1.3054518 1.2992561 1.2799056 1.3168492 1.3068906 1.3056582
2.505 2.041 1.602 1.845 2.097 2.041
2.030 1.929 1.610 2.218 2.054 2.034
0.475 0.112 -0.008 -0.373 0.043 0.007
Information indices such as KiInf2 and ChiInf1 are graph theoretical invariants that view the molecular graph as a source of different probability distributions to which information theory definitions can be applied. They can be considered as a quantitative measure of the lack of structural homogeneity
110
Georgia Melagraki et al.
Training set, n=30 2.5
logK(expr.)
2
1.5
1
0.5 0.5
1
1.5
2
2.5
logK(calc.)
Figure 1. Plot of experimental versus calculated (with Eq.7) values of logKI for the training set. Test set, n=17 2.5
logK(expr.)
2
1.5
1
0.5 0.5
1
1.5
2
2.5
logK(calc.)
Figure 2. Plot of experimental versus calculated (with Eq.7) values of logKI for the test set.
Optimal descriptors for aromatic sulfonamides
111
Table 7. Correlation matrix for the four selected descriptors. KiInf2
ClogP
ChiInf1
KiInf2
1
ClogP
0.301
1
ChiInf1
0.208
-0.439
1
Bertz
0.511
0.273
0.239
Bertz
1
or the diversity of a graph, in this way being related to symmetry associated with structure [17]. Bertz’s complexity index [17], the most popular complexity index, takes into account both the variety of kinds of bond connectivities and atom types of H-depleted molecular graph. The produced MLR QSAR model is the following: logKI = 2.25 + 0.0792 KiInf2 - 0.189 CLogP - 0.0654 ChiInf1 – 0.00457 Bertz
(8)
n = 30; R2 = 0.77; F =20.8; RMS = 0.27; Q2 = 0.66 Eq. 8 was used to predict the binding affinity for the validation examples. 2 The results are presented in Table 8 and corresponds to R pred = 0.76. The results illustrated that the linear MLR technique combined with a successful variable selection procedure are adequate to generate an efficient QSAR model for predicting the binding affinity of different compounds. The proposed model (Eq. 8) passed all the tests for the predictive ability (Eqs. 3-5). 2 = 0.76 > 0.6 R pred
2 '2 ( R 2 − Ro2 ) ( R − R o ) , = -0.2938 < 0.1 = R2 R2
-0.3148 < 0.1
k = 0.98 and k’= 1.000 Graphically the MLR QSAR model for the training and test sets is demonstrated in Fig. 3 and 4, respectively.
112
Georgia Melagraki et al.
Table 8. Experimental and calculated values (with Eq. 8) of the carbonic anhydrase II inhibitory activity (logKI) of para-substituted aromatic sulfonamides using MLR. Exp log(1/IC50)
Calc log(1/IC50)
Expr-Calc
Leverages (limit 0.40)
Training Set 1 2 3 4 5 6 7 8 10 11 13 15 17 18 20 24 25 26 27 29 30 32 34 35 36 38 39 40 44 45
2.4116 2.0934 1.1139 1.1761 0.9542 0.8633 1.0414 1.2553 1.8261 1.7324 0.9777 1.7076 2.3909 2.2124 2.3560 1.7993 1.5682 1.2304 2.3802 1.8751 1.1139 1.6021 0.9542 1.8751 2.4771 2.2304 2.2041 1.7782 1.4472 1.8751
2.1097 2.1725 0.9494 1.4226 1.0316 1.1960 1.1208 1.1675 1.7162 1.4906 1.3180 1.8862 2.3700 1.9471 2.1641 1.8912 1.9853 0.8678 1.9564 2.0653 1.1035 1.8683 1.2699 1.4038 2.1284 2.2944 2.2505 2.0849 1.5128 1.7782
-0.3019 0.0791 -0.1645 0.2465 0.0774 0.3327 0.0794 -0.0878 -0.1099 -0.2418 0.3403 0.1786 -0.0209 -0.2653 -0.1919 0.0919 0.4171 -0.3626 -0.4238 0.1902 -0.0104 0.2662 0.3157 -0.4713 -0.3487 0.0640 0.0464 0.3067 0.0656 -0.0969
0.2037 0.1677 0.1352 0.0678 0.1275 0.0879 0.1442 0.1492 0.2664 0.1898 0.2918 0.2000 0.1188 0.1642 0.1310 0.1490 0.0846 0.2203 0.1534 0.2247 0.2281 0.1185 0.1066 0.0817 0.3687 0.2394 0.2058 0.1388 0.1154 0.1196
Test Set 9 12 14 16 19 21 22 23 28 31
1.1761 0.9912 0.9590 1.8808 2.3655 2.4116 2.3304 2.3617 2.0212 1.6902
1.2650 1.3982 1.1897 1.7579 2.2019 2.0581 2.0288 2.1901 2.1660 1.8147
0.0889 0.4070 0.2307 -0.1229 -0.1636 -0.3535 -0.3016 -0.1716 0.1448 0.1245
0.1348 0.1556 0.2178 0.2479 0.0926 0.1609 0.1313 0.1329 0.2466 0.0939
No.
Optimal descriptors for aromatic sulfonamides
113
Table 8. Continued 33 37 41 42 43 46 47
1.4472 2.5051 2.0414 1.6021 1.8451 2.0969 2.0414
1.6994 2.2284 1.9386 1.8936 1.8463 2.4382 2.3964
0.2522 -0.2767 -0.1028 0.2915 0.0012 0.3413 0.3550
0.1310 0.1462 0.1648 0.1803 0.2004 0.1847 0.1736
Training set, n=30 2.5
logK(expr.)
2
1.5
1
0.5 0.5
1
1.5
2
2.5
logK(calc.)
Figure 3. Plot of experimental versus calculated (with Eq.8) values of logKI for the training set.
The extrapolation method was applied to the compounds that constitute the test set. The leverages for all 17 compounds were computed (Table 9). All 17 compounds in the test set fall inside the domain of the model (warning leverage limit 0.40). MLR model provides qualitative and quantitative information on the dependency between descriptors and the CAII inhibitory activity. The model proposed based on the MLR methodology is simple, requires the calculation of only four descriptors and is easily interpretable in terms of the impact that each descriptor has on the activity. A medicinal chemist can derive useful conclusions for the structure â&#x20AC;&#x201C; activity relationship from the positive or negative
114
Georgia Melagraki et al.
Test set, n=17 3
logK(expr.)
2.5
2
1.5
1
0.5 0.5
1
1.5
2
2.5
logK(calc.)
Figure 4. Plot of experimental versus calculated (with Eq.8) values of logKI for the test set.
and the magnitude of contribution. The accuracy of the model demonstrates the efficiency of the produced QSAR model, which can be used for obtaining accurate predictions and for drawing safe conclusions.
4. Conclusions From the above discussion, it can be seen that both approaches are statistically meaningful, and the predicted values are in agreement with the experimental ones. The SMILES approach gives correlation coefficients of 0.81 and a standard error of 0.210 for the test set, which is slightly better than that of the MLR model whose corresponding statistical indices are 0.76 and 0.292. The differences between the experimental and calculated inhibition activities are visually shown in the figures that are provided in this work, for both SMILES and MLR models. From these figures, it can be visually observed that the residuals produced by the SMILES model are in general lower compared to those produced by the MLR model, and so the prediction results of the SMILES model are better than the ones of he MLR model.
Optimal descriptors for aromatic sulfonamides
115
The MLR model requires the calculation of only four descriptors which can be easily computed and can be interpreted in terms of their influence to the CA II inhibitory activity. Based on the MLR equation, which indicates the dependence and the extent of influence of the descriptors to the inhibitory activity, various structural modifications can be proposed for designing of novel structures with desired characteristics.
References 1. 2. 3. 4. 5.
6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
Supuran, C.T. and Scozzafava, A. Bioorganic & Medicinal Chemistry, 2007, 15, 4336-4350. Supuran, C.T. Nature Reviews Drug Discovery, 2008, 7, 168-181. Innocenti, A., Vullo, D., Scozzafava, A. and Supuran, C. T. Bioorganic & Medicinal Chemistry Letters, 2008, 18, 1583-1587. Sączewski, F., Innocenti, A., Sławiński, J., Kornicka, A., Brzozowski, Z., Pomarnacka, E., Scozzafava, A., Temperini, C. and Supuran, C.T. Bioorganic & Medicinal Chemistry, 2008, doi:10.1016/j.bmcl.2006.06.064. Carbonic Anhydrase: Its Inhibitors and Activators. CRC Enzyme Inhibitors Series, Volume 1. Edited by Claudiu T. Supuran, Andrea Scozzafava (Universita degli Studi, Firenze), and Janet Conway (Pfizer Inc., New York). CRC Press LLC: Boca Raton, FL. 2004. ISBN 0-415-30673-6. Balaban A.T., Basak S.C., Beteringhe A., Mills D., Supuran C.T. Mol Divers. 2004, 8, 401-412. Khadikar P.V., Sharma V., Karmarkar S., Supuran C.T. Bioorg Med Chem Lett. 2005, 15, 923-930. Khadikar, P. V., Singh, J., Singh, S., Mishra, R., Supuran, C.T., Clare, B.W., Lakhwani, M., Medicinal Chemistry, 2008, 4, 30-66. Melagraki, G., Afantitis, A., Sarimveis, H., Igglessi-Markopoulou, O., Supuran C.T. Bioorg. Med. Chem., 2006, 14, 1108-1114. Singh, J., Shaik, B., Singh, S., Agrawal, V. K., Khadikar, P.V., Deeb, O., Supuran, C. T. Chemical Biology & Drug Design, 2008, 71, 244-259. Toropov, A.A., Toropova, A.P., Mukhamedzhanova, D.V., Gutman I. Indian J. Chem. 2005, 44A, 1545-1552. Toropov, A., Nesmerak, K., Raska I., Jr., Waisser, K., Palat K. Computat. Bio. Chem. 2006, 30, 434 - 437. Toropov, A., Benfenati, E. Computat. Bio. Chem. 2007, 31, 57 - 60. http://www/acdlab.com/ CambridgeSoft Corporation www.cambridgesoft.com www.lohninger.com/topix.html Todeschini, R., Consonni, V., Mannhold, R. (Series Editor), Kubinyi, H., (Series Editor), Timmerman, H., (Series Editor) 2000, Handbook of Molecular Descriptors, Wiley-VCH, Weinheim. Kennard, R.W. and Stone, L.A. Technometrics, 1969, 11, 137-148.
116
Georgia Melagraki et al.
19. Shen, M., Beguin, C., Golbraikh, A., Stables, J. Kohn, H. and Tropsha, A. J. Med. Chem. 2004, 47, 2356-2364. 20. Tropsha, A., Gramatica, P., Gombar, V.K. QSAR & Comb. Science. 2003, 22, 69-77. 21. Golbraikh, A. and Tropsha, A. J. Mol. Graph. Mod. 2002, 20, 269-276. 22. Atkinson, A. 1985, Plots, transformations and regression. Clarendon Press, Oxford (UK). 23. Walters, W.P.A. and Murcko, M.A. 1999, Curr. Opin. Chem. Biol. 3, 384-387.
Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 117-165 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro
5. From molecular structure to molecular design through the Molecular Descriptors Family Methodology Sorana D. Bolboacă1 and Lorentz Jäntschi2 1
“Iuliu Hatieganu” University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur 400349 Cluj-Napoca, Romania; 2Technical University of Cluj-Napoca, 103-105 Muncii Bvd 400641 Cluj-Napoca, Romania
Abstract. The Molecular Descriptors Family on the StructureProperty/Activity Relationships (MDF SPR/SAR) is an area of computational research able to generate a family of molecular descriptors and to build models in order to estimate and predict the property/activity of chemical compounds. This review aims to briefly present the MDF SPR/SAR methodology and to discuss its abilities to estimate and predict different properties and activities.
Introduction Structure-Activity Relationships (SARs), Structure-Property Relationships (SPRs) and Property-Activity Relationships (PARs) were first introduced by Louis Pluck HAMMETT in 1937 [1]. A more recent review summarizes the most important applications of Hammett’s equation [2]. The idea of linking the structure of a compound with its activity or property was published before the introduction of the SAR concept. In 1868, Correspondence/Reprint request: Dr. Sorana D. Bolboacă, “Iuliu Hatieganu” University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, 400349 Cluj-Napoca, Romania. E-mail: sbolboaca@umfcluj.ro
118
Sorana D. BolboacÄ&#x192; & Lorentz Jäntschi
Crum-Brown and Fraser noted that the activity of a compound is a function of its chemical composition and structure [3]. In 1893, Richet and Seancs demonstrated that cytotoxicity was inversely related with water solubility on a set of organic compounds [4]. In 1899, Mayer demonstrated that the narcotic action of a sample of organic compounds was related with solubility in olive oil [5]. Therefore, Hammett [6] and Taft [7] could be said to have laid the basis of QSAR/QSPR development. Quantitative relationships (QSAR, QSPR, QPAR), which are mathematical approaches to the link between the structure and the property/activity of chemical compounds in a quantitative manner [8], are applied when the property and/or activity are quantitative. Note that not all the properties and activities of chemical compounds can be classified as quantitative. An example could be the sweetness of sugar (one of the five basic tastes, being almost universally perceived as a pleasurable experience), which can be appreciated only through comparison (relative scale) since there is no single reference scale (such as the boiling and freezing point and the Celsius scale for temperature). Properties that are expressed quantitatively may have several reference scales. Consequently, the use of terms such as QSAR, QSPR, and QPAR is currently avoided, the terms (Q)SAR, (Q)SPR, and (Q)PAR, or simply SAR, SPR, and PAR being preferred. As far as the structure is concerned, things are relatively simpler. Thus, an atom or a bond from a molecule could exist (highlighted through electronic transitions and/or molecular vibrations and/or rotations) or could not exist (0 or 1). Molecular geometry (especially the liquid or gas phase) is more complicated. The Heisenberg principle (Werner HEISENBERG, 19011976, one of the founders of quantum mechanics, a Nobel prize winner) demonstrates the rules of uncertainty through the principles of uncertainty (molecular and atomic level) at micro level. Moreover, molecular geometry depends on the environment on which a molecule is placed (the vicinity of the molecule), the temperature, the pressure, etc. Thus, dealing with molecular geometry is a matter of relativity if not a matter of uncertainty. In conclusion, in the field of Structure-Property-Activity Relationships (SPARs) there are certainties (e.g. molecular topology), uncertainties (e.g. molecular geometry), relativities (e.g. biological activities), and evidences (e.g. physical-chemical properties). Mathematical Chemistry [9,10], Quantum Chemistry [11], and Medicinal Chemistry [12] have increasingly significant contributions to the design and improvement of drugs. The dynamics of pharmaceuticals is high; new drugs appear on the market daily, even if the process is a long-lasting one. Drug design has recently emerged as a new field [13,14]. The usage of computing power was a major breakthrough. Grassy et al. [15] reported a search for
MDF: From molecular structure to molecular design
119
peptides possessing immunosuppressive activity by using 27 descriptors derived from the structure (topology and shape). Twenty-six peptides were selected from a combinatorial library of about 280000 compounds. The predicted activity was high. Five of them were actually synthesized and tested experimentally. The most potent compounds had shown an immunosuppressive activity approximately 100 times higher than the lead compound. Combinatorial Chemistry is now a new field [16,17]. Almost 40 years have passed [18] since QSAR (Quantitative StructureActivity Relationship) paradigm proved its utility in agriculture, pharmacy, toxicology, and other fields. Many methods (CoMFA [19] - Comparative Molecular Field Analysis and its variants CoMSIA, MSIA - Molecular Similarity Indices Analysis), WHIM [20] - Weighted Holistic Invariant Molecular (and its variant MS-WHIM - Molecular Surface WHIM) MTD [21] - Minimal Topological Distance (and its variant MSD, S - Steric), FPIF [22] - Fragmental Property Index Family, MDF [23] - Molecular Descriptors Family) were introduced and proved to be good estimators and predictors. Other emergent results such as S(Q)SAR (Spectral (Quantitative) Structure Activity Relationship) [24] are again challenging the classical approaches. The results in cellular genetics, the mechanic dynamics of cells, genetic mutations, sequencing and coding of macromolecular topology information lead to the recording and storing of molecular geometry into databases for a large number of biologically active chemical compounds [25]. Information stored in databases integrated with proper combinatorial methods may lead to the identification of active compounds with high potency. Statistical methods for internal and external validation [26], correlated correlation analysis [27], and principal component analysis [28] are methods which have a role in identifying (Q)SA/PRs (see [29,30,31] for meaning of parenthesis Q). These methods allow the selection of compounds with the best biological activity. The aim of the present review is to present the Molecular Descriptors Family (MDF) methodology and the obtained results by applying this methodology on over 50 physical-chemical properties or biologically active compounds.
MDF SPR/SAR methodology The MDF SAR method integrates the complex information obtained from the structure of the compounds into models in order to explain the activity/property of interest. The input data for a given set of molecules are represented by the molecular and/or structural formulas and the experimental
120
Sorana D. Bolboacă & Lorentz Jäntschi
values for the activity and/or property of interest. By applying the methodology, the molecular descriptors and/or uni- versus multivariate models are obtained. The following six steps were used in the modelling process [32]. ÷ ÷ ÷
÷ ÷ ÷
Step 1: drawing of the topological model (2D) of the compounds for each molecule from the set of interest by using the HyperChem [33]. Step 2: building the geometrical model (3D) of each compound from the set of interest by using HyperChem [33]. Step 3: applying a semiempirical model (for calculating the partial charge distribution on atoms) and (sometimes) a quantum mechanics model (up to the most advanced ones such as Ab-initio and Time-Dependent Density Functional Theory) using specific modules of HyperChem [33] (examples: HyperNewton, HyperGauss, HyperNDO) in order to obtain an optimized geometrical model in vitro or in vivo. Step 4: generating the molecular descriptors family by using the MDF software. The MDF software is described below. Step 5: applying the bias procedure in order to eliminate identical descriptors from the generated family. Step 6: obtaining simple and/or multivariate linear regression relationships between the members of the descriptors and the given property/activity. The following criteria are used [34,35]: (1) the goodness-to-fit of the model [36,37] (the correlation coefficient and the squared correlation coefficient; values close to ±1 indicate a good model); (2) the co-linearity between pairs of descriptors (values lower than 0.5 indicate the absence of co-linearity between descriptors); and (3) the significance of the regression model (for a significance level of 5%). The internal validation of the MDF SAR models is analyzed in crossvalidation leave-one-out analysis [38]. A correlated correlation analysis is applied whenever appropriate by using Steiger’s Z test at a significance level of 5% [27].
MDF SAR physical model The MDF SAR approach has a mathematical model that comprises seven pieces, every piece having a list of possibilities, which come from the physical approach. Every piece provides a letter in the descriptor’s name: ÷
The linearizing operator (thefirst letter) makes the link between micro, nano, and macro levels. Example: pH = -log[H+] its macro property (measure, effect) measured in micro environment (phenomenon, cause), the presence and the number of H+ in a given solution. It takes six values:
MDF: From molecular structure to molecular design
÷
÷
÷
÷
121
I (identity), i (inverse), A (absolute), a (inverse of absolute), L (logarithm of absolute), l (logarithm). The molecular level superposing operator (the second letter) lay upon the fragmental contributions. Its existence is sustained by the variety of the causality of the molecular property/activity, from specificity, regioselectivity, and selectivity (which most biological activities have) to the independent structural formula (such as relative mass - same for all molecular formula isomers). It takes nineteen values: sized group (m = smallest; M = largest; n = smallest absolute; N = largest absolute); averaged group (S = sum; A = average over all values; a = sum divided by the number of all fragments; B = average first by atom group and then by the whole molecule; b = adjusted B by bonds); geometric group (P = multiplication; G = geometric mean; g = adjusted G by fragments; F = geometric mean first by atom group and then by the whole molecule; f = adjusted F by bonds); harmonic group (s = harmonic sum, H = harmonic mean, h = adjusted H by fragments, I = harmonic mean first by atom group and then by the whole molecule, and i = adjusted I by bonds). The pair-based fragmentation criteria (the third letter) implements different criteria. Some parts of a molecule are more active and give the most of the activity/property of a molecule than others (substituent’s role) as it was observed in the first SAR studies carried out by Hammet. It takes four values: m (minimal), M (maximal), D (Szeged, distance based), and P (Cluj, shortest paths based [22,39]). The interaction model (the fourth letter) implements different levels of approximation (scalar and vectorial) for superposing the descriptors of interaction at the fragment level. It is well known that a series of fieldtype interactions (such as gravitational and electrostatic) are vectorially treated at low range and scalarly treated at distance. It takes six values: scalar (R = rare model and resultant relative to fragment’s head, r = rare model and resultant relative to conventional origin, M = medium model and resultant relative to fragment’s head, m = medium model and resultant relative to conventional origin), vectorial (D = dense model and resultant relative to fragment’s head, d = dense model and resultant relative to conventional origin). The interaction descriptor (the fifth letter) implements a series of interaction descriptors for physical entities (such as force, field, energy, potential), as they occur in magnetism, electrostatics, gravity and quantum mechanics. Different physical entities have different formulas. It takes twenty-four values: D (distance), d (inverted distance), O (first atom’s property), o (inverted of first atom’s property), P (product of atomic
122
÷
÷
Sorana D. Bolboacă & Lorentz Jäntschi
property), p (inverted P), Q (squared P), q (inverted Q), J (first atom’s property multiplied by distance), j (inverted J), K (product of atomic property and distance), k (inverted K), L (product of distance and squared atomic property), l (inverted L), V (first atom’s property potential), E (first atom’s property field), W (first atom’s property work), w (property’s work), F (first atom’s property’s force), f (property’s force), S (first atom’s property’s weak nuclear force), s (property’s weak nuclear force), T (first atom’s property strong nuclear force), t (property’s strong nuclear force). The stomic property (the sixth letter) discriminates atoms through elemental properties. Every atom has a series of characteristics and/or properties making it similar and/or dissimilar to another. It takes six values: M (mass), Q (charge), C (cardinality), E (electronegativity), G (group electronegativity), and H (number of attached atoms of hydrogen). The distance operator (the seventh letter) implements both 2D and 3D approaches (topology and geometry). It takes two values: g (geometry), t (topology).
The application of every piece of mathematical model is a physical model, each model being able to take more than one value. The molecular descriptors family is generated following the calculation of 787968 (6 × 19 × 4 × 6 × 24 × 6 × 2) possibilities. Not all these possibilities have physical meaning (e.g. the logarithm of a negative number). Moreover, not all of them produce finite numbers (e.g. the division by zero). For a given set of molecules a descriptor can be degenerated relative to the set (having the same value for all the molecules in the set) and relatively to another descriptor (two descriptors with different calculation formulas produce the same result for all molecules in the set). A bias procedure trails out these descriptors from the family of the set. Depending on the set, the number of MDF members is around 100000.
MDF software The MDF software was created by using the triad: PHP (Pre Hypertext Processor, [40]), MySQL database [41], and FreeBSD server [42]. A set of programs completes the task of generating the molecular descriptors family. Two databases (one temporary `MDFSARtmp`- for the sets being processed and one permanent `MDFSARs` - for finalized sets) stored on a FreeBSD server from IntraNet [IP 172.27.211.5] by using a MySQL database server. On December 25, 2007, `MDFSARtmp` had 174 tables (1.8 Gb), and `MDFSARs` more than 300 tables (> 3.5Gb). Four tables were generated for each investigated set:
MDF: From molecular structure to molecular design
÷ ÷ ÷ ÷
123
`"NameSet"_tmpx` (787968/6 = 131328 records; fields: molecules; records: descriptors) `"NameSet"_data` (field: property/activity; records: molecules; number of records equal with number of molecules) `"NameSet"_valx` (fields: molecules; records: descriptors; after bias has about 100000 records) `"NameSet"_valy` (records: same number of as `"NumeSet"_valx` table; fields: M(X), M(X*X), M(X*Y), r2(X,Y), Name(X), where M = average operator, r2 = determination coefficient, Name = name of X, Y = property/activity, X = MDF member).
Note that the numeric fields of the `"NameSet"_valy` table are computed for multivariate regression purposes (significantly decreasing the execution time). The `0_MDFSARRes` table (one per database) contains all obtained MDF SARs (fields: name (of the set), eq (MDF SAR), r2 (determination coefficient), m (molecule’s number), n (number of MDF members in MDF SAR model). A set of five PHP applications generated MDF, running on a FreeBSD server from IntraNet [IP 172.27.211.4]: ÷ ÷
÷ ÷ ÷
0_mdf_prepare.php: creates the structure for the set tables using the name of the directory (for set name) and names of the files (for molecules names) 1_mdf_generate.php: generates MDF for the set of interest, filling 131328 records in the `"NameSet"_tmpx` table for every molecule. It is a multitasking application, one task being executed for every molecule at the same time. 2_mdf_linearize.php: it applies the linearizing operator (131328 × 6 = 787968) and it fills valid records (having sense and finite) into the `"NameSet"_xval` and the `"NameSet"_yval` tables. 3_mdf_bias.php: it sorts them in its memory by r2 and it deletes the degenerations from both the `"NameSet"_xval` and the `"NameSet"_yval` tables simultaneously. 4_mdf_order.php: it sorts them in its memory by r2 again, it creates two temporary tables containing ordered records by r2 from `"NameSet"_xval` and `"NameSet"_yval`, and it deletes the old tables and renames the new ones.
Our previous experience in working with a great number of molecular descriptors [22] indicated that the best found descriptor (the one which correlates the best with the measured property) is never found among the descriptors of the best found structure-activity relationship with two
124
Sorana D. Bolboacă & Lorentz Jäntschi
descriptors. Thus, MDF uses pairs of descriptors when we search for SAR models with more than one descriptor (natural selection). MDF uses a genetic algorithm for QSPR/QSAR modelling (genetic algorithms are a particular class of evolutionary algorithms, being categorized as global search heuristics [43]). The peculiarities of the genetic algorithm used are: ÷
÷
÷
Step 1 (inheritance and mutation). The linearization procedure described above is applied to the solution domain (2 × 6 × 24 × 6 × 4 × 19 MDF members having a genetic representation with six letters) whenever every descendent is obtained from a parent (inheritance) through a transformation (mutation). The number of descendants obtained is six times higher than that of the parents. In this step, the fitness function is defined as having real and distinct values. Over half of the descendants die due to mutation (around 300000 descendants remain, which now have genetic representations with seven letters). Step 2 (selection). A bias procedure (selection) is applied to the solution domain (MDF descendants from Step 1). In this step, the fitness function is defined as having distinct first nine digits of the determination coefficient. Only around 100000 members pass the selection process. Another selection is made from this solution domain obtained: the best descriptor (the one that best correlates with the measured property). Step 3 (crossover). Pairs of MDF members are crossed over in order to obtain models with two descriptors. Two fitness functions are used here: “to have the best determination coefficient” and “to have the best crossvalidation leave-one-out score”.
Searching procedures of the uni- and multivariate models were created using Delphi client-server programs [44].
MDF structure- Property/activity relationships discussed More than one hundred and forty models are presented. Seventy properties or activities on different classs of compounds (thirty-one) have been investigated. The results are structured on two sections (properties MDF SPR Models and activities - MDF SAR models) based on the investigated property/activity and the class of compounds. The univariate regression model was presented for the investigated class of compounds. The model with two descriptors was also presented whenever possible [45].
MDF: From molecular structure to molecular design
125
The statistics associated with the MDF model are expressed as correlation coefficient (r), standard error of estimated (s); Fisher parameter of the model and associated type I error (F(p)) at a significance level of 5%. The statistics obtained in cross-validation leave-one-out analysis are expressed as correlation coefficient (rcv-loo), standard error of predicted (scv-loo), Fisher parameter of the predictive model and associated type I error (Fcv-loo(p)) [34] at a significance level of 5%.
MDF SPR models In order to justify the introduction of the molecular descriptors family on the structure-property relationships a series of twenty properties on nine classes of compounds have been investigated. The results obtained are presented and discussed below.
Partition & activity coefficients Volatile organic compounds - n-octanol/water partition coefficient The analysis of the results reveals that the model with two descriptors obtained better performances in terms of goodness-of-fit and cross-validation [34]. According to the model with two descriptors, the octanol/water partition coefficient of the investigated volatile organic compounds is of geometrical and topological nature and it depends on the partial charge and number of the directly bounded hydrogens. Sample size [reference]
24 [46]
MDF SPR Equation
ŷ=-0.004·x+2.09
ŷ=0.65·x1-0.13·x2+3.99
SPR Determination (%)
53
81
MDF Descriptor(s): x1 & x2 ISDRTHg
LsPrDQt & IADRSHg
Dominant Atomic Property
Hydrogen’s (H)
Charge (Q) & Hydrogen (H)
Interaction Via
Space (geometry)
Bonds (topology) & Space (geometry)
Interaction Model
H2·d-4
d & H2·d-3
Structure on Property Scale
Identity
Logarithmic & Identity
Model Statistics
r=0.7309; s=0.43;F(p)=
r=0.9006; s=0.28;
25 (4.98·10-5)
F (p)=45 (2.51·10-8)
Cross-Validation Leave-
rcv-loo=0.6792; scv-loo=0.47; rcv-loo=0.8815; scv-loo=0.31;
One-Out
Fcv-loo (p)=18 (3.00·10-4)
Fcv-loo (p)=36.41 (1.49·10-7)
126
Sorana D. Bolboacă & Lorentz Jäntschi
Para-substituted phenols - n-octanol/water partition coefficient The octanol/water partition coefficient of para substituted phenols depended directly on the molecules’ geometry and it was related with partial charge and group electronegativity (see the model with two descriptors). Sample size [reference] MDF SPR Equation [reference] SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
30 [47] ŷ=-923.42·x+4.25
Interaction Via
Space (geometry)
Interaction Model Structure on Property Scale Model Statistics
Q Identity r=0.8412; s=0.60; F (p) = 68 (5.83·10-9) rcv-loo=0.8115; scv-loo=0.65; Fcv-loo (p)=53 (5.82·10-8)
Cross-Validation LeaveOne-Out
71 IsPdOQg Charge (Q)
ŷ=0.003·x1-0.40·x2+1.07 [48] 89 isDDkGg & IMmrKQg Group Electronegativity (G) & Charge (Q) Space (geometry) & Space (geometry) G-2·d-1 & Q-2·d-1 Inversed & Identity r=0.9457; s=0.37; F (p) = 114 (6.65·10-14) rcv-loo=0.9306; scv-loo=0.41; Fcv-loo (p)=87 (1.69·10-12)
Polychlorinated biphenyls - n-octanol/water partition coefficient According to the model with two descriptors, the octanol/water partition coefficient of investigated polychlorinated biphenyls is of geometrical nature and depends on the atomic and group electronegativity. Sample size [reference] MDF SPR Equation SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
206 [49] ŷ=2552.7·x-14.22 87 iBMmwHg Hydrogen (H)
Interaction Via
Space (geometry)
Interaction Model Structure on Property Scale Model Statistics
H2·d-1 Inversed r=0.9347; s=0.30; F (p)=1410 (7.75·10-94) rcv-loo=0.9330; scv-loo=0.30; Fcv-loo (p)=1372 (9.13·10-93)
Cross-Validation LeaveOne-Out
ŷ=-0.44·x1+0.04·x2+3.12 89 IIDDKGg & IHDRKEg Group Electronegativity (G) & Electronegativity (E) Space (geometry) & Space (geometry) G2·d & E2·d Identity & Identity r=0.9433; s= 0.28; F (p)= 819 (1.71·10-98) rcv-loo=0.9409; scv-loo=0.28; Fcv-loo (p)=784 (9.33·10-97)
MDF: From molecular structure to molecular design
127
Organic pollutants - Soil/water partition coefficient normalized to organic carbon The soil/water partition coefficient of the studied organic pollutants proved to be a topological property related with the number of directly bounded hydrogen atoms and atomic electronegativity (see the equation with two descriptors). Sample size [reference] MDF SPR Equation SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Property Scale Model Statistics Cross-Validation Leave-One-Out
8 [50] ŷ=-17.45·x+8.12 90 IbPMtMt
ŷ=-0.22·x1-0.68·x2+16.62 98 lfDMWHt & IbmrTEt
Mass (M)
Hydrogen (H) & Electronegativity (E)
Bonds (topology) M2·d-4 Identity
Bonds (topology) & Bonds (topology) H2·d-1 & E2·d-4 Logarithmic & Identity
r=0.9483; s=0.21; F (p)=54 (3.33·10-4) rcv-loo=0.8710; scv-loo=0.34; Fcv-loo (p)=18 (5.27·10-3)
r=0.9839; s=0.22; F (p)=26 (2.27·10-3) rcv-loo=0.9481; scv-loo=0.24; Fcv-loo (p)=22 (3.30·10-3)
Fifteen standard amino acids - Partition (1st & 2nd columns) & activity coefficients (3rd column) Fifteen standard amino acids were investigated: alanine (Ala), asparagine (Asn), aspartate (Asp), cysteine (Cys), glutamine (Gln), glutamate (Glu), glycine (Gly), isoleucine (Ile), leucine (Leu), lysine (Lys), methionine (Met), phenylalanine (Phe), serine (Ser), threonine (Thr), and valine (Val).
128
Sorana D. Bolboacă & Lorentz Jäntschi
The partition coefficients of the 15 amino acids were studied on two different scales. In both situations (column 1 and 2) the property was identified as being of geometrical nature and also depended on the partial charge. The activity coefficient is of topological nature but it also depends on the partial charge.
Chromatographic parameters Polychlorinated biphenyls - Relative retention time The relative retention time of polychlorinated biphenyls proved to be of both topological and geometrical nature, and was related with the number of directly bounded hydrogens (see the model with two descriptors). Sample size [reference] MDF SPR Equation SPR Determination (%) MDF Descriptor(s): x 1 & x2 Dominant Atomic Property Interaction Via
206 [49] ŷ=0.09·x-0.17 98 iIDRwHg
Interaction Model Structure on Property Scale Model Statistics
H2·d-1 Inversed r=0.9921; s=0.02; F (p)=13013 (1.64·10-189) rcv-loo=0.9920; scv-loo=0.02; Fcv-loo (p)=12777 (1.06·10-188)
Cross-Validation Leave-One-Out
Hydrogen (H) Space (geometry)
ŷ=0.02·x1-1.02·x2-5.99 99.7 ISDmsHt & lADrtHg Hydrogen (H) & Hydrogen (H) Bonds (topology) & Space (geometry) H2·d-3 & H2·d-4 Identity & Logarithmic r=0.9986; s=0.01; F (p)=36600 (1.10·10-265) rcv-loo=0.9985; scv-loo=0.01; Fcv-loo (p)=35416 (3.33·10-264)
Polychlorinated biphenyls - Relative response factor The MDF SPR abilities to estimate and predict the relative response factor are not strong, the SAR determination being lower than 70%. According to the model with two descriptors, the relative response factor of the polychlorinated biphenyls is both of topological and geometrical nature and it strongly depends on the number of directly bounded hydrogens. Sample size [reference]
209 [49]
MDF SPR Equation
ŷ=0.53·x-0.51
ŷ=-357.3·x1+2.16·x2 +5.08
SPR Determination (%)
63
69
MDF Descriptor(s):
iHMdTHg
imMrFHt & iHDdFHg
x 1 & x2 Dominant Atomic Property Hydrogen (H)
Hydrogen (H) & Hydrogen (H)
MDF: From molecular structure to molecular design
129
Table. Continued Interaction Via
Space (geometry) 2
-4
Interaction Model H ·d Structure on Property Scale Inversed Model Statistics r=0.7929; s=0.22; F (p)=351 (1.67·10-46) Cross-Validation rcv-loo=0.7873; scv-loo=0.22; Leave-One-Out Fcv-loo (p)=337 (2.02·10-45)
Bonds (topology) & Space (geometry) H2·d-2 & H2·d-2 Inversed & Inversed r=0.8324; s=0.20; F (p)=232 (9.59·10-54 rcv-loo=0.8258; scv-loo=0.20; Fcv-loo (p)=221 (3.73·10-52)
Organophosphorus herbicides - Retention chromatography index The retention chromatography index of organophosphorus herbicides proved to be estimated and predicted by using the MDF approach. The SPR determination was higher than 90%. According to the model with two descriptors the retention chromatography index is of topological and geometrical nature and depends on relative atomic mass and partial charge. Sample size [reference] MDF SPR Equation [reference] SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Property Scale Model Statistics Cross-Validation Leave-One-Out
10 [53] ŷ=0.32·x-3.37 94 IBPdqHg Hydrogen (H) Space (geometry) 1/H√H Logarithmic r=0.9708; s=0.78; F (p)=131 (3.08·10-6) rcv-loo=0.9563; scv-loo=0.95; Fcv-loo (p)=85 (1.55·10-5)
ŷ=6.37·x1+0.06·x2-62.36 [23] 99.92 lSDmwMt & iHPDEQg Mass (M) & Charge (Q) Bonds (topology) & Space (geometry) M2·d-1 & M·d-2 Logarithmic & Inversed r=0.9996; s=0.10; F (p) = 4348 (1.47·10-11) rcv-loo=0.9993; scv-loo=0.13; Fcv-loo (p)=2344 (1.28·10-10)
Molar refraction Cyclic organophosphorus (1st & 2nd column) & fifteen standard amino acids (3rd column) The MDF SPR approach proved to be a very good approach. The determination coefficient is higher than or equal with 98 percent. The molar refraction proved to be of topological nature and related with the relative atomic mass and the atomic electronegativity (see the model with two variables obtained for the cyclic organophosphorus s). In a SPR determination of 98% the molar refraction proved to be of geometrical nature, linearly related with the paretial charge in the sample of standard amino acids.
130
Sorana D. Bolboacă & Lorentz Jäntschi
Sample size [reference] MDF SPR Equation [reference] SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
10 [53] ŷ=16.37·x-0.28
Interaction Model Structure on Property Scale Model Statistics
G2·d-3 Inversed
Cross-Validation Leave-One-Out
rcv-loo=0.9942; scv-loo=1.15; Fcv-loo (p)=679 (5.04·10-9)
99 iIMdsGg Group electronegativity (G) Space (geometry)
r=0.9959; s=0.96; F (p)=975 (1.20·10-9)
ŷ=28.25·x183.97·x2+17.39 [54] 100 lGDmSMt & lAmrfEt Mass (M) & Electronegativity (E) Bonds (topology) & Bonds (topology) M2·d-3 & E2·d-2 Logarithmic & Logarithmic r=0.9999; s=0.07; F (p)=83206 (4.44·10-16) rcv-loo=0.9999; scv-loo=0.11; Fcv-loo (p)=40592 (5.99·10-15)
15 [51] ŷ=-0.89·x+6.7 [52] 98 lFMMwQg Charge (Q) Space (geometry) Q2·d-1 Logarithmic r=0.9892; s=1.13; F (p)=590 (3.22·10-12) rcv-loo=0.9845; scv-loo=1.35; Fcv-loo (p)= 409 (3.29·10-11)
Boiling point Alkanes Good determinations are obtained in the estimation and prediction of the boiling point of the investigated alkanes. The best performing model, the one with two descriptors, revealed that the boiling point of the investigated compound is of topological nature and directly related with group electronegativity and the number of directly bounded hydrogens (see the model with two descriptors). Sample size [reference] MDF SPR Equation SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
73 [55] ŷ=188.40·x-507.95 99 lbMdsHg Hydrogen (H)
Interaction Via
Space (geometry)
Interaction Model Structure on Property Scale Model Statistics
H2·d-3 Logarithmic r=0.9956; s=3.81; F (p)=8050 (1.23·10-75) rcv-loo=0.9954; scv-loo=3.91; Fcv-loo (p)=7654 (7.44·10-75)
Cross-Validation Leave-One-Out
ŷ=-67.45·x1+4.89·x2-129.20 99.8 lGDrtGt & IbDrfHt Group Electronegativity (G) & Hydrogen (H) Bonds (topology) & Bonds (topology) G2·d-4 & H2·d-2 Inversed & Identity r=0.9991; s=1.75; F (p)=19361 (4.66·10-99) rcv-loo=0.9990; scv-loo=1.82; Fcv-loo (p)=17837 (8.87·10-98)
MDF: From molecular structure to molecular design
131
Other properties Fifteen standard amino acids - Magnetic susceptibility (1st column) & dipole moment (2nd column) & solubility (3rd column) The magnetic susceptibility of the investigated standard amino acids revealed to be of geometrical nature and linearly dependent on partial charge. The dipole moment and the solubility of the studied standard amino acids are of topological nature and depend on partial charge. For the last two properties (dipole moment and solubility) the model had the same molecular descriptor, but the performance was better for solubility than for the dipole moment. Sample size [reference] MDF SPR Equation [reference] SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Property Scale Model Statistics Cross-Validation Leave-One-Out
15 [51]
15 [51]
15 [51]
ŷ=-92.99·x+84.21 [52] 91
ŷ=-8.70·x+0.19 [52] 79
ŷ=-25.28·x+4.06 [52] 87
iHMRqQg
IiDRLQt
IiDRLQt
Charge (Q)
Charge (Q)
Charge (Q)
Space (geometry) Q-1(√Q)-1 Inversed
Bonds (topology) d·Q Identity
Bonds (topology) Q·d Identity
r=0.9548; s=2.98; F (p)=134 (3.20·10-8) rcv-loo=0.9381; scv-loo=3.48; Fcv-loo (p)=95 (2.42·10-7)
r=0.8885; s=0.50; F (p)=49 (9.61·10-6) rcv-loo=0.8339; scv-loo=0.61; Fcv-loo (p)= 29 (1.18·10-4)
r=0.9338; s=1.08; F (p)=89 (3.63·10-7) rcv-loo=0.9070; scv-loo=1.27; Fcv-loo (p) = 60 (3.15·10-6)
Fifteen standard amino acids - Hückel energy (1st column) & hydration energy (2nd column) The Hückel energy of the investigated amino acids revealed to be a geometrical property that depends on atomic electronegativity. Hydration energy revealed to be of topological nature and related with partial charge. Note that the determination coefficient is higher than 90% in both models. Sample size [reference] MDF SPR Equation [referenc SPR Determination (%) MDF Descriptor(s): x1 & x2
15 [33] ŷ=868.02·x-1417.5 [52] 99.7 lfPdkEg
15 [33] ŷ=17.59·x+19.46 [52] 93 iGPmLQt
132
Sorana D. Bolboacă & Lorentz Jäntschi
Table. Continued Dominant Atomic Property Interaction Via Interaction Model Structure on Property Scale Model Statistics Cross-Validation Leave-One-Out
Electronegativity (E) Space (geometry) E-2·d-1 Logarithmic r=0.9983; s=235; F (p)=3915 (1.53·10-18) rcv-loo=0.9979; scv-loo=264; Fcv-loo (p)=3089 (1.11·10-16)
Charge (Q) Bonds (topology) Q·d Inversed r=0.9646; s=0.86; F (p)=174 (6.74·10-9) rcv-loo=0.9503; scv-loo=1.01; Fcv-loo (p)=121 (5.90·10-8)
Fifteen standard amino acids - Polarizability (1st column) & refractivity (2nd column) The polarizability and refractivity of the investigated standard amino acids was estimated by the same molecular descriptor. Thus, the properties are of geometrical nature and directly related with atomic electronegativity. Sample size [reference] MDF SPR Equation [referenc SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Property Scale Model Statistics Cross-Validation Leave-One-Out
15 [33] ŷ=36.97·x-4.84 [52] 98 iIMdWEg Electronegativity (E) Space (geometry) E2·d-1 Inversed r=0.9883; s=0.46; F (p)=546 (5.32·10-12) rcv-loo=0.9825; scv-loo=0.56; Fcv-loo (p)=362 (7.10·10-11)
15 [33] ŷ=93.72·x-13.09 [52] 97 iIMdWEg Electronegativity (E) Space (geometry) E2·d-1 Inversed r=0.9862; s=1.27; F (p)=462 (1.53·10-11) rcv-loo=0.9794; scv-loo=1.55; Fcv-loo (p)=306 (2.03·10-10)
MDF-SAR models The abilities of the molecular descriptors family approach have been investigated on twenty-one samples of biological active compounds. A total number of seventy-three activities were investigated on different classes of compounds.
Water activated carbon adsorption Organic compounds The water activated carbon adsorption on the investigated organic compounds revealed to be of topological and geometrical nature and related with the number of directly bounded hydrogens and partial charge (see the
MDF: From molecular structure to molecular design
133
Table. Continued Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
16 [56] ŷ=-57.99·x+1.99 [57] 86 iSDrDQt Charge (Q) Bonds (topology) d Inversed r=0.9270; s=0.13; F (p)=86 (2.43·10-7) rcv-loo=0.8959; scv-loo=0.16; Fcv-loo (p)=56 (2.92·10-6)
ŷ=0.85·x1+0.003·x2+2.58 [57] 98 IiMMWHt & lPMDVQg Hydrogen (H) & Charge (Q) Bonds (topology) & Space (geometry) H2·d-1 & Q·d-1 Identity & Logarithmic r=0.9905; s=0.05; F (p)=337 (6.30·10-12) rcv-loo=0.9873; scv-loo=0.06; Fcv-loo (p)=251 (4.14·10-11)
model with two descriptors). The power of determination of the activity is good - close to optimum (98%).
Hydrophobic vs. hydrophilic character The hydrophobic or hydrophilic character, which is an important property in protein structure and protein-protein interactions, is one of the most studied properties of the amino acids. Many hydrophobicity scales have already been reported (see below). The differences between scales are significant: Janin (1979) and Kyte & Doolittle (1982) classified cysteine as the most hydrophobic while Wolfenden et al. [58] or Rose et al. [59] did not. These differences could be explained by the fundamentally different methods used for constructing the scale. Fifteen standard amino acids – Hydrophobicity The hydrophobicity on the Bumble, Hessa et al. and Kyte & Doolittle scales revealed to be of geometrical nature and directly related with partial charge on the sample of fifteen standard amino acids (see the table below). Excepting the Bumble scale, the MDF models obtained had a determination coefficient higher than 90%. Sample size [reference] MDF SAR Equation [reference]
15 [51]
15 [60]
15 [61]
ŷ=-160·x-0.07 [52]
ŷ=8.5·x-0.58 [52]
ŷ=-21·x+12 [52]
134
Sorana D. Bolboacă & Lorentz Jäntschi
Table. Continued SAR Determination (%) MDF Descriptor(s) Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics
Cross-Validation Leave-One-Out
65
90.5
95
AbmrEQg Charge (Q)
iMDRoQg Charge (Q)
IGDROQg Charge (Q)
Space (geometry) Q·d2 Proportional
Space (geometry) Q-1 Inversed
Space (geometry) Q Proportional
r=0.8085; s=1.68; F (p)=25 (2.64·10-4)
r=0.9514; s=0.44; F (p)=124 (5.05·10-8)
r=0.9759; s=0.71; F (p)=260 (5.66·10-10)
rcv-loo=0.9351; rcv-loo=0.9659; rcv-loo=0.7550; scv-loo=1.88; scv-loo=0.51; scv-loo=0.80; -3 -7 Fcv-loo (p)=17 (1.21·10 ) Fcv-loo (p)=90 (3.26·10 ) Fcv-loo (p)=203 (2.57·10-9)
Twenty standard amino acids – Hydrophobicity The sample of twenty standard amino acids used for the following MDF SAR models contains: alanine (Ala), arginine (Arg), asparagine (Asn), aspartate (Asp), cysteine (Cys), glutamine (Gln), glutamate (Glu), glycine (Gly), histidine (His), isoleucine (Ile), leucine (Leu), lysine (Lys), methionine (Met), phenylalanine (Phe), proline (Pro), serine (Ser), threonine (Thr), tryptophan (Trp), tyrosine (Tyr), and valine (Val). Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
20 [62]
20 [63]
20 [64]
ŷ=0.39·x-1.23 [65]
ŷ=-27.79·x+6.55 [65]
ŷ=-1.73·x-2.88 [65]
44
66
69
amMRLQt
immRoQg
LmDROQg
Charge (Q)
Charge (Q)
Charge (Q)
Bonds (topology) Q·d Inversed
Space (geometry) Q-1 Inversed
Space (geometry) Q Logarithmic
r=0.6649; s=1.21; F (p)=14 (1.38·10-3) rcv-loo=0.5961; scv-loo=1.37; Fcv-loo (p)=7 (1.44·10-2)
r=0.8163; s=2.19; F (p)=6 (1.14·10-5) rcv-loo=0.7740; scv-loo=2.41; Fcv-loo (p) = 27 (6.50·10-5)
r=0.8309; s=1.70; F (p)=40 (5.70·10-6) rcv-loo=0.7936; scv-loo= 1.87; Fcv-loo (p) = 30 (3.34·10-5)
MDF: From molecular structure to molecular design
135
Two out of three of the above presented models for hydrophobicity proved that the activity was of geometrical nature and depended on the partial charge. None of the above models was strong; the coefficient of determination was lower than 70%.
Twenty standard amino acids – Hydrophobicity The MDF SAR model for hydrophobicity on the Wimley & White scale is of topological nature and depends on partial charge. The hydrophobicity models on the Hoop & Woods and Cowan & Whittaker scales revealed to be of geometrical nature and depended on partial charge. The coefficient of determination associated with the above presented models proved not to be powerful, even if the values were higher than 50%. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale
20 [66] ŷ=7.35·x-3.37 [65] 71 iBmrWQt Charge (Q)
20 [67] ŷ=10.63·x-1.99 [65] 74 iMPRoQg Charge (Q)
20 [68] ŷ=-6.57·x+1.47 [65] 75 AmDROQg Charge (Q)
Bonds (topology) Q2/d Inversed
Space (geometry) Q-1 Inversed
Space (geometry) Q Absolute
Model
r=0.8434; s=0.48;
r=0.8608; s=1.01;
r=0.8661; s=0.6;
Statistics
F (p)=44 (3.00·10-6)
F (p)=52 (1.11·10-6) F (p)=54 (1.15·10-8)
Cross-Validation
rcv-loo=0.800; scv-loo=0.54;
rcv-loo=0.8288;
Leave-One-Out
Fcv-loo (p) = 32 (2.25·10-5)
loo
rcv-loo=0.8344;
=1.11;
loo
=0.73;
Fcv-loo (p) =
Fcv-loo (p) =
39 (6.49·10-6)
41 (7.94·10-8)
Twenty standard amino acids – Hydrophobicity Hydrophobicity on the Manavalan & Ponnuswamy, the Fauchere et al., and the Rao & Argos scales proved to be of geometrical nature and linearly Sample size [reference] MDF SAR Equation [reference] SAR Determination (%)
20 [69]
20 [70]
20 [71]
ŷ=23.43·x+14.55 [65]
ŷ=5.94·x-4.36 [65]
ŷ=-2.73·x+1.43 [65]
78
78
79
136
Sorana D. Bolboacă & Lorentz Jäntschi
Table. Continued MDF Descriptor Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
inMrpQg Charge (Q)
ibDRPQg Charge (Q)
AmDROQg Charge (Q)
Space (geometry) Q-2 Inversed
Space (geometry) Q2 Inversed
Space (geometry) Q Proportional
r=0.8814; s=0.76; F (p)=63 (2.84·10-7) rcv-loo=0.8546; scv-loo=0.84; Fcv-loo (p)=49 (1.65·10-6)
r=0.8832; s=0.50; F (p)=65 (2.50·10-7) rcv-loo=0.8611; scv-loo=0.54; Fcv-loo (p) = 51 (1.13·10-6)
r=0.8901; s=0.24; F (p)=69 (1.48·10-7) rcv-loo=0.8545; scv-loo=0.28; Fcv-loo (p) = 48 (1.78·10-6)
depended on partial charge. All the above MDF SAR models presented a weak coefficient of determination (the values were higher than 75%). At this value, there is an important linear relationship between the molecular descriptor and the hydrophobic or hydrophilic character.
Twenty standard amino acids – Hydrophobicity All three models revealed that the hydrophobic or hydrophilic character was of geometrical nature and depended on partial charge. The determination was fairly good in these models (greater than 80%). Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
20 [72]
20 [73]
20 [59]
ŷ=1.74·x+0.86 [65]
ŷ=-3.78·x+5.30 [65]
ŷ=1.75·x+0.86 [65]
81
85
90
inMrpQg Charge (Q)
IAmrLQg Charge (Q)
inMrpQg Charge (Q)
Space (geometry) Q-2 Inversed
Space (geometry) Q·d Identity
Space (geometry) Q2·d-3 Inversed
r=0.8974; s=0.05; F (p)=76 (6.76·10-8) rcv-loo=0.8744; scv-loo=0.06; Fcv-loo (p) = 56 (6.37·10-7)
r=0.9208; s =0.80; F (p)=100 (8.69·10-9) rcv-loo=0.9073; scv-loo=0.68; Fcv-loo (p) = 84 (3.48·10-8)
r=0.8974; s=0.05; F (p)=74 (8.21·10-8) rcv-loo=0.8744; scv-loo=0.06; Fcv-loo (p) = 58 (4.73·10-7)
MDF: From molecular structure to molecular design
137
Twenty standard amino acids – Hydrophobicity The hydrophobicity on the Urry scale proved to be of topological nature and linearly dependent on the atomic electronegativity of the standard amino acids studied. The hydrophobicity on the Engelman et al., and on the Eisenberg et al. scales proved to be of geometrical nature and directly dependent on the partial charge. The determination was considered rather good. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
20 [74]
20 [75]
20 [76]
ŷ=-11.96·x-29.73 [65]
ŷ=-753.09·x+ 1.85 [65]
ŷ=-0.92·x+ 1.68 [65]
82
83
83
iBDMkEt Electronegativity (E)
INPrWQg Charge (Q)
IAMdKQg Charge (Q)
Bonds (topology) Q-2·d-1 Inversed
Space (geometry) Q2·d-1 Logarithmic
Space (geometry) Q2·d Logarithmic
r=0.9047; s=1.07; F (p)=81 (4.40·10-8) rcv-loo=0.8819; scv-loo=1.18; Fcv-loo (p) = 63 (2.85·10-7)
r=0.9116; s=2.07; F (p)=89 (2.26·10-8) rcv-loo=0.8731; scv-loo=2.56; Fcv-loo (p) = 51 (1.13·10-6)
r=0.9128; s=0.42; F (p)=90 (2.02·10-8) rcv-loo=0.8935; scv-loo=0.46; Fcv-loo (p) = 70 (1.31·10-7)
Twenty standard amino acids – Hydrophobicity The hydrophobicity on the Cowan et al. scale is of geometrical nature and depends on charge. The same is valid for the model with fifteen amino acids. This observation is also true for the hydrophobicity on the Hessa et al. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s) Dominant Atomic Property Interaction Via
20 [68]
20 [77]
20 [60]
ŷ=-2.16·x+4.64 [65] ŷ=817.95·x+81.72 [65]
ŷ=7.18·x-0.41 [65]
84
85
85
lbmdKQg Charge (Q)
inMrpQg Charge (Q)
AmDROQg Charge (Q)
Space (geometry)
Space (geometry)
Space (geometry)
138
Sorana D. Bolboacă & Lorentz Jäntschi
Table. Continued Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
Q2·d Logarithmic
Q-2 Inversed
Q Proportional
r=0.9182; s=0.52; F (p)=97 (1.15·10-8) rcv-loo=0.8984; loo=0.58; Fcv-loo (p) = 75 (7.94·10-8)
r=0.9232; s=20.73; F (p)=104 (6.69·10-9) rcv-loo=0.9082; scv-loo=22.58; Fcv-loo (p) = 85 (3.16·10-8)
r=0.9238; s=0.32; F (p)=105 (6.24·10-9) rcv-loo=0.9018; scv-loo=0.58; Fcv-loo (p) = 78 (6.01·10-8)
scale, the determination being lower than in the model obtained on the sample of fifteen amino acids. All models revealed that the hydrophobic or hydrophilic character was of geometrical nature and depended on the partial charge.
Twenty standard amino acids – Hydrophobicity Hydrophobicity is of geometrical nature and depends on charge. In these models, the determination is slightly better compared with the above models in the sample of twenty standard amino acids. Sample size 20 [78] 20 [79] [reference] MDF SAR Equation ŷ=-0.20·x+1.36 [65] ŷ=1.85·x+11.05 [65] [reference] SAR 85 86 Determination (%)
20 [61]
MDF Descriptor(s)
iIPmLQt
lfPROQg
IiDRLQt
Dominant Atomic Property Interaction Via
Charge (Q)
Charge (Q)
Charge (Q)
Bonds (topology)
Space (geometry)
Bonds (topology)
Interaction Model Structure on Activity/ Property Scale Model Statistics
Q·d
Q
d·Q
Inversed
Logarithmic
Identity
r=0.9252; s=0.36; F (p)= 107 (5.30·10-9)
r=0.9259; s=2.46; F (p)=108 (4.88·10-9)
r=0.9328; s=1.11; F (p)=120 (2.10·10-9)
Cross-Validation Leave-One-Out
rcv-loo=0.9003; scv-loo=0.42; Fcv-loo (p) = 75 (8.02·10-8)
rcv-loo=0.9226; rcv-loo=0.8935; scv-loo=2.97; scv-loo=1.18; -8 Fcv-loo (p) = 69 (4.91·10 ) Fcv-loo (p) = 103 (7.25·10-9)
ŷ=19.17·x-7.60 [65] 87
MDF: From molecular structure to molecular design
139
Twenty standard amino acids – Hydrophobicity The best determination on the sample of twenty amino acids was obtained for the Black et al. scale. On the Monera et al. scale the determination was better but the sample was of nineteen amino acids (Proline was the amino acid excluded from the generation of molecular descriptors). As an overall conclusion, the hydrophobicity of amino acids is of geometrical nature and depends on partial charge. Sample size [reference] 20 [80]
19 [81] (- Proline)
MDF SAR Equation ŷ=-0.96·x+0.86 [65] [reference] SAR Determination (%) 88
ŷ=843.88·x+86.05 [65] 90
MDF Descriptor(s)
lAmrLQg
inMrpQg
Dominant Atomic Property Interaction Via
Charge (Q)
Charge (Q)
Space (geometry)
Space (geometry)
Interaction Model
d·√Q
Q-2
Structure on Activity Scale Model Statistics
Proportional
Inversed
r=0.9376; s=0.12; F (p)=131 (1.09·10-9)
r=0.9504; s=16.49; F (p)=159 (4.77·10-10)
Cross-Validation Leave-One-Out
rcv-loo=0.9263; scv-loo=0.13; Fcv-loo (p)=109 (4.73·10-9)
rcv-loo=0.9380; scv-loo=18.37; Fcv-loo (p)=125 (3.00·10-9)
Toxicity Polychlorinated organic compounds The toxicity of the investigated polychlorinated organic compounds revealed to be of geometrical nature and related with partial charge on the univariate as well as on the MDF model with two descriptors. Sample size [reference] MDF SAR Equation
31 [82] ŷ=-9.06·x+4.00
ŷ=-8.33·x1+0.28·x2+0.83
SAR Determination (%) MDF Descriptor(s): x1 & x2
72 IsMRKQg
87 IsMRKQg & AHPROQg
Dominant Atomic Property
Charge (Q)
Charge (Q) & Charge (Q)
Interaction Via
Space (geometry)
Space (geometry) & Space (geometry)
Interaction Model
Q-2·d-1
Q-2·d-1 & Q
140
Sorana D. Bolboacă & Lorentz Jäntschi
Table. Continued Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
Identity r=0.8514; s =0.30; F (p)=76 (1.27·10-9) rcv-loo=0.8353; scv-loo=0.32; Fcv-loo (p)=67 (5.24·10-9)
Identity & Absolute r=0.9318; s=0.21; F (p)=92 (4.79·10-13) rcv-loo=0.9118; scv-loo=0.24; Fcv-loo (p)=68 (1.68·10-11)
Mono-substituted nitrobenzenes The toxicity of the investigated mono-substituted nitrobenzenes is of both geometrical and topological nature. It also depends on group electronegativity and partial charge. Ninety-six percent of variation in toxicity could be explained by the linear relationship with two molecular descriptors. Sample size [reference]
39 [83]
MDF SAR Equation SAR Determination (%)
ŷ=-91.15·x+6.27 60
ŷ=-92.37·x1-7.28·x2+6.37 60
MDF Descriptor(s): x1 & x2
IBMrkGg
IBMrkGg & IsPmVQt
Dominant Atomic Property
Group Electronegativity (G) Group Electronegativity (G) & Charge (Q)
Interaction Via
Space (geometry)
Space (geometry) & Bonds (topology)
Interaction Model
G-2-·d-1
E-2·d-1 & Q·d-1
Structure on Activity Scale
Identity
Identity & Identity
Model Statistics
r=0.7717; s=0.35; F (p)=54 (8.87·10-9)
r=0.7739; s=0.35; F (p)=27 (7.33·10-8)
Cross-Validation Leave-One-Out
rcv-loo=0.7474; scv-loo=0.37; Fcv-loo (p)=48 (4.71·10-8)
rcv-loo=0.6947; scv-loo=0.41; Fcv-loo (p)=16 (1.25·10-5)
Benzene derivates The toxicity of the investigated benzene derivates revealed to be of topological and geometrical nature. It also depended on partial charge and on the number of directly bounded hydrogens. Sample size [reference] MDF SAR/SPR Equation
69 [84] ŷ=-0.91·x+2.92
ŷ=-9.66·x1+1.00·x2+3.25
SAR/SPR Determination (%)
68
87
MDF Descriptor(s): x1 & x2
lFPdoGg
ABmrsQg & iGPrfHt
MDF: From molecular structure to molecular design
141
Table. Continued Dominant Atomic Property
Group Electronegativity (G) Charge (Q) & Hydrogen (H)
Interaction Via
Space (geometry)
Space (geometry) & Bonds (topology)
Interaction Model
G-1
Q2·d-3 & H2·d-2
Structure on Activity/ Property Scale
Logarithmic
Absolute & Inversed
Model Statistics
r=0.8262; s=0.43; F (p)=144 (1.87·10-18)
r=0.9331; s=0.28; F (p)=222 (1.48·10-30)
Cross-Validation Leave-One-Out
rcv-loo=0.8160; scv-loo=0.44; Fcv-loo (p)=133 (1.08·10-17)
rcv-loo=0.9267; scv-loo=0.29; Fcv-loo (p)=201 (2.97·10-29)
Alkyl metal compounds The toxicity of alkyl metal compounds revealed to be of geometrical nature and depended on the relative atomic mass as well as on the partial charge. The MDF model was very good, its determination coefficient being close to 1. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x 1 & x2 Dominant Atomic Property Interaction Via
10 [85] ŷ=0.25·x+0.33
ŷ=28.06·x1+0.08·x2+2.80 [86]
97 iFDmdCg
99.7 IbMmpMg & LPPROQg
Cardinality (C)
Mass (M) & Charge (Q)
Space (geometry)
Space (geometry) & Space (geometry)
Interaction Model
d-1
M-2 & Q
Structure on Activity Scale
Inversed
Identity & Logarithmic
Model Statistics
r=0.9830; s=0.19; F (p)=230 (3.54·10-7)
r=0.9988; s=0.06; F (p)=1473 (6.49·10-10)
Cross-Validation Leave-One-Out
rcv-loo=0.9729; scv-loo=0.24; rcv-loo=0.9980; scv-loo=0.07; Fcv-loo (p)=141 (2.29·10-6) Fcv-loo (p)=841 (4.57·10-9)
Para-substituted phenols – Toxicity The model obtained for the toxicity of the para substituted phenols revealed to be a good model. This model showed that the toxicity was of topological and geometrical nature and was related with partial charge.
142
Sorana D. Bolboacă & Lorentz Jäntschi Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
30 [47] ŷ=-603.71·x+1.74 71 IsPdOQg Charge (Q) Space (geometry)
Interaction Model Q Structure on Activity Scale Identity Model Statistics r=0.8458; s=0.38; F (p)=70 (4.01·10-9) Cross-Validation rcv-loo=0.8219; scv-loo=0.41; Leave-One-Out Fcv-loo (p)=58 (2.68·10-8)
ŷ=0.04·x1-0.22·x2-2.26 90 ASMmVQt & lfDdOQg Charge (Q) & Charge (Q) Bonds (topology) & Space (Geometry) Q·d-1 & Q Absolute & Logarithmic r=0.9472; s=0.23; F (p)=118 (4.56·10-14) rcv-loo=0.9352; scv-loo=0.26; Fcv-loo (p)=93 (7.57·10-13)
Para-substituted phenols - Relative toxicity The relative toxicity of the investigated para-substituted phenols revealed to be of geometrical nature and linearly dependant on the partial charge and number of directly bounded hydrogens. The model with two descriptors could be considered considerably good, its determination coefficient being of eighty-five percent. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
30 [47] ŷ=4.50·x+6.59 68 InDDoQg Charge (Q) Space (geometry)
ŷ=-1.76x1-1.42·x2+12.14 85 AHMMVQg & inDmwHg Charge (Q) & Hydrogen (H) Space (geometry) & Space (geometry) Q-1 Q·d-1 & Q2·d-1 Identity Absolute & Inversed r=0.9225; s=0.38; r=0.8275; s=0.54; -8 F (p)=77 (6.87·10-12) F (p)=61 (1.71·10 ) rcv-loo=0.8011; scv-loo=0.57; rcv-loo=0.9054; scv-loo=0.42; Fcv-loo (p)=50 (1.07·10-7) Fcv-loo (p)=61 (9.98·10-11)
Quinoline – Cytotoxicity The model with two descriptors obtained in the investigation of cytotoxicity in the quinoline sample showed that the activity was of topological
MDF: From molecular structure to molecular design
143
nature and depended on relative atomic mass and partial charge. The model was successfully able to predict the activity (the determination coefficient was close to 1). The prediction power was sustained by the value of the crossvalidation leave-one-out score, which had a value of 0.9805 (see the model with two descriptors). Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
15 [87] ŷ=-6.58·x+4.63 65 iHPMdCg Cardinality (C) Space (geometry)
Interaction Model Structure on Activity Scale
d-1 Inversed
ŷ=8.35·x1+1.96·x2-4.49 [88] 98 INDRLQt & lHPmTMt Charge (Q) & Mass (M) Bonds (topology) & Bonds (topology) Q·d & M2·d-4 Identity & Inversed
Model Statistics
r=0.8044; s=0.63; F (p)=24 (2.99·10-4)
r=0.9882; s=0.17; F (p) = 250 (1.65·10-10)
Cross-Validation Leave-One-Out
rcv-loo=0.7576; scv-loo=0.70; Fcv-loo (p)=17 (1.21·10-3)
rcv-loo=0.9805; scv-loo=0.22; Fcv-loo (p)=149 (3.34·10-9)
Quinoline – Mutagenicity The model with two descriptors demonstrated that mutagenicity was of geometrical nature and strongly dependent on partial charge. The estimation power of the model was very close to 1. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
14 [87] ŷ=0.008·x-4.14
Interaction Model Structure on Activity Scale
Q2·d Inversed
96 lNMrSQg & ASPrVQg Charge (Q) & Charge (Q) Space (geometry) & Space (geometry) Q2·d-3 & Q·d-1 Logarithmic & Proportional
Model Statistics
r=0.8456; s=0.44; F (p)=30 (1.39·10-4)
r=0.9782; s=0.18; F (p)=122 (3.12·10-8)
Cross-Validation Leave-One-Out
rcv-loo=0.8054; scv-loo=0.49; Fcv-loo (p)=21 (6.46·10-4)
rcv-loo=0.9666; scv-loo=0.22; Fcv-loo (p)=78 (3.18·10-7)
72 aAmrKQt Charge (Q) Bonds (topology)
ŷ=0.21·x1+0.09·x2-1.57 [88]
144
Sorana D. Bolboacă & Lorentz Jäntschi
Insecticidal & herbicidal activities Neonicotinoids - Insecticidal Activity The model with two descriptors obtained a very good estimation power. According to this model the activity was of both geometrical and topological nature. It also depended directly on atomic electronegativity and partial charge. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
88 iIDrSMg Mass (M)
Interaction Via
Space (geometry)
Interaction Model Structure on Activity Scale Model Statistics
M2·d-3 Inversed r=0.9370; s=0.48; F (p)=43 (5.95·10-4) rcv-loo=0.8692; scv-loo=0.68; Fcv-loo (p)=18 (5.49·10-3)
Cross-Validation Leave-One-Out
8 [89] ŷ=-77.49·x +19.15
ŷ=-2.21·x1 +3.74·x2 +43.34 [90] 99.9 ImMdsEg & lIMMFQt Electronegativity (E) & Charge (Q) Space (geometry) & Bonds (topology) E2·d-3 & Q2·d-2 Identity & Inversed r=0.9996; s=0.04; F (p)=2865 (2.24·10-8) rcv-loo=0.9991; scv-loo=0.06; Fcv-loo (p)=1386 (1.37·10-7)
Substituted triazines (Triazines) - Herbicidal activity A very good MDF model for estimation and prediction was obtained for the herbicidal activity of substituted triazines. The herbicidal activity was of both geometrical and topological nature and depended on the number of directed bounded hydrogen and partial charge. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2
30 ŷ=-4284.7·x+7.47
ŷ=-8112.2·x1+194.35·x2+5.52
95 iSDRFHg
98 iSMMWHg & iSMmEQt
Dominant Atomic Property
Hydrogen (H)
Hydrogen (H) & Charge (Q)
Interaction Via
Space (geometry)
Space (geometry) & Bonds (topology)
Interaction Model Structure on Activity Scale Model Statistics
H2·d2 Inversed r=0.9754; s=0.16; F (p)=549 (2.18·10-20) rcv-loo=0.9725; scv-loo=0.17; Fcv-loo (p)=488 (1.09·10-19)
Cross-Validation Leave-One-Out
Inversed & Inversed r=0.9876; s=0.13; F (p) = 533 (1.37·10-23) rcv-loo=0.9855; scv-loo=0.12; Fcv-loo (p)=449 (1.52·10-22)
MDF: From molecular structure to molecular design
145
Therapeutically activities 3-Indolyl derivates - Antioxidant efficacy The antioxidant efficacy of the investigated 3-indolyl derivates revealed to be of geometrical nature and depended on partial charge (see the model with one descriptor). As was expected the model with two descriptors obtained better results (determination coefficient of 99.9%), but this model is close to the limit according to the Hawkins criteria [45]. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
8 [93] ŷ=-1.34·10-5·x-3.76 90 asMmtQg Charge (Q)
ŷ=-1.10·x1-33.24·x2+7.18 [94] 99.9 lbPMkHg & iAPrVGt Hydrogen (H) & Group Electronegativity (G) Space (geometry) Space (geometry) & Bonds (topology) 2 -4 Q ·d H-2·d-1 & G·d-1 Inversed Logarithmic & Inversed r=0.9999; s=0.01; r=0.9508; s=0.21; -4 F (p) = 12591 (5.55·10-10) F (p) = 56 (2.87·10 ) rcv-loo=0.9119; scv-loo=0.29; rcv-loo=0.9997; scv-loo=0.02; Fcv-loo (p)=29 (1.64·10-3) Fcv-loo (p) = 3877 (1.05·10-8)
Substituted N 4-methoxyphenyl benzamides - Antiallergic activity The model with two descriptors indicated that the antiallergic activity of the investigated substituted N 4-methoxyphenyl benzamides was of both geometrical and topological nature and depended on group electronegativity and relative atomic mass. Sample size [reference] MDF SAR/SPR Equation SAR/SPR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
23 [95] ŷ=0.03·x+0.21 92 InPrdQg Charge (Q)
Interaction Via
Space (geometry)
Interaction Model Structure on Activity/ Property Scale Model Statistics
d-1 Identity
ŷ=-6.8·10-5·x1-1.3·10-6·x2+0.18 98 IFDDpGg & ISDrFMt Group Electronegativity (G) & Mass (M) Space (geometry) & Bonds (topology) G2 & M2·d-2 Identity & Identity
r=0.9603; s=0.33; F (p)=249 (4.05·10-13) rcv-loo=0.9513; scv-loo=0.36; Fcv-loo (p)=199 (3.46·10-12)
r=0.9920; s=0.15; F (p)=616 (4.87·10-20) rcv-loo=0.9910; scv-loo=0.16; Fcv-loo (p)=540 (1.99·10-19)
Cross-Validation Leave-One-Out
146
Sorana D. Bolboacă & Lorentz Jäntschi
Polyhydroxyxanthones - Antituberculotic activity The antituberculotic activity of the investigated polyhydroxyxanthones was well estimated and predicted by the model with two descriptors. Its estimated power was of 99.7%. The activity was of geometrical nature and depended on partial charge and group electronegativity. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2
10 [96] ŷ=-0.07·x+9.74
ŷ=2.32·x1+19.34·x2 -19.11 [97]
82 isDDoHg
99.7 lHPDOQg & IsMRKGg
Dominant Atomic Property
Hydrogen (H)
Charge (Q) & Group Electronegativity (G)
Interaction Via
Space (geometry)
Space (geometry) & Space (geometry)
Interaction Model
H-1
Q & G-2·d-1
Structure on Activity Scale Model Statistics
Inversed r=0.9082; s=0.23; F (p)=38 (2.78·10-4) rcv-loo=0.8346; scv-loo=0.31; Fcv-loo (p)=18 (2.99·10-3)
Logarithmic & Identity r=0.9987; s=0.03; F (p)=1327 (9.33·10-10) rcv-loo=0.9974; scv-loo=0.04; Fcv-loo (p)=663 (1.05·10-8)
Cross-Validation Leave-One-Out
Taxoids - Growth inhibition activity A model with the estimated power of 92% was obtained during the investigation of the taxoids. According to the model with two descriptors, the growth inhibition activity of the investigated taxoids was of geometrical nature and was related with the number of directly bounded hydrogens. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
34 [98] ŷ=0.89·x-8.23 83 IHDrFHt Hydrogen (H) Bonds (topology)
92 isMdTHg & IiDrQHg Hydrogen (H) & Hydrogen (H) Space (geometry) & Space (geometry)
Interaction Model Structure on Activity Scale
H2·d-2 Identity
H2·d-4 & H√H Inversed & Identity
Model Statistics
r=0.9108; s=0.51; F (p)=156 (7.75·10-14) rcv-loo=0.9006; scv-loo=0.54; Fcv-loo (p)=137 (4.08·10-13)
r=0.9583; s=0.36; F (p)=174 (2.86·10-18) rcv-loo=0.9507; scv-loo=0.39; Fcv-loo (p)=146 (2.22·10-16)
Cross-Validation Leave-One-Out
ŷ=0.002·x1+77.22·x2-17.7 [99]
MDF: From molecular structure to molecular design
147
HEPTA and TIBO derivatives - anti-HIV-1 potencies The anti-HIV-1 potencies of the investigated HEPTA and TIBO derivatives revealed to be of geometrical nature and related with atomic and group electronegativity. The estimated power of the model was modest, but increasing the number of molecular descriptors to five provide a significantly better model [101]. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
57 [100] ŷ=-3776.8·x+8.29 61 imMDPMg Mass (M)
ŷ=-19.43·x1+11.07·x2-4.30 78 lIDrFEg & iMMsGg Electronegativity (E) & Group Electronegativity (G) Space (geometry) Space (geometry) & Space (geometry) 2 M E2·d-2 &G2·d-3 Inversed Logarithmic & Logarithmic r=0.7812; s=0.95; r=0.8849; s=0.71; -13 F (p)=86 (7.58·10 ) F (p)=97 (5.77·10-19) rcv-loo=0.7636; scv-loo=0.98; rcv-loo=0.8750; scv-loo=0.74; Fcv-loo (p)=76 (5.03·10-12) Fcv-loo (p)=88 (5.18·10-18)
Substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides (40846_1) - Inhibition activity on carbonic anhydrase I The inhibition activity on carbonic anhydrase I of the substituted 1,3,4thiadiazole- and 1,3,4-thiadiazoline-disulfonamides revealed to be of geometrical nature and depended on partial charge and relative atomic mass according to the model with two descriptors. When the number of descriptors increased, a model whose estimated power was 10% higher than that of the model with two descriptors was obtained [103]. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
40 [102] ŷ=-0.008·x+0.66 63 isMRdQt Charge (Q) Bonds (topology)
Interaction Model Structure on Activity Scale Model Statistics
d-1 Inversed r=0.7927; s=0.33; F (p)=64 (1.09·10-9) rcv-loo=0.7787; scv-loo=0.34; Fcv-loo (p)=58 (3.38·10-9)
Cross-Validation Leave-One-Out
ŷ=0.11·x1+3.10·10-3·x2+1.74 81 inPRlQg & lPDMqMg Charge (Q) & Mass (M) Space (geometry) & Space (geometry) Q-2·d-1 &M√M Inversed & Logarithmic r=0.8975; s=0.24; F (p)=77 (6.95·10-14) rcv-loo=0.8882; scv-loo=0.25; Fcv-loo (p)=69 (3.37·10-13)
148
Sorana D. Bolboacă & Lorentz Jäntschi
Substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides (40846_2) - Inhibition activity on carbonic anhydrase II The inhibition activity of the carbonic anhydrase II of the substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides was of geometrical nature and depended on cardinality and partial charge (see the above model with two descriptors). When the investigation continued, a more powerful model was obtained (a model with 4 descriptors whose estimated power was 11% higher than that of the model with two descriptors [104]). Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
40 [102] ŷ=-2.49·10-3·x +1.43 55 lPmrSMg Mass (M) Space (geometry)
Interaction Model Structure on Activity Scale Model Statistics
M2·d-3 Logarithmic r=0.7422; s=0.35; F (p)=47 (4.21·10-8) rcv-loo=0.7187; scv-loo=0.37; Fcv-loo (p)=41 (1.80·10-7)
Cross-Validation Leave-One-Out
ŷ=2.44·x1+0.09·x2-4.45 79 imDdSCg & iiMrqQg Cardinality (C) & Charge (Q) Space (geometry) & Space (geometry) C2·d-3 & Q√Q Inversed & Inversed r=0.8862; s=0.25; F (p)=68 (4.36·10-13) rcv-loo=0.8697; scv-loo=0.26; Fcv-loo (p)=57 (4.60·10-12)
Substituted 1,3,4-thiadiazole- and 1,3,4-thiadiazoline-disulfonamides (40846_4) - Inhibition activity on carbonic anhydrase IV The analysis of the model with two descriptors revealed that the inhibition activity on carbonic anhydrase IV of the substituted 1,3,4- thiadiazoleand 1,3,4-thiadiazoline-disulfonamides was of geometrical and topological nature and depended on partial charge. The estimated power of the model was Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2
40 [102] ŷ=-0.006·x+0.40 56 iAPmsQt
Dominant Atomic Property Interaction Via
Charge (Q) Bonds (topology)
Charge (Q) & Charge (Q) Space (geometry) & Bonds (topology)
Interaction Model Structure on Activity Scale
Q2·d-3 Inversed
Q-1·d-1 & Q2·d-4 Inversed & Inversed
Model Statistics
r=0.7455; s=0.36; F (p)=48 (3.41·10-8) rcv-loo=0.7230; scv-loo=0.38; Fcv-loo (p)=42 (1.38·10-7)
r=0.8672; s=0.27; F (p)=56 (6.21·10-12) rcv-loo=0.8536; scv-loo=0.29; Fcv-loo (p)=0.49 (3.54·10-11)
Cross-Validation Leave-One-Out
ŷ=0.11·x1+9.98·10-9·x2+0.80 75 inPRlQg & iHMMTQt
MDF: From molecular structure to molecular design
149
slightly moderate, but the increase to four in the number of descriptors increased the estimated power with 16% [105].
Dipeptides - Inhibition activity A model with moderate estimation power was obtained when the inhibition activity of dipeptides was investigated. The model with two descriptors showed that the activity was of both topological and geometrical nature and depended on relative atomic mass and number of directly bounded hydrogens. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
58 [106] ŷ=2.26·10-4·x+1.94 75 ISMrSGg Group Electronegativity (G)
Interaction Via
Space (geometry)
Interaction Model Structure on Activity Scale Model Statistics
G2·d-3 Identity r=0.8656; s=0.51; F (p)=167 (1.32·10-18) rcv-loo=0.8551; scv-loo=0.52; Fcv-loo (p) = 152 (9.73·10-18)
Cross-Validation Leave-One-Out
ŷ=-1.47·x1+0.12·x2+2.57 85 ibDMFHt & ISPdlMg Hydrogen (H) & Mass (M) Bonds (topology) & Space (geometry) H2·d-2 &M-1·d-1 Inversed & Identity r=0.9208; s=0.40; F (p)=153 (1.15·10-23) rcv-loo=0.9135; scv-loo=0.41; Fcv-loo (p)=139 (1.27·10-22)
2,4-diamino-5-(substituted-benzyl)-pyrimidines - Inhibition activity The inhibition activity of the investigated 2,4-diamino-5-(substitutedbenzyl)-pyrimidines revealed to be of both geometrical and topological nature. It was also related with the number of directly bounded hydrogens (see the model with two descriptors). Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
67 [107] ŷ=-707.32·x+10.83 73 ibMrEMt Mass (M)
Interaction Via
Bonds (topology)
Interaction Model Structure on Activity Scale Model Statistics
M·d-2 Inversed r=0.8551; s=0.32; F (p)=177 (2.42·10-20) rcv-loo=0.8459; scv-loo=0.33; Fcv-loo (p)=163 (1.62·10-19)
Cross-Validation Leave-One-Out
ŷ=-4.90·x1+2.31·x2+3.26 86 lImrKHt & lIMDWHg Hydrogen (H) & Hydrogen (H) Bonds (topology) & Space (geometry) H-2·d-1 & H2·d-1 Logarithmic & Logarithmic r=0.9268; s=0.23; F (p)=195 (2.06·10-28) rcv-loo=0.9195; scv-loo=0.24; Fcv-loo (p)=175 (4.05·10-27)
150
Sorana D. Bolboacă & Lorentz Jäntschi
Peptide analogues - Inhibition activity The model with two descriptors obtained when the inhibition activity of the peptide analogues was investigated showed that the activity was of geometrical nature and strongly related with the relative atomic mass. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
47 [108] ŷ=3202.8·x-16.51 81 IHmRpMg Mass (M) Space (geometry)
Interaction Model Structure on Activity Scale Model Statistics
M2·d-3 Identity r=0.8999; s=0.28; F (p)=192 (5.16·10-18) rcv-loo=0.8900; scv-loo=0.29; Fcv-loo (p)=171 (1.11·10-16)
Cross-Validation Leave-One-Out
ŷ=240.54·x1-0.10·x2+0.94 88 IHMdpMg & IHMdOMg Mass (M) & Mass (M) Space (geometry) & Space (geometry) M2·d-3 & M Identity & Identity r=0.9400; s=0.22; F (p)=167 (8.04·10-22) rcv-loo=0.9325; scv-loo=0.23; Fcv-loo (p)=147 (1.10·10-20)
Lethal/effective concentration Ordnance compounds - Fertilization of Sea Urchin The lethal/effective concentration of the investigated ordnance compounds in the fertilization of the Sea Urchin revealed to be of topological nature. It also depended on group electronegativity (see the model with one descriptor). The estimated power did not increase with the increase in the number of descriptors used by the models. This sustained the ability of the model with one descriptors to estimate and predict. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
8 [109] ŷ=75.98·x+3937.6 99.9 ISPRwGt Group electronegativity (E) Bonds (topology)
ŷ=-4291.5·x1-24751·x2+82488 [110]
99.9 iSDmtQg & lAMrFEt Charge (Q) & Electronegativity (E) Interaction Via Space (geometry) & Bonds (topology) 2 -1 Interaction Model E ·d Q2·d-4 & E2·d-2 Structure on Activity Scale Identity Inversed & Logarithmic Model Statistics r=0.9996; s=192; r=0.9999; s=13; -10 F (p)=7754 (1.44·10 ) F (p) = 823154 (1.61·10-14) Cross-Validation rcv-loo=0.9990; scv-loo=328; rcv-loo=0.9999; scv-loo=18; Leave-One-Out Fcv-loo (p)=2645 (3.62·10-9) Fcv-loo (p)=418264 (8.74·10-14)
MDF: From molecular structure to molecular design
151
Ordnance Compounds - Embryological development (1st column) & germination (2nd column) of Sea Urchin The lethal/effective concentration in the embryological development of the Sea Urchin revealed to be of topological nature and depended on partial charge. This model had a very good estimation power. The investigation of the lethal/ effective concentration in the germination of the Sea Urchin revealed that it was of topological nature and related with partial charge. In this model the power of estimation was good, but it was not as high as for the previous model. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
5 [109] ŷ=-0.37·x-0.16 99.9 LIMmwQt Charge (Q) Bonds (topology) Q2·d-1 Logarithmic r=0.9997; s=0.02; F (p)=5677 (5.15·10-6) rcv-loo=0.9992; scv-loo=0.05; Fcv-loo (p)=1220 (5.16·10-5)
7 [109] ŷ=-1.09·x-7.09 93 lNPmfQt Charge (Q) Bonds (topology) Q2·d-2 Logarithmic r=0.9650; s=0.35; F (p)=68 (4.32·10-4) rcv-loo=0.9197; scv-loo=0.52; Fcv-loo (p)=27 (3.43·10-3)
Ordnance compounds - Zoospore germination of Green Macroalgae The lethal/effective concentration of ordnance compounds in the zoospore germination of the Green Macroalgae seemed to be of geometrical nature and depended on relative atomic mass and partial charge (see the model with two descriptors). Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
8 [109] ŷ=0.06·x-1.50 89 aIDmjQg Charge (Q) Space (geometry) Q-1·d-1 Inversed r=0.9435; s=0.39; F (p)=49 (4.32·10-4) rcv-loo=0.9129; scv-loo=0.50; Fcv-loo (p)=27 (1.94·10-3)
ŷ=-0.004·x1+11.11·x2+21.58 [110] 99.9 iHDRkMg & inMrPQg Mass (M) & Charge (Q) Space (geometry) & Space (geometry) M-2·d-1 & Q2 Inversed & Inversed r=0.9996; s=0.04; F (p)=2942 (2.10·10-8) rcv-loo=0.9988; scv-loo=0.06; Fcv-loo (p)=1042 (2.80·10-7)
152
Sorana D. Bolboacă & Lorentz Jäntschi
Ordnance compounds - Germling length of Green Macroalgae The lethal/effective concentration of ordnance compounds in the germling length of the Green Macroalgae revealed to be of geometrical nature and related with the number of directly bounded hydrogens and partial charge according to the model with two descriptors. The estimation and prediction power of this model was almost perfect. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
8 [109] ŷ=-1.88·x-6.13 89 LIDmjQg Charge (Q) Space (geometry) Q-1·d-1 Logarithmic r=0.9445; s=0.35; F (p)=50 (4.09·10-4) rcv-loo=0.8969; scv-loo=0.49; Fcv-loo (p)=23 (3.09·10-3)
ŷ=-10.09·x1-1.39·x2+6.95 [110] 99.9 iGDREHg & lnDDVQg Hydrogen (H) & Charge (Q) Space (geometry) & Space (geometry) H·d-2 &Q·d-1 Inversed & Logarithmic r=0.9999; s=0.02; F (p)=11350 (7.20·10-10) rcv-loo=0.9992; scv-loo=0.06; Fcv-loo (p)=1089 (2.51·10-7)
Ordnance compounds - Germling cell number of Green Macroalgae Two models were obtained for the lethal/effective concentration of ordnance compounds in the germling cell number of the Green Macroalgae. According to the model with two descriptors, the activity was of geometrical nature and strongly depended on partial charge. The estimated power of the model with two descriptors was very high. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
8 [109] ŷ=-1.87·x-6.02 88 LIDmjQg Charge (Q) Space (geometry) Q-1·d-1 Logarithmic r=0.9359; s=0.38; F (p)=42 (6.28·10-4) rcv-loo=0.8907; scv-loo=0.50; Fcv-loo (p)=22 (3.46·10-3)
ŷ=382.96·x1-5.15·x2+5.97 [110] 99.9 AHDmtQg & inMDqQg Charge (Q) & Charge (Q) Space (geometry) & Space (geometry) Q2·d-2 & Q-1(√Q)-1 Absolute & Inversed r=0.9996; s=0.03; F (p)=3132 (1.80·10-8) rcv-loo=0.9992; scv-loo=0.05; Fcv-loo (p)=1545 (1.05·10-7)
MDF: From molecular structure to molecular design
153
Ordnance compounds - Survival and reproductive success of Polychaete (1st column) & Juveniles survival of Opossum Shrimp (2nd column) The lethal/effective concentration of ordnance compounds expressed as the successful reproduction and survival of Polychaete Juveniles (1st column) revealed to be of geometrical nature and related with partial charge. The lethal/effective concentration of ordnance compounds expressed as the survival of the Opossum Shrimp (2nd column) also revealed to be of geometrical nature and depended on partial charge. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s) Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
7 [109] ŷ=-102.72·x-0.79 97 IAPmtQt Charge (Q) Bonds (topology) Q2·d-4 Identity r=0.9835; s=0.22; F (p)=148 (6.65·10-5) rcv-loo=0.9748; scv-loo=0.27; Fcv-loo (p)=94 (1.98·10-4)
7 [109] ŷ=-1.31·x+0.28 91 LHDmjQg Charge (Q) Space (geometry) Q-1·d-1 Logarithmic r=0.9531; s=0.25; F (p)=50 (8.93·10-4) rcv-loo=0.9171; scv-loo=0.35; Fcv-loo (p)=24 (4.53·10-3)
Ordnance compounds - Redfish larvae survival The lethal/effective concentration of ordnance compounds expressed as redfish larvae survival proved to be of topological nature and related with cardinality and partial charge. Sample size [reference] MDF SAR Equation [reference] SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
8 [109] ŷ=16.91·x-1.73
ŷ=-14.72·x1-0.11·x2+17 [110]
93 anDRJQt Charge (Q)
99.9 iAMrECt & aAPmfQt Cardinality (C) & Charge (Q)
Interaction Via
Bonds (topology)
Bonds (topology) & Bonds (topology)
Interaction Model Structure on Activity Scale Model Statistics
Q·d Inversed r=0.9655; s=0.32; F (p)=82 (9.99·10-5)
C·d-2 & Q2·d-2 Inversed & Inversed r=0.9998; s=0.03; F (p)=6295 (3.14·10-9)
Cross-Validation Leave-One-Out
rcv-loo=0.9408; scv-loo=0.42; Fcv-loo (p)=45 (5.10·10-4)
rcv-loo=0.9996; scv-loo=0.04; Fcv-loo (p)=2813 (2.35·10-8)
154
Sorana D. Bolboacă & Lorentz Jäntschi
No observed effect concentration (NOEC) Ordnance compounds - Fertilization (1st column) & embryological development (2nd column) & germination of Sea Urchin (3rd column) The NOEC of the investigated ordnance compounds in the fertilization of the Sea Urchin revealed to be of geometrical nature and related with compounds’ cardinality. The NOEC of the investigated ordnance compounds in the embryological development of the Sea Urchin revealed to be of geometrical nature and related with partial charge. The NOEC of the investigated ordnance compounds in the germination of the Sea Urchin revealed to be of topological nature and related with partial charge. All the above mentioned models had good estimation and prediction power, the coefficient of determination being higher than or equal with 95%. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
7 [109]
7 [109]
6 [109]
ŷ=-0.80·x+3.93
ŷ=0.17·x+1.42
ŷ=0.001·x-1.27
96
95
97
imMrtCg Cardinality (C)
ASPmwQg Charge (Q)
asmrfQt Charge (Q)
Space (geometry) C2·d-4 Inversed
Space (geometry) Q2·d-1 Logarithmic
Bonds (topology) Q2·d-2 Inversed
r=0.9787; s=0.10; F (p)=114 (1.25·10-4)
r=0.9738; s=0.08; F (p)=92 (2.09·10-4)
r=0.9859; s=0.27; F (p)=139 (2.97·10-4)
rcv-loo=0.9540; rcv-loo=0.9704; rcv-loo=0.9627; scv-loo=0.13; scv-loo=0.10; scv-loo=0.39; -4 -4 Fcv-loo (p)=61 (5.58·10 ) Fcv-loo (p)=50 (8.80·10 ) Fcv-loo (p) = 64 (1.34·10-3)
Ordnance compounds - Germling length and cell number of Green Macroalgae The NOEC of the investigated ordnance compounds in the germling length and cell number of the Green Macroalgae revealed to be of both topological and geometrical nature. It also depended on group electronegativity and relative atomic mass (the models with two descriptors). The model with two descriptors proved to have excellent estimation and prediction power.
MDF: From molecular structure to molecular design
155
Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
8 [109] ŷ=0.06·x-1.74 87 aIDmjQg Charge (Q)
Interaction Via
Space (geometry)
Interaction Model Structure on Activity Scale Model Statistics
Q-1·d-1 Inversed r=0.9355; s=0.40; F (p)=42 (6.38·10-4)
ŷ=24.40·x1-28.28·x2+252.94 99.9 lMMRSGt & lsPmpMg Group Electronegativity (G) & Mass (M) Bonds (topology) & Space (geometry) G2·d-3 & M-2 Logarithmic & Logarithmic r=0.9995; s=0.04; F (p)=2499 (3.16·10-8)
Cross-Validation Leave-One-Out
rcv-loo=0.9003; scv-loo=0.53; Fcv-loo (p)=22 (3.17·10-3)
rcv-loo=0.9990; scv-loo=0.06; Fcv-loo (p)=1173 (2.08·10-7)
Ordnance compounds - Survival and reproductive success of Green Macroalgae The NOEC in the survival and reproductive success of the Green Macroalgae of the investigated ordnance compounds proved to be of both geometrical and topological nature and related with the compounds’ atomic electronegativity and partial charge (model with two descriptors). The model with two descriptors proved to have excellent estimation and prediction power. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2
8 [109] ŷ=-1.28·x+3.71 92 LnDRJQt
ŷ=-0.89·x1-13455·x2+28.03 99.9 IHMRFEg & INPmsQt
Dominant Atomic Property
Charge (Q)
Electronegativity (E) & Charge (Q)
Interaction Via
Bonds (topology)
Interaction Model Structure on Activity Scale
Q·d Logarithmic
Space (geometry) & Bonds (topology) E2·d-2 & Q2·d-3 Identity & Identity
Model Statistics
r=0.9578; s=0.36; F (p)=65 (1.83·10-4)
r=0.9998; s=0.03; F (p)=5938 (3.63·10-9)
Cross-Validation Leave-One-Out
rcv-loo=0.9237; scv-loo=0.61; Fcv-loo (p)=19 (4.68·10-3)
rcv-loo=0.9994; scv-loo=0.05; Fcv-loo (p)=1821 (6.95·10-8)
Ordnance compounds - Redfish larvae survival (1st column) & survival and reproductive success of Polychaete (2nd column) The NOEC of the investigated ordnance compounds in redfish larvae survival proved to be of both geometrical and topological nature and related with partial charge and the number of directly bounded hydrogens.
156
Sorana D. Bolboacă & Lorentz Jäntschi
The NOEC of the investigated ordnance compounds in the survival and reproductive success of Polychaete was of geometrical nature and related with partial charge. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
8 [109]
6 [109]
ŷ=-1.37·x+0.09 91 LHDmjQg Charge (Q) Space (geometry) Q-1·d-1 Logarithmic r=0.9542; s=0.24; F (p)=61 (2.33·10-4) rcv-loo=0.9162; scv-loo=0.34; Fcv-loo (p)=28 (1.84·10-3)
ŷ=-0.003·x1+ 2.25·x2+4.59 99.9
ŷ=-1.42·x-10.25
asDmkQg & IGMmTHt Charge (Q) & Hydrogen (H) Space (geometry) & Bonds (topology) Q-2·d-1 & H2·d-4 Inversed & Identity
LsmrfQg
Q2·d-2 Logarithmic
r=0.9995; s=0.03; F (p)=2373 (3.59·10-8) rcv-loo=0.9987; scv-loo=0.05; Fcv-loo (p)= 907 (3.96·10-7)
r=0.9754; s=0.32; F (p)=78 (8.98·10-4) rcv-loo=0.9519; scv-loo=0.46; Fcv-loo (p) = 37 (3.70·10-3)
95
Charge (Q) Space (geometry)
Ordnance compounds - Juveniles survival of Opossum Shrimp The NOEC of the investigated ordnance compounds in the juvenile survival of the Opossum Shrimp was of geometrical nature and related with partial charge and cardinality (model with two descriptors). The determination coefficient of the model with one descriptor was moderately good while the determination power of the model with two descriptors was very good. Sample size [reference] MDF SAR Equation
8 [109] ŷ=668.36·x+19.24
SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via
82 iBPMwEt Electronegativity (E) Bonds (topology)
Interaction Model Structure on Activity Scale Model Statistics
E2·d-1 Inversed r=0.9048; s=0.28; F (p)=27 (2.01·10-3)
Cross-Validation Leave-One-Out
rcv-loo=0.8522; scv-loo=0.40; rcv-loo=0.9993; scv-loo=0.03; Fcv-loo (p)=11 (1.75·10-2) Fcv-loo (p)=1695 (8.32·10-8)
ŷ=1.46·x1-0.008·x2+0.28 99.9 iIPdqQg & iImrSCg Charge (Q) & Cardinality (C) Space (geometry) & Space (geometry) Q-1(√Q)-1 & C2·d-3 Inversed & Inversed r=0.9998; s=0.01; F (p)=5435 (4.53·10-9)
MDF: From molecular structure to molecular design
157
Lowest observed effect concentration (LOEC) Ordnance compounds - fertilization (1st column) & - embryological development (2nd column) of Sea Urchin The LOEC of the investigated ordnance compounds in the fertilization and embryological development of the Sea Urchin was of topological nature and depended on partial charge. The estimation power of the models was good, the values were higher than or equal with 93%. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s) Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale
6 [109] ŷ=-47.56·x+0.57 99.8 IAPmfQt Charge (Q) Bonds (topology) Q2·d-2 Identity
7 [109] ŷ=-1.14·x-7.62 93 lNPmfQt Charge (Q) Bonds (topology) Q2·d-2 Logarithmic
Model Statistics
r=0.9993; s=0.04; F (p)=2781 (7.74·10-7)
r=0.9653; s=0.36; F (p)=68 (4.22·10-4)
Cross-Validation Leave-One-Out
rcv-loo=0.9981; scv-loo=0.10; Fcv-loo (p)=479 (2.58·10-5)
rcv-loo=0.9356; scv-loo=0.49; Fcv-loo (p)=35 (1.99·10-3)
Ordnance compounds - Germination of Sea Urchin The LOEC of the investigated ordnance compounds in the germination of the Sea Urchin was both of topological and geometrical nature. It was also related with group electronegativity and the relative atomic mass. The estimation and prediction power of this model was much closer to the optimum value. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property
8 [109] ŷ=0.06·x-1.43 88 aIDmjQg Charge (Q)
Interaction Via
Space (geometry)
Interaction Model Structure on Activity Scale Model Statistics
Q-1·d-1 Inversed r=0.9357; s=0.40; F (p)=42 (6.33·10-4) rcv-loo=0.9022; scv-loo=0.51; Fcv-loo (p)=24 (2.73·10-3)
Cross-Validation Leave-One-Out
ŷ=11.77·x1+14.55·x2-76.46 99.9 lGPrfGt & iGPMqMg Group Electronegativity (G) & Mass (M) Bonds (topology) & Space (geometry) E2·d-2 & M-1(√M)-1 Logarithmic & Inversed r=0.9996; s=0.04; F (p)=3002 (1.99·10-8) rcv-loo=0.9992; scv-loo=0.05; Fcv-loo (p)=1492 (1.14·10-7)
158
Sorana D. Bolboacă & Lorentz Jäntschi
Ordnance compounds - Germling length and cell number of Green Macroalgae The LOEC of the investigated ordnance compounds in the germling length and cell number of the Green Macroalgae revealed to be of topological nature and strongly depended on atomic electronegativity (model with two descriptors). Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2 Dominant Atomic Property Interaction Via Interaction Model Structure on Activity Scale Model Statistics Cross-Validation Leave-One-Out
8 [109] ŷ=0.06·x-2.02 90 aIDmjQg Charge (Q)
ŷ=0.66·x1-688.62·x2-0.47 99.9 ISPRfEt & imDrwEt Electronegativity (E) & Electronegativity (E) Space (geometry) Bonds (topology) & Bonds (topology) -1 -1 Q ·d E2·d-2 & E2·d-1 Inversed Identity & Inversed r=0.9995; s=0.04; r=0.9504; s=0.35; -4 F (p) 2539 (3.03·10-8) F (p) = 56 (2.94·10 ) rcv-loo=0.9320; scv-loo=0.41; rcv-loo=0.9990; scv-loo=0.05; Fcv-loo (p)=39 (7.99·10-4) Fcv-loo (p)=1255 (1.76·10-7)
Ordnance compounds - Survival and reproductive success of Polychaete The LOEC of the investigated ordnance compounds in the survival and reproductive success of the Polychaete revealed to be of both geometrical and topological nature and related with atomic and group electronegativity. Both models had good estimation and prediction power. Sample size [reference] MDF SAR Equation SAR Determination (%) MDF Descriptor(s): x1 & x2
8 [109] ŷ=16.60·x-1.69 92 anDRJQt
Dominant Atomic Property
Charge (Q)
Interaction Via
Bonds (topology)
Interaction Model Structure on Activity Scale Model Statistics
Q·d Inversed r=0.9612; s=0.34; F (p)=73 (1.42·10-4) rcv-loo=0.9361; scv-loo=0.43; Fcv-loo (p)=42 (6.33·10-4)
Cross-Validation Leave-One-Out
ŷ=-20.97·x1+51.18·x2-267.22 99.9 lsPmkEg & lSmRFGt Electronegativity (E) & Group Electronegativity (G) Space (geometry) & Bonds (topology) E-2·d-1 & G2·d-2 Logarithmic & Logarithmic r=0.9999; s=0.02; F (p)=13109 (5.02·10-10) rcv-loo=0.9998; scv-loo=0.03; Fcv-loo (p)=5687 (4.05·10-9)
MDF: From molecular structure to molecular design
159
Ordnance compounds - Survival and reproductive success of Green Macroalgae (1st column) & Redfish larvae survival (2nd column) & Juveniles survival of Opossum Shrimp (3rd column) The LOEC of the investigated ordnance compounds in the survival and reproductive success of the Green Macroalgae revealed to be of geometrical nature and depended on partial charge. The LOEC of the investigated ordnance compounds in the Redfish larvae survival proved to be of geometrical nature and depended on partial charge. The LOEC of the investigated ordnance compounds in the juvenile survival of the Opossum Shrimp revealed to be of geometrical nature. It also depended on cardinality. All three models had great abilities in estimation and prediction, their power being higher than 88%. Sample size [reference] MDF SAR/ Equation SAR Determination (%) MDF Descriptor(s)
7 [109]
7 [109]
7 [109]
ŷ=0.11·x-3.69
ŷ=-1.30·x+0.39
ŷ=-0.83·x+4.22
95
94
98
iIDdPQg
LHDmjQg
imMrtCg
Dominant Atomic Property Interaction Via
Charge (Q)
Charge (Q)
Cardinality (C)
Space (geometry)
Space (geometry)
Space (geometry)
Interaction Model
Q2
Q-1·d-1
C2·d-4
Structure on Activity Scale Model Statistics
Inversed
Logarithmic
Inversed
r=0.9764; s=0.28; F (p)=102 (1.62·10-4)
r=0.9694; s=0.20; F (p)=78 (3.09·10-4)
r=0.9897; s=0.07; F (p)=239 (2.06·10-5)
Cross-Validation Leave-One-Out
rcv-loo=0.9534; scv-loo=0.41; Fcv-loo (p)=45 (1.09·10-3)
rcv-loo=0.9404; rcv-loo=0.9790; scv-loo=0.29; scv-loo=0.10; -3 Fcv-loo (p)=34 (2.08·10 ) Fcv-loo (p)=111 (1.33·10-4)
Discussions The researchers from many chemistry related fields are interested in quantitative structure-activity/property relationship approaches due to their advantages. A quick search in the SCOPUS database (http://www.scopus.com) with the query “TITLE-ABS-KEY(qsar)”, retrieved a number of 6278 abstracts, out of which almost 10% were published last year. The same query retrieved a number of 4920 items in the PubMed database (http://www.ncbi.nlm.nih.gov). Out of these, almost 13% where published last year and almost 26% in the last two years. The amount of
160
Sorana D. Bolboacă & Lorentz Jäntschi
research published on the QSAR subject showed the importance of these studies. The main three advantages offered by the approaches are as follows: ÷ ÷ ÷
Time and money efficiency [111] Effective (with comparable or better accuracy) alternatives compared with experimental counterparts [111] Virtual screening environments (particularly receptor-based virtual screening) offered reliable and inexpensive methods for identifying leads [112].
A new approach termed MDF SPR/SAR that used the information extracted from the 2D and 3D structure of the compounds in order to generate and calculate the molecular descriptors able to estimate and predict property/activity of interest was developed. The approach was tested on different properties and activities in a number of chemical and/or biological active classes of compounds. The analysis of the MDF abilities in estimation and prediction of properties lead to the following: ÷ ÷ ÷
÷ ÷
The estimated power expressed as correlation coefficient varied from 0.8324 to 1.0000. The prediction power expressed as leave-one-out correlation coefficient varied from 0.8258 to 0.9999. The stability of the model, expressed as the difference between estimated and predictive power, varied from 0.01% to 5.46%. The most stable models (with a difference between estimated and predicted power of 0.0001) were obtained in the following classes of compounds: polychlorinated biphenyls (relative retention time), organophosphorus herbicides (retention chromatography index), cyclic organophosphorus (molar refraction), alkanes (boiling point), and standard amino acids – 15aa (Hückel energy). The estimation power of the best performing model (when two models were reported) occurred in 90% of cases (eighteen out of twenty investigated sets) higher than or equal with 0.9. The prediction power of the best performing model (when two models were reported) occurred in 85% of cases (seventeen out of twenty investigated sets) higher than or equal with 0.9.
The analysis of the MDF abilities in estimation and prediction of activities leads to the following:
MDF: From molecular structure to molecular design
÷ ÷ ÷
÷ ÷
161
The estimated power expressed as correlation coefficient varied from 0.6649 to 0.9999. The predictive power expressed as leave-one-out correlation coefficient varied from 0.5961 to 0.9999. The stability of the model, expressed as the difference between estimation and prediction power, varied from 0.01% to 7.92%. The most stable model (with a difference between estimated and predicted power of 0.0001) was obtained when the fertilization of the Sea Urchin was investigated. The sample size of this set was small (8 compounds). A lower difference between estimation and prediction power (of 0.0002) was also obtained in the investigation of antioxidant efficacy of 3-indolyl derivates. The worst result was obtained in the toxicity investigation of the mono-substituted nitrobenzenes, where a difference of 0.0792 was obtained. The estimated power of the best performing model (when two models were reported) occurred in almost 77% of cases (fifty-six out of seventythree investigated sets) higher than or equal with 0.9. The predicted power of the best performing model (when two models were reported) occurred in 70% of cases (fifty-one out of seventy-three investigated sets) higher than or equal with 0.9.
The MDF SPR/SAR models proved to have good correlation coefficient and predictive power compared with the previously reported models [23,29,32,34,35,47,54,56,59,67,88,90,92,94,96,99,101,103,105,106,107]. The MDF methodology was successful in extracting information from the 2D and 3D structure of compounds. This is useful for identifying the link between the compounds’ structure and property or activity of interest. The use of structural information to characterize chemical compounds allows good correlations between the compounds’ structure and chemical and/or biological activity. Thus, the MDF methodology is a powerful approach that investigates the activities and properties of chemical active compounds since it includes the information extracted from the compounds’ structure in the SPR/SAR models. Furthermore, the approach could be useful in the process of drug design.
References 1. 2. 3. 4.
Hammett, L.P. 1937, J. Am. Chem. Soc., 59, 96. Hansch, C., Leo, A., and Taft, R.W. 1991, Chem. Rev., 91, 165. Crum-Brown, A., and Fraser, T.R. 1868, Philos. Trans. R. Soc. Lond., 25, 151. Richet, C.R. 1893, C. R. Seances Soc. Biol. Fil., 9, 775.
162
5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
26.
Sorana D. Bolboacă & Lorentz Jäntschi
Meyer, H. 1899, Naunyn-Schmiedeberg's Arch. Pharmacol., 42, 109. Hammett, L.P. 1935, Chem. Rev., 17, 125. Taft, R.W. 1952, J. Am. Chem. Soc., 74, 3120. Hansch, C. 1969, Acc. Chem. Res., 2, 232. Mezey, P.G., Journal of Mathematical Chemistry, Springer Netherlands, since 1987 (periodical). Gutman, I., MATCH - Communications in Mathematical and in Computer Chemistry, Faculty of Science Kragujevac and University of Kragujevac Serbia, from 1975 (periodical). Brändas, E., and Öhrn, Y., International Journal of Quantum Chemistry, John Wiley & Sons NY USA, since 1967 (periodical). Portoghese, P.S., Journal of Medicinal Chemistry, American Chemical Society OH USA, since 1959 (periodical). Atta-ur-Rahman. Letters in Drug Design & Discovery, Bentham Science Publishers, since 2004 (periodical). Sawyer, T.K., Chemical Biology & Drug Design, Blackwell Publishing, since 2006 (periodical, formally known as Journal of Peptide Research until 2005). Grassy, G., Calas, B., Yasri, A., Lahana, R., Woo, J., Iyer, S., Kaczorek, M., Floc'h, R., and Buelow, R. 1998, Nat. Biotechnol., 16, 748. van Breemen, R.B. Combinatorial Chemistry & High Throughput Screening, Bentham Science Publishers, from 1998 (periodical). Czarnik AW. Journal of Combinatorial Chemistry, American Chemical Society OH USA, from 1999 (periodical). Hansch, C., and Leo, A. 1979, Substituent Constants for Correlation Analysis in Chemistry and Biology, John Wiley & Sons, Inc., New York. Cramer, R.D. III, Patterson, D.E., and Bunce, J.D. 1988, J. Am. Chem. Soc., 110, 5959. Todeschini, R., Lasagni, M., and Marengo, E. 1994, Theory J. Chemometrics, 8, 263. Simon, Z., Chiriac, A., Holban, S., Ciubotariu, D., and Mihalas, G.I. 1984, Minimum Steric Difference. The MTD Method for QSAR Studies, Research Studies Press, 1. Jäntschi, L., Katona, G., and Diudea, M.V. 2000, MATCH Commun. Math. Comput. Chem., 41, 151. Jäntschi, L. 2004, Leonardo Journal of Sciences, 3, 68. Putz, M.V., and Lacrămă, A.-M. 2007, Int. J. Mol. Sci., 8, 363. Cambridge Structural Database (CSD), available from: http://www.ccdc.cam.ac.uk/products/csd/; Protein Data Bank (PDB), available from: http://www.pdb.org/; Visual Molecular Dynamics (VMD), available from: http://www.ks.uiuc.edu/Research/vmd/; Molecular Modeling DataBase (MMDB), available from: http://130.14.29.110/Structure/MMDB/mmdb.shtml; PubChem Compounds (PCC), available from: http://ncbi.nlm.nih.gov/entrez/ query.fcgi?DB=pccompound; PubChem Substance (PCS). available from: http://ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pcsubstance. Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D., McDowell, R.M., and Gramatica, P. 2003, Environ. Health Perspect., 111, 1361.
MDF: From molecular structure to molecular design
27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55.
163
Steiger, J.H. 1980, Psychol. Bull., 87, 245. Thurstone, L.L. 1931, Psychol. Rev., 38, 406. Bolboacă, S.D., and Jäntschi, L. 2006, Leonardo Journal of Sciences, 5, 179. Tarko, L. 2006, Revista de Chimie, 57, 1014. Jäntschi, L., Bolboacă, S.D. 2007, Leonardo Electronic Journal of Practices and Technologies, 6, 169. Jäntschi, L. 2005, Leonardo Electronic Journal of Practices and Technologies, 6, 76. HyperChem, Molecular Modelling System [online]; ©2003, Hypercube [cited 2007 Feb]. Available from: URL: http://hyper.com/products/. Bolboacă, S.D., and Jäntschi, L. 2008, Environ. Chem. Lett., 6, 175. Jäntschi, L., and Bolboacă, S.D. 2007, Studii şi Cercetări Ştiinţifice Universitatea Bacău Seria Biologie, 12, 57. Lu. A., Zhang. J., Yin. X., Luo. X., and Jiang. H. 2007, Bioorg. Med. Chem. Lett., 17, 243. Ma, W., Luan, F., Zhao, C., Zhang, X., Liu, M., Hu, Z., and Fan, B., 2006, QSAR Comb. Sci., 25, 895. Tropsha, A., Gramatica, P., and Gombar, V.K. 2003, QSAR & Comb. Sci., 22, 69. Diudea, M., Gutman, I., and Jäntschi, L. 2002, Molecular Topology, 2nd Edition, Nova Science, Huntington, New York. The PHP Group [online]; ©2001-2007, The PHP Group [cited 2007 June]. Available from: URL: http://php.net. MySQL AB [online]; ©1995-2007 MySQL AB [cited 2007 June]. Available from: URL: http://mysql.com. The FreeBSD Project [online]; ©1995-2007 The FreeBSD Project [cited 2007 June]. Available from: URL: http://freebsd.org. Chambers, D.L. 2001, The practical handbook of genetic algorithms. Chapman & Hall, Boca Raton. Borland Software Corporation [online]; ©1994 - 2007 Borland Software Corporation [cited 2007 June]. Available from: URL: http://borland.com. Hawkins, D.M. 2004, J. Chem. Inf. Comput. Sci., 44, 1. Abraham, M. H., Kumarsingh, R., Cometto-Muniz, J.E., and Cain, W. S. 1998, Toxicol. In Vitro., 12, 201. Ivanciuc, O. 1998, Revue Roumanian de Chimie, 43, 255. Jäntschi, L., and Bolboacă, S.D. 2007, Int. J. Quantum Chem., 107, 1736. Eisler, R., and Belisle, A.A. 1996, Biological Report 31 and Contaminant Hazard Reviews Report, 31, 75. Baker, J.R., Mihelcic, J.R., and Sabljie, A. 2001, Chemosphere, 45, 213. Bumble, S. 1999, Computer Generated Physical Properties. CRC Press, LLC, Boca Raton. Bolboacă, S.D., and Jäntschi, L. 2008, Chem. Biol. Drug Des., 71(2), 173. Jäntschi, L., Mureşan, S., and Diudea, M. 2000, Studia Universitatis BabesBolyai, Chemia, XLV, 313. Jäntschi, L., and Bolboacă, S.D. 2005, Leonardo Electronic Journal of Practices and Technologies, 7, 55. Toropov, A., Toropova, A., Ismailov, T., and Bonchev, D. 1998, Theochem, 424, 237.
164
Sorana D. Bolboacă & Lorentz Jäntschi
56. Brasquet, C., and Le Cloirec, P. 1999, Water Res., 33, 3603. 57. Jäntschi, L. 2004, Leonardo Journal of Sciences, 5, 63. 58. Wolfenden, R., Andersson, L., Cullis, P., and Southgate, C. 1981, Biochemistry, 20, 849. 59. Rose,G.D., Geselowitz, A.R., Lesser, G.J., Lee, R.H., and Zehfus, M.H., 1985, Science, 229, 834. 60. Hessa, T., Kim, H., Bihlmaier, K., Lundin, C., Boekel, J., Andersson, H., Nilsson, I., White, S.H., and von Heijne, G. 2005, Nature, 433, 377. 61. Kyte, J., and Doolittle, 1982, R.F., J. Mol. Biol., 157, 105. 62. Welling, G.W., Weijer, W.J., Van der Zee, R., and Welling-Wester, S. 1985, FEBS Lett, 188, 215. 63. Wilson, K.J., Honegger, A., Stotzel, R.P., and Hughes, G.J. 1981, Biochem. J., 199, 31. 64. Cornette, J.L., Cease, K.B., Margalit, H., Spouge, J.L., Berzofsky, J.A., and DeLisi, C. 1987, J. Mol. Biol., 195, 659. 65. Bolboacă, S.D., and Jäntschi, L. 2007, Recent Advances in Synthesis & Chemical Biology VI, Centre for Synthesis & Chemical Biology, University of Dublin, December 14, Dublin, Ireland, P2. 66. Wimley, W.C., and White, S.H. 1996, Nature Struct. Biol., 3, 842. 67. Hoop, T.P., and Woods, K.R. 1981, Proc. Natl. Acad. Sci. USA, 78, 3824. 68. Cowan, R. 1990, Pept. Res., 3, 75. 69. Manavalan, P., and Ponnuswamy, P.K. 1978, Nature, 275, 673. 70. Fauchere, J.-L., and Pliska, V.E. 1983, Eur. J. Med. Chem., 18, 369. 71. Rao, M.J.K., and Argos, P. 1986, Biochim. Biophys. Acta., 869, 197. 72. Janin J, Surface and inside volumes in globular proteins, Nature, 1979, 277, 491-492. 73. Roseman., M.A. 1988, J. Mol. Biol., 200, 513. 74. Urry, D.W., 2004, Chem. Phys. Lett., 399, 177. 75. Engelman, D.M., Steitz, T.A., and Goldman, A., 1986, Annu. Rev. Biophys. Chem., 15, 321. 76. Eisenberg, D., Schwarz, E., Komaromy, M., and Wall, R. 1984, J. Mol. Biol., 179, 125. 77. Sereda, T.J., Mant, C.T., Sonnichsen, F.D., and Hodges, R.S. 1994, J. Chromatogr. A., 676, 139. 78. Bull, H.B., and Breese, K. 1974, Arch. Biochem. Biophys., 161, 665. 79. Parker, J.M.R, Guo, D., and Hodges, R.S., 1986, Biochemistry, 25, 5425. 80. Black, S.D., Mould, D.R., Black, S.D., and Mould, D.R. 1991, Anal. Biochem., 193, 72. 81. Monera, O.D., Sereda, T.J., Zhou, N.E., Kay, C.M., and Hodges, R.S. 1995, J. Pept. Sci., 1, 319. 82. Wei, D., Zhang, A., Wu, C., Han, S., and Wang, L. 2001, Chemosphere, 44, 1421. 83. Agrawal, V.K., and Khadikar, P.V. 2001, Bioorg. Med. Chem., 9, 3035. 84. Toropov, A.A., and Toropova, A.P. 2002, Theochem, 578, 129. 85. Ade, T., Zaucke, F., and Krug, H.F. 1996, Fresenius J. Anal. Chem., 354, 609. 86. Bobloacă, S.D., and Jäntschi, L. 2006, Terapeutics, Pharmacology and Clinical Toxicology, X(1), 110. 87. Smith, C.J., Hansch, C., and Morton, M.J. 1997, Mutat. Res., 379, 167.
MDF: From molecular structure to molecular design
165
88. Jäntschi, L., and Bolboacă, S. 2005, The 10th Electronic Computational Chemistry Conference, April 2005. 89. Hasegawa, K., Arakawa, M., and Funatsu, K. 1999, Chemometr. Intell. Lab., 47, 33. 90. Bolboacă, S., and Jäntschi, L. 2005, Leonardo Journal of Sciences, 6, 78. 91. Diudea, M., Jäntschi, L., and Pejov, L. 2002, Leonardo Electronic Journal of Practices and Technologies, 1, 1. 92. Bolboacă, S., and Jäntschi, L. 2006, Bulletin of University of Agricultural Sciences and Veterinary Medicine – Agriculture, 62, 35. 93. Shertzer,G.H., Tabor, M.W., Hogan, I.T.D., Brown, J.S., and Sainsbury, M. 1996, Arch. Toxicol., 70, 830. 94. Bolboacă, S., Filip, C., Ţigan, Ş., and Jäntschi, L. 2006, Clujul Medical, LXXIX, 204. 95. Zhou, Y.X., Xu, L., Wu, Y.P., and Liu, B.L. 1999, Chemometr. Intell. Lab., 45, 95. 96. Frahm, A.W., and Chaudhuri, R. 1979, Tetrahedon, 35, 2035. 97. Bolboacă, S.D., and Jäntschi, L. 2005, Leonardo Journal of Sciences, 7, 58-64. 98. Morita, H., Gonda, A., Wei, L., Takeya, K., and Itokawa, H. 1997, Bioorg. Med. Chem. Lett., 7, 2387-2392. 99. Bolboacă, S.D., and Jäntschi, L. 2008, Arch. Med. Sci., 4(1), 7. 100. Castro, E.A., Torrens, F., Toropov, A.A., Nesterov, I.V., and Nabiev, O.M. 2004, Mol. Simulat., 30, 691. 101. Bolboacă, S., Ţigan, Ş., and Jäntschi, L. 2006, Proceedings of the European Federation for Medical Informatics Special Topic Conference, April 6-8, 222. 102. Supuran, C.T., and Clare, B.W. 1999, Eur. J. Med. Chem., 34, 41. 103. Bolboacă, S.D., and Jäntschi, L. 2007, Integration of Structure Information, Computer-Aided Chemical Engineering, Elsevier Netherlands & UK, 24, 965. 104. Jäntschi, L., Ungureşan, M.L., and Bolboacă, S.D. 2005, Applied Medical Informatics, 17, 12. 105. Jäntschi, L., Bolboacă, S. 2006, Electron. J. Biomed., 2, 22. 106. Opris, D., and Diudea, M.V. 2001, SAR QSAR Environ. Res., 12, 1599. 107. Selassie, C.D., Li, R.L., Poe, M., and Hansch, C. 1991, J. Med. Chem., 34, 46. 108. Hellberg, S., Eriksson, L., Jonsson, J., Lindgren, F., Sjostrom, M., Skagerberg, B., Wold, S., and Andrews, P. 1991, Int. J. Pept. Protein Res., 37, 414. 109. Carr, R.S., and Nipper, M. 2000. Toxicity of marine sediments and porewaters spiked with ordnance compounds: Report prepared for Naval Facilities Engineering Commands, NFESC Contract Report CR 01-001-ENV, Washington, DC, USA. Available from: http://enviro.nfesc.navy.mil/erb/erb_a/restoration/ fcs_area/con_sed/MarineSed2000.pdf. 110. Stoenoiu, C.E., Bolboacă, S.D., and Jäntschi, L. 2007, MetEcoMat - Ecomaterials and Processes: Characterization and Metrology, April 19-21, St. Kirik, Plovdiv, Bulgaria, 54. 111. Prathipati, P., Dixit, A., and Saxena, A.K. 2007, Current Computer-Aided Drug Design, 3, 133. 112. Lyne, P.D. 2002, Drug Discov. Today, 7, 1047.
Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 167-187 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro
6. Selection of an optimal set of descriptors: use of the Enhanced Replacement Method Andrew G. Mercader Instituto de Investigaciones Fisicoqu铆micas Te贸ricas y Aplicadas (INIFTA, UNLP, CCT La Plata-CONICET), Diag. 113 y 64, Sucursal 4, C.C. 16, 1900 La Plata, Argentina
Abstract. The Enhanced Replacement Method (ERM) is a search algorithm that selects an optimal group of descriptors from a much greater set in the construction of linear QSAR/QSPR models. This algorithm has shown many superior attributes that will be presented in this chapter. In addition, successful applications will be displayed to give a better general picture. Finally an innovative methodology to determine the optimal number of parameters using the ERM will be presented.
1. Introduction Quantitative Structure-Property/Activity Relationships (QSPR/QSAR) are generally applied to overcome the lack of experimental data in complex chemical phenomena.[1] Therefore, there exists a permanently renewed interest focused on the development of such kind of predictive techniques. [2-5] The ultimate role of the QSPR/QSAR theory is to suggest mathematical models capable of estimating relevant properties of interest. Such studies rely on the basic assumption that the structure of a compound determines entirely its properties, which can therefore be translated into so-called molecular Correspondence/Reprint request: Dr. Andrew G. Mercader, Instituto de Investigaciones Fisicoqu铆micas Te贸ricas y Aplicadas (INIFTA, UNLP, CCT La Plata-CONICET), Diag. 113 y 64, Sucursal 4, C.C. 16, 1900 La Plata, Argentina. E-mail: amercader@inifta.unlp.edu.ar
168
Andrew G. Mercader
descriptors. These parameters are calculated through mathematical formulae obtained from several theories, such as Chemical Graph Theory, Information Theory, Quantum Mechanics, etc.[6, 7] Until the present time, thousands of descriptors encoding different aspects of the molecular structure have been developed and are available in the literature.[8] As in any QSAR/QSPR study we have to determine how to select those that characterize the property/activity under consideration in the most efficient way. First the optimal number of parameters to include in the model has to be determined. Afterwards the mathematical problem of selecting a subset d of d descriptors from a much larger set D of D>>d ones has to be addressed. The search for the optimal set of descriptors may be monitored by the minimization or maximization of a chosen statistical parameter; we normally prefer to search the model that makes the Standard Deviation (S) as small as possible. In other words, we look for the global minimum of S(d), where d is a point in a space of D!/[d!(D-d)!] ones. A full search (FS) of the optimal variables is impractical since it requires D!/[d!(D-d)!] linear regressions and D normally is higher than one thousand. For this reason, we have recently proposed the Enhanced Replacement Method (ERM)[9] as an adaptation of the Replacement Method (RM).[10-12] This new algorithm present many advantages that have shown to outperform RM and the more elaborated Genetics Algorithms (GA).[13] The main differences between ERM and RM will be discussed in the next sections, also the resemblance of the new algorithm with the Simulated Annealing (SA) which is an adaptation of the Metropolis-Hastings algorithm, a Monte Carlo Method[14] to generate sample states of a thermodynamic system; the name and inspiration come from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects. The heat causes the atoms to become unstuck from their initial positions (a local minimum of the internal energy) and wander randomly through states of higher energy; the cooling gives them more chances of finding configurations with lower internal energy than the initial one.[15] Additionally successful practical applications that give a better understanding of the algorithm and the way it can be used will be presented and shown. Additionally the description of a new method of selecting the optimal number of parameters to include in a model using ERM and.
1.1. Motivation to develop the ERM Even though the RM is a rapidly convergent iterative algorithm that produces linear regression models with small S in a remarkably little computer
ERM: Selection of an optimal set of descriptors
169
time[12, 16, 17]; in some cases it could get trapped in a local minimum of S, living some room for improvement, witch lead to de development of ERM. It should be mentioned that such local minima provided acceptable models, as shown in all earlier applications of the RM.[12, 16, 17] In summary RM is a technique that approaches the minimum of S by judiciously taking into account the relative errors of the coefficients of the least-squares model given by a set of d descriptors d={X1,X2,â&#x20AC;Ś,Xd}. It has been shown[16] that the RM gives models with better statistical parameters than the Forward Stepwise Regression (FSW) procedure[18] and similar ones than the GA. We believed that RM was preferable to the GA[12, 19] because the former takes into account the error in the regression coefficient and as a result the replacement of the descriptor is not at random as in the GA. GA are more complicated and in their practical application the tuning of some parameters such as population size, generation gap, crossover rate, and mutation rate is required. These parameters typically interact among themselves nonlinearly and cannot be optimized one at a time. There is considerable discussion about parameter settings and approaches to parameter adaptation in the evolutionary computation literature; however there does not seem to be conclusive results on which may be the best. Just as a summary GA are search techniques based on natural evolution where variables play the role of genes (in this case a set of descriptors) in an individual of the species. An initial group of random individuals (population) evolve according to a fitness function (in this case the standard deviation) that determines the survival of the individuals. The algorithm searches for those individuals that lead to better values of the fitness function through selection, mutation and crossover genetic operations. The selection operators guarantee the propagation of individuals with better fitness in future populations. The GAs explore the solution space combining genes from two individuals (parents) using the crossover operator to form two new individuals (children) and also by randomly mutating individuals using the mutation operator. The GAs offer a combination of hill-climbing ability (natural selection) and a stochastic method (crossover and mutation) and explore many solutions in parallel processing information in a very efficient manner.[20]
1.2. Development of ERM The first algorithm that was built was slight modification of RM and was called Modified Replacement Method (MRM).[9] The comparison of the behaviour of the RM and MRM is shown in Figure 1 and Figure 2., respectively
170
Andrew G. Mercader
Figure 1. Standard deviation vs. number of steps of the RM.
Figure 2. Standard deviation vs. number of steps of the MRM.
suggested us to implement a further optimization routine that integrates the two algorithms, associating the MRM to a thermal agitation of the RM. We tried the following combinations: MRM-RM, RM-MRM and RM-MRM-RM. When RM was applied after MRM (MRM-RM), the starting solution for RM consisted in the best set of variables obtained from the previous application of the MRM algorithm. It should be mentioned that after several runs we decided to discard the MRM-RM-MRM simply because it increased the number of linear regressions significantly without achieving appreciable
ERM: Selection of an optimal set of descriptors
171
Figure 3. Standard deviation vs. number of steps of the ERM.
improvements in the statistical results. The sequence RM-MRM-RM emerged as the best algorithm which is in fact the ERM (Figure 3), this is in line with the idea that the ERM is the only algorithm from the above list that goes through a complete simulated annealing cycle.[15]
1.3. Description of the algorithms: Differences between ERM and RM Having a large set D ={ X1 , X 2 ,K, X D } of D descriptors we need to choose an optimal subset dm = { X m1 , X m 2 ,K, X md } of d≪ D descriptors with minimum standard deviation S: 1 S = ( N − d − 1)
N
∑ res 2 i =1
i
(1)
where N is the number of molecules in the training set, and resi the residual for molecule i (difference between the experimental and predicted property). Notice that S(dn) is a distribution on a discrete space of D ! d !( D − d )! disordered points dn. The full search (FS) that consists of calculating S(dn) on all those points always enable us to arrive at the global minimum, however it is computationally prohibitive if D is sufficiently large. The RM consists of the following steps:
172
•
•
• •
Andrew G. Mercader
We choose an initial set of descriptors dk at random, replace one of the descriptors, say X ki , with all the remaining D − d descriptors, one by one, and keep the set with the smallest value of S. That is what we define as a ‘step’ From the resulting set we choose the descriptor with the greatest standard deviation in its coefficient (not considering the one changed previously) and substitute all the remaining D − d descriptors, one by one in its position. The procedure is repeated until the set remains unmodified. In each cycle the descriptor optimized in the previous one is not modify. Thus, we obtain the candidate d (mi ) that comes from the so-constructed path i. It should be noticed that if the replacement of the descriptor with the largest error by those in the pool does not decrease the value of S, then the descriptor is not changed. The process is carried out for all the possible paths i =1,2,…,d and keep the point dm with the smallest standard deviation: min S (d (mi ) ) . i
MRM follows the same strategy except that in each step the descriptor with the largest error is replaced even if that substitution is not accompanied by a smaller value of S (the next smallest value of S is chosen). MRM converges to different solutions and commonly bounces from one point to another, occasionally repeating some of them; in such a case we find that a plausible solution is the first one that appears four times. If convergence is too slow the process is stopped after 350 steps, this number of steps achieves a sufficiently small S. ERM as mentioned before is the combination of MRM and RM in the sequence RM-MRM-RM.
1.4. ERM, MRM vs RM: Test results The test and comparison of the different algorithms was performed using four different experimental data sets: a fluorophilicity data set (FLUOR), consisting of 116 organic compounds characterized by 1269 theoretical descriptors [16]; a Growth Inhibition data set (GI), with growth inhibition values to the ciliated protozoan Tetrahymena pyriformis by 200 mechanistically diverse phenolic compounds and 1338 structural descriptors;[17] a GABA receptor data set (GABA), containing 78 inhibition data for flavone derivatives and 1187 molecular descriptors[21] and a 100 ED50 MES mice ip for enaminones (MES) with 1306 descriptors.[22-24] Additionally a data set of 209 Polychlorinated Biphenyls (PCB) with measured Relative Response Factor containing 63912 molecular
ERM: Selection of an optimal set of descriptors
173
descriptors[25] was used to test whether the application of the improved algorithm on an extremely large dataset was possible. In all cases the structures of the compounds were firstly pre-optimized with the Molecular Mechanics Force Field (MM+) procedure included in Hyperchem version 6.03 [26], and the resulting geometries were further refined by means of the semi empirical method PM3 (Parametric Method-3) using the Polak-Ribiere algorithm and a gradient norm limit of 0.01 kcal/Å. More than a thousand molecular descriptors were calculated using the software Dragon 5.0,[27] including parameters of all types such as constitutional, topological, geometrical, quantum mechanical, etc. In the last dataset 62873 descriptors calculated by the molecular descriptors family methodology[25] were added to the pool. All the algorithms were programmed in the computer system Matlab 5.0.[28] All the numerical tests were carried out for d=7 as an example of a high computational demanding search with a reasonable number of descriptors for a potential model in common QSPR/QSAR studies. It should be mentioned that in the practical use of the algorithm the correlation of the descriptors can be easily avoided by taking out of the pool those descriptors that have a correlation higher than a set limit. A comparison against the exact solution is not possible because even for the smallest database (GABA, D = 1187 ) the number of linear regressions for a FS for d = 7 amounts to 6.47x1017 which would take about 3.4 x106 years in a PC with an AMD Athlon 64 2800+ processor. Even in a much more powerful computer the solution would not be reached in a reasonable time. Table 1 shows the average percentage improvement of S taking RM as reference, using all databases for three random initial set of seven descriptors. At the end of the table the ratio of the number of linear regressions for the new algorithms with respect to the RM can be seen. Table 1. Average improvement on the standard deviation taking RM as reference, number of linear regressions and computational time (between parentheses) for MRM and ERM, for the four full data sets with three different initial seven-descriptor sets. Algorithm Average Improvement
RM
MRM
ERM
0%
4.76%
5.73% *
Number of linear regressions (Time in minutes ) Average 283878 (0.78) 1629938 (4.51) 1926775 (5.33) Ratio 1 5.74 6.79 *
Using an AMD Athlon 64 2800+ processor DDR 1GB
174
Andrew G. Mercader
It follows from Table 1 that the MRM outperforms RM and ERM gives even better results. The ERM computational demand is comparable to MRM and is almost seven times greater than RM. This is the small price we have to pay for obtaining QSAR models with better statistical parameters than the ones obtained previously.[16] As a final test of the ERM on a much more demanding problem, we evaluated it on the PCB database containing 63912 descriptors.[25] The ERM converged in a reasonable time, Table 2 shows the results. As expected ERM gave smaller values of S for the same three random initial sets chosen in the preceding tests. It arises from this section that ERM outperforms RM, which as mentioned gives similar results than GA, therefore it is expected that ERM will outperform GA as well. ERM slightly increases the computational demand with respect to RM, nevertheless it was successfully used in a problem with extremely large number of descriptors (63912) calculated by Dragon and the molecular descriptors family methodology. Table 2. Average improvement on the standard deviation taking RM as reference, number of linear regressions and computational time (between parentheses) for ERM, for the PCB data set with three different initial seven-descriptor sets. Algorithm
RM
ERM
Average Improvement
0%
2%
Number of linear regressions (Computation time in minutes*) 1.42E+07 7.42 E+07 Average (39.25) (205.18) Ratio 1 5.23 *
Using an AMD Athlon 64 2800+ processor DDR 1GB
2. Programming language As previously mentioned we use Matlab 5.0[28] as programming platform in all the search algorithm calculations that are performed in our research group, this is because this platform presents many advantages that will be detailed bellow. As a reference point, we will mention that in the past we used Derive[29] as programming language; which worked fine being able to have all the algorithms achieving satisfactory results. Nevertheless the computational
ERM: Selection of an optimal set of descriptors
175
time was very high; some times the programs had to be left running for several days. For that reason we decided to translate the algorithms to a more efficient programming language. We evaluated two alternatives, ANSI C[30] and Matlab. After reading the opinion of several experts in programming from around the globe it was decided to proceed with Matlab. In summary the experts said that potentially ANSI C could be faster than Matlab, nevertheless in order to be able to program an efficient C code a basic knowledge of the language was not enough, additionally it was required to have several years of experience; and the estimation of the time it would take to program in C vs. Matlab for a person with basic knowledge was as high as 10 to 1. Moreover, it was estimated that although the new program in Matlab might not be optimized, the Matlab language has been optimized in matrix handling; hence the computational resources management would be optimal in that sense. After translating all the codes from Derive to Matlab the results were marvelous. The reduction in computational time exceeded our expectations. As can be seen in Table 3 the necessary time in Matlab with respect to Derive is from 140 to 700 times lower. The increment in the time savings is due to fact that Deriveâ&#x20AC;&#x2122;s computational time increases exponentially whereas Matlab polynomially. This reduction in computational time is one of the reasons that an extremely high demanding problem, such as the one presented in the previous section on descriptors calculated by the molecular descriptors family methodology, can be solved using ERM. Table 3. Calculation time of Derive vs. Matlab for different d. D=1269, N=116. Deriveâ&#x20AC;&#x2122;s Time (s)
Matlabâ&#x20AC;&#x2122;s Time (s)
Derive/Matlab
1
Number regressions 1269
41
0.125
330.4
2
20290
270
2.0
137.6
3
53175
1541
5.5
280.2
4
80964
3850
9.8
394.8
5
126405
6382
16.8
378.9
6
189456
15170
28.1
540.3
7
247359
26647
37.5
710.6
8
322824
38345
54.8
699.7
d
*
Using an AMD Athlon 64 2800+ processor DDR 1GB
176
Andrew G. Mercader
We can not compare the results against ANSI C because we do not have the C code, nevertheless we can say that programming in Matlab was very straight forward and the interface is very friendly making the input of the data very simple. To be able to test the ERM code please refer to the Mathworks[44] community files exchange server, a QSAR/QSPR Search Algorithms Toolbox containing the ERM code along with a tutorial has been uploaded and it is available at that site.
3. Determining the optimal number of parameters to include in a model using ERM A fundamental step in the construction of any QSAR model is determining the optimal number of parameters (dopt) to include in the model. As the number of parameters included in the model is increased the statistical parameters of the training set (molecules used to elaborate the model) always improve, nevertheless the statistical parameters in the test set (molecules left aside to test the predictive power of the model) start to improve, as more relevant information from the structure is used, and after a certain number of parameters that depend on the case start to deteriorate due to over-fitting of the model to the training set molecules. A scheme that summarizes the problem of over-fitting was include in Figure 4.[31] One may think that a good way of determining the optimal number of parameters is to develop models with increasing number of parameters, try
Figure 4. Typical behaviour of the standard deviation of a model in the Training and Test Set as d is increased.
ERM: Selection of an optimal set of descriptors
177
them in the test set and keep the one with better statistical parameters in the test set. This may seem appropriate but it can not be done because the chosen model would have been determined using the test set, consequently the test set would no longer be a true external set of molecules. Thus we need a way of determining dopt that only uses the training set. In order to do that we will attempt to use the Kubinyi function (FIT)[32, 33] which is a statistical parameter that closely relates to the Fisher ratio (F), but avoids the main disadvantage of the latter that is too sensitive to changes in small d values and poorly sensitive to changes in large d values. The FIT(d) criterion has a low sensitivity to changes in small d values and a substantially increasing sensitivity for large d values. The greater the FIT value the better the linear equation. It is given by the following expression: R 2 ( N − d − 1) FIT = ( N + d 2 )(1 − R 2 )
(2)
Commonly, a plot of FIT vs. d presents a maximum from which it is possible to calculate the optimal number of molecular descriptors (d opt) to be included in the linear regression model. There are many occasions when the maximum is not reached after adding a reasonable number of descriptors in the model. For this reason we recently proposed a variable FIT equation or VFIT which depends on a semi-empirical constant k that gives more weight to d in the FIT equation.[34] It reads: R 2 ( N − kd − 1) VFIT = ( N + d 2 )(1 − R 2 )
(3)
By means of this equation we obtain d opt as the number of descriptors that yields the maximum value of VFIT (dmax) in the plot of VFIT vs. d. The semiempirical constant k is determined by taking incremental values of 0.5 until the maximum (dmax) remains unchanged for two increments and complies with the rule of thumb that at least 5 data points should be present for each fitting parameter.[35] In this way we obtain dopt with out ever using the data from the test set. An exampled is presented in section 4.2.
4. Applications of ERM The possible applications of the algorithm present no limit compared to any other QSAR methodology. The potential applications are enormous and in this section only a couple of recent applications will be presented as
178
Andrew G. Mercader
examples. The main goal is to show how the algorithm is used, the type of predictive models that can be achieved, how this models are successfully used to predict experimental values and to compare the results of ERM and GA.
4.1. QSAR Prediction of inhibition of aldose reductase for flavonoids[36] In this study we performed a predictive analysis based on Quantitative Structure-Activity Relationships (QSAR) of an important property of flavonoids which is the inhibition (IC50) of Aldose Reductase (AR). The importance of AR inhibition is that it prevents cataract formation in diabetic patients. The best linear model constructed from 55 molecular structures incorporated six molecular descriptors, selected from more than a thousand geometrical, topological, quantum-mechanical and electronic types of descriptors. In the present study we choose a training set of 55 flavonoid derivatives for which their activities were reported in the literature by Štefanič-Petek et al.[37] We first try to use all 75 molecules from that paper, but a further revision of the references containing the experimental data to clarify some doubts on the structures[38, 39] revealed some important errors in the representation of some of them. For this reason the number of flavones was reduced to just 55 reliable structures. More precisely, that reduction was due to the fact that some molecules either did not exhibit the desired flavone backbone or their exact structure was not found in the references. In addition, a set of 4 flavones with the desired backbone,[40] was chosen to test the prediction ability of the new model. The structures of the compounds were firstly pre-optimized with the Molecular Mechanics Force Field (MM+) procedure included in the Hyperchem 6.03 package,[26] and the resulting geometries were further refined by means of the semiempirical method PM3 (Parametric Method-3) using the Polak-Ribiere algorithm and a gradient norm limit of 0.01 kcal.Å-1. We computed the molecular descriptors using the software Dragon 5.0,[27] including parameters of all types as Constitutional, Topological, Geometrical, Charge, GETAWAY (Geometry, Topology and Atoms-Weighted AssemblY), WHIM (Weighted Holistic Invariant Molecular descriptors), 3D-MoRSE (3D-Molecular Representation of Structure based on Electron diffraction), Molecular Walk Counts, BCUT descriptors, 2D-Autocorrelations, Aromaticity Indices, Randic Molecular Profiles, Radial Distribution Functions, Functional Groups, Atom-Centred Fragments, Empirical and Properties.[8] We enlarged that pool by the addition of 18 constitutional and 4 quantum-chemical descriptors (molecular
ERM: Selection of an optimal set of descriptors
179
dipole moments, total energies, homo-lumo energies) not provided by the program Dragon. The resulting total pool thus consisted of D=1233 descriptors. It should be mentioned that the determination that dopt=6 in this case was not done using the above mentioned methodology because this improved methodology was not developed at that time. The best model found using ERM was the following: −logIC50 = −85.5375(±10) −10.882(±1.4) E1u −15.398(±0.9) MATS4e + 55.920(±5.35) BELm2 + (4) 7.7606(±0.9) HATS6e − 2.6755(±0.5) DISPe −18.253(±1.1) R4 p N = 55, R = 0.9523, S = 0.3789, FIT = 5.14, p < 10−5 Rloo = 0.934, Sloo = 0.447, Rl −30% −o = 0.803, Sl −30%−o = 0.886 RMSETest Set = 2.9127
The theoretical validation of the models was done using the well-known Leave-One-Out (loo) and the Leave-More-Out Cross-Validation procedures (l-n%-o),[41] where n% represents the number of molecules removed from the training set. We generated 5,000,000 cases of random data removal for l-n%-o, where n%=30% (16 flavonoids). As a benchmark we used RM and a GA to select an optimal model with dopt = 6 descriptors. To this end we optimized the GA parameters for this particular problem by means of several trials and thus arrived at the following convenient settings: Number of individuals = 250; Generation Gap = 0.9; Single Point Crossover Probability = 0.6; Mutation Probability = 0.7/d. We decided to stop the evolution process when one individual occupied more than 90% of the population or when the number of generations reached 2500. In Table 4 the results are summarized, the details of the descriptors involved in all models are presented in Table 5. By examining the statistical parameters calculated from the training and test sets we conclude that the ERM produces better results than both the GA and RM when exploring large sets of descriptors. Table 4. Comparison between the best models found using ERM, RM and GA with d=6. Model ERM RM GA
DESCRIPTORS USED E1u, MATS4e, BELm2, HATS6e, DISPe, R4p BELp4, GGI8, MATS4e, Mor22e, E1p, R4v TIC0, MATS4e, H7e, E1u, BELe4, R3v
R S 0.952 0.379
RMSETest Set 2.9127
0.936 0.436
3.9059
0.937 0.433
4.1607
180
Andrew G. Mercader
Table 5. Symbols for molecular descriptors involved in the models. Molecular descriptor E1u
Type
Detail
WHIM
1st component accessibility directional WHIM index / unweighted.
MATS4e
2D Autocorrelations
Moran autocorrelation - lag 4 / weighted by atomic Sanderson electronegativities.
BELm2
BCUT
HATS6e
GETAWAY
lowest eigenvalue n. 2 of Burden matrix / weighted by atomic masses. Leverage-weighted autocorrelation of lag 6 / weighted by atomic Sanderson electronegativities.
DISPe
Geometrical
d COMMA2 value / weighted by atomic Sanderson electronegativities. R autocorrelation of lag 4 / weighted by atomic
R4p
GETAWAY
BELp4
BCUT
GGI8
Topological
Topological charge index of order 8
Mor22e
3D-MoRSE
3D-MoRSE - signal 22 / weighted by atomic
polarizabilities. Lowest eigenvalue n. 4 of Burden matrix / weighted by atomic polarizabilities.
Sanderson electronegativities. E1p
WHIM
1st component accessibility directional WHIM index / weighted by atomic polarizabilities.
R4v
GETAWAY
R autocorrelation of lag 4 / weighted by atomic van der Waals volumes.
TIC0
Topological
H7e
GETAWAY
BELe4
BCUT
R3v
GETAWAY
total information content index (neighborhood symmetry of 0-order). H autocorrelation of lag 7 / weighted by atomic Sanderson electronegativities. lowest eigenvalue n. 4 of Burden matrix / weighted by atomic Sanderson electronegativities. R autocorrelation of lag 3 / weighted by atomic van der Waals volumes.
The plot of predicted vs. experimental â&#x20AC;&#x201C;logIC50 using the best model obtained using ERM shown in Figure 5 suggests that the 55 flavone derivatives follow a straight line. With the purpose of demonstrating that Eq. (4) does not result from happenstance, we resorted to a widely used approach to establish the model
ERM: Selection of an optimal set of descriptors
181
Figure 5. Predicted (Eq. 4) versus experimental â&#x20AC;&#x201C;logIC50 for the training (rhombus) and test (triangles) sets.
robustness: the so-called y-randomization.[42] It consists of scrambling the experimental property p in such a way that activities do not correspond to the respective compounds. After analyzing 1000000 cases of y-randomization, the smallest value S = 0.8254 obtained from this process resulted to be considerably greater than the one corresponding to the true calibration S = 0.3789. This result further proved that the model is robust, that the calibration is not a fortuitous correlation, and that we have derived a reliable structure-activity relationship.
4.2. Predictive Study of solvent quenching of the 5 D0 â&#x2020;&#x2019; 7 F2 emission of Eu(6,6,7,7,8,8,8-heptafluoro-2,2-dimethyl-3,5octanedionate)3[34] In this study we performed a predictive analysis of the luminescence lifetime (Ď&#x201E;) of Eu(fod)3 in different solvents. The Eu(fod)3 luminescent properties have been helpful as replacements for radioactivity, alternatives to standard fluorescent dyes and donors in energy transfer experiments, and are gaining expanding applications in wide variety of bioanalytical assays. Our best linear model constructed from 23 molecular structures combines four
182
Andrew G. Mercader
molecular descriptors, selected from more than a thousand geometrical, topological, quantum-mechanical and electronic types of descriptors. We chose a training set of 23 solvents and a test set of 2 solvents for which Eu(fod)3 luminescence lifetime (Ď&#x201E;) was already measured.[41] The size of the training set was selected aiming to have as much structural information as possible on the calculated models. It is interesting to notice that in this example there was only one molecule involved and the solvent was changed, a different approach of what is done in the majority of QSAR studies where different molecules are modeled in the same media. This shows the extention of possible applications, showing that the limit can be further increased by taking different approaches. Again the structures of the compounds were firstly pre-optimized with the Molecular Mechanics Force Field (MM+) procedure included in the Hyperchem 6.03 package[26], and the resulting geometries were further refined by means of the semiempirical method PM3 (Parametric Method-3) using the Polak-Ribiere algorithm and a gradient norm limit of 0.01 kcal.Ă&#x2026;-1. We computed the molecular descriptors by means of the software Dragon 5.0,[27] including parameters of all types.[8] Additionally, 18 constitutional descriptors and 4 quantum-chemical ones (molecular dipole moments, total energies, homo-lumo energies), which were not provided by the program Dragon, were added to the pool. We thus had a pool of D=1057 available descriptors. The optimal number of descriptors was determined using the new methodology, different predictive relationships with the ability to link the molecular structure of the solvents with the luminescence lifetime (Ď&#x201E;) of Eu(fod)3 were calculated by means of linear regression models with 1 to 14 parameters (d) that were selected by ERM from the pool of D=1057 descriptors. As k in VFIT is increased (Table 6) a first maximum appears at d=9 (k=2), a second one at d=7 (k=2.5) and finally a third one at d=4 (k=3) that remains unchanged for an additional increment and complies with the above mentioned rule of thumb, which in this case indicated a maximum of 5 parameters. The resulting VFIT with k=3 increases with d up to a maximum value d=dmax=4 shown in Figure 6 this is the optimal value of descriptors in the model. Figure 6 also shows that FIT does not present a maximum in the interval of d between 1 and 14. Table 6 shows that dmax=4 remains for two more k increments supporting the fact that this is actually the optimal number of model parameters.
ERM: Selection of an optimal set of descriptors
183
Table 6. Incremental values of k and the resulting number of descriptors (d) that present a maximum in VFIT. k
d (max. in VFIT)
1
-
1.5
-
2
9
2.5
7
3
4
3.5
4
4
4
4.5
3
5
2
5.5
2
6
1
Figure 6. VFIT (squares) and FIT (circles in secondary axe) parameters as a function of the number of descriptors for the training set.
The best model found using ERM and dopt=4 is:
τ = 0.6903(±0.04) + 0.1575(±0.03) ⋅ GATS 5v − 0.1731(±0.01) ⋅ Sp + 0.2010(±0.03) ⋅ Jhetv + 0.2120(±0.01) ⋅ RDF 015e
N = 23, R = 0.9746, S = 0.0375, FIT = 8.7242, p < 10 −4 Rloo = 0.9597, Sloo = 0.0470, Rl − 22% − o = 0.9031, Sl − 22% − o = 0.0735 RMSETest Set = 0.03808
(5)
184
Andrew G. Mercader
Once more the theoretical validation of the models was done using the well-known Leave-One-Out (loo) and the Leave-More-Out Cross-Validation procedures (l-n%-o)[41], where n% represents the number of molecules removed from the training set. We generated 5,000,000 cases of random data removal for l-n%-o, where n%=22% (5 solvents). We also tried the GA on the same problem which required several runs to optimize its parameters. They were: Number of individuals = 20; Generation Gap = 0.9; Single Point Crossover Probability = 0.2; Mutation Probability = 0.7/d. The algorithm was stopped when one individual occupied more than 90% of the population or when the number of generations reached 2000. In Table 7 the results are summarized, definitions of the descriptors involved in all models are presented in Table 8. We can appreciate that ERM presents better statistical parameters than GA. The plot of predicted vs. experimental Ď&#x201E; shown in Figure 7suggests that the 23 training and 2 test set solvents approximately follow a straight line. A y-randomization procedure was also performed by analyzing 5000000 case, the smallest S value obtained 0.06739 is considerably larger than the one found in the true calibration (S=0.0375). Table 7. Linear models for the training set with N=23. Model ERM GA
DESCRIPTORS USED GATS5v, Sp, Jhetv, RDF015e Mor30e, piPC05, BEHm5, X1Av
R 0.975 0.968
S 0.0375 0.0420
VFIT 4.856 3.815
RMSETestSet 0.0381 0.1246
Table 8. Symbols for molecular descriptors appearing in different models. Molecular descriptor
Type
GATS5v
2D Autocorrelations
Sp
Constitutional
Jhetv
Topological
RDF015e
Radial Distribution Function
Mor30e
3D-MoRSE
piPC05
Topological
BEHm5
BCUT
X1Av
Topological
Description Geary autocorrelation - lag 5 / weighted by atomic van der Waals volumes. sum of atomic polarizabilities (scaled on Carbon atom). Balaban-type index from van der Waals weighted distance matrix. Radial Distribution Function - 1.5 / weighted by atomic Sanderson electronegativities. 3D-MoRSE - signal 30 / weighted by atomic Sanderson electronegativities. Molecular multiple path count of order 05. Highest eigenvalue n. 5 of Burden matrix / weighted by atomic masses. Average valence connectivity index chi-1.
ERM: Selection of an optimal set of descriptors
185
Figure 7. Lifetime Ď&#x201E; experimental versus predicted by Eq. (5) for the training set (rhombus), test set (triangles).
5. Conclusions The Enhanced Replacement Method has shown to be a superior algorithm for the search of an optimal set of descriptors from a much larger group. The Algorithm has shown to outperform RM presenting better statistical parameters adding an acceptable computational effort. The algorithm was successfully implemented in an extremely high demanding problem as the one presented by searching a pool of 62873 descriptors calculated by the molecular descriptors family methodology. The above presented applications show that it is easy to use and confirm itâ&#x20AC;&#x2122;s superiority to GA. The linear nature of the obtained models makes any potential interpretation of the structural parameters possible; in contrast to nonlinear methodology as Artificial Neural Networks.[43] Using the algorithm and the above presented methodology it is possible to determine the optimal number of parameters to include in a model using only information from the training set. The algorithms are available at the Mathworks[44] community files exchange site.
186
Andrew G. Mercader
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.
Hansch, C. and Leo, A. 1995, Exploring QSAR: Fundamentals and Applications in Chemistry and Biology, Am. Chem. Soc., Washington, D.C. Sexton, W.A. 1950, Chemical Constitution and Biological Activity, D. Van Nostrand, New York. Hansch, C. and Fujita, T. 1964, J. Am. Chem. Soc, 86, 1616-1626. Hansch, C. 1969, Acc. Chem. Res, 2, 232-239. King, R.B., Ed. 1983, Chemical Applications of Topology and Graph Theory; Studies in Physical and Theoretical Chemistry, Elsevier, Amsterdam. Katritzky, A.R., Lobanov, V. S., Karelson, M. 1995, Chem. Rev. Soc., 24, 279-287. Trinajstic, N. 1992, Chemical Graph Theory, CRC Press, Boca Raton, FL. Todeschini, R. and Consonni, V. 2000, Handbook of Molecular Descriptors, Wiley VCH, Weinheim, Germany. Mercader, A.G., Duchowicz, P.R., Fernandez, F.M. and Castro, E.A. 2008, Chemom. Intell. Lab. Syst., 92, 138-144. Duchowicz, P.R., Castro, E.A. and Fernández, F.M. 2006, MATCH Commun. Math. Comput. Chem., 55, 179-192. Duchowicz, P.R., Fernández, M., Caballero, J., Castro, E.A. and Fernández, F.M. 2006, Bioorg. Med. Chem., 14, 5876-5889. Helguera, A.M., Duchowicz, P.R., Pérez, M.A.C., Castro, E.A., Cordeiro, M.N.D.S. and González, M.P. 2006, Chemometr. Intell. Lab., 81, 180-187. So, S.S. and Karplus, M. 1996, J. Med. Chem., 39, 1521-1530. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. 1953, J. Chem. Phys., 21, 1087-1092. Kirkpatrick, S., Gelatt Jr., C.D. and Vecchi, M.P. 1983, Science, 220, 671-680. Mercader, A.G., Duchowicz, P.R., Sanservino, M.A., Fernandez, F.M. and Castro, E.A. 2007, Journal of Fluorine Chemistry, 128, 484-492. Duchowicz, P.R., Mercader, A.G., Fernández, F.M. and Castro, E.A. 2007, Chemom. Intell. Lab. Syst, 90, 97-107. Draper, N.R. and Smith, H. 1981, Applied Regression Analysis, John Wiley&Sons, New York. Duchowicz, P.R., González, M.P., Helguera, A.M., Cordeiro, M.N.D.S. and Castro, E.A. 2007, Chemom. Intell. Lab. Syst., 88, 197-203. Melanie, M. 1998, in An Introduction to Genetic Algorithms, Vol., A Bradford Book The MIT Press, Cambridge, Massachusetts • London, England, pp.3-9, 130 -131. Duchowicz, P.R., Vitale, M.G., Castro, E.A., Autino, J.C., Romanelli, G.P. and Bennardi, D.O. 2007, Eur. J. Med. Chem., 43, 1593-1602 Edafiogho, I.O., Hinko, C.N., Chang, H., Moore, J.A., Mulzac, D., Nicholson, J.M. and Scott, K.R. 1992, J. Med. Chem., 35, 2798-2805. Edafiogho, I.O., V.V, K., Ananthalakshmi and Kombian, S.B. 2006 Bioorganic & Medicinal Chemistry, 14, 5266-5272. Eddington, N.D., Cox, D.S., Khurana, M., Salama, N.N., Stables, J.P., Harrison, S.J., Negussie, A., Taylor, R.S., Tran, U.Q., Moore, J.A., Barrow, J.C. and Scott, K.R. 2003, European Journal of Medicinal Chemistry, 38, 49-64.
ERM: Selection of an optimal set of descriptors
187
25. Jäntschi, L. 2007, Leonardo Electronic Journal of Practices and Technologies, 3, 67 - 84. 26. HYPERCHEM. 6.03 (Hypercube) http://www.hyper.com. 27. DRAGON. 5.0 Evaluation Version http://www.disat.unimib.it/chm. 28. Matlab. 5.0 The MathWorks Inc. http://www.mathworks.com/. 29. Derive. 5, Texas Instrument Incorporated, http://www.derive-europe.com/ specs.asp?d6. 30. C. ANSI, Tutorial, http://cprogramminglanguage.net/c-programming-languagetutorial.aspx. 31. Hawkins, D.M. 2004, Journal of Chemical Information and Computer Sciences, 44, 1-12. 32. Kubinyi, H. 1994, QSAR & Combinatorial Science, 13, 285-294. 33. Kubinyi, H. 1994, QSAR & Combinatorial Science, 13, 393-401. 34. Mercader, A.G., Duchowicz, P.R., Fernández, F.M., Castro, E.A. and Wolcan, E. 2008, Chem. Phys. Lett., 462, 352–357. 35. Hansch, C. 1990, Comprehensive Drug Design, Pergamon Press, New York. 36. Mercader, A.G., Duchowicz, P.R., Fernández, F.M., Castro, E.A., Bennardi, D.O., Autino, J.C. and Romanelli, G.P. 2008, Bioorganic & Medicinal Chemistry, 16, 7470–7476. 37. Štefanič-Petek, A., Krbavčič, A. and Šolmajer, T. 2002, Croat. Chem. Acta, 75, 517-529. 38. Okuda, J., Miwa, I., Inagaki, K., Horie, T. and Nakayama, M. 1982, Biochemical Pharmacology, 31, 3807-3822. 39. Okuda, J., Miwa, I., Inagaki, K., Horie, T. and Nakayama, M. 1984, Chem. Pharm. Bull., 32, 767-772. 40. Sung Lim, S., Hoon Jung, S., Ji, J., Shin, K.H. and Keum, S.R. 2001, Journal of Pharmacy and Pharmacology, 53, 653-668. 41. Hawkins, D.M., Basak, S.C. and Mills, D. 2003, J. Chem. Inf. Model., 43, 579-586. 42. Wold, S. and Eriksson, L. 1995, in Statistical validation of QSAR results, Vol. 2 (Ed. H. v. d. Waterbeemd), VCH, Weinheim, pp.309-318. 43. Wold, S., Sjostrom, M. and Eriksson, L. 1998, Encyclopedia of Computational Chemistry, Wiley, Chichester, England. 44. Mathworks. http://www.mathworks.com/matlabcentral/fileexchange/19578.
Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 189-203 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro
7. Orthogonalization methods in QSPR - QSAR Studies Pablo R. Duchowicz*, Francisco M. Fernández and Eduardo A. Castro Instituto de Investigaciones Fisicoquímicas Teóricas y Aplicadas INIFTA (UNLP, CCT La PlataCONICET), Diag. 113 y 64, C.C. 16, Suc.4, (1900) La Plata, Argentina
Abstract. We discuss some features of the orthogonalization methods commonly applied to QSPR - QSAR studies. We outline the well known multivariable linear regression analysis in vector form in order to compare mainly Randic and Gram-Schmidt orthogonalization procedures and also cast the basis for other approaches like Löwdin’s one. We expect that present review may become the starting point for future developments in QSAR QSPR Theory.
1. Introduction Among the different theoretical methods available in the literature for predicting physicochemical properties or biological activities of substances, solely based on the knowledge of their molecular structure, is the Quantitative Structure - Property and Structure - Activity Relationships (QSPR - QSAR) Theory [1, 2]. Since the pioneer works developed by Hansch in 1964 [3], the basis of QSPR - QSAR is the assumption that the variation of the behaviour of chemical compounds, as expressed by any Correspondence/Reprint request: Dr. Pablo R. Duchowicz, Instituto de Investigaciones Fisicoquímicas Teóricas y Aplicadas INIFTA (UNLP, CCT La Plata-CONICET), Diag. 113 y 64, C.C. 16, Suc.4, (1900) La Plata, Argentina. E-mail: pabloducho@gmail.com / prduchowicz@yahoo.com.ar
190
Pablo R. Duchowicz et al.
experimentally measured property or activity on such compounds, can be correlated with numerical entities related to some aspect of the chemical structure represented through well-defined molecular descriptors [4, 5]. The hypotheses involved in QSPR - QSAR analysis have proved suitable for a wide spectrum of properties / activities of interest. The molecular descriptors employed for establishing the model are generally used to describe different characteristics / attributes of certain structure in order to yield information about the property / activity being studied. These numerical variables capture different constitutional, topological, geometrical or electronic aspects of the molecular structure in consideration, and in most cases can be obtained through mathematical formulae obtained from several theories, such as the Chemical Graph Theory, Information Theory, Quantum Mechanics, etc. [6-8]. Nowadays, there exist more than a thousand of descriptors in the literature which can be readily calculated via different software packages [9-11]. QSPR - QSAR models enable to predict the property / activity for substances that have yet not been tested for many different reasons, either because they are unstable, toxic, or simply because their measurement requires too much time or is expensive. In terms of economical aspects, these studies give chances for a rational use of the available resources present in the laboratory, avoiding performing expensive and unnecessary experimental determinations. With respect to moral aspects, the QSPR - QSAR analyzes applied to the field of Toxicology have reached a great importance in the virtual screening of the toxic potential of compounds before their synthesis [12], and thus represent an effective alternative that reduces animal testing in biological essays. In Medicinal Chemistry, both QSPR - QSAR predictions of ADMET properties [13] and oral bioavailability of compounds [14] were conveniently addressed. Finally, from the theoretically point of view, the model can illuminate the mechanisms of physicochemical properties or biological activities of the compounds. As a consequence that a single descriptor is unable to carry the complete structural information of a given molecule, one has to search for the best descriptors among the more than a thousand available in the literature, in order to find those that are the most representative / descriptive parameters for the modeled property / activity. There exist various standard statistical methods that constitute a common practice for the QSPR - QSAR model design, such as linear: Multivariable Linear Regression (MLR) [15], Principal Component Analysis (PCA) [16], Genetics Algorithms (GA) [17], Replacement Method (RM) [18], and non-linear methods: Artificial Neural Networks (ANN) [19], or Support Vector Machines (SVM) [20].
Orthogonalization in QSPR-QSAR
191
In past years, the MLR technique has proved to be of valuable applicability in establishing predictive QSPR - QSAR by performing an exhaustive analysis of a pool containing a great number of structural descriptors. Linear models result more general and can transparently reveal the effect on the property being modeled regarding the inclusion / exclusion of structural variables, thus making it possible to suggest cause / effect relationships by means of parallelisms. The main advantage of developing linear regression models when compared to non-linear ones is the fact that linear models suffer less from the over-fitting (over-training) problem [21, 22], as they do not involve too many optimization parameters to be found for model building, just the regression coefficient for each model’s numerical variable. Earlier attempts to derive the best QSPR - QSAR models based on MLR has led to the use of orthogonal descriptors [23-27], that is to say, descriptors with non - overlapping structural information. It is well known that the use of a set of orthogonal descriptors does not improve the global statistical parameters of the linear regression model although the standard errors of the regression coefficients appear to be smaller [26]. However, a set of orthogonal descriptors (orthogonal predictor variables in general) offers several advantages: first, several expressions (such as the correlation coefficient and standard deviation, for example) are simpler [28]. Second, orthogonal descriptors have proved more suitable for the search of the best model [28-31]. Third, the orthogonalization algorithms are useful to identify linear dependent or almost linear dependent descriptors [32, 33] and make the linear regression more stable [32]. Fourth, the coefficients of orthogonal descriptors do not change when we add more descriptors to the set. During last decades, several authors have developed a program that enables one to obtain the best subset, in the sense of smaller statistical parameters of standard deviation S or Fisher parameter F, of d descriptors out of a large number of D descriptors [11, 18, 28, 34]. This analysis requires that ⎛ D⎞ we do linear regressions for all ⎜ d ⎟ = D ! [ ( D − d )!d !] possible subsets, and this ⎝ ⎠
is also facilitated by the use of orthogonal descriptors because they make it easier to determine the contribution of each descriptor to the set.
2. Overview of vector notions of multivariable linear regression analysis Linear algebra gives us an appropriate language for the discussion of the Least Squares Method and the Multivariable Regression Analysis [15]. It is
192
Pablo R. Duchowicz et al.
possible to develop this approach further and derive some useful relationships related to orthogonal concepts that have not been discussed earlier, as reported in a recent work [35]. We consider a vector space V = {f , g,...} on the field of real numbers, endowed with an inner product f · g, and define the norm f = f ⋅ f of a vector f ∈ V and a distance between f and g, f , g ∈V , as D(f , g) = f − g . We consider a set of d linearly independent vectors B = {f0 , f1 ,..., f d } that span a subspace S B ⊆ V . It is well known that the set B is linearly independent if and only if the determinant M of the matrix M with elements M ij = fi ⋅ f j , i, j = 0,1,... is nonzero. It is our purpose to find the closest approximation to a given vector f ∈ V by means of a linear combination f ∈ S B of vectors of B: d
f = ∑ c jf j
(2.1)
j =0
Clearly, the problem to solve here is to find the set of coefficients cj that minimize the distance D(f , f ) : ∂D (f , f ) 2 = 0, ∂c j
(2.2)
j = 0,1,..., d
It is quite easy to prove that the set cj make f − f orthogonal to the subspace SB. Therefore, it follows from ( f − f ) ⋅ f j = 0, j = 0,1,..., d that the optimal coefficients cj are solutions to the set of linear equations and are unique: d
∑M k =0
c = f ⋅f j ,
jk k
j = 0,1,..., d
(2.3)
Under such conditions it follows that D(g, f ) ≥ D( f , f ) for all g ∈ S B . One can also prove that
(f − v) ⋅ ( f − v) = f − v
2
for all
v ∈ SB
as well as 2
f − f = f − v − f − v . It follows from these expressions that f ⋅ f = f , and 2
f
2
2
≤ f . 2
2
193
Orthogonalization in QSPR-QSAR
(
If we now define u w = u − u ⋅ w w
2
)
2 w and v w = v − v ⋅ w w w for
u, v,w ∈ V , and take into account the Cauchy - Schwarz inequality [15] we conclude that the ratio C (u, v, w ) =
uw ⋅ vw uw vw
(2.4)
satisfies −1 ≤ C (u, v, w ) ≤ 1 . In the application of these results to MLR the vectors are N-tuples of real numbers, such as f = ( f (1), f (2),..., f ( N )) and we choose the inner product to be: N
f ⋅ g = ∑ f ( j) g ( j)
(2.5)
j =1
Typically the components of the vector f0 are f 0 ( j ) = 1,
j = 1, 2,..., N so
that f0 = N and 2
g ⋅ f0 f0
2
1 = N
N
∑ g ( j) =
(2.6)
g
j =1
is the expectation value of g ∈V . It follows from the definition of Eq. (2.4) and from f ⋅ f = f 2 f ⋅ f0 = f ⋅ f0 = b0 f0 , with b0 = f ⋅ f 0
C (f , f , f0 ) =
f
(f
2
2
− b0 f0 2
− b0 f0 2
)( f
and
2
f0 , that:
2
2
2
− b0 f0 2
2
)
=
f
2
f
2
− b0 2 f0
2
− b0 2 f0
2
(2.7)
Therefore, the coefficient of correlation R between f and f is simply given by:
0 ≤ R = C (f , f , f0 ) =
f
2
− b02 f0
2
f
2
− b f0
2
2 0
≤1
(2.8)
194
Pablo R. Duchowicz et al.
The standard deviation S in terms of present notation reads [36]: S=
D (f , f ) N − d −1
(2.9)
From now on we assume that the vector f represents a given physicochemical property or biological activity for a set of N molecules, and the vectors f1 , f 2 ,..., f d are predictor variables or descriptors in the language of the QSPR - QSAR Theory. Accordingly, f is the regression model. Both R and S parameters depend upon the subspace spanned by the vectors f1,f2,…, fd, and measure the model’s fitting between f and f . The R parameter is based on a theoretical definition that is usually employed as a non absolute criterion of linearity for the model. It is only based on random errors and does not contemplate the precision of the experimental data. The value of R increases and approaches unity and D ( f , f ) decreases and approaches zero as the number of linear independent vectors f j increases. When d + 1 = N we have an interpolation case with R = 1 and D(f , f ) = 0 because SB = V. The S parameter quantifies the precision of the predicted data by taking into account the degrees of freedom of the mathematical problem: N−d−1. It is to be noted that the MLR would have sense whenever the value of S is smaller than the experimental standard deviation (Sexp), which indicates the deviation of the experimental observations from their mean value. A QSPR - QSAR model is successful if it gives acceptable global statistical parameters (S, R, etc.) with a small number of descriptors. Finally, for calculating the standard error of the regression coefficients cj we employ the following formula [36]: δ c j = (M −1 ) jj S
(2.10)
3. Outline of some popular orthogonalization techniques As argued in the Introduction, from both the theoretical and practical point of view it is convenient to use an orthogonal basis set for the subspace SB. There are many ways of orthogonalizing a finite set of linearly independent vectors; here we compare two well - established methods: the Randic orthogonalization method and the Gram Schmidt procedure. The outcome of both methods depends on the order of the sequential process; in other words: different orthogonalization orders produce different sets of orthogonal vectors that span the same subspace. For the case of d
195
Orthogonalization in QSPR-QSAR
nonorthogonal vectors, there exist d ! orders of orthogonalization. In addition, it is very important to note that the values of the global statistical parameters (S, R, etc.) depend on the subspace spanned by the chosen vectors f j and are independent of the order of construction of the orthogonal vectors. From a given a set of linearly independent vectors {f1 , f 2 ,K , f d } we want to obtain a set of orthogonal vectors {u1 , u 2 ,K , u d } by means of linear combinations d
ui = ∑ c ji f j , i = 1, 2,K , d
(3.1)
j =1
that should satisfy ui .u j = ui δ ij . Therefore, if c, 2
Δ and λ are the matrices
with elements cij , Δ ij = fi ⋅ f j and λij = ui .u j , respectively, then we derive a general expression that should satisfy the coefficients of the linear combinations of Eq. (3.1):
ct Δc = λ
(3.2)
If, for example, ct = c −1 then Eq. (3.2) is the diagonalization of the positive−1/ 2
we have Löwdin’s definite, symmetric matrix Δ . Alternatively, if c = Δ orthogonalization procedure and λ = I is the identity matrix (orthonormal vectors). In what follows we compare the most widely applied strategies.
3.1. Randic orthogonalization This is called “the method of correlations and residuals”. Suppose we consider the order of orthogonalization to be: B = {f1 ,..., f d } , then the method proposed by Randic [23] is the following step-by-step process: 1.
2.
From a set of d linearly independent vectors B = {f1 ,..., f d } that span a subspace S B ⊆ V , select a vector which will be considered the first orthogonal descriptor omega, 1 Ω . According to the order in B, 1 Ω = f1 . Do the remaining d − 1 vectors {f 2 , f3 ,..., f d } orthogonal to f1 by calculating the linear regression residuals of {f 2 , f3 ,..., f d } against f1 , which will lead to a new set i Ω1 = f1 − f , i = 2,..., d ., where f is the regression model. This is called the first orthogonalization step.
196
3.
4.
Pablo R. Duchowicz et al.
According to the selected order in B, from the set i Ω1 = f1 − f , i = 2,..., d we select the second orthogonal vector to be 2 Ω = 2 Ω1 . Do the remaining d − 2 vectors i Ω1 = f1 − f , i = 3,..., d orthogonal to 2 Ω by calculating the linear regression residuals of i Ω1 = f1 − f , i = 3,..., d
Ω , which will lead to a new set i Ω 2,1 = 2 Ω − f , i = 3,..., d . Notation i Ω 2,1 means that i Ω1 has been made orthogonal to 2 Ω1 , which in turn is f 2 that is already orthogonal to f1 . This is the second
against
5.
2
orthogonalization step. Select the next orthogonal descriptor according to the order of orthogonalization and proceed as in previous steps. The zth orthogonalization step will consist on making d − z vectors orthogonal to a given one, leading to a new set of vectors. Finally, the orthogonal basis derived from the set B = {f1 ,..., f d } will be O = {1 Ω,..., d Ω} . The Randic orthogonalization process is illustrated in Fig. 1.
Figure 1. Schematic illustration of the Randic orthogonalization process.
3.2. Gram schmidt orthogonalization The Gram Schmidt method [15, 32, 33] consists on the following procedure. Starting from the basis B = {f0 , f1 ,..., f d } we construct a new basis
197
Orthogonalization in QSPR-QSAR
{u 0 , u1 ,..., u d } hierarchically and with a certain order of orthogonalization according to: u 0 = f0 , j −1
uj = fj −∑
uk ⋅ f j
k =0
uk
2
(3.3) uk
The closest approximation to f in terms of the set of orthogonal vectors takes the simpler form: d
f = ∑ bju j , bj =
f ⋅u j
j =0
uj
(3.3)
2
and it is seen that the orthogonal coefficients bj are independent of d.
4. Comparison of Randic and Gram Schmidt methods We will try to go beyond a previous discussion on the subject [33] and show that some numerical results discussed previously by other authors can be proved rigorously and easily. By showing all the equations used in present calculation in detail together with some numerical examples we hope to facilitate the reader to reproduce our results more easily. For the case of orthogonal descriptors, it follows from Eq. (2.8) that d
R 2 = ∑ R 2j
R 2j =
j =1
b 2j u j f
2
2
− b f0 2 0
(4.1)
2
where Rj is the coefficient of correlation between f and f calculated with a single orthogonal descriptor. As indicated previously in section 3.1., the orthogonalization method proposed by Randic does not modify one of the d vectors, say f1, during orthogonalization. The Gram Schmidt coefficient b0 can be compared to the Randic’s b0R by taking into account that: d
f = b f + b1f1 + ∑ b j u j , b0R = b0 − b1 R 0 0
j =2
f1 ⋅ u 0 u0
2
Now, note that according to Eq. (3.3) we can write:
(4.2)
198
Pablo R. Duchowicz et al. j
u j = ∑ d kj f k , d jj = 1
(4.3)
k =0
where the explicit expressions for the coefficients dkj are not relevant for present discussion. If we substitute Eq. (4.3) into Eq. (3.4) and take into account that the least - squares solution is unique, we conclude that d ck = ∑ j = k d kj b j (see Eq. (2.1), from which it follows that
cd = bd
(4.4)
This result was suggested by numerical experiments conducted by Randic [23] and later discussed by this and other authors [23, 25, 27, 37], but it was not proved rigorously until several years later [35]. Equation (4.4) is a consequence of the particular triangular form of the Gram Schmidt linear combination of Eq. (3.3) and one does not expect that it applies to other orthogonalization procedures. According to the standard formula [36], the error of the orthogonal coefficients bj is: δ bj =
S uj
(4.5)
It can be demonstrated [35], that in addition to Eq. (4.4) we have
δ cd = δ bd
(4.6)
We show these results with a previous numerical investigation, where the Hosoya’s Z index is fitted by means of connectivity and higher connectivity j j indices X discussed by Randic [23]. In this case, we choose f = Z and fj = X, j = 1,2,3,4. Consequently, Gram Schmidt orthogonal descriptors uj should be j compared with Randic’s Ω. The best model with lowest S that results from the connectivity indices involve descriptors f0,f1,f3,f4. In Table 1 we show the coefficient of the nonorthogonal and orthogonal descriptors and their respective standard errors for the best model. The coefficient b0 (the constant in the language of linear regression) should be modified according to Eq. (4.2) in order to have complete agreement. We appreciate that the errors are smaller for the orthogonal descriptors as argued earlier by Randic [26], and that c4 = b4 and δc4 = δb4 ((Eqs. (4.4) and (4.6), respectively). The calculations demonstrate that the order of orthogonalization affects the coefficients bj, their standard errors δbj, and the absolute relative errors δ b j b j .
199
Orthogonalization in QSPR-QSAR
Table 1. Coefficients of the nonorthogonal and orthogonal descriptors for the model with lowest S. descriptor f0 f1 f3 f4
coefficient -44.248 18.935 0.781 -0.686
standard error 1.337 0.474 0.204 0.304
descriptor u0 u1 u3 u4
coefficient 17.000 17.967 1.102 -0.686
standard error 0.051 0.359 0.146 0.304
What happens when we add or remove one of the descriptors to or from our model? It is possible to demonstrate [35]
( S [ d ] ) = ( S [d +1] ) + 2
2
b 2j u j
2
− ( S [ d +1] )
2
(4.7)
N − d −1
which clearly shows the effect on S of adding or removing uj, respectively. Equation (4.7) explains why Lucic et al.[28] could improve the model by 2 removing carefully selected orthogonal descriptors. If b j u j
2
is sufficiently
small, then removal of uj may decrease the value of S while slightly reducing the value of R (refer to Eq. (4.1). Therefore, taking into account that b uj 2 j
2
= (f ⋅ u j )
2
uj
2
plays such a relevant role in the values of R and S we
consider it to be a measure of the contribution or weight of the orthogonal descriptor uj to the model. We illustrate these concepts by means of the modeling of the boiling points of 18 octanes with a set of connectivity indices, as done by Lucic et al.[28] In this case we choose f0 as usual and f j +1 = j χ , j = 0 − 6 . The best model found with five nonorthogonal descriptors and that minimizes S is the next one: bp = (2583.416 ± 450.924)f0 − (182.756 ± 25.593)f1 − (310.498 ± 67.625)f 2 − (4.8) (46.150 ± 12.544)f3 + (8.304 ± 1.872)f4 − (5.666 ± 1.703)f6 o
with R = 0.993 and S = 0.887 C. In Table 2 we show the weights of the orthogonal descriptors in this subset with smallest S, for four different orders of orthogonalization. The entries of this table suggest that it may be reasonable to remove the orthogonal vector u6 with the smallest weight of 9.980.10-5. If we do that we obtain four new models o with four descriptors each that yield R = 0.993 and S = 0.855 C. This value of S is smaller than the smallest one that we get by removal of nonorthogonal
200
Pablo R. Duchowicz et al.
descriptors. This notable advantage of orthogonal predictor variables that help us to remove insignificant descriptors was first pointed out by Lucic et al.[28]. Therefore, the possibility of removing orthogonal descriptors from the model depends upon the selected order of orthogonalization. Table 2. Weights of the descriptors for several orderings of the optimal model. descriptor indices
descriptor weights
23614 32614 23641 32641
0.674 0.777 0.674 0.777
0.135 0.032 0.135 0.032
9.980.10-5 9.980.10-5 9.980.10-5 9.980.10-5
0.152 0.152 0.113 0.113
0.024 0.024 0.063 0.063
5. Selection of an appropriate order of orthogonalization Once we have the best set of d nonorthogonal descriptors then we look for the best order of orthogonalization out of d! possible ways, according to a o
Table 3. Best model for descriptors with the orthogonal basis (R = 0.991, S = 0.922 C, F = 2.449). Orthogonalization order: 1 χ , 4 χ , 5 χ bp = 113.713 + 30.328u1 + 12.270u 4 − 12.773u5
(5.1)
R1 = 0.821, R4 = 0.404, R5 = 0.379 Orthogonalization order: 1 χ , 5 χ , 4 χ bp = 113.713 + 30.328u1 − 13.129u 4 + 8.029u 5
(5.2)
R1 = 0.821, R4 = 0.495, R5 = 0.249 Orthogonalization order: 4 χ , 1 χ , 5 χ
bp = 113.713 − 40.726u1 + 2.879u 4 − 12.773u 5
(5.3)
R1 = 0.905, R4 = 0.138, R5 = 0.379 Orthogonalization order: 4 χ , 5 χ , 1 χ
bp = 113.713 + 53.416u1 + 2.879u 4 − 6.369u5
(5.4)
R1 = 0.952, R4 = 0.138, R5 = 0.236 Orthogonalization order: 5 χ , 1 χ , 4 χ bp = 113.713 − 36.485u1 − 13.129u 4 + 6.732u 5
(5.5)
R1 = 0.820, R4 = 0.495, R5 = 0.251 Orthogonalization order: 5 χ , 4 χ , 1 χ
bp = 113.713 + 53.416u1 − 2.262u 4 + 6.731u5 R1 = 0.952, R4 = 0.108, R5 = 0.251
(5.6)
Orthogonalization in QSPR-QSAR
201
Dominant Component Analysis (DCA) [28-31]. This procedure consists on selecting the order of orthogonalization that leads to the highest value of the coefficient of correlation between f and each orthogonal descriptor individually (Rj, Eq. (4.1). These descriptors are called dominant descriptors. Thus, one searches the order of orthogonalization which generates the greatest contributions for the first, second, third, etc. dominant descriptors. A similar scheme was proposed by Randic [23]. Considering again the modeling of boiling points of several octanes with a set of connectivity indices [28], Table 3 includes different orders of orthogonalization with descriptors 1 Ď&#x2021; , 4 Ď&#x2021; , 5 Ď&#x2021; . As can be appreciated, DCA establishes the order given by Eq. (5.4) as the best one. The appropriate selection of the order of orthogonalization may lead to insignificant orthogonal descriptors in the model, having smaller Rj than the rest of the variables present in it, thus making it possible to remove descriptors as discussed in the previous section (Eq. (4.7). An order established with the DCA criterion maximizes the contribution of some descriptors of the model and necessarily minimizes the contributions of the remaining variables.
6. Final remarks Among the first data reduction techniques that attempt a description of physicochemical properties or biological activities through a regression of a few number of orthogonal descriptors (component) appear the Principal Component Analysis (PCA) and the Partial Least Squares techniques [38-40]. Furthermore, various publications [41-44] employs the Randic orthogonalization procedure as a previous step to the standardization of the regression coefficients of a linear regression model build with a small number of descriptors. This represents an alternative way that allows interpretation and assigning more importance to those descriptors exhibiting larger absolute standardized orthogonal coefficients, therefore leading to a ranking of contributions of the descriptors to the analyzed property. Other studies apply the Gram Schmidt procedure in the recent proposed Spectral-Structure Activity Relationship (S-SAR) method [45-47]. The advantage of this technique is that it enables one to replace the Multivariable Linear Regression analysis by purely algebraic models with some conceptual and computational advantages. However, it is also proved that they give the same final analytical equation and correlation results as when using MLR. In conclusion, it is recommended the employment of the orthogonalization procedure as a technique for elucidating the structural factors expressed by mathematical descriptors that contribute most during the model design in the
202
Pablo R. Duchowicz et al.
prediction of a given physicochemical property or biological activity. Furthermore, the proposal of effective QSPR - QSAR models requires the employment of orthogonal predictors for avoiding structural redundancy in the developed models. Finally, the co-linearity of the molecular descriptors should be as low as possible, because the interrelatedness among the different descriptors can lead to highly unstable regression coefficients, which makes it impossible to know the relative importance of an index and underestimates the utility of the regression coefficients of the model.
Acknowledgements The authors thank to Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) and to Universidad Nacional de La Plata for financial support.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17.
Hansch, C. and Leo, A. 1995, Exploring QSAR. Fundamentals and Applications in Chemistry and Biology, American Chemical Society, Washington, D. C. Katritzky, A.R., Lobanov, V.S. and Karelson, M. 1995, Chem. Soc. Rev., 24, 279. Hansch, C. 1964, J. Am. Chem. Soc., 86, 1616. Karelson, M. 2000, Molecular Descriptors in QSAR / QSPR, Wiley Interscience, New York. Todeschini, R. and Consonni, V. 2000, Handbook of Molecular Descriptors, Wiley VCH, Weinheim. Akaike, H. 1973, Second International Symposium on Information Theory, B.N. Petrov, F. Csáki (Eds.), Akademiai Kiado, Budapest, 267. Bader, R.F.W. 1990, Atoms in Molecules - A Quantum Theory, Clarendon Press, Oxford. Trinajstic, N. 1992, Chemical Graph Theory, CRC Press, Boca Raton (FL). Recon, Version 5.5. Rensselaer Polytechnic Institute, Troy, New York, USA. CoDESSA, http://ufark12.chem.ufl.edu Dragon, http://michem.disat.unimib.it/chm Worth, A.P., Bassan, A., De Bruijn, J., Saliner, A.G., Netzeva, T., Patlewicz, G., Pavan, M., Tsakovska, I. and Eisenreich, S. 2007, SAR&QSAR Environ. Res., 18, 111. Noringer, U. 2005, SAR&QSAR Environ. Res., 16, 1. Martin, Y.C. 2005, J. Med. Chem., 48, 3164. Apostol, T.M. 1969, Calculus, Blaisdell Publishing Co., Waltham, Massachusetts. Mallinowski, E.R. 1991, Factor Analysis in Chemistry, Wiley, New York. Leardi, R. 1996, Genetic Algorithms in Molecular Modeling. Principles of QSAR and Drug Design, J. Devillers (Ed.), Academic Press, London, 67.
Orthogonalization in QSPR-QSAR
203
18. Duchowicz, P.R., Castro, E.A. and Fernรกndez, F.M. 2006, MATCH Commun. Math. Comput. Chem., 55, 179. 19. Zupan, J. 1998, Encyclopedia of Computational Chemistry, Wiley, Chichester. 20. Vapanik, V. 1995, The nature of statistical learning theory, Springer - Verlag, New York. 21. Livingstone, D.J. and Manallack, D.T. 1993, J. Med. Chem., 36, 1295. 22. Tetko, I.V., Luik, A.I. and Poda, G.I. 1993, J. Med. Chem., 36, 811. 23. Randic, M. 1991, J. Chem. Inf. Model., 31, 311. 24. Randic, M. 1991, Croat. Chem. Acta, 64, 43. 25. Randic, M. 1993, J. Comput. Chem., 14, 363. 26. Randic, M. 1994, Int. J. Quantum Chem., 21S, 215. 27. Randic, M. 2001, J. Chem. Inf. Comput. Sci., 41, 602. 28. Lucic, B., Nikolic, S., Trinajstic, N. and Juretic, D. 1995, J. Chem. Inf. Comput. Sci., 35, 532. 29. Lucic, B. and Trinajstic, N. 1997, SAR&QSAR Environ. Res., 7, 45. 30. Lucic, B. and Trinajstic, N. 1999, J. Chem. Inf. Comput. Sci., 39, 121. 31. Lucic, B., Trinajstic, N., Sild, S., Karelson, M. and Katritzky, A.R. 1999, J. Chem. Inf. Comput. Sci., 39, 610. 32. Draper, N.R. and Smith, H. 1981, Applied Regression Analysis, John Wiley&Sons, New York. 33. Klein, D.J., Randic, M., Babic, D., Lucic, B., Nikolic, S. and Trinajstic, N. 1997, Int. J. Quantum Chem., 63, 215. 34. Katritzky, A.R., Lobanov, V. and Karelson, M. 1996, CODESSA Reference Manual, University of Florida, Gainesville (FL). 35. Fernandez, F.M., Duchowicz, P.R. and Castro, E.A. 2004, MATCH Commun. Math. Comput. Chem., 51, 39. 36. Hildebrand, F.B. 1956, Introduction to numerical analysis, McGraw - Hill Book Company, Inc., New York. 37. Soskic, M. 1996, J. Chem. Inf. Comput. Sci., 36, 829. 38. Cash, G.G. and Breen, J.J. 1992, Chemosphere, 24, 1607. 39. Sutter, J.M., Kalivas, J.H. and Lang, P.K. 1992, J. Chemometrics, 6, 217. 40. Wold, S., Sjรถstrom, M. and Eriksson, L. 1998, Encyclopedia of Computational Chemistry, R. von Schleyer, N.L. Allinger, T. Clark, J. Gasteiger, P.A. Kollman, H.F. Schaefer III and P.R. Schreiner (Eds.), Wiley, Chichester, 2006. 41. Duchowicz, P.R., Vitale, M.G., Castro, E.A., Fernandez, M. and Caballero, J. 2007, Bioorg. Med. Chem., 15, 2680. 42. Duchowicz, P.R., Gonzalez, M.P., Helguera, A.M., Cordeiro, M.N.D.S. and Castro, E.A. 2007, Chemom. Intel. Lab. Syst., 88, 197. 43. Duchowicz, P.R., Garro, J.C.M. and Castro, E.A. 2008, Chemom. Intell. Lab. Syst., 91, 133. 44. Duchowicz, P.R. and Ocsachoque, M.A. 2009, QSAR&Combinat. Sci., 28, 281. 45. Lacrama, A.-M. 2007, Int. J. Mol. Sci., 8, 842. 46. Putz, M.V. and Lacrama, A.-M. 2007, Int. J. Mol. Sci., 8, 363. 47. Putz, M.V., Putz, A.-M., Lazea, M., Ienciu, L. and Chiriac, A. 2009, Int. J. Mol. Sci., 10, 1193.
Research Signpost 37/661 (2), Fort P.O. Trivandrum-695 023 Kerala, India
QSPR-QSAR Studies on Desired Properties for Drug Design, 2010: 205-218 ISBN: 978-81-308-0404-0 Editor: Eduardo A. Castro
8. Modeling some chemical reactions as a spatio-temporal discrete dynamical systems Juan Luis García Guirao Departamento de Matemática Aplicada y Estadística. Universidad Politécnica de Cartagena, Hospital de Marina, 30203-Cartagena, Región de Murcia, Spain
Abstract. The aim of the present work is to state some topological dynamics results for a family of lattice dynamical systems stated by K. Kaneko in [Phys. Rev. Lett., 65, 1391-1394, 1990] which is related to the Belusov-Zhabotinskii chemical reactions. We prove that these LDS (Lattice Dynamical Systems) systems are chaotic in the sense of Li-Yorke, in the sense of Devaney and have positive topological entropy for zero coupling constant. Moreover, we present a definition of distributional chaos on a sequence (DCS) for LDS systems and we state two different sufficient conditions for having DCS. These results survey three different papers, two of them written jointly with M. Lampart.
1. Introduction Classical Discrete Dynamical Systems (DDS’s), i.e., a couple composed by a space X (usually compact and metric) and a continuous self-map ψ on X, have been highly considered in the literature (see e.g., [BC] or [D]) because are good examples of problems coming from the theory of Topological Dynamics Correspondence/Reprint request: Dr. Juan Luis García Guirao, Departamento de Matemática Aplicada y Estadística. Universidad Politécnica de Cartagena, Hospital de Marina, 30203-Cartagena, Región de Murcia Spain. E-mail: juan.garcia@upct.es
206
Juan Luis García Guirao
and model many phenomena from biology, physics, chemistry, engineering and social sciences (see for example, [Da], [KO], [Pu] or [Po]). In most cases in the formulation of such models ψ is a C∞, an analytical or a polynomial map. Coming from physical/chemical engineering applications, such a digital filtering, imaging and spatial vibrations of the elements which compose a given chemical product, a generalization of DDS’s have recently appeared as an important subject for investigation, we mean the so called (LDS) Lattice Dynamical Systems or 1d Spatiotemporal Discrete Systems. In the next section we provide all the definitions. To show the importance of these type of systems, see for instance [ChF]. To analyze when one of this type of systems have a complicated dynamics or not by the observation of one topological dynamics property is an open problem. The aim of this work is, by using different notions of chaos and the concept of topological entropy we characterize the dynamical complexity of a family coupled lattice dynamical systems which contains the stated one by K. Kaneko in [K] (for more details see for references therein) which is related to the Belusov-Zhabotinskii’s reactions type. Concretely, we prove that these LDS systems are chaotic in the sense of Li-Yorke, in the sense of Devaney and have positive topological entropy for zero coupling constant. Moreover, we present a definition of distributional chaos on a sequence (DCS) for LDS systems and we state two different sufficient conditions for having DCS. We present some other problems for the future related with physical/chemical applications.
2. Definitions and notation Let us start introducing two of the most well-known notions of chaos for a discrete dynamical systems generated by the iteration of a continuous selfmap f defined on a compact metric space X with metric d. Definition 1. A pair of points x, y ∈ X is called a Li-Yorke pair if (1) (2) A set S ⊂ X is called a LY-scrambled set for f (Li-Yorke set) if # S ≥ 2 and every pair of different points in S is a LY-pair where # means the cardinality.
Chemical Reactions as spatio-temporal discrete dynamical systems
207
For continuous self-maps on the interval [0, 1], Li and Yorke [LY] suggested that a map should be called â&#x20AC;&#x153;chaoticâ&#x20AC;? if it admits an uncountable scrambled set. This was subsequently accepted as a formal definition. Definition 2. We say that a map f is Li and Yorke chaotic if it has an uncountable LY-scrambled set. One may consider weaker variants of chaos in the sense of Li and Yorke based on the cardinality of scrambled sets (see for instance [GL1]). On the other hand, a map f is: 1) transitive if for any pair of nonempty open sets there exists an such that 2) locally eventually onto if for every nonempty open set there exists an such that . Since this property can be regarded as the topological analog of exactness defined in ergodic theory, it is often called topological exactness. We use the second name here. Recall that a periodic point of period n of f is a point x such that and . Definition 3. A map f is called Devaney chaotic if it satisfies the following two properties: (1) f is transitive, (2) the set of periodic points of f is dense in X. The original definition given by Devaney [D] contained an additional condition on f, which reflects unpredictability of chaotic systems: sensitive dependence on initial conditions. However, it was proved see, e.g., [Ba] that sensitivity is a consequence of transitivity and dense periodicity under the assumption that X is an infinite set. Let us recall the notion of Positive topological entropy which is known to topological chaos. An attempt to measure the complexity of a dynamical system is based on a computation of how many points are necessary in order to approximate (in some sense) with their orbits all possible orbits of the system. A formalization of this intuition leads to the notion of topological entropy of the map f, which is due to Adler, Konheim and McAndrew [AKM]. We recall here the equivalent definition formulated by Bowen [B], and independently by Dinaburg [Di]: the topological entropy of a map f is a number defined by
208
Juan Luis García Guirao
, where E(n, f, ε ) is a (n, f, ε ) _span with minimal possible number of points, i.e., a set such that for any there is satisfying for 1 ≤ j ≤ n. A map f is topologically chaotic (briefly, PTE) if its topological entropy h (f) is positive. Lattice dynamical systems. The state space of LDS (Lattice Dynamical System) is the set , where d ≥ 1 is the dimension of the range space of the map of state xi, D ≥ 1 is the dimension of the lattice and the l2 norm is is the length of the vector xi). usually taken We deal with the following LDS family of systems which contains the system stated by K. Kaneko in [K] (for more details see for references therein) which is related to the Belusov-Zhabotinskii reactions (see [KO] and for experimental study of chemical turbulence by this method [HGS], [HOY], [HHM]): ,
(1)
where m is discrete time index, n is lattice side index with system size L (i.e. n = 1, 2, . . . L), ε is coupling constant and f (x) is the unimodal map on the unite closed interval I = [0, 1], i.e. f(0) = f(1) = 0 and f has unique critical point c with 0 < c < 1 such that f(c) = 1. For simplicity we will deal with so called “tent map”, defined by (2) In general, one of the following periodic boundary conditions of the system (1) is assumed: (1)
,
(2)
,
(3)
,
standardly, the first case of the boundary conditions is used.
Chemical Reactions as spatio-temporal discrete dynamical systems
209
The equation (1) was studied by many authors, mostly experimentally or semi-analytically than analytically. The first paper with analytic results is [ChL], where it was proved that this system is Li-Yorke chaotic, we give alternative and easier proof of it in this paper. We consider, as an example the 2-element one-way coupled logistic lattice (see [KW]) H: I2 → I2 written as , (3) where f is the tent map.
3. Li-Yorke, Devaney and topological chaos The following two lemmas will be used for the proof of the main results. The proof of the first one is obvious (or, see e.g. [DK]). Lemma 4. Let f : X → X and g : Y → Y be maps with dense sets of periodic points. Then the Cartesian product f g : X Y → X Y has also dense set of periodic points. Proposition 5 ([BC]). Let f be the tent map defined by (2). Put where l = {1, 2, 3, . . . , 2k} and . Then the restriction of f k to Ik,l is linear homeomorphism onto [0, 1]. Let us note that the Cartesian product of two topologically transitive maps is not necessarily topologically transitive (see e.g. [DK]). Hence, for the proof of Theorem 7 we need to prove: Lemma 6. The system
is topologically exact for ε = 0. Proof. Let U be given open subset of IL. Then the projection of U to the m-th coordinate contains Um open connected subset of I, for each m = 1, 2, . . . L}. Then by Proposition 5 there is km such that f km (Um) = I. If we put K = max{km⏐ m = 1, 2, . . . L} then the K-th iteration of U by the system (1) equals to I L. Theorem 7. The system
210
Juan Luis García Guirao
is chaotic in the sense of Devaney for ε = 0. Proof. The assertion follows by Lemma 4 and Lemma 6. The following Proposition is very powerful tool of symbolic dynamics1 for observing nearly all dynamical properties. Proposition 8. There is a subsystem of (1) which is conjugated2 to
.
Proof. Since the critical point for the tent map is equal to 1/2 we can divide the interval I into two sets P1 = [0, 1/3) and P2 = (2/3, 1] and get a family . Then each point can be represented as an infinite symbol sequence C1(x0) = α = a1a2a3 . . . where Λ1 is Cantor ternary set and
Returning to (3) we can divide its range set into four sets (see the figure below) where the upper index corresponds to the x1 coordinate and x2 to the lower one. Then again each point can be encrypted as an infinite symbol sequence C2(p) = α = a1a2a3 . . . where Λ2 is 2-dimensional Cantor ternary set3 and
__________________________ Here, σ2 is the shift operator on the space of all two element sequences Σ2. We say that two dynamical systems (X, f) and (Y, g) are topologically conjugated if there is a homeomorphism h : X → Y such that h ° f = g ° h, such homeomorphism is called conjugacy. 3 by n-dimensional Cantor set we mean the Cantor set constructed as subset of . 1 2
Chemical Reactions as spatio-temporal discrete dynamical systems
211
Now, we denote the k-shift operator σk on k symbol alphabet, defined by σk : Σk → Σk and σk (a1a2a3 . . .) = a2a3 . . . where Σk = {α⏐α = a1a2a3 . . . and ai ∈{1, 2, . . . k}}, so the effect of this operator is to delete the first symbol of the sequence α. We can observe that Λ2 is invariant4 subset of the range space of the system (3) and that each its point is encoded by exactly one point from Σ4, for ε = 0. So, by [F] the shift operator σ4 acts on Σ4 exactly as (3) on Λ2, for ε = 0. Theorem 9. The system
is chaotic in the sense of Li-Yorke for ε = 0. Proof. By Proposition 8 the system (1) has a subsystem conjugated to which is Li-Yorke chaotic (see e.g. [BGKM]). Proposition 10 ([W]). If (X, f) and (Y, g) are topologically conjugated systems then h(f) = h(g). For the proof of result concerning topological entropy we use the well known result: Proposition 11 ([W]). Let σk be the k-shift operator. Then h(σk) = k log 2. Theorem 12. The system
has positive topological entropy for ε = 0. Moreover, its entropy equals to L log 2. Proof. By the construction of the Section 2 it follows that the 2-dimensional system (1) contains 2-dimensional Cantor set which is conjugated (see, e.g. [F]) to the shift space Σ4 by the conjugacy map C2, for ε = 0. Then by Proposition 10 the system has topological entropy equal to the entropy of σ4. Consequently, by Proposition 11 its entropy is 2 log 2. __________________________________________________ 4
a set M is invariant for the map f if f (M) ⊂ M.
212
Juan Luis García Guirao
To the end of the proof, it suffice to note, that the construction of the Section 2 can be generalized to the L-dimensional systems. Such system will be conjugated to the 2L-shift by CL conjugacy and by the same arguments, as in the paragraph above, its entropy equals to L log 2. Remark 13. There are many other notions of chaos, like distributionalchaos, ω - chaos or to satisfy the specification property. The system (1) fulfills all this chaotic behavior by the same arguments as in the proof of the Theorem 9 for zero coupling constant. But obviously this system is not minimal, where minimal means that there is no proper subset which is invariant, nonempty and closed. The proof of Theorem 12 can be done in an alternative way. For zero coupling constant it is obvious that each lattice side contains a subsystem conjugated to) (Σ2, σ2). Then the system (1) contains subsystem conjugated to the L-times product of (Σ2, σ2) and by (see, e.g. [W]) the assertion follows. For non-zero coupling constants the dynamical behavior of the system (1) is more complicated. The first question is how the invariant subsets of phase space look like? Secondly, what are the properties of ω-limit sets (i.e., set of limits points of the trajectories)? The answer for these questions will be nontrivial. Similar system was studied in [BGLL] and there was used the method of resultants to prove existence of periodic points of higher order. The same concept like in [BGLL] should be used.
4. Distributional chaos on a sequence for LDS The aim of this section is, by the introduction of the notion of distributional chaos on a sequence (DCS) for coupled lattice systems (LDS), to characterize the dynamical complexity of the coupled lattice family of systems (1). We present two different sufficient conditions for having DCS for this family of LDS. These results complete and generalize the result surveyed in the previous sections from [GL1, GL2] where Li-Yorke chaos and topological entropy are respectively studied. The statement of the main results in this direction are the following, see [G]: Theorem 14. Let f be a continuous self-map defined on a compact interval [a, b]. If f is Li-Yorke chaotic, then the LDS system defied by f in the form (1)
Chemical Reactions as spatio-temporal discrete dynamical systems
213
is distributionally chaotic with respect to a sequence considering [a, b]∝ endowed with the metrics ρi, i = 1, 2, respectively. and Theorem 15. Let f be a continuous self-map defined on a compact interval [a, b]. If f has positive topological entropy, then the LDS system defined by f in the form (1) has an uncountable distributionally scrambled set, composed by almost periodic points, with respect to a sequence considering [a, b]∝ endowed with the metrics ρi, i = 1, 2, respectively.
4.1. From LDS to classical DDS Consider the set of sequences of real numbers
Let two non-equivalent metrics:
in R∞ we consider the following
(4) and (5) Note that (R∞, ρi), i = 1, 2, is a complete metric space. We consider [a, b]∞ the subset of R∞ composed by sequences with terms in the compact interval [a, b] endowed with the restriction of ρi. Let and be a continuous self-map. Let be a solution of the LDS system (1) with initial x= condition where for all . , and consider the selfDefine for all map Ff defined on [a, b]∝ in the form (6) where
and
.
Remark 16. From the previous construction, for a given self-map f defined on a compact interval [a, b], the LDS system (1) associated with f is
214
Juan Luis García Guirao
equivalent to the classical dynamical system ([a, b]∝, Ff) where Ff is defined in (6). Let us recall the definition of distributional chaos with respect to a sequence in the setting of discrete dynamical systems. Let be an increasing sequence of positive integers, let x, y∈ [a, b] and . Let
where #(A) denotes the cardinality of a set A. Using these notations distributional chaos with respect to a sequence is defined as follows: Definition 17. A pair of points (x, y) ∈ [a, b]2 is called distributionally chaotic with respect to a sequence for some for all t > 0. s > 0 and A set S containing at least two points is called distributionally scrambled if any pair of distinct points of S is distributionally with respect to chaotic with respect to . A map f is distributionally chaotic with respect to , if it has an uncountable set distributionally scrambled with respect to . Definition 18. A point x is called almost periodic of f, if for any ε > 0 there exists N > 0 such that for any q ≥ 0, there exists r, q < r ≤ q +N, holding . By AP(f) we denote the set of all almost periodic points of f. The following results from Oprocha [1] and Liao et al. [L] will play a key role in the proof of Theorems 14 and 15. Lemma 19. Let f be a continuous self-map on [a, b]. The map f is Li-Yorke such that f is chaotic iff there exists an increasing sequence distributionally chaotic repect to . Lemma 20. Let f be a continuous self-map on [a, b]. If f has positive topological entropy, then there exists an increasing sequence such that f has an uncountable distributionally scrambled set T with respect to . Moreover, the set T is composed by almost periodic points.
Chemical Reactions as spatio-temporal discrete dynamical systems
215
For details on the definition of topological entropy see [W]. for a Note that the definition of distributional chaos in a sequence continuous self-map f defined on an interval [a, b] is equivalent to the existence of an uncountable subset S ⊂ [a, b] such that for any x, y ∈ S, x ≠ y, • there exists δ > 0 such that
• for every t> 0,
where
if x ∈ A and
otherwise.
Proof of Theorem 14. Since the map f is Li-Yorke chaotic, by Lemma 19 there exists an increasing sequence such that f is distributionally chaotic with repect to . Let S ⊂ [a, b] be the uncountable set distributionally scrambled with respect to for f. Let E ⊂ [a, b]∞ be the uncountable set such that each element of it is a constant sequence equal to an element of S. Let and be two different elements of E. Then, there exists δ > 0 such that
and for every t > 0 is
216
Juan Luis García Guirao
In a similar way for the distance ρ2 we have that there exists δ* > 0 such that
and for every t > 0 is held
Thus, F is distributionally chaotic with respect to using in [a, b]∞ the metrics ρ1 and ρ2 ending the proof.
respectively
Proof of Theorem 15. Since f has positive topological entropy by Lemma 20 there exists an increasing sequence such that f is distributionally chaotic with repect to . Let S ⊂ [a, b] be the uncountable set distributionally scrambled with respect to for f composed by almost periodic points. Let E ⊂ [a, b]∞ be the uncountable set such that each element of it is a constant sequence equal to an element of S. The proof of Theorem A states that E is an uncontable distributionally scrambled set for F with respect . Now, we shall see that E is composed by almost periodic points of to F respectively for the metrics ρ1 and ρ2. Indeed, let where x ∈ AP(f). Then, for any ε > 0 there exists N > 0 such that for any q ≥ 0, there exists r, q < r ≤ q + N, holding . In this setting,
Chemical Reactions as spatio-temporal discrete dynamical systems
217
and
proving that E ⊂ AP(F) ending the proof.
Acknowledgement This work has been partially supported by MCYT/FEDER grant number MTM2008-03679/MTM, Fundación Séneca de la Región de Murcia, grant number 08667/PI/08 and JCCM (Junta de Comunidades de Castilla-La Mancha), grant number PEII09-0220-0222.
References [AKM] R.L. Adler, A.G. Konheim and M.H. McAndrew, Topological entropy, Trans. Amer. Math.Soc., 114, 309-319, 1965. [BGLL] F. Balibrea, J.L. García Guirao, M. Lampart and J. Llibre, Dynamics of a Lotka-Volterra map, Fund. Math. 191 3, 265-279, 2006. [Ba] J. Banks, J. Brooks,G. Cairns, G. Davis and P. Stacey, On Devaney's definition of chaos, Amer. Math. Monthly, 4, 332-334, 1992. [BC] L.S. Block and W.A. Coppel, Dynamics in One Dimension, Springer Monographs in Mathematics, Springer-Verlag, 1992. [BGKM]F. Blanchard, E. Glasner, S. Kolyada and A. Maass, On Li-Yorke pairs, Journal fur diereine und angewandte Mathematik (Crelle's Journal), 547, 5168, 2002. [BCP] E. Bollt, N.J. Corron and S.D. Pethel, Symbolic dynamics of coupled map lattice, Phys. Rew. Lett., 96, 1-4, 2006. [B] R. Bowen, Entropy for group endomorphisms and homogeneous spaces, Trans. Amer. Math. Soc., 153, 401-414, 1971. [ChF] J.R. Chazottes and B. FernSndez, Dynamics of Coupled Map Lattices and of Related Spatially Extended Systems, Lecture Notes in Physics, 671, 2005. [ChL] G. Chen and S. T. Liu, On spatial periodic orbits and spatial chaos, Int. J. of Bifur. Chaos, 13, 935-941, 2003. [Da] R. A. Dana and L. Montrucchio, Dynamical Complexity in Duopoly Games, J. Econom. Theory, 40 (1986), 40-56. [D] R.L. Devaney, An Introduction to Chaotics Dynamical Systems, Benjamin/Cummings, Menlo Park, CA., 1986. [DK] N. Degirmenci and S. Kocak, Chaos in product maps, Turkish J. Math. to appear.
218
Juan Luis García Guirao
[Di]
E.I. Dinaburg, A connection between various entropy characterizations of dynamical systems, Izv. Akad. Nauk SSSR Ser. Mat., 35 (1971), 324-366. H. Furnsterbeg, Recurrence in Ergodic Theory and Combinational Number Theory. Princeton University Press. XI, Princeton, New Jersey, 1981. J.L. García Guirao and M. Lampart, Positive entropy of a coupled lattice system related with Belusov-Zhabotinskii reaction, Journal of Math. Chem. DOI: 10.1007/s10910-009-9624-3. J.L. García Guirao and M. Lampart, Chaos of a coupled lattice system related with the Belusov_Zhabotinskii reaction, Journal of Math. Chem. DOI: 10.1007/s10910-009-9647-9. J.L. García Guirao, Distributional chaos of generalized BelusovZhabotinskii's reaction models, MATCH Commun. Math.Comput. Chem., 64(2), (2010), 335-344. J.L. Hudson, M. Hart and D. Marinko, An experimental study of multiplex peak periodic and nonperiodic oscilations in the Belusov-Zhabotinskii reaction, J. Chem. Phys., 71, (1979), 1601-1606. K. Hirakawa, Y. Oono and H. Yamakazi, Experimental study on chemical turbulence. II, Jour. Phys. Soc. Jap., 46, (1979), 721-728. J.L. Hudson, K.R. Graziani and R.A. Schmitz, Experimental evidence of chaotic states in the Belusov-Zhabotinskii reaction, J. Chem. Phys., 67, (1977), 3040-3044. K. Kaneko, Globally Coupled Chaos Violates Law of Large Numbers, Phys. Rev. Lett., 65, 1391-1394, 1990. M. Kohmoto and Oono, Discrete model of Chemical Turbulence, Phys. Rev. Lett., 55, 2927 - 2931, 1985. K. Kaneko and H.F. Willeboordse, Bifurcations and spatial chaos in an open flow model, Phys. Rew. Lett., 73, 533-536, 1994. T. Y. Li and J. A. Yorke, Period three implies chaos. Amer. Math. Monthly, 82(10), 985-992, 1975. G.F. Liao and G.F. Huang, Almost periodicity and SS scrambled set. Chinese Annals of Mathematics, 23, 685-692, 2002. P. Oprocha, A note on distributional chaos with respect to a sequence, Nonlinear Analysis, 71, 5835-5839, 2009. B. Van der Pool, Forced oscilations in a circuit with nonlinear resistence, London, Edinburgh and Dublin Phil. Mag, 3 (1927), 109-123. T. Puu, Chaos in Duopoly Pricing, Chaos, Solitions and Fractals, 1 (1991), 573-581. P. Walters, An introduction to ergodic theory. Springer, New York, 1982.
[F] [GL1] [GL2] [G] [HHM] [HOY] [HGS] [K] [KO] [KW] [LY] [L] [1] [Po] [Pu] [W]