Innovate UK-funded Project Results in Next-generation AI Drug Discovery Technology

from IPI Spring 2021

Cold Chain in 2021: COVID-19’s Continued Influence

Introduction A promise of artificial intelligence (AI) is that it will gain more value from the complex and expensive data generated in drug discovery, by identifying hidden patterns and valuable conclusions that will guide better decisions when designing and selecting high-quality clinical candidates. To achieve this requires several factors to align:

• Algorithms – The latest deep learning algorithms have transformed many fields with their ability to capture complex data relationships. However, many of these methods fail in the context of drug discovery, due to the specific challenges of drug discovery data: what we consider 'big data' in our field is orders of magnitude smaller than many areas of notable success for these algorithms, and the available data have significant uncertainties due to the variability inherent in biological experiments. Robust and proven methods are required that make a genuine difference in drug discovery. • Data – High-quality and up-todate data are required as input, to distinguish genuine signals from the noise and provide timely insights as drug discovery projects progress. • Intuitive and interactive access – The results need to be readily accessible to the key decisionmakers in a project, who may not be computational experts, and presented in a way that will address their questions and drive better-informed decisions.

This article describes a successful publicprivate partnership, the DeepADMET project supported by Innovate UK, that addressed these challenges, delivering a 'next-generation' software platform. The project consortium combined the skills of Intellegens Limited, a Cambridge University spin-out developing unique deep learning imputation algorithms;

The Alchemite method accepts compound descriptors (orange squares in a complete matrix), sparse assay data (green squares in a sparse matrix) and imputes the missing values (purple squares).

Medicines DiscoveryCatapult, which brought cutting-edge AI approaches to curation of high-quality data; and Optibrium Limited, a software company with a proven track record of innovation and delivery of elegant software that guides successful drug discovery. We will provide an overview of the underlying methods, proof-of-concept and implementation of this platform, and illustrate the unique ways it addresses critical challenges faced by drug discovery projects. The Challenges of Drug Discovery Data Drug discovery data present challenges for the application of standard machine learning tools: in particular, the data are sparse, as not every compound is measured in every assay, and no assays are run for every compound in a pharmaceutical company’s collection. Drug discovery data are also noisy, as biological variability leads to (sometimes radically) different results when repeating the same experiment. These challenges are not unique to drug discovery and are found in many experimental sciences. They are particularly acute in drug discovery, where it is not unusual for over 99% of possible compound/assay measurement results to be unavailable. To obtain the full potential of machine learning for drug discovery, it is necessary to use an approach designed for this complexity of data.

The Intellegens’ machine learning algorithm, Alchemite™, was originally developed for and successfully applied to materials science, optimising superalloy compositions for jet engine turbine blades. That field also suffers from sparse, noisy datasets, with the expense of generating and testing samples limiting the amount of data available. The project partners identified the similarities between the data from these different experimental domains and suggested the adaptation of Alchemite™ for use in drug discovery.

This unique algorithm follows an imputation approach: it ‘fills in the gaps’ in sparse experimental data, utilising what little experimental data are available to help the imputation of the points that are not. This has a high value in large drug discovery datasets, multiplying the amount of data available for basing project decisions by many folds. The model may also be applied for predicting the performance of ‘virtual’, unsynthesised, compounds, saving experimental effort by triaging results to identify those with a good chance of success against a broad range of endpoints and prioritising those compounds for synthesis.

The ability to leverage existing sparse experimental data in its models results in more accurate models for complex, heterogeneous endpoints, including absorption, distribution, metabolism, excretion, and toxicity (ADMET) endpoints, than is possible with conventional quantitative structureactivity relationship (QSAR) models that are based purely on compounds’ structures. This increased accuracy enables more reliance on the model’s predictions for drug discovery project decisions, an increase which is furthered by the inclusion of reliable uncertainty quantification for each prediction. These uncertainty estimates enable a focus on the most confident

predictions from the model, setting a tolerance for the uncertainty that is useful for ensuring decisions are based on only the best information available.

Accurate predictions and reliable uncertainty quantification also enable the construction of robust experimental designs, by proposing the next experiments to carry out that have the highest likelihood of improving the underlying model and future predictions. This is achieved using a probabilistic approach to identify which experiments are most likely to improve the model’s understanding of particular endpoints or regions of chemical space. The Generation of Large-scale Data There have been several historical approaches to generating large-scale data for use in machine learning approaches to predict ADMET properties for novel potential drugs. Foremost of these efforts are the ChEMBL database of literature abstracted medicinal chemistry data (https://www.ebi. ac.uk/chembl) and deposited experimental screening data at PubChem Bioassay (https://pubchem.ncbi.nlm.nih.gov). However, due to the need for novel data to predict novel target or property endpoints, to boost the domain of applicability or improve model accuracy, there is an ongoing need to develop approaches to identify, index and curate novel SAR data. Ideally, this data supply would be as automated as possible and supply ‘data on-demand’.

The current under-representation and use of supplementary data from journal articles focused initial studies on processing these data for relevant ADMET-related endpoints, leveraging existing specialist lexicons and data standards where possible, and directed the prioritisation of data towards a manually selected set of highinterest ADMET target genes, for example, drug transporters where advances in molecular science, screening, and experience of the industry in drug optimisation have highlighted the need for more predictive data. Specific heuristics were used to identify sections in supplementary data required, typically the presence of a relevant gene name (e.g. OCT1), a chemical structure (e.g. tamsulosin), numerical data (e.g. 13.14), and an assay endpoint description and unit (e.g. IC50, micromolar). 70 million diverse supplementary data documents were gathered from public sources, requiring ca. 35TB of storage space for storage and processing. On top of these data, a software pipeline was built, enabling parallelisation, a component-based architecture for best-in-class named entity recognition (NER) tools, and enhancement of compound, gene, assay endpoint, etc. dictionaries. Further work was performed around image identification and segmentation, and for image data application of optical character recognition (OCR) technologies.

Optimisation of the software was required to minimise the false-positive rate in prioritised documents, and manage the complexities of handling Unicode characters across different stages of the process and sometimes corrupt source supplementary data. Expert review was performed to identify the causes of false positives and negatives, and the results were fed back into earlier stages to identify the most relevant subset of documents. Once sufficient positive examples had been identified, a document-relevance classifier was developed to automatically assign a probability of a new document having information relevant to the subject area. This used features from the document and boosted performance using interpublication author and citation linkages.

These processing steps led to a focused set of documents for processing into a set of consistently annotated bioassay data, similar in principle to the bioassay data stored in the ChEMBL database. Significant technical challenges still remain in achieving completely automated discovery and extraction of novel data, although this work provided significant steps towards this goal and ongoing research is underway to optimise the delivery of data-on-demand further. Proven Success A vital element of DeepADMET was demonstrating the practical application of deep learning imputation to drug discovery data and projects. After an initial proof-ofconcept using public domain data published by Whitehead et al.1, collaborations with pharmaceutical and biotech companies and academic groups enabled applications in real-world scenarios and across ongoing projects. These resulted in peer-reviewed publications and public presentations of the results.

A collaboration with Constellation Pharmaceuticals demonstrated an application to heterogeneous biological endpoints, including activities against biological targets and phenotypic screens, and a range of absorption, distribution, metabolism and excretion (ADME) assays. A critical aspect of this project was that it validated the application of Intellegens' algorithm to small-scale data sets, typical of those encountered in individual drug discovery projects; in this case, the full data set comprised approximately 2500 compounds. The success in this scenario contrasts with most deep learning methods, which typically require very large data sets to achieve an improvement over conventional machine learning methods. Full details of this project were published by Irwin et al.2 and described in a presentation that can be viewed at https://bit.ly/practical_deep_ learning.

Takeda Pharmaceutical Company Limited's collaboration involved an application to a data set of approximately 700,000 compounds and 2500 experimental endpoints. This demonstrated the scalability of the approach to global pharma-scale data sets and explored applications to predicting target activities in projects, a diverse range of ADME and toxicity (ADMET) properties, and highthroughput screening data. The presentation detailing the results is available to watch at: https://bit.ly/large_scale_imputation.

Translation of in vitro ADME data into in vivo disposition was the subject of a collaboration with AstraZeneca, (described in a presentation available to watch at http://bit.ly/PK_prediction). In this application, sparse in vitro ADME data and predictions from in silico models were used to impute pharmacokinetic parameters and concentration-time curves, achieving industryleading results.

Finally, an application to guiding the design of new compounds, by making predictions for 'virtual' compounds, was explored in collaboration with Open Source Malaria (OSM). A model of multiple antimalarial experimental endpoints was built and applied to prioritisation of new compound designs for synthesis and testing. OSM ran this as a competition, and of the compounds proposed by the participants, and synthesised and tested in vitro by the OSM team, only the compound selected by our model was confirmed to have to be active against the malaria parasite Plasmodium falciparum with an IC50 better than 1 µM, as illustrated in Figure 1. Full details of this project were published by Tse et al.3 and Irwin et al.4 and can be viewed in a presentation at https://bit.ly/AI_guided_ design.

Figure 1. The compound submitted by four organisations to the Open Source Malaria projects. For each, the structure and experimentally-measured activity against Plasmodium falciparum are shown. Data from Tse et al.3 .

Delivering a Secure and Scalable Platform The final element of the DeepADMET project was developing a platform that would enable the deployment of deep learning imputation for interactive application on an ongoing basis. This development addressed several challenging and sometimes conflicting goals:

• The platform must be scalable from small data sets, for organisations pursuing only a small number of projects, to global pharma-scale data sets containing millions of compounds and tens of thousands of experimental endpoints. • The models should be updated regularly, to reflect the latest experimental data and enable 'active learning'. • Data within the system must be secure.

In particular, compound structures and details of the assays are critical intellectual property (IP) for drug discovery organisations. • The results must be intuitive and accessible interactively, enabling scientists to quickly answer relevant questions for their drug discovery projects and prioritise experimental and synthetics resources.

The requirements for scalability, regular model building and interactivity motivate a cloud deployment, providing access to large-scale computational resources in a cost-effective way to handle large data sets. However, there is an understandable reluctance to transfer information on the most sensitive IP to the cloud. The solution to this conflict was a unique hybrid onpremises/hybrid architecture, as illustrated in Figure 2. In this architecture, the most sensitive information is processed only in the 'blue zone', hosted on the customer's private network. The platform connects directly to in-house databases, via a 'Query Interface', to refresh the data regularly, e.g. nightly. The raw data is pre-processed in the blue zone to clean and prepare them for modelling, remove compound structures and anonymise compound identifiers and assay information. The resulting, anonymised data are encrypted before transfer to the 'green zone' where the models are built and a matrix of results stored and searched. These results are

Figure 2. Schematic of the platform architecture. The 'blue zone' is hosted on-premises and manages sensitive information, such as compound structures and assay identifiers. The 'green zone' is hosted in a virtual private cloud, providing scalability for modelling, storage and searching of the 'massive matrix' of experimental and imputed data, but has no access to the most sensitive information.

Figure 3. Example workflows with which to use the results of an imputation model. (a) Querying a database for compounds with desired criteria can also return compounds with values that are imputed to meet the criteria (blue cells) as well as those that have already been experimentally measured (white cells). (b) An outlier (purple cell) can be investigated to compare the measured value (red line) with the probability distribution for the corresponding imputed value. (c) Selecting a target assay (dark blue column) for prediction and additional assays that can be performed suggests the most valuable assays and compounds to measure with which to make better predictions for the best compounds for the target assay. true 'big data' for larger data sets because they can comprise tens of billions of data points. Rapid searching and retrieval from this 'massive matrix' and building of the underlying models require the scalability in the green zone provided by a cloud-hosted environment.

Users reside in the blue zone, on the organisation's private network, so results returned from the green zone to the blue zone are decrypted and matched back to the compound structures and assay information to which the data relate.

These results can be used to proactively highlight new opportunities to find highquality drug candidates, by 'filling in' missing data based on limited experimental results for existing compounds. For example, when a scientist performs a query to find compounds that meet their requirements, the massive matrix of imputed results can be searched in real time, in addition to experimental data in the company database, to find additional compounds based on confident predictions, as illustrated in Figure 3(a).

Where experimental measurements have been made, these can be compared with the probability distributions generated by the model to highlight unlikely results. Due to the variability of biological experiments, we know that experimental errors arise that can mislead optimisation projects or result in false negatives that result in missed opportunities. Automatically flagging unlikely results identifies valuable candidates for retesting, as illustrated in Figure 3(b).

The models can also drive 'active learning', suggesting the most valuable data points which, if measured, will result in the greatest improvements in predictive accuracy and hence better selection of compounds for downstream, expensive experiments. Also known as 'design of experiments', this powerful approach is illustrated in Figure 3(c), whereby a scientist can select one or more 'target' assays, followed by lower-cost or higher-throughput assays that can be run as input to predict the target assays. The output highlights the most valuable experiments and compounds to measure, that will most accurately identify the best compounds for the target assay. This approach reduces the experimental effort required to progress compounds and make the best decisions quickly.

The workflows illustrated in Figure 3 demonstrate how the results from sophi-

sticated AI algorithms can be used intuitively and interactively by scientists who are not expert data scientists. Conclusion The DeepADMET project is an example of a successful collaboration, bringing together publicly-funded and SME organisations. It illustrates how cutting-edge science, software engineering and drug discovery experience can come together to deliver state-of-the-art technology that will have a major impact on the discovery of new medicines. REFERENCES

1. T. Whitehead, B. Irwin, P. S. M. Hunt and G. Conduit, "Imputation of Assay Bioactivity Data Using Deep Learning," J. Chem. Inf. Model., vol. 59, no. 3, pp. 1197-1204, 2019. 2. B. Irwin, J. Levell, T. Whitehead, M. Segall and G. Conduit, "Practical Applications of Deep Learning to Impute Heterogeneous Drug Discovery Data," J. Chem. Inf. Model., vol. 60, no. 6, pp. 2848-2857, 2020. 3. E. Tse, L. Aithani, M. Anderson, J. Cardoso-Silva, G. Cincilla, G. J. Conduit and e. al., "An Open Drug Discovery Competition: Experimental Validation of Predictive Models in a Series of Novel Antimalarials," (preprint), 2020. 4. B. Irwin, S. Mahmoud, T. Whitehead, G. Conduit and M. Segall, "Imputation versus prediction: applications in machine learning for drug discovery," Fut. Drug Discov., vol. 2, no. 2, 2020.

Tom Whitehead

Tom is Head of Machine Learning at Intellegens, a machine learning spinout from the University of Cambridge that specialises in handling sparse and noisy experimental data. He has a PhD in theoretical physics from the University of Cambridge, and is now leading the application of Intellegens' novel deep learning approaches to a wide variety of industrial applications. He is also developing a series of application-specific machine learning modules to address high-value data analysis bottlenecks and is interested in developing machine learning approaches to solve previously intractable problems in a range of scientific and engineering fields. John Overington

Professor John studied Chemistry and then completed a PhD in protein modelling and sequence-structure relationships, he then joined Pfizer, leading the Molecular Informatics Structure and Design department. This was followed by Inpharmatica, where he led the development of a series of computational and data platforms to improve drug discovery. In 2008 John was central to the transfer of this technology to the EMBL-EBI, as the ChEMBL database. John then joined Artificial Intelligence technology company – Stratified Medical (later renamed BenevolentAI), applying machine learning to the development of biomedical data extraction and integration strategies. In 2017 John joined the Medicine Discovery Catapult as CIO, where he leads the development and application of informatics approaches to promote and support application of informatics to drug discovery.

Matt Segall

Matt has a Master of Science in computation from the University of Oxford and a PhD in theoretical physics from the University of Cambridge. As Associate Director at Camitro (UK), ArQule Inc. and then Inpharmatica, he led a team developing predictive ADME models and state-of-the-art intuitive decision and visualization tools for drug discovery and responsible for Inpharmatica's ADME business, including experimental ADME services and the StarDrop platform. Following the acquisition of Inpharmatica, Matt led a management buyout of the StarDrop business to found Optibrium which continues to develop research technologies and ground-breaking artificial intelligence services to improve the efficiency and productivity of drug discovery.

Innovate UK-funded Project Results in Next-generation AI Drug Discovery Technology

Next Article

Cold Chain in 2021: COVID-19’s Continued Influence

More articles from this publication:

Cold Chain in 2021: COVID-19’s Continued Influence

Deploying AI in the War on Counterfeit Drugs

Granulation in Pharmaceutical Technology

Are Plant-based Softgels the New Gold Standard for Pharma?

The Effects of Heat from Electro-mechanical Components in Critical Instrumentation

Whitepaper: Together Beyond COVID-19 A Look at the Future

Next-generation Aseptic Tech Needed to Cut Contamination Risk

Adopting Connected Drug Delivery Devices Top Tips for Pharmaceutical Companies

Redefining Healthcare: Digital Trends in 2021

This article is from:

IPI Spring 2021