Drug Discovery, Development & Delivery
Innovate UK Funded Project Results in Next-generation AI Drug Discovery Technology Introduction A promise of artificial intelligence (AI) is that it will gain more value from the complex and expensive data generated in drug discovery, by identifying hidden patterns and valuable conclusions that will guide better decisions when designing and selecting high-quality clinical candidates. To achieve this requires several factors to align: •
•
•
Algorithms – The latest deep learning algorithms have transformed many fields with their ability to capture complex data relationships. However, many of these methods fail in the context of drug discovery, due to the specific challenges of drug discovery data: what we consider 'big data' in our field is orders of magnitude smaller than many areas of notable success for these algorithms, and the available data have significant uncertainties due to the variability inherent in biological experiments. Robust and proven methods are required that make a genuine difference in drug discovery. Data – High-quality and up-todate data are required as input, to distinguish genuine signals from the noise and provide timely insights as drug discovery projects progress. Intuitive and interactive access – The results need to be readily accessible to the key decisionmakers in a project, who may not be computational experts, and presented in a way that will address their questions and drive better-informed decisions.
This article describes a successful publicprivate partnership, the DeepADMET project supported by Innovate UK, that addressed these challenges, delivering a 'next-generation' software platform. The project consortium combined the skills of Intellegens Limited, a Cambridge University spin-out developing unique deep learning imputation algorithms; 36 INTERNATIONAL PHARMACEUTICAL INDUSTRY
The Alchemite method accepts compound descriptors (orange squares in a complete matrix), sparse assay data (green squares in a sparse matrix) and imputes the missing values (purple squares).
Medicines Discovery Catapult, which brought cutting-edge AI approaches to curation of high-quality data; and Optibrium Limited, a software company with a proven track record of innovation and delivery of elegant software that guides successful drug discovery. We will provide an overview of the underlying methods, proof-of-concept and implementation of this platform, and illustrate the unique ways it addresses critical challenges faced by drug discovery projects. The Challenges of Drug Discovery Data Drug discovery data present challenges for the application of standard machine learning tools: in particular, the data are sparse, as not every compound is measured in every assay, and no assays are run for every compound in a pharmaceutical company’s collection. Drug discovery data are also noisy, as biological variability leads to (sometimes radically) different results when repeating the same experiment. These challenges are not unique to drug discovery and are found in many experimental sciences. They are particularly acute in drug discovery, where it is not unusual for over 99% of possible compound/assay measurement results to be unavailable. To obtain the full potential of machine learning for drug discovery, it is necessary to use an approach designed for this complexity of data. The Intellegens’ machine learning algorithm, Alchemite™, was originally developed for and successfully applied to materials science, optimising superalloy compositions for jet engine turbine blades. That field also suffers from sparse, noisy datasets, with the
expense of generating and testing samples limiting the amount of data available. The project partners identified the similarities between the data from these different experimental domains and suggested the adaptation of Alchemite™ for use in drug discovery. This unique algorithm follows an imputation approach: it ‘fills in the gaps’ in sparse experimental data, utilising what little experimental data are available to help the imputation of the points that are not. This has a high value in large drug discovery datasets, multiplying the amount of data available for basing project decisions by many folds. The model may also be applied for predicting the performance of ‘virtual’, unsynthesised, compounds, saving experimental effort by triaging results to identify those with a good chance of success against a broad range of endpoints and prioritising those compounds for synthesis. The ability to leverage existing sparse experimental data in its models results in more accurate models for complex, heterogeneous endpoints, including absorption, distribution, metabolism, excretion, and toxicity (ADMET) endpoints, than is possible with conventional quantitative structureactivity relationship (QSAR) models that are based purely on compounds’ structures. This increased accuracy enables more reliance on the model’s predictions for drug discovery project decisions, an increase which is furthered by the inclusion of reliable uncertainty quantification for each prediction. These uncertainty estimates enable a focus on the most confident Spring 2021 Volume 13 Issue 1