Michael Song Senior Thesis 2024 by Boston University Academy

Predicting Oral Drug Elimination Half-life

In Humans Using Regression Models

Michael Song

Senior Thesis | 2024

Predicting Oral Drug Elimination

Half-life In Humans Using Regression Models

By Michael Song

Advised

Yuhong Wang and Colleen Krivacek

Abstract: In drug development, the elimination half-life, or the time it takes for half of a drug to be excreted from the body, is a critical factor in determining the effectiveness of a drug. However, determining half-life with clinical tests is often time-consuming and expensive. This project studied the use of regression models with different molecular fingerprints in order to predict elimination half-life. The data consisted of 1645 drugs, their molecular structure represented by a SMILES string, and their half-lives. Traditional SMILES strings are difficult to use with supervised learning models, so Extended-Connectivity Fingerprints (ECFPs) and a molecular fingerprint used to predict logP, logS, and logBB were used for representations of molecular structure. Models used were Linear Least Squares, Random Forest, SVR, SGDR, and XGBoost. Benchmarks show that SVR and Random Forest models that were trained with ECFPs produced the best predictions.

1. Introduction:

In the field of drug development, elimination half-life, or the time it takes for half of a drug to be broken down and excreted from the body, is an important characteristic. After four to five half-lives, around 94 to 97% of a drug will be excreted and is considered eliminated, so knowing half-life is especially significant when examining the longevity of a drug (Hallare and Gerriets 2023). Additionally, excessively long half-lives may result in drugs that have a delayed effect, as the body may not break down these drugs quickly enough. More importantly, drugs that stay in the body for too long may have toxic effects and not achieve their desired purpose, while excessively short half-lives result in drugs that are excreted too quickly, necessitating frequent doses to achieve a consistent, prolonged effect. Finding the specific half-life of a drug is especially important when establishing a dosing schedule for patients, as a drug may have varied half-lives in different patients, causing different effects.

Ultimately, understanding the half-life and drug’s excretion rates has significant clinical applications.

Despite the significance of elimination half-life in drug development, determining half-lives is difficult due to the differences between patients and potential interactions with other drugs. Factors such as aging play a large role in the variance of half-lives, as the human body’s cardiovascular and renal systems lose effectiveness as time passes (Mangoni and Jackson 2004). These decreases in blood flow, nutrient absorption, and general body composition cause many drugs, especially nonpolar lipidsoluble compounds, to have a longer half-life in older patients. For example, the lipid-soluble anesthetic thiopentone’s half-life is roughly two times greater in 60-80 year-olds than 20-40 year-olds (Christensen et al. 1981).

Elimination half-life is a characteristic that can only be experimentally tested in later stages of drug development, and has shown to be a critical factor in the failure of many drugs (Van de

Waterbeemd and Gifford 2003). Due to the costly and often timeconsuming nature of clinical trials, much research has been done on predicting elimination half-life and human pharmacokinetics in earlier stages of development, such as through the preclinical testing of other species like rats or dogs (Caldwell et al. 2004). However, limitations still exist with these predictions, as human pharmacokinetics may differ significantly from rat or dog pharmacokinetics. Additionally, trialing costs are still high for nonhuman subjects. Therefore, computational analysis has become an increasingly popular method in predicting elimination half-life in earlier stages of drug development.

In the past 15 years, extensive progress has been made in incorporating machine learning into molecular biology and drug development. Much of this work centers around working with biological processes that generate large amounts of data, such as whole genomes, or structures with many factors that require large amounts of processing, such as protein folding (Jumper et al.

2021). While using machine learning to feed in data from smallmolecule drugs does not require as much computing power as projects such as gene sequence data analysis, the often limited amounts of data, combined with the various ways of presenting molecular structure, can cause other issues.

Elimination half-life is not the only variable of interest when the bioavailability of a drug is being considered, and much research has been done on predicting other molecular properties, such as solubility (Lee et al. 2022). In order to predict these properties, the molecular structure of the drug must be used for machine learning models to recognize patterns and substructures. However, the way molecular structure is presented can be a challenge. Molecules are often presented as SMILES strings, which present molecular structure as a 1D textual sequence of characters. SMILES strings have seen some success in deep learning models with natural language processing techniques, but in many cases, they offer limited representations of molecular

structure due to their 1D nature (Li et al. 2022). Additionally, basic SMILES strings are non-unique, which presents a problem when incorporating SMILES strings into studies that use supervised machine learning models. Past research on predicting solubility values with machine learning has explored various methods of presenting molecular structure. Lee et. al studied the use of graph convolutional network models by converting molecules into graph representations, where atoms and bonds corresponded to nodes and edges. Additionally, they studied the use of Molecular Fingerprints and Physicochemical Features (MFPCP), where molecular fingerprints were combined with inexpensive molecular properties that produce a high prediction rate when used as descriptors. These molecular fingerprints are a common technique used in cheminformatics, where SMILES strings are turned into a form that computers can understand and recognize patterns in. In the case of Lee et al. (2022), the Molecular ACCESS System, a popular 166-dimension data structure, was used.

Keeping past research in mind, when choosing methods of representing molecular structure in this project, molecular fingerprints were chosen for their ease of use with supervised machine learning models. However, past research for predicting elimination half-life has also used other physicochemical descriptors, and so two methods of representing molecular structure were used. The first method involved the use of Extended-Connectivity Fingerprints (ECFPs), a common bioinformatics method that procedurally generates unique fingerprints for each molecule of a dataset (Rogers and Hahn 2010). ECFPs were used due to their popularity in cheminformatics and their ease of use since they could be uniquely generated from SMILES strings. Like other molecular fingerprints, ECFPs represent the structure of a molecule by capturing key substructures. Unlike other fingerprints, ECFPs can be procedurally generated to encompass different numbers of features or characteristics, giving them flexibility in cleaning datasets and

training models. Additionally, factors such as the number of iterations and length of the bit array are up to the user when generating ECFPs, which may influence the complexity or size of each individual fingerprint. These fingerprints are valuable in virtual screening and drug discovery processes, helping to identify molecules with similar structures or properties. They enable computational methods to compare and analyze chemical compounds efficiently, contributing to the discovery of new drug candidates or the optimization of existing ones.

The second method of representing molecular structure was done through a generic molecular descriptor (Sun 2004). While other descriptors may only contain certain properties about molecules, this descriptor was chosen due to its nature of general use-purpose; it is based upon key substructures and patterns in a molecule structure; and it could be used without pre-selection for modeling various properties of molecules.

In recent studies using machine learning to predict elimination half-life, classification or deep learning models have found some success. Dawson et al. (2023) used an ensemble classification model, where ranges of half-lives were sorted into bins, and then predicted using a Random Tree Classifier. However, in this project, we used regression models. An advantage of regression models is their ability to predict exact values, instead of sorting drugs into ranges of half-lives. This is important, as drugs for different purposes require different half-lives. Additionally, supervised learning models are generally simpler and easier to program than deep-learning models, making them a better option for researchers without extensive computer science backgrounds. Currently, available regression models vary in training methods, optimization algorithms, and means of fitting the data. Linear models such as Linear Least Squares and SGDRegressor were used, alongside nonlinear models such as Random Forest, SVR, and XGBoost. Due to the nonlinear nature of molecular structure,

linear models were not expected to perform as well as nonlinear models in this study.

Materials and Methods:

2.1 Dataset: A text file with half-life data of 1645 drugs was curated from DrugBank, with each drug containing the drug name, a labeled half-life in hours, and molecular structure in the form of a SMILES string (Wishart et al. 2006). All half-lives given were measured on humans after the drug was taken orally, and ranges of half-lives were converted into the mean of the range. Due to a large variance in half-lives, the log of all the half-lives was taken in order to dampen the effects of outliers. SMILES strings were not a viable representation of molecular structure, so the strings were converted into unique molecular fingerprints using two methods.

2.1.1 Dataset 1:

In Dataset 1, the SMILES strings were converted to Extended-Connectivity Fingerprints (ECFPs,) a type of molecular fingerprint used to represent chemical structures in a format that can be used for searching, clustering, and machine learning in drug discovery. ECFPs were generated through the use of a circular fingerprint that captures the local chemical environment around each atom in a molecule. The fingerprints were generated by first assigning a unique integer value to each atom using the daylight atomic invariants. In order to capture and represent substructures and bonds, each atom and its neighbors’ integer values were fed into a hash function and a new integer value was assigned to each atom. This was done twice, with more information about the relationship between neighboring atoms captured in each iteration. After duplicate features were removed, all atoms’ unique integer values were used to form the identifier list. This identifier list was converted into a single 1024-length bit array, with each value

being hashed to a location in the array. An example of this is seen in Figure 1.

Figure 1. Visual representation of how atom integers from an identifier list are hashed and converted into a bit array.

Due to the use of a 1024-length bit array for each molecule, the number of features for each drug increased to 1024. Many of the features, however, could be redundant and similar across the drugs. With calculations from a Principal Component Analysis, redundant features were removed. However, later testing revealed no significant difference in performance between the datasets before and after redundant features were removed.

2.1.2 Dataset 2:

In Dataset 2, a different molecular fingerprint, specifically designed to accurately predict common and important drug development properties such as logP, logS, and logBB, was used. Particular atom types were generated using properties such as its element, neighboring atoms, bond types, and aromaticity. Further classification of atom types was done through a classification tree trained with logP values. The resulting atom types resulted in 270 unique molecular descriptors for each fingerprint and 270 features for each drug. As opposed to ECFPs, each feature generated was a base 10 number, not a bit.

After data cleaning, two separate datasets were produced.

Dataset 1 contained 1647 drugs with their elimination half-life and 1024 features. Dataset 2 contained the same 1647 drugs with their elimination half-life and 270 features.

2.2

Methods:

Five different models were used, each with varying algorithms and strengths. The regression models used were Linear

Least Squares, Random Forest, Support Vector Regressor (SVR), Stochastic Gradient Descent Regressor (SGDRegressor), and Extreme Gradient Boosting Regressor (XGBoost).

2.2.1 Linear Least Squares

Linear Least Squares is one of the most common and simple types of regression models, and plays a role in the workings of more complex models (Guthrie 2023). The model fits the data using a function of the form ! (# ) = '! + '" #" + '# ## +… with each xi value being a particular feature, and each ' i being an unknown parameter or weight. Linear Least Squares is only linear in the sense that it is linear in the unknown parameters, not in x.

Importantly, this means that best-fit functions created by Linear Least Squares do not only have to be straight lines, but the fitting function may also include polynomials, exponents, trigonometric functions, or other expressions. In each iteration of the model, the unknown parameters are adjusted to minimize the sum of the squared difference between the best-fit function and the true target half-life.

In this study, scikit-learn’s Linear Regression model was employed as the ordinary Linear Least Squares model (“Linear Regression” 2023). A fit intercept was added, but all other parameters were kept at default values.

2.2.2

Random Forest

Random Forest, a common ensemble method, works by averaging the results of many decision trees to form a prediction (VanderPlas 2023). As seen in Figure 2, the decision trees follow a simple branching process with each node having two possible

outcomes and where multiple features determine each split. Averaging the results of all trees is done to avoid overfitting the data.

Fig. 2. A simplified example of a decision tree.

Each decision tree should differ, if only slightly, in the predictions it makes. Therefore, this study used bootstrapping, where each tree was only trained on a random subset of the training data.

This study used scikit-learn’s Random Forest Regressor model (“Random Forest Regressor” 2023). To account for the large number of features, 200 trees were used and bootstrapping was set to true. Additionally, the number of features considered in each split of a decision tree was set as the square root of the number of features. All other parameters were kept at default values.

2.2.3 Support Vector Regression (SVR)

SVRs are a subset of Support Vector Machines (SVMs), which are more commonly seen in discriminative classification. When working with classification problems containing data points with many features, SVMs use a multidimensional space and try to find a manifold that divides the data points into distinct predicted classes that are in line with the real class. This manifold is formed by trying to maximize the margin between data points and the manifold, or by trying to fit the manifold optimally in between the

two classifications. Because SVMs can work in higherdimensional spaces, this also allows fit functions to appear nonlinear when projected onto a 2d space.

Similar to SVMs, SVRs also work to find a data-fitting manifold, but instead of separating data points into distinct classes, the manifold is used to estimate values (Rodríguez-Pérez and Bajorath 2022). It is formed in the same way as classification SVMs, except instead of trying to maximize the margin between data points, SVRs penalize deviations between the predicted and target values. To avoid overfitting, deviations less than ε are ignored, but data points outside of the manifold’s ε-tube are penalized. SVRs also contain a C value, which dictates the intensity of regularization and affects accuracy and risk of overfitting.

In this study, scikit-learn’s SVR model was used (“SVR” 2023). A radial basis function kernel was chosen, along with a C of 3 and ε of 0.1. All other parameters were kept at default values.

2.2.4 Stochastic Gradient Descent Regressor (SGDRegressor)

Unlike the previous models used, the main focus of SGDRegressor is not the model itself, but the way the model is optimized. In all optimization algorithms, the goal is to find the minimum point in a loss function, or the minimum between expected and predicted values. This is done by iteratively adjusting the hyperparameters, or weights, of the model.

In the Gradient Descent optimization algorithm, the learning rate, or the rate at which the model approaches the minimum of a loss function, decreases as the loss function approaches the minimum (IBM 2023). This is done to avoid overshooting the minimum point of the loss function and to avoid oscillation.

However, with the large number of features and data points for this study, batch Gradient Descent would have been too inefficient, as the model would need to calculate the sum of all

errors for the loss function for each iteration. Therefore, Stochastic Gradient Descent was picked, where a random point is picked to calculate the loss function for each iteration. This greatly increases computing efficiency.

In this study, scikit-learn’s linear model SGDRegressor was used (“SGDRegressor” 2023). In order to minimize the effects of outliers, a Huber loss function with ε of 0.1 was used (“Stochastic” 2023). The number of max iterations for the model was set to 1000, and all other parameters were kept at default values.

2.2.5 Extreme Gradient Boosting (XGBoost)

One of the most popular machine learning methods for structured data, XGBoost works similarly to Random Forest models because of its use of decision tree ensembles (“Introduction to” 2024). However, a key difference between the two models is that XGBoost uses a specialized form of decision tree, called a

Classification And Regression Tree (CART). By default, CARTs use the Gini splitting rule, or Gini impurity, to decide the optimal split in the nodes of the tree (Waheed et al. 2006). This is done by iteratively identifying the most important variable to split with, which helps in weighting and identifying the influence of each feature for determining the target half-life. Unlike other ensemble methods, XGBoost performs tree boosting, where trees are made stepwise, instead of all at once. Once one tree is made, it is tested and its decisions are improved upon by learning and adding the decisions of another tree, which are meant to correct for the first tree’s decisions. This is done additively, where many weak models’ decisions are added up to form a final decision. Importantly, this is different from Random Forest Regression, where all decisions are averaged, instead of summed.

XGBoost’s Gradient Boosting builds on tree boosting where in each iteration, a tree is built to try to minimize the error of the previous trees. Similar to the idea of gradient descent in

SGDRegression, the rate at which this error is minimized decreases over the iterations.

In this study, XGBoost’s Regressor model was used (“Python API” 2022). 500 estimators, a learning rate of 0.05, a gamma of 0, and a subsample of 0.75 were used. All other parameters were left at default values.

2.2.6 Training and Evaluation

Using scikit-learn, the two datasets were split into training sets with 80% of the drugs and testing sets with the remaining 20%. A random state of 400 was used for both splits. After each model was trained with the training sets, they were evaluated on their predictions for the training set using R2 and root mean squared error.

3. Results and Discussion

3.1 Diversity and Distribution of Data

In this study’s dataset, the original half-lives of the 1645 drugs ranged from 0.01 to 4000 hours. However, 1585 of the 1645 drugs had half-lives equal to or below 100. As seen in Figure 3, most half-lives skewed towards values between 0 and 5, which was not an optimal distribution for training the models.

Figure 3. Histogram of the 1585 half-lives between 0 and 100 hours. Half-lives were collected from the original dataset and plotted using Matplotlib with 40 bins.

In this study, the natural log of the half-life was taken and then rounded to two decimal points. This was done in order to regularize the distribution of the half-lives and reduce the impact

of outliers, as seen in Figure 4. The log of half-lives ranged from6.91 to 8.29, and the mean half-life was 1.87.

Figure 4. Histogram of the 1645 drug half-lives between -6.91 and 8.29 ln(hours). The natural log of the half-lives were plotted using Matplotlib with 40 bins.

After the molecular fingerprints were generated, a Principal Component Analysis (PCA) was performed on the two datasets. The datasets were plotted using their cumulative explained variance to identify redundant features, measure the distribution of effective features, and reduce the dimensionality of the datasets.

Dataset 1 exhibited a much larger distribution of important features and a lower number of redundant features compared to Dataset 2, which exhibited a less diverse distribution of important features. As seen in Figure 5, while the ECFPs in Dataset 1 effectively captured the information and variance of the drug molecular structures, the molecular fingerprints in Dataset 2 were not able to and exhibited poor performance in reflecting this information.

Figure 5. Cumulative explained variance plots for Dataset 1 (top) and Dataset 2 (bottom) using a PCA.

3.2 Comparison of Models

The predictions of the models were scored on their R2 and Root Mean Squared Error, presented in Table 1 and Table 2.

Table 1. Scores of Model Predictions for Dataset 1

Table 2. Scores of Model Predictions for Dataset 2

Random Forest Regressor was identified as the model with the highest prediction accuracy for both datasets, with an R2 of 0.293 using Dataset 1 and R2 of 0.250 using Dataset 2. Linear Least Squares performed the worst, with an R2 of -3.94 using Dataset 1 and R2 of -597 using Dataset 2. Other than Linear Least Squares for both datasets and SGDRegressor for Dataset 2, all other models produced positive R2 values, meaning that the models performed better than just taking the mean of the half-lives. As expected, the linear regression models, which were Linear Least Squares and SGDRegressor, performed the worst, as they could not sufficiently capture the nonlinear nature of molecular structure’s relationship to half-life.

Random Forest’s superior performance across both datasets is likely because of its use of many trees and its ability to effectively identify the importance of individual features. Its method of averaging each tree’s decisions also reduces overfitting for irrelevant features, which is a weakness in other models.

Because of its ability to identify the weight of each feature, Random Forest handled datasets with many important features, as seen in Table 1, and datasets with only a few important features, as seen in Table 2. This flexibility is what allowed Random Forest to do well with both datasets. This is also reflected in the similar R2 values for XGBoost across both datasets. Since it also uses a tree ensemble method with CARTs, XGBoost was also able to effectively predict half-lives for both datasets. These findings are reflected in the study by Lu et al.’s (2016), where Gradient Boosting Machines, which also use many weaker decision trees, had the best accuracy.

However, SVR exhibited a much lower R2 value when trained with Dataset 2 compared to Dataset 1, likely due to Dataset 2’s narrower distribution of important features. While SVR’s fitting is very flexible and can do well in modeling nonlinear relationships, its flexibility also causes the model to be prone to overfitting. SVR utilizes an ε-value to avoid overfitting, where

deviations less than ε are ignored, but data points outside of the manifold’s ε-tube are penalized. In this study, however, multiple εvalues were tested without a large improvement in accuracy. Additionally, its poor performance may be due to its poor feature weighting (McKearnan 2023). This is reflected in its poor performance in Table 2, where redundant or irrelevant features were weighted similarly to important features and incorrect relationships were identified.

Overall, all models performed better when trained with the ECFPs in Dataset 1 compared to the other molecular fingerprints in Dataset 2. This was expected, as the ECFPs’ features retained much less redundant information than the other molecular fingerprint.

Although models like Random Forest were able to predict some variance in the elimination half-life of drugs, there exists much room for improvement in these predictions. The deviations of these predictions from their real values could come from

insufficient or redundant molecular fingerprints that could cause models to overfit or misinterpret relationships with half-life. Lu et al documented the success of using 3D molecular structures whose molecular descriptors were calculated in Codessa (Katritzky et al. 1996). This was followed by multiple rounds of removing constant or redundant features, which was not done in this study but could be done in future studies. With the success of ensemble methods in this project, future studies could examine different or more complex ensemble methods, such as Stacking Regressor, Gradient Boosting, or consensus models made up of multiple ensemble models.

Conclusion

With the costly and time-consuming process of finding elimination half-life through experimental tests, there has been a rising popularity of predicting pharmacokinetic values like

elimination half-life through computational analysis. However, many contemporary predictions of half-life use classification or unsupervised models, which could be insufficient for practical needs in drug discovery.

In this study, we examined the use of ECFPs and other molecular fingerprints to train regression models. Simple ensemble methods produced the most accurate predictions, although SVR also produced accurate predictions, but only when trained with ECFPs. In the future, more powerful and optimized ensemble methods could be used to recognize relationships between half-life and molecular structure. Additionally, these same ensemble methods could be used to predict other important properties such as the bioavailability or logS of drugs, and ultimately may help speed up drug development in the future.

References

1. Caldwell, G. W., Masucci, J. A., Yan, Z., & Hageman, W. (2004). Allometric scaling of pharmacokinetic parameters in drug discovery: can human CL, Vss and t1/2 be predicted from in-vivo rat data?. European journal of drug metabolism and pharmacokinetics, 29(2), 133–143. https://doi.org/10.1007/BF03190588

2. Christensen J. H., Andreasen F., & Jansen J. A. (1981). Influence of age and sex on the pharmacokinetics of thiopentone. British journal of anaesthesia, 53(11), 1189–1195. https://doi.org/10.1093/bja/53.11.1189

3. Dawson DE, Lau C, Pradeep P, Sayre RR, Judson RS, Tornero-Velez R, Wambaugh JF. A Machine Learning Model to Estimate Toxicokinetic Half-Lives of Per- and Polyfluoro-

Alkyl Substances (PFAS) in Multiple Species. Toxics. 2023; 11(2):98. https://doi.org/10.3390/toxics11020098

4. Guthrie W, NIST/SEMATECH e-Handbook of Statistical Methods, https://www.itl.nist.gov/div898/handbook/pmd/section1/pmd14 1.htm, 2024, January 12.

5. Hallare J, Gerriets V. Half Life. [Updated 2023 Jun 20]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2023 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK554498/#

6. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature,

596(7873), 583–589. https://doi.org/10.1038/s41586-02103819-2

7. Katritzky, A. R., Lobanov, V. S., Karelson, M., Murugan, R., Grendze, M. P., & Toomey, J. E. (1996). Comprehensive descriptors for structural and statistical analysis. 1: correlations between structure and physical properties of substituted pyridines. Revue Roumaine de Chimie, 41(11-12), 851-867.

8. Lee, S., Lee, M., Gyak, K. W., Kim, S. D., Kim, M. J., & Min, K. (2022). Novel Solubility Prediction Models: Molecular Fingerprints and Physicochemical Features vs Graph Convolutional Neural Networks. ACS omega, 7(14), 12268–12277. https://doi.org/10.1021/acsomega.2c00697

9. Li, C., Feng, J., Liu, S., & Yao, J. (2022). A Novel Molecular Representation Learning for Molecular Property Prediction with a Multiple SMILES-Based Augmentation. Computational

intelligence and neuroscience, 2022, 8464452. https://doi.org/10.1155/2022/8464452

10. Lu J., Lu D., Zhang X., Bi Y., Cheng K., Zheng M., Luo X. (2016). Estimation of elimination half-lives of organic chemicals in humans using gradient boosting machine, Biochimica et Biophysica Acta (BBA), 1860(11), 2664-2771. https://doi.org/10.1016/j.bbagen.2016.05.019.

11. Mangoni, A. A., & Jackson, S. H. (2004). Age-related changes in pharmacokinetics and pharmacodynamics: basic principles and practical applications. British journal of clinical pharmacology, 57(1), 6–14. https://doi.org/10.1046/j.13652125.2003.02007.x

12. McKearnan, S. B., Vock, D. M., Marai, G. E., Canahuate, G., Fuller, C. D., & Wolfson, J. (2023). Feature selection for support vector regression using a genetic algorithm.

Biostatistics (Oxford, England), 24(2), 295–308.

https://doi.org/10.1093/biostatistics/kxab022

13. Python API reference. Python API Reference - xgboost 2.0.3 documentation. (2022).

https://xgboost.readthedocs.io/en/stable/python/python_api.htm l#xgboost.XGBRegressor

14. Rodríguez-Pérez, R., & Bajorath, J. (2022). Evolution of Support Vector Machine and Regression Modeling in Chemoinformatics and Drug Discovery. Journal of computeraided molecular design, 36(5), 355–362.

https://doi.org/10.1007/s10822-022-00442-9

15. Rogers D and Hahn M. Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 2010 50 (5), 742-754. DOI: 10.1021/ci100050t

16. sklearn.linear_model.SGDRegressor. scikit-learn. (2023).

https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.SGD Regressor.html#sklearn.linear_model.SGDRegressor

17. sklearn.svm.SVR. scikit-learn. (2023). https://scikitlearn.org/stable/modules/generated/sklearn.svm.SVR.html

18. sklearn.linear_model.linearRegression. scikit-learn. (2023). https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.Linea rRegression.html

19. sklearn.ensemble.RandomForestRegressor. scikit-learn. (2023). https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomF orestRegressor.html

20. 1.5. stochastic gradient descent. scikit-learn. (2023). https://scikit-learn.org/stable/modules/sgd.html#sgdmathematical-formulation

21. Sun H. A Universal Molecular Descriptor System for Prediction of LogP, LogS, LogBB, and Absorption. Journal of Chemical Information and Computer Sciences 2004 44 (2), 748-757. DOI: 10.1021/ci030304f

22. T. Waheed, R.B. Bonnell, S.O. Prasher, E. Paulet (2006).

Measuring performance in precision agriculture: CART A decision tree approach. Agricultural Water Management, 84(12), 173-185. https://doi.org/10.1016/j.agwat.2005.12.003

23. Van de Waterbeemd, H., Gifford, E. ADMET in silico modeling: towards prediction paradise?. Nat Rev Drug Discov 2, 192–204 (2003). https://doi.org/10.1038/nrd1032

24. VanderPlas, J. (2023). Python Data Science Handbook. O’Reilly Media, Inc.

25. What is gradient descent?. IBM. (n.d.). https://www.ibm.com/topics/gradient-descent

26. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D668-72.

27. xgboost developors. (n.d.). Introduction to boosted trees. Introduction to Boosted Trees - xgboost 2.0.3 documentation. https://xgboost.readthedocs.io/en/stable/tutorials/model.html