Finn McMillan Senior Thesis 2024 by Boston University Academy

Finn McMillan Utilizing

Senior Thesis | 2024

Utilizing Machine Learning Methods in Genome Scale Stoichiometric Models of P. simiae with COMETS

Finn McMillan, Segre Lab

Plant root microbiomes are incredibly diverse, hosting thousands of different species of bacteria at any given moment. While such microbial interactions can either benefit or inhibit plant growth, it is difficult to say which bacterial colonies are responsible for certain effects. The Microbial Community Analysis & Functional Evaluation in Soils project (m-CAFEs) was created for exactly this purpose; a greater understanding of these interactions will allow for not only more accurate predictions of such communities, but also for potential microbiome control and manipulation (m-CAFEs, 2023). Founded at the Lawrence Berkeley National Laboratory, m-CAFEs is a diverse project spanning several major research institutions and areas of focus. Numerous research publications have come to fruition through this project, many of which detail ways to mitigate climate change

through plant microbiome engineering. One such emphasis is placed on rhizospheric effects, which are biological and chemical changes that occur in soils due to plant microorganisms. m-CAFEs seeks to identify the interactions that influence carbon flow in particular; combined with CRISPR-Cas and RNAi community editing, the goal of the research is to artificially optimize plant growth for the maximum yield of biomass. Such optimization would allow for biofuels such as ethanol to become more sustainable and feasible for consumers.

According to the United States Environmental Protection Agency, replacing fossil fuels with biofuels can generate numerous benefits for the environment, while also supplying the world with an indefinite fuel supply (US EPA, 2014). Their prominence means a drastic decrease in carbon monoxide emissions and fossil fuel imports. However, biofuel feedstocks also require more land devoted to agriculture, increasing pesticide pollutants and food prices as crops are diverted towards fuel production. In order for

biofuels to overtake coal and fossil fuels as a dominant form of energy, their pros must outweigh their cons. Artificially optimizing plant growth will allow for biofuels to become cheaper, while simultaneously occupying less agricultural land. Studies such as this one are crucial for the future of biofuels, and the new reality of climate change.

In plant root biomes, the presence of certain bacteria can either hinder or promote plant growth. One such promoter is Pseudomonas simiae, a Gram-negative, catalase- and oxidasepositive, rod-shaped bacterium isolated from monkeys (Preston, 2004). In order to artificially edit these root biomes to maximize plant growth using CRISPR-Cas, an accurate simulation of P. simiae and its subsequent growth must be created. One such method of plant microbiome analysis utilized in this study is performed through Computation Of Microbial Ecosystems in Time and Space (COMETS), a multi-scale modeling framework that computes group dynamics through metabolic stoichiometry,

separated from any prior assumptions of how species interact (Harcombe et al., 2014). First becoming publicly available in 2014, COMETS was founded through a collaboration between researchers at Boston University, Yale University, and the University of Minnesota (COMETS, 2023). Rather than utilizing classical kinetic models in community analysis that require largescale kinetic parameters and differential equations, COMETS employs both stoichiometric and environmental modeling in accurately predicting metabolic activity at the genome-scale and community level. This is performed through dynamic flux balance analysis (dFBA), which assumes steady optimality in order to predict the metabolic fluxes of a system, and consequently, the system’s microbial growth. In COMETS, analysis is performed within time-dependent interacting subsystems that simultaneously record intracellular metabolic fluxes and data on communal distribution of populations and nutrients. In order to combine both species-level and ecosystem-level analysis of a system, COMETS

first models cellular growth through a hybrid kinetic-dFBA approach, before taking finite differences approximations of the use of extracellular nutrients and the effects of the environment.

To fully understand and utilize the various benefits of the COMETS software, one must first have an understanding of flux balance analysis (FBA) (Orth et al., 2010). FBA is a commonly used approach in predicting organism growth that uses information about organisms' metabolic properties to reconstruct biochemical networks and calculate the flux rates of enzymatic reactions.

Calculation begins with mathematical representations of the reactions within an organism, portrayed as stoichiometric coefficients. These coefficients make up a stoichiometric matrix (S), composed of m compounds and n reactions, resulting in a size of m*n. Unlike other theory-based models that require complex kinetic parameters and differential equations, FBA uses the stoichiometric coefficients as flux constraints that control the flow of metabolites through the system. The coefficients are represented

in a matrix system where negative numbers represent consumed metabolites, positive numbers represent produced metabolites, and zeros represent metabolites that are non-participants in a reaction.

Figure 1: Mass balance constraints are placed on the stoichiometric matrix S, with capacity constraints ai and bi defining lower and upper bounds of the system. v represents a particular flux distribution: a maximum or minimum of the objective function, which satisfies the given constraints (Orth et al, 2010).

Certain parameters are necessary to constrain to perform optimal FBA and determine global minima and maxima (Figure 1). By defining such constraints for a given biological system, an allowable solution space can be created. Figure 1 first depicts an infinite space devoid of kinetic parameters, where all solutions are valid. By outlining an allowable space, a singular optimal solution

can be found, which, in the case of m-CAFEs and this study, are the simulated values that most accurately correspond with observed data. Constraints are imposed both as reaction balances and as upper and lower bounds for reactions. Within the stoichiometric matrix, constraints exist so that compounds are being consumed and produced at a steady state. Upper and lower bounds are utilized to define the allowable solution space by manipulating the consumption and production rates of certain metabolites. Because FBA does not use kinetic parameters, it cannot predict metabolite concentrations, and must also assume that all fluxes remain in a steady state. Steady state kinetics allows for a quicker and simpler way of assessing enzymatic reactions.

The m-CAFEs project seeks for a way to optimize plant growth, thereby optimizing plant biomass for the use of biofuels. Through the COMETS software, simulations can be created for individual bacterial root colonies, where parameters and metabolites for growth can be numerically specified. One such root

bacteria is the previously specified Pseudomonas simiae, which is the primary study of this paper. Pseudomonas is an incredibly diverse genus that can occupy several different niches, with many species being found in grass and plants. Strangely, some plantassociated Pseudomonas work to increase plant growth and disease resistance, while other species act as parasites and inhibit biomass growth. P. simiae promotes plant biomass by suppressing pathogenic microorganisms, thereby synthesizing growth hormones while increasing plant disease resistance. Very few studies have been performed on the growth of P. simiae, however; simulated data does not exist in large enough quantities for the bacteria to be subject to CRISPR-Cas editing. This, along with reliable experimental growth data, makes P. simiae the perfect target for COMETS analysis. In particular, this study seeks to create a COMETS simulation for P. simiae that more accurately reflects experimental data, given glucose as the primary carbon source.

The experimental data used in this study was first studied by Adam M. Deutschbauer, the department head at Berkeley Lab and Technical Co-Manager on the m-CAFEs project (Wetmore et al., 2015). His data was used as a benchmark for P. simiae–when applied to this particular study, the goal is to create a COMETS simulation that most accurately represents the physical analysis. Before the COMETS simulation is able to be created, one must first derive a genome-scale model of the given bacteria. This can be accomplished through The Department of Energy Systems

Biology Knowledgebase (KBase), an open data platform that allows for an integrated analysis of plant-microbe communities (KBase, 2023). The genome-scale model for P. simiae was first created by supplying the given reads in KBase, before using RAST annotation to complete the model.

COMETS simulations were conducted using Jupyter Notebooks (a Python-based interface), the COBRA toolbox, and standard NumPy software packages. COMETS scripts can also be

run in MATLAB, but the Python language was chosen instead based on its compatibility with the machine learning algorithms defined later in this project. A COMETS layout was then created to cultivate P. simiae in the specified environment. The simulation layout can be compared to an experimental test tube, used to house the bacteria and given metabolites. The volume of this test tube was specified to be 150 μL, before the metabolites were then outlined. In the COMETS interface, metabolites are specified in millimolar amounts (mM); along with elements and compounds such as calcium, iron, and ammonium being loaded into the test tube, a carbon source was also given to supply the growth of P. simiae. In this study, the carbon source utilized was glucose, which is outlined to be 0.0008 mM.

After basic test tube outlines were created, the genomescale model was then loaded in. Defined as a Systems Biology Markup Language (SBML) file, the model was first created using the COBRA toolbox before being translated into the COMETS

interface. At this point, the initial biomass of P. simiae was declared at 0.000008 g/m2. After adding this model to the test tube layout, general parameters were set to specify experimental conditions. Time variables such as timeStep and maxCycles were defined as 0.25 and 220 respectively; while timeStep dictates when data on the bacteria is collected (specified in hours), maxCycles controls how long the experiment runs for. Most importantly, default Vmax and Km were given initial values of 15.0 and 0.002 respectively. Modifying these Vmax and Km is the primary focus of this experiment; initial values were chosen as a benchmark for the experiment, with the expectation that these values were not optimal.

Growth curves were simulated and compared to experimental data (Figure 2). In order to accomplish this, a common time scale was created that allowed for the experimental results taken in real time to be plotted on the same graph as the simulated data on the COMETS time scale. Since maxCycles for

the COMETS experiment was set at 220, experimental data was parsed in a way that allowed for 220 equally spaced cycles. Plotting growth curves on a shared graph is crucial to creating a more accurate simulation, as individual points on each curve can then be compared to each other. Such comparison is defined through a created cost function, which is a third curve that plots the difference between experimental and simulated data and returns a numerical representation of data discrepancy. The creation of the cost function allows for the effects of changed simulation parameters to be measured in real time and compared to older solutions. Slight increments and decrements of Vmax and Km will either improve or hinder accuracy of the simulation, and a global cost minimum can be found.

Figure 2: Experimental, simulated, and cost data overlaid on one graph. The x-axis corresponds to time, in seconds; the y-axis represents biomass.

One can view the discrepancy between simulated (blue) and experimental (orange) data through the representative cost curve (green) (Figure 2). In the beginning, there is expected exponential growth in P. simiae, as substrate concentrations approach the Km value. Eventually, simulated data overshoots experimental growth, as a spike in cost at 100,000 seconds can be

observed. The simulated curve does not try to emulate the random, low-frequency volatility of the experimental curve, but rather match its general shape to minimize cost.

The primary objective of this study is to observe the effectiveness of changing parameters, more specifically Vmax and Km, in the context of improving simulation accuracy (University College London, 2019).

Figure 3: A graph relating rate of reaction and substrate concentration in the context of Vmax and Km (University College, 2019).

There is an exponential increase in the rates of reactions at low concentrations of substrate (Figure 3). The enzyme is ready to undergo a reaction and is only limited by the concentration of substrate which is available. As concentration increases, the formation of product becomes limited by the enzyme’s activity rather than a lack of substrate. This rate of reaction during

sufficient substrate concentration is referred to as Vmax, or the maximum rate of reaction. Similarly, the relationship between reaction rate and substrate concentration depends on a given enzyme’s affinity for its substrate; this enzyme affinity is expressed as Km, or the Michaelis constant. In relation to one another, Km also refers to concentration of substrate needed to reach half of Vmax. The impact of the Km value on plant growth and curve structure cannot be overlooked. Due to its large influence on rate of reaction, Km values are consequently of smaller magnitude than Vmax values.

Figure 4: A graph showing the impact of the Michaelis constant Km on the general shape of the growth curve (University College, 2019). In terms of the simulation, optimal values of Vmax and Km correspond with observed values of experimental results. An optimal solution yields a curve that maintains a similar shape to that of experimental growth, consequently decreasing cost. To search for such an optimal solution, a machine learning algorithm known as simulated annealing was utilized (ScienceDirect Topics,

2023). Simulated annealing belongs to a larger subcategory of machine learning known as Monte Carlo, a search model that uses random sampling to address integration and optimization problems (ScienceDirect Topics, 2023). First utilized in the nineteenth century to estimate the value of !, Monte Carlo experiments were first carried out physically, and thus costly to implement. It was only when physicists in the 1940s began using random number generators, rather than physical experiments, that the Monte Carlo Method took off. Its most important feature lies in its inversion of statistics, using random quantities to estimate deterministic ones as opposed to an analytic approach. One common Monte Carlo problem examines rain falling uniformly at random over a certain region, in which the probability is given as 4. By analyzing the number of raindrops falling in that area, also known as random sampling, one can find a deterministic estimate of . Simulated annealing is one such Monte Carlo method and modeled off the real-world process of physical annealing. Physical annealing

involves heating up a material to a specific annealing temperature, before cooling it to reach an optimal shape. When the material is hot, molecules are less structured and thus more susceptible to change; the inverse is true for the latter stages of cooling.

Simulated annealing works in the same manner, beginning at a specified temperature of high probability change before cooling to less volatility. Python lends itself well to simulated annealing, allowing for physical processes to be replaced by loops and random number generators. For this experiment, an initial annealing temperature was defined at 20°, base values of Vmax and Km were set at 15 and 0.002 respectively, and the base cost was determined to be 4.1. The algorithm was created within a loop designed to run 1000 times, where temperature gradually decreases, and the function consequently becomes less volatile. In each iteration of the algorithm, the absolute value of the change of Vmax and Km remained constant, but the sign of these changes varied. To determine either an incrementation or decrease of these values,

two random number generators were used, one for each value: an even number corresponded to an increase in the respective value, while an odd number signified a decrease. The absolute value of the change in Vmax per iteration was set at 0.1, while the change in Km was set at 0.0001 to reflect the reasonable magnitudes of the values.

After these new Vmax and Km values were determined, they were passed in as parameters of a method call to the cost function. For every cycle, the cost function is fed the new data and builds a new simulation with the updated parameters. All other factors remain constant; thus, the inevitable change in cost of the function can be attributed to the changes experienced by Vmax and Km. At this point, two variables are defined: one that represents the difference between the new cost calculated by the stated method call and the old cost of the previous cycle, and another that calculates the probability of the solution being accepted. The algorithm first checks to see if the difference variable is less than 0, thus

representing a more optimal solution–in this case, the new values of Vmax and Km are automatically accepted, and the algorithm moves onto the next iteration. If the difference variable is greater than zero, the algorithm defines a random floating-point value and compares it to the probability variable stated earlier. If the random value happens to be less than the probability variable, the solution is accepted; otherwise, Vmax and Km are not updated, and the algorithm moves to the next iteration. This methodology is referred to as the Metropolis algorithm, in which non-optimal solutions are accepted so that the algorithm can explore a greater solution space (Weisstein, 2023).

The parallels to physical annealing, as well as the token randomness of Monte Carlo methods are most evident in the probability variable. After new values are found for Vmax and Km, the initial annealing temperature of the function is divided by the number of that iteration and assigned to a new, temporary

temperature variable. In terms of physical annealing, the probability is derived from thermal dynamics,

P(ΔE) = " !" #∗% where the probability of the internal energy of the material changing is dependent on some given energy magnitude, some temperature t, and the Boltzmann constant k. Similarly, the probability function for simulated annealing is defined as

P(s) = " & % where the probability of a solution being accepted is dependent on the change in cost s and temperature t. Since the probability function is only taken into account when s > 0, the function will always return a number less than 1. As s becomes greater, i.e as the new solution becomes less optimal, P(s) consequently decreases, making such a solution less likely to be accepted. As t decreases, i.e as the algorithm continues to iterate, P(s) consequently decreases, making the algorithm less likely to accept a non-optimal solution.

This particular experiment began with a baseline cost of 4.1, derived from initial Vmax and Km values of 15.0 and 0.002. After the completion of the simulated annealing algorithm, the cost was decreased to 2.41 with optimal Vmax and Km values of 35 and 0.0036.

Figure 5: Experimental, simulated, and cost data overlaid on one graph. The x-axis corresponds to time, in seconds; the y-axis represents biomass.

One can view the discrepancy between simulated (blue) and experimental (orange) data through the representative cost

curve (green) (Figure 5). The graph shows greater synergy between observed and simulated data, in comparison to Figure 2. The spike in cost observed at 100,000 seconds in Figure 2 is eliminated in the optimized simulation, resulting in a significant decrease in cost.

This process was also applied to a similar experiment also involving P. simiae, but with different metabolites and experimental layout. Starting with Vmax and Km values of 10 and 0.03 and a cost of 8.76, the simulation was optimized to yield a final cost of 1.73, with optimal Vmax and Km found at 7.1 and 0.069.

Figure 6: A separate simulation of P. simiae: The x-axis corresponds to time, in seconds; the y-axis represents biomass. Cost is represented only as a value, rather than a curve.

Figure 7: The optimized simulation of P. simiae: The x-axis corresponds to time, in seconds; the y-axis represents biomass. Cost is represented only as a value, rather than a curve. In this second experiment, optimizing Vmax and Km proved to be even more effective in lowering cost, while simultaneously proving that the methods outlined in this study will work regardless of other simulation parameters. While experimental data remains constant across both experiments, this particular simulation utilized different metabolic values, resulting in different optimal Vmax and Km values (Figures 6, 7). In this case, while the

optimized simulation undershot experimental results, it ultimately decreased cost drastically.

This study proved a success, slashing cost in both experiments simply by optimizing two control parameters. In the primary experiment documented in Figures 2 and 7, cost was decreased from 4.1 to 2.41, resulting in a 41% increase in simulation accuracy when compared with observable data. In the secondary experiment outlined in Figures 8 and 9, cost was decreased from 8.76 to 1.73, resulting in an even greater 80% increase in simulation accuracy. The methods outlined in this study averaged to a 60% improvement in simulation accuracy, making simulated annealing a highly effective method of finding optimal kinetic Vmax and Km parameters.

The high-temperature, high-volatility portion of the simulated annealing algorithm is crucial to its success in determining a global solution (Liang, 2020). In this initial stage, t is high enough so that P(s) will always return a high value, regardless of s. New

solutions, regardless of their inaccuracy, will almost always be accepted.

Figure 8: A depiction of a multi-solution system, with a nonoptimal barrier separating the two minimums (Liang, 2020).

The algorithm’s ability to accept non-optimal solutions allows it to search beyond a local minimum (Figure 8). By returning to a former state of inaccuracy, the algorithm is able to bypass crevices of false optimality. As temperature continues to decrease, this period of high volatility is replaced by a more direct search pattern. In this state, the randomness of the system still allows for nonoptimal solutions to be accepted, but such volatility is limited to

allow for a global minimum to be identified. Once the simulated annealing algorithm completed its 1000 iterations, the final ten solutions were examined. Of these ten, the Vmax and Km values that yielded the lowest cost were then selected as optimal solutions to the simulation.

One important note of this study lies in the cost function, when expressed as a singular number representing the sum of all discrepancy. It is crucial to note that the objective of this study is not to reach a cost of zero; in fact, such a feat is impossible to accomplish. Optimizing Vmax and Km is shown to drastically improve simulation accuracy, yet such changes occur on a broad scale that does not concern volatility on a second-to-second basis. These kinetic parameters influence the general shape of the simulated growth curve; in order to further decrease the cost of the simulation, metabolic amounts must also be optimized. Future studies could employ simulated annealing algorithms to find these metabolic global minimums,

In terms of COMETS simulations for P. simiae, much more can be accomplished. For starters, this particular study employed glucose as a primary carbon source for growth in both experiments; future studies would involve utilizing a similar fitting process for other carbon source models, such as fructose. Additionally, this study only concerned global Vmax and Km values that dictated the general shape of the growth curve, influencing all enzymes in P. simiae. To further decrease cost, Vmax and Km values of individual exchange reactions in the given simulation should also be optimized using simulated annealing. This process was intermediately explored with generally positive results, but concrete solutions were not finalized enough to include in this study.

Small errors in data collection may have occurred through lack of significant figures, but such discrepancies are small enough to rule out. Experimental data was not collected in this study and was instead taken from Adam M. Deutschbauer at the Berkeley

Lab. This study cannot directly attest to the accuracy of this data, nor can it attest to the genome-scale model that was created for this simulation, as this too was also provided from an outside source. The COMETS software is still in a developmental stage and is thus subject to regular change and bug fixing. Future studies may need to account for changes in the software.

While this study only focused on improving one particular aspect of the m-CAFEs project, its strides should not be overlooked; by enhancing the accuracy of COMETS simulations for given plant root bacteria, one can better determine their effect on plant growth. The potential for artificial root biome editing with CRISPR-Cas and RNAi is vast; this potential can only be realized with accurate simulations of such root bacteria. Understanding the behavior of these bacteria is crucial in plant biomass optimization, which in turn will make plant biofuels a more feasible and sustainable source of clean energy.

Acknowledgements

I would like to thank my project mentor, Ilija Dukovski, as well as Haroon Qureshi and Hui Shi for their contributions to this project.

Works Cited

m-CAFEs –. (n.d.). Mcafes.lbl.gov. Retrieved October 5, 2023, from https://mcafes.lbl.gov/ US EPA, OP. “Economics of Biofuels.” US EPA, 17 Apr. 2014, www.epa.gov/environmental-economics/economicsbiofuels#:~:text=Replacing%20fossil%20fuels%20with%20biofuel s%20has%20the%20potential%20to%20generate.

Preston, Gail M. “Plant Perceptions of Plant Growth-Promoting Pseudomonas.” Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, vol. 359, no. 1446, 29 June 2004, pp. 907–918, https://doi.org/10.1098/rstb.2003.1384.

Harcombe, William R., Riehl, William J., Dukovski, I., Granger, Brian R., Betts, A., Lang, Alex H., Bonilla, G., Kar, A., Leiby, N., Mehta, P., Marx, Christopher J., & Segrè, D. (2014).

Metabolic Resource Allocation in Individual Microbes Determines Ecosystem Interactions and Spatial Dynamics. Cell Reports, 7(4), 1104–1115. https://doi.org/10.1016/j.celrep.2014.03.070 “COMETS.” Www.runcomets.org, www.runcomets.org/. Accessed 4 Mar. 2024.

Orth, J. D., Thiele, I., & Palsson, B. Ø. (2010). What is flux balance analysis? Nature Biotechnology, 28(3), 245–248. https://doi.org/10.1038/nbt.1614

Wetmore, K. M., Price, M. N., Waters, R. J., Lamson, J. S., He, J., Hoover, C. A., Blow, M. J., Bristow, J., Butland, G., Arkin, A. P., & Deutschbauer, A. (2015). Rapid Quantification of Mutant Fitness in Diverse Bacteria by Sequencing Randomly Bar-Coded Transposons. MBio, 6(3). https://doi.org/10.1128/mbio.00306-15

KBase: DOE Systems Biology Knowledgebase. (n.d.). Retrieved October 5, 2023, from https://genomicscience.energy.gov/kbase/

University College London. (2019). The Effect of Substrate Concentration on Enzyme Activity. Ucl.ac.uk.

https://www.ucl.ac.uk/~ucbcdab/enzass/substrate.htm

Simulated Annealing Algorithm - an overview | ScienceDirect Topics. (n.d.). Www.sciencedirect.com.

https://www.sciencedirect.com/topics/engineering/simulatedannealing-algorithm

Monte Carlo Method - an overview | ScienceDirect Topics. (n.d.). Www.sciencedirect.com.

https://www.sciencedirect.com/topics/medicine-anddentistry/monte-carlo-method

Weisstein, E. W. (n.d.). Simulated Annealing. Mathworld.wolfram.com. Retrieved October 5, 2023, from https://mathworld.wolfram.com/SimulatedAnnealing.html#:~:text= The%20traveling%20salesman%20problem%20can

Liang, F. (2020, April 21). Optimization Techniques Simulated Annealing. Medium. https://towardsdatascience.com/optimizationtechniques-simulated-annealing-d6a4785a1de7