NU Sci Issue 47: Bloom

6 I Math

FOR BY DINA ZEMLYANKER, DATA SCIENCE & BIOCHEMISTRY, 2024

A

s the amount of data in the biological field expands exponentially as a result of more efficient biological processes, such as Next Gen Sequencing, machine learning has become a tool to leverage this data for contributions to the drug development and medicinal fields. One of the most widely utilized machine learning models is the Hidden Markov Model (HMM). HMMs are based on Markov chains, a mathematical system that illustrates the probabilities of transitions between states. For example, in a Markov model describing weather, it could have two states: rainy and sunny. The Markov model would show the probability that given that today was rainy, tomorrow is sunny and vice versa. In HMMs, there is an added complication called observables. These are useful when directly seeing whether it is sunny or rainy is impossible, making these states “invisible.” For example, for an HMM predicting the weather, the invisible states would be rainy and sunny, and the observations could be what people outside are wearing. In simpler terms, the model could predict which days in a week will be rainy and which will be sunny based on people’s outfits. To determine the sequence of states, researchers use the probability of seeing each observable given a certain invisible state, known as the emission probability. The collection of all of these probabilities is called the emission probability matrix, while all transition probabilities between states are called the transition probability matrix. These two matrices are combined to determine the probability of the next invisible state. For a protein predicting HMM, the invisible states can be the exons and introns. Exons are the

DESIGN BY KATIE GREEN, BIOENGINEERING, 2022

regions of RNA that code for proteins and introns are the non-coding regions. For every state in the model, there is a different set of emission probabilities based on the observable qualities of each exon and intron. There are multiple uses to this kind of HMM. The first way to use this would be computing whether or not the sequence is coding, which means that it contains the instructions

"

Out of the plethora of biological applications of HMMs, the most useful have proven to be modeling characteristics of protein families, such as globins and kinases, to help classify new proteins into different protein families.”

to create a protein. Another way would be to predict the locations of exons and introns given that it codes for a protein. For this process, the states would first be predicted using the observables. After predicting the optimal sequence of states, it will be clear which codons belong to which exon or intron based on their locations. The optimal path or sequence of latent states is found using one of a few different possible algorithms, including the forward and Viterbi algorithm. The Viterbi algorithm is usually used when trying to determine the entire optimal state sequence rather than just the next state, and it works by

recursively finding the maximum probability of one latent state following another latent state, given the observables. Out of the plethora of biological applications of HMMs, the most useful have proven to be modeling characteristics of protein families, such as globins and kinases, to help classify new proteins into different protein families. This allows researchers to infer qualities and find multiple sequence alignments, which are when proteins are compared by sequence and assigned a similarity score. The most popular implementation of this is the Sequence Alignment and Modeling System (SAM), which is used for multi-protein sequence alignment and profiling using HMMs. Another essential application of HMMs is in protein structure prediction, where the protein structure is determined using both an HMM and an unfamiliar protein sequence; this has proven to be very useful in drug development and medicine. Knowing the properties and shapes of proteins, drugs can be created to ameliorate their negative effects by making changes to their structures and interrupting their processes. With the help of HMMs, scientists can develop new medicines much more cheaply and quickly.

Genomics, Proteomics & Bioinformatics (2004). DOI: 10.1016/S1672-0229(04)02014-5 Curr Genomics. (2009). DOI: 10.2174/138920209789177575 PHOTO BY SHUTTERSTOCK

Turn static files into dynamic content formats.

Create a flipbook

Articles inside

NU Sci Issue 47: Bloom

Articles inside

The American lawn

Space invaders

When flavors bloom

The tales of conjoined twins

A closer look at transparent

Hidden Markov Models for biochemical applications

Love, actually

A sight to behold

Whiskey webs

Terraforming Mars

Here comes the sun