4 minute read
Transcriptomic Analysis of COVID-19 Patients
from PCR - Fall 2021
by Maya Khan (V)
This past summer, I participated in a virtual Transcriptomic Analysis + COVID-19 course run by Milrd, an educational organization in New York City, supported by the Mason Lab at Cornell Medical Center. I was new to transcriptomic analysis, the study of RNA transcripts, but the course gave me a strong foundation and incredible mentors. Through this program, I learned how to analyze a broad collection of data and use specific programs to compare reference genomes. This course would be perfect for anyone interested in computational biology and large-scale sequencing surveillance. The class focused on a two-step analysis of COVID-19 transcriptomic data. While the genome is a larger collection of all nucleic or mitochondrial DNA, the initial product of genome expression is the collection of mRNA copied from the genes during transcription. The transcriptome measures this complete set of RNA transcripts in the intermediary stage between genes and proteins. For the analysis, cDNA is synthesized from the single-stranded RNA. Using a series of programs to clean and organize the raw sequencing data, we transformed it into an understandable format. Then, using the Integrated Genomics Viewer (IGV), we generated visualizations of the genomic data and identified point mutations. Over the past decade, more reliable sequencing methods to track and map genes have been
Figure 1: Graphic showing the transcriptome’s variability due to alternative splicing.
Figure 2: The IGV program was used to compare the cleaned COVID-19 patient sample and the reference genomes.
developed. In response to the COVID-19 pandemic—caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)—many countries have invested in genome sequencing (a process used in molecular biology to study genomes and the resulting proteins). However, transcriptomic analysis can further explain the complexity of gene expression. The nucleotide sequence found in RNA reflects the code in DNA. By analyzing the transcriptome, researchers also investigate the COVID-infected cell gene expression and activity. Because of alternative splicing, each gene may produce more than one type of mRNA. Changes in normal gene activity may signal disease and may help guide vaccine development. To perform our analysis, we cleaned and organized the data in stages using the Terminal found in most computers. In the first stage, we turned our sets of raw DNA reads into a _strain profile_ of SARS-CoV-2 patient isolate that delineates nucleotide and amino-acid level mutations in the sample. This process had two major steps: quality control and mutation analysis. During the second stage, we removed human DNA from our samples using a program called Bowtie2 and a human reference genome. Most SARS-CoV-2 patient samples contain human RNA that gets converted to cDNA, but we were not interested in analyzing it. To focus on the important COVID genomic data, we had to get rid of the host cell RNA. We then aligned our data with the indexed SARS-CoV-2 Reference Genome to find mutations. This alignment format gave coordinates indicating where mapped segments had mis59
matches between the reference sequence nucleotides and our COVID-19 patient sample. The discovered mismatches between the reference genome and our sample could be due to two factors: (1) a genuine biological mutation in the sample or (2) an error as a result of the sample preparation and sequencing process. After we had aligned reads in our sample and (hopefully) called a list of high-quality variants, we visualized the results and listed the nucleotide position and associated amino acid mutations. These counted mutations allow genomic epidemiologists to track how a particular virus travels around the globe. We used a genome browser called the Integrated Genomics Viewer (IGV), which is authored and distributed by the Broad Institute of Harvard and MIT.
Performing sequencing on viral and host cell genomes is important for many reasons. On average,
the SARS-CoV-2 genome accumulates about two changes per month. Sequencing SARS-CoV-2 genomes and identifying subtle shifts help researchers follow how the virus spreads. Most of the mutations don’t affect how the virus functions, but a few may change the virion’s transmissibility or severity. Sequencing is an important tool, especially during this pandemic because it allows lineages and virulent genetic shifts to be tracked around the globe. For example, they have found that the Coronavirus outbreak in New York City stemmed primarily from Europe, while the SARS-CoV-2 genomes sequenced in Washington State indicate a Wuhan origin. Using the data and mutations, scientists can group global outbreaks and build phylogeny “family” trees for viruses. These trees are constantly being updated on GISAID and nextstrain.org. All in all, this was a great class to take to get an inside look on the tools being used to track and sequence SARS-Cov-2. Taking part in this program is a great way to explore current research being done in computational biology. After applying, I chose from a variety of classes and different sessions. The experience was virtual, extremely flexible, and allowed me to pursue my interests in science.
Figure 3: Final results, on mutations in the patient sample, displayed by a table.