9 minute read
Vaccine Hesitant Users 74
3.2 Chang, Hung, Lee, Li, Lucena, Mardjoko, Tsarnakova, Wang, Wang, Zheng. Microbial Forensics: Identifying Bacteria and Yeast Using Ribosomal DNA Fingerprints 19 The fragment was copied into the Sequence Manipulation Suite’s restriction digest, and the test was performed three times. Each trial used only one restriction enzyme, and the fragment was digested with the enzymes AluI, HaeIII, and MboI. The program displayed the number of fragments generated and their lengths for each individual enzyme, and the results of the virtual digest were recorded and compared to the results of our program. Test data was successfully collected from twenty bacteria in this manner. for each enzyme was calculated by subtracting the last site cut by each enzyme with the length of the original sequence (Figure 5). These fragment lengths were then appended to the list in ‘length_ dict’ that was contained in key with the enzyme that produced the fragments. Finally, the dictionary ‘length_ dict’ was returned with the fragment lengths.
Bacteria Identification Process
Advertisement
The virtual digest found the restriction enzyme cut sites of AluI, HaeIII, and MboI for each genome sequence in the bacteria database. Then, the length of the fragments cut by the restriction enzymes were calculated. Information of each bacteria in the bacteria database were stored together in a class.
We read the FASTA database file using the SeqIO reading function, parse().We also iterated through the dataset and performed a manual PCR to check whether the sequences were identified between the forward and reverse primers.
A BioPython RestrictionBatch object, ‘rb’, was initiated in order to use the three restriction enzymes, AluI, HaeIII, and MboI simultaneously when performing the virtual digest. The BioPython search function could then be applied to the restriction batch to determine the restriction sites of a given sequence, which was returned as a dictionary with the three enzymes as keys and a list of restriction cut sites as values.
A ‘BacteriaInfo’ class was created to concisely store the identifying information of a given bacteria in one object. The class contained three instance variables, ‘name’, ‘id’, and ‘lengths’, with ‘name’ representing the bacteria name, ‘id’ representing the bacteria’s accession number, and ‘lengths’ containing the dictionary of fragment lengths. In iterating through the bacteria database, each sequence of bacteria was first virtually digested and its fragment lengths were calculated, then a ‘BacteriaInfo’ object was created to store the information of the bacteria.
The function, ‘find_lengths’, used the parameters ‘sequence’, representing the original genome sequence that was virtually digested and preprocessed to remove all extraneous space and newline characters, and ‘seq_dict’, representing the dictionary of restriction cut sites that was returned by the BioPython virtual digest search function. This function created a new dictionary, ‘length_dict’, with the same restriction enzyme keys as ‘seq_dict’, but with an empty list to store fragment lengths as values. By iterating through each key in ‘seq_ dict’ and each cut site in the corresponding list to the key, fragments lengths were calculated as the difference between a particular cut site and its previous cut site. The last fragment length
Figure 5: Visual Representation of Fragment Length Calculation
Possibility Reducer
The possibility reducer was created to identify known bacteria from the database of unnamed bacteria sequences. The algorithm finds possible bacteria matches from the fragment lengths calculated from the cut sights performed by the virtual digests. As mentioned before, after the virtual digest PCR was performed, each bacteria would result in unique ribosomal DNA fragment cut sites and number of cuts. Therefore, the identity of the bacteria can be determined by analyzing the fragment lengths.
We created an algorithm to compare fragment lengths from two individual bacteria records. It analyzed two lists of fragment lengths and determined whether the numbers in these lists are close enough in value.
Preconditions were first examined. To prevent the comparison between a record with no fragment lengths in a particular enzyme, empty lists that did not contain fragment lengths were deemed invalid for comparison with filled lists. The fragment lengths below the minimum base pair (100) were filtered out of the fragment length list and were disregarded in comparison. This minimum base pair value denotes the lowest cutoff number for the most effective range of DNA sequence. In figure 5, the two example lists of fragment lengths each contain fragment lengths below 100, and in the ‘filter lists’ step all of those numbers are removed from the list.
20
After the lists were filtered, the algorithm checked if each sequence contained the same number of fragment cuts. If each sequence contained a different number of fragment cuts, the algorithm determined the two sequences as non-matching and did not execute a comparison. If the two sequences contained the same number of fragment lengths, the fragment lengths were then sorted in increasing value.
During the comparison, the algorithm took the difference between the corresponding lengths and evaluated whether this difference was within 10% (the use of this percentage will be expanded on in the percent difference number section).
Figure 6: Matching Decision Process Diagram
We iterated through the records in the database and for each bacteria sequence record a virtual digest was performed and the subsequent fragment lengths were calculated from the cut sites. Comparisons were made between the bacteria from the database and the lab dataset. Each bacteria contained fragment lengths from three enzyme cuts (AluI, HaeIII, MboI). In the comparison process, the fragment length lists from each corresponding enzyme would be compared using the aforementioned algorithm. If all three enzymes were deemed as a match from the fragment comparing algorithm, then the bacteria sequences would be added to a list of possible matches. Figure 6 shows an example of this process and Figure 7 illustrates what a potential match would look like. Chapter 3. Life Science
Figure 7: Comparing Bacteria Fragments from Each Enzyme
Results
Out of the 18 tested bacteria, there were 8328 fragment matches with respect to the reference database. While our fragments were compared within a 10% differentiation interval based on size, these results are nevertheless indicative of the functionality of the program. They indicate that the fragments sizes obtained from the PCR and digest are able to be matched with established and proven fragment data. Looking more closely at the matches, it was noted that for 15 particular bacteria there was a match and three bacterias were found to have no matches within the fragments from the database. The exact names of the bacteria and the number of matches are shown in Table 5.
Table 5: Number of Matches for Each Bacteria in the Test Dataset
Discussion
Our results had to be validated in order to see whether or not our program was in fact correctly identifying our code-obtained fragments with the fragments from the database. There were multiple matches with the database fragments. It can be assumed that the multiple fragment match is a result of the prevalence of the 16S region. It is highly conserved across many different strains of bacteria; therefore, it is likely that some bacteria types in the
3.2 Chang, Hung, Lee, Li, Lucena, Mardjoko, Tsarnakova, Wang, Wang, Zheng. Microbial Forensics: Identifying Bacteria and Yeast Using Ribosomal DNA Fingerprints 21 database may yield very similar fragment sizes, producing multiple matches to our test fragments. For the three bacteria that did not produce matches, it is indicated that the bacteria fragments obtained from our program were matching the bacteria from the database within the 10% allowable range.
Most of the sixteen bacteria had multiple matches, such as Campylobacter coli, likely had such results because similar to the other four bacteria, there were multiple strains of each bacteria in the database. Since the 16S region is highly conserved across bacterial organisms, it is a possibility that certain sequences, and thus fragments, appear across more than one strain, outputting more than one match. Some of the matches and nonmatches were manually validated after running the program. We verified the resulting matches and legitimacy of the program by checking for the occurrences of each test bacteria in the bacteria database as shown in table 6. We found false positive matches for four of our test bacteria: Shigella dysenteriae, Obesumbacterium proteus, Yersinia kristensenii, and Enterobacter cloacae. Three false negative results were determined as well: Helicobacter pylori, Haemophilus influenzae, and Haemophilus parainfluenzae. Table 6: Number of Occurrences of Test Bacteria Species in the Bacteria Database
It can be determined that the sequences could have had small errors such as single nucleotide insertions or deletions. The virtual PCR and restriction enzyme digest do not have tolerance to those small errors and therefore may result in incorrect results. Also, the limitations of the gel electrophoresis done in the lab was not fully taken into account. If the cut fragments were too small to be resolved by gel electrophoresis or if there were more fragments in the wet work than expected, the algorithm was not tolerant to this. For all of these errors, the tool can further be programmed to be more tolerant, whether it be regarding gaps, sequence inaccuracies, or the number of fragments from the wet work that are to be compared. The aim would be to still give the correct bacteria identification without giving too many possible bacteria choices. Table 7: Comparison of Lab Data to Test Data for E. coli To further ensure the validity of our test data that was obtained from the program, we compared the test fragments with real laboratory-derived data after a PCR and restriction digest, as seen in Table 7. Looking at the fragments from Mbo1, the fragments were relatively similar. However, for the fragments from Alu1 and Hae3, it was clear that the test data produced more fragments than were actually produced by a traditional restriction digest. As with the limitations behind gel electrophoresis, this deviation of fragment number was most likely due to our program’s simplified matching approach. We used a restricted set of enzymes and primers, therefore when we consider the possibility of errors in the initial input sequence for the test data, it is inevitable that the resulting fragments sizes would vary such that a single fragment from the lab data is seen as multiple fragments from our program output. Finding ways to efficiently identify unknown microorganisms provides a stronger understanding of the natural world and allows researchers to pursue novel innovations. With further development, our program could assist in producing affordable high-quality identifications of unknown bacteria and yeasts. The identification of microorganisms is critical to the ability of the scientific community in numerous ways. Recent research has investigated how microbes could even be engineered for use as biofuels, in therapies, and more. For example, it has been proposed that the bacterium Alcanivorax borkumensis could be used to clean up oil spills after its genome was sequenced and the enzymes it uses to break down oil were identified [4]. The enzymes, hydroxylases, were efficient at breaking down oil both in water and soil, and around 80 percent of various crude oil compounds. As a result of these enzymes, A. borkumensis could be used as an efficient clean up of oil spills [4]. Another application of 16S rDNA fingerprinting is in the seafood industry, specifically, the traceability of seafood. Fingerprinting techniques (specifically, PCRDGGE) were used in one study in 2017 to determine the geographic origin of sea bass, by performing analyses on samples of sea bass skin mucus. The data show that fish from different geographical locations had unique operational taxonomic units (OTUs), and that PCR-DGGE was