3.2 Chang, Hung, Lee, Li, Lucena, Mardjoko, Tsarnakova, Wang, Wang, Zheng. Microbial Forensics: Identifying Bacteria and Yeast Using Ribosomal DNA Fingerprints The fragment was copied into the Sequence Manipulation Suite’s restriction digest, and the test was performed three times. Each trial used only one restriction enzyme, and the fragment was digested with the enzymes AluI, HaeIII, and MboI. The program displayed the number of fragments generated and their lengths for each individual enzyme, and the results of the virtual digest were recorded and compared to the results of our program. Test data was successfully collected from twenty bacteria in this manner.
19
for each enzyme was calculated by subtracting the last site cut by each enzyme with the length of the original sequence (Figure 5). These fragment lengths were then appended to the list in ‘length_dict’ that was contained in key with the enzyme that produced the fragments. Finally, the dictionary ‘length_dict’ was returned with the fragment lengths.
Bacteria Identification Process
The virtual digest found the restriction enzyme cut sites of AluI, HaeIII, and MboI for each genome sequence in the bacteria database. Then, the length of the fragments cut by the restriction enzymes were calculated. Information of each bacteria in the bacteria database were stored together in a class. We read the FASTA database file using the SeqIO reading function, parse().We also iterated through the dataset and performed a manual PCR to check whether the sequences were identified between the forward and reverse primers. A BioPython RestrictionBatch object, ‘rb’, was initiated in order to use the three restriction enzymes, AluI, HaeIII, and MboI simultaneously when performing the virtual digest. The BioPython search function could then be applied to the restriction batch to determine the restriction sites of a given sequence, which was returned as a dictionary with the three enzymes as keys and a list of restriction cut sites as values. A ‘BacteriaInfo’ class was created to concisely store the identifying information of a given bacteria in one object. The class contained three instance variables, ‘name’, ‘id’, and ‘lengths’, with ‘name’ representing the bacteria name, ‘id’ representing the bacteria’s accession number, and ‘lengths’ containing the dictionary of fragment lengths. In iterating through the bacteria database, each sequence of bacteria was first virtually digested and its fragment lengths were calculated, then a ‘BacteriaInfo’ object was created to store the information of the bacteria. The function, ‘find_lengths’, used the parameters ‘sequence’, representing the original genome sequence that was virtually digested and preprocessed to remove all extraneous space and newline characters, and ‘seq_dict’, representing the dictionary of restriction cut sites that was returned by the BioPython virtual digest search function. This function created a new dictionary, ‘length_dict’, with the same restriction enzyme keys as ‘seq_dict’, but with an empty list to store fragment lengths as values. By iterating through each key in ‘seq_dict’ and each cut site in the corresponding list to the key, fragments lengths were calculated as the difference between a particular cut site and its previous cut site. The last fragment length
Figure 5: Visual Representation of Fragment Length Calculation Possibility Reducer
The possibility reducer was created to identify known bacteria from the database of unnamed bacteria sequences. The algorithm finds possible bacteria matches from the fragment lengths calculated from the cut sights performed by the virtual digests. As mentioned before, after the virtual digest PCR was performed, each bacteria would result in unique ribosomal DNA fragment cut sites and number of cuts. Therefore, the identity of the bacteria can be determined by analyzing the fragment lengths. We created an algorithm to compare fragment lengths from two individual bacteria records. It analyzed two lists of fragment lengths and determined whether the numbers in these lists are close enough in value. Preconditions were first examined. To prevent the comparison between a record with no fragment lengths in a particular enzyme, empty lists that did not contain fragment lengths were deemed invalid for comparison with filled lists. The fragment lengths below the minimum base pair (100) were filtered out of the fragment length list and were disregarded in comparison. This minimum base pair value denotes the lowest cutoff number for the most effective range of DNA sequence. In figure 5, the two example lists of fragment lengths each contain fragment lengths below 100, and in the ‘filter lists’ step all of those numbers are removed from the list.