Technical and Operational Assessment of Genomic Sequencing Platforms by Biological Defense Therapeutics

August 28, 2013

LANL FO OR JJPM‐MC CS

TECHNICAL AND OPERRATIONAAL ASSEESSMEN NT OF GENOM MIC SEQ QUENCIN NG PLATTFORMSS

Final Reporrt

Davvid Bruce,, Shannon n Johnson n, Mattheew Scholzz, Mom mchilo Vu uyisich an nd Gary Xiie

Technical and Operational Assessment of Genomic Sequencing Platforms

Contents Summary and Background .............................................................................................................. 2 Executive summary ..................................................................................................................... 2 Introduction to next generation sequencing (NGS) ................................................................... 5 Section 1: Technical Assessment of Genomic Sequencing Platforms ............................................ 7 Chapter 1: Applications of Next Generation Sequencing Technologies ..................................... 7 Chapter 2: Requirements for Pathogen Detection and Characterization by NGS .................... 13 Chapter 3: Comparative Analysis of Performance of Current Sequencing Platforms .............. 25 Chapter 4: Survey of Sequencing Centers & Platform Vendors ............................................... 31 Section 2: Operational Assessment .............................................................................................. 36 Appendices .................................................................................................................................... 40 Appendix 1: Glossary ................................................................................................................ 40 Appendix 2: Analysis Pipelines .................................................................................................. 43 Appendix 3: List of software packages mentioned ................................................................... 53 Appendix 4: Comparative Analysis of Performance of Current Sequencing Platforms ........... 58 Appendix 5: Survey to Sequencing Centers and Platform Vendors ......................................... 66 References Cited ........................................................................................................................... 87

Summary and Background Executive summary Next Generation Sequencing (NGS) is rapidly becoming the technology of choice for detection and characterization of pathogens in clinical and environmental samples. Until recently, NGS was a slow and costly process. However, it is becoming cost‐competitive and sufficiently rapid for many applications. Even though NGS is unlikely to replace the rapid and portable pathogen detection platforms in the near future, in many cases it will provide actionable information faster than the current rapid systems. This is mainly due to the vast amounts of data that NGS provides. It is the only technology that can perform all of the following tasks in parallel from almost any sample: 1) detect all known pathogens: viruses, bacteria, and protozoa, 2) identify emerging pathogens, whether they have naturally evolved or been engineered, and 3) characterize the pathogens (for example, determine antibiotic resistance or pathogenicity markers). Within a decade, it is conceivable that NGS applications will contribute to generating a world map displaying the real‐time status of all infectious diseases. The data will be provided by a global network of inter‐connected NGS‐utilizing clinics. Clinical and sequencing data, combined with the computational models of disease progression and easy visualization, will enable accurate prediction and monitoring of disease spread, and reduce the effects on human lives and local economies. Additionally increased understanding of gene features and function will improve understanding of genomic markers for potential disease. With the existing or forthcoming hardware and software upgrades, NGS technology will provide actionable information in 16‐48 hours, depending on the platform, the number of samples, and types of information needed. The simplest process includes detection of known pathogens and determination of some of their features, such as antibiotic resistance. More complex processes will involve identification of novel pathogens in mixed samples (clinical or environmental), prediction of their pathogenicity and susceptibility to antibiotics, and matching their identities to pathogens that previously caused serious outbreaks. As described in detail in Chapters 1 and 2, sequencing data can be obtained with three different pipelines, each providing different amounts and types of information, depending on the user's requirements (Table 1). It is important to note that most of the facility and training requirements for the three pipelines are the same, regardless of which sequencing system is implemented. The three sequencing pipelines can be implemented in many applications and offer significant advantages over traditional methods for pathogen detection and characterization. Here are some realistic scenarios in which the power of genomics can provide relevant, timely, and actionable information. 2

Pipeline

Description

1. Amplicon sequencing

Rapid sequencing of very small portions of pathogen genomes.

2. Pathogen identification and characterization in mixed samples

Full sequencing of environmental and clinical samples.

3. Pure culture (isolate) whole genome sequencing

Whole genome sequencing of one bacterial pathogen isolated from a sample and grown in the lab.

Actionable information Identify and characterize known pathogens, and some emerging ones. Able to test 100s of samples in parallel. Identify and characterize known and emerging pathogens, including bacteria, viruses, and protozoa. Can identify sequences associated with specific outbreaks. Allows rapid detection of the same pathogen in future outbreaks.

Table 1: Overview of the three NGS pipelines for pathogen detection and characterization that are described in this document.

Outbreaks. The number and severity of infectious disease outbreaks are likely to increase (in humans and animals) due to global trade and travel, human encroachment of wild environments, higher concentrations of domestic animals, climate change, etc. A portion of future outbreaks will likely be caused by emerging pathogens, many of which will not be detectable by current technologies searching for known pathogens. NGS can detect both known and unknown organisms, and combined with accurate data analysis, can rapidly identify any pathogen and its characteristics. The high resolution data provided by NGS can also enable more accurate mitigation and forecasting of outbreaks. Monitoring surface contamination in hospitals and living/working areas. Even though decontamination is routinely performed in hospitals, many surfaces still harbor pathogens, often organisms resistant to antibiotics. These pathogens cause millions of hospital‐acquired (nosocomial) infections each year, prolonging existing and causing additional infections. In addition, infected (ill) people can spread the pathogens to various surfaces in living or sleeping areas, causing additional infections. Sequencing can be used to test a large number of surfaces for presence of pathogens and their antibiotic resistance, enabling more effective decontamination procedures and minimizing infections. Discovery and tracking of emerging pathogens in all environments (biosurveillance). Identification and characterization of emerging (mutated, unknown, or engineered) pathogens are particularly challenging for current detection technologies that mostly detect known pathogen signatures. NGS has a unique advantage in this area, since it can detect all living organisms, not just already characterized ones. Understanding not only the geographic distribution and diversity of all pathogens, but tracking their change in real time will enable much better prediction and effective responses to many biological outbreaks. Robust sample preparation methods enable sequencing of a wide variety of environmental and clinical samples: insects, soils, indoor surfaces, domestic and wild animals, and humans.

Table 2 summarizes two example facilities that rely on different NGS platforms. The questions being answered are: What pathogens are in a sample and what actionable information can we gather about them? Life Technologies Ion Torrent PGM / Ion Proton

Illumina MiSeq Lab setup and training Lab setup cost (includes computer) Lab area Power requirements Personnel Upgrade path System speed for amplicon sequencing System speed for sequencing mixed samples System speed for sequencing pure culture (isolate)

2-6 weeks ~ $200,000

~ $200,000 / $375,000 ft2

~220 - 250 ~40 Amps of peak electrical power and standard outlets Two people with standard microbiological/molecular background Upgrade to HiSeq ($700,000) increases No hardware upgrades required for years the number of samples in similar (only disposable sequencing chips will amounts of time. change). Hundreds of samples in ~30 hours

Hundreds of samples in ~20/30 hours

~1 sample in ~30 hours

~1 / 4 samples in ~20/30 hours

1-4 samples in ~48 hours

~1 / 10 samples in ~38 /48 hours

Table 2: Description of two hypothetical laboratory facilities that use NGS for pathogen detection and characterization.

In conclusion, next generation sequencing (NGS) technologies have rapidly improved over the last several years and can now quickly and inexpensively generate large amounts of data. NGS is the only molecular platform whose large data output can match the vast diversity of microorganisms. Combined with the clinical information, NGS data can enable appropriate mitigation responses in the shortest amount of time in certain scenarios, such as in cases of unknown outbreaks or hospital infections. As more genomic data sets are obtained about pathogens with known pathogenicity, transmissibility, and susceptibility to vaccines and antibiotics, NGS will be able to provide much more accurate data to inform about the best method of decontamination or treatment of any new pathogen. In addition, the sequencing technology will continue to improve, and should match the speed of today's rapid detectors in just a few years. The sooner NGS is implemented for clinical and biosurveillance applications, the sooner we will have the power to detect, track, and predict the spread and behavior of all pathogens on the global scale

Introdu uction to next gene eration se equencing g (NGS) NGS tech hnologies can be used to o sequence aalmost any ssample contaaining biologgical materiaal, such as cclinical (human, animal), environmen ntal (insectss, water, soil, surfaces, p plants, etc.), or pure culttures of orgaanisms of intterest. Seque encing can i nclude DNA A, RNA, or bo oth, depending on the tyype of inform mation needed. DNA seq quencing usuually revealss which organisms are present in a sample aand their characteristicss (specific typpe). RNA seq quencing mu ust be used for o study the aactivity of orrganisms during infectio on, allowing us to RNA viruses, but it is also used to better un nderstand th heir behavior and identiffy genes thaat play key ro oles in patho ogenesis, genetic disorderss, inflammattory responsse, etc. NGS tech hnologies req quire that RNA and DNA A molecules to be sequeenced are firsst converted d to NGS libraaries. RNA is always convverted to DN NA first, as ccurrently theere are no direct RNA sequenciing technolo ogies. Standaard library prreparation pprocesses incclude DNA fragmentatio on and addition of appropriate adap pter molecules to DNA ffragments en nds. Adapters are uniqu ue DNA sequences (usually 30‐60 bps long) which allow seqquencing to occur and can also incorporaate barcodes (or indicess), providing the ability tto multiplex many samples in one sequenciing run. Depending on th he applicatio on, library p reparation m methods takke between ttwo hours and three dayss. Once the llibraries are prepared, eeach DNA fraagment pressent in the library is clonallyy amplified jjust before ssequencing. This processs and its auttomation levvel depend o on the NGS platform. Fo or example, the Illuminaa platforms uuse clusterin ng (MiSeq is fully automated), whereas Ion Torren nt platforms use emulsioon PCR (man nual process)). Sequencin ng is always an automated d process. NGS prod duces vast amounts of d data, output as sequenciing reads (orr just "readss"). A read is a string of DNA nucleo otides corressponding to tthe sequencce of the origginal DNA orr RNA moleccule in the sam mple. Each N NGS platform m outputs re eads with th ree importaant characterristics: read length, th he number o of reads, and d their qualitty (fidelity). Read length hs vary (50‐1 10,000 bps) dependin ng on the plaatform and ssequencing kit used. Reaad numbers vary from aas few as 1.5 5 million to o 4 billion pe er run. Reads can be evaaluated inde pendently, o or combined d into much longer strings of DNA A or RNA seq quences usin ng a variety oof computattional tools. DNA/RNA D extraction frrom a samplle Prepara ation of a sequencing lib brary Seequencing Computtational anallysis of sequencing reads Figure 1:: Overview off the next geeneration seqquencing (NG GS) process..

For manyy application ns, sequencing offers gre eat advantagges over trad ditional biocchemical methodss. For examp ple, in the fie eld of pathoggen detectioon, NGS can not only ideentify known n 5

organisms, but also novel, emerging and engineered ones. This is highly relevant, especially for rapidly‐evolving and highly diverse organisms, such as RNA viruses and Burkholderia spp. In addition, NGS does not require prior knowledge of pathogens present in a sample like the traditional detection methods. Therefore, NGS shows promise as the ultimate pathogen detection tool. Other application areas where NGS will play a significant role include pathogen characterization (strain typing, antibiotic resistance, etc.), bioforensics, biometrics and bio‐ surveillance. For NGS to succeed in all these applications, basic studies and databases that correlate genotype (genetic sequence of an organism) to phenotype (the behavior of an organism, such as its pathogenicity, transmissibility, resistance, etc.) are required. The promise of NGS cannot be realized without significant investments in data analysis (typically referred to as “bioinformatics”) Analysis of NGS data is highly specialized, depending on the application and level of resolution desired for a particular analysis. There are several levels of analysis possible for pathogen identification and characterization, each with different requirements and ability to reach more or less detailed conclusions. In conclusion, NGS shows promise for improving current microbiology, virology, and molecular biology methods and for providing new data streams that will help us understand the current state of pathogens and anticipate future changes in the microbial world. Physical and personnel requirements The sequencing process includes laboratory processing of samples and data analysis. Each NGS platform has slightly different requirements for space, electrical supply, equipment and personnel. The platforms discussed in this document in general require ~250 ft2 of temperature‐controlled laboratory space with multiple instruments and one computer. More information is provided in Chapter 2.

Section 1: Technical Assessment of Genomic Sequencing Platforms Chapter 1: Applications of Next Generation Sequencing Technologies Besides generating sequencing data in the laboratory, data analysis is an essential component of the process. The large amount of information that NGS provides can be used to rapidly identify and characterize potential biothreat agents from both pure and mixed samples. This chapter describes three analysis pipelines that utilize different input data types for pathogen identification and characterization. Each pipeline has advantages and disadvantages compared to other NGS pipelines and traditional detection approaches. They offer different sequencing throughputs and information, and require different levels of expertise and equipment. Selection of the desired analysis pipeline(s) should be based on the NGS application. Current Biothreat agent identification/characterization methods Sensitive and specific detection and characterization of bacterial and viral pathogens is essential for rapid and accurate decision making. Molecular detection methods have mostly replaced the traditional culturing techniques. Nucleic acid and antibody‐based assays, such as PCR and lateral flow detectors, allow rapid and accurate detection of known pathogens from many sample types, thus decreasing the time required for pathogen identification. However, these assays rely on prior knowledge of the target pathogens and must be redesigned when new strains or variants are discovered or when the signature they target is discovered in non‐pathogenic organisms [1‐3]. Table 3 shows a comparison of several molecular methods for detection and characterization of pathogens. Technique

Sensitivity

Specificity

Primary advantage

Immunoassays PCR (single and multiplex)

Low to moderate

Very high

Rapid and low cost Sensitivity and specificity

Microarrays

Moderate to high

Sequencing

Very high

Moderate to very high Very high

Primary disadvantage Poor specificity Low throughput

Highly multiplexed

Low throughput

Very high content

Cost and time

Table 3: Table of pathogen ID/characterization tools currently available compared to Sequencing based methods.

Next Generation Sequencing (NGS) for biothreat agent identification and characterization Use of NGS for pathogen identification and characterization offers a highly sensitive and specific method to accurately identify pathogens from many sample types. It has several advantages: 1) It can detect multiple pathogens simultaneously in a single sample. 2) It can utilize universal methods for all pathogenic microbes – including unculturable, or not yet culturable, and hard to detect pathogens. 3) It can detect emerging (mutated, novel, and engineered) pathogens.

Actionable information generated from analysis Applications of NGS methods in microbiology and virology are not limited to high‐throughput whole genome sequencing. NGS is an essential tool for discovery of new microorganisms, investigation of microbial communities in various environments, tracking rapid evolution of viruses, and detection of drug‐resistant mutations in pathogens [4‐6]. Here we emphasize the actionable information generated from NGS data analysis, such as antibiotic resistant determinants for defining drug susceptibility patterns and treatment of infectious disease, accurate and definitive pathogen identification and tracking disease outbreaks associated with microbial and viral infections.

Pipelines Three pipelines for NGS data analysis are described in this document, each with different applications, data requirements and types of actionable information it can generate. In this section, each of these methods is outlined at a high level. Additionally, a workflow of analysis methods and other technical details are described in Appendix 2. Method

1. Amplicon sequencing

2. Pathogen identification and characterization in mixed samples

3. Pure culture (isolate) whole genome sequencing

Description

Pros

Rapid sequencing & analysis of very small portions of pathogen genomes (signatures) Sequences all living organisms in any sample (environmental, clinical, etc.)

 High sensitivity and specificity at species level  Limited ability to detect novel pathogens  High throughput

 Specificity & classification depend on the choice of primers  Minimal characterization

Low

 Does not rely on culturing  Accurate for abundant pathogens  Identifies and characterizes emerging pathogens  Sequences a pathogen to very high accuracy and coverage  Can characterize virulence factors and antibiotic resistance genes

 Reduced sensitivity  High computational requirements for assembly-based data analysis  Can be difficult to interpret  Requires isolation  Requires additional computational power  Not a detection method

Low to high

Sequencing of one cultured pathogen.

Cons

Computational requirements

Low to medium

Table 4: Analysis methods or "pipelines" described in this report

1. Amplicon sequencing Overview Amplicon sequencing is the deep sequencing of PCR‐products (amplicons) generated with known primers. The process is well established and does not require culturing. It uses NGS to sequence PCR reactions that have been traditionally used as detection assays. This NGS method can run 8

hundreds of tests on hundreds of samples in a single sequencing run, enabling very high throughput and low per‐sample cost. This approach can detect all known pathogens and characterize them to a desired depth (e.g. known markers for antibiotic resistance and pathogenicity). However, it has limited ability to detect emerging pathogens, because the PCR assays have to be designed to amplify known sequences. The laboratory process needed to generate the data is very mature and can be easily semi‐automated. Amplicon sequencing data enables rapid and accurate detection and classification of known pathogens at any taxonomic level (even strain). Trained bioinformatics experts are needed only if amplicon data are used to detect emerging pathogens. For pathogen identification analysis, the wide range of available signatures means that specificity and sensitivity are both high for identifying known pathogens at a strain level. Data analysis and requirements The data are generated in the laboratory by sequencing highly multiplexed PCR reactions. False positive results are not a great concern, since each amplicon is fully sequenced. Current NGS platforms provide long enough reads to sequence amplicons of up to 400 bps, providing detailed sequence information within short stretches of the genome. Data analysis can be fully automated when looking for known pathogens and their features. Detection and characterization of emerging pathogens requires some manual data analysis and expertise. Due to the maturity of characterization tools available, the computational requirements for amplicon‐based analyses are relatively low. A desktop computer operated by trained person can handle majority of the tasks. These tools are primarily developed to run in LINUX; therefore the person conducting the analysis should be familiar with the LINUX operating system and have some biology training.

2. Pathogen identification and characterization in mixed samples Overview Sequence‐based metagenomics involves extracting and sequencing of DNA directly from a mixed sample (such as a soil sample, or a blood sample of an infected individual). This method can rapidly identify both known pathogen and virulence genes present in the sample, without attribution to a specific agent. The greatest challenge with sequence‐based metagenomics is the large number of sequences without significant similarity to previously sequenced genes or organisms. Lacking known reference sequences, virulence and resistance genes cannot be easily identified in the metagenomes, as well as a significant portion of metagenomic reads cannot be annotated or assigned taxonomy. Given the amount of novel sequence in metagenomic shotgun reads, read‐ based classification methods may fail to acknowledge novel pathogens present in the samples. Therefore, several computational tools based on metagenomic de novo assemblies can be applied. Once contigs having been annotated, pathogenicity from those pathogens in the community can be inferred by comparing the metagenomic sequences to large databases of pathogen/virulence factors/antibiotic resistance (for the abundant pathogen). 9

Data Analysis Sequencing and analysis of metagenomic mixed DNA samples can yield valuable information, however often the target of interest (potential biothreat agent) will only be a very small fraction of the reads. Typically, trimming and QC of the data will be required before use for read mapping. Removal of expected contaminating sequences (e.g. removal of human reads from clinical samples) can also be performed. Benefits/Drawbacks The obvious advantage of the metagenomics approach is that culturing and isolation are not required. Metagenomics can therefore identify difficult to isolate or especially dangerous pathogens. Identification of any potential pathogens is limited to those pathogens used for identification and will not necessarily identify novel or non‐targeted potential biothreat agents in a sample. Therefore some level of false negative is likely to occur, particularly for low abundance pathogens. All metagenomic analysis is limited by the depth of coverage generated by the sequencing technology. More complex samples, or low pathogen loads will require more sequencing. Proportionately, this increased sequencing also requires greater computational time and power. Requirements and Personnel requirements For preliminary analysis, including testing for known pathogens and virulence factors requires tools to classify reads and some degree of analysis by the user. For more detailed analysis, including assembly of the pathogen for analysis, the computational and training requirements are very high, requiring assembly, annotation and analysis. While tools are constantly being developed to improve metagenomic analysis, they are not currently at Technology Readiness Levels (TRL) 4 or above. Assembly of metagenomes requires a single machine with large RAM and use of proprietary software (CLCBio) to assemble contigs, or the ability to transfer data to a location with these capabilities. Additionally, annotation software and analysis tools listed in the appendix would be required. Extensive expertise in microbiology would also be required to perform these analyses. Therefore, we recommend all analysis would be best conducted by a highly skilled bioinformaticist on a computer cluster at CONUS lab instead. Therefore, a high‐speed internet connection (with T1 or higher) will be required for data transfers.

3. Pure culture (isolate) whole genome sequencing Overview If a pure culture biothreat agent (pathogen) has been isolated, characterization by sequencing and analysis is possible. If similar pathogens have previously been sequenced and classified, sequence data generated from an isolated pathogen can be used to trace the pathogen’s origin and its 10

relationship to other previously characterized pathogens. Further analysis by examining presence/absence of genes such as antibiotic resistance or virulence factors can also allow researchers to more accurately classify a potential pathogen, either to determine if it is harmful, or to select a treatment regime. In order to fully classify an unknown or emerging pathogen, assembly of generated reads would result in better comparisons to reference genomes, but also allow annotation. Annotation results can then be examined to identify the biothreat agent, and better classify it by its functional capacity, such as antibiotic resistance, or other factors. Use of isolate genome assembly and analysis techniques may detect and characterize novel or emerging pathogens. Input Data Input for analysis is data from an isolated pathogen sequencing run and data from a related reference genome. Typically, trimming and QC of the sequencing data will be required before use for read mapping. Benefits/Drawbacks This is a highly robust and discriminatory method for characterizing the genome of an isolated biothreat agent. It has low to medium computational requirements, so it can rapidly identify pathogen and its virulence/antibiotic resistance genes. It is more difficult to identify novel genes or DNA molecules that have been introduced into a biothreat agent, leading to potential false‐ negative results when searching for specific pathogenicity or virulence factors. Therefore, assembly based method is applicable to identify foreign elements in a biothreat agent that are not present in the near‐neighbor reference. This can occur as a result of a pathogen acquiring pathogenicity factors, either naturally or due to a genetic engineering event. Since assembly of genomes is a difficult process and requires additional computational power, it takes longer to process. It is recommended that in‐depth analysis or assembly be performed by well‐trained experienced personnel. Hardware Requirements and Personnel requirements To perform analysis of genomic differences, rapid alignment tools, a reference genome and its annotation are all required. To maintain all required components on a local machine, a multiple‐ processor computer with sufficient RAM (>4GB) would be needed. If de novo annotation of assembled sequences is desired, either significant investment in software and hardware for this purpose or high‐speed internet access to an external annotation portal will be required. While this is the most powerful of the analysis techniques for isolated genomes, it requires a high degree of expertise (at CONUS laboratories), including understanding of gene annotations and virulence factors. As described above, the analyst should be familiar with LINUX operating system and have biology training. Alternatively, use of complete genome analysis solutions, such as EDGE or CLC bio will reduce the need for understanding of LINUX. Assembly and analysis of antibiotic resistance and 11

virulence genes can be done at OCONUS laboratories by well‐trained personnel. Due to the complexities of in depth analysis of complete genome assemblies, more detailed analyses or improved assemblies must be handled by highly trained biologists, and is not recommended at OCONUS laboratories. This process will have very low value when unknown or emerging pathogens are to be analyzed.

Chapter 2: Requirements for Pathogen Detection and Characterization by NGS This chapter describes the physical and personnel requirements for producing sequence data to be used by the computational pipelines described in Chapter 1. In addition, the three recommended sequencing pipelines are described in some detail, including steps starting from the collection of clinical and environmental samples to the generation of sequencing data in a specific output format (fastq, sff). Recommended sequencing pipelines and their physical requirements Based on literature searches, interviews with other sequencing centers (see Chapter 4) and our own studies (see Chapter 3), we make specific recommendations for laboratory processes (pipelines) that transform crude samples into sequencing data. These processes are described below and depicted in Figure 22. Sample types Clinical (swabs, blood, stool, biopsy, etc.)

Culturing and DNA isolation (for pure cultures only)

Environmental (arthropods, swipes, soil, liquids, etc.)

DNA / RNA extraction

Preparation of sequencing libraries (amplicons or shotgun libraries)

Sequencing on next generation platforms MiSeq (Illumina)

PGM / Proton (Life Tech / Ion Torrent)

Figure 2: Overview of the laboratory processes required to generate sequencing data that can provide actionable information.

Sample collection and storage Samples to be sequenced are of either clinical or environmental origin. Clinical samples are collected from humans or animals, and can be in the form of blood, stool, cerebrospinal fluid (CSF), or swabs from wounds, nose, or throat. Environmental samples may be collected from arthropods (mosquitos, ticks, etc.), surfaces (swipes), soil, or water sources. 13

Clinical sam mples are co ollected by trained mediical staff usinng appropriaate collectio on devices. Environme ental sample es can be colllected by an ny trained peerson. Samp ples that willl be sequencced without an ny culturing should be frrozen as soo on as possibl e. Samples tthat require culturing an nd isolation o of potential p pathogens sh hould be sto ored at 2‐8 °C C. Isolation of single microbial clones For in‐depth sequencing of specifiic pathogenss, pathogenss first need tto be isolateed from complex mixtures aand grown in n pure culturres. This is acccomplishedd with stand dard microbiological techniques, using selective media or cell lines that are sel ected based d on the sym mptoms and expected p pathogens.

Figure F 3: Varrious types of o samples an nd sample coontainers.

Figure 4: QIAcube Q is an a inexpensiv ve benchtop platform whhich can autoomatically exxtract DNA oor RNA using g Qiagen kit s. The instru ument processses 12 sampples in one hoour.

Extractio on of DNA A and RNA A (sample e prep) The sample preparatio on process re efers to the extraction oof DNA and R RNA molecu ules from thee sample. Evven though tthere are a vvariety of sam mple types, it is likely th hat only threee DNA/RNA A extraction products (kkits) can efficciently purifyy DNA and R RNA from any sample. Th he kits are commerciaally available e from Qiage en and can b be automateed by use of a relatively small and inexpensivve platform ccalled QIAcu ube. The kitss can also bee used by personnel, req quiring only aa vortexer and a mini ce entrifuge.

Figure 5: Nucleic acid extraction with Qiagen kits can also be manually performed, using a vortexer and a centrifuge.

For extraction of DNA from all clinical samples (except stool), the QIAamp DNA mini kit can be used. For extraction of RNA from all clinical samples except stools, a combination of QIAzol and miRNeasy mini kits can be used. For DNA and RNA extraction from stool and environmental samples, QIAamp DNA stool mini kits can be used. The QIAamp DNA stool mini kits can purify RNA, DNA, or both types of nucleic acids from samples. After extraction, the concentration and total amount of extracted DNA and RNA must be determined. The simple, small, and inexpensive Qubit device (available from Life Technologies, formerly Invitrogen) performs this task with sufficient speed and accuracy. Purified DNA should be stored at 2‐8 °C, while RNA samples should be stored at ‐20 °C or ‐80 °C, if available.

Preparation of sequencing libraries All NGS platforms require adapter‐tagged DNA fragments as the starting “template” material. Therefore, all DNA molecules must first be converted to sequencing libraries using a library preparation process. Sequencing libraries are DNA molecules containing original DNA from an unknown sample with specific adapter sequences attached to both sides of the molecule. RNA molecules to be sequenced must be converted to cDNA prior to library preparation. For amplicon sequencing (the first pipeline in Chapter 1), specific pathogen sequences (signatures) or bacterial 16S genes are amplified using polymerase chain reaction (PCR). Sequencing adapters with barcodes are added during PCR, enabling very rapid library preparation.

Figure 6: A thermocycler is required during the library prep process, whether DNA fragments or amplicons are to be prepared for sequencing.

For shotgun sequencing, where all nucleic acids present in a sample are sequenced, sequencing libraries must be prepared. The first step in this process is fragmentation of all DNA molecules to the size suitable for sequencing. Covaris instruments are best suited for this purpose, as they provide very reproducible results and are easy to use.

Figure 7: Covaris M220 instrument fragments DNA molecules as the first step in the library preparation process. This model can process one sample every 2 minutes.

Once the DNA is fragmented, many commercially available kits can be used to perform library preparations. Comparisons carried out at LANL, but not yet published, find NEBNext Ultra kits from New England Biolabs provide the best overall performance. They are easy to use, fully automatable (for high sample numbers), inexpensive, robust and require very small amounts of input sample. They also produce excellent sequencing data across a spectrum of microbial pathogens. There are two types of NEBNext Ultra kits, one for preparation of sequencing libraries from DNA samples and the other one from RNA samples. The RNA library prep kit is identical to the DNA kit, with three additional steps to convert RNA to cDNA.

Sequenc cing platfo forms There are currently tw wo sequencin ng platformss that providde significantt advantages over the reest of the NGS syystems in terms of cost, speed, dataa output andd physical foo otprint. These are the Illumina MiSeq and d the LifeTecch Ion Torren nt. The two w will be detaiiled below. MiSeq (Illumina) MiSeq is a fully integraated bench‐ttop device, aautomatical ly performin ng the clonal amplificatio on of nd data proccessing. The final step in n a sequencin ng run library fraggments, sequencing by ssynthesis, an is the creation of fastq q files that are ready for analysis. The follow wing lists assu ume the new west MiSeq m model and 22×150bp sequencing cheemistry are u used. Strength hs of MiSeq q:  Fullly automate ed and easy to operate  Seq quencing reaagents are p pre‐loaded in nside a cartr idge  As an Illumina platform, prroduces the highest qua lity NGS datta (lowest seequencing errror)  A ssingle sequen ncing run prroduces ~ 3.6 6 Gbps of daata, sufficien nt for sequen ncing two orr more baccterial genom mes per run (depends on the genom me size)  Op perators requ uire very little training  Very small foottprint (2 feet of the bench top) Weaknessses of MiS Seq:  A ssequencing rrun takes at least 24 hou urs (though iit is fully auttomated)

Figure 8: Illumina's I MiSeq M platforrm.

PGM (Life e Technolo ogies) The other NGS platform that may be useful fo or field operaations is the Ion Torrentt Personal Machine (PGM). Even tho ough it has ssignificant d rawbacks co ompared to MiSeq, it can Genome M provide se equencing daata in a much shorter am mount of tim me. 17

As Figure 9 shows, the PGM is accompanied by several other pieces of equipment and the entire system requires hours of manual work (as opposed to fully automated on MiSeq) for clonal amplification of library fragments, bead purification, primer and polymerase loading, chip loading and initiation of PGM.

Figure 9: Life Technologies' Ion Torrent PGM system consists of seven required pieces of equipment: Ion One Touch, Ion ES, chip centrifuge, stir plate, water purification system, compressed argon tank, and PGM.

The following lists assume that the 318 chip and 300bp sequencing chemistry are used. Strengths of Ion Torrent PGM:  Length of the sequencing run is about 13 hours (see text above for explanation)  One sequencing run can provide sufficient data (~1.8 Gbps) to sequence at least one bacterial genome (depends on the genome size) Drawbacks of Ion Torrent PGM:  Requires several hours of manual work  Sequencing reagents are numerous and must be individually handled  Operators require intensive training  Utilizes a larger footprint than MiSeq  Requires a supply of compressed argon gas  Produces slightly lower quality data (more errors) than MiSeq  A single sequencing run produces less data than MiSeq Besides MiSeq and PGM, two other platforms may be of interest for use in field‐based sequencing laboratories. 18



Ion Proton (Life Technologies) is the larger version of the PGM. The current version of the sequencing chips that the Proton uses have more than ten times higher data output (~10 Gbps) than PGM. It offers shorter run times than the PGM by approximately two hours, due to the introduction of Ion Chef, which automates the process of clonal amplification of library fragments. Life Technologies has already announced their plans for higher throughput sequencing chips to provide more data in the same amount of time. Overall, the Ion Proton promises to deliver the best overall performance of any NGS platform. The cost of a new instrument is about $250,000. HiSeq 2500 (Illumina) is a new instrument that has two different run modes. In the rapid mode, the fully automated sequencing run (just like MiSeq) will take 27 hrs and produce 120 Gbps of sequencing data, sufficient to sequence more than 75 bacterial genomes at once. The cost of a new instrument is about $750,000.

Data storage and transfer and management Sequencing platforms generate enormous quantities of data. This data must be managed efficiently, processing the raw data into useable data files and storing them for downstream analysis. The Ion Torrent PGM can generate output data files ranging in size from 300MB to 5GB. A MiSeq will typically generate 20GB to 40GB of data per run (HiSeq can generate 3,000GB to 5,000GB of data per run). In order to maintain an archive of the raw and processed data, a file system of ≥50TB may be required. This would allow for archiving of raw data and intermediate analysis results. Depending on project throughput, platform usage, and whether stringent data cleanup is implemented, a more modest sized file system should suffice. If transfer is desired to off‐site locations, connections allowing transfer of gigabytes of data in short timeframes will be required. Laboratory layouts, equipment, and space and power requirements

Illumina MiSeq platform Table 5 and Table 6 list required and optional equipment and bench space needed in a laboratory using MiSeq for sequencing of complex samples and pure cultures. Minimum bench space required is approximately 27 feet, which could be further reduced to 21 feet by placing the refrigerator 1, freezer 1 and incubator under the bench. Minimal space requirements may negatively affect the quality of the work performed and compromise the trustworthiness of the data produced, due to potential cross‐contamination of the samples. Therefore, the optional equipment and space are highly recommended.

Required equipment

Function

Space, feet*

Power, Amps**

Refrigerator 1 Freezer 1 Incubator (shaking) Mini centrifuge Vortexer Heat block Thermocycler Covaris M220 Qubit MiSeq Working bench Sink Computer and desk

Sample and reagent storage Sample and reagent storage For culturing microorganisms Sample prep, library prep Sample prep, library prep Sample heating PCR amplification DNA fragmentation DNA and RNA quantification Sequencing For performing all work For liquid waste and washing hands Computer work and data keeping Required space and power

2 2 2 1 1 1 2 2 1 3 4 2 4 27

2 2 1 <1 <1 2 7 <1 <1 4 N/A N/A N/A <22

Table 5: Space and power requirements for MiSeq sequencing. All instruments use standard household outlets ranging from 110-240V.* Required bench-top width. **Based on 110V and peak power. Suggested equipment

Function

Space, feet*

Power, Amps**

Microbiology hood Mini centrifuge Vortexer QIAcube Refrigerator 2 Freezer 2 Post-PCR bench

For sterile microbiology work Sample prep, library prep Sample prep, library prep Sample prep Post-PCR sample and reagent storage Post-PCR sample and reagent storage Performing steps before PCR Additional space and power Total space and power

3 1 1 3 2 2 4 16 43

<1 <1 <1 6 2 2 N/A <13 <35

Table 6: Space and power recommendations for MiSeq sequencing. Notes same as Table 6.

Figure 10: Example laboratory layout for MiSeq sequencing, with sufficient laboratory equipment to produce highly reliable data.

Ion Torrent PGM platform Table 7 and Table 8 list the required and suggested equipment and bench space needed in a laboratory that uses the PGM for sequencing of complex samples and pure cultures. The minimum 20

bench space required is about 23 feet, which could be further reduced to 17 feet by placing the refrigerator 1, freezer 1, and incubator under the bench. Again, minimum space requirements may negatively affect the work quality and compromise the trustworthiness of the data produced, thus optional equipment and space are highly suggested. Required equipment

Function

Space, feet*

Power, Amps**

Refrigerator 1 Freezer 1 Incubator (shaking) Mini centrifuge Vortexer Heat block Thermocycler Covaris M220 Qubit Ion One Touch Ion ES Chip centrifuge Stir plate Argon cylinder PGM Working bench Sink Computer and desk

Sample and reagent storage Sample and reagent storage For culturing microorganisms Sample prep, library prep Sample prep, library prep Sample heating PCR amplification DNA fragmentation DNA and RNA quantification PGM accessory PGM accessory PGM accessory PGM accessory PGM accessory Sequencing For performing all work For liquid waste and washing hands Computer work and data keeping Required space and power

2 2 2 1 1 1 2 2 1 2 2 1 1 1 2 4 2 4 33

2 2 <1 <1 <1 2 7 <1 <1 <1 <1 <1 <1 N/A 9 N/A N/A N/A <31

Table 7: Space and power requirements for PGM sequencing. All instruments use standard household outlets ranging from 110-240V.* Required bench-top width. **Based on 110V and peak power. Suggested equipment

Function

Space, feet*

Power, Amps**

Microbiology hood Mini centrifuge Vortexer QIAcube Refrigerator 2 Freezer 2 Post-PCR bench

3 1 1 3 2 2 4 16 49

<1 <1 <1 6 2 2 N/A <13 >44

Table 8: Space and power recommendations for PGM sequencing. Notes same as Table 8.

Figure 11: Example laboratory layout for sequencing with the PGM platform.

Personnel requirements For each laboratory set‐up recommended above, a single trained technician could perform all tasks. However, the sample throughput of such a laboratory would be fairly low and the equipment would not be utilized to its full potential. An additional technician would certainly increase the throughput and maximize equipment use. The highest expertise level in the entire process described above is required for isolation and culturing of potential pathogens from a complex sample. This task should probably be performed by a trained technician. All other steps use standard operating procedures that can be learned by a non‐technical person. As described earlier, the MiSeq platform requires much less training and manual work than the PGM. This leads to the possibility that one technician could operate a laboratory with two MiSeq platforms just as easily as one PGM. Throughput of a sequencing laboratory There are many possible sequencing laboratory set‐ups, depending on the needs that the facility must satisfy. acility. Sample description

Sample count

Platform

Time to sequence data

Time/sample

Pure culture Pure culture Pure culture Pure culture Pure culture Pure culture Mixed sample Mixed sample Mixed sample Mixed sample Mixed sample Mixed sample Pure culture

1 4 12 1 4 12 1 4 12 1 4 12 12

MiSeq MiSeq MiSeq PGM PGM PGM MiSeq MiSeq MiSeq PGM PGM PGM HiSeq

48 hrs 48 hrs 96 hrs 37 hrs 50 hrs 102 hrs 30 hrs 54 hrs 150 hrs 19 hrs 58 hrs 162 hrs 51 hrs

48 hrs 12 hrs 8 hrs 37 hrs 12.5 hrs 8.5 hrs 30 hrs 13.5 hrs 12.5 hrs 19 hrs 14.5 hrs 13.5 hrs 4.25 hrs

Sample description Pure culture Mixed sample Mixed sample

Sample count

Platform

Time to sequence data

Time/sample

12 12 12

Ion Proton* HiSeq Ion Proton*

37 hrs 33 hrs 17 hrs

3.1 hrs 2.75 hrs 1.4 hrs

Table 90 lists some possible scenarios in terms of instrumentation, number of samples, and

timelines and the same data are plotted in Figure 12. From the data, it is clear that the choice of sequencing platforms will depend on the sample throughput requirements for a given facility. Adding another technician is also a way to increase the throughput of a laboratory without any additional equipment. Ultimately, decisions about the number of technicians and the choice of sequencing platforms should be based on the predicted requirements for each facility. Sample description

Sample count

Platform

Time to sequence data

Time/sample

Pure culture Pure culture Pure culture Pure culture Pure culture Pure culture Mixed sample Mixed sample Mixed sample Mixed sample Mixed sample Mixed sample Pure culture Pure culture Mixed sample Mixed sample

1 4 12 1 4 12 1 4 12 1 4 12 12 12 12 12

MiSeq MiSeq MiSeq PGM PGM PGM MiSeq MiSeq MiSeq PGM PGM PGM HiSeq Ion Proton* HiSeq Ion Proton*

48 hrs 48 hrs 96 hrs 37 hrs 50 hrs 102 hrs 30 hrs 54 hrs 150 hrs 19 hrs 58 hrs 162 hrs 51 hrs 37 hrs 33 hrs 17 hrs

48 hrs 12 hrs 8 hrs 37 hrs 12.5 hrs 8.5 hrs 30 hrs 13.5 hrs 12.5 hrs 19 hrs 14.5 hrs 13.5 hrs 4.25 hrs 3.1 hrs 2.75 hrs 1.4 hrs

Table 9: Time required for various tasks in a sequencing lab. These estimates assume that 1. Pure culture is of an average genome length bacterium (~4 Mbps). MiSeq can sequence 4 samples per run, and PGM 2 samples per run. Estimated time includes 18 hrs for culturing. 2. Two mixed samples can be sequenced by MiSeq and one by PGM in one sequencing run. Estimated time includes 6 hours for sample and library preps for up to 12 samples. 3. One laboratory technician is performing all the work. * IonProton process includes a not-yet-available Ion Chef.

180

160

Total time Time per sample

140

120

Time, hrs

100

0 MiSeq

MiSeq

PGM

MiSeq

PGM

HiSeq Ion Proton 12

HiSeq 12

Ion Proton 12

Pure culture

Mixed sample

Pure culture

Mixed sample

Platform, number of samples, sample type Figure 12: Time to data for different sequencing scenarios. Data are also shown in Table 10.

Good laboratory practices (GLP) Independent of the platform of choice and types of applications, all NGS facilities should follow Good Laboratory Practices (GLP). GLP is a set of administrative and laboratory processes that ensure the results obtained can be trusted and shared among equivalent facilities and maximizes the productivity by minimizing failures. GLP consists of proper laboratory set up, implementing a Quality Assurance (QA) plan, thorough training, following standard operating procedures, keeping records of all work performed and performing Quality Control (QC) on every sample. A specific example of QA/QC would be the use of positive and negative controls during daily operations. It is important to note that the laboratory layouts depicted in Figures 10 and 11 are the minimalist versions of NGS laboratories. If at all possible, multiple rooms should be utilized to perform sequencing applications under GLP guidelines. For example, a proper set‐up would include a sample receiving and processing room, a pre‐PCR room where small amounts of nucleic acids are handled, followed by post‐PCR room where prepared sequencing libraries handled and sequencing is performed. This unidirectional process would minimize sample cross‐contamination.

Chapter 3: Comparative Analysis of Performance of Current Sequencing Platforms This chapter (and Appendix 4) details a direct comparison of various library preparation and sequencing methods for a variety of samples in order to make recommendations regarding the most appropriate chemistries for sequencing in OCONUS settings. Technologies Analyzed and Samples Utilized Platforms examined are those assumed to be flexible and rapid enough for deployment to a OCONUS laboratory. These include the Roche 454 FLX Jr., Illumina MiSeq and Ion Torrent PGM. Each sequencing technology has identified weaknesses for particular types of sequencing applications. These weaknesses are examined in detail here and in Appendix 4. Additionally, for the MiSeq sequencing several library preparations were performed to determine the impact of these preparations on overall sequencing quality. It is generally accepted that the greatest impact on quality of sequencing is the average ratio of G+C to A+T (commonly referred to as GC ratio) in a DNA sample. To examine each technology in detail, we chose 3 potential biothreat agents with a varying range of GC ratio. Table 10 lists these organisms and several key characteristics. Isolate

%GC

Size

Notes

Burkholderia thailandensis Escherichia coli Bacillus anthracis

68% 50% 36%

6.71Mb 5.3Mb 5.3Mb

2 Chromosomes Isolate from the Republic of Georgia; 4 Plasmids Isolated variant of B. anthracis Ames; 1 Plasmid

Table 10: List of bacterial strains used in comparative study.

Sequencing Quality There are generally four types of sequencing error in current NGS platforms: low quality sequence, substitution errors, InDel errors, and loss of genetic material during preparation. Of these error types, sequence data of low quality is the least likely to have a negative impact on analysis. Low quality sequence data can be easily dealt with by trimming low quality bases from the ends of individual sequence reads. The next impactful type of error is a substitution miscall, where a nucleotide is incorrectly classified (referred to in literature as single nucleotide polymorphisms, or SNPs). These errors can have minor to intermediate effects on sequence analysis, or assembly of a genome and must be controlled for, but can typically be overcome by additional sequencing. Slightly more damaging to analyses are the incorrect addition or subtraction of a nucleotide in the sequence (called InDel errors). InDel errors are more frequent for 454 or IonTorrent data, but due to the known issues with these types of errors, there are protocols and software to minimize errors of this type. InDel errors can have much more severe impacts on the quality of analysis, due to the fact that not only sequence, but order and spacing of sequence data are important for 25

correct analysis. Due to the type of sequencing performed, this type of error is more frequent for Ion Torrent and 454 technologies than for Illumina. Of greatest importance for this analysis, the most difficult error to identify is missing genomic sequence. This flaw is due to the inability of preparation and sequencing methods to sequence every part of a genome. The dangers of this error type are; 1) it causes an unrecoverable loss of information from known organisms (all information about the genes within that region is missing and are not analyzed), and 2) it is difficult to identify such regions for a newly sequenced organism. It is generally assumed that the laboratory preparation techniques before sequencing are primarily responsible for this type of error. Low Quality Errors The frequency and impact of low quality bases can be easily determined by quantitatively measuring the number of sequencing reads that are removed from analysis when removing areas of low quality. Table 11 shows the number of original reads and the number of reads remaining after removing poor quality data. While the majority of reads have some proportion of their reads removed due to quality concerns, 454 and Ion Torrent have the highest proportion. In no case, however, are sufficient reads removed to be concerned about the platforms’ ability to generate sufficient information for analysis.

Platform + Chemistry Roche 454* Ion Torrent PGM MiSeq TruSeq +Betaine NebNext2 +Betaine

Bacillus anthracis

Escherichia coli

Burkholderia thailandensis

Reads

2.71×105 1.58×106

High Quality Reads 2.47×105 1.43×106

4.58×105 1.33×106

High Quality Reads 3.71×105 8.75×105

N/A 7.17×106 9.08×106 7.45×106

N/A 7.07×106 8.99×106 7.36×106

2.03×107 2.13×107 2.50×107 3.24×107

1.97×107 2.08×107 2.41×107 3.14×107

2.77×105 2.20×106 7.16×106 6.48×106 9.24×106 8.57×106

High Quality Reads 2.50×105 1.98×106 7.13×106 6.46×106 9.08×106 8.51×106

Table 11: Reads and trimming results for all platforms and chemistries.

Substitution and InDel Errors Error rates for all samples run are shown in Table 1Table 12. Sequencing was performed with all three technologies (MiSeq, Ion Torrent, 454) for 3 pathogens, and the sequencing results were compared to the finished genomes. Each technology has an individual error profile; overall 454 and Ion Torrent have a significantly higher percentage of every type of error than Illumina, particularly those resulting in InDel errors. Additionally, the G/C ratio of the organism appears to increase the substitution error for both Ion Torrent and Illumina drastically. Technology

Sample

Insertion Percentage

Deletion Percentage

Substitution Percentage

MiSeq

B. anthracis E. coli B. thailandensis B. anthracis E. coli B. thailandensis B. anthracis E. coli B. thailandensis

0.1% 0.0% 0.1% 8.5% 6.1% 7.8% 3.3% 2.3% 2.5%

0.1% 0.1% 0.1% 10.2% 10.6% 9.2% 4.4% 3.0% 3.1%

2.8% 3.6% 3.9% 9.7% 9.0% 12.7% 2.7% 1.9% 25.0%

Ion Torrent

454

Table 12: Sequencing error rates for all technologies across all platforms.

Genome Coverage All platforms are capable of sequencing 99+% of every genome tested, with MiSeq generating the highest number of bases covered, as well as the most even coverage of the finished target organism genome ( Platform

Reads/Run (Ave Length)

Ave. Genome Coverage (%)

Fold Coverage (Min-Max)

Multiplex (max samples/run)

MiSeq PGM (316 Chip) 454 FLX*

~20 Million (100Bp) 1-2 Million (~200Bp) 100,000 (400Bp)

100% 99.99% < 99%

40-800× 10-100× 5-45×

2-4** 1 1

Table 13). Figure 13 is a box and whisker plot illustrating the evenness of coverage of each

technology, and shows that for each organism, MiSeq generated the most even coverage, with Ion Torrent and 454 generating different levels of coverage for varying genomes. Platform

Reads/Run (Ave Length)

Ave. Genome Coverage (%)

Fold Coverage (Min-Max)

Multiplex (max samples/run)

MiSeq PGM (316 Chip) 454 FLX*

~20 Million (100Bp) 1-2 Million (~200Bp) 100,000 (400Bp)

100% 99.99% < 99%

40-800× 10-100× 5-45×

2-4** 1 1

Table 13: Sample table for platform analysis. FLX is used in lieu of the GS Jr., previous studies have shown highly similar behavior between the two. Genome size coupled with desired fold coverage drives the calculations of how many samples may be multiplexed per run.

Figure 13: Comparisons of evenness of coverage between platforms. Evenness of coverage across the genome ranges from 1.0 (all regions of the target genome are covered by the same number of reads) to < 0.20 (The variation of coverage across the genome varys by > 5-fold between regions). Illumina MiSeq performs better for all tested organisms.

Assembly of Reads To illustrate the sequencers ability to characterize a pathogen of unknown origin, sequence data from each platform was assembled and the resulting assemblies analyzed. Due to the relatively low genome coverage of Ion Torrent and 454 data in these samples, overall assemblies produced much longer assemblies for Illumina data than for either of the other two technologies. However, assembled reads from all technologies maintained a similar level of coverage of the genome (>85%). Analysis of assembled reads does indicate significantly more substitution errors for all platforms than the reads from the same platforms. This indicates that while assembly is necessary for improved ability to detect novel genes or acquired genes not present in a reference genome, there are likely more errors in the assembled sequence than in the reads. For further discussion of assembly for analysis, please see Appendix 3. Sequencing Mixed Samples From analysis of multiple previously sequenced mixed community samples, sequencing of mixed samples (such as blood, stool samples, or environmental samples) is possible, but requires relatively high concentrations of the pathogen(s) for detection. An exercise performed using environmental air filter samples with a spiked control of Francisella tularensis at a concentration equal to approximately 3% of the sample genetic material sequenced was able to identify the presence of F. tularensis but was unable to characterize the pathogen to strain level. While 28

successful, this scenario would be considered a high limit of detection when compared to PCR based assays and multiple efforts are underway to produce improved samples for scenarios. For identification of human pathogens from blood samples, the expected pathogen load is very low. A single sample may reasonably contain only tens or hundreds of pathogen cells mixed with several million human cells. In such cases the required sensitivity is well below 3% of the sample using current DNA extraction techniques and sequencing technologies. Two main areas of research to improve detection limits are (1) improved DNA extraction protocols to potentially remove DNA from non‐target sources and (2) improved sequencing throughput. Currently, a study investigating the preferred DNA extraction protocol from human derived samples, including blood, fecal material, and sputum, is underway. Protocols for sequencing human derived samples to detect potential bio‐threat organisms are expected to reach a maturity stage sufficient for diagnostic work in the near future.

B. anthracis

E. coli

B. thailandensis

Platform Library Prep. Reads Generated (Million) % Genome Coverage Fold Coverage ±StDev* Reads Generated (Million) % Genome Coverage Fold Coverage ±StDev* Reads Generated (Million) % Genome Coverage Fold Coverage ±StDev*

HiSeq

MiSeq

TruSeq 26.4

+ Betaine 25.4

NebNext2 27.6

+Betaine 24.6

TruSeq 20.3

+Betaine 21.3

NebNext2 25

+Betaine 32.4

99.99%

100%

324±112

310±82

278±76

306±84

388±128

401±90

453±98

629±143

18.5

26.7

22.9

N/A

7.2

7.4

100%

N/A

100%

240±32

337±53

345±40

321±38

N/A

158±28

223.17±28

182.59±30

47.8

17.1

33.7

3.2

7.1

6.4

9.2

8.8

100%

204±82

569±205

874±121

39±9

105±41

167±62

192±30

204±31

Table 14: Sample table of results by library preparation method. *Coverage and standard deviation values for Burkholderia thailandensis are presented as an average of both chromosomes.

MiSeq Sequencing Kits and Betaine Treatment As discussed in the appendix, two MiSeq sequencing kits and two MiSeq treatment methods were evaluated to determine if they had any effect on the genome sequencing coverage and sequencing quality in low GC regions. The methods and analysis are covered in depth in the appendix. In brief, the NebNext2 sequencing kit demonstrates lower variability of genome sequence coverage and the Betaine treatment method appears to have an improvement on both genome coverage and evenness of coverage for low GC ratio genomes.

Summary After examination of multiple NGS platforms’ ability to reliably, accurately and evenly sequence the genomes of a range of sample types, the Illumina MiSeq seems to generate more complete and even coverage of genomes than either the Ion Torrent or 454 platforms. Illumina MiSeq has the lowest rate of sequencing errors, followed by 454 and Ion Torrent technologies. Using current technologies and DNA extraction protocols, sequencing of mixed community samples (e.g. swab or blood sample, or soil or air filter sample) requires very high coverage to reliably characterize, again indicating that Illumina MiSeq has a significant advantage. For samples derived from a human source, the signal‐to‐noise ratio for detection of a pathogen is very low, making detection via sequencing potentially unreliable. However, methods for DNA extraction and preparation of human‐derived samples are currently being developed to reduce noise.

Chapter 4: Survey of Sequencing Centers & Platform Vendors In an effort to understand the diversity of sequencing goals, methods, and technological implementations, this survey includes many of the world’s leading sequencing centers and two major platform vendors with a total of 14 responses from 13 institutions. Generally speaking, the types of sequencing stayed in line with the goals of a particular sequencing group. Notably 90% of the sequencing centers utilize the Illumina MiSeq or HiSeq platforms regularly, respondents processing clinical samples have greatly relaxed incoming QC requirements for samples and few centers still utilize the Roche 454 platform regularly. Information gleaned from the vendor survey generally echoes other publically released information on the platforms. Sequencing centers

Introduction This survey was conducted to elucidate the methods applied for sequencing and analysis of data generated by NGS platforms, under the direction of Joint Program Manager ‐ Medical Countermeasure Systems (JPM‐MCS, formerly Transformational Medical Technologies, JPM‐TMT). The purpose of the survey was to understand if the methods and operating procedures used by the Genome Science teams at LANL were in agreement with those at other major sequencing centers, academia and national laboratories. One individual conducted the surveys with all respondents and carried out the analysis of the results. The basis of the questionnaire was outlined by JPM‐MCS in conjunction with LANL staff. Surveys were conducted by providing the questionnaires in advance via email (October 2012, see Error! Reference source not found. for example survey), then discussing answers via phone when the respondents were available for phone interviews. Two respondents were unable to speak by phone but provided written responses; survey answers were collected between October and November 2012. Most of the survey respondents are located in the USA (four OCONUS centers were contacted, however only one responded,Error! Reference source not found.). All questions were asked of all respondents (n=12).

Figure 14: Location of survey respondents (created using pinmaps.net)

The sample size for this survey was small (n=12) due both to the small group of potential respondents as well as the response rate (~65%). All efforts were made to ensure that each participant understood the purpose of the survey, and for their responses to be accurately recorded. Survey respondents included both researchers and project managers from each of the institutions contacted (all legal adults, list of institutions in Table 15). Institution

Type

Broad Institute Center for Disease Control and Prevention – Influenza Center for Disease Control and Prevention – Rapid Response Center for Infection and Immunity Edgewood Chemical Biological Center J. Craig Venter Institute Joint Genome Institute Los Alamos National Laboratory National Center for Genome Resources Navy Medical Research Center Sanger Institute Translational Genomics Institute

Large sequencing center, research laboratory Government, research laboratory Government, rapid response Academic, rapid response, research laboratory Government, research laboratory Large sequencing center, research laboratory Large sequencing center, research laboratory Moderate sequencing center, research laboratory Moderate sequencing center, research laboratory Government, research laboratory Large sequencing center, research laboratory Academic, research laboratory

Table 15: Sequencing centers participating in the survey. The low response count may indicate that the summarized responses are skewed; however the respondents did cover the breadth of possible interviewees.

Sample handling Generally speaking, respondents handle samples in a similar fashion to each other with most of the differences being directly linked to the goal of the individual centers. Most centers were able to process any nucleic acid but starting viable samples varied as did the prevalence of working with any particular source type. Similarly nearly all centers worked with both DNA and RNA, and all have very similar initial processing steps upon sample arrival. A comparison of sample tracking systems yielded an interesting result. Only four of the twelve centers utilized a commercially available LIMS (Laboratory Information Management System, a database designed to integrate with standard laboratory processes to enable more complete sample tracking). While seven of the centers continued to use in‐house developed systems and the last still utilized spreadsheets to document the progression of a sample through pipelines.

Sequencing process This set of questions looks at the sequencing platforms employed and how they’re utilized. Most notable is the prevalence of Illumina technologies throughout those surveyed (Figure 15) and many centers maintained more than one of the available Illumina platforms. About two‐thirds of the centers routinely make adjustments to the manufacturer protocols, requiring either an in‐ house development team or a trusted team outside of that organization. Also common (>60%) is the use of a robotic system to generate the Illumina libraries, this allows for both increased accuracy (closeness of results) and reduced labor costs.

100% 75%

92%

58%

58% %

Illumina I

PacBio RSS

Rochee 454

50% 25% 0% Figure 15: Prevalence of Illumina sequencing s platforms. p

Staff tra aining A single qu uestion regarding the levvel of formal education for an incom ming laborattory techniciian to handle sam mple preparation througgh sequencin ng runs (no bbioinformatics) led to a discussion regarding the importance not onlyy of college llevel degreees but also off general lab boratory experience e, personal ccharacteristiics and platform specificc training. In short no ceenter intervieewed used the same individuals for both h wet‐lab an nd bioinform matics work aand, with thee exception of training prrograms, all incoming staaff held colle ege level deggrees.

Sequenc cing Cente er Survey Summary y Overall the e responses to the surve ey were quitte positive annd provided d valuable feedback. Each center exp pressed an in nterest in en nsuring that samples we re handled aand tracked properly tho ough all stages o of processing. Generallyy the processses and methhods were ssimilar betweeen location ns, with the exceptions be eing either b based on pre eference witth little impaact or logicall ones based d on that cente er’s goals (such as processsing speed and incominng sample Q QC stringencyy). The mostt notable sim milarity was the implem mentation of Illumina seqquencing plaatforms acro oss the varied d groups. m Vendors s Platform

Introduc ction This surveyy was also co onducted to o answer que estions brouught up in the Statementt of Work an nd subsequen nt outline, att the directio on of JPM‐M MCS. The purrpose of the survey was to understaand, from the p perspective o of the vendo ors providingg the sequenncing platforrms, how thee equipment should perrform now aand in the ne ear future. Contact was aattempted w with many vendors (Erro or! Reference e source not found.), how wever respo onse were sl ow and few. LifeTech w was the most enthusiasttic respondent providingg full written n answers ass well as a teelecom to disscuss for botth the Ion Torren nt and IonPro oton. Roche responded late but justt in time to b be included, however Illumina promised aa response b but did not p provide any. In order to represent th hose vendorrs that did no ot respond directly to the e survey, if aan answer co ould be easilly gleaned frrom the man nufacturer’ss website orr news reporrts, it was included. As w with the sequuencing center survey, SS Johnson m made contact wiith all respon ndents and ccarried out tthe analysis of the resultts.

Vendor

Platform

LifeTech Illumina Roche

Ion Torrent & IonProton MiSeq & HiSeq 2000 454 Jr & 454 FLX+

Table 16: NGS platforms included in vendor survey.

General questions

Current Expected Reads Per Full Run*

Notably different from the initial emergence of NGS technologies, the shortest maximal read length from a major vendor is now 100bp long (early platforms were known for 20‐30bp reads). Interestingly, there is a strong correlation between increased read length and decreased read count (Figure 16). 1.E+12

y = 2E+13e-0.047x R² = 0.8363

1.E+10 1.E+08 1.E+06 1.E+04 100

150

200

250

300

350

400

450

Common Single Ended Read Length

Figure 16: Comparison of read length to read count. *Illumina HiSeq is considered on a per lane basis.

Protocols and future directions Both short and long read protocols are available for the three platform chemistries, with a vendor available kit for most combinations. LifeTech relies heavily on its online user community to develop and share new protocols, while Illumina and other commercial vendors are continually producing revised library preparation kits for faster preparation with lower nucleic acid inputs.

Library preparation and sequencing runs Standard library preparation, utilizing a vendor provided reagent set and instructions, should take between two and six hours to complete. Automated library preparation is available for all three platform types, although only LifeTech offers such a system themselves (third‐party vendors offer automation for the Illumina and Roche platforms). Steps required between the library preparations and sequencing runs take two hours to 1½ days to complete, depending on platform, and so can substantially increase the overall time to generate usable data. The one exception is the Illumina MiSeq platform, which accounts for these intermediate steps in the sequencing run time. Sequencing run times also vary greatly between platforms and the specifications of the runs on each platform. The most rapid sequencing runs take three hours on the Ion Torrent, followed 34

closely by a short read (50bp) run on the MiSeq requiring 4 hours. The longest run times are found with the HiSeq, with a 2×100 run taking 10 days. Vendor estimated training requirements for all steps is less than what discussion with sequencing centers indicate. The vendors suggest that approximately two days instruction (from DNA to data) is all that is required for experienced laboratory workers. As indicated before, this training is only to generate the data, not to analyze it.

Data handling requirements As sequencing platforms generate larger amounts of data, data handling has become a major issue in genomics [7‐9]. These challenges are being met with larger computational power and more efficient data transfer and analysis software. This section queried the vendors as to how such challenges have been addressed with respect to their platforms. All vendors responding suggested that the effort to transfer data from a sequencer to a local system where analysis could take place would be minimal. Experience in the LANL Genome Sequencing team suggests otherwise, although once the informatics networks are in place transfers can and do occur smoothly and without manual inputs. With data outputs from a single sequencing run ranging from 540MB to 300GB, it’s not surprising that the effort in analysis varies as well. For the LifeTech systems no additional hardware is required, however running a Roche platform likely requires the purchase of additional computational hardware. Experience at LANL indicates that substantial computational hardware is required both to store and analyze the large datasets generated by the HiSeq instrument.

Costs Sequencer costs vary widely, but in general a bench‐top system (Ion Torrent, MiSeq or 454 Jr.) costs less than $150,000 for the initial purchase (not including reagents required to run the system) and larger system, such as the HiSeq costing $690,000. Additional costs indicated by the vendors include only service contracts to maintain the instruments over time.

Platform Vendor Survey Summary The vendor survey is not meant to be the final descriptor of any platform, rather to summarize how the manufacturers expect the sequencing equipment to perform. Much of the information can be found publically or through various online searches, but there is value to validating it with the vendors themselves and compiling it together. By trying to compare metrics across all platforms (read lengths ranging from 2×100 to 1×1000 and read counts ranging from one hundred thousand to three trillion reads per run) tends to mask the inherent differences in those platforms. Each project or goal will likely have a different data type that is most appropriate and so general aim for most groups investing in such a platform is to find one that best meets most if not all of their needs. 35

Section 2: Operational Assessment Demonstrated and Expected Contribution of NGS Technology toward Fulfilling OCONUS Mission Culture- free characterization of samples

Description OCONUS laboratories utilizing culture‐free rapid PCR‐based assays are limited to detecting known signatures based on reference pathogens. Although NGS technology is more costly and complex than PCR‐based assays, NGS derived analysis can provide more information regarding the nature of pathogen threats and is less prone to false negatives due to PCR signature erosion or the lack of a signature.

Requirements NGS methodology can be added to enhance existing rapid PCR based assays. This requires the training described in Section 1 for DNA extraction, library preparation, and sequencing protocols. Implementation also requires selection and deployment of one or more analysis pipelines to OCONUS laboratories, along with training for pipeline utilization and result interpretation. In depth analysis, beyond automated pipeline analysis, may be performed by offsite support personnel at CONUS (DoD or DOE) laboratories as a reach‐back support mechanism. This requires data transfer or continuous connectivity between OCONUS laboratories and support personnel.

Challenges Sequencing of mixed samples (blood, sputum, swabs, etc.) currently requires significant sequencing capacity. There are limitations to the amount of sequencing possible on deployable sequencing machines, highlighting the need for improved host removal techniques. Development of an easily applied treatment methodology to rapidly remove host DNA (e.g. human blood samples would need human DNA removed from the sample prior to sequencing) will aid the usability of generated data. Multiple efforts exist to support this growing need, but none currently at or above TRL 5.

Future Applications As sequencing capability increases, so will increased and more detailed monitoring. Furthermore, improved future technology releases (decreased cost/base and procedure simplification) will allow additional laboratories to perform the sequencing and analysis, thus providing broader surveillance capacity.

Rapid characterization of isolated pathogens

Description An alternative use of on‐site sequencing capacity for OCONUS laboratories is rapid sequencing of isolated pathogens. Current methodology for field isolated pathogens typically requires shipment of either cultures or extracted DNA to CONUS laboratories for sequencing and analysis. Deployment of sequencing capacity to OCONUS sites allows OCONUS laboratories to sequence organisms within country, improving turnaround time and avoiding biosecurity and political barriers.

Requirements Nucleic acid extraction from cultured isolates typically requires greater safety training than mixed samples and (depending on regulations) may require biosafety level 2 (BSL‐2) or above training, equipment and facilities. Personnel tasked with extraction must also be trained in quality control and methods to verify samples are non‐infectious. All subsequent procedures mirror sequencing requirements for mixed samples, including similar materiel and personnel. In depth analysis, beyond pipeline analysis, would be performed by support personnel at DoD or DOE CONUS laboratories as a reach‐back support mechanism. Such reach‐back support requires data transfer or continuous connectivity between OCONUS and CONUS laboratories and support personnel.

Challenges Biosafety and biosecurity are paramount concerns with cultured potential pathogens. Additionally, on‐site analysis will be limited to automated or potentially automated computational analyses, including limited searches against previously identified pathogens or antibiotic resistance genes. More detailed analyses (e.g. genome assembly, annotation of novel pathogens) require high levels of expertise, such as those available at CONUS DoD and DOE facilities.

Future Applications Improvements in sequencing, assembly and analysis protocols are expected to improve the quality of analysis possible by OCONUS laboratories over time with decreasing need for reach‐back support. Increased sequencing capabilities would allow sequencing of additional strains, increasing capacity and improving outcomes during outbreak scenarios.

Current Next Generation Sequencing Capacity Current CONUS DoD NGS capabilities reside largely at three facilities: U.S. Army Edgewood Chemical Biological Center (ECBC), Naval Medical Research Center (NMRC) and USAMRIID’s Center for Genome Sciences (CGS). Each laboratory is home to both highly educated and capable staff along with modern sequencing equipment, including at least one Illumina sequencing platform per locale. Potential sample throughput is likely on the tens to hundreds per month range and could be improved with the addition of automated robotic systems to speed the library preparation 37

process. All three employ not only highly trained bench staff but also scientists well versed in bioinformatics analyses and scientific context for the potential data applications. Each of the three main laboratories is capable of sequencing and assembling high quality draft microbial genomes and transcriptomes. The throughput and expertise make metagenomic sequencing and analysis also within their capabilities. The laboratory at ECBC is most suited to bacterial processing while those at NMRC and CGS have greater resources focused on viral samples, however all three labs either house or are associated with BSL‐3 facilities and so are capable of processing all sample types discussed in this document. Additional DoD CONUS laboratory facilities include the partner laboratories, many of which are also well staffed and equipped to handle low to moderate sample throughput.

Requirements for Deployment of NGS Technology to OCONUS laboratories Physical Requirements As described in Section 1, the equipment required for nucleic acid extraction, library preparation, and sequencing can be placed in a single 10’×25’ room. Requirements for such ancillary materials as gas cylinders or high quality water are dependent on the sequencing system selected. All NGS platforms require, at a minimum this amount of space and uninterrupted power for the duration of a sequencing run (also dependent on sequencing platform). Additional recommendations would include UV lights, cleaning supplies and other materials to reduce the possibility of cross‐ contamination between samples during sequencing. Computers capable of storing the produced data (~2 GB per sequencing run) and performing analyses would be required for such an OCONUS NGS establishment. To enable reach‐back support, a universal system must be deployed both to the OCONUS laboratories and their partnering CONUS facilities. For effective bioinformatic analysis, particularly across multiple locations, protocol standardization and incorporation of internal and external sequencing standards (use of synthesized DNA both as a portion of each sequencing event and as an individual sequencing event for standardization) must be developed and implemented at all participating laboratories. Personnel Effective OCONUS sequencing function must include training of personnel in nucleic acid extraction (including Good Laboratory Practices to reduce potential contamination events), library preparation and sequencing protocols. Additional training would be required to analyze the results of sequencing events. Reduction of machine “down time” may be reduced by the ability of on‐site staff to perform minor equipment repairs.

Conclusion Globalization has and will continue to increase the emergence of pathogens not previously seen or facilitate spread of known pathogens. This is due to the global trade, rapid movement of people, closer interactions with domestic and wild animals, etc. DoD OCONUS diagnostic and surveillance laboratories relying on traditional detection methods have a limited ability to detect emerging pathogens. The NGS technologies offer a very powerful tool for detection and characterization of pathogens, known or unknown, in many sample types. The most recent NGS platforms, described in detail in this report, are now reasonably priced and require a relatively small footprint. Most importantly, they have the potential to generate highly detailed information about pathogens in a cost‐efficient and timely manner. Implementation of NGS technologies in DoD OCONUS labs will enable more rapid and accurate detection and response to outbreaks.

Appendices Appendix 1: Glossary Abbreviation

Description

Agarose

Gelling agent, extracted from marine algae. Works similarly to gelatin but with greater rigidity at room temperature Software that helps align sequences to each other by lining up the individual bases**

Alignment tools Annotation

Ave Barcode BCM BioAnalyzer BLAST CDC cDNA CDS Chaotropic Chimeric CII Codon Commensal Contig CONUS CU De novo assembly DNA DoD DTRA ECBC Epigenetics Fluorescence Fluorometric Fragmentation Gene annotations HIPPA Homology ID InDel JCVI

40Â

A process that attaches biological information to sequence data. This consists of two steps: (1) features of interest (genes) are identified (feature prediction), and (2) gene function and taxonomy profiling assigned (functional annotation). Average determined by mean Also referred to as "index". A short DNA sequence that uniquely distinguishes one sample from another and enable multiplexing. Baylor College of Medicine Commercially available system that utilizes microfluidic chips to determine the concentration and molecular weight (a.k.a. size) of DNA and RNA Basic Local Alignment Search Tool (BLAST) an easy to use but computationally expensive way to locate regions of local similarity between two sequences. US Center for Disease Control and Prevention Complementary DNA, generated through reverse-transcription of RNA CoDing Sequences, portion of DNA that codes for a protein. Will have both a start and stop codon (see below). Substance, generally a salt, that denatures or breaks down macromolecules such as proteins, DNA and RNA Being made from two entities, in genomics this refers to a sequence that is partially from one organism and partially from another Center for Infection and Immunity at Columbia University A series of three nucleotides (bases of DNA) that code for an amino acid (building blocks of proteins) A symbiotic relationship in which one member benefits and the other is unaffected Derived from the word contiguous, it is a set of overlapping DNA reads or segments that represent a consensus region of DNA sequence. Contiguous United States Columbia University Assembling sequencing reads together without the aid of a reference DeoxyriboNucleic Acid, genomic code for all non-viral organisms Department of Defense Defense Threat Reduction Agency Edgewood Chemical Biological Center Study of functional changes (gene expression or phenotype) that are heritable but not due to changes in DNA sequence (such as DNA methylation or modifications in histones) Light emission by a substance after it has absorbed light or radiation. Fluorescence spectroscopy, also known as fluorometry or spectrofluorometry, is a type of electromagnetic spectroscopy which analyzes fluorescence from a sample. Physical shearing of large DNA into smaller fragments, generally required prior to sequencing library preparation. Identification of gene locations and determining what those genes do. Health Insurance Portability and Accountability Act of 1996, requires high levels of documentation and explanation from health care providers and insurance companies A similarity in characteristic (such as gene or genome sequence) due to shared ancestry Identification Insertion/Deletion mutation, where a single bases is added or removed from the genome J. Craig Venter Institute

Abbreviation

Description

JGI LANL Metagenome

Joint Genome Institute Los Alamos National Laboratory Sequencing and analysis of all organisms in a mixed sample, may refer to environmental or clinical samples Microbes (bacterial, archaea, yeasts and many other fungi) grow asexually, so a single cell placed under growth conditions should grow into a colony of clones, cells that are genetically identical. Also referred to as a laminar flow hood. This is a large desk sized item, connected to the building’s HVAC system and equipped with HEPA filters to provide a sterile workspace for microbiological work. A microbiome includes all microbes (both genomes and interactions) in an environment. Most often used in terms of the “Human Microbiome Project”, which looked at all microbes from various parts of the human body? Multiple reactions occurring simultaneously in a single vessel (see singleplex for comparison) Northern Arizona University National Center for Biotechnology Information National Center for Genome Resources Next Generation Sequencing, essentially all platforms of genomic sequencing described here are considered NGS Navy Medical Research Center Outside the Contiguous United States Operational Taxonomic Unit, Used to define the smallest level of taxonomy for an organism (particular isolate), often prior to assigning a strain designation A sequencing method in which each DNA fragment is sequenced from both ends.

Microbial clones

Microbiology hood Microbiome

Multiplex NAU NCBI NCGR NGS NMRC OCONUS OTU Paired-end sequencing PCR PGM Putative QC qPCR Quasi-species Rarefaction curves Read mapping Reference-based assembly RNA rRNA RT-PCR Shotgun sequencing SI Signature Singleplex StdDev TGen Thermalcycler TRL USAMRIID Virulence factors

Polymerase Chain Reaction Personal Genome Machine, commercially available sequencing platform from LifeTech Expected to be but not confirmed Quality Control, used in reference to sample, library or data quality Quantitative PCR, a process that uses fluorescent detectors to monitor the amount of DNA in a PCR reaction over time; often incorporated into sequencing library QC processes Viruses employ an error prone replication system so (unlike bacteria) when they grow within a host not all individuals are clones of each other A graphical plot of the number of species (or OTU) as a function of the number of samples Alignment of sequencing reads to reference genomes Assembly of sequencing reads using a reference genome as a guide for placement. RiboNucleic Acid, cellular messaging and genomic code for some viruses Ribosomal RNA, present in all known cells. Often used for high- to mid-level identification Reverse Transcriptase PCR Random sequencing of a sample. Analogous to the rapidly expanding, quasi-random firing pattern of a shotgun. Wellcome Trust Sanger Institute Signatures are nucleotide sequences that can be used to detect the presence of an organism and to distinguish that organism from all others A single reaction occurring in a single vessel (see multiplex for comparison) Standard Deviation Translational Genomics Research Institute Laboratory apparatus used to quickly cycle through temperatures, aiding in amplification reactions (see PCR) Technical Readiness Level United States Army Medical Research Institute for Infectious Diseases Genes expressed by pathogens that enable (1) colonization of a host, (2) immune-evasion, (3) immune-suppression, or (4) entry into/exit from cells [only with intracellular pathogens]. Often

Abbreviation Vortex(er) WashU Îą-diversity

Description includes traits such as antibiotic resistance. Laboratory apparatus used for mixing by shaking, common and inexpensive equipment. Genome Institute at Washington University Originally defined by R.H. Whitaker in the study of ecology, refers either to (1) the diversity of a species in a single unit or subunit or (2) the average [mean] species diversity in a set of units or subunits

Table 17: List of abbreviations and non-standard terms used in text.

42Â

Appendix 2: Analysis Pipelines

Metagenome Sequence Analysis

Isolate Genome Sequence Analysis

Amplicon Sequence Analysis

Application of NGS to diagnostic and analysis of pathogens is a relatively young field, but several pipelines for analysis of NGS data have been proposed, such as Mothur, Qiime, MG‐RAST, IMG‐M‐ER, CloVR, SmashCommunity, Virome, and Metamos, each with different applications, data requirements, and types of actionable information that can be generated. This section discusses each of these pipelines, both their benefits and drawbacks. Method

Description

Pros

Cons

Computational Requirements

rRNA amplicon sequencing

Amplification of sample with universal rRNA primers and sequencing of amplicons. Amount of information depends on the primers used

High sensitivity Can have high specificity at species level Can identify novel bacterial, archaeal, and fungal species

Classification software & rRNA DBs

Pathogen ID amplicon sequencing

Amplification of sample with specific primers and sequencing of amplicons. Read mapping and rapid ID and characterization. Amount of information depends on primers used ID/characterization from databases of known pathogens Potential ID of antibiotic resistance/virulence factors Assembly of reads, ID from contigs Annotation of contigs Analysis of annotated contigs for characterization

High sensitivity Can have high specificity at strain level Fastest method

Specificity and classification depend on the choice of primers/amplicons Cannot detect virus and rare community members due to lower coverage Specificity and classification depend on the choice of primers/amplicons Data cannot be used in other analysis pipelines Minimal characterization Characterization may have false negatives for virulence factors/antibiotic resistance Requires isolation Requires additional computational power Slightly slower (time diff?) Annotation is slow

Map reads to pathogen DB, ID species present in sample Map reads to virulence/antibiotic resistance DB, ID virulence factors/AB resistance Assemble all reads from sample ID potential pathogens from assembled contigs Annotate assembled contigs Characterize contigs based on annotations

Rapid Analysis Capable of ID of difficult to isolate pathogens No requirements for culture

Reduced sensitivity (false negatives for ID and virulence factors)

Accurate for abundant pathogens Increased potential to ID pathogen/virulence factors (for abundant pathogens)

High computational requirements Difficult to interpret

Isolate genome sequencing (read based analysis) Isolate genome sequencing (assembly based analysis) Mixed sample based pathogen ID & characterization (read based) Mixed sample based pathogenID & characterization (assembly based)

Table 18: Analysis methods or "pipelines" available.

Highly discriminatory Rapid, specific pathogen ID Virulence/Antibiotic resistance gene ID Highly discriminatory Specific pathogen ID Virulence

Homology search software & pathogen ID DBs Low computational requirements

Low computational requirements Pathogen genome Databases Read mapping software Medium computational requirements (~16+ GB RAM minimum for assembly (?)) Assembly software Annotation software Low computational Requirements Pathogen genome DB Read mapping software

High computational requirements Metagenome assembly pipeline Metagenome annotation pipeline

on Sequencing Amplico

rRNA am mplicon se equencing g Methodo ology Amplicon ssequencing involves the e analysis of genetic variations in deep NGS sequ uencing of polymerasse chain reacction (PCR) p products. As each ampliccon moleculle within a m mixture of amplicons can be sequ uenced by m multiplex seq quencing pla tforms, with h up to 96 baarcoded sam mples sequenced d in a single Illumina lane e, this high‐tthroughput technology has high sen nsitivity and the power to d detect rare vvariants with h detection limits of ≤0.55%. The hyp per‐variable rregions of th he small subu unit (SSU) rRNA molecule e (approximately 40% of the gene iss considered d hyper‐variaable) are ideallyy suited for vvariance analysis, while ssamples withh low varian nce are betteer suited to o other analytical ttechniques, such as path hogen ID (orr signature) detection. Id dentification n of uncultivvated microbes ccan be greattly aided by ccharacter‐baased naive B Bayesian classsifiers (NBC C) or homolo ogy‐ based (BLA AST) approacches. These analyses allow for the iddentification n of known p potential microbes in mixed samples, with h identification to the ge enus and at times speciees level. Phyylogenetic relationships among D DNA sequencces can also be inferred for potentiaal novel path hogens, either with existing phylogen ny‐based refference databases (such as greengen nes, RDP, orr SILVA) or byy de novo phylo ogenic analyysis, using to ools such as N NAST (for seequence alignment) and FastTree (fo or phylogenyy inference frrom aligned sequences). Such tools for phylogeny inferencee can be useed in isolation o or within the context of Q QIIME, Moth hur or other r pipelines. Ribo‐tags

Daata QC (Denoising/Chimera CChecking)

Charaacter based approaches (NBC)

Hom ology based approaaches (BLAST)

Taxono omy profiling

Pathogen n identificationn Figure 17: Work flow of o rRNA amp plicon sequen nce analysis..

Technica al Considerations: Which seque encing pla atform to cchoose The analyssis pipeline ffor these am mplicons will vary dependding on the ssequencing platform utiilized. Expected inputs are eiither fewer long reads (R Roche 454) oor many paired‐end sho ort reads (Illu umina) degenerate (non specificc) rRNA prim mers. Pyroseequencing of PCR amplicons using universal d 44

technologies (454) introduce homopolymer errors (also called ‘noise’) into sequence data, so analysis must begin by “denoising” the data to remove errors. Recent results show using paired‐ end Illumina reads significantly improves the accuracy of taxonomic assignments compared to single‐end amplicon sequencing runs [10]. Technical Limitation Ribosomal ribonucleic acid (rRNA) is the RNA component of the ribosome and is essential for protein synthesis in all cellular organisms. This trait is not extended to viruses, which are in fact not cellular and therefore it is not useful in detecting viral pathogens. There are many potential biases that can be introduced to SSU rRNA data, the first being that degenerate “universal” primers are likely biased toward known sequences, leading to the exclusion of divergent genes. Additionally the PCR reaction has the ability to introduce chimeric molecules, potentially confounding analysis and identification of reads. Computer algorithms, such as the UCHIME, CHECK_CHIMERA of RDP, are useful in the prediction and location of such chimeric molecules within a dataset. Computational requirements Due to maturity of characterization tools available for targeted rRNA amplicon sequence analysis, the computational requirements are relatively low. A single sample can be classified quickly using QIIME and Mothur packages on a desktop computer. We recommend OCONUS laboratories install a QIIME virtual box first, this is a virtual machine based on Ubuntu LINUX which comes pre‐ packaged with QIIME’s dependencies. This is useful for small analyses (approximately a full 454 run) and testing QIIME to determine if it doesn’t meet OCONUS needs. If not, the next option would be to invest time in installing the native version on a large LINUX cluster environment in CONUS reach‐back support lab. Similarly, Mothur can also be installed on desktop computers in OCONUS labs.

PathogenID amplicon sequencing Methodology DNA signatures are nucleotide sequences that can be used to detect the presence of an organism and to distinguish that organism from all other species. There are several stand‐alone applications used for direct, automated selection of DNA signatures. KPATH was a pioneer computational methodology developed for the identification of DNA signatures in silico. The signature discovery pipeline of KPATH integrates different previously developed algorithms in a multiple step approach. Initially it uses MGA (Multiple Genome Aligner) to align numerous bacterial genomes simultaneously, followed by Vmatch, a suffix tree algorithm for comparing the target genome(s) against all the available genome sequences. This analysis allows for genomic regions that are present in other microorganisms to be filtered out, leaving only the sequences unique to the target(s). The signature sequences are then transferred to the software Primer3 and probes are designed to be used in real‐time TaqMan assays and have been field‐tested for routine pathogen screening, demonstrating the potential of in silico prediction of DNA signatures. Similarly, Insignia, 45

TOFI, and YYoda are oth her open acccess tools aimed at idenntifying DNA signatures ffor bacteria and viruses either optimize ed for TaqMan assay or microarray assays respeectively. Alteernatively, th hese PCR produ ucts could be e sequenced d by NGS. NG GS platformss are capablee of sequenccing hundred ds or thousandss PCR ampliccons simultaneously. This allows usee of all availaable PCR prim mers designed for diagnostics to be applied to a singgle sample, followed by ppooling of alll PCR produ ucts into a sin ngle sequencing library perr sample for sequencing. Sequencess can then bee rapidly maapped back tto a database o of the targett genes, and positive hitss would indiicate targeteed pathogen n present in tthe sample. Technica al Considerations PCR‐based d techniquess allow for th he identification of know wn pathogen n(s), regardleess of their cultivabilitty. The wide range of avaailable primers for pathoogen detecttion in a cliniical setting is both a huge ben nefit and a m minor drawb back. The wid de range of ttargets meaans that speccificity and sensitivity are both higgh for identifying known n pathogen, but this metthod will not allow for identifyingg unknown o or emerging pathogens. Computa ation requirements As with rRNA amplicon n sequencing, computattional requirrements are low and do not require more than a dessktop compu uter and acce ess to seque encing data. After these PCR produccts have been sequenced d, rapid align nment tools are required d (Blast, BW WA, Bowtie2, etc) as well as the Samttools packages, which can b be installed o on a desktop p computer iin OCONUS lab. Both am mplicon meth hods require litttle scientific training and d are easily sstandardizedd for a wide range of useers. Pathoge enID Amplicon

Quaality Check

Hom mology search aagainst signaturee database

Patthogen identificaation (species/sttrain level)

Figure 18: Work flow chart c of path ogenID amp licon sequen nce analysis.

Isolate genome g se equencing g: Read ba ased analy ysis for Is solate gen nome

Methodo ology If an isolatted biothreat agent (pathogen) has b been cultivaated, sequen ncing and analysis of straain level variants are posssible. Characcterization is performed to determin ne how similar or differeent a given isolaate is from its nearest ne eighbor. Iden ntification annd characterization requ uires three ssteps: (1) identification of th he nearest n neighbor, (2) detecttion variatio on among them and 46

(3) characterization of genes known to contribute to pathogenicity or virulence in near‐neighbors. Typically to reach these goals shotgun reads are first compared to a reference genome database and are then mapped to its reference genome for identifying genome variations (such as SNPs). These variations may be used to trace pathogen’s origin and its phylogenetic relationship to other known species. The last step of mapping reads to antibiotic resistance/virulence database helps determine if sequence reads fall within genes of potential interest, such as antibiotic resistance genes or other factors that may play a role in pathogenicity. Technical Considerations This is a highly robust method for characterizing a genome of an isolated biothreat agent. However it is time consuming, requiring hours of computational time followed by analysis of the results. It is also more difficult with read‐based analyses to identify novel genes or plasmids that have been introduced into a biothreat agent, leading to potential false‐negative results when searching for known specific pathogenicity or virulence factors. Computation requirements Rapid alignment tools are required (Blast, BWA, Bowtie2, Blat, etc) along with associated analysis tools (SAMtools, etc). To perform analysis of genomic differences, a reference genome and its annotation are also required. Currently, there are >3,000 bacterial genomes and >30,000 viral genomes deposited at NCBI. To maintain all of these on a local database currently requires >6GB of storage, which is readily accomplished but does require maintenance. If the organism has not been classified at species level, software capable of classifying reads to higher taxonomic levels could also be used. These characterization tools, such as PhyloSift and Sequedex, require little computational power but do require some expertise. In order to identify potential virulence factors from a list of genes, a centralized database, such as Mvirdb and VFDB, are used as references. Variation detection among genomes (SNP, InDel) requires genotyping and SNP calling programs (such as SOAP2, realSFS, Samtools, GATK, Beagle, IMPUTE2, QCall, MaCH), followed by phylogenetic analysis (such as MEGA).

Assembly based analysis for Isolate genome Methodology To better address the need of identifying unknown or emerging pathogens, assembly of generated reads can produce assembled contigs from NGS data that are not only capable of analysis by comparison to a reference genome, but also by de novo analyses. Assembly of NGS data is highly platform specific, with Illumina reads requiring a Kmer‐based assembler (such as Velvet, SOAPdenovo, or CLC bio) and 454 or Ion Torrent reads being better assembled using Roche’s proprietary Newbler software, or the publically available MIRA assembler, which are trained to recognize platform specific sequencing errors that make assembly difficult. Assembled contigs can be aligned to the reference genome using alignment software (such as Blast or MUMmer tools) and analyzed for SNP/INDELs. Genome rearrangements can also be detected by comparative genome analysis. Analysis of the assembled sequences can be performed by annotating the 47

contigs ind dividually, ussing publiclyy available, w web‐based, ttools (such aas RAST) or aan in‐house annotation n system (Errgatis and Clo oVR). Annottation resultss can then b be examined d to look for informatio on to identifyy the bio‐thrreat organism m, and classsify it by its ffunctional caapacity, such h as antibiotic resistance, o or other facttors. Formation of large contigs com mprising seveeral unmapp ped milarity to anny sequence in the referrence databaases reads not possessing ssignificant alignment sim d organism. SSuch non‐maapped sequeences can also be may be suggestive of aa previously undetected analyzed for gene content, to dete ermine if the ey contributte towards p pathogenicityy or virulencce. Technica al Considerations This metho od is ideal to o identify forreign elements in a bio‐tthreat agentt that are no ot present in n the near‐neigh hbor referen nce, as can o occur as a ressult of a baccteria pickingg up a new p plasmid conttaining pathogenicity factors o or potentially through a recombinattion or geneetic engineerring event e into the genome. Genoomes assembly may resu ult in false intentionaally inserting a new gene negatives, as genes of interest maay be incorre ectly assembbled or not aassembled att all. Additio onally, genome re earrangement detection n can lead to o false positivves, resultin ng in incorrecct conclusion ns. Computa ation requirements This appro oach has med dium compu utational req quirements. The de novoo assembler and homolo ogy searching tools listed aabove would d be needed d. If de novo annotation of assembleed sequences is desired, eiither significcant investm ment in softw ware and harrdware for th his purpose is required, or high‐speed d internet acccess to an e external annotation porttal will be reequired. Isolate genome sequeencing

Mapping to pathogen p DB

BLAS ST against an ntibiotic ressistance/ virulen nce database

Detect variiation among g genomes (SNP, InDel))

Patho ogen characterrization: Virulen nce facto ors such as anttibiotic resistan nce

Figure 19: Work W flow chartt of isolated gen nome sequencee analysis.

Genome aassembly

Compaarative geno ome anallysis

Pathogen ID D: (emerging or engineereed pathogen)

Metagenome Sequence Analysis: Mixed sample based pathogen ID and characterization (read based)

Methodology Benefits of metagenomic sequencing to OCONUS laboratories include the increased information richness and decreased need for on‐site expertise in pathogens culturing or isolation. Additionally, no information about the potential virulence genes sequenced is needed. To start, community composition and presence of known or unknown pathogens must be understood. To determine the phylogenetic membership of microbial communities based on metagenomic sequences, several freely available and popular software packages compare the metagenomic reads to a variety of full genome sequences using a read‐based approach, such as read mapping and BLAST. The identity of the best match then determines the likely phylogenetic origin of the sequence. Another alternative is to find and extract informative phylogenetic markers from the metagenomic reads, which can be processed with similar methods to targeted gene surveys. However, the taxonomic assignments from arbitrary metagenomic fragments remain a big challenge as much of the novelty in metagenomes still corresponds to organisms that lack a sequenced reference genome and complementing metagenomic analyses with 16S rDNA analyses, for which much larger reference databases exist, are often useful. One advantage of metagenomic approaches is their ability to discriminate strains of common species by gene content beyond the resolution that is possible with 16S rDNA sequences, although this approach requires high coverage and thus cannot be applied to rarer members of the community. Using a wide range of reference resistance genes, the potential for multiple antibiotic resistances can be predicted from a single metagenome. The metagenomic sequences represent the diversity of the community, including strains that cannot be cultured, valuable information for the study of community changes as a result of antibiotic treatment. The biggest challenge with sequence‐based metagenomics is the large number of sequences with no significant similarity to previously sequenced genes or organisms. Without known reference sequences, resistance genes cannot be easily identified in the metagenomes. The strong selection for antibiotic resistance alleles results in convergent evolution – the adaption of very different genes to perform the same function. Many resistance genes identified in functional screens have low similarity to known genes, but sequence based approaches are generally limited to only identify things we already know.

Technical Considerations This is a rapid classification method for mixed samples and can identify potential pathogens rapidly, but is limited to the set of known targets and will not necessarily identify all potential biothreat agents in a sample. All metagenomic analysis is limited by the depth of coverage generated by the sequencing technology, with increased sequence data requiring greater computational time and power. Some pipelines developed for rapid identification of pathogens, such as RINS and PathSeq, are especially useful for viral genome detection. However mapping against all genomes is slower and demands more computational resources.

Computa tational Re equiremen nts Rapid align nment tools (BWA, Bowtie2, CLCaliggner etc) andd associated d analysis too ols (SAMtools, etc) are require ed. For the rreference daatabase, there are >3,0000 bacterial ggenomes and >30,000 viral genomes aavailable. To o maintain alll of these on n a local dattabase currently requires >6GB of sttorage, which is re eadily accom mplished but requires maaintenance. If the organism has not been classiffied at species levvel, software e capable of classifying rreads to highher taxonom mic level could also be ussed. These characterization n tools, such h as PhyloSiftt and Seque dex, requiree little computational po ower, but do req quire some e expertise to understand the potentiaal false posittive/negative rates in th hese assignmen nts. In order to identify p potential viru ulence factoors from a lisst of genes, ccentralized database ssuch as Mvirrdb and VFDB, are used as referencees. Isolate gen nome sequencin ing

Mapping M to path hogen DB

Metageenome assemb bling

BL LAST against antibiotic a resistance/ virulen nce database

Metagenomee annotation

Pathogen id dentification (em merging or eng gineered patho ogen)

Figure 20: Woork flow chart off mixed sample based b pathogen ID I and characterrization

Mixed sa ample bas sed patho ogen ID an nd charact cterization n (assemb bly based) Methodo ology Given the amount of n novel sequen nce in metaggenomics shhotgun readss, read‐based classificatiion methods rrelying on daatabases willl fail to classsify many noovel or diverggent pathoggens presentt in samples. TThis is of particular dangger with virall pathogens where evolu ution is rapid d and classification of virulen nce genes is less universsal. In order to recover tthose novel pathogens from mixed sam mples or obtaain full‐lengtth CDS of virrulence factoors/antibiotiic resistancee genes for subsequen nt characteriization, asse embly of sho ort read fragm ments must first be perfformed to obtain longer gen nomic contiggs. Two strattegies can be e employed for metagen nomics samp ples: referen nce‐ based asse embly (co‐asssembly) and d de novo asssembly. Reference‐based assembly can be e done with ssoftware pacckages such as Newblerr (Roche), metAMOSS or MIRA. Th hese softwarre packages include fastt and memory‐efficient aalgorithms h hence can often be performe ed on laptop p‐sized mach hines in a couuple of hourrs. Referencee based asseembly osely related d reference works well if the metaagenomic dataset contains sequencees where clo 50

genomes are available. Differences in the true genome to the reference, such as a large insertion, deletion, or polymorphisms, can indicate that the assembly is fragmented or divergent regions are not covered. De novo assembly typically requires larger computational resources. A whole class of assembly tools based on the de Bruijn graphs was specifically created to handle very large amounts of data. Machine requirements for the de Bruijn assemblers Velvet or SOAP are still significantly greater than for reference‐based assembly (co‐assembly), often requiring hundreds of gigabytes of memory in a single machine and run times frequently take days. Unfortunately without assembly, longer and more complex genetic elements (e.g., pathogenicity islands) cannot be analyzed. This leads to the need for metagenomic assembly to obtain high‐confidence contigs, enabling study of virulence factors and antibiotic resistance genes in samples. To complement read based analysis, several computational tools based on metagenomic de novo assembly can be applied. Once contigs having been annotated, virulence factors from those pathogens in the community could be inferred by comparing the metagenomic sequences to large databases of pathogen/virulence factors/antibiotic resistance (for the abundant pathogen). In practice, depending on the sequencing strategy, coverage and community complexity, the sequences can be assembled into larger contigs for gene calling prior to annotation. Some software packages, such as metAMOS, SmashCommunity, MOCAT, exist to tie together various components, although no single standard exists yet. Annotation of metagenomic sequence data has two general steps: (1) features of interest (genes) are identified (feature prediction), and (2) gene function and taxonomy profiling assigned (functional annotation). Feature prediction is the process of labeling sequences as genes or genomic elements. A number of available tools are specifically designed to handle metagenomic prediction of CDS, including FragGeneScan, MetaGeneMark, MetaGeneAnnotator (MGA)/ Metagene and Orphelia. All of these tools use internal information (e.g., codon usage) to classify sequence stretches as either coding or non‐coding, however they distinguish themselves from each other by the quality of the training sets used and their usefulness for short or error‐prone sequences. FragGeneScan is currently the only algorithm known to the authors that explicitly models sequencing errors and thus results in gene prediction errors of only 1‐2%. True positive rates of FragGeneScan are around 70% (better than most other methods), meaning that this tool still misses a significant subset of genes. These missing genes could potentially be identified by BLAST‐based searches; however the size of current metagenomic datasets often makes this computationally prohibitive. Functional annotation represents a major computational challenge for mixed sample metagenomic studies. Current estimates are that only 20 to 50% of a metagenomic dataset can be annotated, leaving the immediate question of importance and function of the remaining genes. Metagenomic annotation typically relies on classifying sequences to known functions or taxonomic units based on homology searches against available “annotated” data. Considering the large size of metagenomic datasets, manual annotation is not feasible; therefore the ideal automated annotation would be very accurate and computationally inexpensive. Running a BLASTx similarity search is currently computationally expensive. Computationally less demanding methods, involving detecting feature composition in genes, have limited success for short reads. With growing dataset sizes, some software packages (such as MG‐ 51

RAST, IMG‐M‐ER, CloVR, SmashCommunity and Virome) now exist to address this, however no single standard exists yet. Technical Considerations The computational requirements are very high, typically a machine with large RAM to assemble contigs or the ability to transfer data to a location with these capabilities. Interpretation of data may be a slow process. Computation Requirements The de novo assembler and homology searching tools listed above are needed. For de novo annotation of assembled sequences, either significant investment in software and hardware for this purpose is required or high‐speed internet access to an external annotation portal will be required.

Appendix 3: List of software packages mentioned Tool Name Pipelines

Description

Website

Reference

Quantitative Insights Into Microbial Ecology (QIIME) Mothur

Software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (SSU rRNA) Collection of tools for analysis of 16S rRNA datasets

http://qiime.org/

[11]

http://www.mothur.org/

[12]

16S rRNA gene sequence alignment for browsing, blasting, probing, and downloading. Data analysis and aligned and annotated Bacterial and archaeal small-subunit 16S rRNA sequences. Comprehensive, quality checked datasets of aligned rRNA sequences for all three domains of life

http://greengenes.lbl.gov/cgi-bin/nphindex.cgi http://rdp.cme.msu.edu/

[13]

http://www.arb-silva.de/

[15]

Align a batch of sequences against the 16S greengenes rRNA gene database Infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences.

http://greengenes.lbl.gov/cgi-bin/nphNAST_align.cgi http://www.microbesonline.org/fasttree

[16]

http://code.google.com/p/ampliconnoise/ http://qiime.org/scripts/denoiser.html

[18] [11]

Find Chimeras

Removal of noise from 454 sequenced PCR amplicons Removes sequencing noise characteristic to pyrosequencing by flowgram clustering. Uncover chimeras hidden in 16S rRNA sequences.

[19]

UCHIME

Check for chimeras

http://decipher.cee.wisc.edu/FindChimeras. html http://drive5.com/uchime/ https://www.llnl.gov/str/April04/Slezak.ht ml https://applications.bioanalysis.org/tofi/

n/a

http://pathport.vbi.vt.edu/YODA

[21]

http://insignia.cbcb.umd.edu/index.php

[22]

https://www.ncbi.nlm.nih.gov/

[23]

Database Greengenes Ribosomal Database Project (RDP) SILVA

[14]

Standalone tools rRNA Amplicon Analysis

NAST FastTree

[17]

QC tools AmpliconNoise Denoiser

[20]

PathogenID Amplicon Sequencing

Identification of DNA signatures

53Â

KPATH

Identification of DNA signatures in silico

TOFI

Oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays Yet-another Oligonucleotide Design Application (PMID: 15572465) Generates unique DNA signatures for any and all pathogens

Yoda Insignia

n/a

Alignment engines Basic Local Alignment Search Tool (BLAST)

homology searching engine

Tool Name

Description

Bowtie2

Aligns short NGS reads to long reference sequences.

BWA

Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence

Website

Reference

http://bowtiebio.sourceforge.net/bowtie2/index.shtml http://bio-bwa.sourceforge.net/

[24]

https://www.ncbi.nlm.nih.gov/ http://bowtiebio.sourceforge.net/bowtie2/index.shtml http://bio-bwa.sourceforge.net/

[23] [24]

http://genome.ucsc.edu/cgibin/hgBlat?command=start http://samtools.sourceforge.net/

[26]

a microbial database of protein toxins, virulence factors and antibiotic resistance genes reference database for bacterial virulence factors

http://mvirdb.llnl.gov/

[28]

http://www.mgc.ac.cn/VFs/main.htm

[29]

Package for NGS data analysis, which includes a single individual genotype caller (SOAPsnp) Software for SNP and genotype calling using single individuals and allele frequencies. Site frequency spectrum (SFS) estimation Package for manipulation of NGS alignments, which includes a computation of genotype likelihoods (samtools) and SNP and genotype calling (bcftools) Package for aligned NGS data analysis, which includes a SNP and genotype caller (Unifed Genotyper), SNP filtering (Variant Filtration) and SNP quality recalibration (Variant Recalibrator) Software for imputation, phasing and association that includes a mode for genotype calling Software for imputation and phasing, including a mode for genotype calling. Requires fine-scale linkage map

http://soap.genomics.org.cn/index.html

[30]

http://128.32.118.212/thorfinn/realSFS/

http://samtools.sourceforge.net/

[27]

http://www.broadinstitute.org/gsa/wiki/in dex.php/The_Genome_Analysis_Toolkit

[31]

http://faculty.washington.edu/browning/b eagle/beagle.html http://mathgen.stats.ox.ac.uk/impute/impu te_v2.html

[32]

[25]

Alignment engine Blast Bowtie2

homology searching engine Aligns short NGS reads to long reference sequences.

BWA

Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence BLAST-like alignment tool

Blat

Isolate genome sequencing (read based analysis)

Samtools

54Â

Package for manipulation of NGS alignments, which includes a computation of genotype likelihoods (samtools) and SNP and genotype calling (bcftools)

[25]

[27]

Virulence factor Database Mvirdb VFDB

SNP call tools SOAP2 realSFS Samtools

GATK

Beagle IMPUTE2

[33]

Tool Name

Description

QCall

SNP and genotype calling, including a method for generating candidate SNPs without LD information (NLDA) and a method for incorporating LD information (LDA). The 'feasible' genealogies can be generated using Margarita Software for SNP and genotype calling, including a method (GPT_Freq) for generating candidate SNPs without LD information and a method (thunder_glf_freq) for incorporating LD information

MaCH

Website ftp://ftp.sanger.ac.uk/pub/rd/QCALL http://www.sanger.ac.uk/resources/softwa re/margarita http://genome.sph.umich.edu/wiki/Thund er

Reference n/a

[34]

Molecular Evolutionary analysis tool MEGA

an integrated tool for conducting sequence alignment, inferring phylogenetic trees, mining web-based databases, estimating rates of molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses.

http://www.megasoftware.net/

[35]

Short read assembler for small genomes, ideal for Illumina data Short read assembler (designed for Illumina GA reads) that can handle up to human sized genomes Commercially available software for analysis, visualization and comparison of nucleic acid and protein sequence data. Short read NGS data assembler optimized for Roche 454 pyrosequencing data De-novo assemblies using reads gathered through Sanger, 454 or Solexa sequencing technologies.

http://www.ebi.ac.uk/~zerbino/velvet/ http://soap.genomics.org.cn/soapdenovo.ht ml http://www.clcbio.com/

[36] [33]

http://www.chevreux.org/projects_mira.ht ml

[37]

Fully-automated service for annotating bacterial and archaeal genomes.

http://rast.nmpdr.org/

[28]

Ergatis: A web interface and scalable software system for bioinformatics workflows. CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing

http://ergatis.sourceforge.net/

[38]

http://clovr.org/

[39]

http://www.sanger.ac.uk/resources/softwa re/act/ http://gel.ahabs.wisc.edu/mauve/

[40]

http://mummer.sourceforge.net/

[24]

Assembly tools Velvet SOAPdenovo CLC bio Mi Isolate genome sequencing (assembly based analysis) xe d

Newbler MIRA

n/a

Annotation system Rapid Annotation using Subsystem Technology (RAST) Ergatis CloVR

Comparative genome tools ACT

ACT: Artemis Comparison Tool

Mauve

System for constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Graphical viewing tools for analyzing genome alignments

MUMmer

Read based annotation

55Â

[41]

Tool Name

Description

MG-RAST IMG-M-ER

Automated analysis platform for metagenomes Microbial community metagenome datasets

Virome

Viral metagenome analysis

CloVR

Uclust, blastx for function, blastn for tax, metastats for beta diversity, alternatively, metagene for protein seq

Website

Reference

http://metagenomics.anl.gov/ https://img.jgi.doe.gov/cgibin/mer/main.cgi http://virome.diagcomputing.org/#view=h ome http://clovr.org/methods/clovrmetagenomics/

[42] [43]

http://khavarilab.stanford.edu/resources.ht ml https://github.com/capsid/capsid http://www.broadinstitute.org/software/pa thseq/

[45]

http://bio-bwa.sourceforge.net/ http://bowtiebio.sourceforge.net/bowtie2/index.shtml http://www.novocraft.com/main/page.php ?s=novoalign http://www.clcbio.com/

[25] [24]

http://mltreemap.org http://huttenhower.sph.harvard.edu/metap hlan/ http://www.cbcb.umd.edu/~boliu/metaph yler/ http://omics.informatics.indiana.edu/mg/p hyloshop/ https://github.com/gjospin/PhyloSift

[47] [48]

http://sequedex.lanl.gov/

[31]

http://www.bork.embl.de/software/smash / https://github.com/treangen/metAMOS/w iki

[51]

[44] [39]

Mapping based pipeline RINS

Rapid Viral detection

CaPSID Pathseq

Rapid pathogen detection (especially good for virus) Computational tool for ID & analysis of microbial sequences in high-throughput human sequencing data, designed to work with large numbers of sequencing reads in a scalable manner.

[46]

Standalone mapping tools BWA Bowtie2

Read mapping tool Read mapping tool

Novaalign

Commercially available aligner for single-ended and pairedend reads from the Illumina Genome Analyser Commercially available software for analysis, visualization and comparison of nucleic acid and protein sequence data.

CLC bio

Marker gene based pipeline MLTreeMap MetaPhlAn

phylogenetic markers based on tax assignment Profiling the composition based on marker genes

Metaphyler

Profiling the composition based on marker genes

PHYLOSHOP

Inferred from 16S rRNA gene sequencing and shotgun metagenomics Pipeline conduct phylogenetic analysis of genomes and metagenomes (good for pathogen detection) Rapid phylogenetic and functional classification of short genomic fragments with signature peptides

PhyloSift Sequedex

[49] [50] -

Mixed sample based pathogen ID

Assembly based

56Â

SmashCommunity

454 assembly and gene prediction, blast based assignment

Metamos

Assembling and analysis

[52]

Tool Name

Description

MOCAT

pipeline: trim--read mapping--assembling--gene prediction

Website http://vmlux.embl.de/~kultima/MOCAT//about.htm l

Reference [53]

FragGeneScan

predicts the protein-coding regions in short reads

[14]

MetaGene

prokaryotic gene-finding program, that utilizes di-codon frequencies estimated by the GC content of a given sequence with other various measures Predicts prokaryotic genes from a single or a set of anonymous genomic sequences metagenomic ORF finding tool for the prediction of protein coding genes in short, environmental DNA sequences with unknown phylogenetic origin gene identification in DNA sequences derived from shotgun sequencing of microbial communities

http://omics.informatics.indiana.edu/FragG eneScan http://rgd.mcw.edu/wg/toolmenu/metagene-to-be-retired http://metagene.cb.k.u-tokyo.ac.jp.

[18]

http://orphelia.gobics.de/

[54, 55]

http://exon.gatech.edu/GeneMark/metagen ome/index.cgi

[19]

https://www.msi.umn.edu/sw/amos http://bowtiebio.sourceforge.net/bowtie2/index.shtml http://www.clcbio.com/

[24] n/a

http://www.cs.hku.hk/~alse/metaidba.

[20]

http://metavelvet.dna.bio.keio.ac.jp/ http://www.chevreux.org/projects_mira.ht ml

[21] [22]

Metagenome gene call

MetaGeneAnnotator (MGA) Orphelia

MetaGeneMark

[17]

Metagenome assembly tool Amos Bowtie2

Short read assembler Aligns short NGS reads to long reference sequences.

CLCbio

Commercially available software for analysis, visualization and comparison of nucleic acid and protein sequence data. An iterative De Bruijn Graph de novo short read assembler specially designed for de novo metagenomic assembly Metagenome assembler for short read (Illumina) datasets De-novo assemblies using reads gathered through Sanger, 454 or Solexa sequencing technologies. Short read NGS data assembler optimized for Roche 454 pyrosequencing data Short read assembler (designed for Illumina GA reads) that can handle up to human sized genomes Reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Short read assembler for small genomes, ideal for Illumina data

Meta-IDBA MetaVelvet MIRA Newbler SOAPdenovo Trinity

Velvet

Table 19: Software tools and packages mentioned in Chapter 1.

57Â

n/a http://soap.genomics.org.cn/soapdenovo.ht ml http://evomics.org/learning/genomics/trini ty/

[33]

http://www.ebi.ac.uk/~zerbino/velvet/

[37]

[34]

Appendix 4: Comparative Analysis of Performance of Current Sequencing Platforms Abstract To investigate the relative accuracy, utility and applicability of sequencing platforms (Ion Torrent, Illumina MiSeq, Illumina HiSeq, Roche 454 and Pacbio RS), sequencing operations were performed on three bacteria of varying G+C content on each platform, as well as exploring the use of several kits and chemistries for the Illumina platform. Due to the necessary amplification for the library preparation stage, most platforms show some level of G+C bias. This investigation indicates that for most applications, the NebNext2 kit shows more even and better coverage than the TruSeq library preparation methodology. The addition of betaine to the library preparation steps did not negatively affect samples and appears to have improved coverage of high G+C organisms. Illumina sequencing yields fewer InDel errors than other platforms. Samples and Method

Strain selection and sequencing methodology Three isolate bacterial samples with finished genomes ranging from high (69%) to low (32%) G+C ratio were selected. Each isolate genome was sequenced using Roche 454 FLX, Ion Torrent PGM, Illumina and PacBio instruments. Additionally, for Illumina sequencing, two kits (Illumina TruSeq, NebNext 2) were tested, as was the addition of a DNA stabilization solution (betaine) to determine their effects on evenness of coverage, and their contribution towards changes in assembly. Table 10 lists the organisms used for this investigation. Illumina samples were multiplexed onto multiple lanes, yielding varying coverage for each sample. Ion Torrent samples were run using manufacturer’s instructions and software on a 316 chip. For 454 libraries, ¼ or ½ plate runs were selected for each sample. Each sample was prepared for a ~2Kb sequence run on Pacbio and 8 SMRT‐cells were sequenced. For a detailed description of the library preparation and sequencing methodologies see Chapter 4. Isolate Burkholderia thailandensis Escherichia coli

%GC 68% 50%

Size 6.71Mb 5.3Mb

Notes 2 Chromosomes Isolate from the Republic of Georgia 4 Plasmids Bacillus anthracis 36% Isolated variant of B.anthracis Ames 1 Plasmid Table 20: List of bacterial strains used in comparative study. (Duplicate of table 11 in main text).

Analysis We performed analysis of sequencing quality by trimming all reads at quality 5 to remove low quality sequence. The percent of bases and percent of reads removed during trimming was calculated for all sequencing samples and compared between samples. Trimmed reads were aligned to finished reference genomes using the Burrows‐Wheeler Aligner (BWA). SAMtools (see Table 19 for information on this is other tools used) was used to calculate all coverage statistics 58

and identify and count SNP/InDels. All samples were assembled with the Velvet assembler, using default parameters and K=75. Next, each generated contig set was aligned to the reference using nucmer. The percent coverage and SNP/InDels were identified using MUMmer tools. For Roche 454 and Ion Torrent PGM data, sequence data was also assembled using an overlap based consensus tool, Newbler (454) which was designed for 454 data. Comparisons between these contigs and the assembled reference were performed using MUMmer tools to calculate genome coverage and SNPs and InDel coverages. As the samples used have finished reference genomes, InDel and SNPs are presumed to be false positives, indicating that fewer is better. Results

Illumina sequencing, assembly, and analysis

B. anthracis

E. coli

B. thailandensis

HiSeq

MiSeq

TruSeq 26.4

+ Betaine 25.4

NebNext2 27.6

+Betaine 24.6

TruSeq 20.3

+Betaine 21.3

NebNext2 25

+Betaine 32.4

99.99%

100%

324±112

310±82

278±76

306±84

388±128

401±90

453±98

629±143

18.5

26.7

22.9

N/A

7.2

7.4

100%

N/A

100%

240±32

337±53

345±40

321±38

N/A

158±28

223.17±28

182.59±30

47.8

17.1

33.7

3.2

7.1

6.4

9.2

8.8

100%

204±82

569±205

874±121

39±9

105±41

167±62

192±30

204±31

Table 21 details the results of sequencing and read‐mapping of Illumina reads to references. The results show that the Illumina platforms perform well for all organisms listed, produce accurate assemblies and have the highest throughput and per base sequencing quality. There are no noticeable differences between kits from the perspective of coverage or accuracy and the addition of betaine to the preparation improves sequencing coverage of high G+C content DNA without negative impacts on lower G+C regions.

B. thailandensis

Platform

Library Prep. Reads Generated (Million) % Genome Coverage Fold Coverage

HiSeq

MiSeq

TruSeq 26.4

+ Betaine 25.4

NebNext2 27.6

+Betaine 24.6

TruSeq 20.3

+Betaine 21.3

NebNext2 25

+Betaine 32.4

99.99%

100%

324±112

310±82

278±76

306±84

388±128

401±90

453±98

629±143

E. coli B. anthracis

±StDev* Reads Generated (Million) % Genome Coverage Fold Coverage ±StDev* Reads Generated (Million) % Genome Coverage Fold Coverage ±StDev*

18.5

26.7

22.9

N/A

7.2

7.4

100%

N/A

100%

240±32

337±53

345±40

321±38

N/A

158±28

223.17±28

182.59±30

47.8

17.1

33.7

3.2

7.1

6.4

9.2

8.8

100%

204±82

569±205

874±121

39±9

105±41

167±62

192±30

204±31

Table 21: Sample table of results by library preparation method. *Coverage and standard deviation values for Burkholderia thailandensis are presented as an average of both chromosomes. (Duplicateof table 15 in main text).

Read based analysis Figure 21 illustrates the variation of evenness of coverage of the three pathogens between kits, and between MiSeq and HiSeq. For the high G+C Burkholderia thailandensis strain sequenced, a noticeable drop in percent coverage is seen for TruSeq libraries generated without the use of betaine, coupled with a higher standard deviation of fold coverage indicating a drop in coverage of the higher %G+C regions. For all other samples, evenness of coverage (calculated as a function the normalized standard deviation of fold coverage for each base in each genome) does not illustrate significant differences between treatments. There is some indication that NebNext kits result in lower deviation in coverage than TruSeq, however these results are not significant.

Figure 21: Evenness of coverage of each sample type ordered by GC content. Measurement of evenness is expressed on a scale of 0-1, where 1 would indicate that all bases are covered at exactly the same fold coverage.

Comparisons of MiSeq to HiSeq generated data indicate that there is a minor, but consistent drop in base coverage of the genome, with MiSeq generated sequence data generating consistently more gaps than the HiSeq sequencing of the same library. This trend is supported regardless of differences in fold coverage between the two datasets and is most defined for Escherichia coli with a nearly 3‐fold increase in the number of bases not covered by reads compared to the HiSeq. There is also a noticeable shift in SNP calls between platforms of MiSeq or HiSeq. However it is unclear which of these two platforms are detecting more accurately. Assembly based analysis Assembly of Illumina reads using Velvet with default parameters yields acceptable draft assemblies for all samples. As anticipated, the assembled contigs do not cover 100% of the genome. In general, more bases are covered by assemblies produced from higher coverage and for sequencing reactions using the NebNext2 Kit. Differences between these assemblies do not produce significant differences between samples. Additionally, the numbers of SNPs/InDels detected from assemblies are significantly greater than those from read‐mapping analyses. For every assembly, there are several possible genomic re‐arrangements identified by analysis, each are known to be incorrect. Discussion and recommendations for Illumina sequencing These analyses indicate that read‐mapping based analyses are more accurate and produce fewer SNP/InDel calls than assembly. While there is limited value for assembly of reads from isolate genomes to locate possible genome rearrangements, which are difficult to impossible to find using current read‐mapping techniques these can be incorrectly assembled, as illustrated above. The use of a MiSeq is comparable, but not identical to the use of the HiSeq instrument, indicating a higher error rate for SNP/InDel detection by use of the MiSeq. Use of betaine is highly recommended for high %GC organisms based on this study. Library preparation kit selection has limited effect on the variability of coverage, but there is limited support for use of the NebNext2 kit over TruSeq.

Platform comparisons In this section we compare Illumina sequencing using NebNext2 and Betaine to the sequencing using Roche 454 and the Ion Torrent PGM. As discussed above de novo assembly of reads will result in worse coverage of bases for both the 454 and PGM, however use of an overlap based assembler (Newbler, 454) will be discussed briefly. Sequencing and read mapping Even with an average coverage of 10× or greater, the coverage of the whole genome is less even for both the 454 and PGM. Generally the PGM behaved better for read‐mapping based coverage than 454 (99+% vs <98% average for 454). The relatively lower‐fold coverage for 454 may be 61

responsible for some degree of this drop. It is important to note that similar depth of coverage from Illumina generates both more even and more complete coverage of the genome than the other platforms. Due to the sequencing process involved in both 454 and PGM sequencing, read‐ mapping based analysis of these data types will result in significantly more InDel detections than Illumina reads, which is typically minimized during assembly. B. anthracis Platform + Chemistry Roche 454* Ion Torrent PGM MiSeq TruSeq +Betaine NebNext2 +Betaine

Reads 2.77×105 2.20×106 7.16×106 6.48×106 9.24×106 8.57×106

E. coli High Quality Reads 2.50×105 1.98×106 7.13×106 6.46×106 9.08×106 8.51×106

Reads

B. thailandensis Reads

2.71×105 1.58×106

High Quality Reads 2.47×105 1.43×106

4.58×105 1.33×106

High Quality Reads 3.71×105 8.75×105

N/A 7.17×106 9.08×106 7.45×106

N/A 7.07×106 8.99×106 7.36×106

2.03×107 2.13×107 2.50×107 3.24×107

1.97×107 2.08×107 2.41×107 3.14×107

Table 22: Reads and trimming results for all platforms and chemistries. Duplicated from Table 12 in the main text.

Table 23 indicates the ability of each data type to adequately cover a target genome. The number of reads generated and possibility for multiplexing are also detailed. Platform

Reads/Run (Ave Length)

Ave. Genome Coverage (%)

Fold Coverage (Min-Max)

Multiplex (max samples/run)

MiSeq PGM (316 Chip) 454 FLX*

~20 Million (100Bp) 1-2 Million (~200Bp) 100,000 (400Bp)

100% 99.99% < 99%

40-800× 10-100× 5-45×

2-4** 1 1

Table 23: Sample table for platform analysis. FLX is used in lieu of the GS Jr., previous studies have shown highly similar behavior between the two. Genome size coupled with desired fold coverage drives the calculations of how many samples may be multiplexed per run.

Analysis of the coverage evenness in 454 and PGM data indicate that evenness is more variable for both as compared to Illumina. Figure 22 illustrates the variation of this coverage for all three organisms and all three technologies.

Figure 22: Comparison of evenness of coverage between platforms. Illumina MiSeq performs better in all cases.

Assembly Use of Velvet for 454 and Ion Torrent data assembly is not ideal, due to the differences in Illumina based sequencing as compared to the highly similar methods of sequencing employed by 454 and the Torrent platforms. The 454 assembler (Newbler) or other overlap‐based assemblers designed to use 454 or PGM data (e.g. MIRA) are highly recommended for assembly of these data types. Analysis of assemblies using Velvet under the same conditions as described above yielded significantly worse results than those from any assembly of Illumina data. Assembly using Newbler gives similar coverage to that generated by Illumina reads and assembly for both platforms. Discussion and recommendations for cross- platform analysis This study has indicated that Illumina MiSeq generates fewer errors than either the 454 or Ion Torrent platforms, regardless of coverage tested. Read mapping analysis indicates that all platforms are capable of covering the vast majority of the bacterial genomes (>97%), with Illumina mapping to every base. Both 454 and PGM sequencing datasets generate orders of magnitude more InDels than Illumina sequencing with a slightly reduced number of SNPs. However the lower overall coverage of the genome may be partially responsible for the decrease in SNP generation. Ion Torrent generates more even coverage of the genome compared to 454 and is capable of multiplexing 2‐4 samples, as opposed to 454 GS Jr. K‐mer based assemblers are not recommended to assemble 454 or Ion Torrent reads. Newbler assemblies generate coverage of the genome similar to those Velvet assemblies generated for Illumina reads, and much improved when compared to the Velvet assemblies of the same data sets. The limitation of 454 GS Junior yielding up to 100,000 reads per run also indicates that it is unlikely to generate sufficient coverage for most of the work outlined here. 63

From this study, we conclude that Illumina has more even coverage and can generate better information about strain level variants than either of the other two discussed platforms. Use of Newbler (or potentially MIRA) for 454 or PGM reads has been shown to be able to generate assemblies that are highly similar to those generated by Velvet for Illumina data. In essence, there are distinct advantages and disadvantages to each platform, with the need of the user driving decisions to utilize the most advantageous platform for their use. For amplicon sequencing or read‐based mapping, reducing the number of InDel errors is more important than for those analysis pipelines requiring assembly, when the majority of InDels are corrected. These observations are in line with those made in recent comparisons between sequencing platforms performed by the Sanger institute and at the Beijing Genome Institute [56, 57].

Pacbio sequencing and analysis For each sample 8 SMRT‐cells were sequenced and analyzed using both Pacbio‐provided and standard analysis tools. In all cases, Pacbio reads were able to cover 100% of the genome with similar numbers of SNPs and InDels as the PGM or 454 read sets. Assembly of PacBio reads was not performed. Sequencing of mixed samples Methods To mimic expected loads of pathogens in blood and sputum, samples of human blood and human sputum were spiked with several pathogens, at varying ratios of cells/viral particles. These ratios reflect biologically relevant levels of these pathogens in their respective sample types. DNA from these samples was extracted and sequenced. Sequencing was performed using one lane each of HiSeq (2×100 bp reads). Read‐mapping based analysis was performed on these reads to determine if the pathogens present in the sample could be reliably detected. Results Less than 0.0001% of the reads generated were mapped to a pathogen, yielding negligible coverage of any targeted genome. Table 24 shows the read mapping results from three blood samples spiked with Yersinia pestis and Bacillus anthracis Organism

Blood Sample #1

Blood Sample #2

Blood Sample #3

Yersinia pestis (# Spiked) Bacillus anthracis (# Spiked) Reads Generated

102 104 ~ 300 Million

103 103

104 102

Yersinia pestis Bacillus anthracis

25 788

- - - Nu mb er of R ea ds Ma pp ed - - -

57 66

568 7

Table 24: Detection of pathogens from blood samples.

Conclusions Due to the low levels of available sequence, it is not possible to perform most analyses on these mixed samples. Using current DNA preparation methods, sequencing of pathogens from human samples is not sensitive enough to give reliable answers for diagnostics. There are several methods of preparation that would increase signal‐to‐noise ratio but all are currently in states of less than TRL‐4. With current throughputs and time constraints, the only platform capable of generating sufficient depth of sequencing coverage to detect pathogens from human background, without application of pre‐sequencing methods to remove this DNA would be the Illumina MiSeq.

Appendix 5: Survey to Sequencing Centers and Platform Vendors Survey to Sequencing Centers

ses by Seq quencing Centers to o Survey Q Question Respons

Sample handling Sample source s The goal o of the first qu uestion, “Wh hat is the source of the m majority of yyour samplees,” was to ascertain h how many of the centers focused on n a small num mber of longg term projeects as opposed to a large num mber of smaaller projectss. Respondents were abble to choosee from (a) a ccontinual strream of sampless from one o or two facilitties, (b) all ne ew or small projects, or (c) a nearly even mixturre of both. ed, only one e received all or the bulkk of their sam mples from 1 1‐2 sources ((Figure Of the 12 ccenters polle 23). The re emaining cen nters either processed ssamples from m independeent projects or found themselve es with nearly half of the eir incoming samples arrriving from o one or two sources. 6 4 2 0

Contin nual Stream, 1--2 Facilities/Lo ocales

All New/Small N Projjects

Figure 23: Source of saamples for seequencing ce nters.

Botth

Sample types The second and third q questions inquired abou ut the types of samples rreceived and d handled. nts were ablle to answerr (a) usually, (b) occasionnally, or (c) n never for each of eight ssample Responden types. In th he distribute ed form of th he questionn naire, Questtion 2 asked about comm monly proceessed sample typ pes and Que estion 3 abou ut sample tyypes handledd even occassionally. Durring the survvey process it became cleaar that these e two questions should bbe handled iin parallel. When aske ed about the e types of saamples that tthey receiveed and handled, answerss were varieed but indicated tthat there w was in fact a ggreat diversity in the sam mples proceessed (Figuree 24). Most o of the centers process non‐viable nucleicc acids, with the exceptiion of human samples (w which carry an additional paperwork load and eth hical conside erations). Prrocessing of whole samp ples varies grreatly by their location, theirr goals and in ntents for th he sequencee data, and the facilities available for sample pre eparation.

Euk Isolat. Mic Isolat. Cllinical

Usuaally

Purified NA

En nviron Human n Cells

Occaasionally

Samples

En nviron

Nev ver

Human n Cells Euk Isolat. Mic Isolat. Cllinical 0%

25%

50% %

75%

100%

Figure 24: Types of sam mples proces sed by differrent sequenc ing centers

Nucleic acid type (DNA or RNA) Question 4 4, “Do you handle DNA o or RNA?” atttempts to di scern the prrevalence off DNA versuss RNA processingg. Generally speaking RN NA is slightly more difficuult to processs as it is mo ore sensitive to degradatio on and mustt be converte ed to cDNA prior to sequuencing. DN NA will tell th he analyst th he organisms and/or genes present in n a sample, while RNA w will either reeveal the exp pressed genees of e to determiine the sequuence of RNA A viruses. eukaryotes and bacterria, or will be All of the 1 12 centers su urveyed processed DNA A samples at least occasionally, with 10 of the 12 2 doing so re egularly (Figure 25). Sim milarly, all but one centerr processed RNA samplees at least occasionallly, with 10 o of 12 doing sso regularly. The RNA seequencing ho owever was primarily taargeted at transcription studie es. 12 10 8 6 4 2 0

Occasionaally Regularly DNA D

RNA

Figure 25: Prevalence of DNA verssus RNA proocessing at s equencing ceenters.

Sample tracking t As sample tracking and d management are integgral operatioons to any seequencing ccenter, questtion 5 asked “wh hat type of LIIMS or samp ple tracking ssystem do yoou use?”. A LIMS, or Lab boratory Informatio on Managem ment System, is a databaase designedd to integratee with stand dard laborato ory processes to enable m more complete tracking. If the proce dures used d do not vary and only encompass a few samples such traacking can b be done in a spreadsheett based fash hion (Microsoft Excel for e example), ho owever doingg so can quicckly becomee intractablee and requiree a great deaal of human efffort to mainttain. 69

All centers tracked sample receipt and processing, with all but one using a LIMS to do so (Table 25). Over half of those surveyed utilized a commercially available system (with one exception, no two centers used the same vendor), four used an in‐house developed system, and one center relied solely on spreadsheet‐based tracking. LIMS Type

Count

In-house developed Commercial Neither/No LIMS

7 4 1

Table 25: Sample tracking method used by sequencing centers.

Initial sample processing In an effort to determine if there was a standard set of steps taken at the arrival of a sample, question 6 asked “when a sample comes in what are the first steps taken?” Upon receipt of samples, all groups agreed that the most important steps were to ensure that full sample information was logged into the tracking system in use, including external QC (when available), applicable biosafety documentation and any metadata available. Because all answers were simple, direct and in agreement with the stated question there was little discussion on this point. Incoming sample standard protocols Question 7 attempts to determine what standard protocols and methods are applied to all or most samples upon arrival at a given sequencing center. The question was worded in a rather open‐ ended way and often interpreted as “what do you routinely do when a sample arrives?” Depending on the center, available metadata for the samples were either entered into the LIMS prior to or upon sample arrival. All centers generated their own QC of each incoming sample, with the occasional exception for small "precious" samples, where insufficient material was received to run both QC and library preparation (Table 26). Internal sample QC generally included a measurement of nucleic acid concentration (Qubit or other fluorometric method most common) and quality (agarose gel or BioAnalyzer profile). Depending on the analysis plan some utilized qPCR as well. Those handling clinical samples often de‐identified them at this stage to comply with ethical guidelines, as well as recording whether the samples had been stabilized with a chaotropic agent. Such actions agree with the overall mission of a given center (routine sample processing and sequencing versus rapid pathogen identification). Answers

Yes?

Sample QC Metadata accumulation Add to freezer management system

12 11 12

Table 26: SOP for incoming sample handling.

Incoming g sample requiremen r nts Centers im mpose a varie ety of stringencies on incoming sam mples, based largely on their mission n. This goal of thiss question w was to both aascertain the e strictness aand enforceement of rulees for incom ming samples, aand sees how w this strictn ness correlatted with the proportion of rapid ID/clinical samp ples they proce essed (inform mation glean ned from disscussions du ring an earliier question). This questtion was simplyy worded “do you imposse rules upon sample haandling durin ng processin ng?” with thee possible an nswers being (a) strict, ((b) guideline es, or (c) nonne. Of the 12 ccenters survveyed about half impose ed guideliness on incomin ng samples (Figure 26). FFor those centters without guidelines, while they aasked that inncoming sam mples conforrm to qualityy specificatio ons (concentration, masss, molecular weight, etcc.), samples would be acccepted outside of those speccifications – generally with the explicit understaanding that tthe library preparation aand sequencing processes may or mayy not succeed. For the ceenters most familiar with accepting clinical sam mples (CDC aand CII), no rrules were im mposed on tthe incoming sample qu uality. In factt, at times the ggoal of the ssequencing w work was to show that nnothing could be sequen nced from it. In contrastt, those centters dedicate ed to sequencing genom mes and sam mples for assembly, but n not clinical sam mples, had strict rules th hat were imp posed and d id not allow w for receipt of any samp ple that was u unlikely to be e successfullly sequenced d. This distinnction betweeen the goall of the center and the stringe ency of the incoming sam mple guidelines was nott quantified but in relativve terms waas quite expliicit (Figure 277). 6 4 2 0 Stricct

G Guidelines

Non ne

Figure 26: Enforcemen t of incomin ng sample han ndling/QC rrules.

Incoming QC Stringency

3.5 3 2.5 2 1.5 1 0.5 0 0..5

1.5

2.5

3.55

Freequency of In ncoming Cliinical Samplles Figure 27: Relative corrrelation betw ween incomiing sample Q QC and frequ uency of clin ical sample processing

Incoming sample processing time The turnaround time, or time from sample receipt to data availability, differs in importance based on the project type being discussed. Question 9, “how much time does it take an average sample, from delivery, to enter library preparation?” attempts to ascertain what a normal time frame is from all centers surveyed. This question does not address the time required to QC the samples, prepare the libraries or sequence those libraries; only the time between receipt and the initiation of processing One‐third of the sequencing centers surveyed began processing samples within one day of receipt with the remaining two‐thirds generally beginning library preparation between 1 and 8 weeks of receipt (Table 27). Generally speaking, those with the shortest time spans dealt largely with clinical samples and felt that delaying sample processing may have a negative impact on the health and/or well‐being of another person. Samples with the longest normal time between receipt and processing were often considered routine and that such delays would likely have little negative impact. One center described a particular batch of samples that were delayed approximately one year due to delays in paperwork (data removed from below table and figure) however their normal queue time was measured in weeks not months. Time from Receipt to Processing (days) Min Max Ave Std Dev

0.2 -1 28 -56 8.02 -19.4 8.50 -19.9

Table 27: Time (days) a sample normally waits between receipt at a sequencing facility and the initiation of processing.

Sequencing process Sequencing platforms utilized Until 2005, the only commercially available sequencing platforms were based on di‐deoxy terminator sequencing (often referred to as “Sanger” sequencing). Since that time, several instruments have been released, each with particular attributes; all of which have vastly greater throughput than the di‐deoxy platforms. In question 1 we asked “which sequencing platforms are utilized in your laboratory space?” The question was asked regarding the most commonly used platforms (PacBio RS, Illumina HiSeq, Illumina MiSeq, Ion Torrent, IonProton, Roche 454 FLX, Roche 454 Jr.), however other responses provided by the respondents were recorded as well. It should be noted that at the time of writing this report, the IonProton platform (the most recently released machine from Ion) has been delivered to two of the locations but not installed. By far the most popular NGS chemistry was the one sold by Illumina. Of the 12 centers surveyed 10 had at least one unit on site, and another had access to that technology at an outside location (Table 28, Figure 28). 72

NGS Plattform

Onsite

Offfsite

Deliveered

Illumina GAiiX G Illumina HiSeq H Illumina MiSeq M IonProton Ion Torren nt PacBio RS Roche 454 FLX Roche 454 Jr Sanger SOLiD

2 7 7 0 7 6 6 1 3 2

1 2 1 0 1 1 0 0 0 0

0 0 0 2 0 0 0 0 0 0

Table 28: Sequencing S platforms p utiilized 12 10 8 6 4 2 0 Illumina

LifeTech L Ion

P PacBio RS

R Roche 454

Figure 28: Most comm on sequencin ng platforms utilized.

It should b be noted thaat the cost to o purchase aand operate the three diifferent platfforms offereed by Illumina vaaries greatlyy, as does the e maximum throughput from a given sequencin ng run on each. These diffe erences in in nvestment and output le ead to alternnate uses of the platform ms, with the GAiiX (the oldestt) being the least desirable, and has seen here tthe least com mmonly utilized (Table 2 29). NGS Platfo orm

Onsite

Offsite

GAiiX HiSeq MiSeq

2 7 7

1 2 1

Table 29: Types T of Illu mina platforrms utilized at various seequencing ceenters.

Library preparation p n modifica ations As with an ny new techn nology, the m methods for preparing liibraries from m nucleic aciids are consttantly changing – – this is due to various groups experrimenting wiith protocolss to find wayys to optimizze for their particular processs, and by ve endors releasing and re‐releasing kitts for library preparation n. Generally speaking the ese methodss are painstaaking, and w without a queestion the m most robust method re equires some e testing and d modificatio on before m moving it into o production n in a new seetting. This question asked eaach center iff they “utilize e the manuffacturer’s prrotocols for llibrary preparatio on or make m minor/majorr modificatio ons for an intternal SOP?” As this queestion coverred preparatio on for all seq quencing platforms, mosst of the resppondents weere uncomfo ortable or un nable to provide e a single ansswer; in fact fully half off the centerss affirmed tw wo answers rrather than one, leading to a re‐coding of the respo onses from tthe original ooptions.

Most of th he responden nts (66%, Figgure 29) said d that their ccenter madee either mino or or minor to major mod dification to the protoco ols provided by reagent//kit vendors.. The remain ning 4 centers either follo owed the protocols exacctly as writte en by the ve ndors or maade only occasional mino or changes. 4 2 0 As written

As written w or min nor

Minor

Mino or or major

Figure 29: Customizat ion of protoccols from thoose provided by vendors.

Frequenc cy of librarry SOP revision The contin nuous changes in protocols and available reagennts for prepaaring and sequencing samples forces each center to d determine h how often th hey will revissit and/or revise their SO OP for both llibrary preparatio on and seque encing runs. This continu ual investmeent in researrch and deveelopment off the sequencing methodolo ogies is quite e costly; how wever withoout this, an eentire centerr can quicklyy become ou utdated. Of the sequencing cen nters surveye ed, one‐half said they haad a dedicatted team thaat was contin nually testing new w reagent kiits and proto ocols. Respo ondents withh testing teams tended tto be the larger centers, an nd those kno own for sequ uencing largge numbers oof samples o on a continual basis (Figu ure 30). One‐third of the rrespondents said that th hey tested neew SOP and kits every feew months, often shortly after the release of a new commercially available kit or published protoco ol. Most of these mentioned d that they ffollowed pro otocols releaased by otheer sequencing centers (fo or example, the Broad Institute’s releaase of a new protocol oftten promptss two other ccenters survveyed here to eir processes). update the Finally two o teams, botth at the CDC C, said that tthey rarely aaltered theirr operating m methods. This is largely due e to the fact that they cu urrently have highly robbust systemss that are vallidated and accepted ffor the diagn nosis of hum man disease. To change ttheir systems would require a revalidation of each ste ep. 8 6 4 2 0 Rarely y Everry few month hs Figure 30: Frequency of o protocol an nd reagent t esting

Contin nually

Desired microbial genome coverage (by platform) The goal of this question was to determine what level of coverage each center used in assemblies of microbial genomes. Unfortunately the question was worded a bit ambiguously, “provide normal values you see or aim for on each platform utilized in your facility.” Of the centers that routinely assemble microbial genomes, all eight used the Illumina platforms as their primary data set (Table 30). Of those, most (75%) aim for 50‐100 fold genome coverage with Illumina data, 1 aims for 300 fold coverage and the last only 15‐30 fold; however this low coverage outlier only generates reference‐based assemblies and so requires less data input. Three of the centers routinely generated PacBio data to assist in scaffolding of the assembled data, and aimed for between 20 and 50‐fold genome coverage. In contrast only two centers routinely generated Ion Torrent data for genome projects and their desired coverage varied greatly (20 versus 200‐fold genome coverage). Finally only one center still routinely generates Roche 454 data for microbial genomes, aiming for 25‐fold coverage. We should mention that as newer methods (including long insert library preparation for the Illumina platform) have become available, 454 data has become less commonly generated. Platform

Coverage Aimed for

# Centers

Illumina

10* 1 15-30** 1 50-150 6 300 1 PacBio 20-50 3 Ion Torrent 20 1 200 1 Roche454 25 1 Table 30: Desired Genome Coverage by Platform. *Human genome sequencing uses reduced coverage to account for size/cost issues as well as look for SNPs to reference. **Reference based assemblies that look for difference with validated reference as opposed to de novo or from-scratch assemblies which require more data.

The centers polled which did not routinely generate de novo microbial assemblies had noteworthy comments. For example, CII does not generally assemble genomes but will often generate 1‐ to 75‐fold genome coverage using Roche 454 to map to references, looking for SNP changes. Also the Influenza group at the ECBC is still using di‐deoxy based sequencing as its main platform, this is due to the fact that there are highly robust sequence‐specific PCR primers to determine differences in the viral genome which are well suited for sequencing on the lower throughput platform. Library preparation automation Initially this question aimed to determine the number of total and human hours required for library preparation and sequencing. However as it became clear that the times required varied little (with the exception that large insert libraries tended to take 3 days as opposed to just 1 day for short insert libraries), the query became more to determine how many centers utilize an automated or robot‐driven library preparation system. Library preparation for the other (non‐ 75

Illumina) N NGS platform ms was large ely considere ed a manual process, as adoption haas not been extensive and long eno ough for a vendor to develop an outt‐of‐the‐boxx system for them. The Sciclon ne by Calipe er was the most common nly implemeented metho od of automaating the library preparatio on process. Itt is an autom mated liquid handler thaat was speciffically design ned to reducce the amount off manual hou urs required d to prepare Illumina TruuSeq librariees. Of the sevven laborato ories that said they used an automated library prep paration for Illumina seq quencing (Figgure 31), six specificallyy stated theyy used this ssystem (the o other centerr simply did not mention n which system they employed). The time involved in p preparing sho ort‐insert orr standard libbraries for any of the NG GS platformss is generally o one to two d days, where multiple sam mples can bee processed at a single ttime, and that time is quiite labor inte ensive. This ttime generaally reaches tthree to four days for lo ong insert lib brary preparatio on, also quite e labor inten nsive. The ovverall time reequired for an automateed standard library pre eparation is ggenerally tw wo to three d days, howeveer very little of that timee requires hu uman input. 8 6 4 2 0 Manual

A Automated

Figure 31: Comparison C off sequencing centers generrating Illumin na libraries ussing a manuall or automate d process.

Library ty ypes prepa ared There are generally tw wo types of liibraries thatt can be prodduced on an ny NGS platfo orm, standard (either single‐ended or short‐inserrt pairs) and d long‐insert data. Metaggenomic or ttranscriptom mic surveys ge enerally onlyy utilize stand dard data se ets, both beccause the template material is geneerally not suitable for long in nsert and the e extra efforrt in produciing the data would not aaid in the fin nal analysis. H However, if the goal is to o assemble aa single genoome or enricched metageenome, both h library types are desire ed; the shorrt‐insert dataa at high covverage for deepth and quality is usefu ul while the llong‐insert d data is used to help scafffold or pull tthe shorter ccontigs togeether. The long insert dataa can be gen nerated to a lesser fold ccoverage butt does require a more in nvolved libraary preparatio on process w with more strringent input requiremeents. Of the 12 ccenters polle ed, only 7 evver make lon ng insert datta on the Illu umina or Rocche platform ms (Table 31). An alternatte approach to scaffolding or bridginng contigs to ogether is to o use the lon ng erated by the PacBio RS platform, so omething ussed by 5 of th he centers. reads gene 76

Library Types Shotgun/Short-insert Long-Insert (>7kb) Long Read (PB) Table 31: Prevalence of long-insert library

Regularly

Occasionally

12 0 6 1 4 1 preparations among sequencing centers.

Ever 12 7 5

Staff training Training required for incoming staff Here we asked each sequencing center their baseline requirements for new hires into a sequencing laboratory setting. It should be noted that the individuals discussed here are responsible for sample receipt, QC, library preparation and sequencing runs but not for any bioinformatics (data transfer, data QC, analysis, assembly, etc.). Of the 12 centers queried, all had very similar responses. All labs hire those with at least an Associate degree (AS) and most have a minimal requirement of Bachelor’s degree, BS; though a few required all new hires to hold a Master’s degree (MS). Few labs (2) with longer‐term projects still have technicians from the "early Sanger days" with only a high‐school education, but those individuals have years of experience and training. Finally, several centers stated that they often brought in high‐school or college students to work alongside their full‐time staff as a means of outreach and training but would not allow those individuals to handle the sequencing equipment unsupervised.

Survey to NGS Vendors

Responses by NGS Vendors to Survey Question

General questions Normal anticipated read- length Short read length is the number one complaint about NGS platforms by those who analyze and assemble the data. As such the first question posed to the vendors asked: “what is the “normal” length of reads that could be expected off your platform? Early NGS platforms often offered read lengths of 10‐30bp, however current technologies no long experience such limitations. Standard reads on most of the major platforms in use exceed 200bp in length (a 2×100bp read has an insert of know size in the middle but 200bp of sequence), and one is now on par with “standard” di‐deoxy reads (Table 32). Platform

Current Read Lengths

Anticipated Read Lengths

Ion Torrent 1×300bp 1×400bp IonProton 1×200bp 1×200bp MiSeq 2×250bp 2×250bp HiSeq 2000 2×100bp 2×100bp 454 Jr 400-500bp 400-500bp 454 FLX+ 500-1000bp 500-1000bp Table 32: Current and anticipated read lengths from major NGS platforms.

Normal anticipated read count While the first generation of sequencing provided utilized 96‐ and 384‐well plates, presumably providing one sequence per well, current platforms have a much greater throughput. This question asked each vendor what the specifications for their platform(s) would provide in terms of read count by asking: “what is a “normal” number of reads to be expected from a single unit of your sequencer?” As shown in Table 35, current NGS platforms provide between 100 thousand and 4×109 reads per run (the Illumina HiSeq generates up to 5×108 reads per lane with 8 lanes available per run). Platform Ion Torrent Ion Torrent

Setting

Expected Reads

Ion 314 Chip >1×105 Ion 316 Chip >1×106 Ion 318 Chip ~3×106 IonProton Ion PI Chip ~ 5×107 Ion PII Chip ~ 2×108 Ion PIII Chip 1.2×109 wells, reads TBD MiSeq 2×150bp ~ 3×107. [58] HiSeq 2000 2×100bp ~ 3×1012 per flowcell [59] 454 Jr n/a ~ 1×105 454 FLX+ n/a ~ 1×106 Table 33: Expected number of reads generated per run on NGS platforms.

Current and Future Protocols and Directions Available protocol types available While single ended reads often form the basis and depth of coverage in any genome assembly, those often form multiple contigs rather than whole genomes. Therefore libraries, often painstaking to make, with “long inserts” are created. These inserts are of a known length and help to bridge together or scaffold distinct contigs together. The question asked of the vendors was: “what current protocols are in place for your platform(s)?” Currently all platforms support, in some fashion, these long‐insert reads (Table 36). Platform

Single-End

Mate-Paired

Paired-End

Comment

Ion Torrent

Kit available from vendor Kit available from vendor

Ion Community demonstrated, no formal kit Ion Community demonstrated, no formal kit Kit available from vendor Kit available from vendor Kit available from vendor

Supports all library preparations

IonProton

Ion Community demonstrated, no formal kit Ion Community demonstrated, no formal kit Kit available from vendor Kit available from vendor Unknown

MiSeq

Kit available from vendor HiSeq 2000 Kit available from vendor 454 FLX+ Kit available from vendor Table 34: Current library preparation methods available

Supports all library preparations

Anticipated protocol releases The capabilities of sequencing technologies are known for rapid changes, not surprising as decreases in the costs for sequencing DNA has exceeded Moore’s Law (Figure 32). As such each vendor was asked two questions to determine anticipated protocol releases in the next 6 months as well as in the next 12‐18 months. From those vendors that did respond, the answers were encouraging and suggest longer read lengths with fewer labor hours to produce data (Table 37).

Figure 32: Cost of sequencing by Mb of DNA, from www.genome.gov/sequencingcosts.

Platform

Updates in next 6 months

Updates in next 12-18 months

Other planned updates

Ion Torrent

Same as above but up to 400bp read lengths should be in production end of 2012

No public disclosure of protocol changes beyond 6 months

IonProton

Isothermal template preparation (e.g. Avalanche) for simpler and more rapid clonal amplification of libraries No survey response No survey response Not mentioned

No public disclosure of protocol changes beyond 6 months

Long amplicon sequencing

Undetermined at this time

IonChef – fully automated template preparation and chip loading instrumentation (library in, sequencing ready-chips out) available first half of 2013 IonChef – fully automated template preparation and chip loading instrumentation (library in, sequencing ready-chips out) available first half of 2013 No survey response No survey response Automation of the library preparation process and expanded read lengths Read lengths of up to 1000bp

MiSeq HiSeq 2000 454 Jr

454 FLX+

No survey response No survey response Undetermined at this time

Table 35: Anticipated updates to protocols

Library Preparation and Sequencing Run Time required to prepare library The first question asked “how many hours do you estimate it takes to prepare your standard library?” This question assumes that the user is only purchasing the kits and reagents from the platform vendor and is following the instruction as written by that manufacturer. The responses in the following table (Table 36) may have caveats such as “with enzymatic shearing” which will extend the overall time required for the preparation process even if they do not require human interaction during those times. Only the “fragment” or standard libraries and RNA libraries are shown, as those are generally the shortest and longest time requiring processes. Vendor

Fragment Libraries

RNA libraries

LifeTech Illumina Roche

2 hours, with enzymatic shearing No survey response 4 hours

6 hours No survey response 6 hours

Table 36: Time required for standard library preparation.

Training required prior to library preparation The preparation of sequencing libraries can range from simple and straight‐forward to complicated and problematic. Often the standard library preparations are mastered quickly by experienced technicians, while long‐insert libraries may take weeks or more to troubleshoot. In this section we asked each vendor “how much training do you estimate to be required for an experienced technician to become proficient in your library preparation?” The assumption was that the libraries in preparation would be of the “standard” variety. 81

Of the two vendors that responded only one answer was provided. Roche indicated that training for the library preparation should take one day for the process to be mastered, while LifeTech simply suggested that as their techniques are “standard relative to other NGS competitors so [they] expect the library training time to be comparable to other systems/platforms.” Library reagent cost The preparation of standard (shotgun or short insert) generally takes about one working day to complete (as suggested in Table 37). This question asked “how much would you estimate the reagents for a single sample library preparation to cost?” Generally a single library preparation utilizes reagents costing between $50 and $150 per sample to prepare. However, as shown in the following section, the material does not exit the library process ready to load onto a sequencer. Vendor LifeTech Illumina Roche

Fragment Libraries $50 No survey response $130

Amplicon Libraries $125 No survey response No survey response

RNA libraries $83 No survey response $130

Table 37: Material and Supply costs per library preparation.

Batch library preparation Laboratory automation is an industry in itself, with specific journals and conferences [60, 61]. This industry has worked to improve efficiencies and reduce both repetitive stress injuries as well as human induced errors into common processes. In fact after discussion with LifeTech and Roche, followed by a review of the Illumina documentation, all of the platforms discussed in this section offer an automated system to prepare libraries for sequencing (Table 38). The throughput of those systems ranges from 5‐46 libraries per day, surpassing the throughput of laboratory workers. Vendor

Automatable

Max Daily throughput

LifeTech Illumina Roche

Yes Yes (3rd party vendor) Yes (3rd party vendor)

26 libraries/day 46 libraries/day, 288/week* 5 libraries/day

Table 38: Library preparation automation and throughput.*As no response was received from Illumina or Caliper, this response was from Lance Green at LANL, where a SciClone is installed and in use.

Steps between library preparation and sequencing run As indicated earlier in the section, there are generally processing steps required between the library preparations and beginning a sequencing run. Often this part of the process involved clonal amplification of the libraries, the goal to increase the signal so that the detection system on the sequencer can register it. The amplifications can take as little as 5 hours and as much as 1½ days (Table 39). 82

Platform

Step(s)

Estimate Time

Ion Torrent IonProton MiSeq HiSeq 2000 454 Jr & FLX+ 454 FLX+

Template preparation for clonal amplification on IonSphere particles Template preparation for clonal amplification on IonSphere particles Processed on the sequencer, accounted for in sequencing run time [58] Cluster generation (amplification) on cBot [59] Emulsion PCR (emPCR) for clonal amplification Emulsion PCR (emPCR) for clonal amplification

6 hours 6 hours 0 hours 5 hours 1.5 days 1.5 days

Table 39: Steps required between library preparations and sequencing run.

Sequencing run time NGS sequencing platforms vary greatly in the time required to generate data, and depending on the application speed of data acquisition may not be of great importance. The question “how many hours do you estimate it takes to run your sequencing platform?” intended to provide a succinct review of the run times required for each platform. The most rapid sequencing platform considered here is the Ion Torrent, which can generate 1×105 to 8×106 200bp‐reads (depending on the sequencing chip used) in about 3 hours, followed closely by the MiSeq (Table 40). The longest run times are found with the HiSeq system, where each base added takes just under 1 hour, so a 2×100 bp run (200bp total) takes 10 days to complete. Platform

Normal Run Time

Ion Torrent IonProton MiSeq HiSeq 2000 454 Jr 454 FLX+

3 hr (varies by chip, assuming 200bp reads) 21 hr 4-39 hr (varies by read length) [58] 10 days (2×100 run) [59] 10 hrs 23 hrs

Table 40: Time for sequencing run to complete.

Training required prior to NGS platform operation As a proxy for difficulty in running a sequencing platform, we asked each vendor “how much training do you estimate to be required for an experienced technician to become proficient in running your sequencing platform?” Theoretically a more difficult system would take longer to train new users on, however the general response (Table 41) was a 1‐2 day training session. Vendor

Estimated Staff Training

LifeTech Illumina Roche

2 days instruction No survey response 1 day instruction

Table 41: Training required for staff proficiency on sequencing platforms.

Sequencing reagent cost Just as the library preparations take various amount of time and reagents to prepare, the runs on different sequencing platforms differ in costs. Prices listed here reflect the direct cost of the reagents, without taxes or handling fees (Table 42). 83

Platform

Estimated M&S Cost/Run

Ion Torrent IonProton MiSeq HiSeq 2000 454 Jr 454 FLX+

$350-$750 (varies by chip) $1000 No survey response No survey response $1100 [62] $3600

Table 42: Reagent cost for a standard sequencing run.

Data Handling Requirements Manual data transfer (sequencer to server) The first question concerns the amount of effort involved in the transfer of data from the sequencer itself to a place where an analyst can access it. Initially this was a manual task however at this point all major platforms now have automated means to deal with this issue (Table 43). Vendor

Manual Processing for Data Transfer

LifeTech Illumina Roche

None, customer created scripts to transfer FASTQ/BAM files from servers No survey response None, data transfer process is automated.

Table 43: Manual processing required for data transfer from sequencer to local servers.

Required hardware investment Another issue with the amount of data produced by these sequencing platforms involves the acquisition of additional computational hardware to handle the data. As shown in the table below (Error! Reference source not found.), some vendors are attempting to include server hardware in the initial purchase package and others simply provide guidance as to the necessary amount of computational power needed. Platform

Required Hardware Investment

Ion Torrent

PGM includes the Ion Torrent Server to support primary analysis through FASTQ/BAM file generation and on-server variant calling PGM includes the Ion Torrent Server to support primary analysis through FASTQ/BAM file generation and on-server variant calling No survey response No survey response Roche suggests using the guideline of 4 bytes RAM per input base processed. A Z800 system is included with the FLX and capable of processing most needs. Suggests user plan for 4bytes of RAM per input base, Roche has used a workstation with 256GB RAM to assemble the human genome.

IonProton MiSeq HiSeq 2000 454 Jr 454 FLX+

Table 44: Required hardware investment (outside that included with the platform purchase).

Amount of data generated With the differences in read count, run time, and other factors described above it is not surprising that each platform provides a different amount of output data per run (Table 45). Of course greater data outputs provide increased read depth but also increased difficulty in transferring data 84

(both from the sequencer to the server as well as to collaborating institutions) and analyzing it. Often, especially for high data generating platforms such as the Illumina HiSeq 2000, a single run may be split into >700 libraries. This adds computational time to pull the data apart but also increases the efficiency of the system. Platform

Setting

Expected FASTQ File Size

Ion Torrent

Ion 314 Chip Ion 316 Chip Ion 318 Chip Ion PI Chip Ion PII Chip Ion PIII Chip 1×36bp [58] 2×250 [58] 1×35bp [59] 2×50bp [59] 2×100bp [59] No survey response 500-100bp reads

0.3GB 1.0GB 2.4GB 30GB ND ND 540-610Mb 7.5-8.5GB 47-52GB 135-150GB 270-300GB No survey response 60GB

IonProton

MiSeq 2000 HiSeq

454 Jr 454 FLX+

Table 45: Amount of data (in GB) generated per full run. ND = Not Determined

Included data analysis software As each type of sequencing data generated comes with distinct error profiles [63], it has become common for vendors to provide some analysis software with their sequencing platforms to help aid in data analysis. With the question “what types of data analysis software is included with your platform?” we gave each vendor a chance to list off the computer programs they provide (Table 46). The follow on question asked how much time each vendor expected to be required for training an experienced bioinformatician in their software. Of the vendors that responded, the consensus was 1‐2 days, often included in the overall training session (Table 47). Vendor

Data analysis Software Included

LifeTech

On-system variant calling capabilities along with variety of other Ion and user developed “apps” in the Plug-In Store. Offer Ion Reporter, a pay-per-use cloud-based software resource for variant calling, visualization of variants and report-out No survey response Assembler, Mapper, Amplicon analysis, HLA plug-in, Remote Desktop software, Documentation and firmware to run the instrument and convert data from raw to human readable.

Illumina Roche 454

Table 46: Data analysis software included with platforms by vendor. Vendor

Estimated Staff Training

LifeTech Illumina 454 Jr

2 days instruction (included with operation instruction, see 4.7) No survey response 1 day

Table 47: Training required for bioinformatics associated with data handling.

Costs Estimated initial startup cost Initial costs to set up a laboratory can be substantial. The first question of this section asked each vendor to estimate the initial cost of the sequencing platform, required hardware and any other initial startup costs associated with their platforms (Table 48). For the Illumina platforms, values were gleaned from news articles published near the time of those platform’s official release. Platform

Sequencer Costs

Computer hardware

Other costs

Ion Torrent

$66K

Included with sequencer

IonProton

$225K

Included with sequencer

MiSeq HiSeq 2000 454 Jr 454 FLX+

$125K [64] $690K [65] $98K Not released in survey

No survey response No survey response Not released in survey Not released in survey

Ion OneTouch 2 System $19K Ancillary equipment ~$10k Ion OneTouch 2 System $19K Ancillary equipment ~$10k No survey response No survey response Not released in survey Not released in survey

Table 48: Initial startup costs for each sequencing platform.

Long- term operational costs The final question asked in the survey attempted to discern long‐term operating costs of each sequencing platform, by asking “what other long‐term operational costs should a new user anticipate?” Generally speaking the vendors responded with service contracts to maintain the sequencers and help troubleshoot and/or repair any malfunctions that might occur (Table 49). Platform

Additional Costs

Ion Torrent IonProton MiSeq HiSeq 2000 454 Jr 454 FLX+

2nd year service contracts range from $4.3 - $14K depending on level of service 2nd year service contracts range from $7.5 - $33K depending on level of service No survey response No survey response 2nd year service contact of $12.6K Not released in survey

Table 49: Additional long-term operational costs.

References Cited 1. Abed, Y. and G. Boivin, New Saffold cardioviruses in 3 children, Canada. Emerging infectious diseases, 2008. 14(5): p. 834‐6. 2. Allander, T., et al., Cloning of a human parvovirus by molecular screening of respiratory tract samples. Proceedings of the National Academy of Sciences of the United States of America, 2005. 102(36): p. 12891‐12896. 3. van der Hoek, L., et al., Identification of a new human coronavirus. Nature medicine, 2004. 10(4): p. 368‐73. 4. Scholz, M.B., C.‐C. Lo, and P.S.G. Chain, Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Current Opinion in Biotechnology, 2012. 23(1): p. 9‐15. 5. Ahmed, S.A., et al., Genomic Comparison of Escherichia coli O104:H4 Isolates from 2009 and 2011 Reveals Plasmid, and Prophage Heterogeneity, Including Shiga Toxin Encoding Phage stx2. PLoS ONE, 2012. 7(11): p. e48228. 6. Minot, S., et al., Rapid evolution of the human gut virome. Proceedings of the National Academy of Sciences, 2013. 110(30): p. 12450‐12455. 7. Callebaut, W., Scientific perspectivism: A philosopher of science’s response to the challenge of big data biology. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 2012. 43(1): p. 69‐80. 8. Howe, D., et al., Big data: The future of biocuration. Nature, 2008. 455(7209): p. 47‐50. 9. Trelles, O., et al., Big data, but are we ready? Nat Rev Genet, 2011. 12(3): p. 224‐224. 10. Soergel DA, D.N., Knight R, Brenner SE., Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. ISME J, 2012. 7: p. 1440‐1444. 11. Caporaso, J.G., et al., QIIME allows analysis of high‐throughput community sequencing data. Nat Meth, 2010. 7(5): p. 335‐336. 12. Schloss, P.D., et al., Introducing mothur: Open‐Source, Platform‐Independent, Community‐Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology, 2009. 75(23): p. 7537‐7541. 13. DeSantis, T.Z., et al., Greengenes, a Chimera‐Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Applied and Environmental Microbiology, 2006. 72(7): p. 5069‐5072. 14. Cole, J.R., et al., The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Research, 2009. 37(suppl 1): p. D141‐D145. 15. Quast, C., et al., The SILVA ribosomal RNA gene database project: improved data processing and web‐based tools. Nucleic Acids Research, 2013. 41(D1): p. D590‐D596. 16. DeSantis, T.Z., et al., NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Research, 2006. 34(suppl 2): p. W394‐W399.

17. Price, M.N., P.S. Dehal, and A.P. Arkin, FastTree 2 – Approximately Maximum‐Likelihood Trees for Large Alignments. PLoS ONE, 2010. 5(3): p. e9490. 18. Quince, C., et al., Removing Noise From Pyrosequenced Amplicons. BMC Bioinformatics, 2011. 12(1): p. 38. 19. Wright, E.S., L.S. Yilmaz, and D.R. Noguera, DECIPHER, a Search‐Based Approach to Chimera Identification for 16S rRNA Sequences. Applied and Environmental Microbiology, 2012. 78(3): p. 717‐725. 20. Edgar, R.C., et al., UCHIME improves sensitivity and speed of chimera detection. Bioinformatics, 2011. 27(16): p. 2194‐2200. 21. Nordberg, E.K., YODA: selecting signature oligonucleotides. Bioinformatics, 2005. 21(8): p. 1365‐1370. 22. Phillippy, A.M., et al., Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Research, 2009. 37(suppl 2): p. W229‐W234. 23. Altschul, S.F., et al., Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Research, 1997. 25(17): p. 3389‐3402. 24. Langmead, B. and S.L. Salzberg, Fast gapped‐read alignment with Bowtie 2. Nat Meth, 2012. 9(4): p. 357‐359. 25. Li, H. and R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 2009. 25(14): p. 1754‐1760. 26. Kent, W.J., BLAT—The BLAST‐Like Alignment Tool. Genome Research, 2002. 12(4): p. 656‐664. 27. Li, H., et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009. 25(16): p. 2078‐2079. 28. Zhou, C.E., et al., MvirDB—a microbial database of protein toxins, virulence factors and antibiotic resistance genes for bio‐defence applications. Nucleic Acids Research, 2007. 35(suppl 1): p. D391‐D394. 29. Chen, L., et al., VFDB 2012 update: toward the genetic diversity and molecular evolution of bacterial virulence factors. Nucleic Acids Research, 2012. 40(D1): p. D641‐D645. 30. Li, R., et al., SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 2009. 25(15): p. 1966‐ 1967. 31. McKenna, A., et al., The Genome Analysis Toolkit: A MapReduce framework for analyzing next‐generation DNA sequencing data. Genome Research, 2010. 20(9): p. 1297‐1303. 32. Browning, S.R., Multilocus Association Mapping Using Variable‐Length Markov Chains. The American Journal of Human Genetics, 2006. 78(6): p. 903‐913. 33. Howie, B.N., P. Donnelly, and J. Marchini, A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome‐Wide Association Studies. PLoS Genetics, 2009. 5(6): p. e1000529. 34. Li, Y., et al., Low‐coverage sequencing: Implications for design of complex trait association studies. Genome Research, 2011. 21(6): p. 940‐951. 35. Tamura, K., et al., MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology and Evolution, 2011. 28(10): p. 2731‐2739.

36. Zerbino, D.R. and E. Birney, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 2008. 18(5): p. 821‐829. 37. Chevreux, B., MIRA: An Automated Genome and EST Assembler, in German Cancer Research Center, Department of Molecular Biophysics2005, Ruprecht‐Karls‐University: Heidelberg. p. 171. 38. Orvis, J., et al., Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics, 2010. 26(12): p. 1488‐1492. 39. Angiuoli, S., et al., CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics, 2011. 12(1): p. 356. 40. Carver, T.J., et al., ACT: the Artemis comparison tool. Bioinformatics, 2005. 21(16): p. 3422‐3423. 41. Darling, A.E., B. Mau, and N.T. Perna, progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement. PLoS ONE, 2010. 5(6): p. e11147. 42. Meyer, F., et al., The metagenomics RAST server ‐ a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008. 9(1): p. 386. 43. Markowitz, V.M., et al., IMG/M: the integrated metagenome data management and comparative analysis system. Nucleic Acids Research, 2012. 40(D1): p. D123‐D129. 44. Wommack, K.E., et al., VIROME: a standard operating procedure for analysis of viral metagenome sequences. 2012. Vol. 6. 2012. 45. Bhaduri, A., et al., Rapid identification of non‐human sequences in high‐throughput sequencing datasets. Bioinformatics, 2012. 28(8): p. 1174‐1175. 46. Kostic, A.D., et al., PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotech, 2011. 29(5): p. 393‐396. 47. Stark, M., et al., MLTreeMap ‐ accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC Genomics, 2010. 11(1): p. 461. 48. Segata, N., et al., Metagenomic microbial community profiling using unique clade‐specific marker genes. Nat Meth, 2012. 9(8): p. 811‐814. 49. Liu, B., et al., Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics, 2011. 12(Suppl 2): p. S4. 50. Ye, Y., et al., Comparing bacterial communities inferred from 16S rRNA gene sequencing and shotgun metagenomics, in Biocomputing 2011. p. 165‐176. 51. DePristo, M.A., et al., A framework for variation discovery and genotyping using next‐generation DNA sequencing data. Nat Genet, 2011. 43(5): p. 491‐498. 52. Treangen, T., et al., MetAMOS: a metagenomic assembly and analysis pipeline for AMOS. Genome Biology, 2011. 12(1): p. 1‐27.

53. Kultima, J.R., et al., MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit. PLoS ONE, 2012. 7(10): p. e47656. 54. Hoff, K., et al., Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics, 2008. 9(1): p. 217. 55. Hoff, K.J., et al., Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Research, 2009. 56. Quail, M.A., et al., A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics, 2012. 13: p. 341. 57. Liu, L., et al., Comparison of next‐generation sequencing systems. J Biomed Biotechnol, 2012. 2012: p. 251364. 58. Illumina, MiSeq System Product Information Sheet: Sequencing, http://www.illumina.com/documents//products/datasheets/datasheet_miseq.pdf, Editor 2012. p. 2. 59. Illumina, HiSeq Sequencing Systems Specificication Sheet: Illumina Sequencing, http://www.illumina.com/Documents/systems/hiseq/datasheet_hiseq_systems.pdf, Editor 2011. p. 4. 60. Hughes, S., Lab Automation Services. Journal of Laboratory Automation, 2012. 17(6): p. 405‐407. 61. Kong, F., et al., Automatic Liquid Handling for Life Science: A Critical Review of the Current State of the Art. Journal of Laboratory Automation, 2012. 17(3): p. 169‐185. 62. Loman, N.J., et al., Performance comparison of benchtop high‐throughput sequencing platforms. Nat Biotech, 2012. 30(5): p. 434‐439. 63. Mardis, E.R., Next‐Generation DNA Sequencing Methods. Annual Reviews in Genomics and Human Genetics, 2008. 9: p. 387‐402. 64. Karow, J., Illumina's low‐cost MiSeq promises to speed up next‐gen sequencing, in GenomeWeb2011, GenomeWeb: http://www.genomeweb.com/sequencing/illuminas‐low‐cost‐miseq‐promises‐speed‐next‐gen‐ sequencing. p. 2. 65. Karow, J., Illumina launches HiSeq with half the output, single flow cell, in GenomeWeb2010: http://www.genomeweb.com/sequencing/illumina‐launches‐hiseq‐half‐output‐single‐flow‐cell.

Version v19, August 29, 2013