NGS by Hillol Sarkar

The Past, Present, and Future of DNA Sequencing Craig A. Praul Co- Director Genomics Core Facility Huck Institutes of the Life Sciences Penn State University

A very short history of DNA sequencing

I started from the conviction that, if different DNA species exhibited different biological activities, there should also exist chemically demonstrable differences between deoxyribonucleic acids. Edwin Chargaff

Milestones •

First Isolation of DNA : 1867 (Freidrich Meisher)

•

Composition of nucleic acids; tetranucleotide theory : 1909 - 1940 (Phoebus Levine)

•

G=C and A=T however, the G/C and A/T content of different organisms vary : 1950 (Edwin Chargaff)

•

G/C content measured by annealing : 1968 (Mandel and Marmur)

•

Maxam-Gilbert and Sanger Sequencing : 1977

•

Next-Generation Sequencing : 2005

Genomes Sequenced

• Virus – 3222 (Bacteriophage phi X 174, 5386 nt – 1977) • Bacteria – 2289 (Haemophilus influenza, 1.8 x 106 nt – 1995) • Eukarya – 168 (S. cerevisiae 1.2 x 107 nt – 1995; H. sapien, 3 x 109 nt -2001) • Archaea – 152 (Methanococcus jannaschi , 1.7 x 106 nt – 1996)

Next-Generation Sequencing

Liu et al. Journal of Biomedicine and Biotechnology Volume 2012 (2012), Article ID 251364, 11 pages doi:10.1155/2012/251364

Changes in instrument capacity*

ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796

Sequencing Cost Date Sep-01 Sep-02 Oct-03 Oct-04 Oct-05 Oct-06 Oct-07 Oct-08 Oct-09 Oct-10 Oct-11 Oct-12 Jan-13

Cost per Mb

Cost per Genome

$5,292.39 $3,413.80 $2,230.98 $1,028.85 $766.73 $581.92 $397.09 $3.81 $0.78 $0.32 $0.09 $0.07 $0.06

Source - NHGRI : http://www.genome.gov/sequencingcosts/

$95,263,072 $61,448,422 $40,157,554 $18,519,312 $13,801,124 $10,474,556 $7,147,571 $342,502 $70,333 $29,092 $7,743 $6,618 $5,671

Central Dogma of Molecular Biology James Watson version - 1965

DNA

RN Protein A So once we have the genomic DNA sequence of a species we have all of the information there is?

Really?

â&#x20AC;˘ No, not really.

Illumina HiSeq and MiSeq •

Massively parallel – HiSeq : 150 or 180 million reads per lane – MiSeq : 15 million reads per run

•

Intermediate Read Length – HiSeq : 100 nt or 150 nt – MiSeq : 250 nt

•

High total output per run – HiSeq : 90 GB or 288 GB – MiSeq : 8 GB

Sequencing Types

Single Read

Paired-end read

Mate-pair read

Library Types •

Many different library preps : DNA, mate-pair, mRNA, miRNA, ChIP

•

Fragmentation – DNA : 300 – 500 nt – RNA : 150 – 200 nt

•

Attachment of appropriate adapters – Complex : flow cell binding, F & R sequencing, BC – Custom : Avoid if possible

•

Removal of dimers/small inserts

•

Amplification (or not)

Applications •

de Novo sequencing (genomes, transcriptomes)

•

Resequencing (genomes, exomes, custom sequence capture)

•

RNA-seq (mRNA, miRNA, degradome)

•

Chip-Seq

•

Methyl-seq

•

RIP-seq

•

Amplicon

de Novo Experimental Design •

Estimate of genome size

•

Coverage (30 x – 100 x)

•

Sequencing Type (paired-end or mate-pair)

•

Example 100 MB genome, 100 x 100 nt paired-end reads – (100 MB) x (30 x coverage) = 3 GB – 3 GB / (200 nt for each pair of paired-end reads) = 15 million read pairs

•

Replicates

Resequencing : Sequence Capture

RNA-seq Experimental Design •

Estimate of transcriptome size (1-5% of genome ?)

•

Coverage (30 x ?) – mRNA or rRNA depleted RNA – Relative abundance of transcripts you are interested in

•

Sequencing Type (single read or paired-end) – Simple transcriptome vs. complex transcriptome – Splice variants

•

Example 3 GB genome, 100 nt single reads – (3 GB genome) x ( 5% transcriptome ) = 120 MB Transcriptome – (120 MB transcriptome) x (30 x coverage) = 4.5 GB total sequence – 4.5 GB / (100 nt for each read) = 45 million read pairs

•

Replicates : Yes!!!! – Biological not technical

ChIP-Seq

http://www.nature.com/nmeth/journal/v4/n8/images/nmeth0807-613-F1.gif

RIP-seq

Source : http://openi.nlm.nih.gov/imgs/rescaled512/3269675_ijms-13-00097f6.png

Methyl-seq

20 different types of base modifications in DNA are known and there are perhaps 200 modifications of RNA

Experimental Space: Next-Gen Platform • PacBio : 0.075 x 106 reads/sample, 1000 – 3000 nt – Whole transcript • Roche 454 FLX+ : 0.5 -1 x 106 reads/sample, 800 -1000 nt – Small – Medium Genome de novo sequencing – Long Amplicon – Transcriptome • PGM: 1-2 x 106 reads per sample, 400 nt – Small genome de novo – Medium Amplicon • MiSeq: 1-2 x 106 reads per sample, 50 – 250 nt – Small genome de Novo – Small Amplicon • HiSeq : 10-100 x 106 reads per sample, 50 – 150 nt – Counting Applications : RNA-seq, ChIP-seq, RIP-seq, Methyl-seq – Large genome de novo and resequencing

Experimental Space: The Relevancy of “Classic” Techniques Differential Gene Expression

•

Northern blotting (1977) : 1 Probe – 20 samples

•

Dot Blots (1987) : 100s of probes – 1 sample

•

RT-PCR (1992) : 100s of probes – 10 -100 samples

•

Microarrays (1995 ) : 100,000s of probes – 1 sample

•

Next-gen sequencing (2005) : 10-100 x 106 reads – 1 sample

The Future

• More Reads • Longer Reads • Faster Sequencing • Cheaper Sequencing • New Applications