George Church Thu 27-Apr-2006 9:30-11 Broad-MPG Thanks to: New Sequencing Technologies & Diploid Personal Genomes NHGRI Seq Tech 2004: Agencourt, 454,

George Church Thu 27-Apr-2006 9:30-11 Broad-MPG Thanks to: New Sequencing Technologies & Diploid Personal Genomes NHGRI Seq Tech 2004: Agencourt, 454, Microchip, 2005: Nanofluidics, Network, VisiGen Affymetrix, Helicos, Solexa-Lynx

‘Next Generation’ Technology Development Multi-molecule Our role Affymetrix Software Gorfinkel Polony to Capillary 454 LifeSci Paired ends, emulsion Lynx/Solexa Multiplexing & polony Agencourt Seq by Ligation (SbL) Single molecules Helicos Biosci SAB, cleavable fluors Pacific Biosci - Agilent Nanopores Visigen Biotech - Complete GenomicsSbL

Sequencing components 1.Applications & goals 2.Cost, accuracy, continuity goals 3.Source, consent, ELSI 4.Sample prep 5.Technology development, deployment, scaling 6.Software: data acquisition to interpretation 7.Human interface, education

Sequencing applications 1.Environment (genetic): maternal, allergens, microbes 2.Small mutations: whole genome vs targeted 3.DNA copy number & rearrangements (paired ends) 4.Exons conserved &/or mutable regions 5.Haplotype: LD &/or causative combinations in cis 6.RNA Digital Analysis of Gene Expression (by counting) 7.RNA splicing (that arrays can’t handle) 8.Proteomics: MS, Ab, aptamers 9.Metabolomics: MS, Ab, aptamers 10.Microbial evolution resequencing (needs consensus accuracy) 11.Cancer resequencing 12.Gene synthesis by sequencing (needs raw accuracy) 13. DNA methylation

Why single chromosome sequencing? (or single cell or single particle?) (1) When we only have one cell as in Preimplantation Genetic Diagnosis (PGD) or environmental samples (2) Sequence relations >100 kbp (haplotypes) (3) Prioritizing or pooling (rare) species based on an initial DNA screen (4) Anything relating 2 or more chromosomes (in a cell or virus) (5) Cell-cell interactions (e.g. predator-prey, symbionts, commensals, parasites, etc)

Zhang et al. Nature Genet. Mar 2006 Method#1: ‘in situ’ haplotyping Sequencing/genotyping on single human chromosomes 153 Mbp

Method#2: Chromosome dilution library QC: Reverse-FISH of amplicons Sequencing/genotyping on single human chromosomes Amplicon 19 Amplicon 6q

Single chromosome molecule sequencing How? –Isothermal Strand Displacement Amplification from a single chromosome (Ploning) –Shotgun sequencing on the amplicon Challenges –Non-specific amplification competes with a single template molecule –Amplicons have high-order DNA structures, which creates issues in sequencing library construction

Reduce chimeras when cloning from SDA Plones Single cell chromosome molecule sequencing Phi-29 debranching S1 nuclease digestion DNA pol I nick translation From 19% to 6%

Single cell chromosome molecule sequencing Chromosome# #1#2 # Good seq reads7,16610,660 Average length (bp)769.4676.6 Total length (bp)5,513,5207,212,556 # unkown seqs1210 # vectors2344 # other seqs742 % genome sampled63%67% Plone amplification errors: < 1.7×10 -5 Ploning & sequencing 2.5 Mbp molecules

In vitro paired tag libraries Bead polonies via emulsion PCR Monolayer gel immobilization Enrich amplified beads SOFTWARE Images → Tag Sequences Tag Sequences → Genome SBE or SBL sequencing Epifluorescence & Flow Cell Shendure, Porreca, Reppas, Lin, McCutcheon, Rosenbaum, Wang, Zhang, Mitra, Church (2005) Science 309:1728. Integrated Polony Sequencing Pipeline (open source hardware, software, wetware)

R Paired-end libraries + ligate dilute, ligate amplify Shear or Nla III digest select hRCA digest Mme I ligate amplify ePCR Shendure, Porreca, et al. (2005) Science 309: 1728 Margulies et al. (2005) Nature 437: 376. L M

Distribution of Distances Between Mate-Paired Tags distance (bp) frequency 980 ± 96 bp 1.0 kb 2.0 kb 10.7 bp FT

3’ 5’ Tag 1 ePCR bead 7 bp 6 bp 7 bp 6 bp Tag 2 Each yields 6 to 7 bp of contiguous sequence 34 bp new sequence per 135 bp amplicon 4 positions for paired-end anchor 'primers' L M R

ACUCAUC… (3’)…TAGAGT????????????????TGAGTAG…(5’) 5’-Cy5-nnnnAnnnn-3’ 5’-Cy3-nnnnGnnnn-3’ 5’-TR-nnnnCnnnn-3’ 5’-Cy3+Cy5-nnnnTnnnn-3’ 5'PO 4 Sequencing by Ligation (SBL) with fluorescent combinatorial 9-mers Excitation Emission 647 700 555 605 572 630 555 700 nm Shendure, Porreca, et al. (2005) Science 309:1728

HPLC autosampler (96 wells) syringe pump Automation Schematic microscope & xyz stage flow-cell temperature control

Off the Shelf Instrumentation $140,000 Mitra Porreca Shendure

Image Collection & Data Processing 514 raster positions x 4 images per cycle 26 cycles of sequencing 2 additional image sets for object-finding algorithms 54996 images (1000 x 1000, 14-bit) Porecca et al. 100GBytes 5M reads $500 run

Open Source Readmapper Hash all the reads (n) Scan genome (m), and for each window: –Does current window exist in hash? –If so, move downstream, scan d positions & test hash for membership Hash all possible reads from genome (m) Scan the reads (n), and for each: –Does it occur in the hash? –If so, does the second exist? –If so, take union (k) m + (n * d) = 10+ hours, 20 nodes, 1.6e6 reads n * k = 10 hours, 1 node, 1.6e6 reads v1.0 (Shendure, Porreca et al) v2.0 (Gary Gao, Sasha Wait)

Error quantitation Median raw Polony = 3E-3 (99.7%) 454 raw = 4E-2 (96%) Shendure, Porreca et al, 2005 6X consensus <3E-7 [>Q65, 99.99997%]

$/kb @4E-5 $7 $9 0.80.07 $/3e9 @1X 3M 300K $30K Paired ends yes no yes Device $ 300K 500K 140K Cost vs consensus error rate 454 Sep05 ABI 454 Sep05 Polony Polony Sep05 Feb 06

Consensus error rate Total errors (E.coli) (Human) 1E-4 Bermuda/Hapmap 500 600,000 4E-5 454 @40X 200 240,000 3E-7 Polony-SbL @6X 0 1800 1E-8 Goal for 2006 0 60 Goal of genotyping & resequencing  Discovery of variants E.g. cancer somatic mutations ~1E-6 (or lab evolved cells) Why low error rates? Also, effectively reduce (sub)genome target size by enrichment for exons or common SNPs to reduce cost & # false positives.

PositionType GeneLocation ABI Confirm Comments 986,334 T > GompFPromoter-10 Only in evolved strain 985,797 T > GompFGlu > Ala Only in evolved strain 931,960 ▲ 8 bplrpframeshift Only in evolved strain 3,957,960 C > TppiC5' UTR MG1655 heterogeneity -3274 T > CcIGlu > Glu  red heterogeneity -9846 T > CORF61Lys > Gly  red heterogeneity Mutation Discovery in Engineered/Evolved E.coli Shendure, Porreca, et al. (2005) Science 309:1728

Sequence monitoring of evolution (optimize small molecule synthesis/transport) Sequence trp - Reppas, Lin & Church

Glu-117 → Ala (in the pore) Charged residue known to affect pore size and selectivity Promoter mutation at position (-12) Makes -10 box more consensus-like -12 -11 -10 -9 -8 -7 -6 A AAGAT C AAGAT Can increase import & export capability simultaneously ompF - non-specific transport channel

3 independent lines of Trp/Tyr co-culture frozen. OmpF: 42R-> G, L, C, 113 D->V, 117 E->A Promoter: -12A->C, -35 C->A Lrp: 1bp deletion, 9bp deletion, 8bp deletion, IS2 insertion, R->L in DBD. Heterogeneity within each time-point reflecting colony heterogeneity. Co-evolution of mutual biosensors sequenced across time & within each time-point

proximal tag placement distal tag placement 1,206k1,210k Incorrect distance Red=same strand Black opposite strand Mixture of wild & 2kb Inversion (pin) Using paired ends, rearrangement & copy-number detection is >1000X easier than point mutation detection (6X vs 6000X)

1M Causative Genome Changes CGCs (10X MIP pool $20) Strand displacement amplification (ploning) Polony sequencing 7E8 pixels Chip Genotyping/ Haplotyping Exons & conserved 3% (6X $9K) Diplome chromosome dilution shotgun (0.01X $300) 40K RNA diplome (10X MIP pool $20) Personal Genome Project (ELSI) Open source hardware, software, wetware Human Diplome Sequencing Strategies

Padlock, Molecular Inversion Probes (MIPs) Causative Genomic Changes (CGCs, e.g. conserved 3%) (not restricted to Single Nucleotides or Polymorphisms >1%) Hardenbol.. Landegren Davis et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol. 2003 21:673-8. “10,000 targeted SNPs genotyped in a single tube assay.” Genome Res. 2005 15:269 Vitkup, Sander, Church (2003) The Amino-acid Mutational Spectrum of Human Genetic Disease. Genome Biol. 4: R72. (CG to CA, TG) CG CA TG Genomic DNA Alternative alleles Universal primers R L Optional multiplex tag

MIPs for VDJ Polonies http://www.infobiogen.fr/services/chromcancer/Genes/TCRBID24.html xxx Over the whole field of human T-cells 1 TRAC + 2 TRBC primers cDNA 47 TRAV * 50 TRAJ + 46 TRAV * 13 TRBJ = 2948 MIP oligos or 47 TRAV * 1 TRAC + 46 TRAV * 2 TRBC = 139 MIP oligos In situ RCA or PCR for each T-cell Polony sequencing of tag &/or gap fill (e.g. 18 to 33bp in CDR3) (two tags per cell sufficient?)

‘Next Generation’ Technology Development Multi-molecule Our role Affymetrix Software Gorfinkel Polony to Capillary 454 LifeSci Paired ends, emulsion Lynx/Solexa Multiplexing & polony Agencourt Seq by Ligation (SbL) Single molecules Helicos Biosci SAB, cleavable fluors Pacific Biosci - Agilent Nanopores Visigen Biotech - Complete GenomicsSbL

Human subjects consent “Because the database will be public, people who do identity testing, such as for paternity testing or law enforcement, may also use the samples, the database, and the HapMap, to do general research. However, it will be very hard for anyone to learn anything about you personally from any of this research because none of the samples, the database, or the HapMap will include your name or any other information that could identify you or your family.” YRI=Yoruba, Ibadan, Nigeria JPT= Japan, Tokyo CHB=China (Han) Beijing CEU=CEPH (N&W Europe) Utah http://www.hapmap.org/downloads/elsi/CEPH_Reconsent_Form.pdf

Is anonymity in genomics realistic? http://arep.med.harvard.edu/PGP/Anon.htm 1) Re-identification after “de-identification” using other public data. Group Insurance Commission list of birth date, gender, and zip code was sufficient to re- identify medical records of Governor Weld & family via voter-registration records (1998) (2) Hacking. “Drug Records, Confidential Data vulnerable via Harvard ID number & PharmaCare loophole” (2005). A hacker gained access to confidential medical info at the U. Washington Medical Center -- 4000 files (names, conditions, etc, 2000) (3) Combination of surnames from genotype with geographical info An anonymous sperm donor was traced on the internet 2005 by his 15 year old son who used his own Y chromosome genealogy to access surname relations. (4) Inferring phenotype from genotype Markers for eye, skin, and hair color, height, weight, racial features, dysmorphologies, etc. are known & the list is growing. (5) Unexpected self-identification. An example of this at Celera undermined confidence in the investigators. Kennedy D. Science. 2002 297:1237. Not wicked, perhaps, but tacky. (6) A tiny amount of DNA data in the public domain with a name leverages the rest. This would allow the vast amount of DNA data in the HapMap (or other study) to be identified. This can happen for example in court cases even if the suspect is acquitted. (7) Identification by phenotype. If CT or MR imaging data is part of a study, one could reconstruct a person’s appearance. Even blood chemistry can be identifying in some cases.

"Open-source" Personal Genome Project (PGP) Harvard Medical School IRB Human Subjects protocol submitted Sep-2004, approved Aug-2005 renewed Feb-2006. Start with 3 highly-informed individuals consenting to non- anonymous genomes & extensive phenotypes (medical records, imaging, omics). Cell lines in Coriell NIGMS Repository G M Church GM (2005) The Personal Genome Project Nature Molecular Systems Biology doi:10.1038/msb4100040 Kohane IS, Altman RB. (2005) Health-information altruists--a potentially critical resource. N Engl J Med. 10;353(19):2074-7.

It is likely that less-privileged citizens ‘might be’ less likely to volunteer & will be more likely to volunteer due to higher financial risk. These same people ‘might be’ even less likely to volunteer is the data might become public. These same folks might be especially impacted socially if identifying (genome and/or phenome) data were to get out after they were assured that it would not. Discussion: Ascertainment bias vs. risk of disclosure without consent.

Five categories: 1)Withdrawal from studies due to new information on risks (all data destroyed). 2) Highest security (possibly higher than the original study) encryption, aggressive de-identification, only expert access with IRB-approval of each person, not whole teams. Consent form clearly states the risks (see previous slides). 3) Medium security, similar to current practice, but consented as above. IRB approval for teams to download de-identified data. 4) Open-PGP-type security. Click-through agreement. IRB- approval only for data collection, not for data reading. 5) Fully open. No IRB approval; full web access e.g. subject initiated. Proposal for multi-tiered (re)consent of subjects in genomic studies

George Church Thu 27-Apr-2006 9:30-11 Broad-MPG Thanks to: New Sequencing Technologies & Diploid Personal Genomes NHGRI Seq Tech 2004: Agencourt, 454,

Similar presentations

Presentation on theme: "George Church Thu 27-Apr-2006 9:30-11 Broad-MPG Thanks to: New Sequencing Technologies & Diploid Personal Genomes NHGRI Seq Tech 2004: Agencourt, 454,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

George Church Thu 27-Apr-2006 9:30-11 Broad-MPG Thanks to: New Sequencing Technologies & Diploid Personal Genomes NHGRI Seq Tech 2004: Agencourt, 454,

Similar presentations

Presentation on theme: "George Church Thu 27-Apr-2006 9:30-11 Broad-MPG Thanks to: New Sequencing Technologies & Diploid Personal Genomes NHGRI Seq Tech 2004: Agencourt, 454,"— Presentation transcript:

Similar presentations

About project

Feedback