Presentation is loading. Please wait.

Presentation is loading. Please wait.

US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute

Similar presentations


Presentation on theme: "US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute"— Presentation transcript:

1 US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute platt5@llnl.gov

2 US DOE Joint Genome Institute Surviving the Deluge Shrinking Read Lengths Hybrid Assemblies The Coming Storm Improving metabolic flux The Future

3 US DOE Joint Genome Institute Read Lengths are Getting Shorter The debate about impact of read length on genome assembly has never been resolved Is more 650bp reads better than fewer 750bp reads? What about 100? How do you feel about 35? Why wait, join the revolution…

4 US DOE Joint Genome Institute Assembling with Four Base Pair Reads.. >GATC GATC

5 US DOE Joint Genome Institute Not as dire as you might think Dramatically simplifies Genbank trace archive submissions Only 256 distinct sequences, 65K 3bp overlaps Store reads as a single byte GGGG GGGA GGGT GGGC GGAG GGAA GGAT GGAC GGTG GGTA GGTT GGTC GGCG GGCA GGCT GGCC GAGG GAGA GAGT GAGC GAAG GAAA GAAT GAAC GATG GATA GATT GATC GACG GACA GACT GACC GTGG GTGA GTGT GTGC GTAG GTAA GTAT GTAC GTTG GTTA GTTT GTTC GTCG GTCA GTCT GTCC GCGG GCGA GCGT GCGC GCAG GCAA GCAT GCAC GCTG GCTA GCTT GCTC GCCG GCCA GCCT GCCC AGGG AGGA AGGT AGGC AGAG AGAA AGAT AGAC AGTG AGTA AGTT AGTC AGCG AGCA AGCT AGCC AAGG AAGA AAGT AAGC AAAG AAAA AAAT AAAC AATG AATA AATT AATC AACG AACA AACT AACC ATGG ATGA ATGT ATGC ATAG ATAA ATAT ATAC ATTG ATTA ATTT ATTC ATCG ATCA ATCT ATCC ACGG ACGA ACGT ACGC ACAG ACAA ACAT ACAC ACTG ACTA ACTT ACTC ACCG ACCA ACCT ACCC TGGG TGGA TGGT TGGC TGAG TGAA TGAT TGAC TGTG TGTA TGTT TGTC TGCG TGCA TGCT TGCC TAGG TAGA TAGT TAGC TAAG TAAA TAAT TAAC TATG TATA TATT TATC TACG TACA TACT TACC TTGG TTGA TTGT TTGC TTAG TTAA TTAT TTAC TTTG TTTA TTTT TTTC TTCG TTCA TTCT TTCC TCGG TCGA TCGT TCGC TCAG TCAA TCAT TCAC TCTG TCTA TCTT TCTC TCCG TCCA TCCT TCCC CGGG CGGA CGGT CGGC CGAG CGAA CGAT CGAC CGTG CGTA CGTT CGTC CGCG CGCA CGCT CGCC CAGG CAGA CAGT CAGC CAAG CAAA CAAT CAAC CATG CATA CATT CATC CACG CACA CACT CACC CTGG CTGA CTGT CTGC CTAG CTAA CTAT CTAC CTTG CTTA CTTT CTTC CTCG CTCA CTCT CTCC CCGG CCGA CCGT CCGC CCAG CCAA CCAT CCAC CCTG CCTA CCTT CCTC CCCG CCCA CCCT CCCC Challenging to assemble Vector trimming out of the question

6 US DOE Joint Genome Institute Testing on a Real Genome gi|11496567|ref|NC_001830.1| Pear blister canker viroid PBCVd, complete genome CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGG GCTTCTCGGCTCGTCGTCGACGAAGGGTCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAA TCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCTGTCCCGCTAGTCGAGC GGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGT TTACCGCGGACCCCCGAGAGGAGGCCCTCGGGTCC

7 US DOE Joint Genome Institute ABCD Assembler ABCD assembler: Only 500 lines of C++ Libraries with insert sizes of — 4, 5, 6, 8, 10, 20, 40, 100 and 200 bp Generated 8Mb of sequence (22K x coverage) Results in 2410 unique data points 100 AAAA CGCT 100 AAAA GCTC 100 AAAA GGAG 100 AAAA GGCT 100 AAAA TGGA 100 AAAC GCTT 100 AAAG CTCC 100 AAAG GAGA 100 AACC CTTC 100 AAGA CTTC ….

8 US DOE Joint Genome Institute Performance of the ABCD Assembler Genetic Algorithm evolves candidate genomes and compares to observed data frequencies Reward and breed genomes that produce similar data Penalize genomes that generate unobserved data After 3-4 days on a high end CPU.. REF CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGGGCTTCTCGGCTCGTCGTCGACGAAGGG SEQ CTTTCCTGAGGTTCCTGTGGTGCTCCCCTGACCTGCGTTCCAAAAAGCGAAAAAGTGAGAGGCCCTAGGGGCTTCTCGGCTCGTCGTCGACGAAGGG REF TCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAATCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCT SEQ TCTAGAAGCCTGGGCGCTGGCTGGAGCGCGCGGCTGTGAGTAATCGCTCCTTTGGAGAAGAAAACCAGCGTTGCTTCCTGCCTGAGCCTCGTCTTCT REF GTCCCGCTAGTCGAGCGGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGTTTACCGCGGAC SEQ GTCCCGCTAGTCGAGCGGACAACCCGAGCACCGCCGAAGCGCTTTTTTCTTTTATAGCAGCTTGGCTTCGCGGCGAGGGTGGAAGTTTACCGCGGAC REF CCCCGAGAGGAGGCCCTCGGGTCC SEQ CCCCGAGAGGAGGCCC

9 US DOE Joint Genome Institute Read Pair Overview

10 US DOE Joint Genome Institute Consensus Alignment

11 US DOE Joint Genome Institute Hybrid Assemblies

12 US DOE Joint Genome Institute Forge: 454/Sanger Hybrid Assembly

13 US DOE Joint Genome Institute Up Close

14 US DOE Joint Genome Institute Accurate Consensus Generation is more Challenging

15 US DOE Joint Genome Institute The Coming Storm

16 US DOE Joint Genome Institute Growth Rates

17 US DOE Joint Genome Institute These are just the foothills Thought exercises How to deal with 1-10 microbes/day? How best to use 3 Gb/day? Will human reseq technologies enable denovo large genome sequencing Remember that organisms aren’t getting larger

18 US DOE Joint Genome Institute No problem, Computers are getting faster too.. http://Tomshardware.com http://intel.com

19 US DOE Joint Genome Institute What really holds us back? Limiting Reagents CPU time Disk space Network Bandwidth Human Bandwidth Software quality

20 US DOE Joint Genome Institute Improving Metabolic Flux

21 US DOE Joint Genome Institute JGI as an Organism Prokaryote DNA Library QC 4x8x Post Ass QC AnnotationJamboree Portals IMG8x Final Annotation Draft Annotation Finishing Eukaryote

22 US DOE Joint Genome Institute It’s 2am, where is your Genome..

23 US DOE Joint Genome Institute Scaling up Global Project Tracking How would a 30x increase in production capacity affect tracking? PGF has sequenced over 300 species More than 100 “active” in freezer Wave of new projects propagating through pipeline Majority of sequencing is in projects still underway Considering use of Blog like features to improve interaction

24 US DOE Joint Genome Institute Assembly and Quality Control Prokaryote DNA Library QC 4x8x Post Ass QC AnnotationJamboree Portals IMG8x Final Annotation Draft Annotation Finishing Eukaryote

25 US DOE Joint Genome Institute Bimodal GC Content distributions

26 US DOE Joint Genome Institute Use test Fosmids to QC WGS data

27 US DOE Joint Genome Institute Kitchen sink Blast

28 US DOE Joint Genome Institute On a bad day..

29 US DOE Joint Genome Institute Annotation Prokaryote DNA Library QC 4x8x Post Ass QC AnnotationJamboree Portals IMG8x Final Annotation Draft Annotation Finishing Eukaryote

30 US DOE Joint Genome Institute Scaling Annotation “Last year we annotated ~5 genomes, this year plan to do 20, CSP has twice more requests, does it mean 40 next year? At some point we may need to talk in 100s” How to prioritize them and share time for support of each of them? Measure CPU consumption in 1000 CPU day units Need to fundamentally rethink methods/assumptions — algorithms (e.g gene finders) not improving much — Need more experimental data e.g tiling arrays — Software quality holds us back

31 US DOE Joint Genome Institute Annotation Pipelines “So nineties” but still not a well solved problem Issues: — “Non sucking software” — “Skillset for building distributed scalable systems is rare in CS types, perhaps non-existent in biologists” — “Moore’s law will succumb to N squared” In 3 years, computers will be 4 times faster, we will have 10 times more genomes and 100 times more comparisons to do if we insist on comparing all against all. —QA/QC/Reproducibility

32 US DOE Joint Genome Institute Environmental Interaction Prokaryote DNA Library QC 4x8x Post Ass QC AnnotationJamboree Portals IMG8x Final Annotation Draft Annotation Finishing Eukaryote

33 US DOE Joint Genome Institute Data delivery Models Continuing Interaction with environment Good Luck! Data Delivery Model

34 US DOE Joint Genome Institute JGI Genome Portals Key tools for presenting large genomes Support Jamboree activities Attract a lot of web traffic

35 US DOE Joint Genome Institute VISTA: Comparative Genomics Tool Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. VISTA : computational tools for comparative genomics. Nucleic Acids Res. 2004 Jul 1;32 (Web Server issue):W273-9 Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. VISTA : computational tools for comparative genomics. Nucleic Acids Res. 2004 Jul 1;32 (Web Server issue):W273-9

36 US DOE Joint Genome Institute IMG allows 3-click comparison of proteomes Can rapidly discover functional differences BUT… 90% of “differences” are annotation quality issues

37 US DOE Joint Genome Institute http://regtransbase.lbl.gov

38 US DOE Joint Genome Institute RegTransBase statistics, March 2006 Experiment types related to: Gene/operon activation2354 Gene/operon repression1128 Operon structure characterization 666 Promoter mapping1410 Regulatory site mapping1670 Terminator mapping46 Regulatory site prediction733 Plasmid replication16 TaxonomyGenesSites Alphaproteobacteria32081678 Betaproteobacteria10317 Gammaproteobacteria45422668 E.coli1516997 Delta/epsilon proteobacteria 11 Firmicutes31951459 B. subtilis666320 Cyanobacteria135196 Actinobacteria33 Bacteroidetes/Chlorobi group 12 Archea34 Multi- or unknown host plasmids, transposons and phages 1331439 TOTAL 128176470

39 US DOE Joint Genome Institute The Future?

40 US DOE Joint Genome Institute Future of Bioinformatic Data Analysis?

41 US DOE Joint Genome Institute Finishing will be reduced to solving the hardest problems The JGI will sequence 20 times more genomes in 2011 than now. In few years we will look back and see that today we are doing low throughput sequencing. GenBank will be taken over by Google I think all of the old problems will stay with us :) Every genome will have several ref sequences (e.g. Male, Female) Where will users get their CPU time? Who will do the detailed number crunching? As a corollary to all of this, quality & usability of software will need to dramatically improve. Nanotech will affect computers profoundly.. Hopefully this will ease our data storage problems just as the flood becomes unmanageable The bottlenecks will be: Integration with other resources. Standardization of data exchange. Get the expert knowledge to database. Integration of expert’s knowledge. Bandwidth may finally become the bottleneck The flood of data will force people think more about data management. The field of bioinformatics has progressed to the point where the crazy quilt of formats, modules, scripts, etc. is now interfering with people's ability to make additional research progress. Web-based tools will be much more valuable because of the richness of the data set, I don't have a sense of whether short reads will really be the future... if systematic sequencing errors end up being a problem for all of them, and substantial pairing isn't feasible, we might never be able to do anything other than a prokaryote with them.

42 US DOE Joint Genome Institute Writing http://www.agen.ufl.edu/~chyn/age2062/ lect/lect_09/FG10_008.GIF Alter Observe Understand Recapitulate Synthetic technologies will improve ergonomics of gene function validation E.g —active site confirmation —Heterologous expression —Tagged proteins —Cutting and pasting regulatory elements —“Simplifying” systems

43 US DOE Joint Genome Institute Thanks Joint Genome Institute

44 US DOE Joint Genome Institute Annotation is Time Consuming Preps (pre-assembly, time=2+ weeks) Identify scope, contributors, resources Identify and collect available data (ESTs, FL) Develop strategy for annotation Annotation (once assembly is available, time=5-8 weeks) Identify repeats Train gene prediction (1 week) Customize, configure & test-run Pipeline (1 week) Run Pipeline & other tools (2-4 weeks) QC gene models and annotations (1-2) weeks Support (post-release, time=?) Analysis, custom data, user support, jamboree, publications

45 US DOE Joint Genome Institute Example: Filtered Scaffold Depth Estimate

46 US DOE Joint Genome Institute Curator interface


Download ppt "US DOE Joint Genome Institute Surviving the Deluge Darren Platt Joint Genome Institute"

Similar presentations


Ads by Google