Presentation is loading. Please wait.

Presentation is loading. Please wait.

Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.

Similar presentations


Presentation on theme: "Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley."— Presentation transcript:

1 Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley

2 A History of Genome Sequencing  1981: Sanger et al. sequence Lambda (50Kbp) by the shotgun method. Cloning: BACs permit 100-250Kbp inserts BACs permit 100-250Kbp insertsTechnology: Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Capillaries improve accuracy & efficiency Capillaries improve accuracy & efficiency  1998: 3% of the human genome has been sequenced using a BAC- based hierachical plan. Common wisdom is that shotgun approach does not scale beyond BACs save for simple bacterial sequences.

3 Whole Genome Shotgun Sequencing ~ 55million reads reads – Collect 6-10x sequence in a 5-5-1 ratio of three types of read pairs. Short Long 2Kbp 10Kbp + single highly automated process + only a handful of library constructions – assembly is much more difficult Contig Gap (mean & std. dev. Known) Read pair (mates) – Assemble into “scaffolds”, ordered runs of contigs with known spacing. – Map scaffolds to genome with STS or other markers. Extra Long 50-150Kbp

4 How to accomplish WGA in a nutshell – Identify and assembly all the unique genomic segments – Link together into scaffolds with paired reads – Back-fill interspersed repeats with “anchored reads”

5 A History of Genome Sequencing  1981: Sanger et al. sequence Lambda (50Kbp) by the shotgun method.  1998: 3% of the human genome has been sequenced using a BAC- based hierachical plan. Common wisdom is that shotgun approach does not scale beyond BACs save for simple bacterial sequences.  2001: 97% of the chromatin of the human genome has been determined. Mouse, Drosophila, Rice, Fugu, and Anopheles have all been sequenced with a whole genome shotgun approach. Cloning: BACs permit 100-250Kbp inserts BACs permit 100-250Kbp insertsTechnology: Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Capillaries improve accuracy & efficiency Capillaries improve accuracy & efficiency

6

7 Case Study: 3 Dros. Assemblies vs. Release 3  Input: (Celera) 3.2M reads, 732K 2Kbp pairs, 548K 10Kbp pairs, (BDGP), 12K BAC pairs.  WGS1: Dec. 1999, reported in Science 2000. Repeat walking removed, Stones debugged, SNP handling  WGS2: March 2001, time of Human publication Error correction introduced, improvements in unitig classification  WGS3: July 2002, last run on melanogaster

8 Coverage of Release 3 # of Scaffolds Covering Rel. 3 55635313 Total Mb Spanned 116.39117.44117.6116.91 Total Mb of Rel. 3 Spanned 116.4116.5116.8-------- Total Mb of Sequence 114.15115.83116.42116.87 Total Mb of Rel. 3 Sequence 114.1115115.6-------- N50 Scaffold Length (in Mb) 10.8514.4513.8918.5 Number of Gaps 2,1732,3151,13044 Mean Contig Length (in kb) 52.249.51022,335 WGS1WGS2WGS3 Rel. 3 Mean Gap Length (in bp) 1,5319121,335--------- In addition 20.7Mbp of heterochromatic sequence was assembled (WGS3), containing 31 known proteins and 266 newly predicted genes. 98.93% 99.91% 58% of Rel. 3 gaps were interspersed repeat, 12% were tandem repeats (WGS3).

9 O&O Errors vs. Release 3 WGS1WGS2WGS3 Aligned Segments 2,125 113.30 Mb 2,270 114.41 Mb 1,087 114.99 Mb Local Errors 9 68.33 kb 7 9.80 kb 3 5.64 kb # segs # base pairs # segs # base pairs # segs # base pairs Repeat Errors 25 42.52 kb 1 0.66 kb 1 0.98 kb Gross misassemblies 3 10.69 kb 0 0

10 Sequencing Error Rates vs. Release 3 All Sequence 4.122.231.1 In Tandem Repeats 95.261.448.8 In Interspersed Repeats 78.215.89.62 In Unique Sequence 1.821.310.38 > 10 bp from gap 1.371.020.29 Errors / 10 kb WGS1WGS2WGS3 > 50 bp from gap 1.320.950.26

11

12  Solid State Sequencing in Pico-wells:  Operational next year  25-50Mbp per instrument/day in 50bp reads,.3-1Kbp pairs (vs. 1-2Mbp per inst./day in 800bp, 2-10Kbp pairs)  Applications: Resequencing, BAC drafts at 99%  Detecting dNTP incoporations by fixed PolII complex:  Operational 5-10 years from now  1-10Gbp per instrument/day in 100Kbp reads (they can be 30-50% noise)!  Assembly will not be difficult.  Nanopore  My opinion: not knowable, could be 50 years.

13

14 Mouse is smaller than Human: ~15% expansion of euchromatin Human (21) (21) Mouse (16) (16) Mbp Sequence anchor: >50bp at >75% id. & bidirectionally unique Mbp Syntenic Anchors

15 Based on sequence anchor blocks Courtesy Lisa Stubbs Oak Ridge National Laboratory Evolution as Genomic Rearrangements

16 Orthologous Pairs of Proteins

17 Human chromosome 6 Mouse chromosome 17 Protein-level synteny

18 Computational Gene Finding  Computational Gene finding: Identification of coordinates of coding regions.  ‘Clues’ that differentiate coding from non-coding regions.  Cellular machinery (ribosome,spliceosome) recognizes specific signals that mark gene boundaries. Start Codon TRANSCRIPT: Donor Site Acceptor Site GTAG ATG Stop Codon GENE:

19 Computational Gene Finding (Homology )  Comparative (Genewise, Procrustes, Sim4)  Perform well when homolog has strong similarity. Performance tapers off with decrease in sequence similarity.  Performance is (or, should be) independent of sequence composition.  Difficult to find good homologs.

20 Full Length cDNA’s: Alternate Splicing Courtesy Terry Gaasterland, Rockefeller

21 Gene Finding (Ab Initio Methods)  Gene structure is identified by the most likely parse of the sequence through an appropriate HMM (weighted finite automaton) (ex: Genscan, Genie…).  Fairly accurate, with well understood procedures for training models and parsing.  Recent results (multi-gene examples) indicates that further improvements are desirable (Guigo’99).

22 1D Methods: Summary  Homology:  Very specific and accurate  Can sample only abundunt genes and full-length is hard  Ab Initio:  Good sensitivity for presence (85%) but weak for exon (60%) and gene (10%), also very non-specific (20%).  Main drivers of recognition are: Splice site Splice site No stop codon in exon No stop codon in exon Some bias in hexamer coding frequency Some bias in hexamer coding frequency  Mouse vs. Human Homology (50-100 million years):  85% of exons in a TBlastX hit  85% amino acid identity in a hit  25% of TBlastX hits contain a true exon

23 2D: Homology (Sagot et al., Huson & Bafna) Require gene models (splice sites + start + no-stop) in both genomes that have high homology: Human Mouse Performance is better than 1D HMM with weak splice site model

24 2D HMMs: Target Evidence Mask (0/1) Twinscan (Brent et al.): cDNA, other evidence Given training set of known genes and evidence mask learn HMM over  {0/1} SLAM (Pachter et al., Durbin et al.): Given training set of known genes and “correctly” alignments learn HMM over  k

25 Outcomes  Exon prediction (must get splice junctions right)  SN 63%  68%  SP 58%  66%  Gene prediction (must get every exon)  SN 15%  24%  SP 10%  14%  A lot of improvement possible ?


Download ppt "Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley."

Similar presentations


Ads by Google