Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.

Slides:



Advertisements
Similar presentations
Introduction 1.Ordering of P. knowlesi contigs v P. falciparum methodology progress/status towards a synteny map – ‘true’ scaffold 2. Gene prediction generating.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland.
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Gene Finding (DNA signals) Genome Sequencing and assembly
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Whole Genome Assembly. WGA 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Eukaryotic Gene Finding
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Lecture 12 Splicing and gene prediction in eukaryotes
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Eukaryotic Gene Finding
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Mouse Genome Sequencing
Todd J. Treangen, Steven L. Salzberg
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
The Changing Face of Sequencing
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Human Genome.
Gene prediction roderic guigó i serra IMIM/UPF/CRG.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Today Please read… Science 291: Human Genome Project Dissenters My Brush with Greatness? 1992: Two years into the HGP, two of the projects.
Accessing and visualizing genomics data
BIOL 433 Plant Genetics Term 2, Instructors: Dr. George Haughn Dr. Ljerka Kunst BioSciences 2239BioSciences Tel
Virginia Commonwealth University
bacteria and eukaryotes
Human Genome Project.
Pre-genomic era: finding your own clones
Very important to know the difference between the trees!
CSE182-L12 Gene Finding.
Ab initio gene prediction
A Sequenciação em Análises Clínicas
BIOL 433 Plant Genetics Term 2,
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Genome Annotation and the Human Genome
Presentation transcript:

Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley

A History of Genome Sequencing  1981: Sanger et al. sequence Lambda (50Kbp) by the shotgun method. Cloning: BACs permit Kbp inserts BACs permit Kbp insertsTechnology: Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Capillaries improve accuracy & efficiency Capillaries improve accuracy & efficiency  1998: 3% of the human genome has been sequenced using a BAC- based hierachical plan. Common wisdom is that shotgun approach does not scale beyond BACs save for simple bacterial sequences.

Whole Genome Shotgun Sequencing ~ 55million reads reads – Collect 6-10x sequence in a ratio of three types of read pairs. Short Long 2Kbp 10Kbp + single highly automated process + only a handful of library constructions – assembly is much more difficult Contig Gap (mean & std. dev. Known) Read pair (mates) – Assemble into “scaffolds”, ordered runs of contigs with known spacing. – Map scaffolds to genome with STS or other markers. Extra Long Kbp

How to accomplish WGA in a nutshell – Identify and assembly all the unique genomic segments – Link together into scaffolds with paired reads – Back-fill interspersed repeats with “anchored reads”

A History of Genome Sequencing  1981: Sanger et al. sequence Lambda (50Kbp) by the shotgun method.  1998: 3% of the human genome has been sequenced using a BAC- based hierachical plan. Common wisdom is that shotgun approach does not scale beyond BACs save for simple bacterial sequences.  2001: 97% of the chromatin of the human genome has been determined. Mouse, Drosophila, Rice, Fugu, and Anopheles have all been sequenced with a whole genome shotgun approach. Cloning: BACs permit Kbp inserts BACs permit Kbp insertsTechnology: Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Cycle sequencing (linear PCR) permits efficient sequencing of both insert ends Capillaries improve accuracy & efficiency Capillaries improve accuracy & efficiency

Case Study: 3 Dros. Assemblies vs. Release 3  Input: (Celera) 3.2M reads, 732K 2Kbp pairs, 548K 10Kbp pairs, (BDGP), 12K BAC pairs.  WGS1: Dec. 1999, reported in Science Repeat walking removed, Stones debugged, SNP handling  WGS2: March 2001, time of Human publication Error correction introduced, improvements in unitig classification  WGS3: July 2002, last run on melanogaster

Coverage of Release 3 # of Scaffolds Covering Rel Total Mb Spanned Total Mb of Rel. 3 Spanned Total Mb of Sequence Total Mb of Rel. 3 Sequence N50 Scaffold Length (in Mb) Number of Gaps 2,1732,3151,13044 Mean Contig Length (in kb) ,335 WGS1WGS2WGS3 Rel. 3 Mean Gap Length (in bp) 1, , In addition 20.7Mbp of heterochromatic sequence was assembled (WGS3), containing 31 known proteins and 266 newly predicted genes % 99.91% 58% of Rel. 3 gaps were interspersed repeat, 12% were tandem repeats (WGS3).

O&O Errors vs. Release 3 WGS1WGS2WGS3 Aligned Segments 2, Mb 2, Mb 1, Mb Local Errors kb kb kb # segs # base pairs # segs # base pairs # segs # base pairs Repeat Errors kb kb kb Gross misassemblies kb 0 0

Sequencing Error Rates vs. Release 3 All Sequence In Tandem Repeats In Interspersed Repeats In Unique Sequence > 10 bp from gap Errors / 10 kb WGS1WGS2WGS3 > 50 bp from gap

 Solid State Sequencing in Pico-wells:  Operational next year  25-50Mbp per instrument/day in 50bp reads,.3-1Kbp pairs (vs. 1-2Mbp per inst./day in 800bp, 2-10Kbp pairs)  Applications: Resequencing, BAC drafts at 99%  Detecting dNTP incoporations by fixed PolII complex:  Operational 5-10 years from now  1-10Gbp per instrument/day in 100Kbp reads (they can be 30-50% noise)!  Assembly will not be difficult.  Nanopore  My opinion: not knowable, could be 50 years.

Mouse is smaller than Human: ~15% expansion of euchromatin Human (21) (21) Mouse (16) (16) Mbp Sequence anchor: >50bp at >75% id. & bidirectionally unique Mbp Syntenic Anchors

Based on sequence anchor blocks Courtesy Lisa Stubbs Oak Ridge National Laboratory Evolution as Genomic Rearrangements

Orthologous Pairs of Proteins

Human chromosome 6 Mouse chromosome 17 Protein-level synteny

Computational Gene Finding  Computational Gene finding: Identification of coordinates of coding regions.  ‘Clues’ that differentiate coding from non-coding regions.  Cellular machinery (ribosome,spliceosome) recognizes specific signals that mark gene boundaries. Start Codon TRANSCRIPT: Donor Site Acceptor Site GTAG ATG Stop Codon GENE:

Computational Gene Finding (Homology )  Comparative (Genewise, Procrustes, Sim4)  Perform well when homolog has strong similarity. Performance tapers off with decrease in sequence similarity.  Performance is (or, should be) independent of sequence composition.  Difficult to find good homologs.

Full Length cDNA’s: Alternate Splicing Courtesy Terry Gaasterland, Rockefeller

Gene Finding (Ab Initio Methods)  Gene structure is identified by the most likely parse of the sequence through an appropriate HMM (weighted finite automaton) (ex: Genscan, Genie…).  Fairly accurate, with well understood procedures for training models and parsing.  Recent results (multi-gene examples) indicates that further improvements are desirable (Guigo’99).

1D Methods: Summary  Homology:  Very specific and accurate  Can sample only abundunt genes and full-length is hard  Ab Initio:  Good sensitivity for presence (85%) but weak for exon (60%) and gene (10%), also very non-specific (20%).  Main drivers of recognition are: Splice site Splice site No stop codon in exon No stop codon in exon Some bias in hexamer coding frequency Some bias in hexamer coding frequency  Mouse vs. Human Homology ( million years):  85% of exons in a TBlastX hit  85% amino acid identity in a hit  25% of TBlastX hits contain a true exon

2D: Homology (Sagot et al., Huson & Bafna) Require gene models (splice sites + start + no-stop) in both genomes that have high homology: Human Mouse Performance is better than 1D HMM with weak splice site model

2D HMMs: Target Evidence Mask (0/1) Twinscan (Brent et al.): cDNA, other evidence Given training set of known genes and evidence mask learn HMM over  {0/1} SLAM (Pachter et al., Durbin et al.): Given training set of known genes and “correctly” alignments learn HMM over  k

Outcomes  Exon prediction (must get splice junctions right)  SN 63%  68%  SP 58%  66%  Gene prediction (must get every exon)  SN 15%  24%  SP 10%  14%  A lot of improvement possible ?