CISC667, S07, Lec4, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Whole genome sequencing Mapping & Assembly.

Slides:



Advertisements
Similar presentations
The Human Genome Project
Advertisements

Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
GENOME SEQUENCING AND OBJECTIVES
Genomics Chapter 18.
Lecture 14 Genome sequencing projects
9 Genomics and Beyond Brief Chapter Outline
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
CISC667, F05, Lec3, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Molecular Biology Tools Gel electrophoresis Cloning PCR DNA Sequencing.
CISC667, F05, Lec4, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Whole genome sequencing Mapping & Assembly.
Genome sequence assembly
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
16 and 20 February, 2004 Chapter 9 Genomics Mapping and characterizing whole genomes.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Mouse Genome Sequencing
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Genome sequencing Haixu Tang School of Informatics.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Status report on gap closure of the human chromosome 5 BAC map Authentication of C5 BAC maps Map and sequence status Gap status and steps used to close.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
GigAssembler. Genome Assembly: A big picture
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
Virginia Commonwealth University
DNA Sequencing Project
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Pre-genomic era: finding your own clones
Introduction to Genome Assembly
CISC 667 Intro to Bioinformatics (Spring 2007) Molecular Biology Tools
CS 598AGB Genome Assembly Tandy Warnow.
Genome sequencing informatics
Bioinformatics: Buzzword or Discipline (???)
CISC 667 Intro to Bioinformatics (Fall 2005) Systems biology: Gene expressions profiling and clustering CISC667, F05, Lec25, Liao.
CISC 667 Intro to Bioinformatics (Spring 2007) Review session for Mid-Term CISC667, S07, Lec14, Liao.
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Profile HMMs GeneScan TMMOD
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
CISC 667 Intro to Bioinformatics (Spring 2007) Systems biology: Gene expressions profiling and clustering CISC667, S07, Lec23, Liao.
Presentation transcript:

CISC667, S07, Lec4, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Whole genome sequencing Mapping & Assembly

CISC667, S07, Lec4, Liao [A] Bacteria, 1.6 Mb, ~1600 genes Science 269: 496 [B] 1997 Eukaryote, 13 Mb, ~6K genes Nature 387: 1 [C] 1998 Animal, ~100 Mb, ~20K genes Science 282: 1945 [D] 2001 Agrobacteria, 5.67 Mb, ~5419 genes Science 294:2317 [E,F] 2001 Human, ~3 Gb, ~35K genes Science 291: 1304 Nature 409: 860 [G] 2005 Chimpanzee (A) (B)(C) (E)(F) (G) As of May 2005, 225 completed microbial genomes (source: TIGR CMR) (D)

CISC667, S07, Lec4, Liao Sequencing Technologies and Strategies Sanger method (gel-based sequencing) –Top-Down or “BAC-to-BAC” –Whole Genome Shotgun Sequencing by Hybridization Pyrosequencing (“sequencing by synthesis”) (Science, 1998) Polony sequencing (George Church et al, Science, August 2005) The sequencing (Nature, July 2005) SBS (

CISC667, S07, Lec4, Liao Courtesy of Discovering genomics, proteomics, & bioinformatics by Campbell & Heyer. Physical Mapping by using Sequence-Tagged-Sites STSs are unique markers Exercise: What is the difference between STS and EST, expressed sequence tags? What is the chance a fragment of 20 bps to be unique in a genome of 3 billion bps? (Visit to learn more about EST as an alternative to whole genome sequencing.)

CISC667, S07, Lec4, Liao Terms BAC YAC Cosmid Mapping Tiling path Read Gap Contig Shotgun

CISC667, S07, Lec4, Liao

Assembly is complicated by repeats

CISC667, S07, Lec4, Liao Informatics tasks Shotgun coverage (Lander-Waterman) Base-calling (Phred) Assembly (Phrap, Visualization (Consed– contig editor for phred and phrap) Post-assembly analyses –Sun Kim, Li Liao, Jean-Francois Tomb, "A Probabilistic Approach to Sequence Assembly Validation" …

CISC667, S07, Lec4, Liao DNA 1234 Clone1 Clone2 Clone3 STSs STS Clone STS-content mapping a. Actual ordering that we want to infer: b. Hybridization data: What we do not know: either the relative location of STSs in the genome, or the relative location of clones in the genome STS Clone c. Permutation of columns to have consecutive ones in rows Linear time algorithm by Booth & Lueker (1976)

CISC667, S07, Lec4, Liao Sequence coverage (Lander-Waterman, 1988): Length of genome: G Length of fragment: L # of fragments: N Coverage: a = NL/G. Fragments are taken randomly from the original full length genome. Q: What is the probability that a base is not covered by any fragment? Assumption: fragments are independently taken from the genome, in other words, the left-hand end of any fragment is “uniformly” distributed in (0,G). Then, the probability for the LHE of a fragment to fall within an interval (x, x+L) is L/G. Since there are N fragments in total, on average, any point in the genome is going to be covered by NL/G fragments.

CISC667, S07, Lec4, Liao Poisson distribution: - The rate for an event A to occur is r. what is the rate to see a left-hand end of a fragment? - Probability to see an event in time interval (t, t + dt) is P(A|r) = r dt - h(t) = probability no event in (0,t) This is called exponential distribution - By independence of different time intervals h(t + dt) = h(t) [1 – r dt]  h/  t + r h(t) = 0  h(t) = exp(-rt). - Probability to have n events in (0, t) P(n|r) = exp(-rt) [(rt) n / n!]

CISC667, S07, Lec4, Liao What is the mean proportion of the genome covered by one or more fragments? – Randomly pick a point, the probability that to its left, within L, where there are at least on fragment, is 1 – exp(-NL/G) –Example: to have the genome 99% covered, the coverage NL/G shall be 4.6; and 99.9% covered if NL/G is 6.9. –Is it enough to have 99.9% covered? Human genome has 3 x 10 9 bps. A 6.9 x coverage will leave ~3,000,000 bps uncovered, which cause physical gaps in sequencing the human genome. –Then, what is the number of possible gaps?

CISC667, S07, Lec4, Liao

What is the mean # of contigs? –N exp(-NL/G) –For G = 100,000 bps, and L = NL/G Mean # of contigs

CISC667, S07, Lec4, Liao Assembly programs Phrap Cap TIGR assembler Celera assembler CAP3 ARACHNE EULER AMASS A genome sequence assembly primer is available at

CISC667, S07, Lec4, Liao Sequencing by Hybridization

CISC667, S07, Lec4, Liao

Sequence reconstruction and Eulerian Path Problem (Pavzner ‘89)

CISC667, S07, Lec4, Liao

Chen & Skiena 2005 (