Computational Genomics and Proteomics Sequencing genomes.

Slides:



Advertisements
Similar presentations
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Advertisements

Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
13-2 Manipulating DNA.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
DNA Sequencing and Assembly
3 September, 2004 Chapter 20 Methods: Nucleic Acids.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
Introduction to Bioinformatics Lecture 20: Sequencing genomes.
DNA Sequencing. Next few topics DNA Sequencing  Sequencing strategies Hierarchical Online (Walking) Whole Genome Shotgun  Sequencing Assembly Gene Recognition.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
Analyzing your clone 1) FISH 2) “Restriction mapping” 3) Southern analysis : DNA 4) Northern analysis: RNA tells size tells which tissues or conditions.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Applications of DNA technology
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
A Lot More Advanced Biotechnology Tools (Part 1) Sequencing.
Genome sequencing Haixu Tang School of Informatics.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Human Genome.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
DNA Sequencing.
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
Locating and sequencing genes
Sequencing tutorial Peter HANTZ EMBL Heidelberg.
Manipulating DNA. Scientists use their knowledge of the structure of DNA and its chemical properties to study and change DNA molecules Different techniques.
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
Genomics Part 1. Human Genome Project  G oal is to identify the DNA sequence of every gene in humans Genome  all the DNA in one cell of an organism.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
FOOTHILL HIGH SCHOOL SCIENCE DEPARTMENT Chapter 13 Genetic Engineering Section 13-2 Manipulating DNA.
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
DNA Sequencing Project
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Genomics Sequencing genomes.
Sequencing Technologies
AMPLIFYING AND ANALYZING DNA.
Stuff to Do.
The Human Genome Project
Graph Algorithms in Bioinformatics
DNA Technology.
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Graph Algorithms in Bioinformatics
DNA and the Genome Key Area 8a Genomic Sequencing.
Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine
Graph Algorithms in Bioinformatics
A Sequenciação em Análises Clínicas
CSE 5290: Algorithms for Bioinformatics Fall 2009
Introduction to Sequencing
Presentation transcript:

Computational Genomics and Proteomics Sequencing genomes

Polymerase chain reaction  PCR is a molecular biological technique for creating large amount of DNA  We need: –DNA template –Two primers –DNA-Polymerase –Nucleotides –Buffer - a suitable chemical environment

Polymerase chain reaction

DNA Sequencing  Chain Termination Method –Sanger, 1977 –single stranded DNA, b –Method: Electrophoresis can separate DNA molecules differing 1bp in length Dideoxynucleotide (ddNTP) are used - which stop replication

ddNucleotides  ddA, ddT, ddC, ddG  Each type marked with fluorescent dye  When incorporated into DNA chain –stops replication

Chain Termination Method, An Outline  Start four separate replications reactions –first obtain single stranded DNA –add a (universal) primer  Start each replication in a soup of A,T,C,G

Chain Termination Method, An Outline  add tiny amounts of –ddA to the first reaction, –ddT to second, ddC to 3rd, ddG to 4th

Chain Termination Method A read

Chain Termination Method, Reading the Sequence  Recent improvements: –one reaction and Four types of ddNTP have four different fluorescent labels –automated reading See: -> 70s -> DNA sequencing

Chain Termination Method, Results time Signal fragment size

Paired-end reads paired-end reads (500b)‏ DNA fragment (a few kb)‏

Sequencing methods  Directed –Sequencing fragments that are known to be adjacent  Top-down (hierarchical)‏ –In top-down mapping a single chromosome is cut into large pieces. These are then ordered and cut up again. The smaller pieces are then mapped further. The resulting map depicts the order and distance between the original cleaves. This approach yields maps with much continuity and few gaps but resolution is fairly low. Also the strategy cannot produce long stretches of mapped sites, being limited to around one million base pairs.  Bottom-up (shotgun)‏ –In shotgun sequencing, DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a contiguous sequence.

Directed sequencing 1  Primer walking using PCR Sanger sequencing Attach a primer Run PCR The end of the sequenced strand is used as a primer for the next part of the long DNA sequence. Primer walking is the method of choice for sequencing DNA fragments between 1.3 and 7 kilobases.

Directed sequencing 2  Nested deletion –cut DNA with exonuclease –“eats up” DNA from an end one bp at a time either 3' or 5'

Restriction enzymes  proteins  cut DNA  at a specific pattern

Shotgun vs. Hierarchical Method  Shotgun bottom-up  Hierarchical top-down

Hierarchical sequencing ~100mln bp ~1mln bp each YACs subset ~40kbp each YAC lib Yeast Artificial Chromosome

Hierarchical sequencing ~40kbp each BACs sequencing (easy, short sequence)‏

Probe libraries Filling in gaps Contig Gap Contig Gap

Shotgun DNA Sequencing Shear DNA into millions of small fragments Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)‏ An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info

Shotgun Method – Haemophilus Influenzae Sequencing 1.5-2kb Extract DNA Sonicate Electrophoresis DNA library Sequence Construct seq paired-end reads

Massively parallel picolitre- scale sequencing: 454  fragment single strand DNA (ssDNA)‏  fragments bound to beads (1 f/bead)‏  replication in oil droplets –1 bead/droplet –10mln copies/bead  beads are deposited in 1.6mln microscopic wells Margulies et al., Nature Vol 437, 15 September 2005, doi: /nature03959

Massively parallel picolitre- scale sequencing: 454  ssDNA (ready to make a complement) in each well  sequencing-by-synthesis –wash the plate with special nucleotides –emits light when DNA grows –record on the camera Margulies et al., Nature Vol 437, 15 September 2005, doi: /nature03959

454 - results  advantages –100x faster (25mln nucleotides/h)‏ –1 operator  disadvantages –short reads –accuracy Margulies et al., Nature Vol 437, 15 September 2005, doi: /nature03959

A contig  Contig – a continuous set of overlapping sequences Gap!

Read Coverage Length of genomic segment: L Number of reads: n Coverage C = n l / L Length of each read: l How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region per 1,000,000 nucleotides C

Shotgun Method - Pros and Cons  Pros –Human labour reduced to minimum  Cons –Computationally demanding – O(n 2 ) comparisons –High error rate in contig construction Repeats as the main problem

Shotgun vs. Hierarchical Method  Celera vs. Human Genome Project  Hierarchical (top-down) assembly: –The genome is carefully mapped –“Shotgun” into large chunks of 150kb Exact location of each chunk is known –Each piece is again “shotgun” into 2kb and sequenced

Celera (Craig Venter) vs. HGC (Francis Collins) “The big bang” June 2000

Assembling the genome  Given a set of (short) fragments from shotgun sequencing... –find overlap between all pairs  find the order of reads in DNA –determine a consensus sequence

Assembling the genome: Overlap-Layout-Consensus Assemblers:ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors..ACGATTACAATAGGTT..

Fragment Assembly Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“contig”) Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem

Challenges in Fragment Assembly  Repeats: A major problem for fragment assembly  > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp)‏ - about 200,000 LINE repeats (1000 bp and longer)‏ Repeat Light green and blue fragments are interchangeable when assembling repetitive DNA

Repeat Types Low-Complexity DNA (e.g. ATATATATACATA…)‏ Microsatellite repeats (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGCAGCAGCACCAG)‏ Transposons/retrotransposons –SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 10 6 copies)‏ –LINE Long Interspersed Nuclear Elements ~ ,000 bp long, 200,000 copies –LTR retroposonsLong Terminal Repeats (~700 bp) at each end Gene Families genes duplicate & then diverge Segmental duplications ~very long, very similar copies

Paired-end reads help to resolve repeat order Repeat

Shortest Superstring Problem  Problem: Given a set of strings, find a shortest string that contains all of them  Input: Strings s 1, s 2,…., s n  Output: A string s that contains all strings s 1, s 2,…., s n as substrings, such that the length of s is minimized  Complexity: NP – hard  Note: this formulation does not take into account sequencing errors

Shortest Superstring Problem: Example

Reducing SSP to TSP  Traveling Salesman Problem (TSP)  Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaa What is overlap ( s i, s j ) for these strings? TSP: If a salesman starts at point A, and if the distances between every pair of points are known, what is the shortest route which visits all points and returns to point A?

Reducing SSP to TSP  Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaa overlap=12

Reducing SSP to TSP  Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaa  Construct a graph with n vertices representing the n strings s 1, s 2,…., s n.  Insert edges of length overlap ( s i, s j ) between vertices s i and s j.  Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete.

Reducing SSP to TSP (cont’d)‏

SSP to TSP: An Example S = { ATC, CCA, CAG, TCC, AGT } SSP AGT CCA ATC ATCCAGT TCC CAG ATCCAGT TSP ATC CCA TCC AGT CAG start

Determining 2-fragment overlap using DP aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaa

Determining 2-fragment overlap using DP M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j])‏ max No gaps!

Sequencing by Hybridization (SBH): History  1988: SBH suggested as an an alternative sequencing method. Nobody believed it will ever work  1991: Light directed polymer synthesis developed by Steve Fodor and colleagues.  1994: Affymetrix develops first 64- kb DNA microarray First microarray prototype (1989)‏ First commercial DNA microarray prototype w/16,000 features (1994)‏ 500,000 features per chip (2002)‏

Sequencing by Hybridization (SBH): A class of methods for determining the order in which nucleotides occur on a strand of DNA Typically used for looking for small changes relative to a known DNA sequence. The binding of one strand of DNA to its complementary strand in the DNA double-helix (aka hybridization) is sensitive to even single-base mismatches when the hybrid region is short This is exploited in a variety of ways, most notable via DNA chips or microarrays with thousands to billions of synthetic oligonucleotides found in a genome of interest plus many known variations or even all possible single- base variations.

DNA microarray  a chip which contains short probes –ssDNA sequences, millions of them  make DNA for sequencing fluorescent  wash it over the chip  DNA hybridizes to its complementary strand  cells light up

Universal DNA microarrray  A DNA microarray which contains all seqs of length l (l-mers)‏  therefore, we can determine l-mer composition

Hybridization on DNA Array

l-mer composition  Spectrum ( s, l )‏ –a set of all possible l-mers  Spectrum ( TATGGTGC, 3 ): {ATG, GGT, GTG, TAT, TGC, TGG}