15-853Page 1 15-853:Algorithms in the Real World Computational Biology V – Sequencing the “Genome” Thanks to: Dannie Durand for some of the slides. Various.

Slides:



Advertisements
Similar presentations
DNA Technology & Gene Mapping Biotechnology has led to many advances in science and medicine including the creation of DNA clones via recombinant clones,
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Recombinant DNA Technology
GENETIC ENGINEERING. MANIPULATING GENES… Can we make our food taste better? Can we make humans live longer? Can we make X-men like mutants?!? Let’s start.
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
13-2 Manipulating DNA.
9 Genomics and Beyond Brief Chapter Outline
Sequencing and Sequence Alignment
CISC667, F05, Lec4, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Whole genome sequencing Mapping & Assembly.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
MCB 130L Lecture 1 1. How to get the most from your time in lab 2. Recombinant DNA 3. Tips on giving a Powerpoint talk.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Reading the Blueprint of Life
Molecular Biology DNA Fingerprint – a sequence of bands that shows a persons DNA sequence How to make a DNA Fingerprint 1.DNA Extraction Cell is opened.
Plasmids  Bacterial DNA  Circular shape  Can be used to make proteins needed by humans.
Recombinant DNA Technology for the non- science major.
6.3 Advanced Molecular Biological Techniques 1. Polymerase chain reaction (PCR) 2. Restriction fragment length polymorphism (RFLP) 3. DNA sequencing.
CHAPTER 20 BIOTECHNOLOGY: PART I. BIOTECHNOLOGY Biotechnology – the manipulation of organisms or their components to make useful products Biotechnology.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Chapter 13 Section 1 DNA Technology. DNA Identification Only.10% of the human genome varies from person to person 98% of our genetic makeup does not code.
Chapter 20 DNA Technology. DNA Cloning  Gene cloning allows scientists to work with small sections of DNA (single genes) in isolation. –Exactly what.
CS 394C March 19, 2012 Tandy Warnow.
1 Genetics Faculty of Agriculture Instructor: Dr. Jihad Abdallah Topic 13:Recombinant DNA Technology.
Slide 1 of 32 Copyright Pearson Prentice Hall Biology.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Applications of DNA technology
Technological Solutions. In 1977 Sanger et al. were able to work out the complete nucleotide sequence in a virus – (Phage 0X174) This breakthrough allowed.
Manipulating DNA.
Section 2 Genetics and Biotechnology DNA Technology
13-1 Changing the Living World
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Genetic Technology.
Genetics 6: Techniques for Producing and Analyzing DNA.
Human awareness.  M16.1 Know that the DNA can be extracted from cells  Genetic engineering and /or genetic modification have been made possible by isolating.
Review from last week. The Making of a Plasmid Plasmid: - a small circular piece of extra-chromosomal bacterial DNA, able to replicate - bacteria exchange.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
DNA Technology Chapter 11. Genetic Technology- Terms to Know Genetic engineering- Genetic engineering- Recombinant DNA- DNA made from 2 or more organisms.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
BIOTECH!. Figure DNA fingerprints from a murder case.
Human Genome.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
GENETIC ENGINEERING CHAPTER 20
6.3 Advanced Molecular Biological Techniques 1. Polymerase chain reaction (PCR) 2. Restriction fragment length polymorphism (RFLP) 3. DNA sequencing.
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
DNA Technology and Genomics
DNA Technologies Chapter 13. What is biotechnology? Biotechnology- is the use of organisms to perform practical tasks for humans – Analysis – Manipulation.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
FOOTHILL HIGH SCHOOL SCIENCE DEPARTMENT Chapter 13 Genetic Engineering Section 13-2 Manipulating DNA.
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
The genetic engineers toolkit A brief overview of some of the techniques commonly used.
Isolating Genes By Allison Michas and Haylee Kolding.
DNA TECHNOLOGY. POLYMERASE CHAIN REACTION Polymerase Chain Reaction (PCR) is used to copy and amplify tiny quantities of DNA. When researchers want to.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Biotechnology.
DNA Technologies (Introduction)
Pre-genomic era: finding your own clones
Biotechnology CHAPTER 20.
DNA Technology Now it gets real…..
Chapter 20 – DNA Technology and Genomics
Manipulating DNA Chapter 9
Chapter 14 Bioinformatics—the study of a genome
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
DNA Evidence.
Recombinant DNA Unit 12 Lesson 2.
DNA and the Genome Key Area 8a Genomic Sequencing.
History of DNA Fingerprinting
Presentation transcript:

15-853Page :Algorithms in the Real World Computational Biology V – Sequencing the “Genome” Thanks to: Dannie Durand for some of the slides. Various figures borrowed from the web.

15-853Page 2 Tools of the Trade Cutting: Arber, Nathans, and Smith, Nobel Prize in Medicine (1978) for “the discovery of restriction enzymes and their application to problems of molecular genetics". Copying: Mullis, Nobel Prize in Chemistry (1993) for “his invention of the polymerase chain reaction (PCR) method” Reading: (sequencing) Gilbert and Sanger, Nobel Prize in Chemistry (1980) for “contributions concerning the determination of base sequences in nucleic acids"

15-853Page 3 Cutting Cutting: –Restriction Enzines: Cut at particular sites, e.g. ACTTCTAGAT –Chemical, physical or radiation cuts Cut at random locations

15-853Page 4 Copying Copying: Cloning a strand of DNA –Cosmids: clones sequences up to 40K bps –BAC, PAC: up to about 200K bps –YAC (yeast artificial chromosones): up to 1 M Copying between two specific sites –PCR (polymerase chain reaction): 500 bps

15-853Page 5 Cloning (copying fragments) Isolate DNA

15-853Page 6 fragmentation Isolate DNA

15-853Page 7 fragmentation Isolate DNA + insert fragments plasmid

15-853Page 8 Amplification

15-853Page 9 Amplification

15-853Page 10 Amplification

15-853Page 11 PCR (Polymerase chain reaction) Select two sequences that appear in the DNA sequence (e.g ATACTTAATG and TCTAAGATAG) Design two synthetic “primers” identical to sequences REPEAT: 1.Denature: Heat DNA to split into two strands 2.Anneal: cool and let primers attach 3.Replicate: let DNA attach in both directions Note: cells copy DNA strands character by character

15-853Page 12 PCR (Polymerase chain reaction)

15-853Page 13 Reading: sequencing a fragment Currently too expensive to actually read each bp. Finding the length is cheap. –The speed of a fragment in a gel when an electric charge is applied is proportional to its length (DNA has slight negative charge at one end). Lengths are what are used in Forensic DNA analysis and for DNA “fingerprints” Gilbert and Sanger got the Nobel Prize for figuring out how to use lengths to “read” a DNA strand from one end. Currently only good for about 500 bp.

15-853Page 14 Forensic DNA Analysis For the two samples, and some “control” DNA 1.Copy using PCR if sample is small 2.Use restriction enzines to cut up DNA at particular sites (e.g. AATGATGGA) 3.Tag DNA with radioactive (or florescent) tracer This is a strand that will attach to particular sites of the cut DNA. 4.Put each sample (enzine and DNA sample) on its own track on a gel 5.Apply charge for fixed time 6.Expose film to see pattern of lengths

15-853Page 15 The “fingerprint” of a DNA sample cut by seven restriction enzines.

15-853Page 16 Reading using lengths Can use special base-pairs that stop growth: DDC, DDA, DDT, DDG. Will generate all prefixes that end in A, T, C or G.

15-853Page 17

15-853Page 18

15-853Page 19 Improvements Use fluorescent dies on the base pairs and laser to excite the die as it passes a certain point on the gel.

15-853Page 20 Improvements (1) 4 “test tubes”, single track.

15-853Page 21 Improvements (2) Single “test tube”, single track

15-853Page 22 Porous GEL + _ LASER DETECTOR aggctcctctcccacc a ag agg aggc aggct aggctc ……..

15-853Page 23 ABI 3700 sequencer

15-853Page 24 History of Sequencing 1971 Nobel prize for restriction enzymes 1973 First recombinant DNA 1980 Nobel prize for DNA sequencing 1988 Congress establishes Genbank 1995 First genomic sequence 1998 First multicellular organism 2000 Fly genome 2000 First plant genome 2001 Human genome 2003 Mouse genome 22 million sequences 28 billion base pairs

15-853Page 25 Sequencing the Whole Genome Problem: we only know how to sequence about 500 bps at a time in the lab. 1.Linear sequencing 2.The shotgun method 3.Hierarchical shotgun method 4.Whole genome and double-barreled shotgun methods

15-853Page 26 Linear Sequencing Each step takes too long. Requires “wet” runs. e.g. if each step took 4 hours, sequencing the human genome would take 4 £ 3 £ 10 9 /500 hours = 3000 years Also no interesting Computer Science PCR

15-853Page 27 The Shotgun Method 1.Make multiple copies of the sequence. 2.Randomly break sequences into parts (e.g. using radiation or chemicals). 3.Throw away parts that are too small or too large. 4.Read about 500bp from the end of each part 5.Try to put the information together to reconstruct the original sequence

15-853Page 28 Example this_is_a_sequence_to_sequence thi s_is_a_s equenc e_to_se quence this_is_ a_seq uence _to_sequence this_i s_ a _sequence _to_sequ ence

15-853Page 29 Example Remove strands that are too short (or too long) thi s_is_a_s equenc e_to_se quence this_is_ a_seq uence _to_sequence this_i s_ a _sequence _to_sequ ence

15-853Page 30 Example Sequence k characters from each (e.g. 6), from either end. s_is_a_s equenc e_to_se quence this_is_ a_seq uence _to_sequence this_i _sequence _to_sequ ence

15-853Page 31 Example Find overlaps s_is_a equenc to_se quence s_is_a a_seq uence _to_se this_i quence o_sequ ence

15-853Page 32 Example s_is_a equenc to_se quence s_is_a a_seq uence _to_se this_i quence o_sequ ence

15-853Page 33 Example equenc to_se s_is_a a_seq uence _to_se this_i quence o_sequ ence

15-853Page 34 Example equenc to_se s_is_a a_seq uence _to_se this_i quence o_sequ ence

15-853Page 35 Example equence to_se s_is_a a_seq uence _to_se this_i o_sequ ence

15-853Page 36 Example equence to_se s_is_a a_seq uence _to_se this_i o_sequ ence

15-853Page 37 Example equence to_se s_is_a a_seq _to_se this_i o_sequ

15-853Page 38 Example equence to_se s_is_a a_seq _to_se this_i o_sequ

15-853Page 39 Example to_se s_is_a a_seq _to_se this_i o_sequence

15-853Page 40 Example to_se s_is_a a_seq _to_se this_i o_sequence

15-853Page 41 Example to_se s_is_a a_seq _to_sequence this_i

15-853Page 42 Example to_se s_is_a a_seq _to_sequence this_i

15-853Page 43 Example s_is_a a_seq _to_sequence this_i

15-853Page 44 Example s_is_a a_seq _to_sequence this_i

15-853Page 45 Example Having a single character overlap might not be enough to assume they overlap. a_seq _to_sequence this_is_a

15-853Page 46 Example a_seq_to_sequencethis_is_a

15-853Page 47 Example We are left with gaps, and unsure matches. Each covered region (e.g. this_is_a ) is called a contig Is there a systematic way to find or even define a “best solution”? a_seq_to_sequencethis_is_a

15-853Page 48 The SSP: an attempt The shortest superstring problem: given a set of strings s 1, s 2, …, s n find the shortest string S that contains all s i. NP-Hard, but can be reduced to TSP and solved approximately (nearly optimally in practice). Even if easy to solve, are we done? Our example gives: this_is_a_seq_to_sequence but this is the best we can do given the data. This problem is caused by repeats. Other problems?

15-853Page 49 Problems In practice the data is noisy. –Reads have up to a 1% error rate –Samples could have contaminants –Fragments can sometimes join up The reads could be in either direction (front-to-back or back-to-front). Cannot distinguish.

15-853Page 50 Assembly in Practice Score all suffix-prefix pairs –This can use a variant of the global alignment prob. It is the most expensive step (n 2 scores). Repeat: –Select best score and check for consistency –If score is too low, quit –If there is a good overlap, merge the two. Determine consensus: –We know the ordering among strands, but since matches are approximate, we need to select bps. Can use, e.g., multiple alignment over windows. gatcgat_ga attgactactatg

15-853Page 51 Some Programs for Assembly Phrap SEQAID CAP TIGR Celera assembler ARACHNE After using one of these programs to generate a set of “contigs” with some gaps, one can use the linear method to fill in the gaps (assuming they are small). atgattagccagtacgtttcagcatcccagtacgttatgcattagccagatc

15-853Page 52 Suquencing the Whole Genome Problem: we only know how to sequence about 500 bps at a time in the lab. 1.Linear sequencing 2.The shotgun method 3.Hierarchical shotgun method 4.Whole genome and double-barreled shotgun methods

15-853Page 53 Shotgun on the Whole Genome? Problems: –Computationally very expensive –50% of genome consist of repeats. Causes major problems. –Hard to partition work among multiple labs.

15-853Page 54 Hierarchical Shotgun 1.Generate clone Libraries (100K – 1M per clone) 2.Order the clones by finding “tags” that overlap multiple clones. Use these for ordering. 3.Identify a set of clones that cover the whole length (minimum tiling path) 4.Use shotgun technique on each identified clone 5.Put the results together. tag

15-853Page Clone Libraries A “BAC” library will contain sequences of about 200K bps each. These can be cloned using “BAC Vectors” (Bacterial Artificial Chromosome) A “YAC” library will contain sequences of about 1M bps each. These can be cloned using “YAC Vectors” (Yeast Artificial Chromosome) These are typically stored at a common site and can be ordered. Many can be purchased from companies.

15-853Page Ordering Clones We have the clones, but we don’t know their order or how they overlap. Pick random small sequences that only appear once in one location covered by the library. These are called STS (Sequence Tagged Sites) Figure out which clones contain which STSs using PCR (use tag site to start copy…will only copy of the sequence contains the site). AECFBD

15-853Page Ordering Clones (cont.) AECFBD ABCDEF Goal: Reorder the columns so that all the 1s in each row are contiguous. Can be done in O(n) time, where n is the number of entries in the array. But!!!, what about errrors?

15-853Page Ordering Clones (cont.) AECFBD ABCDEF AECFBD X E

15-853Page Ordering Clones (cont.) AECFBD Find ordering that minimizes the number of zero-one and one-zero transitions (i.e. errors). This is NP-hard, but can be posed as a Traveling Salesman Problem (TSP). Any ideas? E X

15-853Page Ordering Clones (cont.) ABCDEF F A D E B C Create graph with one vertex per STS. Edge weights = hamming distance (number of bits that differ).

15-853Page Ordering Clones (cont.) ABCDEF F A D E B C s Add in source (s) node with weights equal to number of 1s in each row. Solve TSP. Answer gives min number of transitions

15-853Page Ordering Clones (cont.) ABCDEF F A D E B C s 2 2 AECFBD E X Cost = 16

15-853Page Ordering Clones (cont.) ABCDEF F A D E B C s 2 2 AECFBD E X Cost = 14 The “wrong” answer has smaller cost

15-853Page Find “Minimum Tiling Path” Minimum Tiling Path: Find a set of clones that cover the whole length and for which the total number of bps is minimized. Can be posed as a shortest path problem. Any ideas?

15-853Page 65 Hierarchical Shotgun (revisited) 1.Generate clone Libraries (100K – 1M per clone) 2.Order the clones by finding “tags” that overlap multiple clones. Use these for ordering. 3.Identify a set of clones that cover the whole length (minimum tiling path) 4.Use shotgun technique on each identified clone 5.Put the results together. tag

15-853Page 66 Celera’s Method Whole genome shotgun: Use shotgun method on whole genome. Use double-barreled approach: some sequences of known length (e.g. 2-5K) are sequenced at both ends. These can be used to bridge across repeats. In practice they used some mapping (hierarchical) data from the NIST effort, which was freely available. This was needed to deal with long repeats.