Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Genomics and Proteomics Sequencing genomes.

Similar presentations


Presentation on theme: "Computational Genomics and Proteomics Sequencing genomes."— Presentation transcript:

1 Computational Genomics and Proteomics Sequencing genomes

2 Polymerase chain reaction  PCR is a molecular biological technique for creating large amount of DNA  We need: –DNA template –Two primers –DNA-Polymerase –Nucleotides –Buffer - a suitable chemical environment

3 Polymerase chain reaction http://en.wikipedia.org/wiki/Polymerase_chain_reaction

4 DNA Sequencing  Chain Termination Method –Sanger, 1977 –single stranded DNA, 500-700b –Method: Electrophoresis can separate DNA molecules differing 1bp in length Dideoxynucleotide (ddNTP) are used - which stop replication

5 ddNucleotides  ddA, ddT, ddC, ddG  Each type marked with fluorescent dye  When incorporated into DNA chain –stops replication

6 Chain Termination Method, An Outline  Start four separate replications reactions –first obtain single stranded DNA –add a (universal) primer  Start each replication in a soup of A,T,C,G

7 Chain Termination Method, An Outline  add tiny amounts of –ddA to the first reaction, –ddT to second, ddC to 3rd, ddG to 4th

8 Chain Termination Method A read

9 Chain Termination Method, Reading the Sequence  Recent improvements: –one reaction and Four types of ddNTP have four different fluorescent labels –automated reading See: www.dnai.org/timeline/index.html -> 70s -> DNA sequencing

10 Chain Termination Method, Results time Signal fragment size www.newscientist.com

11 Paired-end reads paired-end reads (500b)‏ DNA fragment (a few kb)‏

12 Sequencing methods  Directed –Sequencing fragments that are known to be adjacent  Top-down (hierarchical)‏ –In top-down mapping a single chromosome is cut into large pieces. These are then ordered and cut up again. The smaller pieces are then mapped further. The resulting map depicts the order and distance between the original cleaves. This approach yields maps with much continuity and few gaps but resolution is fairly low. Also the strategy cannot produce long stretches of mapped sites, being limited to around one million base pairs.  Bottom-up (shotgun)‏ –In shotgun sequencing, DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a contiguous sequence.

13 Directed sequencing 1  Primer walking using PCR Sanger sequencing Attach a primer Run PCR The end of the sequenced strand is used as a primer for the next part of the long DNA sequence. Primer walking is the method of choice for sequencing DNA fragments between 1.3 and 7 kilobases.

14 Directed sequencing 2  Nested deletion –cut DNA with exonuclease –“eats up” DNA from an end one bp at a time either 3' or 5'

15 Restriction enzymes  proteins  cut DNA  at a specific pattern http://en.wikipedia.org/wiki/Restriction_enzymes

16 Shotgun vs. Hierarchical Method  Shotgun bottom-up  Hierarchical top-down

17 Hierarchical sequencing ~100mln bp ~1mln bp each YACs subset ~40kbp each YAC lib Yeast Artificial Chromosome

18 Hierarchical sequencing ~40kbp each BACs sequencing (easy, short sequence)‏

19 Probe libraries Filling in gaps Contig Gap Contig Gap

20 Shotgun DNA Sequencing Shear DNA into millions of small fragments Read 500 – 700 nucleotides at a time from the small fragments (Sanger method)‏ An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info

21 Shotgun Method – Haemophilus Influenzae Sequencing 1.5-2kb Extract DNA Sonicate Electrophoresis DNA library Sequence Construct seq paired-end reads

22 Massively parallel picolitre- scale sequencing: 454  fragment single strand DNA (ssDNA)‏  fragments bound to beads (1 f/bead)‏  replication in oil droplets –1 bead/droplet –10mln copies/bead  beads are deposited in 1.6mln microscopic wells Margulies et al., Nature Vol 437, 15 September 2005, doi:10.1038/nature03959

23 Massively parallel picolitre- scale sequencing: 454  ssDNA (ready to make a complement) in each well  sequencing-by-synthesis –wash the plate with special nucleotides –emits light when DNA grows –record on the camera Margulies et al., Nature Vol 437, 15 September 2005, doi:10.1038/nature03959

24 454 - results  advantages –100x faster (25mln nucleotides/h)‏ –1 operator  disadvantages –short reads –accuracy Margulies et al., Nature Vol 437, 15 September 2005, doi:10.1038/nature03959

25 A contig  Contig – a continuous set of overlapping sequences Gap!

26 Read Coverage Length of genomic segment: L Number of reads: n Coverage C = n l / L Length of each read: l How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C=10 results in 1 gapped region per 1,000,000 nucleotides C

27 Shotgun Method - Pros and Cons  Pros –Human labour reduced to minimum  Cons –Computationally demanding – O(n 2 ) comparisons –High error rate in contig construction Repeats as the main problem

28 Shotgun vs. Hierarchical Method  Celera vs. Human Genome Project  Hierarchical (top-down) assembly: –The genome is carefully mapped –“Shotgun” into large chunks of 150kb Exact location of each chunk is known –Each piece is again “shotgun” into 2kb and sequenced

29 Celera (Craig Venter) vs. HGC (Francis Collins) “The big bang” -- 26 June 2000

30 Assembling the genome  Given a set of (short) fragments from shotgun sequencing... –find overlap between all pairs  find the order of reads in DNA –determine a consensus sequence

31 Assembling the genome: Overlap-Layout-Consensus Assemblers:ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors..ACGATTACAATAGGTT..

32 Fragment Assembly Computational Challenge: assemble individual short fragments (reads) into a single genomic sequence (“contig”) Until late 1990s the shotgun fragment assembly of human genome was viewed as intractable problem

33 Challenges in Fragment Assembly  Repeats: A major problem for fragment assembly  > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp)‏ - about 200,000 LINE repeats (1000 bp and longer)‏ Repeat Light green and blue fragments are interchangeable when assembling repetitive DNA

34 Repeat Types Low-Complexity DNA (e.g. ATATATATACATA…)‏ Microsatellite repeats (a 1 …a k ) N where k ~ 3-6 (e.g. CAGCAGCAGCAGCACCAG)‏ Transposons/retrotransposons –SINE Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 10 6 copies)‏ –LINE Long Interspersed Nuclear Elements ~500 - 5,000 bp long, 200,000 copies –LTR retroposonsLong Terminal Repeats (~700 bp) at each end Gene Families genes duplicate & then diverge Segmental duplications ~very long, very similar copies

35 Paired-end reads help to resolve repeat order Repeat

36 Shortest Superstring Problem  Problem: Given a set of strings, find a shortest string that contains all of them  Input: Strings s 1, s 2,…., s n  Output: A string s that contains all strings s 1, s 2,…., s n as substrings, such that the length of s is minimized  Complexity: NP – hard  Note: this formulation does not take into account sequencing errors

37 Shortest Superstring Problem: Example

38 Reducing SSP to TSP  Traveling Salesman Problem (TSP)  Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaa What is overlap ( s i, s j ) for these strings? TSP: If a salesman starts at point A, and if the distances between every pair of points are known, what is the shortest route which visits all points and returns to point A?

39 Reducing SSP to TSP  Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaa overlap=12

40 Reducing SSP to TSP  Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaa  Construct a graph with n vertices representing the n strings s 1, s 2,…., s n.  Insert edges of length overlap ( s i, s j ) between vertices s i and s j.  Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete.

41 Reducing SSP to TSP (cont’d)‏

42 SSP to TSP: An Example S = { ATC, CCA, CAG, TCC, AGT } SSP AGT CCA ATC ATCCAGT TCC CAG ATCCAGT TSP ATC CCA TCC AGT CAG 2 2 2 2 1 1 1 0 1 1 start

43 Determining 2-fragment overlap using DP aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaa

44 Determining 2-fragment overlap using DP M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j])‏ max No gaps!

45 Sequencing by Hybridization (SBH): History  1988: SBH suggested as an an alternative sequencing method. Nobody believed it will ever work  1991: Light directed polymer synthesis developed by Steve Fodor and colleagues.  1994: Affymetrix develops first 64- kb DNA microarray First microarray prototype (1989)‏ First commercial DNA microarray prototype w/16,000 features (1994)‏ 500,000 features per chip (2002)‏

46 Sequencing by Hybridization (SBH): A class of methods for determining the order in which nucleotides occur on a strand of DNA Typically used for looking for small changes relative to a known DNA sequence. The binding of one strand of DNA to its complementary strand in the DNA double-helix (aka hybridization) is sensitive to even single-base mismatches when the hybrid region is short This is exploited in a variety of ways, most notable via DNA chips or microarrays with thousands to billions of synthetic oligonucleotides found in a genome of interest plus many known variations or even all possible single- base variations.

47 DNA microarray  a chip which contains short probes –ssDNA sequences, millions of them  make DNA for sequencing fluorescent  wash it over the chip  DNA hybridizes to its complementary strand  cells light up

48 Universal DNA microarrray  A DNA microarray which contains all seqs of length l (l-mers)‏  therefore, we can determine l-mer composition

49 Hybridization on DNA Array

50 l-mer composition  Spectrum ( s, l )‏ –a set of all possible l-mers  Spectrum ( TATGGTGC, 3 ): {ATG, GGT, GTG, TAT, TGC, TGG}


Download ppt "Computational Genomics and Proteomics Sequencing genomes."

Similar presentations


Ads by Google