Genome Sequencing and Annotation (Part 1).

Genome Sequencing and Annotation (Part 1)

Objective of most genome projects
Sequencing – DNA, mRNA Identify genes characterize gene features This chapter How blocks of DNA seqs. are obtained How these blocks are assembled into contigs then genomes Bioinformatics – how to do seq. alignment, such as cDNA/EST, genome seqs. Annotation of ORF, Other features of gene – repetition elements, variable distribution of GC content, evolutionary conserved elements Gene annotation by cross species annotation

2.1 (Part 2) The principle of dideoxy (Sanger) sequencing
Automated DNA sequencing 1974, F. Sanger developed the chain-termination method (Sanger sequencing) Sanger won his second Noble prize for inventing this process pgs2e-fig jpg

Automated DNA sequencing
Most current sequencing projects use the chain termination method Also known as Sanger sequencing, after its inventor Based on action of DNA polymerase Adds nucleotides to complementary strand Requires template DNA and primer Most large-scale sequencing projects use the chain-termination method, which is also known as Sanger sequencing, after Fred Sanger, who won his second Nobel prize for inventing the process. Chain-termination sequencing is based on the action of DNA polymerase, which adds nucleotides that are complementary to another strand of DNA. For it to work, it needs a template DNA strand that it can copy, as well as a short stretch of DNA, known as a primer, to which it can add nucleotides.

Chain-termination sequencing
Dideoxynucleotides (ddA, ddT, ddC or ddG) stop synthesis Chain terminators (DNA polymerase cannot add another nucleotide) Included in amounts so as to terminate every time the base appears in the template Use four reactions One for each base: A,C,G, and T 3’ ATCGGTGCATAGCTTGT 5’ 5’ TAGCCACGTATCGAACA* 3’ 5’ TAGCCACGTATCGAA* 3’ 5’ TAGCCACGTATCGA* 3’ 5’ TAGCCACGTA* 3’ 5’ TAGCCA* 3’ 5’ TA* 3’ Template Sequence reaction products In the Sanger chain-termination method, the nucleotide analog is called a dideoxynucleotide. When the correct amount is added to the solution, the chain will be terminated at each occurrence of the complementary nucleotide in the template. The reason for this reaction is that DNA polymerase cannot add another nucleotide to a dideoxynucleotide. For example, if the right amount of dideoxy A is added, then the chain will be terminated at each occurrence of T in the template. Determining the complete sequence requires a separate reaction for each of the four bases A, T, C, and G. On the top right of this slide is a template strand, and beneath it are the various chains that would be terminated with dideoxy A in the reaction mix.

Sequence detection To detect products of sequencing reaction
Include labeled nucleotides Formerly, radioactive labels (33P or 35S) were used Now fluorescent labels Use different fluorescent tag for each nucleotide Can run all four reactions in a single gel lane or capillary tube TAGCCACGTATCGAA* TAGCCACGTATC* TAGCCACG* TAGCCACGT* Once the sequencing reaction has been completed, one has to be able to detect the chains generated by the reaction. This can be done by making the chains radioactive by using radioactively labeled nucleotides in the reaction. As an alternative, today’s automated sequencers all use fluorescent labels. With this approach, each of the four sequencing reactions (for each of the bases A, T, C, and G) uses a different color of fluorescent tag. After the reactions are completed, the four fluorescently labeled reactions can be combined and run in a single gel lane or capillary tube.

Sequence separation Terminated chains need to be separated
– Terminated chains need to be separated Requires one-base-pair resolution See difference between chains of X and X+1 base pairs Gel electrophoresis Very thin gel High voltage applied Works with radioactive or fluorescent labels Negative pole at the top Determining the placement of each of the bases in the sequence requires separating the terminated chains and resolving them so that one can see differences of one base. This means, for example, that a chain of length 35 bases must be distinguishable from chains of 34 or 36 bases. This step is normally done using electrophoresis through a gel or capillary. Originally, very thin gels were used, with high voltages applied. Either radioactively labeled or fluorescently labeled reaction products can be separated on gels. The image in this slide shows an autoradiogram of a typical sequencing gel with the lanes marked according to the dideoxy terminator used. For gel electrophoresis, the negative pole was at the top and the positive pole at the bottom. + C A G T C A G T

Sequence reading of radioactively labeled reactions
– The final step of sequencing is to read the sequence Radioactive labeled reactions Gel dried Placed on X-ray film Film developed, the position of each band becomes visible Sequence read from bottom up (the positive pole) Each of the four lanes giving the position of a different base: A, T, C or G The final step of sequencing is to read the sequence. After radioactively labeled sequencing reactions are run through a gel, the gel is dried and then a piece of X-ray film is placed on it. After the film is developed, the position of each band becomes visible. The sequence is then read from the bottom of the gel to the top, with each of the four lanes giving the position of a different base: A, T, C, or G. +

Sequence reading of fluorescently labeled reactions
Fluorescently labeled reactions scanned by laser as particular point is passed Color picked up by detector Output sent directly to computer The read out is given both in terms of bases and the intensity of each color, so that ambiguous readings are easily identified For fluorescently labeled reactions that are separated either by gel electrophoresis or through a capillary, the fragments are detected by a laser as they pass a particular point. Each of the four colors is picked up by a detector, and the output is sent directly to a computer. The readout is given both in terms of bases and in terms of the intensity of each color, so that ambiguous readings are easily identified.

Summary of chain termination sequencing
A primer is extended by DNA polymerase based on the sequence present in the template strand. The chain is terminated by different ddNTP that are complementary to the template strand. Four reactions are separated on a gel that can resolve one-base differences. The seq. is then read from the bottom of gel to the top. This image summarizes the steps in DNA sequencing. A primer is extended by DNA polymerase based on the sequence present in the template strand. The chain is terminated by different dideoxynucleotides that are complementary to the template strand. Four reactions, one for each base, are separated on a gel that can resolve one-base differences. The sequence is then read from the bottom of the gel to the top.

High-Throughput Sequencing
The new techniques and equipment include: (1) Four-color fluorescent dyes have replaced the radioactive label (2) Rather than stopping the electrophoresis at a particular time, the products are scanned for laser-induced fluorescence just before the run off the end of the electrophoresis medium (3) Improvements in the chemistry of template purification and the sequencing reaction (4) Slab gel electrophoresis gave way to capillary electrophoresis with the introduction in 1999 of Applied Biosystem’s ABI Prism 3700 automated sequencers, which in turn were updated with ABI Prism 3730 DNA analyzers in 2003 (deliver extremely high quality, long reads; save time and money) ABI Prism 3730 DNA analyzers

Reading sequence traces
Base-calling – the reading of raw sequence traces Now routinely performed using automated software that reads bases, aligns similar seqs. and editing Program – phred The program assign probability scores to the accuracy of each base call as the trace is read pgs2e-fig jpg

2.3 Automated sequence chromatograms
pgs2e-fig jpg This seq. shows ‘noiseness’ of the first 30 bp of a run. The middle two rows show a segment of two seqs. that are polymorphic for both SNPs and an indel. A decline in seq. quality typically occurs after about 800 bp.

Ex. 2.1 Reading a sequence trace
pgs2e-exer jpg The base labeled N – due to poor seq. quality Two peaks of the same height are observed at the same location, the site is heterozygous for a C and T SNP.

Figure 2.5 An aligned-reads window in consed
Contig Assembly pgs2e-fig jpg Figure 2.5 An aligned-reads window in consed

Assembling DNA seq. fragments
NCBI dbest databases View the EST statistics FTP EST files

IFOM assembler Multiple EST seqs.  contig max. number of seqs. you can enter is !! use gi( , , , , , ) Length (850, 1062, 634, 596, 869, 768) bp resulting in a single contig consensus seq., can be used for similarity search against db

Assembling DNA seq. fragments – 6 GI fragments
>gi| |gb|BI |BI F1 NIH_MGC_114 Homo sapiens cDNA clone IMAGE: ', mRNA sequenceCGGGGTGCTGCGAGCGCGGGGCCAGACCAAGGCGGGCCCGGAGCGGAACTTCGGTCCCAGCTCGGTCCCCGGCTCAGTCCCGACGTGGAACTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATATGCGTTGACTGTGCCATGGGAGAGTAGCAGAAAACAGCAGCATGCTCAACCTCCCACTTTCTCTATTGGATGTGGACTCAAAGCCCT >gi| |gb|BM |BM AGENCOURT_ NIH_MGC_124 Homo sapiens cDNA clone IMAGE: ', mRNA sequenceGTCCGGAATTCCCGGGATCTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATACGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGCGGGGAACAGGGTCCTGAAACCTGACCATTTTGCCCCAGACCTTGACCAATCCACCTCATGGCGATTCTCCCTCCAGGAATTCCCCGACCGAGAAAAAATTGGCCACTTCCCCGGAATTTCCCCCCAAAAACTTGGAATTTCACCCAAAACCTTTCCCATGTAAACCCGGAAACCCTGGGGAAGGCT >gi| |gb|AW |AW EST MAGE resequences, MAGE Homo sapiens cDNA, mRNA sequenceGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTCATGCGATTCTTCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGAACCTTCCAATGAAGCGAGAATCTTGTGAAGCTGAAGAACAGTCTGGAAGGCAAGATGAGCTTTTTGCTGGGAATGCGCACGTGGAAAGGCAGAATTCGGTCATAA >gi| |gb|AW |AW EST MAGE resequences, MAGE Homo sapiens cDNA, mRNA sequenceGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTTATGCGATTCTCCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTTGAGGAGCAATCTGGAGGGCATATGAGCTTTTTGCTGTGATTGCGCACCTGGGAATGCAAAACTCCGTCATTACTG >gi| |gb|BQ |BQ AGENCOURT_ NIH_MGC_72 Homo sapiens cDNA clone IMAGE: ', mRNA sequenceAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTGAGGAGCAGTCTGGAGGGCAGTATGAGCTTTTTGCTGTGATTGCGCACGTGGGAATGGCAGACTCCGGTCATTACTGTGTCTACATCCGGAATGCTGTGGATGGAAAATGGTTCTGCTTCAATGACTCCAATATTTGCTTGGTGTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGGCCAAGGGCAGTGGCAGGGGGTATTTCAGTATTATACCACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGGGCTGGTTGTAATTTTTTCCCTTTGAAGAAACACCATTAATTTCCTAATGAATCCAAGTGGTTTGTAACTTGCCTATTCCTTTTATTCCAGCAAAAAATTAATTGATCATCCCCTCCCCCAAAAAATAGGGG >gi| |gb|BG |BG RST25778 Athersys RAGE Library Homo sapiens cDNA, mRNA sequenceTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGCCAAAGGTCAGTGGCAGGGGGTATTTCAGTATTATACAACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGTGCTGTTTGTAATTTTTCACTTTGAGAACCAACATTAATTCCATATGAATCAAGTGTTTTGGAACTGCTATTCATTTATTCAGCAAATATTTATTGGTCATCTTTTCTCCATAAGATAGTGTGATAAACACAGCATGAATAAAGGTATTTTCCACACAGACAAGTGTTTTTTCACAAAATTATTNATTTTGNTGGGGCTGTGGCGGCCGCTTCCTTTATGGGGGGGAATTTAGAACCCGTTCCTGACGCGGGGGN

List of assembled fragments
Assembling DNA seq. fragments List of assembled fragments

Overlap details

End of overlap details Assembled mRNA sequence

Box 2.1 Pairwise Sequence Alignment
The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1 alignment 2 Seq. 1 ACGCTGA ACGCTGA Seq. 2 A - - CTGT ACTGT - - Seeks alignments  high seq. identity, few mismatchs and gaps Assumption – the observed identity in seqs. to be aligned is the result of either random or of a shared evolutionary origin Identity ≠ similarity Sequence identity = Homology (a risky assumption) Sequence identity ≠ Homology

Same true alignment arise through different evolutionary events Scoring scheme: substitution  -1, indel  -5, match  3 indel pgs2e-box jpg Score Figure A Common evolutionary events and their effects on alignment

Find the optimal score  the best guess for the true alignment Find the optimal pairwise alignment of two seqs.  inserted gaps into one or both of them  maximize the total alignment score Dynamic programming (DP) – Needleman and Wunsch (1970), Smith and Waterman (1980), this algorithm guarantees that we find all optimal alignments of two seqs. of lengths m and n BLAST is based on DP with improvement on speed Prof. Waterman

The score for alignment of i residues of sequence 1 against j residues of sequence 2 is given by where c(i,j) = the score for alignment of residues i and j and takes the value 3 for a match or -1 for a mismatch, c(-,j) = the penalty for aligning a residue with a gap, which takes the value of -5

The entry for S(1,1) is the maximum of the following three events: S(0,0) + c(A,A) = = [c(A,A) = c(1,1)] S(0,1) + c(A, -) = = [c(A, -) = c(1, -)] S(1,0) + c(-, A) = = [c(- ,A) = c(-, 1)] Similarly, one finds S(2,1) as the maximum of three values: (-5)-1=-6; 3-5=-2; and (-10)-5=-15  the best is entry is the addition of the C indel to the A-A match, for a score of -2 (see next page).

The alignment matrix of sequences 1 and 2 S(2,1) = max {S(1,0) + c(2,1), S(1,1) + c(2,-), S(2,0) + c(-,1)} = max { S(1,0) + c(C,A), S(1,1) + c(C,-), S(2,0) + c(-,A) } = max { -5-1, 3-5, } = -2

Traceback  determine the actual alignment From the top right hand corner  the (7,5) cell For example the 1 in the (7,5) cell could only be reached by the addition of the mismatch A-T ACGCTGA A - - CTGT or AC - - TGT 4 matches 1 mismatch 2 indels Ambiguity – has to do with which C in seq. 1 aligns with the C in seq. 2

Parameters settings - Gap penalties Default settings are the easiest to use but they are not necessarily yield the correct alignment constant penalty  independent of the length of gap, A proportional penalty  penalty is proportional to the length L of the gap, BL (that is what we used in the this lecture) affine gap penalty  gap-opening penalty + gap-extension penalty = A+BL There is no rule for predicting the penalty that best suits the alignment Optimal penalties vary from seq. to seq.  it is a matter of trial and error Usually A > B, because of opening a gap (usually A/B ~ 10) Hint: (1) compare distantly related seqs. high A and very low B often give the best results  penalized more on their existence than on their length, (2) compare closely related seqs., penalize both of extension and extension

Exercise 2.2 Computing an optimal sequence alignment
Two score schemes Gap penalty = -5, mismatch = -1, match =3 Gap penalty = -1, mismatch = -1, match =3 First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-5) = 8 (2) First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-1) = 16 A more serious problem – identify the wrong alignment pgs2e-exer jpg

Exercise 2.2 Computing an optimal sequence alignment
Gap penalty = -5 Gap penalty = -1

Emerging Sequencing Methods
Costs of genome sequencing Mid $30-50 Million dollars to sequencing a mammalian genome Target $1000 per human genome by the year 2010 J. Craig Benter Foundation - $500,000 award for the first person to achieve this goal New technologies Sequencing by hybridization (SBH) – detect whether an exact match is present in a sample of DNA or not Mass spectrophotometric technique – ionized fragment, time of flight Nanopore sequencing strategies - Ultrafast and relative inexpensive sequencing of long DNA fragments Single-molecule approach – Solexa, Visigen and Genovoxx Single-molecule polony sequencing

Figure 2.6 Single-molecule polony sequencing
Emerging Sequencing Methods Dilute solution of DNA are plated onto a glass microscope slide. In situ PCR produces thousands of tiny colonies of DNA, which incorporated of single dye-labeled dNTPs. Polony – PCR colonies (聚集區) The slide is read after each cycle of Incorporation of a new base, allowing short seqs. to be determined. Each numbered polony produces a short nucleotide seq. as shown. These can then be assembled computationally into a contiguous seq. pgs2e-fig jpg Figure 2.6 Single-molecule polony sequencing

Figure 2.7 (Part 1) Hierarchical versus shotgun sequencing
Genome Sequencing Whole genome seqs. are assembled from ~105 of fragments, each typically between 500 and 1000 bp in length. Two general approaches for fragmentation and assembly: (1) hierarchical seq. (2) shotgun seq. For historical overview, see Hierarchical seq. * First develop a low resolution physical alignment to measure the seq. is obtained in large order pieces. * Break the genome into small fragments and use computer algorithms to assemble them, see Figure 2.7 Most new genome projects adopt the shotgun approach. pgs2e-fig jpg Figure 2.7 (Part 1) Hierarchical versus shotgun sequencing

Genome Sequencing – hierarchical sequencing
Top down, map-based or clone-by-clone strategy ~ late 1980 Genome  break into small fragments The relative locations of the fragments are known BEFORE sequencing Advantages It fostered (help develop) assembly of high-resolution physical and genetic maps Allow groups working around the global Technology for cloning large fragments of genomes are progressed rapidly throughout the1990s, such as E. coli, S. cerevisiae, C. elegans. A. thaliana. Top-down seq.  clone seqs. as managable units of framgments (50 – 200 kb in length) Clone vectors – BAC (~300 kb), PAC (~100 kb), phage-derived cosmids

Figure 2.7 (Part 2) Shotgun sequencing
Genome Sequencing – Shotgun sequencing In the shotgun approach, no attempt is made to order the clones in advance, Instead, the whole genome is assembled using computer algorithms that order contigs based on their overlapping sequences. pgs2e-fig jpg Figure 2.7 (Part 2) Shotgun sequencing

Figure 2.8 Cloning vectors used in genome sequencing
pgs2e-fig jpg Figure 2.8 Cloning vectors used in genome sequencing

DNA libraries By restriction enzyme (RE) or sonication (以超音波處理) Fragments are ligated into a multiple cloning site (mcs) in the vector Aim for 5- to 10-fold redundancy  larger than 5 to 10 times in the genome library Each clone will have different ends  possible to select a scaffold of clones that forms a contiguous seq. coverage – a tiling (貼瓷磚) path By aligning the regions of overlap (Fig. 2.9) The tiling path can be assembled using a combination of 3 methods: (1) hybridization, (2) fingerprinting, and (3) end-sequencing

A minimal tiling path through a library of aligned BAC clones that ensures complete coverage of the chromosome is chosen. After sequencing independent shotgun libraries for each BAC. Small gaps in the sequenced clone contigs remain. These are closed as far as possible by merging the two BAC sequences, as well as by the addition of mate-pair information (yellow) and cDNA structural information (red), which establishes the orientation and distance between cloned segments. pgs2e-fig jpg Figure 2.9 Hierarchical assembly of a sequence-contig scaffold (supercontig)

Hybridization All of the clones in a library that carry a particular seq. can be identified rapidly by hybridizing a small radioactively or chemically labeled probe containing the seq. to a filter on which is printed an array of ~10000 of clones (Fig. 2.10A) Fingerprinting Study the Restriction Enzyme (RE) patterns Assemble contigs of large insert clones is to compare and align them according to RE RE ~ 6 bp  46 = 212 ~ 4000 bp For BAC, 100 kb  100 kb/4 kbp ~ 20 – 30 fragments these fragments can be separated by electrophesis  Fingerprint profile  BAC alignment by gel  software alignment  overlapping  Contigs  assemble of ~Mb length contigs pgs2e-fig jpg

Figure 2.10 Aligning BAC clones by hybridization and fingerprinting
Genome Sequencing – hierarchical sequencing (A) A macroarray of BAC clones is probed with a short, radioactive fragment to identify all BACs that carry a specific fragment. These clones are digested with a RE, end- labeled, and separated by gel electrophoresis, Software converts the bands to a virtual profile, shown hypothetically for a small portion of four bands (high-ligated box in part B). Shared bands (red or blue) imply that the two clones share the same seq. Green indicates the vector band common to all clones. The fingerprint profile is then converted into a BAC alignment, In this example, clone 2 does not share any bands with the others and so is placed into a seq. BAC contig, while the other three clones form a tiling path. pgs2e-fig jpg Figure Aligning BAC clones by hybridization and fingerprinting

End-sequencing Fill in the gaps after fingerprinting. How ? sequencing both ends of the collection of BAC clones Once a critical threshold of seqs. have been achieved  overlap For example, along a 10 Mbp genome, end seqs. of 10,000 BAC clones,  provide a seq. tag every 5kb (for a 5-fold coverage) Along a 10 Mbp genome 10 Mbp/10000 BAC  1 kbp/BAC Five fold  10 Mb/2000 BAC ~ 5 kb (a seq. tag distance) Given this tag density, it is possible to close gap < 50 kb Once the Tiling path is chosen  shotgun the BAC clones into small fragments Subcloning, use M13 phagemid (~1 kb, exist as dsDNA and ssDNA or clone 2 ~ 3 kb fragments into a plasmid vector

Genome Sequencing – Shotgun sequencing
Use computer algorithm to assemble the seqs. (~100,000) About 5 ~ 10 folds redundancy for each fragment Library - From a single whole genome After MSA  screen out repetitive seqs., overlap reads of the same seq.  generate unitigs and scaffolds  >90% of the seqs. are assembled Finishing phase – closing gaps, cleaning up ambiguities  take as much time as the shotgun phase Users are asked to trust the assemblies Celera Genomics used the following software to assemble the seqs. Screener – to mask (not removed) seqs. that contain repetitive DNA (such as microsatellites, LINE, Alu repeats, retrotransposons and ribosomal DNA) Overlapper – compares every unscreened read against every other unscreened read, searching for overlaps of a predetermined length and identity. Parallel processing on 40 supercomputers, each with 4GB RAM, allowed the 27 M screened human seqs. reads to be overlapped in < 5 days ! Repeat-induced overlaps of a seq. are resolved using the Unitigger (see Figure 2.11). Scaffolder – uses mate-pair information to link U-unitigs into scaffold contigs pgs2e-fig jpg

Figure 2.11 Seq. alignment between two or more shotgun clones can arise between unique seqs. (left) or repetitive seqs. (right). (B) The Overlapper aligns unitigs, which are identified as unique seq. alignments (U-untigs) or overcollapsed repeats (blue). Two contigs can be aligned and oriented by using mate-pair seq. information from the ends of longer (10- or 50-kb) clones, as shown at the bottom, while mate-pairs from 2-kb fragments allow assembly of scaffolds despite the presence of simple repeats such as microsatellites (blue) that are masked before performing alignments. pgs2e-fig jpg Figure U-unitigs and repeat resolution

Figure 2.12 shows the estimated coverage of the fly and human whole genomes after initial assembly: in both cases, 84% or more of the genomes was covered by scaffolds at least 100 kb in length, while most scaffolds were in the Mb range.  seq. coverage from 5x to 10x  a 10%  in the proportion of scaffolds of lengths up to 1 Mb. The plot shows the percentage of Scaffolds that have a length greater than that indicated for the fly 10x, human 8x (CSA) and human 5x (whole genome assembly WGA) seqs. generated by Celera. The fly and CSA assemblies include shredded (撕成碎片) seqs. generated from BAC clones by public genomes sequencing efforts. pgs2e-fig jpg Figure Proportion of fly and human genomes in large scaffolds

NCTS http://math.cts.nthu.edu.tw/Mathematics/conference-PT2005.html
UCSD

Genome Sequencing and Annotation (Part 1).

Similar presentations

Presentation on theme: "Genome Sequencing and Annotation (Part 1)."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genome Sequencing and Annotation (Part 1).

Similar presentations

Presentation on theme: "Genome Sequencing and Annotation (Part 1)."— Presentation transcript:

Similar presentations

About project

Feedback