Genome Sequencing and Annotation (Part 1).

Slides:



Advertisements
Similar presentations
Analysis of your 16s RNA. DNA sequencing Most current sequencing projects use the chain termination method –Also known as Sanger sequencing, after its.
Advertisements

9 Genomics and Beyond Brief Chapter Outline
Recombinant DNA Introduction to Recombinant DNA technology
Heuristic alignment algorithms and cost matrices
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
DNA Sequencing and Gene Analysis
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Review of Laboratory 3 Spectrophotometric determination of DNA quantity, purity Abs 260 nmAbs 280 nmAbs 320 nmAbs 260/Abs
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Sequence comparison: Local alignment
Reading the Blueprint of Life
DNA Technology- Cloning, Libraries, and PCR 17 November, 2003 Text Chapter 20.
Analyzing your clone 1) FISH 2) “Restriction mapping” 3) Southern analysis : DNA 4) Northern analysis: RNA tells size tells which tissues or conditions.
Mouse Genome Sequencing
-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
Journal Meeting Jung-Yun Ko DNA Sequencing & ABI DNA Sequencer.
Applications of DNA technology
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Chapter 3 Fundamentals of Mapping and Sequencing Basic principles.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
1 Chapter 2: DNA replication and applications DNA replication in the cell Polymerase chain reaction (PCR) Sequence analysis of DNA.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Genome Characterization DNA sequence-ULTIMATE Map DNA sequencing-methods Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Service 2006.
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Automatic DNA and Genome Sequencing
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
Locating and sequencing genes
1 PCR: identification, amplification, or cloning of DNA through DNA synthesis DNA synthesis, whether PCR or DNA replication in a cell, is carried out by.
Genomics Part 1. Human Genome Project  G oal is to identify the DNA sequence of every gene in humans Genome  all the DNA in one cell of an organism.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Chapter 14 GENETIC TECHNOLOGY. A. Manipulation and Modification of DNA 1. Restriction Enzymes Recognize specific sequences of DNA (usually palindromes)
Genome sequencing and annotation Week 2 reading assignment - pages 63-78, 93-98, Boxes 2.1 and don’t worry about details of similarity scoring.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
Virginia Commonwealth University
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
DNA Technologies (Introduction)
Genomics Sequencing genomes.
Sequence comparison: Local alignment
DNA Sequencing.
Cloning Overview DNA can be cloned into bacterial plasmids for research or commercial applications. The recombinant plasmids can be used as a source of.
AMPLIFYING AND ANALYZING DNA.
Relationship between Genotype and Phenotype
Relationship between Genotype and Phenotype
The Human Genome Project
Chapter 14 Bioinformatics—the study of a genome
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
DNA and the Genome Key Area 8a Genomic Sequencing.
A Sequenciação em Análises Clínicas
Introduction to Sequencing
Relationship between Genotype and Phenotype
Presentation transcript:

Genome Sequencing and Annotation (Part 1)

Objective of most genome projects Sequencing – DNA, mRNA Identify genes characterize gene features This chapter How blocks of DNA seqs. are obtained How these blocks are assembled into contigs then genomes Bioinformatics – how to do seq. alignment, such as cDNA/EST, genome seqs. Annotation of ORF, Other features of gene – repetition elements, variable distribution of GC content, evolutionary conserved elements Gene annotation by cross species annotation

2.1 (Part 2) The principle of dideoxy (Sanger) sequencing Automated DNA sequencing 1974, F. Sanger developed the chain-termination method (Sanger sequencing) Sanger won his second Noble prize for inventing this process pgs2e-fig-02-01-2.jpg

Automated DNA sequencing Most current sequencing projects use the chain termination method Also known as Sanger sequencing, after its inventor Based on action of DNA polymerase Adds nucleotides to complementary strand Requires template DNA and primer Most large-scale sequencing projects use the chain-termination method, which is also known as Sanger sequencing, after Fred Sanger, who won his second Nobel prize for inventing the process. Chain-termination sequencing is based on the action of DNA polymerase, which adds nucleotides that are complementary to another strand of DNA. For it to work, it needs a template DNA strand that it can copy, as well as a short stretch of DNA, known as a primer, to which it can add nucleotides.

Chain-termination sequencing Dideoxynucleotides (ddA, ddT, ddC or ddG) stop synthesis Chain terminators (DNA polymerase cannot add another nucleotide) Included in amounts so as to terminate every time the base appears in the template Use four reactions One for each base: A,C,G, and T 3’ ATCGGTGCATAGCTTGT 5’ 5’ TAGCCACGTATCGAACA* 3’ 5’ TAGCCACGTATCGAA* 3’ 5’ TAGCCACGTATCGA* 3’ 5’ TAGCCACGTA* 3’ 5’ TAGCCA* 3’ 5’ TA* 3’ Template Sequence reaction products In the Sanger chain-termination method, the nucleotide analog is called a dideoxynucleotide. When the correct amount is added to the solution, the chain will be terminated at each occurrence of the complementary nucleotide in the template. The reason for this reaction is that DNA polymerase cannot add another nucleotide to a dideoxynucleotide. For example, if the right amount of dideoxy A is added, then the chain will be terminated at each occurrence of T in the template. Determining the complete sequence requires a separate reaction for each of the four bases A, T, C, and G. On the top right of this slide is a template strand, and beneath it are the various chains that would be terminated with dideoxy A in the reaction mix.

Sequence detection To detect products of sequencing reaction Include labeled nucleotides Formerly, radioactive labels (33P or 35S) were used Now fluorescent labels Use different fluorescent tag for each nucleotide Can run all four reactions in a single gel lane or capillary tube TAGCCACGTATCGAA* TAGCCACGTATC* TAGCCACG* TAGCCACGT* Once the sequencing reaction has been completed, one has to be able to detect the chains generated by the reaction. This can be done by making the chains radioactive by using radioactively labeled nucleotides in the reaction. As an alternative, today’s automated sequencers all use fluorescent labels. With this approach, each of the four sequencing reactions (for each of the bases A, T, C, and G) uses a different color of fluorescent tag. After the reactions are completed, the four fluorescently labeled reactions can be combined and run in a single gel lane or capillary tube.

Sequence separation Terminated chains need to be separated – Terminated chains need to be separated Requires one-base-pair resolution See difference between chains of X and X+1 base pairs Gel electrophoresis Very thin gel High voltage applied Works with radioactive or fluorescent labels Negative pole at the top Determining the placement of each of the bases in the sequence requires separating the terminated chains and resolving them so that one can see differences of one base. This means, for example, that a chain of length 35 bases must be distinguishable from chains of 34 or 36 bases. This step is normally done using electrophoresis through a gel or capillary. Originally, very thin gels were used, with high voltages applied. Either radioactively labeled or fluorescently labeled reaction products can be separated on gels. The image in this slide shows an autoradiogram of a typical sequencing gel with the lanes marked according to the dideoxy terminator used. For gel electrophoresis, the negative pole was at the top and the positive pole at the bottom. + C A G T C A G T

Sequence reading of radioactively labeled reactions – The final step of sequencing is to read the sequence Radioactive labeled reactions Gel dried Placed on X-ray film Film developed, the position of each band becomes visible Sequence read from bottom up (the positive pole) Each of the four lanes giving the position of a different base: A, T, C or G The final step of sequencing is to read the sequence. After radioactively labeled sequencing reactions are run through a gel, the gel is dried and then a piece of X-ray film is placed on it. After the film is developed, the position of each band becomes visible. The sequence is then read from the bottom of the gel to the top, with each of the four lanes giving the position of a different base: A, T, C, or G. +

Sequence reading of fluorescently labeled reactions Fluorescently labeled reactions scanned by laser as particular point is passed Color picked up by detector Output sent directly to computer The read out is given both in terms of bases and the intensity of each color, so that ambiguous readings are easily identified For fluorescently labeled reactions that are separated either by gel electrophoresis or through a capillary, the fragments are detected by a laser as they pass a particular point. Each of the four colors is picked up by a detector, and the output is sent directly to a computer. The readout is given both in terms of bases and in terms of the intensity of each color, so that ambiguous readings are easily identified.

Summary of chain termination sequencing A primer is extended by DNA polymerase based on the sequence present in the template strand. The chain is terminated by different ddNTP that are complementary to the template strand. Four reactions are separated on a gel that can resolve one-base differences. The seq. is then read from the bottom of gel to the top. This image summarizes the steps in DNA sequencing. A primer is extended by DNA polymerase based on the sequence present in the template strand. The chain is terminated by different dideoxynucleotides that are complementary to the template strand. Four reactions, one for each base, are separated on a gel that can resolve one-base differences. The sequence is then read from the bottom of the gel to the top.

High-Throughput Sequencing The new techniques and equipment include: (1) Four-color fluorescent dyes have replaced the radioactive label (2) Rather than stopping the electrophoresis at a particular time, the products are scanned for laser-induced fluorescence just before the run off the end of the electrophoresis medium (3) Improvements in the chemistry of template purification and the sequencing reaction (4) Slab gel electrophoresis gave way to capillary electrophoresis with the introduction in 1999 of Applied Biosystem’s ABI Prism 3700 automated sequencers, which in turn were updated with ABI Prism 3730 DNA analyzers in 2003 (deliver extremely high quality, long reads; save time and money) ABI Prism 3730 DNA analyzers

Reading sequence traces Base-calling – the reading of raw sequence traces Now routinely performed using automated software that reads bases, aligns similar seqs. and editing Program – phred http://www.phrap.org The program assign probability scores to the accuracy of each base call as the trace is read pgs2e-fig-02-02-0.jpg

2.3 Automated sequence chromatograms pgs2e-fig-02-03-0.jpg This seq. shows ‘noiseness’ of the first 30 bp of a run. The middle two rows show a segment of two seqs. that are polymorphic for both SNPs and an indel. A decline in seq. quality typically occurs after about 800 bp.

Ex. 2.1 Reading a sequence trace pgs2e-exer-02-01-1.jpg The base labeled N – due to poor seq. quality Two peaks of the same height are observed at the same location, the site is heterozygous for a C and T SNP.

Figure 2.5 An aligned-reads window in consed Contig Assembly pgs2e-fig-02-05-0.jpg Figure 2.5 An aligned-reads window in consed

Assembling DNA seq. fragments NCBI dbest databases http://www.ncbi.nlm.nih.gov/Database/ View the EST statistics FTP EST files

Assembling DNA seq. fragments IFOM assembler http://bio.ifom-firc.it/ASSEMBLY/assemble.html Multiple EST seqs.  contig max. number of seqs. you can enter is 10000 !! use gi(15744427, 19124086, 8147732, 8147734, 20393914,13728017) Length (850, 1062, 634, 596, 869, 768) bp resulting in a single contig consensus seq., can be used for similarity search against db

Assembling DNA seq. fragments – 6 GI fragments >gi|15744427|gb|BI752849.1|BI752849 603022060F1 NIH_MGC_114 Homo sapiens cDNA clone IMAGE:5192510 5', mRNA sequenceCGGGGTGCTGCGAGCGCGGGGCCAGACCAAGGCGGGCCCGGAGCGGAACTTCGGTCCCAGCTCGGTCCCCGGCTCAGTCCCGACGTGGAACTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATATGCGTTGACTGTGCCATGGGAGAGTAGCAGAAAACAGCAGCATGCTCAACCTCCCACTTTCTCTATTGGATGTGGACTCAAAGCCCT >gi|19124086|gb|BM807263.1|BM807263 AGENCOURT_6574903 NIH_MGC_124 Homo sapiens cDNA clone IMAGE:5732238 5', mRNA sequenceGTCCGGAATTCCCGGGATCTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATACGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGCGGGGAACAGGGTCCTGAAACCTGACCATTTTGCCCCAGACCTTGACCAATCCACCTCATGGCGATTCTCCCTCCAGGAATTCCCCGACCGAGAAAAAATTGGCCACTTCCCCGGAATTTCCCCCCAAAAACTTGGAATTTCACCCAAAACCTTTCCCATGTAAACCCGGAAACCCTGGGGAAGGCT >gi|8147732|gb|AW958049.1|AW958049 EST370119 MAGE resequences, MAGE Homo sapiens cDNA, mRNA sequenceGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTCATGCGATTCTTCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGAACCTTCCAATGAAGCGAGAATCTTGTGAAGCTGAAGAACAGTCTGGAAGGCAAGATGAGCTTTTTGCTGGGAATGCGCACGTGGAAAGGCAGAATTCGGTCATAA >gi|8147734|gb|AW958051.1|AW958051 EST370121 MAGE resequences, MAGE Homo sapiens cDNA, mRNA sequenceGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTTATGCGATTCTCCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTTGAGGAGCAATCTGGAGGGCATATGAGCTTTTTGCTGTGATTGCGCACCTGGGAATGCAAAACTCCGTCATTACTG >gi|20393914|gb|BQ213074.1|BQ213074 AGENCOURT_7559959 NIH_MGC_72 Homo sapiens cDNA clone IMAGE:6055692 5', mRNA sequenceAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTGAGGAGCAGTCTGGAGGGCAGTATGAGCTTTTTGCTGTGATTGCGCACGTGGGAATGGCAGACTCCGGTCATTACTGTGTCTACATCCGGAATGCTGTGGATGGAAAATGGTTCTGCTTCAATGACTCCAATATTTGCTTGGTGTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGGCCAAGGGCAGTGGCAGGGGGTATTTCAGTATTATACCACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGGGCTGGTTGTAATTTTTTCCCTTTGAAGAAACACCATTAATTTCCTAATGAATCCAAGTGGTTTGTAACTTGCCTATTCCTTTTATTCCAGCAAAAAATTAATTGATCATCCCCTCCCCCAAAAAATAGGGG >gi|13728017|gb|BG206330.1|BG206330 RST25778 Athersys RAGE Library Homo sapiens cDNA, mRNA sequenceTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGCCAAAGGTCAGTGGCAGGGGGTATTTCAGTATTATACAACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGTGCTGTTTGTAATTTTTCACTTTGAGAACCAACATTAATTCCATATGAATCAAGTGTTTTGGAACTGCTATTCATTTATTCAGCAAATATTTATTGGTCATCTTTTCTCCATAAGATAGTGTGATAAACACAGCATGAATAAAGGTATTTTCCACACAGACAAGTGTTTTTTCACAAAATTATTNATTTTGNTGGGGCTGTGGCGGCCGCTTCCTTTATGGGGGGGAATTTAGAACCCGTTCCTGACGCGGGGGN

List of assembled fragments Assembling DNA seq. fragments List of assembled fragments

Assembling DNA seq. fragments Overlap details

Assembling DNA seq. fragments End of overlap details Assembled mRNA sequence

Box 2.1 Pairwise Sequence Alignment The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1 alignment 2 Seq. 1 ACGCTGA ACGCTGA Seq. 2 A - - CTGT ACTGT - - Seeks alignments  high seq. identity, few mismatchs and gaps Assumption – the observed identity in seqs. to be aligned is the result of either random or of a shared evolutionary origin Identity ≠ similarity Sequence identity = Homology (a risky assumption) Sequence identity ≠ Homology

Box 2.1 Pairwise Sequence Alignment Same true alignment arise through different evolutionary events Scoring scheme: substitution  -1, indel  -5, match  3 indel pgs2e-box-02-01-0.jpg Score 9 5 4 4 Figure A Common evolutionary events and their effects on alignment

Box 2.1 Pairwise Sequence Alignment Find the optimal score  the best guess for the true alignment Find the optimal pairwise alignment of two seqs.  inserted gaps into one or both of them  maximize the total alignment score Dynamic programming (DP) – Needleman and Wunsch (1970), Smith and Waterman (1980), this algorithm guarantees that we find all optimal alignments of two seqs. of lengths m and n BLAST is based on DP with improvement on speed Prof. Waterman http://www.usc.edu/dept/LAS/biosci/faculty/waterman.html

Box 2.1 Pairwise Sequence Alignment The score for alignment of i residues of sequence 1 against j residues of sequence 2 is given by where c(i,j) = the score for alignment of residues i and j and takes the value 3 for a match or -1 for a mismatch, c(-,j) = the penalty for aligning a residue with a gap, which takes the value of -5

Box 2.1 Pairwise Sequence Alignment The entry for S(1,1) is the maximum of the following three events: S(0,0) + c(A,A) = 0 + 3 = 3 [c(A,A) = c(1,1)] S(0,1) + c(A, -) = -5 + -5 = -10 [c(A, -) = c(1, -)] S(1,0) + c(-, A) = -5 + -5 = -10 [c(- ,A) = c(-, 1)] Similarly, one finds S(2,1) as the maximum of three values: (-5)-1=-6; 3-5=-2; and (-10)-5=-15  the best is entry is the addition of the C indel to the A-A match, for a score of -2 (see next page).

Box 2.1 Pairwise Sequence Alignment The alignment matrix of sequences 1 and 2 S(2,1) = max {S(1,0) + c(2,1), S(1,1) + c(2,-), S(2,0) + c(-,1)} = max { S(1,0) + c(C,A), S(1,1) + c(C,-), S(2,0) + c(-,A) } = max { -5-1, 3-5, -10-5 } = -2

Box 2.1 Pairwise Sequence Alignment Traceback  determine the actual alignment From the top right hand corner  the (7,5) cell For example the 1 in the (7,5) cell could only be reached by the addition of the mismatch A-T ACGCTGA A - - CTGT or AC - - TGT 4 matches 1 mismatch 2 indels Ambiguity – has to do with which C in seq. 1 aligns with the C in seq. 2

Box 2.1 Pairwise Sequence Alignment Parameters settings - Gap penalties Default settings are the easiest to use but they are not necessarily yield the correct alignment constant penalty  independent of the length of gap, A proportional penalty  penalty is proportional to the length L of the gap, BL (that is what we used in the this lecture) affine gap penalty  gap-opening penalty + gap-extension penalty = A+BL There is no rule for predicting the penalty that best suits the alignment Optimal penalties vary from seq. to seq.  it is a matter of trial and error Usually A > B, because of opening a gap (usually A/B ~ 10) Hint: (1) compare distantly related seqs. high A and very low B often give the best results  penalized more on their existence than on their length, (2) compare closely related seqs., penalize both of extension and extension

Exercise 2.2 Computing an optimal sequence alignment Two score schemes Gap penalty = -5, mismatch = -1, match =3 Gap penalty = -1, mismatch = -1, match =3 First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-5) = 8 (2) First alignment score = 5*3 + 2*(-1) =13 Second/Third alignment score = 6*3 + 2*(-1) = 16 A more serious problem – identify the wrong alignment pgs2e-exer-02-02-1.jpg

Exercise 2.2 Computing an optimal sequence alignment Gap penalty = -5 Gap penalty = -1

Emerging Sequencing Methods Costs of genome sequencing Mid-2000 - $30-50 Million dollars to sequencing a mammalian genome Target $1000 per human genome by the year 2010 J. Craig Benter Foundation - $500,000 award for the first person to achieve this goal New technologies Sequencing by hybridization (SBH) – detect whether an exact match is present in a sample of DNA or not Mass spectrophotometric technique – ionized fragment, time of flight Nanopore sequencing strategies - Ultrafast and relative inexpensive sequencing of long DNA fragments Single-molecule approach – Solexa, Visigen and Genovoxx Single-molecule polony sequencing

Figure 2.6 Single-molecule polony sequencing Emerging Sequencing Methods Dilute solution of DNA are plated onto a glass microscope slide. In situ PCR produces thousands of tiny colonies of DNA, which incorporated of single dye-labeled dNTPs. Polony – PCR colonies (聚集區) The slide is read after each cycle of Incorporation of a new base, allowing short seqs. to be determined. Each numbered polony produces a short 20-25 nucleotide seq. as shown. These can then be assembled computationally into a contiguous seq. pgs2e-fig-02-06-0.jpg Figure 2.6 Single-molecule polony sequencing

Figure 2.7 (Part 1) Hierarchical versus shotgun sequencing Genome Sequencing Whole genome seqs. are assembled from ~105 of fragments, each typically between 500 and 1000 bp in length. Two general approaches for fragmentation and assembly: (1) hierarchical seq. (2) shotgun seq. For historical overview, see http://www.sciencemag.org/feature/plus/sfg/human/timeline1.shtml Hierarchical seq. * First develop a low resolution physical alignment to measure the seq. is obtained in large order pieces. * Break the genome into small fragments and use computer algorithms to assemble them, see Figure 2.7 Most new genome projects adopt the shotgun approach. pgs2e-fig-02-07-1.jpg Figure 2.7 (Part 1) Hierarchical versus shotgun sequencing

Genome Sequencing – hierarchical sequencing Top down, map-based or clone-by-clone strategy ~ late 1980 Genome  break into small fragments The relative locations of the fragments are known BEFORE sequencing Advantages It fostered (help develop) assembly of high-resolution physical and genetic maps Allow groups working around the global Technology for cloning large fragments of genomes are progressed rapidly throughout the1990s, such as E. coli, S. cerevisiae, C. elegans. A. thaliana. Top-down seq.  clone seqs. as managable units of framgments (50 – 200 kb in length) Clone vectors – BAC (~300 kb), PAC (~100 kb), phage-derived cosmids

Figure 2.7 (Part 2) Shotgun sequencing Genome Sequencing – Shotgun sequencing In the shotgun approach, no attempt is made to order the clones in advance, Instead, the whole genome is assembled using computer algorithms that order contigs based on their overlapping sequences. pgs2e-fig-02-07-2.jpg Figure 2.7 (Part 2) Shotgun sequencing

Figure 2.8 Cloning vectors used in genome sequencing pgs2e-fig-02-08-1.jpg Figure 2.8 Cloning vectors used in genome sequencing

Genome Sequencing – hierarchical sequencing DNA libraries By restriction enzyme (RE) or sonication (以超音波處理) Fragments are ligated into a multiple cloning site (mcs) in the vector Aim for 5- to 10-fold redundancy  larger than 5 to 10 times in the genome library Each clone will have different ends  possible to select a scaffold of clones that forms a contiguous seq. coverage – a tiling (貼瓷磚) path By aligning the regions of overlap (Fig. 2.9) The tiling path can be assembled using a combination of 3 methods: (1) hybridization, (2) fingerprinting, and (3) end-sequencing

Genome Sequencing – hierarchical sequencing A minimal tiling path through a library of aligned BAC clones that ensures complete coverage of the chromosome is chosen. After sequencing independent shotgun libraries for each BAC. Small gaps in the sequenced clone contigs remain. These are closed as far as possible by merging the two BAC sequences, as well as by the addition of mate-pair information (yellow) and cDNA structural information (red), which establishes the orientation and distance between cloned segments. pgs2e-fig-02-09-0.jpg Figure 2.9 Hierarchical assembly of a sequence-contig scaffold (supercontig)

Genome Sequencing – hierarchical sequencing Hybridization All of the clones in a library that carry a particular seq. can be identified rapidly by hybridizing a small radioactively or chemically labeled probe containing the seq. to a filter on which is printed an array of ~10000 of clones (Fig. 2.10A) Fingerprinting Study the Restriction Enzyme (RE) patterns Assemble contigs of large insert clones is to compare and align them according to RE RE ~ 6 bp  46 = 212 ~ 4000 bp For BAC, 100 kb  100 kb/4 kbp ~ 20 – 30 fragments these fragments can be separated by electrophesis  Fingerprint profile  BAC alignment by gel  software alignment  overlapping  Contigs  assemble of ~Mb length contigs pgs2e-fig-02-08-2.jpg

Figure 2.10 Aligning BAC clones by hybridization and fingerprinting Genome Sequencing – hierarchical sequencing (A) A macroarray of BAC clones is probed with a short, radioactive fragment to identify all BACs that carry a specific fragment. These clones are digested with a RE, end- labeled, and separated by gel electrophoresis, Software converts the bands to a virtual profile, shown hypothetically for a small portion of four bands (high-ligated box in part B). Shared bands (red or blue) imply that the two clones share the same seq. Green indicates the vector band common to all clones. The fingerprint profile is then converted into a BAC alignment, In this example, clone 2 does not share any bands with the others and so is placed into a seq. BAC contig, while the other three clones form a tiling path. pgs2e-fig-02-10-0.jpg Figure 2.10 Aligning BAC clones by hybridization and fingerprinting

Genome Sequencing – hierarchical sequencing End-sequencing Fill in the gaps after fingerprinting. How ? sequencing both ends of the collection of BAC clones Once a critical threshold of seqs. have been achieved  overlap For example, along a 10 Mbp genome, end seqs. of 10,000 BAC clones,  provide a seq. tag every 5kb (for a 5-fold coverage) Along a 10 Mbp genome 10 Mbp/10000 BAC  1 kbp/BAC Five fold  10 Mb/2000 BAC ~ 5 kb (a seq. tag distance) Given this tag density, it is possible to close gap < 50 kb Once the Tiling path is chosen  shotgun the BAC clones into small fragments Subcloning, use M13 phagemid (~1 kb, exist as dsDNA and ssDNA or clone 2 ~ 3 kb fragments into a plasmid vector

Genome Sequencing – Shotgun sequencing Use computer algorithm to assemble the seqs. (~100,000) About 5 ~ 10 folds redundancy for each fragment Library - From a single whole genome After MSA  screen out repetitive seqs., overlap reads of the same seq.  generate unitigs and scaffolds  >90% of the seqs. are assembled Finishing phase – closing gaps, cleaning up ambiguities  take as much time as the shotgun phase Users are asked to trust the assemblies Celera Genomics used the following software to assemble the seqs. Screener – to mask (not removed) seqs. that contain repetitive DNA (such as microsatellites, LINE, Alu repeats, retrotransposons and ribosomal DNA) Overlapper – compares every unscreened read against every other unscreened read, searching for overlaps of a predetermined length and identity. Parallel processing on 40 supercomputers, each with 4GB RAM, allowed the 27 M screened human seqs. reads to be overlapped in < 5 days ! Repeat-induced overlaps of a seq. are resolved using the Unitigger (see Figure 2.11). Scaffolder – uses mate-pair information to link U-unitigs into scaffold contigs pgs2e-fig-02-08-3.jpg

Genome Sequencing – Shotgun sequencing Figure 2.11 Seq. alignment between two or more shotgun clones can arise between unique seqs. (left) or repetitive seqs. (right). (B) The Overlapper aligns unitigs, which are identified as unique seq. alignments (U-untigs) or overcollapsed repeats (blue). Two contigs can be aligned and oriented by using mate-pair seq. information from the ends of longer (10- or 50-kb) clones, as shown at the bottom, while mate-pairs from 2-kb fragments allow assembly of scaffolds despite the presence of simple repeats such as microsatellites (blue) that are masked before performing alignments. pgs2e-fig-02-08-4.jpg Figure 2.11 U-unitigs and repeat resolution

Genome Sequencing – Shotgun sequencing Figure 2.12 shows the estimated coverage of the fly and human whole genomes after initial assembly: in both cases, 84% or more of the genomes was covered by scaffolds at least 100 kb in length, while most scaffolds were in the Mb range.  seq. coverage from 5x to 10x  a 10%  in the proportion of scaffolds of lengths up to 1 Mb. The plot shows the percentage of Scaffolds that have a length greater than that indicated for the fly 10x, human 8x (CSA) and human 5x (whole genome assembly WGA) seqs. generated by Celera. The fly and CSA assemblies include shredded (撕成碎片) seqs. generated from BAC clones by public genomes sequencing efforts. pgs2e-fig-02-11-0.jpg Figure 2.12 Proportion of fly and human genomes in large scaffolds

NCTS http://math.cts.nthu.edu.tw/Mathematics/conference-PT2005.html UCSD http://research.calit2.net/recomb-workshop05/