Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Enter Presentation Everything you expect …plus DNASIS MAX 2.0 Sequence Analysis Software.
GENOME SEQUENCING AND OBJECTIVES
Doug Brutlag 2011 Sequencing the Human Genome Doug Brutlag Professor Emeritus of Biochemistry.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Next Generation Sequencing, Assembly, and Alignment Methods
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
M B G Rui Pires Martins PhD Candidate, CMMG computer applications in molecular genetics.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Bioinformatic Techniques & Tools for SNP Analysis
DNA Sequencing Kabi R. Neupane, Ph.D. Leeward Community College ABE Workshop 2006.
Genome sequencing and assembling
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
INDIAN INITIATIVE FOR TOMATO GENOME SEQUENCING Tomato Finishing Workshop T. R. Sharma National Research Centre on Plant Biotechnology Indian Agricultural.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Sequencing. Bacteriophage fX174, the first genome to be sequenced, is a viral genome with only 5,368 base pairs (bp). Fred Sanger invented "shotgun"
CS 6293 Advanced Topics: Current Bioinformatics
Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Mouse Genome Sequencing
Large-scale genome projects
-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9.
DNA Assembly Sanger Reads
Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Bioinformatics and Sequencing Relevant to SolCAP
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Remember the limitations? –You must know the sequence of the primer sites to use PCR –How do you go about sequencing regions of a genome about which you.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Genome Characterization DNA sequence-ULTIMATE Map DNA sequencing-methods Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Service 2006.
中国农业科学院蔬菜花卉研究所 Institute of Vegetables and Flowers Chinese Academy of Agricultural Sciences Zhonghua Zhang Institute of Vegetables and Flowers, Chinese.
Double-Ended Shotgun Sequencing of PA14 Daniel G. Lee 10/30/02.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Human Genome.
Automatic DNA and Genome Sequencing
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Sequence File Formats.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Genome sequence assembly concepts and methods Shih-Jon Wang May 6, 2009.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Genome sequencing and annotation Week 2 reading assignment - pages 63-78, 93-98, Boxes 2.1 and don’t worry about details of similarity scoring.
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
DNA Sequencing First generation techniques
Next-generation sequencing technology
Virginia Commonwealth University
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Genomics Sequencing genomes.
Genome sequence assembly
Next-generation sequencing technology
The FASTQ format and quality control
Basic Techniques Project Design Process Improvements
Molecular Cloning.
Stuff to Do.
Genome sequencing informatics
MapView: visualization of short reads alignment on a desktop computer
Molecular Cloning.
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001 Faculty of Veterinary Medicine and Zootechny University of São Paulo BRAZIL

What is Phred/Phrap/Consed ? Phred/Phrap/Consed is a worldwide distributed package for: a. Trace file (chromatograms) reading; b. Quality (confidence) assignment to each individual base; c. Vector and repeat sequences identification and masking; d. Sequence assembly; e. Assembly visualization and editing; f. Automatic finishing.

Why to assemble? Current DNA sequencing methods generate reads of bp – resolution limit of electrophoresis Current DNA sequencing methods generate reads of bp – resolution limit of electrophoresis Whole genomes or large clones need to be fragmented - clone library Whole genomes or large clones need to be fragmented - clone library Short fragments are randomly sequenced (shotgun approach) – reads are assembled to form final consensus sequence Short fragments are randomly sequenced (shotgun approach) – reads are assembled to form final consensus sequence

How to deal with the enormous amount of reads generated by the high throughput DNA sequencers? Sanger Centre

Phred Genome Research 8: , 1998

Phred Genome Research 8: , 1998

Phred Phred is a program that performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI- COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs.

Phred c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

Trace File High quality region – no ambiguities (Ns)

Trace File Medium quality region – some ambiguities (Ns)

Trace File Poor quality region – low confidence

Phred value formula q = - 10 x log 10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = (1 error in 100 bases) q = 40 means p = (1 error in 10,000 bases)

The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.g BEGIN_COMMENT CHROMAT_FILE: EBV10201A02.g ABI_THUMBPRINT: PHRED_VERSION: g CALL_METHOD: phred QUALITY_LEVELS:99 TIME: Thu May 24 00:18: TRACE_ARRAY_MIN_INDEX: 0 TRACE_ARRAY_MAX_INDEX: TRIM: CHEM: term DYE: big END_COMMENT BEGIN_DNA t 8 5 c a c t a g t c a g c n c n n n n n n n c n END_DNA END_SEQUENCE t a a a g c c t g g g g t g c c t a a t g t g t c g n c t t c t c c c t c g g a g g

Phrap Phragment Assembly Program or… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data Key Features: a. Uses the entire read content – no need for trimming. b. User supplied (i.e. Repbase) + internally computed data – better accuracy of assembly in the presence of repeats. c. Contig sequence is constituted by a mosaic of the highest quality parts of the reads – it’s not a consensus!

Phrap Phragment Assembly Program or… Phil’s Revised Assembly Program Phrap is a program for assembling shotgun DNA sequence data d. Provides extensive information about assembly – contained in phrap.out, *.ace and *.screen.contigs.qual files e. Handles very large datasets – hundreds of thousands of reads are easily manipulated. f. Generate output files – contain some important data and enable visualization by other programs

Phrap output files *.contigs – fasta file containing the contigs *.contigs – fasta file containing the contigs - Contigs with more than one read - Singletons (single reads with a match to some other contig but that couldn’t be merged consistently to it) *.singlets – fasta file of the singlet reads *.singlets – fasta file of the singlet reads - Reads with no match to other read *.ace – allows for viewing the assembly using Consed *.ace – allows for viewing the assembly using Consed *.view – required for viewing the assembly using Phrapview *.view – required for viewing the assembly using Phrapview

Consed Genome Research 8: , 1998

Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: a. Assembly viewer - allows for visualization of contigs, assembly (aligned reads), quality values of reads and final sequence. b. Trace file viewer – single and multiple trace files can be visualized allowing for comparison of a given sequence in several reads.

Consed Consed is a program for viewing and editing assemblies produced by Phrap Key Features: c. Navigation – identify and list regions which are below a given quality threshold, contain high quality discrepancies, single- strand coverage, etc. d. Autofinish – automatic set of functions for: gap closure, improvement of sequence quality, determination of relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. The program automatically performs primer picking and chooses the templates.

Phred/Phrap/Consed Pipeline Chromat_dir Phd_dir Edit_dir Directories:

Finishing Problems Finishing can be a boring and difficult task due: DNA sequencing problems a. High GC content – genomes presenting a high GC content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc. b. Palindromic regions – lead to strong secondary structures causing sudden drops. Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product. c. Homopolymeric regions – can reduce DNA synthesis efficiency for some chemistries. Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye).

Finishing Problems Finishing can be a boring and difficult task due: DNA assembly problems a. High content of repeats – highly repeated elements reduce accuracy of DNA assembly. Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units. b. High AT content – some highly biased genomes (i.e. Plasmodium falciparum; plastid genomes) can pose a problem for assembly programs. Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data.