Genome Assembly Preliminary Results

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
MCB Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
Introduction to Short Read Sequencing Analysis
Author: David He, Astghik Babayan, Andrew Kusiak By: Carl Haehl Date: 11/18/09.
Heuristic alignment algorithms and cost matrices
Assembly.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Genome sequencing and assembling
Chapter 2: Algorithm Discovery and Design
Fundamentals of Python: From First Programs Through Data Structures
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Fundamentals of Python: First Programs
CS 394C March 19, 2012 Tandy Warnow.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Some Ideas on Final Project. Feature extraction TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT n CCGATACCCCCCACGCGAAACCCTACACATCAAAT p AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
Introduction to Short Read Sequencing Analysis
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Giuseppe D'Auria Norwich September 2014 FISABIO, Valencia Introduction into the processing of raw data.
Assembling Sequences Using Trace Signals and Additional Sequence Information Bastien Chevreux, Thomas Pfisterer, Thomas Wetter, Sandor Suhai Deutsches.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Metagenomics Assembly Hubert DENISE
Linux+ Guide to Linux Certification Chapter Eight Working with the BASH Shell.
Quick introduction to genomic file types Preliminary quality control (lab)
Introduction to Newbler NextGen BUG Assembly Workshop Dec 2009 Stephen Bridgett.
Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
1.Data production 2.General outline of assembly strategy.
billion-piece genome puzzle
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter – 8 Software Tools.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Short Read Workshop Day 5: Mapping and Visualization
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Regulatory Genomics Lab
Bacterial Genome Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Introduction into the processing of raw data
Bacterial Genome Assembly
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Fast Sequence Alignments
Maximize read usage through mapping strategies
Regulatory Genomics Lab
Regulatory Genomics Lab
Fragment Assembly 7/30/2019.
Presentation transcript:

Genome Assembly Preliminary Results Jeri Dilts Suzanna Kim Hema Nagrajan Deepak Purushotham Ambily Sivadas Amit Rupani Leo Wu 02/01/2012

Outline Data Pre-Processing Error Correction Assembler results Results Formats and Conversion PRINSEQ Data statistics Error Correction CoRAL Assembler results Newbler 2.6 MIRA3 ABySS Velvet PCAP-454 Results Lab Exercise

Outline Data Pre-Processing Error Correction Assembler results Results Formats and Conversion PRINSEQ Data statistics Error Correction CoRAL Assembler results Newbler 2.6 MIRA3 ABySS Velvet PCAP-454 Results Lab Exercise

What are sff files?  Sff files are Roche's "Standard Flowgram Format" files, containing the sequence data produced from a 454 run. The sff files contains a Manifest header at the start describing the contents and flow intensity signal values for each base in each read. They are in binary format, so need to be converted to text format, such as a fastq/fasta file using sff2fastq , ssf_extract , sffinfo programs. The Sequence Read Archive request that these .sff files be uploaded, to obtain accession number for publications.    

Fastq = Fasta + Quality No standard file extension: but .fq .fastq and .txt are commonly used • 4 lines per sequence • Line 1 begins with the @ character, a sequence ID, and an optional description     @SEQ_ID • Line 2 is the sequence letters    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAA • Line 3 begins with the + character, followed by the same sequence ID, and another optional description    +Optional Description • Line 4 encodes quality values for the sequence letters in line 2    !''*((((***+))%%%++(%%%%).1***+*''))**55CCF>>>>>

File Tools There are a large number of sff tool converters available now. We list a few here: sffinfo --> extract fasta, quality and flowgrams as text from .sff files. sffinfo -seq Axolotl_reads.sff > Axolotl.fna sffinfo -qual Axolotl_reads.sff > Axolotl.qual sff2fastq --> extracts read information from sff and outputs the sequence and quality scores in fastq sff2fastq -o  Axolotl.fastq  Axolotl_reads.sff sfffile  --> join sff files; extract part of sff file by MIDs, read names or random reads; or trim reads in user-defined ways.

Quality Control WHY ? Saves time, effort and money KEY CONCERNS   Saves time, effort and money KEY CONCERNS Number and Length of sequences Base qualities Ambiguous bases Sequence duplications 

Data pre-processing - PRINSEQ http://prinseq.sourceforge.net/manual.html

Generating trimmed reads in fastq $ prinseq-lite.pl -fastq M19107.fastq -out_good stdout  -log M19107.txt -trim_qual_left 20 -trim_qual_right 20 -trim_tail_left 5 -trim_tail_right 5 -trim_ns_left 2 -trim_ns_right 2 -min_len 65 -min_qual_mean 20 -ns_max_n 4 | gzip -9 > M19107.fastq.gz   Read a fastq file containing quality data and write data passing all Filters to standard out (terminal). The trimmed sequences are gzipped to a new file. -fastq:Fastq file containing sequence and quality data.  -out_good stdout: This will write all data passing the filter to the stdout (terminal) -log: logfile to keep track of parameters, error etc. -min_len: Filter out sequences lower than this length. -ns_max_n: Filter sequences with more than the specified Ns. We tried with 2/3 and 1. -min_gc: Filter sequences with GC content less than min_gc.  -max_gc: Filter sequences with GC content greater than max_gc  -min_qual_mean: Filter sequences with mean quality scores below the specified level. Most published thresholds varied between 15 and 25. We used 20. 

Generating graphs report from trimmed reads prinseq-lite.pl -fastq M19107_filtered.fastq -graph_data M19107_filtered.gd -verbose -out_good null -out_bad null Reads a filtered fastq file and graph data file  to generate graphs showing the distribution of length, base quality, GC Content, Occurance of N, Poly A/T Tails, Sequence Duplication, Sequence Complexity and Dinucleotide odds ratios. -fastq:Fastq file containing sequence and quality data.  -graph_data: File containing graph data to generate graphs report  -verbose: prints status and info messages during processing -out_bad null & -out_good null: This will NOT create any output file other than the graphics file.

Length Distribution

Base Quality Distribution

Mean Base Quality Scores

Outline Data Pre-Processing Error Correction Assembler results Results Formats and Conversion PRINSEQ Data statistics Error Correction CoRAL Assembler results Newbler 2.6 MIRA3 ABySS Velvet PCAP-454 Results Lab Exercise

Error Correction Motivation Sequencing errors pose the biggest challenge Computational efficiency of assemblers improves  Lot of redundant data - take advantage of it Ensures high data usage in assembly

Coral v1.3 CORrection with ALignment Corrects sequencing errors in short-read high throughput data Key strategy - Multiple Alignment using redundant read data Similar reads are all updated according to the alignment based on scoring of quality reads --score is based on number of reads that align at a position/ number of total reads at position  

Coral Multiple Alignment

Coral Command Line coral -f[q, s] Input File -o Output File accepts input files FASTA, FASTQ, and Solexa FASTA -k (length of k-mer) >= log4 (length of genome), default 21 -e (maximum expected error rate), default 0.07 -454 (sets optimal settings in gap penalty, mismatch penalty, and reward for matching

Outline Data Pre-Processing Error Correction Assembler results Results Formats and Conversion PRINSEQ Data statistics Error Correction CoRAL Assembler results Newbler 2.6 MIRA3 ABySS Velvet PCAP-454 Results Lab Exercise

bler Algorithm How does it work??

What is newbler?? Roche's “GS De Novo Assembler” (where “GS” = “Genome Sequencer”) Designed to assemble reads from the Roche 454 sequencer. Accepts: 454 Flx Standard reads, and 454 Titanium reads. Single and paired-end reads. Optionally can include Sanger reads. Runs on Linux, and has 32 bit and 64 bit versions. Has Command-line and Java-based GUI interface.

OLC algorithm -A quick recap Overlap Layout Consensus (OLC) is a method used for de novo genome assemblies. OLC requires three steps: 1) overlap, 2) layout, and 3) consensus. The overlap stage computes and builds the basic assembly graph. The layout stage compresses the graph, and the consensus stage determines the genome sequence based on the graph generated in the previous two steps.

Overlap In the overlap step, the sequence of each read is compared to that of every other read, in both the forward and reverse complement orientations. As such, the overlap computation step is a very time intensive step – especially if the set of reads is very large.

Overlap criteria The OLC overlap criteria result in two types of overlaps: true overlaps (Figure 1a) and repeat overlaps (Figure 1b). For example, in Figure 1b, an overlap occurs between reads S and T, due to the orange repeat section, not because reads S and T truly overlap one locus in the genome, as in Figure 1a.

How does it go? Although we would like exclude repeat overlaps, we must construct the assembly graph using both types of overlaps, as true and repeat overlaps cannot be distinguished individually. In the assembly graph, the nodes represent actual reads, and edges represent OLC-quality overlaps between these reads (Figure 2). Thus, the genome assembly becomes equivalent to finding a path through the graph that visits each node exactly once (i.e., a Hamiltonian path).

Layout Finding a path through the OLC graph is not a trivial task. Imagine you have a graph of millions of nodes and edges. Identifying a path that visits every node exactly once would be extremely difficult, even for a powerful computer. In order to find such a path, you would have to start at some node and proceed to other connected nodes. If you find that you visited a node more than once, you must backtrack, adjust the path, and test this new path. So the larger the graph is, the more options you must test.

Contigs In order to decrease the size of the graph, the OLC assembly graph is simplified in the layout stage, where segments of the assembly graph are compressed into contigs, which are collections of reads that clearly overlap each other and refer to the same overall sequence. Thus in the overlap graph, a contig would be a subgraph, or a group of nodes, with many connections among each other, as they all overlap with each other and refer to the same sequence

Graphical representation of a unique contig

Unique and repeat contigs There are two classes of contigs, unique contigs and repeat contigs. Unique contigs are composed of reads that can be unambiguously assembled. Generally, these reads only overlap in one way and do not contain repeats within the genome. Essentially, unique contigs are contigs not flagged as repeat contigs.

Contig Assignment For example, the R1 + R2 repeat contig is connected to more contigs and has more read coverage than the other unique contigs, X, Y and Z. Both the X and Z unique contigs are connected to the prefix of the R1 + R2 repeat contig as they both have overlaps to this sequence. Similarly, unique contigs Y and Z have overlap with the suffix of the R1 + R2 repeat contig.

Contig assignment Repeat contigs are contigs with an abnormally high read coverage or connected to an abnormally large number of other contigs. Additionally, this repeat contig is different from other contigs because it has such high coverage. The high coverage of repeat contigs allows algorithms to identify them through statistics that compare the coverage of each contig. Contigs with too much coverage are most likely due to over-collapsing of repeats and are flagged as repeat contigs, to be used later in the layout stage

Consensus In the final stage of the OLC method, we derive the consensus sequence. At this point, the assembly graph has been reduced to large scaffolds – ideally a single scaffold. A single scaffold would be represented by one node that resulted from collapsing all previous nodes. Starting from the left most read of each scaffold, the OLC algorithm computes the consensus of all of the reads composing each scaffold.

Why would reads go from contig 1 to contig 3, as well as from contig 2 to contig 3? If the sequence from contig 3 is repeated in the genome, the reads coming from these repeats are very similar and collapse into a single contig. At the beginning and end of this contig there will be reads extending into the respective neighboring, single copy genomic regions. Essentially, here is the problem with assembly: repeats cause a complicated contig graph structure. Newbler here reads in all information from the flowgrams (sff file, these include the light intensity during sequencing for each read) and calculates for each position in the contig the consensus signal. Also, a consensus quality score is calculated for each base. Several output files are written during this phase.

Repeat detection Repeat resolution Identifying Unique DNA Stretches Unique DNA unitig Repetitive DNA unitig Arrival Intervals Discriminator Statistic is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA. -10 +10 Repeat detection pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value (e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp) Dist. For Unique Dist. For Repetitive Definitely Repetitive Don’t Know Definitely Unique

How to run Newbler? COMMAND LINE INTERFACE The simplest command to run Newbler is: runAssembly [options] reads.sff • Which creates an the assembly in an output directory called: P_yyyy_mm_dd_hh_min_sec_runAssembly where P_ = Project, followed by date and time There are a large number of optional parameters available for controlling and refining the assembly

Lets look at a Newbler run.. The first thing you’ll see is a message stating that the assembly computation started, and which version of newbler you used. Then, you’ll see messages for each input file saying Indexing XXXXXXX.sff…, and a counter. During indexing, newbler scans the input file, performs some checks and trims the reads (sometimes more than the base-calling software already did). The first phase of assembly is finding overlap between reads. Newbler splits this phase into one for long reads (this goes very fast) and shorter reads (can take quite some time). As aligning all reads against each other would take too long time, newbler (and many other programs) actually make seeds, 16-mers of each read, where each seed starts 12 bases upstream of the previous one. Basically, checkpointing means writing the intermediate results to disc, so that in the case of a crash, you could continue the assembly from the last ‘checkpoint’.

Overlap between two sequences overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% overlap - region of similarity between regions overhang - un-aligned ends of the sequences The assembler screens merges based on: length of overlap % identity in overlap region maximum overhang size. When two different reads have identical seeds the program tries to extend the overlap between the reads until the minimum overlap (default 40 bp) with the minimum alignment percentage default 90%) has been reached. 

Newbler Command Line runAssembly [options] inputFile runMapping [options] referenceGenome inputFile accepts input files -- FASTA, FASTQ, .SFF -o Name output directory -qo Generate quick output for assembly (can decrease accuracy)    

M19107 - Newbler Results

M19501- Newbler Results

MIRA3 Multifunctional Inertial Reference Assembly v3.4.0 Started in 1997 as a PhD project at the German Cancer Research Center by Bastien Chevreux and in 2007 became open source MIRA 3 is able to perform true hybrid de-novo assemblies using reads gathered through Sanger, 454, Solexa, IonTorrent or PacBio sequencing technologies. can also perform regular (non-hybrid) de-novo assemblies using 454 data Overlap layout consensus algorithm http://sourceforge.net/apps/mediawiki/mira-assembler/index.php?title=Main_Page

MIRA3:Command Line Arguments (1/2) "Extracting SFF" Convert the SFFs named M19107.sff, M19107.sff and M19107.sff The parameters of sff_extract  -Q extract to FASTQ -s give the FASTQ file a name we chose  -x give the XML file with vector clipping information a name we chose  http://mira-assembler.sourceforge.net/docs/chap_454_part.html

MIRA3:Command Line Arguments (2/2) "Begin Assembly" Parameters --project (for naming your assembly project) --job (perhaps to change the quality of the assembly to 'draft') >& creates/outputs a file named log_assembly.txt to observe assembly progress    http://mira-assembler.sourceforge.net/docs/chap_454_part.html

MIRA3 Data Manipulation Integrity Our objective is to produce the most accurate representation of a genome. When software tools produce better results, it doesn't necessarily indicate that the genome's representation is more accurate. (and vice versa) This makes it tricky to determine the proper tools to use. Could be detrimental to scientific integrity. Take precautions.

MIRA3 Data Analysis Data Quality (ideally) Increases

MIRA3 Further Work..... Need to look at more assembly parameters in reference manual Finish scripts that calculate Min. Contig Length and Avg. Contig Length M19501 32bit error Finish running MIRA3 on all genomes

ABySS ABySS stands for Assembly By Short Sequences. It is a De Novo  sequence assembler designed for short reads and large genomes. Single Processor version: Genomes up to 100Mbp in size  Parallel version: capable of assembling mammalian sized genomes Capable of performing assemblies for both single end reads and paired end reads The output of the ABySS is set of contigs assembled from the short reads

ABySS   

ABySS commands for assembly ABYSS -K[K-mer value] input.fastq -o output.fastq   perform operation for multiple k-mer value in loop                   for k in {20..40}; do             ABYSS -k$k reads.fa -o contigs-k$k.fa             done 

ABySS output file ABySS output file consist of contigs generated by the assembly Each contig in the output file consist of 2 lines Line 1            >n iii jjj  where n=contig id                                   iii=contig length in nucleotides                                  jjj=absolute coverage value Line 2       AAAAACTAATTTCTGAAAT (contig sequence)

ABySS output 

VELVET de Bruijn graph - based assembler best for high coverage very short read data sets leverages paired end information really well

Commands velveth - performs hashing default k-mer value - 31 specify input format (-fastq), read-type (-long)

Commands velvetg - generates the graph and forms the assembly exp_cov (Expected coverage) - auto  

Results

PCAP-454 Beta version (not yet released) Overlap Layout Consensus assembler Designed for 454 paired-end data Computationally efficient

PCAP-454 commands Input files: Fasta and Qual files separately zipped Fofn file: Specify the name of the input file Run the automated script $./autopcap fofn > auto.log

PCAP-454 Results M19107 Metrics Raw data No. of Contigs 589 N50 6574 Max Contig length 20652

Results Summary

Work Cited M19107 original reads http://edwards.sdsu.edu/cgi-bin/prinseq_beta/tmp/1327884761.html M19107 filtered reads http://edwards.sdsu.edu/cgi-bin/prinseq_beta/tmp/1327950415.html Quality Control with Prinseq prinseq.sourceforge.net/Quality_control_with_PRINSEQ.pdf (http://prinseq.sourceforge.net).