Presentation on theme: "Understanding and Assembling 454 Transcriptome sequences"— Presentation transcript:
1Understanding and Assembling 454 Transcriptome sequences Transcriptome WorkshopNov 2010Stephen Bridgett
2Aims Why sequence transcriptomes? How does 454 sequencing work? What are ‘sff’ files?Using sff toolsWhat is assembly?Challenges to assemblyNewbler assembler and Output filesExercises with sample data
3Why sequence transcriptomes? Gives more dynamic view of the activity in a cell, (than genome sequencing would) as:Gives relative expression levels for different cells under different conditions.Could identify alternate splicing, and fusion genes (important in several cancers).Focuses on gene sequences, which are often the main research focus.
4How does 454 sequencing work ? DNA Capture bead,emPCR,Pyrosequencing reaction,Signal image,Base calling454 sequencer
5An Animation of 454 sequencing Animation of 454 sequencing from Wellcome trust website to explain Flowgrams:454Animation_from_Wellcome_WTX swfTo help understand output from assembler alignment.
6Data obtained from 454 sequencing Roche 454 ‘titanium’ genome reads approx. 400 bases long.Transcriptome reads tend to be a bit shorter eg. 350 bases.Typically 700,000 reads from one sequencing plate.Plates can be divided into 2, 4, 8 or 16 lanes.Samples can have an MID (multiplex index) ‘barcode’ added, so several samples can be run together in the same lane.
7What are ‘sff’ files ?‘Sff’ files are Roche’s “Standard Flowgram Format” files, containing the sequence data produced from a 454 run.The sff files contain:a Manifest header at the start describing the contents,flow intensity signal values for each base in each read.They are in binary format, so need converted to text format, such as a fasta file (using the ‘sffinfo’ program)The Sequence Read Archive (SRA at EBI or NCBI) request that these .sff files be uploaded, to obtain accession number for publications.
8What is Assembly?Merge the short reads into long contigs (ideally a full transcript), by finding the best sequence overlaps between reads.Eg: Roche’s Newbler assembler, MIRA assembler, TgiCl assembler, Phrap, Cap3, MOSAIK reference guided assembler, etc.This is an ‘overlap’ assembler (there are also deBruijn graph assemblers to cope with the very large numbers of short illumina reads)Reads overlapped to form a contig, viewed in the gsAssembler graphical interface.Newbler is an ‘overlap assembler’. There are also de-Bruijn graph assemblers designed to cope with the vary-large numbers of short reads from illumina or SOLiD, such as Velvet, CLC cell, Cotex, SOAP-denovo, Abyss.
9Challenges for assembly (1) Contaminants in samples (eg. from Bacteria or Human).Ribosomal RNA (small and large sub-units).PCR artifacts (eg. Chimeras and Mutations)Sequencing errors, such as “Homopolymer” errors – when eg. 3+ run of same base.MID’s (multiplex indexes), primers/adapters (eg. SMART adapters used to synthesise cDNA) still in the raw reads.Repeats and large or polyploid genomes – repeated sequences in the transcriptome make assembly more difficult.
10Challenges for assembly (2) Extra sample preparation steps in cDNA synthesis - more risk of cloning errors or contamination, wider range of read lengths.Large expression level range (eg. 105) - some transcripts have low read coverage and some very high coverage.Alternative splicing - differingreads from same part of genome.Roche’s Newbler 2.3 assembler sometimes didn’t finish transcriptome assembly, seemed to get lost when “Detangling Alignments”, but the latest Newber 2.5 beta is able to.
11Blast search to check for contaminants Blastx search of 5,000 randomly picked reads against UniRef90 or Non-redundant dataset.Sorted by frequency of Description (or Tax) with evalue > e-8FrequencySubject_description1689 (16.9 %)Picea sitchensis (Sitka Spruce)907 (9.1 %)Vitis vinifera (Common Grape Vine)311 (3.1 %)Physcomitrella patens subsp. Patens (Moss)282 (2.8 %)Arabidopsis thaliana (Thale cress)218 (2.2 %)Oryza sativa Japonica Group (Rice)153 (1.5 %)Zea mays (Maize)58 (0.6 %)Oryza sativa Indica (Rice)Oryza sativa (Rice)
12Homopolymer error A ?c TT - AAAAA ?a Different between signal of 1 and signal of 2 = 100%.Different between signal of 5 and 6 is 20% so errors more likely after eg. AAAAA.
13Roche softwareRoche have developed Data-Analysis software for processing, assembling and mapping the 454 reads:sffinfo - extract fasta, quality and flowgrams as text from .sff files.sfffile - join, split or trim sff files.gsAssembler (Newbler) - to assembly reads into contigs/isotigs.gsMapper - to map reads to a transcriptome or genome reference.gsAmplicon – to analyse Variants in Amplicons.(These run on 32 and 64 bit Linux. There is information on the wiki about obtaining and installing these.)
14Exercise 1A – sff files Aims: Using ‘sffinfo’ and ‘sfffile’ Summarise the read statisticsBlast the reads for contaminantsThe exercises are on the wiki:
15What is “Newbler” ?Roche's “GS De Novo Assembler” (where “GS” = “Genome Sequencer”)Designed to assemble reads from the Roche 454 sequencer.Accepts:454 Flx Standard reads, and454 Titanium reads.single and paired-end reads.Optionally can include Sanger reads.Initial versions focused on assembling Genomic reads.Latest versions (2.3 and now 2.5) improve transcriptome assembly.Runs on Linux, and has 32 bit and 64 bit versions.Has Command-line and Java-based GUI interface.Rarely called “Newbler” (for “New Assembler”) in Roche's documentation, rather “runAssembler”, or “gsAssembler”.
16How does Newbler work?cDNA Reads Alignments Contig graph Final untangled assembly
17Inputs to Newbler assembler Newbler accepts:Roche's .sff files (standard flowgram format)Fasta files, with or without Quality files, such as Sanger reads, (which can be used as a scaffolds.)Parameters specified by the user, to guide the assembly, (or parameters can all be left at their default values.)
18Command-line interface The simplest command to run Newbler is:runAssembly [options] reads.sffWhich creates an the assembly in an output directory called:P_yyyy_mm_dd_hh_min_sec_runAssemblywhere P_ = Project, followed by date and timeThere are a large number of optional parameters available for controlling and refining the assembly.
19Common command-line options -cdna for transcriptome (cDNA) assembly-urt ‘use read tips’ to produce longer isotigs-o output_directory to set name of output directory-vt trimmingFile.fasta to trim primers, adapters from start or end of reads-vs screeningFile.fasta to remove reads that closely matching a cloning vector such as E.Coli or rRNA.(-vs and -vt also match reverse-complements of given sequences.)
20Isogroups, Isotigs, Contigs ? Some definitions to understand Newbler output:An isogroup: - tries to represent a gene- collection of isotigs containing reads that imply connections between the isotigs.An Isotig: - represents an individual transcript.- different isotigs from a given isogroup can be inferred splice-variants.Contigs: - contigs forming an isotig may be thought of as exons.- this is not strictly correct, as untranslated regions (UTRs) and introns (in the case of primary transcripts) may exists in the reads generated from the sample.
21Isotigs - more detailsConnections between contigs in an isogroup are represented by sequences (reads) that have alignments diverging consistently towards two or more different contigs or by a depth spike.The assembler trims and ignores any poly-A tails, so the true orientation of reads in the assembly cannot be determined. So an isotig may be output as the reverse-complement of the true biological transcript.For more details see pages of the Roche software manual (which is on your computer’s Desktop in the ‘manual’ folder)
22Output files for Transcriptome projects (1) In the Assembly subdirectory:454Isotigs.fna fasta file of all Isotigs, and Contigs which are not in an isotig.454Isotigs.qual quality scores (Phred-based) for each base in '454Isotigs.fna’ file. (eg: 20 = 1 in 100 probability of incorrect base call; 50 = 1 in 100,000)454Contigs.fna fasta file of all contigs, which are used to create the Isotigs.454Contigs.qual quality scores for each base.454NewblerMetrics.txt statistics of the assembly, eg: number of reads and bases aligned, overlaps found, mean contig sizes,454ReadStatus.txt status of each read in assembly (Assembled, PartiallyAssembled, Singleton, TooShort, Outlier), and alignment 3' and 5' positions within contig.454TrimStatus.txt each read's original and revised trim-points used in the assembly.
23Output files (2)454AlignmentInfo.tsv base consensus and quality, read-depth and flow-signal, at each position in each contig.Can easily be parsed by Perl script to obtain eg: average coverage depth for each contig and isotig.eg:Position Consensus Quality Unique Align Signal SignalScore Depth Depth StdDev(incl.duplicates)>contig00008GATTG...etc...
24Output files (3)454Contigs.ace = ACE format file, showing how reads were aligned to form contigs, viewable in eg. Tablet, or Consed.Unlike traditional ace files, in Newbler’s ace files:the same read can be in several contigs (but is given an extra suffix),eg: if one contig is in a repeat (higher coverage) region, and the next is contig is a non-repeat (low coverage) region, and the read spans the junction.a contig (and hence a read) can be shared between several isotigs.But a read should only be in one isogroup.
25Output files (4) Only with -cdna option: 454IsotigLayout.txt how contigs are laid along each isotig in the isogroup, (454RefLink also gives which isotigs are in each isogroup).eg:>isogroup numIsotigs=8 numContigs=11Length : (bp)Contig : Total:isotig00004 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 1484isotig00005 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 1484isotig >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 1497isotig >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 1497isotig00008 >>>>> >>>>> >>>>> >>>>> >>>>>isotig >>>>> >>>>> >>>>> >>>>> >>>>>etc……
26Exercise 1B - Assembly Aims: Assemble dataset with Newbler Summarise the resulting isotigsLook into the assembly output filesView the assembly in Tablet viewer (http://bioinf.scri.ac.uk/tablet)
28Obtaining Roche software To obtain the latest Roche off-line Data_Analysis software version (which includes the sff tools, Newbler assembler, gsMapper and gsAmplicon), complete the softwre request form on the 454.com website:
30How does Newbler work? Identify pairwise overlaps between reads. Construct multiple alignments of overlapping reads.Break the multiple alignments where consistent differences are found between different sets of reads.This gives “contigs” that represent the assembled reads.Resolve branching structures between contigs, to generate isotigs.Generate consensus basecalls for the contigs using quality and flow signal information at each base in the multiple alignments.Output the contig consensus sequences, quality scores, alignment and metric files.You will see message about these steps as assembly progesses.If paired End data is available, the assembler performs these extra steps:Organize contigs into scaffolds, using paired-end information to order the contigs and to approximate the distance between contigs.
31GUI interface to Newbler gsAssembler = Roche’s graphical interface to the newbler assembler. Is based on java.Type: gsAssembler & (The ‘&’ just means can still use the command-console as runs assembler in’background’).Set project name, directory, and Genomic or cDNA option.On Project tab, select directory containing sff files, then uncheck any unwanted sff files.Set parameters for project, such as MINT adapters to trim, and ribosomal rRNA fasta file to screen out, other assembler and output options.Click the “Start” button at the right, and watch the output at the bottom.When finished assembly, can view using the Results, Alignment and Flowgrams tabs.
32Experiment 208: Using the GUI Run: gsAssemblerGraphical interface should appearCan use: /dataset2 (or other dataset)Choose options and run the assemblyLook at the resulting assembly in the viewing tab.What do you think about the accuracy of the assembly?
33Roche's software also includes: Gs Reference Mapper for mapping to reference (for model organisms can specify file of known annotations and SNP's)Amplicon Variant analyser for analysing DNA variants (eg rare alleles) in ultra deep coverage of regions of interest. (see manual Part D on website for more information)File Tools:sffinfo extract fasta, quality and flowgrams as text from .sff files.sfffile join sff files; extract part of sff file by MIDs, read names or random reads; or trim reads in user-defined ways.sff2scf converts one read from sff file into an SCF file (or performs “call throughs” to access SCF data for Sanger reads)fnafile Constructs a FASTA file (& quality file) from list of FASTA, PHD and SCF files.
34Viewing AssembliesIn addition to the alignment viewer in gsAssembler, there are several other viewers for viewing the .ace alignment files:Hawkeye = useful, but a bit tricky to compile initially: =HawkeyeTablet = Fast, nice display and easy to use. From SCRI:EagleView = a limited basic viewer. From MarthLab.Gambit = newer viewer from MarthLab:Magic Viewer = Views sam and bam files, not ace files. From
35Videos about 454 sequencing Pyrosequencing:Genome Sequencer FLX System Workflow:
36Exercise 1: Look into an .sff file ‘sffinfo’ is a command-line program that is part of this Roche Data_Analysis package.To view the binary sff file as text, run:cd ~/data/Axolotlsffinfo Axolotl_reads.sff | less(Piping to less allows you to scroll easily)Type ‘q’ to quit less.
37Exercise 2: Extract reads from an .sff file Use the file: Axolotl_reads.sffcd ~/data/AxolotlExtract reads from the .sff file into a fasta file:sffinfo -seq Axolotl_reads.sff > Axolotl.fnahead Axolotl.fnaExtract the quality information from the .sff file:sffinfo -qual Axolotl_reads.sff > Axolotl.qualhead Axolotl.qualCount the number of reads (The quotes are important):grep -c ">" Axolotl.fna
38Exercise 3: Assembly dataset 1: 454 titanium reads for 6 Mb genome ~/training/data/454/dataset_1/set1_reads.sffGet metrics for the raw reads:sffinfo -seq reads.sff > reads.fastamy_process_contigs.pl –i reads.fasta –o process_reads –t read(although these are reads, rather than contigs the same script can still be used.)more process_reads/contig_stats.txtEstimate the average read depth (genome 6Mb )
39Exercise 4: Assembly Assembly command: runAssembly -o assembly1 reads.sffWhere reads are:~/assembly_workshop/data/454/dataset_1/set1_reads.sffLook in the assembly1 subdirectory, and see what you think the files contain.
40More options-a num minimum contig length for 454AllContigs (default 100)-l num mim contig length for 454LargeContigs (default=500)-large for large or complex genomes, speeds up assembly, but reduces accuracy. Not with -cdna option.-m keep sequence data in memory to speed up assembly, but needs sufficient RAM.-cpu num num CPU’s to use (manual says default=all, but wrong), to speed up the computing alignments and generating output steps.-minlen num minimum length of reads to use in assembly (default=20, can be 15 to 45).-rip output each read in only one contig.
41(see page 21 or 77 of manual Part C for more details.) Even more options-notrim disable default quality & primer trimming of input reads.-p filename specify input file contains paired-end reads.-ud treats each reads separately, not grouping duplicates.-ss set seed step parameter (default=12, can be 1 or more)-sl set seed length parameter (default=16, can be 6 to 16)-sc set seed count parameter (default=1, can be 1 or more)-ais set alignment identity score (default=2, can be 0 or more)-ads ?set alignment difference score (default=-3, can be 0 or less)-ml set minimum overlap length (default=40, allowed 1 or more)-mi set minimum overlap identify (default=90, allowed 0 to 100)-nobig skip output of large files (.ace, 454AlignmentInfo.tsv)-consed creates subdirectory, .ace, .phd files, sff_dir for consed.(see page 21 or 77 of manual Part C for more details.)
42Extra challenges of transcriptome assembly Poly A’s, Poly T’s tails (added after gene transcription).…….ATGCTAAAAAAAAAAAAAAA-3’
43Exercise 5: Using options Use some of these options that you think may improve assembly.runAssembly [your options] –o assembly2 reads.sffChange into subdir assembly2Look through some of the output files, eg:less filename (or use a texteditor)my_process_contigs.pl -i 454AllContigs.fna -o stats -t contigThe assembler manual is available on the web links page so you can try different options for ‘runAssembly’.
44Common options (again) -o output_directory to set name of output directory (overwrites existing directory without warning!)-vt trimmingFile.fasta to trim primers, adapters or polyA tails from start or end of reads-vs screeningFile.fasta to remove reads that closely matching a cloning vector such as E.Coli.(-vs and -vt will also match for the reverse-complements of the given sequences.)
45Exercise 6: Transcriptome assembly Using: ~/assembly_workshop/data/454/dataset_2/Enter the following on one line:runAssembly-o assembly3-vt MINTadapters.fna–vs rRNA.fna (groups at front half only)-cdna-ig NUM (max contigs in an isogroup, default 500 contigs)-it NUM (max number of isotigs in an isogroup, default 100)-icc NUM (max contigs in one isotig, default 100 contigs)-icl NUM (isotig contig length threshold, default 3 bp)reads.sff
46Incremental assemblyThere are also alternative command-line commands (instead of ‘runAssembly’) that can perform incremental assembly, adding, or removing, runs to an existing project over time:newAssembly,addRun,removeRun,runProject.
47Transcriptome options Newbler collects into “Isogroups”, then creates “Isotigs”New options for transcriptomes:-cdna = for transcriptome (cDNA assembly)-ig = max contigs in an isogroup (default 500 contigs)-it = max number of isotigs in an isogroup (default 100)-icc = maximum number of contigs in one isotig (default 100 contigs)-icl = isotig contig length threshold, below which traversal stops (default 3 base pairs)Pages 142 to 146 of Part C of the Roche Assembly manual gives a good table of all the options.
48Output files (cont. 3)454ContigGraph.txt = describes the branching structure between contigs.Has 3 sections:(1) Graph Node information (the contigs):ContigNum ContigName Length Average_depth1 contig2 contig...etc...(2) Graph edges (C=contig edge; or S=scaffold edge for paired end reads, also S in -cdna graphs):Edge FromContigNum FromEnd ToContigNum ToEnd AlignmentReadDepth C ’ ’ 24C ’ ’ 36...etc....S :+;2802:-;2872:-;2575:-;2783:-;2614:-S :+;1327:-(3) More graph informationI t :2639-5' 'I atcgattgaaatcaatggagaaagatacTATAGAAAGTTAATAAAaGTATCTGTAGAGCCGACAGTTG....etc...F /188/0.0;2931/8/0.0;2957/36/0.0;1242/226/F /24/ /24/0.0