NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor,

NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB

Workflow for Today  Generate a synthetic NGS read data set  Genome assembly  ABySS  Velvet  ALLPATHS-LG

Generate synthetic NGS read data for assembly  Try a new program out called “ART” from Baylor College Huang W, Li L, Myers JR, Marth GT. 2012. ART: a next-generation sequencing read simulator. Bioinformatics. 28(4):593-4  Available as open source and as binary programs for 32 or 64 bit Windows, Mac and Linux http://www.niehs.nih.gov/research/resources/software/art  Notes:  the binary archive names are a bit strange – really a.tar.gz in disguise (need to do a gunzip followed by a tar –xvf)  The fastq sequence line is *lower case* which is not expected by some software (e.g. ABySS)

Simulated Illuminex Paired End Reads  Using rice chloroplast genome (~134kb) art_illumina -i Chloroplast.fasta  -p -l 50 -f 20 -m 200  -s 10 -o Chloroplast -sam  Generates files:  Chloroplast1.aln  Chloroplast1.fq  Chloroplast2.aln  Chloroplast2.fq  Chloroplast.sam

============================================================================== ART (Q Version 1.3.6) Copyright(c) 2008-2012, Weichun Huang, Jason Myers. All Rights Reserved. ============================================================================== Paired-end Simulation Total CPU time used: 2.48 Parameters used during run Read Length: 50 Fold Coverage: 20X Mean Fragment Length: 200 Standard Deviation: 10 Profile Type: Combined ID Tag: Quality Profile(s) First Read: EMP50R1 (built-in profile) Second Read: EMP50R2 (built-in profile) Output files FASTQ Sequence Files: the 1st reads: Chloroplast1.fq the 2nd reads: Chloroplast2.fq ALN Alignment Files: the 1st reads: Chloroplast1.aln the 2nd reads: Chloroplast2.aln SAM Alignment File: Chloroplast.sam

Unfortunately…  The ART program generates peculiar id’s (doesn’t mark the paired end reads…) and lower case sequence letters, which causes some headaches…  So, I wrote a small python script to fix this…

#!/usr/bin/python # Fixes the output of the ART program # art_illumina -i reference.fa -p -l 50 -f 20 -m 200 -s 10 -o outFile_prefix -sam from sys import stdin seq = False qual = False if __name__ == '__main__': for line in stdin: line = line.strip() if qual: qual = False # to avoid treating rare quality score lines that start with '@' as id's elif line.startswith('+'): qual = True elif not seq and line.startswith('@'): # massage the ID part1 = line.split('|') part2 = part1[1].split('-') line = part1[0]+'_'+part2[0]+'-'+part2[1]+'/'+part2[2] seq = True elif seq: # convert sequence all to upper case to avoid downstream confusion... line = line.upper() seq = False print line

Getting ABySS  Installation:  For Ubuntu, sudo apt-get install abyss  Or visit BCGSC and download tar.gz source, then configure..make (more up-to-date?)  Perhaps put the abyss bin directory on your path…  To test run ABySS: abyss-pe k=25 name=test  se=https://raw.github.com/dzerbino/  velvet/master/data/test_reads.fa

Try our test PE read data set  abyss-pe name=Chloroplast31 k=31  ABYSS_OPTIONS=--no-trim-masked  in=‘Chloroplast1.fastq Chloroplast2.fastq‘  The ‘no-trim-masked’ needed because default behaviour of abyss is to trim lower case letters in sequence (which designate identified vector sequences in 454 outputs…)  Try with other k-mer sizes…

For more info about ABySS http://www.bcgsc.ca/platform/bioinfo/software/abyss  Active list service to troubleshoot issues: abyss-users@googlegroups.com

Velvet http://www.ebi.ac.uk/~zerbino/velvet/  download & tar -zxvf  make  sudo make install  put velvet directory on your $PATH  Run velveth:  velveth outputdir k_mer -fastq readfile  Run velvetg:  velvetg outputdir -ins_length 200 -exp_cov 20

ALLPATHS-LG http://www.broadinstitute.org/software/allpaths-lg/blog/  download and tar –zxvf ./configure  make  sudo make install  Execute the program:  PrepareAllPathsInputs.pl # needs some config files…  RunAllPathsLG

NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor,

Similar presentations

Presentation on theme: "NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor,

Similar presentations

Presentation on theme: "NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor,"— Presentation transcript:

Similar presentations

About project

Feedback