NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor,

Slides:



Advertisements
Similar presentations
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Advertisements

Facilitator: Richard Bruskiewich
NGS Bioinformatics Workshop 1.3 Tutorial - Sequence Alignment and Searching March 22 nd, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor,
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
Linux Platform  Download the source tar ball from the BLAST source code link  ncbi-blast src.tar.gz  Compilation  cd /BLASTdirectory/c++ ./configure.
Introduction to bioperl. What is perl? Production Engineering Research Laboratory Practically Everything Really Likeable Pre-positioned Equipment Requirement.
Computer System Laboratory
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data FastqSAM/BAM.
CS 0008 Day 2 1. Today Hardware and Software How computers store data How a program works Operators, types, input Print function Running the debugger.
Before we start: Align sequence reads to the reference genome
The New Books List Michael Doran, Systems Librarian Ex Libris Southwest Users Group February 6, 2008 – Santa Ana College.
Lab 3 Department of Computer Science and Information Engineering National Taiwan University Lab3 - Cross Tools 2014/10/7/ 20 1.
De-novo Assembly Day 4.
Li and Dewey BMC Bioinformatics 2011, 12:323
MODELLER hands-on Ben Webb, Sali Lab, UC San Francisco Maya Topf, Birkbeck College, London.
Introduction to Python
Builtins, namespaces, functions. There are objects that are predefined in Python Python built-ins When you use something without defining it, it means.
LING 408/508: Programming for Linguists Lecture 3 August 31 st.
File formats Wrapping your data in the right package Deanna M. Church
NGS Bioinformatics Workshop 1.1 Tutorial – Preparing for Bioinformatics Work March 8 th, 2012 IRMACS, SFU Facilitator: Richard Bruskiewich Adjunct Professor,
Genome Assembly Preliminary Results
Day 7 Installing Software RPM tar, mtools make, ssh.
Visualising NGS data in GBrowse 2 August 2009 GMOD Meeting 6-7 August 2009 Dave Clements GMOD Help Desk National Evolutionary Synthesis Center (NESCent)
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
NGS Bioinformatics Workshop 1.5 Tutorial – Genome Annotation April 5th, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor, MBB.
Next Generation DNA Sequencing
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Lab7 QRNA, HMMER, PFAM. Sean Eddy’s Lab
Quick introduction to genomic file types Preliminary quality control (lab)
NGS Bioinformatics Workshop 1.4 Tutorial - Comparative Sequence Analysis and Visualization March 29th, 2012 IRMACS Facilitator: Richard Bruskiewich.
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
Next Generation Sequencing pipeline: a joint LONI – BIRN [UCLA – UCI] collaborative project F. Macciardi – March 16, 2011.
Cole David Ronnie Julio Sam Littlefield. Let’s Begin  Globus Toolkit runs on Unix platform only  Install Ubuntu  download all updates for Ubuntu.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the.
Python 101 Dr. Bernard Chen University of Central Arkansas PyArkansas.
ZHT Hands-on tutorial How to install, configure and run ZHT on a multi-nodes cluster.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Denovo Sequencing Practical. Overview Very small dataset from Staphylococcus aureus – 4 million x 75 base-pair, paired end reads Cover basic aspects of.
RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
De novo assembly of RNA Steve Kelly
Nachos Overview and Project 1. Nachos Introduction Official website
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 2 Karsten Hokamp, PhD Genetics TCD, 17/11/2015.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Install CB 1.8 on Ubuntu. Steps Followed Install Ubuntu (Ubuntu LTS) on Virtual machine – (VMware Workstation) (
Installing CUDA, PyCUDA on Ubuntu
Tutorial on setting up Zebra: A Z39.50 Server ARD Prasad DRTC Indian Statistical Institute Bangalore.
ECE 544 Software Project 1 Kuo-Chun Huang (KC). Environment Linux (Ubuntu or others) Windows with Cygwin
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
Short Read Workshop Day 5: Mapping and Visualization
Using Docker in a CyVerse World. To install Docker GO TO Click “Get Started”, follow the directions.
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Outline Installing Gem5 SPEC2006 for Gem5 Configuring Gem5.
Placental Bioinformatics
Install external command line softwares
The Linux Operating System
ChIP-Seq Analysis – Using CLCGenomics Workbench
Introduction into the processing of raw data
University of Texas Rio Grande Valley Systems Administration CSCI 6175
Maximize read usage through mapping strategies
Computer System Laboratory
Campus and Phoenix Resources
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor, MBB

Workflow for Today  Generate a synthetic NGS read data set  Genome assembly  ABySS  Velvet  ALLPATHS-LG

Generate synthetic NGS read data for assembly  Try a new program out called “ART” from Baylor College Huang W, Li L, Myers JR, Marth GT ART: a next-generation sequencing read simulator. Bioinformatics. 28(4):593-4  Available as open source and as binary programs for 32 or 64 bit Windows, Mac and Linux  Notes:  the binary archive names are a bit strange – really a.tar.gz in disguise (need to do a gunzip followed by a tar –xvf)  The fastq sequence line is *lower case* which is not expected by some software (e.g. ABySS)

Simulated Illuminex Paired End Reads  Using rice chloroplast genome (~134kb) art_illumina -i Chloroplast.fasta  -p -l 50 -f 20 -m 200  -s 10 -o Chloroplast -sam  Generates files:  Chloroplast1.aln  Chloroplast1.fq  Chloroplast2.aln  Chloroplast2.fq  Chloroplast.sam

============================================================================== ART (Q Version 1.3.6) Copyright(c) , Weichun Huang, Jason Myers. All Rights Reserved. ============================================================================== Paired-end Simulation Total CPU time used: 2.48 Parameters used during run Read Length: 50 Fold Coverage: 20X Mean Fragment Length: 200 Standard Deviation: 10 Profile Type: Combined ID Tag: Quality Profile(s) First Read: EMP50R1 (built-in profile) Second Read: EMP50R2 (built-in profile) Output files FASTQ Sequence Files: the 1st reads: Chloroplast1.fq the 2nd reads: Chloroplast2.fq ALN Alignment Files: the 1st reads: Chloroplast1.aln the 2nd reads: Chloroplast2.aln SAM Alignment File: Chloroplast.sam

Unfortunately…  The ART program generates peculiar id’s (doesn’t mark the paired end reads…) and lower case sequence letters, which causes some headaches…  So, I wrote a small python script to fix this…

#!/usr/bin/python # Fixes the output of the ART program # art_illumina -i reference.fa -p -l 50 -f 20 -m 200 -s 10 -o outFile_prefix -sam from sys import stdin seq = False qual = False if __name__ == '__main__': for line in stdin: line = line.strip() if qual: qual = False # to avoid treating rare quality score lines that start with as id's elif line.startswith('+'): qual = True elif not seq and # massage the ID part1 = line.split('|') part2 = part1[1].split('-') line = part1[0]+'_'+part2[0]+'-'+part2[1]+'/'+part2[2] seq = True elif seq: # convert sequence all to upper case to avoid downstream confusion... line = line.upper() seq = False print line

Getting ABySS  Installation:  For Ubuntu, sudo apt-get install abyss  Or visit BCGSC and download tar.gz source, then configure..make (more up-to-date?)  Perhaps put the abyss bin directory on your path…  To test run ABySS: abyss-pe k=25 name=test  se=  velvet/master/data/test_reads.fa

Try our test PE read data set  abyss-pe name=Chloroplast31 k=31  ABYSS_OPTIONS=--no-trim-masked  in=‘Chloroplast1.fastq Chloroplast2.fastq‘  The ‘no-trim-masked’ needed because default behaviour of abyss is to trim lower case letters in sequence (which designate identified vector sequences in 454 outputs…)  Try with other k-mer sizes…

For more info about ABySS  Active list service to troubleshoot issues:

Velvet  download & tar -zxvf  make  sudo make install  put velvet directory on your $PATH  Run velveth:  velveth outputdir k_mer -fastq readfile  Run velvetg:  velvetg outputdir -ins_length 200 -exp_cov 20

ALLPATHS-LG  download and tar –zxvf ./configure  make  sudo make install  Execute the program:  PrepareAllPathsInputs.pl # needs some config files…  RunAllPathsLG