MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014.

Slides:



Advertisements
Similar presentations
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 6 More advanced Perl…. Substitute Like s/// function in vi: #cut with EcoRI and chew back $linker = “GGCCAATTGGAAT”; $linker =~ s/CAATTG/CG/g;
Input from STDIN STDIN, standard input, comes from the keyboard. STDIN can also be used with file re-direction from the command line. For instance, if.
Programming Perls* Objective: To introduce students to the perl language. –Perl is a language for getting your job done. –Making Easy Things Easy & Hard.
Perl for Bioinformatics Lecture 4. Variables - review A variable name starts with a $ It contains a number or a text string Use my to define a variable.
Introduction to Perl Bioinformatics. What is Perl? Practical Extraction and Report Language A scripting language Components an interpreter scripts: text.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Shell Programming Software Tools. Slide 2 Shells l A shell can be used in one of two ways: n A command interpreter, used interactively n A programming.
PERL P ractical E xtraction R ecording L anguage.
5.1 Previously on... PERL course (let ’ s practice some more loops)
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
MCB 5472 Psi BLAST, Perl: Arrays, Loops J. Peter Gogarten Office: BPB 404 phone: ,
2ex.1 Lists and Arrays. 2ex.2 Comments on exercises Always run your script with “ perl -w ” and take care of all warnings  submitted scripts should not.
Linux & Shell Scripting Small Group Lecture 4 How to Learn to Code Workshop group/ Erin.
Advanced Perl for Bioinformatics Lecture 5. Regular expressions - review You can put the pattern you want to match between //, bind the pattern to the.
Shell Scripting Basics Arun Sethuraman. What’s a shell? Command line interpreter for Unix Bourne (sh), Bourne-again (bash), C shell (csh, tcsh), etc Handful.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Lecture 8: Basic concepts of subroutines. Functions In perl functions take the following format: – sub subname – { my $var1 = $_[0]; statements Return.
Shell Scripting Awk (part1) Awk Programming Language standard unix language that is geared for text processing and creating formatted reports but it.
Computer Programming for Biologists Class 2 Oct 31 st, 2014 Karsten Hokamp
MCB Lecture #3 Sept 2/14 Intro to UNIX terminal.
Introduction to Perl Matt Hudson. Review blastall: Do a blast search HMMER hmmpfam: search against HMM database hmmsearch: search proteins with HMM hmmbuild:
Practical Extraction & Report Language PERL Joseph Beltran.
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Perl Tutorial Presented by Pradeepsunder. Why PERL ???  Practical extraction and report language  Similar to shell script but lot easier and more powerful.
Lecture 8 perl pattern matching features
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Programming Languages Meeting 13 December 2/3, 2014.
Identifying the ortholog of TNF (Tumor necrosis factor) in mosquito genomes Pet Projects:
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Assignment feedback Everyone is doing very well!
Why? – Examples Speaking Computer-ise – How – What – Environment (windows) Basic Instructions – Declare – Conditional – Loop – Input Write a quiz game.
7 1 User-Defined Functions CGI/Perl Programming By Diane Zak.
Chapter 9: Perl (continue) Advanced Perl Programming Some materials are taken from Sams Teach Yourself Perl 5 in 21 Days, Second Edition.
BY A Mikati & M Shaito Awk Utility n Introduction n Some basics n Some samples n Patterns & Actions Regular Expressions n Boolean n start /end n.
Prof. Alfred J Bird, Ph.D., NBCT Office – McCormick 3rd floor 607 Office Hours – Tuesday and.
Computer Programming for Biologists Class 3 Nov 13 th, 2014 Karsten Hokamp
5 1 Data Files CGI/Perl Programming By Diane Zak.
A Genomics View of Unix. General Unix Tips To use the command line start X11 and type commands into the “xterm” window A few things about unix commands:
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
OCR Computing GCSE © Hodder Education 2013 Slide 1 OCR GCSE Computing Python programming 8: Fun with strings.
Computer Programming for Biologists Class 6 Nov 21 th, 2014 Karsten Hokamp
Parsing BLAST output. Output of a local BLAST search “less” program Full path to the BLAST output file.
Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.
16-Dec-15Advanced Programming Spring 2002 sed and awk Henning Schulzrinne Dept. of Computer Science Columbia University.
Introduction to PERL Genetics PERL is a language that is easy to use and was designed to do certain tasks (like reading, writing, moving text.
Department of Electrical and Computer Engineering Introduction to Perl By Hector M Lugo-Cordero August 26, 2008.
Perl Variables: Array Web Programming1. Review: Perl Variables Scalar ► e.g. $var1 = “Mary”; $var2= 1; ► holds number, character, string Array ► e.g.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 4 Karsten Hokamp, PhD Genetics TCD, 01/12/2015.
1 P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Awk Programming Ruibin Bai (Room AB326) Division of Computer Science The University.
Session 2: PHP Language Basics iNET Academy Open Source Web Development.
PERL By C. Shing ITEC Dept Radford University. Objectives Understand the history Understand constants and variables Understand operators Understand control.
Perl for Bioinformatics Part 2 Stuart Brown NYU School of Medicine.
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
CS 403: Programming Languages Lecture 20 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
C++ Memory Management – Homework Exercises
Input from STDIN STDIN, standard input, comes from the keyboard.
(optional - but then again, all of these are optional)
(optional - but then again, all of these are optional)‏
Miscellaneous Items Loop control, block labels, unless/until, backwards syntax for “if” statements, split, join, substring, length, logical operators,
Lecture 7 You’re on your own now...
Perl Variables: Array Web Programming.
Presentation transcript:

MCB 5472 Assignment #6: HMMER and using perl to perform repetitive tasks February 26, 2014

One note about assignment #5 Something we didn’t talk about last week in class: I showed code skipping lines after the first (best) hit was found in the BLAST output Really, you should have done this for both BLAST outputs If you made 2 hashes, one for each BLAST output, you are probably OK Else you have to make a second hash just so you can skip found lines in the second outfile

open (INFILE1, $ARGV[0]) or die; open (INFILE2, $ARGV[1]) or die; while ($line = = split “\t”, $line; next if ($hash1{$array[0]}); next if ($array[3]/$array[10] < 0.7); next if ($array[3]/$array[11] < 0.7); $hash1{$array[0]} = $array[1]; } while ($line = = split “\t”, $line; next if ($hash2{$array[0]}); next if ($array[3]/$array[10] < 0.7); next if ($array[3]/$array[11] < 0.7); $hash2{$array[0]} = “found”; if ($hash1{$array[1]} eq $array[0]){ $count++; } print “$count RBHs found”;

This week You have 2 weeks to complete this assignment! Central concept: you can use perl to automate repetitive computations Will also demonstrate how to use HMMER to create and use HMMs

Perl system command system is a perl command that runs terminal commands from inside a perl script e.g., system “cat file1.faa file2.faa > both.faa”; Advantage: can make the filenames in these commands variables and run them inside a loop e.g., system “cat $file1 $file2 > both.faa”;

Input The key to making these types of loops is having input files containing filenames so that you can process them one after the other e.g., tables with multiple columns, each containing a filename e.g., a list of files generated at the terminal ls *.faa > files.list Names don’t even have to match exactly, just close enough that you can make exact matches with regular expressions

Today’s exercise For each RBH pair from last week: Combine both sequences into a single file Create a multiple sequence alignment using muscle Create a HMM using HMMER3 Search the other 5 draft genomes using that model Count the hits times total! With only a few lines of code (trust me)!

Step #1 Modify your RBH script to make a table having orthologs on the same line separated by a tab allows you to extract the filenames for further analysis

Step #2 Convert the protein multiple fasta files into individual using the terminal command seqret Part of the EMBOSS package (very useful!) Syntax: seqret –auto –ossingle all.faa Output: lots of files looking like yp_ fasta Note: small letters Note: version number maintained (in gi header)

Step #3 Concatenate all of the protein faa sequences from the 5 E.coli draft genomes into a single file using cat syntax: cat *.faa > all.faa

Everything after this is looping in a perl script Load your RBH filename table as the input file Use “system” to invoke terminal commands

Step #4 Extract filenames from the input table using regular expressions To match the seqret output: Keep the version number Use small letters Add “.fasta ” Note on regular expressions and NCBI fasta headers: The “ | ” means “or” in a regular expression If you want to match this, you need to precede it with a forward slash, e.g., “ \| ”

More regular expression fun You can capture text out of regular expressions using round brackets e.g., s/^gi\|(.+)\|/ # match everything between the first two “|” characters Everything in the round brackets is kept in the special $1 variable why you can’t start your variable with a number in perl Can do multiple times s/^gi\|(.+)\|(.+)\|/ $1 contains what was between the first brackets $2 contains what was between the second brackets

Step #5 Join your two RBH files into a single multiple fasta file containing 2 sequences e.g., system; “cat $file1 $file2 > both.faa”

Step #6 Align the sequences in your multiple fasta file using the program muscle Syntax: muscle –in [both.faa] –out [both.muscle.faa] Note: muscle outputs aligned fasta files

Step #7 Use your sequence alignment to create a HMM Syntax: hmmbuild [both.hmm] [both.muscle.faa] hmmpress [both.hmm]

Step #8 Query the multiple fasta file containing all of the proteins from the 5 draft genomes with the HMM Syntax: hmmscan --tblout [table.out] –o [hmmscan.out] [both.hmm] [all_drafts.faa]

Step #9 Parse hmmscan output file and count number of orthologs Consider length and similarity thresholds to use considering the results of assignment #4 looking for paralogs It is possible to have >5 orthologs (poor genome quality) Keep a running total of ortholog counts (hash table like assignment #4)

Important hint: Run through your script once placing die at the end of your main loop make sure everything works before running through ~25,000 commands