BLAST Basic Local Alignment Search Tool Developed by NCBI NCBI - National Center for Biotechnology Information NLM – US National Library of Medicine NIH – National Institute of Health http://blast.ncbi.nlm.nih.gov/ Latest Version (executable) 2.2.28+ ftp://ftp.ncbi.nlm.nih.gov/blast+/LATEST/
BLAST A suite of tools that work together to search for similar sequences of different protein or nucleotide DNA sequences. Three Categories of Applications 1.Search Tools 2.BLAST Database Tools 3.Sequence Filtering Tools BLAST Command Line User Manual http://www.ncbi.nlm.nih.gov/books/NBK1763/
SEARCH APPLICATIONS Execute a BLAST search. blastn – Nucleotide Blast Nucleotide database using nucleotide query. blastp - Protein Blast Protein database using protein query. blastx Protein database using translated nucleotide query. tblastx Translated nucleotide database using a translated nucleotide query. tblastn Translated nucleotide database using a protein query.
SEARCH APPLICATIONS CONT. psiblast Position-Specific Iterated BLAST Finds sequences significantly similar to the query in a database search and uses the resulting alignments to build a Position-Specific Score Matrix (PSSM). rpsblast Reverse Position-Specific BLAST Uses a query to search a database of pre-calculated PSSMs and report significant hits in a single pass. rpstblastn Searches database using a translated nucleotide query.
BLAST DATABASE APPLICATIONS Create or examine BLAST databases. makeblastdb Creates BLAST databases. blastdb_aliastool Manage BLAST databases. Search multiple databases together or search a subset of sequences within a database. makeprofiledb Builds an RPS-BLAST database. blastdbcmd Examine the contents of a BLAST database.
SEQUENCE FILTERING APPLICATIONS Segmasker Identifies and masks low complexity regions* of protein sequences. Dustmasker Similar to segmasker but for nucleotide sequences. Windowmasker Uses a genome to identify sequences represented too often to be of interest to most users. *Low-Complexity Regions – Regions of a sequence composed of few elements. These will be ignored by BLAST unless explicitly told to include them in searches. May achieve high scores that may bump more significant sequences.
E-VALUE The number of hits to see by chance when searching the database. This value decreases exponentially when the score is increased. The lower the e-value is, the more significant the match is. This also depends on the length of the query sequence. E-values will be higher with shorter sequences because there is a higher probability of a query sequence occurring in the database by chance.
BITSCORE The bitscore value is derived from the raw alignment score S. Lambda and K are statistical parameters of the scoring system. http://www.ncbi.nlm.nih.gov/books/NBK21106/bin/glossfig1.jpg
FASTA FORMAT Text-based format representing nucleotide or peptide sequences. A “>”, followed by the sequence identifier, then an optional description. >seq_1 Some description GAGGGCTCATCCGGGAATCGAACCCGGGACCT CTCGCACCCTAAGCGAGAATCATACGACTAGACC AATGAGCCGTGTTCAAAGAGTGTCAAAATGTGTTTC GAGCGTCTATGTCCAAAGTGAATTGCTTGTCTTTTGA GTTTTGCGATTG
BWA-MEM Burrows-Wheeler Aligner A software package for aligning sequences against large reference genomes. The BWA package contains three different algorithms: BWA-backtrack, BWA-SW, and BWA- MEM. Manual Page http://bio-bwa.sourceforge.net/bwa.shtml
BWA-MEM Can align 70bp to 1Mbp MEM – Maximal Exact Matches Local alignment
HOW TO RUN Index the reference FASTA file. Run BWA-MEM with a query file (in FASTQ format) against the reference database. The output is in a SAM file format.
FASTQ FORMAT Similar to a FASTA format, but with a quality score added. @HWI-EAS397:8:1:1067:18713#CTTGTA/1 TGGAGATGAGATTGTCGGCTTTATTACCCAGGGGC GGGGGGTTATTGTA + Y^]Lcda]YcffccffadafdWKd_V\``^\aa^BBBBBBBBBB BBBBB The quality score is an integer mapping of the probability that the base is incorrect.
SAM FILE Eleven mandatory fields and a variable amount of optional fields. The optional fields are a key-value pair of TAG:TYPE:VALUE. These store extra information.
BWA-MEM OPTIONS t – Number of threads T – Don’t output alignment with score lower than INT. a – Output all found alignments for single-end or unpaired paired-end reads. (In output, ‘*’ are considered zero.)
REFERENCES NCBI Help Manual - http://www.ncbi.nlm.nih.gov/books/NBK3831/ Bwa - http://bio-bwa.sourceforge.net/ FASTA - http://en.wikipedia.org/wiki/FASTA_format FASTQ - http://en.wikipedia.org/wiki/FASTQ_format Li, H, et al. (2009). The Sequence Alignment/Map format and SAMtools. Vol. 25 no 16, Bioinformatics Applications Note.