BINF350, Tutorial 4 Karen Marshall
Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast parameters for alignments with different levels of sequence homology
Practical: Part 1 ► Start with an ~200 bp original DNA sequence ► Simulation mutation events over time and collect sequences ► Blast original sequence against mutated sequences ► Repeat blasts using different parameters v Mutated sequences Original sequence
Simulation of mutated sequences ► Point accepted mutation (PAM) model of molecular evolution ► 1 PAM = 1 mutation per 100 bases on average 1 PAM 99.0% sequence homology 10 PAM 90.6% sequence homology 50 PAM 63.5% sequence homology Concept of forward and backwards mutation
for each ‘successive PAM’ for each ‘nucleotide’ if (rand > 0.01) do not mutate else if (rand <=0.01) mutate by random selection from the non-identical bases
BLAST - Heuristic Step Suffix Tree Lookup table Words/seeds Location Threshold T Larger seq file
BLAST February 10, 2004: BLAST released BLAST release notes Correction to tblastx alignment computation ia32-linux now requires glibc Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/ /ncbi.tar.gz. ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/ /ncbi.tar.gz Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/ February 2, 2004: BLAST released BLAST release notes Standalone BLAST is now available for amd64-linux. formatdb now restricts volume sizes to 1G on 32-bit platforms for performance reasons. The -A option has been removed from formatdb, that is, all databases will be created with ASN.1 deflines. tblastn query concatenation now works correctly on 64-bit platforms. The wwwblast source code has been merged into the C toolkit tree and is no longer distributed with the binaries. Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/ /ncbi.tar.gz. ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/ /ncbi.tar.gz Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/ blast_whatsnew.shtml
BLAST on your own machine ► Allows you to BLAST multiple sequences most web versions are single sequence only ► Steps Sequence files in FASTA format Can have multiple sequences in each file but no duplicates Format larger sequence file into a database Formatdb –i dbfile.txt –p F –o T Perform BLAST using appropriate switches BLASTALL –p BLASTN –d dbfile.txt –i comp.txt –o out.txt
BLAST ► Arguments see appendix of handout –W for seed word length (default = 11) -r reward for a match (default = 1) -q penalty for a mismatch (default = 3) -G cost to open a gap -E cost to extend a gap -F filter query sequence -e to set threshold expectation (threshold for HSP before gaps are included) -m to specify different output options
Score E Score E Sequences producing significant alignments: (bits) Value 1_ e-046 0_ e-046 4_ e-029 2_ e-027 5_ e-023 3_ e-023 4_ e-015 2_ e-015 5_ e-011 QUERY 1 agattcactggtgtggcaagttgtctctcagactgtacatgcattaaaattttgcttggc 60 1_ _ _ t.....c......ag a _ a..c....a a g _ c......a g c _ g t c.....a _ t.....c......ag....a.....g a _ a..c...ta aa......c..a.....g _ c..c...a....g....g a......c......c Example of BLAST output: -m3
Substitution scores ► Optimal substitution scores were derived for different PAM distances / sequence homologies (States et al., 1991) ► Of importance is the match to mismatch score ratio
Substitution scores ► ‘Better’ substitution matrices exist, but not yet implemented in most BLAST software
Practical: Part 2 ► Apply concepts from Part 1 to ‘real sequences’ ► BLAST mRNA sequence for human and cattle INFG to an ~1/2 Mb sequence of human DNA ► Use optimal blast parameters for expected homology Human DNA Human INFG mRNA Cattle INFG mRNA
Expected levels of sequence homology ► Varies for sequences being considered and genomic region Human to mouse comparison, from …
Efficiency of BLAST ► Human to cattle coding sequence ~85% homology (~PAM 15) (~PAM 15)
INFG mRNA sequences ► Extracted from NCBI website using batch entrez >gi| |ref|NM_ | Homo sapiens interferon, gamma (IFNG), mRNA TGAAGATCAGCTATTAGAAGAGAAAGATCAGTTAAGTCCTTTGGACCTGATCAGCTTGATACAAGAACTACTGATTTCAACTTCTTTGGCTTAATTCTCTCGGAAACGATGAAATATACAAGTTATATCTTGGCTTTTCAGCTCTGCATCGTTTTGGGTTCTCTTGGCTGTTACTGCCAGGACCCATATGTAAAAGAAGCAGAAAACCTTAAGAAATATTTTAATGCAGGTCATTCAGATGTAGCGGATAATGGAACTCTTTTCTTAGGCATTTTGAAGAATTGGAAAGAGGAGAGTGACAGAAAAATAATGCAGAGCCAAATTGTCTCCTTTTACTTCAAACTTTTTAAAAACTTTAAAGATGACCAGAGCATCCAAAAGAGTGTGGAGACCATCAAGGAAGACATGAATGTCAAGTTTTTCAATAGCAACAAAAAGAAACGAGATGACTTCGAAAAGCTGACTAATTATTCGGTAACTGACTTGAATGTCCAACGCAAAGCAATACATGAACTCATCCAAGTGATGGCTGAACTGTCGCCAGCAGCTAAAACAGGGAAGCGAAAAAGGAGTCAGATGCTGTTTCAAGGTCGAAGAGCATCCCAGTAATGGTTGTCCTGCCTGCAATATTTGAATTTTAAATCTAAATCTATTTATTAATATTTAACATTATTTATATGGGGAATATATTTTTAGACTCATCAATCAAATAAGTATTTATAATAGCAACTTTTGTGTAATGAAAATGAATATCTATTAATATATGTATTATTTATAATTCCTATATCCTGTGACTGTCTCACTTAATCCTTTGTTTTCTGACTAATTAGGCAAGGCTATGTGATTACAAGGCTTTATCTCAGGGGCCAACTAGGCAGCCAACCTAAGCAAGATCCCATGGGTTGTGTGTTTATTTCACTTGATGATACAATGAACACTTATAAGTGAAGTGATACTATCCAGTTACTGCCGGTTTGAAAATATGCCTGCAATCTGAGCCAGTGCTTTAATGGCATGTCAGACAGAACTTGAATGTGTCAGGTGACCCTGATGAAAACATAGCATCTCAGGAGATTTCATGCCTGGTGCTTCCAAATATTGTTGACAACTGTGACTGTACCCAAATGGAAAGTAACTCATTTGTTAAAATTATCAATATCTAATATATATGAATAAAGTGTAAGTTCACAACT >gi| |ref|NM_ | Bos taurus interferon, gamma or immune type [interferon gamma type 2] (IFNG), mRNA ATTAGAAAAGAAAGATCAGCTACCTCCTTGGGACCTGATCATAACACAGGAGCTACCGATTTCAACTACTCCGGCCTAACTCTCTCCTAAACAATGAAATATACAAGCTATTTCTTAGCTTTACTGCTCTGTGGGCTTTTGGGTTTTTCTGGTTCTTATGGCCAGGGCCAATTTTTTAGAGAAATAGAAAACTTAAAGGAGTATTTTAATGCAAGTAGCCCAGATGTAGCTAAGGGTGGGCCTCTCTTCTCAGAAATTTTGAAGAATTGGAAAGATGAAA INFG_refseq.txt
Human Chr12 sub-sequence ► Extracted from USCS ‘Golden Path’ website ► chr12:66,589,493-67,085,092 ~ ½ Mb does contain INFG gene ► Repeats masked to lower case >hg16_dna range=chr12: 'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=lower CATTCATTACTTTTATAAGGTTTCTCTCTGGTATGCATCTGACTTACATC ATGGGAAAGCTAGTTTCATGACTCCTTTGGAATAGTTGTGGTCCTGAATA TGGAAAATCAATTAATGAATAGCTTAAAGCACAATAGTCAACAAATAGAT GTGAAAATTCTTTGTGAACTTTAAAGTCTTACTTAAACGTGAGATATTAT ATACAGTGTTTTATGTtagactgtgagcttgttaaagaaagaactatgcc ttctttttctttctaccagttccagtgcctcgtacaacatagaaaccata agtgtttttgaaagagcaaatGAATATTGGAAGGAGTAAGGTGATAGCTA AAGCTAAAACAATGTTTAGGGAGAACAACTGAAACAAAAGCAGCATTTGT GTCTTAAACTCATGGCCTCTGAAACAGCCTTGATAGATAGTAGAGAGGGT CAGATAGAGAGAGCCTGACTCAGAGATTGGGAAGCCCTATATGGTTGGAA GAGAAAGTAAGAGGAGACCCAAAGTATTAGACCACAGAAAGAAGTTCTAA TAGTCAGTGTCAAGAGATTCAGCAGGAGGTTGTGTATCAGGATTTGGGTT TGGGAGTGGTATGGAGCTTACCTATCTCTAAAACGAGCAGGAGGGCAAAA ATGAATCCCAGTCCCAAAGAATTCACTAATGGCCAGCAAACCAACACAGG AACCCCAGCACAGACACACAAGATAGGAAACCAGTTGTTGAAACTACAAT GTAACGGGGCTGATTTAATAAAAACCTGTTACATGAGTTATAGGtttttt ttttttttttttttttttAATGTATGTGCCCCACCTTAGGAAAGCCAGAA ATAATGGCAACGAAGAAATATTCATTCACAGTGAGAAAGCCATTAGAACG TTGGCTGGAACCTAGGGGCATATCGAGGGCCCACGTGGGAAGGACAATGA CAACTTGTTTAGTCCTCACTGGTTTCCCAGTCTGTGGATCTTATTTGAAT hs_chr12_subseq.txt
Human INFG gene
From USCS ‘Golden Path website’ genome browser
INFG against ~1/2 Mb region of Chr 12
Assessment ► Submit for either Part 1 or Part 2 the BLAST output, concatenated into one file and annotated a short summary / discussion of the concepts covered in this practical (< 500 words)
References ► Strongly recommend BLAST tutorial on NCBI site Altschul-1.html Altschul-1.html Altschul-1.html ► Further “Bioinformatics for quantitative geneticists course notes” J. McEwan aabc_materials2004.htm#ModuleC aabc_materials2004.htm#ModuleC aabc_materials2004.htm#ModuleC