Presentation is loading. Please wait.

Presentation is loading. Please wait.

BINF350, Tutorial 4 Karen Marshall. Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast.

Similar presentations


Presentation on theme: "BINF350, Tutorial 4 Karen Marshall. Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast."— Presentation transcript:

1 BINF350, Tutorial 4 Karen Marshall

2 Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast parameters for alignments with different levels of sequence homology

3 Practical: Part 1 ► Start with an ~200 bp original DNA sequence ► Simulation mutation events over time and collect sequences ► Blast original sequence against mutated sequences ► Repeat blasts using different parameters v Mutated sequences Original sequence

4 Simulation of mutated sequences ► Point accepted mutation (PAM) model of molecular evolution ► 1 PAM = 1 mutation per 100 bases on average  1 PAM  99.0% sequence homology  10 PAM  90.6% sequence homology  50 PAM  63.5% sequence homology  Concept of forward and backwards mutation

5 for each ‘successive PAM’ for each ‘nucleotide’ if (rand > 0.01) do not mutate else if (rand <=0.01) mutate by random selection from the non-identical bases

6 BLAST - Heuristic Step 1 2 3 Suffix Tree Lookup table Words/seeds Location Threshold T Larger seq file

7 BLAST February 10, 2004: BLAST 2.2.8 released BLAST 2.2.8 release notes Correction to tblastx alignment computation ia32-linux now requires glibc 2.2.5 Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040204/ncbi.tar.gz. ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040204/ncbi.tar.gz Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/ February 2, 2004: BLAST 2.2.7 released BLAST 2.2.7 release notes Standalone BLAST is now available for amd64-linux. formatdb now restricts volume sizes to 1G on 32-bit platforms for performance reasons. The -A option has been removed from formatdb, that is, all databases will be created with ASN.1 deflines. tblastn query concatenation now works correctly on 64-bit platforms. The wwwblast source code has been merged into the C toolkit tree and is no longer distributed with the binaries. Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040202/ncbi.tar.gz. ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040202/ncbi.tar.gz Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/ http://www.ncbi.nih.gov/BLAST/ blast_whatsnew.shtml

8 BLAST on your own machine ► Allows you to BLAST multiple sequences  most web versions are single sequence only ► Steps  Sequence files in FASTA format Can have multiple sequences in each file but no duplicates  Format larger sequence file into a database Formatdb –i dbfile.txt –p F –o T  Perform BLAST using appropriate switches BLASTALL –p BLASTN –d dbfile.txt –i comp.txt –o out.txt

9 BLAST 2.2.8 ► Arguments see appendix of handout  –W for seed word length (default = 11)  -r reward for a match (default = 1)  -q penalty for a mismatch (default = 3)  -G cost to open a gap  -E cost to extend a gap  -F filter query sequence  -e to set threshold expectation (threshold for HSP before gaps are included)  -m to specify different output options

10 Score E Score E Sequences producing significant alignments: (bits) Value 1_10 170 3e-046 0_0 170 3e-046 4_10 115 2e-029 2_10 107 4e-027 5_10 96 2e-023 3_10 96 2e-023 4_20 68 3e-015 2_20 68 3e-015 5_20 56 1e-011 QUERY 1 agattcactggtgtggcaagttgtctctcagactgtacatgcattaaaattttgcttggc 60 1_10 1............................................................ 60 0_0 1............................................................ 60 4_10 3....t.....c......ag..................a.................... 60 2_10 1............a..c....a...........a................g.......... 60 5_10 2........c......a.........g............................c.... 60 3_10 1.................g........t.....................c.....a..... 60 4_20 3....t.....c......ag....a.....g.......a.................... 60 2_20 1............a..c...ta...........aa......c..a.....g..... 55 5_20 4......c..c...a....g....g..............a......c......c.... 60 Example of BLAST output: -m3

11 Substitution scores ► Optimal substitution scores were derived for different PAM distances / sequence homologies (States et al., 1991) ► Of importance is the match to mismatch score ratio

12 Substitution scores ► ‘Better’ substitution matrices exist, but not yet implemented in most BLAST software

13 Practical: Part 2 ► Apply concepts from Part 1 to ‘real sequences’ ► BLAST mRNA sequence for human and cattle INFG to an ~1/2 Mb sequence of human DNA ► Use optimal blast parameters for expected homology Human DNA Human INFG mRNA Cattle INFG mRNA

14 Expected levels of sequence homology ► Varies for sequences being considered and genomic region Human to mouse comparison, from …

15 Efficiency of BLAST ► Human to cattle coding sequence ~85% homology (~PAM 15) (~PAM 15)

16 INFG mRNA sequences ► Extracted from NCBI website using batch entrez >gi|10835170|ref|NM_000619.1| Homo sapiens interferon, gamma (IFNG), mRNA TGAAGATCAGCTATTAGAAGAGAAAGATCAGTTAAGTCCTTTGGACCTGATCAGCTTGATACAAGAACTACTGATTTCAACTTCTTTGGCTTAATTCTCTCGGAAACGATGAAATATACAAGTTATATCTTGGCTTTTCAGCTCTGCATCGTTTTGGGTTCTCTTGGCTGTTACTGCCAGGACCCATATGTAAAAGAAGCAGAAAACCTTAAGAAATATTTTAATGCAGGTCATTCAGATGTAGCGGATAATGGAACTCTTTTCTTAGGCATTTTGAAGAATTGGAAAGAGGAGAGTGACAGAAAAATAATGCAGAGCCAAATTGTCTCCTTTTACTTCAAACTTTTTAAAAACTTTAAAGATGACCAGAGCATCCAAAAGAGTGTGGAGACCATCAAGGAAGACATGAATGTCAAGTTTTTCAATAGCAACAAAAAGAAACGAGATGACTTCGAAAAGCTGACTAATTATTCGGTAACTGACTTGAATGTCCAACGCAAAGCAATACATGAACTCATCCAAGTGATGGCTGAACTGTCGCCAGCAGCTAAAACAGGGAAGCGAAAAAGGAGTCAGATGCTGTTTCAAGGTCGAAGAGCATCCCAGTAATGGTTGTCCTGCCTGCAATATTTGAATTTTAAATCTAAATCTATTTATTAATATTTAACATTATTTATATGGGGAATATATTTTTAGACTCATCAATCAAATAAGTATTTATAATAGCAACTTTTGTGTAATGAAAATGAATATCTATTAATATATGTATTATTTATAATTCCTATATCCTGTGACTGTCTCACTTAATCCTTTGTTTTCTGACTAATTAGGCAAGGCTATGTGATTACAAGGCTTTATCTCAGGGGCCAACTAGGCAGCCAACCTAAGCAAGATCCCATGGGTTGTGTGTTTATTTCACTTGATGATACAATGAACACTTATAAGTGAAGTGATACTATCCAGTTACTGCCGGTTTGAAAATATGCCTGCAATCTGAGCCAGTGCTTTAATGGCATGTCAGACAGAACTTGAATGTGTCAGGTGACCCTGATGAAAACATAGCATCTCAGGAGATTTCATGCCTGGTGCTTCCAAATATTGTTGACAACTGTGACTGTACCCAAATGGAAAGTAACTCATTTGTTAAAATTATCAATATCTAATATATATGAATAAAGTGTAAGTTCACAACT >gi|31982948|ref|NM_174086.1| Bos taurus interferon, gamma or immune type [interferon gamma type 2] (IFNG), mRNA ATTAGAAAAGAAAGATCAGCTACCTCCTTGGGACCTGATCATAACACAGGAGCTACCGATTTCAACTACTCCGGCCTAACTCTCTCCTAAACAATGAAATATACAAGCTATTTCTTAGCTTTACTGCTCTGTGGGCTTTTGGGTTTTTCTGGTTCTTATGGCCAGGGCCAATTTTTTAGAGAAATAGAAAACTTAAAGGAGTATTTTAATGCAAGTAGCCCAGATGTAGCTAAGGGTGGGCCTCTCTTCTCAGAAATTTTGAAGAATTGGAAAGATGAAA INFG_refseq.txt

17 Human Chr12 sub-sequence ► Extracted from USCS ‘Golden Path’ website ► chr12:66,589,493-67,085,092 ~ ½ Mb  does contain INFG gene ► Repeats masked to lower case >hg16_dna range=chr12:66589493-67085092 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=lower CATTCATTACTTTTATAAGGTTTCTCTCTGGTATGCATCTGACTTACATC ATGGGAAAGCTAGTTTCATGACTCCTTTGGAATAGTTGTGGTCCTGAATA TGGAAAATCAATTAATGAATAGCTTAAAGCACAATAGTCAACAAATAGAT GTGAAAATTCTTTGTGAACTTTAAAGTCTTACTTAAACGTGAGATATTAT ATACAGTGTTTTATGTtagactgtgagcttgttaaagaaagaactatgcc ttctttttctttctaccagttccagtgcctcgtacaacatagaaaccata agtgtttttgaaagagcaaatGAATATTGGAAGGAGTAAGGTGATAGCTA AAGCTAAAACAATGTTTAGGGAGAACAACTGAAACAAAAGCAGCATTTGT GTCTTAAACTCATGGCCTCTGAAACAGCCTTGATAGATAGTAGAGAGGGT CAGATAGAGAGAGCCTGACTCAGAGATTGGGAAGCCCTATATGGTTGGAA GAGAAAGTAAGAGGAGACCCAAAGTATTAGACCACAGAAAGAAGTTCTAA TAGTCAGTGTCAAGAGATTCAGCAGGAGGTTGTGTATCAGGATTTGGGTT TGGGAGTGGTATGGAGCTTACCTATCTCTAAAACGAGCAGGAGGGCAAAA ATGAATCCCAGTCCCAAAGAATTCACTAATGGCCAGCAAACCAACACAGG AACCCCAGCACAGACACACAAGATAGGAAACCAGTTGTTGAAACTACAAT GTAACGGGGCTGATTTAATAAAAACCTGTTACATGAGTTATAGGtttttt ttttttttttttttttttAATGTATGTGCCCCACCTTAGGAAAGCCAGAA ATAATGGCAACGAAGAAATATTCATTCACAGTGAGAAAGCCATTAGAACG TTGGCTGGAACCTAGGGGCATATCGAGGGCCCACGTGGGAAGGACAATGA CAACTTGTTTAGTCCTCACTGGTTTCCCAGTCTGTGGATCTTATTTGAAT hs_chr12_subseq.txt

18 Human INFG gene

19  From USCS ‘Golden Path website’ genome browser

20 INFG against ~1/2 Mb region of Chr 12

21 Assessment ► Submit  for either Part 1 or Part 2 the BLAST output, concatenated into one file and annotated  a short summary / discussion of the concepts covered in this practical (< 500 words)

22 References ► Strongly recommend BLAST tutorial on NCBI site  http://www.ncbi.nlm.nih.gov/BLAST/tutorial/ Altschul-1.html http://www.ncbi.nlm.nih.gov/BLAST/tutorial/ Altschul-1.html http://www.ncbi.nlm.nih.gov/BLAST/tutorial/ Altschul-1.html ► Further “Bioinformatics for quantitative geneticists course notes” J. McEwan  http://www-personal.une.edu.au/~jvanderw/ aabc_materials2004.htm#ModuleC http://www-personal.une.edu.au/~jvanderw/ aabc_materials2004.htm#ModuleC http://www-personal.une.edu.au/~jvanderw/ aabc_materials2004.htm#ModuleC


Download ppt "BINF350, Tutorial 4 Karen Marshall. Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast."

Similar presentations


Ads by Google