Download presentation
Presentation is loading. Please wait.
1
BINF350, Tutorial 4 Karen Marshall
2
Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast parameters for alignments with different levels of sequence homology
3
Practical: Part 1 ► Start with an ~200 bp original DNA sequence ► Simulation mutation events over time and collect sequences ► Blast original sequence against mutated sequences ► Repeat blasts using different parameters v Mutated sequences Original sequence
4
Simulation of mutated sequences ► Point accepted mutation (PAM) model of molecular evolution ► 1 PAM = 1 mutation per 100 bases on average 1 PAM 99.0% sequence homology 10 PAM 90.6% sequence homology 50 PAM 63.5% sequence homology Concept of forward and backwards mutation
5
for each ‘successive PAM’ for each ‘nucleotide’ if (rand > 0.01) do not mutate else if (rand <=0.01) mutate by random selection from the non-identical bases
6
BLAST - Heuristic Step 1 2 3 Suffix Tree Lookup table Words/seeds Location Threshold T Larger seq file
7
BLAST February 10, 2004: BLAST 2.2.8 released BLAST 2.2.8 release notes Correction to tblastx alignment computation ia32-linux now requires glibc 2.2.5 Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040204/ncbi.tar.gz. ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040204/ncbi.tar.gz Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.8/ February 2, 2004: BLAST 2.2.7 released BLAST 2.2.7 release notes Standalone BLAST is now available for amd64-linux. formatdb now restricts volume sizes to 1G on 32-bit platforms for performance reasons. The -A option has been removed from formatdb, that is, all databases will be created with ASN.1 deflines. tblastn query concatenation now works correctly on 64-bit platforms. The wwwblast source code has been merged into the C toolkit tree and is no longer distributed with the binaries. Source code can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040202/ncbi.tar.gz. ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi_tools/old/20040202/ncbi.tar.gz Binaries can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/. ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.7/ http://www.ncbi.nih.gov/BLAST/ blast_whatsnew.shtml
8
BLAST on your own machine ► Allows you to BLAST multiple sequences most web versions are single sequence only ► Steps Sequence files in FASTA format Can have multiple sequences in each file but no duplicates Format larger sequence file into a database Formatdb –i dbfile.txt –p F –o T Perform BLAST using appropriate switches BLASTALL –p BLASTN –d dbfile.txt –i comp.txt –o out.txt
9
BLAST 2.2.8 ► Arguments see appendix of handout –W for seed word length (default = 11) -r reward for a match (default = 1) -q penalty for a mismatch (default = 3) -G cost to open a gap -E cost to extend a gap -F filter query sequence -e to set threshold expectation (threshold for HSP before gaps are included) -m to specify different output options
10
Score E Score E Sequences producing significant alignments: (bits) Value 1_10 170 3e-046 0_0 170 3e-046 4_10 115 2e-029 2_10 107 4e-027 5_10 96 2e-023 3_10 96 2e-023 4_20 68 3e-015 2_20 68 3e-015 5_20 56 1e-011 QUERY 1 agattcactggtgtggcaagttgtctctcagactgtacatgcattaaaattttgcttggc 60 1_10 1............................................................ 60 0_0 1............................................................ 60 4_10 3....t.....c......ag..................a.................... 60 2_10 1............a..c....a...........a................g.......... 60 5_10 2........c......a.........g............................c.... 60 3_10 1.................g........t.....................c.....a..... 60 4_20 3....t.....c......ag....a.....g.......a.................... 60 2_20 1............a..c...ta...........aa......c..a.....g..... 55 5_20 4......c..c...a....g....g..............a......c......c.... 60 Example of BLAST output: -m3
11
Substitution scores ► Optimal substitution scores were derived for different PAM distances / sequence homologies (States et al., 1991) ► Of importance is the match to mismatch score ratio
12
Substitution scores ► ‘Better’ substitution matrices exist, but not yet implemented in most BLAST software
13
Practical: Part 2 ► Apply concepts from Part 1 to ‘real sequences’ ► BLAST mRNA sequence for human and cattle INFG to an ~1/2 Mb sequence of human DNA ► Use optimal blast parameters for expected homology Human DNA Human INFG mRNA Cattle INFG mRNA
14
Expected levels of sequence homology ► Varies for sequences being considered and genomic region Human to mouse comparison, from …
15
Efficiency of BLAST ► Human to cattle coding sequence ~85% homology (~PAM 15) (~PAM 15)
16
INFG mRNA sequences ► Extracted from NCBI website using batch entrez >gi|10835170|ref|NM_000619.1| Homo sapiens interferon, gamma (IFNG), mRNA TGAAGATCAGCTATTAGAAGAGAAAGATCAGTTAAGTCCTTTGGACCTGATCAGCTTGATACAAGAACTACTGATTTCAACTTCTTTGGCTTAATTCTCTCGGAAACGATGAAATATACAAGTTATATCTTGGCTTTTCAGCTCTGCATCGTTTTGGGTTCTCTTGGCTGTTACTGCCAGGACCCATATGTAAAAGAAGCAGAAAACCTTAAGAAATATTTTAATGCAGGTCATTCAGATGTAGCGGATAATGGAACTCTTTTCTTAGGCATTTTGAAGAATTGGAAAGAGGAGAGTGACAGAAAAATAATGCAGAGCCAAATTGTCTCCTTTTACTTCAAACTTTTTAAAAACTTTAAAGATGACCAGAGCATCCAAAAGAGTGTGGAGACCATCAAGGAAGACATGAATGTCAAGTTTTTCAATAGCAACAAAAAGAAACGAGATGACTTCGAAAAGCTGACTAATTATTCGGTAACTGACTTGAATGTCCAACGCAAAGCAATACATGAACTCATCCAAGTGATGGCTGAACTGTCGCCAGCAGCTAAAACAGGGAAGCGAAAAAGGAGTCAGATGCTGTTTCAAGGTCGAAGAGCATCCCAGTAATGGTTGTCCTGCCTGCAATATTTGAATTTTAAATCTAAATCTATTTATTAATATTTAACATTATTTATATGGGGAATATATTTTTAGACTCATCAATCAAATAAGTATTTATAATAGCAACTTTTGTGTAATGAAAATGAATATCTATTAATATATGTATTATTTATAATTCCTATATCCTGTGACTGTCTCACTTAATCCTTTGTTTTCTGACTAATTAGGCAAGGCTATGTGATTACAAGGCTTTATCTCAGGGGCCAACTAGGCAGCCAACCTAAGCAAGATCCCATGGGTTGTGTGTTTATTTCACTTGATGATACAATGAACACTTATAAGTGAAGTGATACTATCCAGTTACTGCCGGTTTGAAAATATGCCTGCAATCTGAGCCAGTGCTTTAATGGCATGTCAGACAGAACTTGAATGTGTCAGGTGACCCTGATGAAAACATAGCATCTCAGGAGATTTCATGCCTGGTGCTTCCAAATATTGTTGACAACTGTGACTGTACCCAAATGGAAAGTAACTCATTTGTTAAAATTATCAATATCTAATATATATGAATAAAGTGTAAGTTCACAACT >gi|31982948|ref|NM_174086.1| Bos taurus interferon, gamma or immune type [interferon gamma type 2] (IFNG), mRNA ATTAGAAAAGAAAGATCAGCTACCTCCTTGGGACCTGATCATAACACAGGAGCTACCGATTTCAACTACTCCGGCCTAACTCTCTCCTAAACAATGAAATATACAAGCTATTTCTTAGCTTTACTGCTCTGTGGGCTTTTGGGTTTTTCTGGTTCTTATGGCCAGGGCCAATTTTTTAGAGAAATAGAAAACTTAAAGGAGTATTTTAATGCAAGTAGCCCAGATGTAGCTAAGGGTGGGCCTCTCTTCTCAGAAATTTTGAAGAATTGGAAAGATGAAA INFG_refseq.txt
17
Human Chr12 sub-sequence ► Extracted from USCS ‘Golden Path’ website ► chr12:66,589,493-67,085,092 ~ ½ Mb does contain INFG gene ► Repeats masked to lower case >hg16_dna range=chr12:66589493-67085092 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=lower CATTCATTACTTTTATAAGGTTTCTCTCTGGTATGCATCTGACTTACATC ATGGGAAAGCTAGTTTCATGACTCCTTTGGAATAGTTGTGGTCCTGAATA TGGAAAATCAATTAATGAATAGCTTAAAGCACAATAGTCAACAAATAGAT GTGAAAATTCTTTGTGAACTTTAAAGTCTTACTTAAACGTGAGATATTAT ATACAGTGTTTTATGTtagactgtgagcttgttaaagaaagaactatgcc ttctttttctttctaccagttccagtgcctcgtacaacatagaaaccata agtgtttttgaaagagcaaatGAATATTGGAAGGAGTAAGGTGATAGCTA AAGCTAAAACAATGTTTAGGGAGAACAACTGAAACAAAAGCAGCATTTGT GTCTTAAACTCATGGCCTCTGAAACAGCCTTGATAGATAGTAGAGAGGGT CAGATAGAGAGAGCCTGACTCAGAGATTGGGAAGCCCTATATGGTTGGAA GAGAAAGTAAGAGGAGACCCAAAGTATTAGACCACAGAAAGAAGTTCTAA TAGTCAGTGTCAAGAGATTCAGCAGGAGGTTGTGTATCAGGATTTGGGTT TGGGAGTGGTATGGAGCTTACCTATCTCTAAAACGAGCAGGAGGGCAAAA ATGAATCCCAGTCCCAAAGAATTCACTAATGGCCAGCAAACCAACACAGG AACCCCAGCACAGACACACAAGATAGGAAACCAGTTGTTGAAACTACAAT GTAACGGGGCTGATTTAATAAAAACCTGTTACATGAGTTATAGGtttttt ttttttttttttttttttAATGTATGTGCCCCACCTTAGGAAAGCCAGAA ATAATGGCAACGAAGAAATATTCATTCACAGTGAGAAAGCCATTAGAACG TTGGCTGGAACCTAGGGGCATATCGAGGGCCCACGTGGGAAGGACAATGA CAACTTGTTTAGTCCTCACTGGTTTCCCAGTCTGTGGATCTTATTTGAAT hs_chr12_subseq.txt
18
Human INFG gene
19
From USCS ‘Golden Path website’ genome browser
20
INFG against ~1/2 Mb region of Chr 12
21
Assessment ► Submit for either Part 1 or Part 2 the BLAST output, concatenated into one file and annotated a short summary / discussion of the concepts covered in this practical (< 500 words)
22
References ► Strongly recommend BLAST tutorial on NCBI site http://www.ncbi.nlm.nih.gov/BLAST/tutorial/ Altschul-1.html http://www.ncbi.nlm.nih.gov/BLAST/tutorial/ Altschul-1.html http://www.ncbi.nlm.nih.gov/BLAST/tutorial/ Altschul-1.html ► Further “Bioinformatics for quantitative geneticists course notes” J. McEwan http://www-personal.une.edu.au/~jvanderw/ aabc_materials2004.htm#ModuleC http://www-personal.une.edu.au/~jvanderw/ aabc_materials2004.htm#ModuleC http://www-personal.une.edu.au/~jvanderw/ aabc_materials2004.htm#ModuleC
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.