Presentation is loading. Please wait.

Presentation is loading. Please wait.

Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

Similar presentations


Presentation on theme: "Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive"— Presentation transcript:

1 Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive http://www.ebi.ac.uk/ena/http://www.ebi.ac.uk/ena/ DNA data bank of Japan http://www.ddbj.nig.ac.jp/http://www.ddbj.nig.ac.jp/ GenBank http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/

2

3 contains wealth of many types of data

4 …but the main part represent sequences (DNA, RNA, aa; short fragments, genomes…) for the explained sample of GenBank sequence record click here there is lots of categories and information, but you can view the sequence also in much more streamlined form (called FASTA format): >gi|1293613|gb|U49845.1|SCU49845 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAGACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGTACTCGAGCCCTGTCTCAAGAATGTAAT AATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCACAACCTCAAAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAACTTATTTTCTTATTCTTTACTCTCACATCCTGTAG TGATTGACACTGCAACAGCCACCATCACTAGAAGAACAGAACAATTACTTAATAGAAAAATTATATCTTCCTCGAAACGATTTCCTGCTTCCAACATCTACGTATATCAAGAAGCATTCACTTACCATGACACAGCTTCA GATTTCATTATTGCTGACAGCTACTATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATA CCTATAAATCGTCTGTAGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACG TTGTATTTCAATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTA TGGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTAC CCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAGGTA GAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGATCCTATTTCTTC TGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTTTCTGTGTCCATTTATGATA CTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTAC GTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCA AGGTTCACAATCTCAAGAGCTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTT ACACTGCAAAAATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATC CTAGTAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTT GAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAGATT CTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGTATATGGATAGT GAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGAAGCACCAGAGAAGGAAAAACG TACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACT CTCAAAGCGGTAAAAACGGAATCACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTA GTAGATTTTTCAAATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGATTATACGCAACGATATTTTGCTTAATTTTATTTTCCTGTTTTATTTTTTATTAGTGGTTTAC AGATACCCTATATTTTATTTAGTTTTTATACTTAGAGACATTTAATTTTAATTCCATTCTTCAAATTTCATTTTTGCACTTAAAACAAAGATCCAAAAATGCTCTCGCCCTCTTCATATTGAGAATACACTCCATTCAAA ATTTTGTCGTCACCGCTGATTAATTTTTCACTAAACTGATGAATAATCAAAGGCCCCACGTCAGAACCGACTAAAGAAGTGAGTTTTATTTTAGGAGGTTGAAAACCATTATTGTCTGGTAAATTTTCATCTTCTTGACA TTTAACCCAGTTTGAATCCCTTTCAATTTCTGCTTTTTCCTCCAAACTATCGACCCTCCTGTTTCTGTCCAACTTATGTCCTAGTTCCAATTCGATCGCATTAATAACTGCTTCAAATGTTATTGTGTCATCGTTGACTT TAGGTAATTTCTCCAAATGCATAATCAAACTATTTAAGGAAGATCGGAATTCGTCGAACACTTCAGTTTCCGTAATGATCTGATCGTCTTTATCCACATGTTGTAATTCACTAAAATCTAAAACGTATTTTTCAATGCAT AAATCGTTCTTTTTATTAATAATGCAGATGGAAAATCTGTAAACGTGCGTTAATTTAGAAAGAACATCCAGTATAAGTTCTTCTATATAGTCAATTAAAGCAGGATGCCTATTAATGGGAACGAACTGCGGCAAGTTGAA TGACTGGTAAGTAGTGTAGTCGAATGACTGAGGTGGGTATACATTTCTATAAAATAAAATCAAATTAATGTAGCATTTTAAGTATACCCTCAGCCACTTCTCTACCCATCTATTCATAAAGCTGACGCAACGATTACTAT TTTTTTTTTCTTCTTGGATCTCAGTCGTCGCAAAAACGTATACCTTCTTTTTCCGACCTTTTTTTTAGCTTTCTGGAAAAGTTTATATTAGTTAAACAGGGTCTAGTCTTAGTGTGAAAGCTAGTGGTTTCGATTGACTG ATATTAAGAAAGTGGAAATTAAATTAGTAGTGTAGACGTATATGCATATGTATTTCTCGCCTGTTTATGTTTCTACGTACTTTTGATTTATAGCAAGGGGAAAAGAAATACATACTATTTTTTGGTAAAGGTGAAAGCAT AATGTAAAAGCTAGAATAAAATGGACGAAATAAAGAGAGGCTTAGTTCATCTTTTTTCCAAAAAGCACCCAATGATAATAACTAAAATGAAAAGGATTTGCCATCTGTCAGCAACATCAGTTGTGTGAGCAATAATAAAA TCATCACCTCCGTTGCCTTTAGCGCGTTTGTCGTTTGTATCTTCCGTAATTTTAGTCTTATCAATGGGAATCATAAATTTTCCAATGAATTAGCAATTTCGTCCAATTCTTTTTGAGCTTCTTCATATTTGCTTTGGAAT TCTTCGCACTTCTTTTCCCATTCATCTCTTTCTTCTTCCAAAGCAACGATCCTTCTACCCATTTGCTCAGAGTTCAAATCGGCCTCTTTCAGTTTATCCATTGCTTCCTTCAGTTTGGCTTCACTGTCTTCTAGCTGTTG TTCTAGATCCTGGTTTTTCTTGGTGTAGTTCTCATTATTAGATCTCAAGTTATTGGAGTCTTCAGCCAATTGCTTTGTATCAGACAATTGACTCTCTAACTTCTCCACTTCACTGTCGAGTTGCTCGTTTTTAGCGGACA AAGATTTAATCTCGTTTTCTTTTTCAGTGTTAGATTGCTCTAATTCTTTGAGCTGTTCTCTCAGCTCCTCATATTTTTCTTGCCATGACTCAGATTCTAATTTTAAGCTATTCAATTTCTCTTTGATC where first line introduced by ‘>’ represent the header, anything after first line break is considered to be the sequence. Fasta (or Pearson’s) format is the most widely used sequence format in Bioinformatics!

5 !but first, you have to find it!

6 you can search by keyword (could be name, abbreviation...)

7 ... or unique identifier ‘Accesion number’

8 ... or first filter out all sequences of particular organism

9 ... and then use keyword

10 check results you want to save, click ‘Display settings, ‘Apply’

11 and copy results into any text editor

12 or click ‘Send to’, set Format to Fasta and save to wherever you want to This way, you can also download whole protein/nucleotide set of any particular taxonomic unit, or even the genomic sequence. Try to figure out how!

13 ... you can also search by similarity/homology using BLAST

14 set of sequence comparison algorithms (1990) search sequence databases for optimal local alignments to a query Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance www, standalone, and network clients The BLAST programs (Basic Local Alignment Search Tools) Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403- 410. Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” NAR 25:3389-3402. BLAST+

15 1) Choose the sequence (query) 2) Select the BLAST program 3) Choose the database to search 4) Choose optional parameters The BLAST programs (Basic Local Alignment Search Tools)

16 ProgramDescription blastp Compares an amino acid query sequence against a protein sequence database. blastn Compares a nucleotide query sequence against a nucleotide sequence database. blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.ProgramDescription blastp Compares an amino acid query sequence against a protein sequence database. blastn Compares a nucleotide query sequence against a nucleotide sequence database. blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. The BLAST programs: Select the BLAST program

17 ProgramNotes Megablast Contiguous Nearly identical sequences Discontiguou s Cross-species comparison Position Specific PSI-BLAST Automatically generates a position specific score matrix (PSSM) RPS-BLAST Searches a database of PSI- BLAST PSSMsProgramNotes Megablast Contiguous Nearly identical sequences Discontiguou s Cross-species comparison Position Specific PSI-BLAST Automatically generates a position specific score matrix (PSSM) RPS-BLAST Searches a database of PSI- BLAST PSSMs nucleotide only protein only The BLAST programs: Select the BLAST program

18 first choose appropriate database/algorithm, i.e. if you have aa sequence and you are after proteins, use blastp (protein blast), if you’re looking for coding sequence, use tblastn (translated blast) etc...

19 paste your query sequence or acc. # here sometimes it’s handy to zoom in the search for specific group

20 How does it work? BLAST Algorithm in layers “The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990) Three heuristic layers: seeding, extension, and evaluation Seeding – identify where to start alignment Extension – extending alignment from seeds Evaluation – Determine which alignments are statistically significant

21 BLAST Algorithm: Seeding compile a list of word pairs (w=3) above threshold T Example: for a human RBP query …FSGTWYA… (query word is in red) A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS BLAST locates all common words in a pair of sequences, then uses them as seeds for the alignment Discriminating between real and artificial matches is done using an estimate of probability that the match might occur by chance. scores (S) and e-values (E) of BLAST hits word=defined number of letters

22 BLAST Algorithm: Seeding: Score score=alignment quality

23 Substitution matrices are used for amino acid alignments. – each possible residue substitution is given a score A simpler unitary matrix is used for DNA pairs (+1 for match, -2 mismatch) 6 BLAST Algorithm: Seeding: Scoring matrix aa frequency, aa properties

24 BLOSUM vs PAM BLOSUM 62 as the default in BLAST 2.0. -tailored for comparisons of moderately distant proteins, performs well in detecting closer relationships. -search for distant relatives may be more sensitive with a different matrix. BLOSUM 45 BLOSUM 62 BLOSUM 90 PAM 250 PAM 160 PAM 100 More Divergent Less Divergent PAM (Percent Accepted Mutation) - theoretical approach - based on assumptions of mutation probabilities BLOSUM (BLOcks SUbstitution Matrix) - empirical - constructed from multiply aligned protein families - ungapped segments (blocks) clustered based on percent identity BLAST Algorithm: Seeding: Scoring matrix

25 BLAST Algorithm: Seeding: E value Low E-values suggest that sequences are homologous Statistical significance depends on both the size of the alignments and the size of the sequence database ‣ Important consideration for comparing results across different searches ‣ E-value increases as database gets bigger ‣ E-value decreases as alignments get longer Suggested BLAST Cutoffs For nucleotide based searches, one should look for hits with E-values of 10^-6 or less and sequence identity of 70% or more For protein based searches, one should look for hits with E-values of 10^-3 or less and sequence identity of 25% or more e- value= significance of the alignment The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

26 when you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) Hit! extend BLAST Algorithm: Extension and Evaluation originally hits extended in either direction X refinement of BLAST: two independent hits required

27 BLAST Algorithm: Extension and Evaluation BLAST algorithm extends the initial “seed” hit into an HSP HSP = high scoring segment pair = Local optimal alignment

28 BLAST Algorithm: Extension and Evaluation

29

30 BLAST-related tools for genomic DNA MegaBLAST at NCBI BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query-a mirror image of the BLAST strategy http://genome.ucsc.edu SSAHA at Ensembl uses a similar strategy as BLAT http://www.ensembl.org

31 it’ll even tell you, whether it found any known domain... or level of similarity

32 scroll down to bottom... the more the better

33 check hits you want to save... then click ‘Download’

34 Access to sequenced data: Species and Taxa Specific Databases https://genome.ucsc.edu/ENCODE/ http://www.genecards.org/ http://www.biobase-international.com/product/hgmd

35 Comparative database of eukaryotic pathogens

36 gene/metabolic pathway oriented databases


Download ppt "Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive"

Similar presentations


Ads by Google