Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Exercise 1 Bioinformatics Databases. 2 What’s in a database?  Sequences – genes, proteins, etc.  Full genomes  Annotation – information about the.

Similar presentations


Presentation on theme: "1 Exercise 1 Bioinformatics Databases. 2 What’s in a database?  Sequences – genes, proteins, etc.  Full genomes  Annotation – information about the."— Presentation transcript:

1 1 Exercise 1 Bioinformatics Databases

2 2 What’s in a database?  Sequences – genes, proteins, etc.  Full genomes  Annotation – information about the gene/protein: - function - cellular location - chromosomal location - introns/exons - protein structure - phenotypes, diseases  Publications

3 3 NCBI and Entrez  One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA)  Entrez is the search engine of NCBI  Search for : genes, proteins, genomes, structures, diseases, publications and more.  http://www.ncbi.nlm.nih.gov/

4 4 Searching for published papers  Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.

5 5 Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags

6 6 Exercise  Retrieve all publications in which the first author is: Pe'er I and the last author is: Shamir R

7 7 Using Limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years

8 8 Google scholar http://scholar.google.com/

9 9

10 10 NCBI gene & protein databases: GenBank  GenBank is an annotated collection of all publicly available DNA sequences  Holds 65 billion bases (Oct. 2007)  GenPept is a database of translated coding sequences from GenBank

11 11 Searching for CD4 human using Entrez Search demonstration

12 12

13 13 Using Field Descriptions, Qualifiers, and Boolean Operators  Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism]  List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers Boolean Operators: AND OR NOT Boolean Operators: AND OR NOT Note: do not use the field Protein name [PROT], only GENE!

14 14

15 15 RefSeq  REFSEQ: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)

16 16

17 17 An explanation on GenBank records

18 18 Accession Numbers Two letters followed by six digits, e.g.: AY123456 One letter followed by five digits, e.g.: U12345 GenBankEMBL Three letters and five digits, e.g.: AAA12345 GenPept (a.a. translations of GenBank) RefSeq accession numbers can be distinguished from GenBank accessions by their prefix distinct format of [2 characters+underscore], e.g.: NP_015325. NM_: nucleotide, NP_: protein Refseq All are six characters: Character/Format 1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:P12345 and Q9JJS7 SWISS-PROT (another protein database) one digit followed by three letters, e.g.: 1hxw PDB (Protein Data Bank – structure database)

19 19 Swiss-Prot  A protein sequence database which strives to provide a high level of annotation: * the function of a protein * domains structure * post-translational modifications * variants  One entry for each protein

20 20

21 21 GenBank Vs. Swiss-Prot GenBank results Swiss-Prot results

22 22 Downloading a sequence & Fasta format  Fasta format > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Save Accession Numbers for future use (makes searching quicker): Refseq: NP_000607.1 Swiss-Prot: P01730

23 23

24 24 PDB: Protein Data Bank  Main database of 3D structures  Includes ~47,000 entries (proteins, nucleic acids, others)  Proteins organized in groups, families etc.  Is highly redundant  http://www.rcsb.org

25 25 CD4 in complex with gp120 gp120 CD4 PDB ID 1G9M

26 26  Model organisms have independent database: Organism specific databases HIV database http://hiv-web.lanl.gov/content/index

27 27 Genecards  All in one database of human genes (a project by Weizmann institute)  Attempts to integrate as many as possible databases, publications and all available knowledge  http://www.genecards.org

28 28

29 29 Summary  General and comprehensive databases: NCBI, EMBL, DDBJ NCBI, EMBL, DDBJ  Genome specific databases: ENSEMBL, UCSC genome browser ENSEMBL, UCSC genome browser  Highly annotated databases: Human genes Human genes GenecardsGenecards Proteins: Proteins: Swiss-Prot, RefseqSwiss-Prot, Refseq Structures: Structures: PDBPDB

30 30 The MOST important of all 1. Google (or any search engine)

31 31 And always remember: 2. RT(F)M – Read the manual!!

32 32 Help!  Read the Help section  Read the FAQ section  Google the question!

33 33 || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC CTGGCTCACAAGTACCATTGA MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… Alignment teaser…

34 34 Pairwise Sequence Alignment

35 35 What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE

36 36 Why sequence alignment? Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein Assumptions: similar sequences produce similar proteins

37 37 Local vs. Global  Global alignment – finds the best alignment across the whole two sequences.  Local alignment – finds regions of high similarity in parts of the sequences.  Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

38 38 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes: 1. Insertion - AAGA  AAGTA Sequence evolution AAG T A Insertion

39 39 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes : 1. Insertion - AAGA  AAGTA 2. Deletion - AAGA  AGA Sequence evolution AAG Deletion A

40 40 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: 1. Insertion - AAGA  AAGTA 2. Deletion - AAGA  AGA 3. Substitution - AAGA  AACA Evolutionary changes in sequences AAA Substitution G C Insertion + Deletion  Indel

41 41 Sequence alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-

42 42 Scoring scheme  Match/mismatch scores: substitution matrices Nucleic acids: Nucleic acids: Transition-transversionTransition-transversion Amino acids: Amino acids: Evolution (empirical data) based: (PAM, BLOSUM)Evolution (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan)Physico-chemical properties based (Grantham, McLachlan)  Gap penalty

43 43 Amino Acid Scoring Matrices  PAM matrices: PAM80, PAM120, PAM250 The number with PAM matrices represent evolutionary distance The number with PAM matrices represent evolutionary distance Greater numbers denote greater distances Greater numbers denote greater distances Low PAM: strong similarities Low PAM: strong similarities High PAM: weak similarities High PAM: weak similarities PAM120 for general use (40% identity) PAM120 for general use (40% identity) PAM60 for close relations (60% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) PAM250 for distant relations (20% identity)  If uncertain, try several different matrices

44 44 Amino Acid Scoring Matrices  BLOSUM matrices: BLOSUM45, BLOSUM62, BLOSUM80 The number with BLOSUM matrices represent average % identity The number with BLOSUM matrices represent average % identity Greater numbers denote greater identity Greater numbers denote greater identity Low BLOSUM: weak similarities Low BLOSUM: weak similarities High BLOSUM: strong similarities High BLOSUM: strong similarities BLOSUM62 for general use BLOSUM62 for general use BLOSUM80 for close relations BLOSUM80 for close relations BLOSUM45 for distant relations BLOSUM45 for distant relations  If uncertain, try several different matrices

45 45 Web servers for pairwise alignment

46 46 BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment BLAST  Does not use an optimal algorithm but a heuristic

47 47 Back to NCBI

48 48 BLAST – bl2seq

49 49 blastn – nucleotide blastp – protein Bl2Seq - query

50 50 Bl2seq results

51 51 Bl2seq results Match Dissimilarity Gaps Similarity Low complexity

52 52 Bl2seq results:  Bits score – A score for the alignment according to the number of identities, similarities, etc.  Bits score – A score for the alignment according to the number of identities, similarities, etc.  Expected-score (E-value) –The number of alignments with the same score one can “expect” to observe by chance when searching a database of a particular size. The closer the e- value approaches zero, the greater the confidence that the hit is real

53 53 BLAST – programs Query:DNAProtein Database:DNAProtein

54 54 BLAST – Blastp

55 55 Blastp - results

56 56 Blastp – results (cont’)

57 57 Blastp – acquiring sequences

58 58 Blastp – acquiring sequences (cont’)

59 59 Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH

60 60 Searching for remote homologs  Sometimes BLAST isn’t enough  Large protein family, and BLAST only finds close members. We want more distant members  PSI-BLAST  Profile HMMs (not discussed)

61 61 PSI-BLAST  Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

62 62 PSI-BLAST  Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends  Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration

63 63 BLAST – PSI-Blast

64 64 PSI-Blast - results


Download ppt "1 Exercise 1 Bioinformatics Databases. 2 What’s in a database?  Sequences – genes, proteins, etc.  Full genomes  Annotation – information about the."

Similar presentations


Ads by Google