Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.

Similar presentations


Presentation on theme: "1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein."— Presentation transcript:

1 1 Introduction to Bioinformatics Fall 2008

2 2 Administration  Adi Doron doronadi@post.tau.ac.il  Nimrod Rubinstein rubi@post.tau.ac.il  Dudu Burstein davidbur@post.tau.ac.il  Reception hours: by appointment Britania 405, 6409245

3 3 Course Website http://bioinfo.tau.ac.il/~intro_bioinfo/

4 4 Exercises  Each student participates once in 2 weeks: Sunday 16:00-18:00 Monday 12:00-14:00 Monday 14:00-16:00 Computer classroom Sherman 03

5 5 Requirements  Exam – 80% of final grade  Assignments – 20% of final grade (Compulsory) Assignments include class and home works: Assignments include class and home works: Class works are planned to be completed during the exercise. They should be mailed to the TA. They will be checked but not graded.Class works are planned to be completed during the exercise. They should be mailed to the TA. They will be checked but not graded. Home works should be handed in the following exercise (2 weeks after the hand out date). They will be checked and graded.Home works should be handed in the following exercise (2 weeks after the hand out date). They will be checked and graded.

6 6 Goals  To familiarize the students with research topics in bioinformatics, and with bioinformatic tools  The emphasis will be on tools and their use Prerequisites  Familiarity with topics in molecular biology (cell biology and genetics)  Basic familiarity with computers & internet

7 7 BIOINFORMATIC DATABASES

8 8 What’s in a database?  Sequences – genes, proteins, etc.  Full genomes  Annotation – information about the gene/protein: - function - cellular location - chromosomal location - introns/exons - protein structure - phenotypes, diseases  Publications

9 9 NCBI and Entrez  One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA)  Entrez is the search engine of NCBI  Search for : genes, proteins, genomes, structures, diseases, publications and more.  http://www.ncbi.nlm.nih.gov/

10 10 Search for published papers  Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.

11 11 Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags

12 12 Exercise  Retrieve all publications in which the first author is: Pe'er I and the last author is: Shamir R

13 13 Using Limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years

14 14 Google scholar http://scholar.google.com/

15 15

16 16 NCBI gene & protein databases: GenBank  GenBank is an annotated collection of all publicly available DNA sequences.  Holds 65 billion bases (Oct. 2007)  GenPept is a database of translated coding sequences from GenBank

17 17 Searching for CD4 human using Entrez Search demonstration

18 18

19 19 Using Field Descriptions, Qualifiers, and Boolean Operators  Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism]  List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers Boolean Operators: AND OR NOT Boolean Operators: AND OR NOT Note: do not use the field Protein name [PROT], only GENE!

20 20

21 21 RefSeq  REFSEQ: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)

22 22

23 23 An explanation on GenBank records

24 24 Accession Numbers Two letters followed by six digits, e.g.: AY123456 One letter followed by five digits, e.g.: U12345 GenBankEMBL Three letters and five digits, e.g.: AAA12345 GenPept (a.a. translations of GenBank) RefSeq accession numbers can be distinguished from GenBank accessions by their prefix distinct format of [2 characters+underscore], e.g.: NP_015325. NM_: nucleotide, NP_: protein Refseq All are six characters: Character/Format 1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:P12345 and Q9JJS7 SWISS-PROT (another protein database) one digit followed by three letters, e.g.: 1hxw PDB (Protein Data Bank – structure database)

25 25 Swissprot  A protein sequence database which strives to provide a high level of annotation: * the function of a protein * domains structure * post-translational modifications * variants  One entry for each protein

26 26

27 27 GenBank Vs. Swiss-Prot GenBank results Swiss-Prot results

28 28 Downloading & Fasta format  Fasta format > sp|P01730|CD4_HUMAN T-cell surface glycoprotein CD4 precursor MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Save Accession Numbers for future use (makes searching quicker): Refseq: NP_000607 Swissprot: P01730

29 29

30 30 PDB: Protein Data Bank  Main database of 3D structures.  Includes ~47,000 entries (proteins, nucleic acids, others).  Proteins organized in groups, families etc.  Is highly redundant.  http://www.rcsb.org

31 31 CD4 in complex with gp120 gp120 CD4 PDB ID 1G9M

32 32  Model organisms have independent database: Organism specific HIV database http://hiv-web.lanl.gov/content/index

33 33 Genecards  All in one database of human genes (a project by Weizmann institute)  Attempts to integrate as many as possible databases, publications and all available knowledge  http://www.genecards.org

34 34

35 35 Summary  General and comprehensive databases: NCBI, EMBL, DDBJ NCBI, EMBL, DDBJ  Genome specific databases: ENSEMBL, UCSC genome browser ENSEMBL, UCSC genome browser  Highly annotated databases: Human genes Human genes GenecardsGenecards Proteins: Proteins: Swissprot, RefseqSwissprot, Refseq Structures: Structures: PDBPDB

36 36 The MOST important of all 1. Google (or any search engine)

37 37 And always remember: 2. RT(F)M – Read the manual!!

38 38 Help!  Read the Help section  Read the FAQ section  Google the question!


Download ppt "1 Introduction to Bioinformatics Fall 2008. 2 Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein."

Similar presentations


Ads by Google