Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cédric Notredame (20/09/2015) Finding What you Need in Biological Databases Cédric Notredame.

Similar presentations


Presentation on theme: "Cédric Notredame (20/09/2015) Finding What you Need in Biological Databases Cédric Notredame."— Presentation transcript:

1 Cédric Notredame (20/09/2015) Finding What you Need in Biological Databases Cédric Notredame

2 Cédric Notredame (20/09/2015) Where is my Needle ? Databases:

3 Cédric Notredame (20/09/2015)

4 Our Scope Give you means to answer simple questions Databases are UNFRIENDLY INFORMATION DESKS Give you an idea of what is possible WHAT can you ask ? HOW can you ask it ?

5 Cédric Notredame (20/09/2015) Outline - An Overall view - Asking a biological question to a database - Turning a question into a query - Bibliographic Databases: Medline, OMIM - Gene Databases: GenBank, LocusLink, ENSEMBL - Protein Databases: SwissProt, InterPro, Prodom - SRS

6 Cédric Notredame (20/09/2015) Database: What is a Database ?

7 Cédric Notredame (20/09/2015) DataBase Entries 1 entry = 1 Sequence AGCTGTCGAGGGATAGGACA TATACATAAATTAATATAAT 1 entry = 1 File = Sequence +Doc SEQ DOC = Flat File Database = Collection of Flat Files SEQ DOC SEQ DOC SEQ DOC SEQ DOC SEQ DOC SEQ DOC SEQ DOC

8 Cédric Notredame (20/09/2015) DataBase Entries: Flat Files Accession number: 1 First Name: Amos Last Name: Bairoch Course: DEA=oct-nov-dec 2002 http://www.expasy.org/people/amos.html // Accession number: 2 First Name: Laurent Last name: Falquet Course: EMBnet=sept 2000, sept 2001;DEA=oct-nov-dec 2000; // Accession number 3: First Name: Marie-Claude Last name: Blatter Garin Course: EMBnet=sept 2000; sept 2001; DEA=oct-nov-dec 2000; http://www.expasy.org/people/Marie-Claude.Blatter-Garin.html //

9 Cédric Notredame (20/09/2015) DataBase: Relational Databases Teacher Accession number Education Amos1Biochemistry Laurent2Biochemistry M-Claude3Biochemistry Course DateInvolved teachers DEAOct-nov-dec 20001,3 EMBnetSept 2000, Sept 20012,3 Relational database (« table file »):

10 Cédric Notredame (20/09/2015) To Summarize: What’s a database ? Collection of Data that is: Structured Data Searchable (index) -> table of contents Updated periodically (release) -> new edition Cross-referenced (hyperlinks) -> links with other db Collection of tools (software) necessary for: Searching –Updating -Releasing Data storage managment: flat files, relational databases…

11 Cédric Notredame (20/09/2015) Database: What’s on the Menu?

12 Cédric Notredame (20/09/2015) A large amount of information More than 1000 different databases Generally accessible through the web EBI: http://www.ebi.ac.uk/ NCBI: http://www.ncbi.nlm.nih.org Google: http://www.google.com Variable size: 10Gb DNA: > 10 Gb Protein: 1 Gb 3D structure: 5 Gb Other: smaller Update frequency: daily to annually

13 Cédric Notredame (20/09/2015) A Non Exhaustive List AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc.................. !!!! There Exists A Specialized Database on Almost anything you can think of

14 Cédric Notredame (20/09/2015) A database of databases

15 Cédric Notredame (20/09/2015) What’s on the Menu: The Art of Eating Well Always Use Fresh Data: The Latest Update of your DataBase Make Sure The DataBase is Maintained: Many Databases are poorly maintained Treat DataBases like Publications: Some Journals are Better than Others

16 Cédric Notredame (20/09/2015) Bio-Google: How Can I Search a Database ?

17 Cédric Notredame (20/09/2015) Searching Databases There are 2 ways to search databases Text based queries: Medline, Entrez SEQ DOC Search For « Smith AND dUTPase> Similarity Searches: BLAST AGCTGTCGAGGGATAGGACA TATACATAAATTAATATAAT

18 Cédric Notredame (20/09/2015) Searching Databases Each database is a little kingdom… Has its own query system Has its own information structure The main databases are well documented and this documentation is available online Most databases can be searched using SRS or Entrez

19 Cédric Notredame (20/09/2015) Databases: Asking the right Question Databases ARE NOT meant for browsing When you search a Database you must have an idea of what your Needle-in-a- hay-stack looks like

20 Cédric Notredame (20/09/2015) Databases: Asking the right Question Browsing a database is like Using your phone book in place of a dating agency…

21 Cédric Notredame (20/09/2015) Databases: Asking the right Question Finding Data: Database Search Finding Questions: Data Mining

22 Cédric Notredame (20/09/2015) The Kind Of Questions We Can Ask: SEQUENCE Based InterProAny Known Domain in my Protein ???SwissProtAny Protein like mine ??? These ARE Predictions

23 Cédric Notredame (20/09/2015) The Kind Of Questions We Can Ask: TEXT Based MedlineWho Worked on my Protein ???SwissProtFunction of My Protein ???PDBStructure of My Protein ??? These are NOT Predictions

24 Cédric Notredame (20/09/2015) Just like When You Google up Specific Queries give Precise Answers

25 Cédric Notredame (20/09/2015) Medline: Who worked on my Protein ?

26 Cédric Notredame (20/09/2015) Medline (PubMed)

27 Cédric Notredame (20/09/2015) What is in Medline ? MEDLINE covers the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences more than 4,000 biomedical journals and More than 10 million citations since 1966 until now Contains links to biological db and to some journals n Many papers not dealing with human are not in Medline n Before 1970, keeps only the first 10 authors !

28 Cédric Notredame (20/09/2015) Using Medline: Asking a question During the last Lab Meeting, I heard the word dUTPase. What can it be ? What has been published on this ?

29 Cédric Notredame (20/09/2015) Using Medline: Asking a question

30 Cédric Notredame (20/09/2015) Using Medline: Asking a question

31 Cédric Notredame (20/09/2015) Using Medline: Asking a question

32 Cédric Notredame (20/09/2015) Using Medline: Asking a question By Default, Medline Assumes you mean: Abergel AND dUTPase

33 Cédric Notredame (20/09/2015) Using Medline: Asking a question I have found the reference I wanted. Now I want to save it so that I can use it later, For instance to Import it in ENDnote my Reference Manager Save Your Data in the Proper DataBase format

34 Cédric Notredame (20/09/2015) Using Medline: Storing your results

35 Cédric Notredame (20/09/2015) Using Medline: Storing your results

36 Cédric Notredame (20/09/2015) Retrieving EXACTLY the Information that you need [AB] [AD] Restricted fields

37 Cédric Notredame (20/09/2015) Using Medline: Storing your results AB AD

38 Cédric Notredame (20/09/2015) Using Medline: Looking for a Review I Want to Find the LATEST REVIEW on the dUTPase. Use The Limit Option of Medline

39 Cédric Notredame (20/09/2015) Using Medline: Looking For a Review LanguageTitle OR Abstract Article type 1-Limits

40 Cédric Notredame (20/09/2015) Using Medline: A Few Tips Quoted queries (e.g. «down syndrome» ) behave as a single word, and are great to improve the relevance of your search Adding initials to names (e.g. “Abergel C” ) (if you can) also reduces your output Write down the PubMed Identifier (the number in the PMID field) of that interesting paper you just find. It could be very useful in your subsequent search for related items such as associated gene and protein sequences

41 Cédric Notredame (20/09/2015) Using Medline: A Few Tips Spelling mistakes, wrong field restrictions or Limits setting can occur. These may be the problem. Use abstracts to enlarge your vocabulary and look for synonyms: some papers on dUTPase might use dUTP pyrophosphatase instead! The “related papers” button (on the extreme right of the PubMed output). Try it from time to time, to enlarge a search that is not giving you enough references

42 Cédric Notredame (20/09/2015) Using Medline: A Few Tips Storing your PDFs, Memory is cheap, access is sometimes strange… Storing your favourite PDF is a good idea Which name on your disk? THE MEDLINE ID NUMBER !!! With a reference manager like EndNote

43 Cédric Notredame (20/09/2015)

44 GenBank: What is the Sequence of my Gene ?

45 Cédric Notredame (20/09/2015) GenBank: an Overview

46 Cédric Notredame (20/09/2015) GenBank: an Overview

47 Cédric Notredame (20/09/2015) GenBank: an Overview EMBL DDBJ GenBank EMBL, GenBank and DDBJ are the same database. They are synchronized every day.

48 Cédric Notredame (20/09/2015) GenBank: an Overview GenBank contains EVERY piece of DNA that has been sequenced and made publicly available. It contains GOOD and BAD dataThere is a Historical Aspect in the GenBank data: -Complex Genes are spread in many entries:

49 Cédric Notredame (20/09/2015) GenBank Entries Are Complex because Genes are complex Prokaryotic Example Gene Promoter RBS Protein ORF mRNA STOPATG

50 Cédric Notredame (20/09/2015) GenBank Entries Are Complex because Genes are complex Gene Promoter Protein (form2) Protein (form1) mRNA (form1) mRNA (form2) exon

51 Cédric Notredame (20/09/2015) What is the Sequence of the E. Coli dUTPase ? Using GenBank: Asking a question ?

52 Cédric Notredame (20/09/2015) Using GenBank: Asking a question The Naive Way This search reports EVERY GenBank entry that contains these two words. Most Bacterial Genomes Entries (annotated by similarity) Contain these two words Escherichia coli dUTPase

53 Cédric Notredame (20/09/2015) Using GenBank: Asking a question The Right Way Escherichia coli[organism] dUTPase[definition]

54 Cédric Notredame (20/09/2015) Using GenBank: And There Is Plenty More where It comes from… If a Gene is published more than once, Each publication gets its own entry This can mean MANY ENTRIES if you have SNPs or ESTs GenBank Is Redundant:

55 Cédric Notredame (20/09/2015) Header Contains all the practical Information

56 Cédric Notredame (20/09/2015) Features Contains Experimental Information and Predictions

57 Cédric Notredame (20/09/2015) Extra Gene This is common in GenBank entries

58 Cédric Notredame (20/09/2015) What is the Sequence of the Human dUTPase ? Using GenBank: Asking a question ? What is the Sequence of the E. Coli dUTPase ?

59 Cédric Notredame (20/09/2015) Using GenBank: Finding the Human dUTPase 2-Check box here to exclude ESTs1-Request Limits

60 Cédric Notredame (20/09/2015) Using GenBank: Finding the Human dUTPase The Gene does NOT appear in a single entry

61 Cédric Notredame (20/09/2015) Using GenBank: Finding the Human dUTPase

62 Cédric Notredame (20/09/2015) Using GenBank: Reconstructing your gene

63 Cédric Notredame (20/09/2015) Some Good News… -This Information is complicated because it is RAW Information -It is necessary to keep UNINTERPRETED Experimental Information available -There are SIMPLER alternatives to using this RAW Information: -Gene Centric Databases -Protein Databases

64 Cédric Notredame (20/09/2015) RefSeq/LocusLink: What Is There To know about This Gene?

65 Cédric Notredame (20/09/2015) Using LocuLink

66 Cédric Notredame (20/09/2015) What Can I find about the DUT Gene ? Using LocusLink: Asking a question ?

67 Cédric Notredame (20/09/2015) Enter Gene name Select LocusLink

68 Cédric Notredame (20/09/2015) Using LocusLink: Asking a question about a Gene

69 Cédric Notredame (20/09/2015) Using LocusLink: Asking a question about a Gene

70 Cédric Notredame (20/09/2015) OMIM: Is There A disease Associated to This Gene?

71 Cédric Notredame (20/09/2015) OMIM: Finding Out About The Phenotype of a Gene

72 Cédric Notredame (20/09/2015) OMIM: Finding Out About The Phenotype of a Gene OMIM™: Online Mendelian Inheritance in Man A catalog of human genes and genetic disorders Contains a summary of literature, pictures, and reference information. It also contains numerous links to articles and sequence information.

73 Cédric Notredame (20/09/2015) OMIM: Finding Out About The Phenotype of a Gene

74 Cédric Notredame (20/09/2015) NCBI-GENOME: What is the Context of my Gene In Its Genome?

75 Cédric Notredame (20/09/2015) NCBI-GENOME

76 Cédric Notredame (20/09/2015) NCBI-GENOME: The Virus Section

77 Cédric Notredame (20/09/2015) NCBI-GENOME: The Virus Section

78 Cédric Notredame (20/09/2015) NCBI-GENOME: The Bacteria Section

79 Cédric Notredame (20/09/2015) NCBI-GENOME: The Bacteria Section

80 Cédric Notredame (20/09/2015) ENSEMBL: Where is my Gene in the Human Genome (who are its neighbors) ?

81 Cédric Notredame (20/09/2015) Using ENSEMBL

82 Cédric Notredame (20/09/2015) My Gene: A Summary

83 Cédric Notredame (20/09/2015) Gathering Everything you need on a gene GenBank: What is the Sequence ? LocusLink: What about this Gene? ENSEMBL: What is the Context? MEDLINE: Are There Papers? OMIME: Are There Illnesses?

84 Cédric Notredame (20/09/2015) SwissProt: What Do We Know About My Protein ?

85 Cédric Notredame (20/09/2015) The Protein Databases GenBank: A Big Bag of DNA PREDICTION + EXPERIMENT Generic Non Redundant Protein Databases NR trEMBL Specialized Protein Databases SwissProt PIR

86 Cédric Notredame (20/09/2015) What Is SwissProt ?

87 Cédric Notredame (20/09/2015) What Is SwissProt ? Fully-annotated (manually), non-redundant, cross- referenced, documented protein sequence database. ~100 ’000 sequences from more than 6’800 different species; 70 ’000 references (publications); 550 ’000 cross- references (databases); ~200 Mb of annotations. Collaboration between the SIB (CH) and EMBL/EBI (UK)

88 Cédric Notredame (20/09/2015) Using SwissProt: Asking a question We hear the word EPO quite often these days, but what exactly is known about it ?

89 Cédric Notredame (20/09/2015) Using SwissProt: Asking a question A Simple SwissProt Text Query EPO HUMAN

90 Cédric Notredame (20/09/2015) Using SwissProt: Reading an Entry

91 Cédric Notredame (20/09/2015) Using SwissProt: Reading an Entry

92 Cédric Notredame (20/09/2015) Using SwissProt: Reading an Entry

93 Cédric Notredame (20/09/2015) Using SwissProt: Reading an Entry

94 Cédric Notredame (20/09/2015) Using SwissProt: Reading an Entry Structure Information

95 Cédric Notredame (20/09/2015) Using SwissProt: Reading an Entry

96 Cédric Notredame (20/09/2015) The Protein Databases GenBank: A Big Bag of DNA PREDICTION + EXPERIMENT Specialized Protein Databases SwissProt PIR UniProt Generic Non Redundant Protein Databases NR trEMBL

97 Cédric Notredame (20/09/2015)

98

99

100 SwissProt How Good is Good ?

101 Cédric Notredame (20/09/2015)

102

103 PDB: What is the Structure of my Protein ?

104 Cédric Notredame (20/09/2015) PDB: The Protein Database

105 Cédric Notredame (20/09/2015) PDB: The Protein Database Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). Contains macromolecular structure data on proteins, nucleic acids, protein-nucleic acid complexes, and viruses. Currently there are ~16’000 structure data for about 4’000 different molecules, but far less protein families (highly redundant) !

106 Cédric Notredame (20/09/2015) Using PDB: Asking a question Does tolB have a known Structure? And If the answer is Yes, How can I look at it ?

107 Cédric Notredame (20/09/2015) Using PDB: Asking a question Query: TolB

108 Cédric Notredame (20/09/2015) Using PDB: Viewing a Structure View Structure

109 Cédric Notredame (20/09/2015) Using PDB: Viewing a Structure

110 Cédric Notredame (20/09/2015) Using PDB: Viewing a Structure

111 Cédric Notredame (20/09/2015) Using PDB: Viewing a Structure

112 Cédric Notredame (20/09/2015) Using PDB: Downloading Data Coordinates

113 Cédric Notredame (20/09/2015) Interpro: Are There Domains In my Protein ?

114 Cédric Notredame (20/09/2015) Interpro: The Idea of Domains

115 Cédric Notredame (20/09/2015) Interpro: The Idea of Domains

116 Cédric Notredame (20/09/2015) Interpro: A Federation of Databases

117 Cédric Notredame (20/09/2015) Using InterPro: Asking a question Which Domains does the oncogene FosB contain?

118 Cédric Notredame (20/09/2015) Using InterPro: Asking a question

119 Cédric Notredame (20/09/2015) Using InterPro: Asking a question

120 Cédric Notredame (20/09/2015) Using CDsearch: Asking a question

121 Cédric Notredame (20/09/2015) Using CDsearch: Asking a question

122 Cédric Notredame (20/09/2015) Using Domains: Some Statistics 10 most common protein domains for H. sapiens Immunoglobulin and major histocompatibility complex domain Zinc finger, C2H2 type Eukaryotic protein kinase Rhodopsin-like GPCR superfamily Pleckstrin homology (PH) domain RING finger Src homology 3 (SH3) domain RNA-binding region RNP-1 (RNA recognition motif) EF-hand family Homeobox domain

123 Cédric Notredame (20/09/2015) My Protein: A Summary

124 Cédric Notredame (20/09/2015) Gathering Everything you need on a Protein trEMBL: What is the Sequence ? MEDLINE: Are There Papers? PDB: Which Structure? INTERPRO: Which Domains? SwissProt:What about the Function

125 Cédric Notredame (20/09/2015) SRS: Can I search Many Databases Simultaneously ?

126 Cédric Notredame (20/09/2015) Using SRS

127 Cédric Notredame (20/09/2015) Using SRS

128 Cédric Notredame (20/09/2015) A Few Databases in Bulk

129 Cédric Notredame (20/09/2015)

130

131

132

133 A Few Addresses

134 Cédric Notredame (20/09/2015) A few Databases


Download ppt "Cédric Notredame (20/09/2015) Finding What you Need in Biological Databases Cédric Notredame."

Similar presentations


Ads by Google