1 Introduction to Bioinformatics. 2 What is Bioinformatics?

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
© Wiley Publishing All Rights Reserved. How Most People Use Bioinformatics.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
1.
On line (DNA and amino acid) Sequence Information Lecture 7.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Archives and Information Retrieval
Biological Databases Pharmamatrix Workshop Philip Winter -Ishwar V. Hosamani.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Lecture 2.21 Retrieving Information: Using Entrez.
ECE 501 Introduction to BME
Introduction to Bioinformatics
The Cell, Central Dogma and Human Genome Project.
Bioinformatics Lecture 2. Bioinformatics: is the computational branch of molecular biology Using the computer software to analyze biological data The.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Introduction to Bioinformatics / Lecturer: Prof. Yael Mandel-Gutfreund Teaching Assistance: Shai Ben-Elazar Idit kosti Course web site :
Course Module: Introduction to Bioinformatics – CS 2001 July CS Databases.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Bioinformatics and it’s methods Prepared by: Petro Rogutskyi
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
CSE 6406: Bioinformatics Algorithms. Course Outline
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
X-ray crystallography NMR cryoEM Experimental approaches for structural biology.
Cédric Notredame (20/09/2015) Finding What you Need in Biological Databases Cédric Notredame.
PubMed and other Online Tools Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries/ U.F. Genetics Institute GMS 6014 January.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.
We have displayed the Browse publisher drop down menu. This You have full access to: list for an institution where all the material is included in the.
Sevas Educational Society All Rights Reserved, 2008 Module 1 Introduction to Bioinformatics.
Organizing information in the post-genomic era The rise of bioinformatics.
Introduction to Bioinformatics Lecturer: Prof. Yael Mandel-Gutfreund Teaching Assistance: Rachelly Normand Edward Vitkin Course web site :
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Overview of Bioinformatics 1 Module Denis Manley..
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistance: Martin Akerman Sivan Bercovici Course web site :
Bioinformatics and Computational Biology
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Copyright OpenHelix. No use or reproduction without express written consent1.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Introduction to Bioinformatics Dr. Yael Mandel-Gutfreund TA: Oleg Rokhlenko.
Introduction to Bioinformatics Lecturer: Prof. Yael Mandel-Gutfreund Teaching Assistance: Rachelly Normand Olga Karinski Course web site :
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Introduction to molecular biology Data Mining Techniques.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Entrez, dbSNP, GEO, OMIM & LinkOut JanPlan Entrez Distributed by NCBI in 1991 on CD-ROM Included linked nodes: GenBank & PDB Translated GenBank,
Computer Applications for Biological
Archives and Information Retrieval
Genomes and Their Evolution
What is Bioinformatics?
Introduction to Bioinformatics /234525
Mangaldai College, Mangaldai
Genomes and Their Evolution
Introduction to Bioinformatics
PubMed.
Accessing journals by Language 4
From Mendel to Genomics
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

1 Introduction to Bioinformatics

2 What is Bioinformatics?

3 NIH – definitions What is Bioinformatics? - Research, development, and application of computational tools and approaches for expanding the use of biological, medical, behavioral, and health data, including the means to acquire, store, organize, archive, analyze, or visualize such data. What is Computational Biology? - The development and application of analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social data. on molecular

4 NSF – introduction Large databases that can be accessed and analyzed with sophisticated tools have become central to biological research and education. The information content in the genomes of organisms, in the molecular dynamics of proteins, and in population dynamics, to name but a few areas, is enormous. Biologists are increasingly finding that the management of complex data sets is becoming a bottleneck for scientific advances. Therefore, bioinformatics is rapidly become a key technology in all fields of biology.

5 NSF – mission statement The present bottlenecks in bioinformatics include the education of biologists in the use of advanced computing tools, the recruitment of computer scientists into this evolving field, the limited availability of developed databases of biological information, and the need for more efficient and intelligent search engines for complex databases.

6 NSF – mission statement the education of biologists in the use of advanced computing tools The present bottlenecks in bioinformatics include the education of biologists in the use of advanced computing tools, the recruitment of computer scientists into this evolving field, the limited availability of developed databases of biological information, and the need for more efficient and intelligent search engines for complex databases.

7 Molecular Bioinformatics Molecular Bioinformatics involves the use of computational tools to discover new information in complex data sets (from the one-dimensional information of DNA through the two-dimensional information of RNA and the three-dimensional information of proteins, to the four-dimensional information of evolving living systems).

Bioinformatics (Oxford English Dictionary): The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics. 8

Biologists collect molecular data: DNA & Protein sequences, gene expression, etc. Computer scientists (+Mathematicians, Statisticians, etc.) Develop tools, softwares, algorithms to store and analyze the data. Bioinformaticians Study biological questions by analyzing molecular data The field of science in which biology, computer science and information technology merge into a single discipline 9

Some biological background…. A biologist 10

The hereditary information of all living organisms, with the exception of some viruses, is carried by deoxyribonucleic acid (DNA) molecules. 2 purines: 2 pyrimidines: adenine (A) cytosine (C) guanine (G) thymine (T) two rings one ring 11

Eukaryotes may have up to 3 subcellular genomes: 1. Nuclear 2. Mitochondrial 3. Plastid Bacteria have either circular or linear genomes and may also carry plasmids The entire complement of genetic material carried by an individual is called the genome. Human chromosomes Circular genome 12

Central dogma: DNA makes RNA makes Protein Modified dogma: DNA makes DNA and RNA, RNA makes DNA, RNA an Protein 13

Amino acids - The protein building blocks 14

15

Any region of the DNA sequence can, in principle, code for six different amino acid sequences, because any one of three different reading frames can be used to interpret each of the two strands. 16

Protein folding A human Hemoglobin: 17

How does it all looks like on a computer monitor? 18

A cDNA sequence >gi| |ref|NM_ | Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAA CGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAG AGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCT GCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGG ACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCG GTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTC ACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAA TACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTC CCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC 19

A cDNA sequence (reading frame) A protein sequence >gi| |ref|NM_ | Homo sapiens hemoglobin, alpha 1 (HBA1), mRNA ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACC ATG GTGCTGTCTCCTGCCGACAAGACCA ACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGA GAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTC TGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTG GACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCC GGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTT CACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAA ATACCGT TAA GCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCC TCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGC >gi| |ref|NP_ | alpha 1 globin [Homo sapiens] MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVA DALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLA SVSTVLTSKYR 20

ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTC AAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTC CTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGG GCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCG CTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCC ACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGG ACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATG CTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAAT AAAGTCTGAGTGGGCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCT GCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAG GCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACG GCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGCCGTGGCGCACGTG GACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTC AACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTG CGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCT GGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGT ACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACC CACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGC TGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCG CACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACC AACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAG CTTCGGGTGGACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCC CCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGAC CTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCC TCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGCGCCGTGGCGCACGTG GACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTC AACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTG CGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCT GGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGT ACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACC CACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGGGGTAAGGTCGGCGCGCACGC TGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCTACTTCCCG CACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG... And, a whole genome… 21

E. coli 4.6 x 10 6 nucleotides –Approx. 4,000 genes Yeast15 x 10 6 nucleotides –Approx. 6,000 genes Human 3 x 10 9 nucleotides –Approx. 30,000 genes Smallest human chromosome 50 x 10 6 nucleotides How big are whole genomes? 22

What do we actually do with bioinformatics? 23

Sequence assembly (next generation sequencing) 24

Genome annotation 25

Molecular evolution Smith et al. (2009) Nature 459, Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic 26

Analysis of gene expression Gene expression profile of relapsing versus non-relapsing Wilms tumors. A set of 39 genes discriminates between the two classes of tumors. ( Prof. Gessler) 27

Analysis of regulation Toledo and Bardot (2009) Nature 460,

Protein structure prediction Protein docking 29

Luscombe, Greenbaum, Gerstein (2001) 30

31 From DNA to Genome Watson and Crick DNA model Sanger sequences insulin protein Sanger dideoxy DNA sequencing PCR (Polymerase Chain Reaction) ARPANET (early Internet) PDB (Protein Data Bank) Sequence alignment GenBank database Dayhoff’s Atlas

SWISS-PROT database NCBI World Wide Web BLAST FASTA EBI Human Genome Initiative First human genome draft First bacterial genome Yeast genome

33 The first protein sequence reported was that of bovine insulin in 1956, consisting of 51 residues. Origin of bioinformatics and biological databases: Nearly a decade later, the first nucleic acid sequence was reported, that of yeast tRNA alanine with 77 bases.

34 In 1965, Dayhoff gathered all the available sequence data to create the first bioinformatic database (Atlas of Protein Sequence and Structure). The Protein DataBank followed in 1972 with a collection of ten X-ray crystallographic protein structures. The SWISSPROT protein sequence database began in 1987.

35 Nucleotides

36 as of August 2011: Eukaryotes 37 Prokaryotes 1708 Total 1745 Complete Genomes

37 What can we do with sequences and other type of molecular information?

38 Annotation Open reading frames Functional sites Structure, function

39 CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG AAT TGAAAAACGTA

40 CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG AAT TGA AAAACGTA TF binding site promoter Ribosome binding Site ORF = Open Reading Frame CDS = Coding Sequence Transcription Start Site

41 Comparative genomics Comparing ORFs Identifying orthologs Inferences on structure and function Comparing functional sites Inferences on regulatory networks

42 Similarity profiles Researchers can learned a great deal about the structure and function of human genes by examining their counterparts in model organisms.

43 Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL Bos MALWTRLRPLLALLALWPPPPARAFVNQHL **** : * *.*: *:..* :. *:**** Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ Bos CGSHLVEALYLVCGERGFFYTPKARREVEG ***************:***** ** :*::* Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV.**. ** * * ***** Xenopus EQCCHSTCSLFQLENYCN Bos EQCCASVCSLYQLENYCN **** *.***:******* Alignment preproinsulin

44

45 Ultraconserved Elements in the Human Genome Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart Stephen, W. James Kent, John S. Mattick, & David Haussler (Science : ) There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. There are 156 intergenic, untranscribed, ultraconserved segments

46 Junk is real! Junk: Supporting evidence

47 Functional genomics Genome-wide profiling of: mRNA levels Protein levels Co-expression of genes and/or proteins Identifying protein-protein interactions Networks of interactions

48 Understanding the function of genes and other parts of the genome

49 Structural genomics Assign structure to all proteins encoded in a genome

Biological databases 50

Database or databank? Initially Databank (in UK) Database (in the USA) Solution The abbreviation db 51

What is a Database? A structured collection of data held in computer storage; esp. one that incorporates software to make it accessible in a variety of ways; transf., any large collection of information. database management: the organization and manipulation of data in a database. database management system (DBMS): a software package that provides all the functions required for database management. database system: a database together with a database management system. Oxford Dictionary 52

What is a database? A collection of data –structured –searchable (index) -> table of contents –updated periodically (release) -> new edition –cross-referenced (hyperlinks) -> links with other db Includes also associated tools (software) necessary for access, updating, information insertion, information deletion…. Data storage management: flat files, relational databases… 53

Database: a « flat file » example Accession number: 1 First Name: Amos Last Name: Bairoch Course: Pottery 2000; Pottery 2001; // Accession number: 2 First Name: Dan Last name: Graur Course: Pottery 2000, Pottery 2001; Ballet 2001, Ballet 2002 // Accession number 3: First Name: John Last name: Travolta Course: Ballet 2001; Ballet 2002; // Easy to manage: all the entries are visible at the same time ! Flat-file database (« flat file, 3 entries »): 54

Database: a « relational » example TeacherAccession number Education Amos1Biochemistry Dan2Genetics John3Scientology CourseYearInvolved teachers Advanced Pottery 2000; 20011; 2 Ballet for Fat People 2001; 20022; 3 Relational database (« table file »): 55

Why biological databases? Exponential growth in biological data. Data (genomic sequences, 3D structures, 2D gel analysis, MS analysis, Microarrays….) are no longer published in a conventional manner, but directly submitted to databases. Essential tools for biological research. The only way to publish massive amounts of data without using all the paper in the world. 56

Distribution of sequences Books, articles1968 -> 1985 Computer tapes1982 -> 1992 Floppy disks1984 -> 1990 CD-ROM1989 -> FTP > On-line services > 1994 WWW1993 -> DVD > 57

Some statistics More than 1000 different ‘biological’ databases Variable size: 20Gb –DNA: > 20 Gb – Protein: 1 Gb –3D structure: 5 Gb –Other: smaller Update frequency: daily to annually to seldom to forget about it. Usually accessible through the web (some free, some not) 58

Some databases in the field of molecular biology… AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc !!!! 59

Categories of databases for Life Sciences Sequences (DNA, protein) Genomics Mutation/polymorphism Protein domain/family Proteomics (2D gel, Mass Spectrometry) 3D structure Metabolic networks Regulatory networks Bibliography Expression (Microarrays,…) Specialized 60

61 NCBI: EBI: DDBJ:

62 BookshelfBookshelf: A collection of searchable biomedical books linked to PubMed. PubMedPubMed: Allows searching by author names, journal titles, and a new Preview/Index option. PubMed database provides access to over 12 million MEDLINE citations back to the mid-1960's. It includes History and Clipboard options which may enhance your search session. PubMed CentralPubMed Central: The U.S. National Library of Medicine digital archive of life science journal literature. OMIM:OMIM: Online Mendelian Inheritance in Man is a database of human genes and genetic disorders (also OMIA). Literature Databases:

PubMed (Medline) MEDLINE covers the fields of medicine, nursing, dentistry, veterinary medicine, public health, and preclinical sciences Contains citations from approximately 5,200 worldwide journals in 37 languages; 60 languages for older journals. Contains over 20 million citations since 1948 Contains links to biological db and to some journals New records are added to PreMEDLINE daily!

65 Alerting services – –

A search by subject:“mitochondrion evolution”

68 Type in a Query term Enter your search words in the query box and hit the “Go” button

69 The Syntax … 1.Boolean operators: AND, OR, NOT must be entered in UPPERCASE (e.g., promoters OR response elements). The default is AND. 2. Entrez processes all Boolean operators in a left-to-right sequence. The order in which Entrez processes a search statement can be changed by enclosing individual concepts in parentheses. The terms inside the parentheses are processed first. For example, the search statement: g1p3 OR (response AND element AND promoter). 3. Quotation marks: The term inside the quotation marks is read as one phrase (e.g. “public health” is different than public health, which will also include articles on public latrines and their effect on health workers). 4. Asterisk: Extends the search to all terms that start with the letters before the asterisk. For example, dia* will include such terms as diaphragm, dial, and diameter.

70 Refine the Query Often a search finds too many (or too few) sequences, so you can go back and try again with more (or fewer) keywords in your query The “History” feature allows you to combine any of your past queries. The “Limits” feature allows you to limit a query to specific organisms, sequences submitted during a specific period of time, etc. [Many other features are designed to search for literature in MEDLINE]

71 You can search for a text term in sequence annotations or in MEDLINE abstracts, and find all articles, DNA, and protein sequences that mention that term. Then from any article or sequence, you can move to "related articles" or "related sequences". Relationships between sequences are computed with BLASTBLAST Relationships between articles are computed with "MESH" terms (shared keywords) Relationships between DNA and protein sequences rely on accession numbers Relationships between sequences and MEDLINE articles rely on both shared keywords and the mention of accession numbers in the articles. Related Items

72

73

74

A search by authors:“Esser” [au] AND “martin” [au]

A search by title word: “Wolbachia pipientis” [ti]

77 Database Search Strategies General search principles - not limited to sequence (or to biology). Start with broad keywords and narrow the search using more specific terms. Try variants of spelling, numbers, etc. Search many databases. Be persistent!!

78 Searching PubMed Structureless searches –Automatic term mapping Structured searches –Tags, e.g. [au], [ta], [dp], [ti] –Boolean operators, e.g. AND, OR, NOT, () Additional features –Subsets, limits –Clipboard, history

79 Start working: Search PubMed 1.cuban cigars 2.cuban OR cigars 3.“cuban cigars” 4.cuba* cigar* 5.(cuba* cigar*) NOT smok* 6.Fidel Castro 7.“fidel castro” 8.#6 NOT #7

80 “Details” and “History” in PubMed

81 “Details” and “History” in PubMed

82 The OMIM (Online Mendelian Inheritance in Man) –Genes and genetic disorders –Edited by team at Johns Hopkins –Updated daily

83 MIM Number Prefixes * gene with known sequence + gene with known sequence and phenotype # phenotype description, molecular basis known % mendelian phenotype or locus, molecular basis unknown no prefix other, mainly phenotypes with suspected mendelian basis

84 Searching OMIM Search Fields –Name of trait, e.g., hypertension –Cytogenetic location, e.g., 1p31.6 –Inheritance, e.g., autosomal dominant –Gene, e.g., coagulation factor VIII

85 OMIM search tags All Fields [ALL] Allelic Variant [AV] or [VAR] Chromosome [CH] or [CHR] Clinical Synopsis [CS] or [CLIN] Gene Map [GM] or [MAP] Gene Name [GN] or [GENE] Reference [RE] or [REF]

86

87 Start working: Search OMIM How many types of hemophilia are there? For how many is the affected gene known? What are the genes involved in hemophilia A? What are the mutations in hemophilia A?

88 Online Literature databases 1. How to use the UH online Library? 2. Online glossaries 3. Google Scholar 4. Google Books 5. Web of Science

89 How to use the online UH Library?

90

91 Online Glossaries Bioinformatics : Genomics: Molecular Evolution: Biology dictionary: Other glossaries, e.g., the list of phobias:

92 4. Google Scholar

93 Enables you to search specifically for scholarly literature, including peer-reviewed papers, theses, books, preprints, abstracts and technical reports from all broad areas of research. What is Google Scholar?

94 Use Google Scholar to find articles from a wide variety of academic publishers, professional societies, preprint repositories and universities, as well as scholarly articles available across the web.

95 Google Scholar orders your search results by how relevant they are to your query, so the most useful references should appear at the top of the page This relevance ranking takes into account the: full text of each article. the article's author, the publication in which the article appeared and how often it has been cited in scholarly literature.

96 What other DATA can we retrieve from the record?

97

98

99 5. Google Book Search

100

101 Start working: Search Google Books How many times is the tail of the giraffe mentioned in On the Origin of Species by Mr. Darwin?

Web of science =WOS&search_mode=GeneralSearch&SID=4FB7LbbLgDMhG9fDiLh&preferencesSaved=

103

104