Presentation is loading. Please wait.

Presentation is loading. Please wait.

Organizing information in the post-genomic era The rise of bioinformatics.

Similar presentations


Presentation on theme: "Organizing information in the post-genomic era The rise of bioinformatics."— Presentation transcript:

1 Organizing information in the post-genomic era The rise of bioinformatics

2 An information explosion! Bioinformatics Computational tools are developed to collect, organize and analyze a wide variety of biological data Advances in DNA sequencing technologies have accelerated the pace of discovery. Much of the process is now automated.

3 What is a database? Which databases are important for molecular cell biology research? How is information processed in databases?

4 Literature Nucleotide Protein Organism Structure Function Biological databases use different organizing principles Hyperlinks connect records in different databases

5 Databases are organized collections of information Information is stored in records Accession # Field 1........................ Field 2....................... Data............................................ Databases assign each record a unique accession number using their own numbering system Fields are used to cross-reference the data. Records can be searched by fields. Data is entered in the record using a defined format Bioinformaticians work with computer scientists to set up the database structure Curators review and link records within and between databases

6 The information in databases ultimately derives from experimental data Researchers do experiments Researchers analyze data and write papers Data is published in journals to PubMed nucleotide sequences mutants, phenotype info structural coordinates Curators will process the submissions and link entries in different databases

7 What is a database? Which databases are important for molecular cell biology research? How is information processed in databases?

8 Largest collection is housed at the National Center for Biotechnology Information (NCBI), part of the National Library of Medicine NLM-NCBI complex in Bethesda MD Large staff of curators process the information and compile information into derivative databases Biologists use hundreds of different databases from around the world, some with similar foci

9 NCBI maintains both primary and derivative databases We’ll look at three of them PubMed is the premier literature database in the world

10 SGD is a derivative database serving the yeast research community Grew out of decades of research Genome project provided a systematic organization for genes

11 What is a database? Which databases are important for molecular cell biology research? How is information processed in databases? Questions for today:

12 Most records in the Protein database have been derived by automated translation of nucleotide sequences GenBank Nucleotide sequences GenBank Nucleotide sequences Annotated nucleic acid sequences are submitted to GenBank from many sources, including genome projects, individual investigators, and other databases – there is considerable REDUNDANCY in the information Protein Amino acid sequences Protein Amino acid sequences Automated translations of nucleotide sequences Experimentally determined amino acid sequences and information from other protein databases RefSeq Non-redundant nucleotide and protein sequences RefSeq Non-redundant nucleotide and protein sequences Sequences are compiled to generate non-redundant reference sequences Curators are responsible for data flow between the NCBI databases

13 On a larger scale: Genome projects have produced the reference sequences in nucleotide databases (robots and computers do much of the work) 1. Pieces of chromosomal DNA are sequenced, each ~1000 bp long S. cerevisiae genome is ~12 Mbp – how many reads would be necessary to cover each base pair in the genome once?

14 2. Overlapping sequence reads are aligned until sequences of entire chromosomes were complete Computer algorithms identify areas of sequence overlap Process is repeated to align long stretches of sequence Complete chromosome sequences are submitted to GenBank GenBank NC_####### (non-redundant chromosome) sequences

15 3. Chromosomal sequences are analyzed for the presence of potential transcripts (open reading frames; ORFs) ORFs are characterized by an under-representation of stop codons ORF-finding computer algorithms look for sequences that begin with a methionine methionine is separated from a stop codon in the same reading frame by a large number of amino acids (often 100, equiv. to 300bp) GenBank NM_####### records are predicted ORFs 4. Protein sequences are computationally predicted from ORF sequences GenBank NP_###### records

16

17 Genes were given systematic (locus) names by their positions on chromosomes Systematic name for MET1: YKR069W Y (A-P) (L or R) (ORF number) (W or C) yeast left or right arm of chromosome sense strand is Watson or Crick strand (coding sequence is read 5’ to 3’) ORF number, counting away from the centromere (position = 0) chromosome 1 = A 2 = B etc. Left armRight arm centromere WW C C

18 Literature Nucleotide Protein Organism Structure Function


Download ppt "Organizing information in the post-genomic era The rise of bioinformatics."

Similar presentations


Ads by Google