Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

History Data Service1 Good Design for Historical source based Databases History Data Service Hamish James.
C6 Databases.
Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam.
Lecture-7/ T. Nouf Almujally
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Management Information Systems, Sixth Edition
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS MBNA
Archives and Information Retrieval
Protein structure (Part 2 of 2).
Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam Some pics were.
Database Management: Getting Data Together Chapter 14.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
3-1 Chapter 3 Data and Knowledge Management
Geographic Information Systems
The Protein Data Bank (PDB)
Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Unit, Vrije Universiteit, Amsterdam.
Protein structures in the PDB
Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.
Chapter 14 The Second Component: The Database.
Professor Michael J. Losacco CIS 1150 – Introduction to Computer Information Systems Databases Chapter 11.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
It refers to the software used to manage the database.
IST Databases and DBMSs Todd S. Bacastow January 2005.
Dale Roberts 1 Department of Computer and Information Science, School of Science, IUPUI Dale Roberts, Lecturer Computer Science, IUPUI
PHASE 3: SYSTEMS DESIGN Chapter 7 Data Design.
CSI315CSI315 Web Development Technologies Continued.
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
Fundamentals of Information Systems, Fifth Edition
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
CHAPTER 8: MANAGING DATA RESOURCES. File Organization Terms Field: group of characters that represent something Record: group of related fields File:
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Instructor: Dema Alorini Database Fundamentals IS 422 Section: 7|1.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Fanny Widadie, S.P, M.Agr 1 Database Management Systems.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
Protein and RNA Families
Data resource management
+ Information Systems and Databases 2.2 Organisation.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Bioinformatics and Computational Biology
Foundations of Business Intelligence: Databases and Information Management.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Lection №4 Development of the Relational Databases.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
CIS 250 Advanced Computer Applications Database Management Systems.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Introduction to Databases Angela Clark University of South Alabama.
1 10 Systems Analysis and Design in a Changing World, 2 nd Edition, Satzinger, Jackson, & Burd Chapter 10 Designing Databases.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Rationale Databases are an integral part of an organization. Aspiring Database Developers should be able to efficiently design and implement databases.
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
James A. Senn’s Information Technology, 3rd Edition
Chapter 14 Protein Structure Classification
Introduction to Computing Lecture # 13
Demo: Protein Information Resource
DATABASES WHAT IS A DATABASE?
Chapter 17 Designing Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Structure Databases DNA/Protein structure-function analysis and prediction Lecture 6 Bioinformatics Section, Vrije Universiteit, Amsterdam

The dictionary definition Main Entry: da·ta·base Pronunciation: 'dA-t&-"bAs, 'da- also 'dä- Function: noun Date: circa 1962 Main Entry: da·ta·base Pronunciation: 'dA-t&-"bAs, 'da- also 'dä- Function: noun Date: circa 1962 : a usually large collection of data organized especially for rapid search and retrieval (as by a computer) : a usually large collection of data organized especially for rapid search and retrieval (as by a computer) - Webster dictionary

WHAT is a database? A collection of data that needs to be: Structured Structured Searchable Searchable Updated (periodically) Updated (periodically) Cross referenced Cross referencedChallenge: To change “meaningless” data into useful information that can be accessed and analysed the best way possible. To change “meaningless” data into useful information that can be accessed and analysed the best way possible. For example: HOW would YOU organise all biological sequences so that the biological information is optimally accessible? You need an appropriate data management system (DBMS)

DBMS Internal organization Controls speed and flexibility Controls speed and flexibility A unity of programs that Store Store Extract Extract Modify Modify Database StoreExtractModify USER(S)

DBMS organisation types Flat file databases (flat DBMS) Simple, restrictive, table Simple, restrictive, table Hierarchical databases (hierarchical DBMS) Simple, restrictive, tables Simple, restrictive, tables Relational databases (RDBMS) Complex,versatile, tables Complex,versatile, tables Object-oriented databases (ODBMS) Complex, versatile, objects Complex, versatile, objects

Relational databases Data is stored in multiple related tables Data relationships across tables can be either many-to-one or many-to-many A few rules allow the database to be viewed in many ways Lets convert the “course details” to a relational database

Student 1 Chemistry Biology A B B A C ….. Student 2 Ecology Maths A D A A A … Course details FLAT DATABASE 2 Student 2 Ecology Biology A B A A A ….. Student 1 Chemistry English A A A A A … Name Depart. Course E1 E2 E3 P1 P2 Name Depart. Course E1 E2 E3 P1 P2 Student 1 Chemistry Maths C C B A A ….. Our flat file database

Normalize (1NF) … We remove repeating records (rows) sID Name dID 1 Student1 1 2 Student2 2 cID Course 1 Biology 2 Maths 3 English dID Department 1 Chemistry 2 Ecology 1 1 A B B A C … A B B A C … A D A A A … A D A A A … A B A A A … A B A A A … A A A A A … A A A A A … sID cID E1 E2 E3 P1 P2 sID cID E1 E2 E3 P1 P2 1 2 C C B A A … C C B A A ….. Primary keys Foreign keys

sID Name dID 1 Student1 1 2 Student2 2 cID Course 1 Biology 2 Maths 3 English gID Grade 1 A 2 B 3 C dID Department 1 Chemistry 2 Ecology wID Project 1 E1 2 E2 3 E3 4 P1 5 P2 sID cID gID wID Normalize (2NF) … We remove redundant fields (columns)

Relational Databases What have we achieved? No repeating information No repeating information Less storage space Less storage space Better reality representation Better reality representation Easy modification/management Easy modification/management Easy usage of any combination of records Easy usage of any combination of recordsRemember the DBMS has programs to access and edit this information so ignore the human reading limitation of the primary keys

Accessing database information A request for data from a database is called a query Queries can be of three forms: Choose from a list of parameters Choose from a list of parameters Query by example (QBE) Query by example (QBE) Query language Query language

Query Languages The standard SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language) SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language) Developed by IBM in 1974 Developed by IBM in 1974 Introduced commercially in 1979 by Oracle Corp. Introduced commercially in 1979 by Oracle Corp. RDMS (SQL), ODBMS (Java, C++, OQL etc) RDMS (SQL), ODBMS (Java, C++, OQL etc)

Distributed databases From local to global attitude Data appears to be in one location but is most definitely not A definition: Two or more data files in different locations, periodically synchronized by the DBMS to keep data in all locations consistent An intricate network for combining and sharing information Administrators praise fast network technologies!!! Users praise the internet!!!

So why do biologists care?

Three main reasons Database proliferation Dozens to hundreds at the moment Dozens to hundreds at the moment In the next few years biological data analysis will be trifurcated Bio-webs : remote data analysis and mining Bio-webs : remote data analysis and mining Bio-grids : transparent high-end computing Bio-grids : transparent high-end computing Bio-semantic webs : biological knowledge Bio-semantic webs : biological knowledge More and more scientific discoveries result from inter-database analysis and mining

Biological databases Like any other database Data organization for optimal analysis Data organization for optimal analysis Data is of different types Raw data (DNA, RNA, protein sequences) Raw data (DNA, RNA, protein sequences) Curated data (DNA, RNA and protein annotated sequences and structures, expression data) Curated data (DNA, RNA and protein annotated sequences and structures, expression data)

Raw Biological data Nucleic Acids (DNA)

Raw Biological data Amino acid residues (proteins)

Curated Biological Data DNA, nucleotide sequencesProteins, residue sequences Gene boundaries, topology Gene structure Introns, exons, ORFs, splicing Expression data MCTUYTCUYFSTYRCCTYFSCD Extended sequence information Secondary structure Hydrophobicity, motif data

Curated Biological data 3D Structures, folds

Biological Databases The 2003 NAR Database Issue:

Distributed information Pearson’s Law: The usefulness of a column of data varies as the square of the number of columns it is compared to.

A few biological databases Nucleotide Databases Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, IMGT Genome Databases Human, Mouse, Yeast, C.elegans, FLYBASE, Parasites Protein Databases Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT Structure Databases PDB, MSD, FSSP, DALI Microarray Database ArrayExpress Literature Databases MEDLINE, Software Biocatalog, Flybase Archives Alignment Databases BAliBASE, Homstrad, FSSP

Structural Databases Protein Data Bank (PDB) Structural Classification of Proteins (SCOP)

PDB 3D Macromolecular structural data Data originates from NMR or X-ray crystallography techniques Total n o of structures (25/01/2005) If the 3D structure of a protein is solved... they have it

PDB content

PDB information The PDB files have a standard format Key features Informative descriptors

Lets give it a go on the WWW

SCOP Structural Classification Of Proteins 3D Macromolecular structural data grouped based on structural classification Data originates from the PDB Current version (v1.65) PDB Entries (1 August 2003) Domains

SCOP levels bottom-up 1.Family: Clear evolutionarily relationship Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%. 2.Superfamily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily. 3.Fold: Major structural similarity Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.

Lets give it a go on the WWW

CATH Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. Topology level clusters structures according to their toplogical connections and numbers of secondary structures. The Homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to toplogy families and homologous superfamilies are made by sequence and structure comparisons.

Lets give it a go on the WWW

DSSP Dictionary of secondary structure of proteins The DSSP database comprises the secondary structures of all PDB entries DSSP is actually software that translates the PDB structural co-ordinates into secondary structure elements A similar example is STRIDE

WHY bother??? Researchers create and use the data Use of known information for analyzing new data New data needs to be screened Structural/Functional information Extends the knowledge and information on a higher level than DNA or protein sequences

In the end …. Computers can figure out all kinds of problems, except the things in the world that just don't add up. James Magary We should add: For that we employ the human brain, experts and experience.

Bio-databases: A short word on problems Even today we face some key limitations There is no standard format There is no standard format Every database or program has its own format There is no standard nomenclature There is no standard nomenclature Every database has its own names Data is not fully optimized Data is not fully optimized Some datasets have missing information without indications of it Data errors Data errors Data is sometimes of poor quality, erroneous, misspelled

What to take home Databases are a collection of data Need to access and maintain easily and flexibly Need to access and maintain easily and flexibly Biological information is vast and sometimes very redundant Distributed databases bring it all together with quality controls, cross-referencing and standardization Computers can only create data, they do not give answers