Lecture 7 Types of databases.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics for biomedicine Sequence search: BLAST, FASTA Lecture 2, Per Kraulis
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
Lecture 8 Alignment of pairs of sequence Local and global alignment
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence alignment, E-value & Extreme value distribution
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Sequence Alignment.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Lecture 7 Types of databases. Bioinformatics Lecture 7 Types of databases. Principles of organizations and functioning. Sequence formats. Conversion of one sequence format to another. Database search. FASTA, BLAST.

Protein and DNA/RNA Databases The first biological database was created by Margaret Dayhoff in 1960s as a reaction to development of protein-sequencing methods in 1950s. Proteins in this and other DB were organised into families and superfamilies based on degree of similarity. Tables that reflected the frequency of changes observed in the sequences of a group of closely related proteins were then derived (Percent Accepted Mutations, PAM Matrices). These tables were used to align sequences and reconstruct the evolutionary pathways – phylogenetic trees

Protein and DNA/RNA Databases In the following years numerous protein and other databases were developed. SwissProt is an example

Protein and DNA/RNA Databases The first DNA DB were developed in 1979-80 by American (GenBank) and European groups (EMBL) also as a reaction to development of sequencing techniques. Hundreds different specialised DB were constructed since then. Many DB contain DNA or RNA and protein information. There are numerous links between DB and regular exchange of data and tools occur. The Entrez Nucleotides database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. The number of stored bases grows at an exponential rate. On 20.02.2004 the total number of base stored in Entrez was 20,197,497,568.

ENTREZ DB of different kind merged together and become global hubs of knowledge.

Protein and DNA/RNA Databases Each DB obviously has a complicated internal structure Only a section of GenBank dealing with sequences consists of 58 blocks. Each block has at least one but usually several links with other blocks. The major components of DB include: storage of sequences, many blocks responsible for retrieval of sequences, several blocks responsible for alignment, blocks responsible for input of the data and quality control, blocks responsible for statistical analysis of sequences and many other functionalities available in the DB Complex DB like GenBank contain several sets of blocks, which serve DNA, protein, genome, taxonomy and other domains of the DB with numerous links operating between them. There are hundreds of thousand and even millions requests to the major banks every day. Clearly this must be accommodated in the structure of the bank

Collecting sequence and other data Primary sequences of DNA, RNA and proteins constitute a significant portion of information accumulated in DB

Collecting sequences and other data All primary sequences, which are going to be in public domain, must be submitted into a DB, otherwise a publication is not accepted. DNA sequences are usually submitted on-line in the following form. Each sequence provided with an ID If two or more identical sequences provided by different people a problem of redundancy of a DB will emerge.

Collecting sequences and other data As size of genomes varies dramatically from 10,000 bp for simple viruses up to several billion bp in higher animals and plants, the number of sequences covering the whole genome also varies very significantly 10 – 106. DNA fragments presented in DB have not only very different lengths but also diverse origin. Some are large fragments of genome, other represent genes or their fragments, some are repeats and non-coding sequences, etc. Many fragments have areas of overlaps. Many sequences are annotated. It means that their position on genetic maps, internal structure of genes (exon-intron) and function are known or predicted. However in many cases such information is missing.

Sequence formats It is import to ensure that sequence files do not contain special characters recognisable only by text editors. ASCII files are suitable for most sequence programs. However independent DB and some widely used programs developed slightly different formats for sequences. Correct using of different formats is critical as well as a possibility to recognize and convert sequence/file/entry from one format to another.

GenBank DNA sequence entry

Sequence formats There are many different (> 20) sequences formats including GenBank, EMBL, SwissProt, FASTA, Genetics Computer Group (GCG) Sequence Format and several others. 1. FASTA/Pearson format >seq1 agctagct actgg >seq2 aactaact attcg 2. GenBank format LOCUS seq1 16bp DEFINITION seq1, 16 bases, 2688 checksum. ORIGIN 1 agctagctag // LOCUS seq2 20bp

Conversion of one sequence format to another There are several computer programs able to convert formats. READSEQ is one of such programs and is very useful. Sequence formats recognized by format conversion program READSEQ: Abstract Syntax Notation (ASN.1) DNA Strider EMBL FASTA Fitch (phylogenetic analysis) GenBank GCG Intelligenetics Multiple sequence format Nat. Biomedical Research Foundation (NBRF) Protein Information Resource (PIR) And 6-8 additional specialised formats

READSEQ

Format conversion in GenBank

Storage of information in a sequence database There are millions of entries in the major DNA and protein DB and each entry usually contain significant amount of information. This information is organised into a tabular form, as it usually done in relational DB. The number of columns (fields) in such DB is much larger than in the table below. An index of these fields can be made, which allows very fast search of a DB using one or few field simultaneously. The information in one DB can be cross-referenced to that in another DB. For instance DNA, protein and reference DB have all been cross-references so that moving between them is readily accomplished. Accession Organism Reference Name Keywords Sequence No 123 E. coli Medline1, LexA SOS regulon, ATGCCGG… protein repressor,…   124 H. sapiens Medline2, glucorticoid transcriptional CCGATAAC receptor regulator

Database Types There are several types of DB; the two principal types are the relational and object-oriented. The relational DB orders data in tables made up of rows giving specific item in the DB, and columns giving the features as attributes of those items. Careful indexation and cross-referencing are essential for each item in DB has a unique set of identifying features. The object-oriented DB structure has been useful in the development of biological DB. These DB are necessary to deal with complex and constantly evolving biological objects like genetic maps. A sophisticated architecture and use of unifying common language, like Interface Definition Language (IDL) are required to make such DB functional and united. To plan/construct such object-oriented DB a specific set of procedures called the Unified Modelling Language (UML) was devised.

Example of object-oriented DB

Sequence retrieval from the public Databases The essential step in providing access to a functional DB is development of software and Web pages that allow queries to be made. ENTREZ system implemented in NCBI is a good example of such retrieval system Three major versions of a query search are implemented in a number of DBs. 1. ID search, 2. Molecule name or function search and 3. Similarity search. The first two types are based on DB indexes and cross-references. The third is different as it has to create a comparative data and then use these data for the retrieval and DB searches.

DB searches for similar sequences Since Charles Darwin the idea of common origin of species became widely accepted view, however the level of similarity on molecular level between distant species remained unclear until 1970s and 1980s. At that time the fact that many DNA and particularly protein molecules retain significant (>60-70%) or high (>85%) similarity hundreds of millions of years after separation from the common ancestor was established. This discovery as well as practical needs to search growing DB lead to development of effective methods of similarity search. Two programs, which greatly facilitated the similarity search, were developed FASTA (Pearson and Lipman 1988) and BLAST (Altschul et al. 1990).

Basics of similarity searches The basic step in any similarity search is an alignment of two or more sequences. Principles of alignment will be considered during the next lecture. The search provides a list of DB sequences with which a query sequence can be aligned. Then scoring procedure is implemented, which allows to measure degree of similarity from 100% identity to a loose similarity. A common reason for performing a DB search is to find a related gene. A matched gene (or any other sequence) may provide a clue as to function. An alternative task can be be achieved when a sequence with known function or role is used as a query for search in a species genome. The search must be fast and sensitive enough.

FASTA FASTA is a program for rapid alignment of pairs of protein and DNA sequences. Comparison of all nucleotides or amino acids is not an option, even for powerful computers, FASTA instead searches for matching sequence patterns (“words”) called k-tuples. These patterns comprise k consecutive matches in the compared sequences. Using k-tuples FASTA builds a local alignment. Finally FASTA scores this alignment and output a list of sequences similar to a query in the descending order. gaps ATCGAACCTGGATCGTGGCCATCGAACCTGGATCGTGGCCATCGAACCTGGATCGTGGCC GGCGAACCCCTATCGTGGCGTTACCGCCTTATTGACGGCCATCGAACCTGGATCGTGGCC k = 6 k = 8 k = 14 k-tuples

FASTA FASTA performs the following statistical tasks: 1. The average score for DB seq. of the same length is determined, 2. The average score is plotted against the log of average seq. length in each length range, 3. The points are then fitted to a straight line by linear regression, 4. A z score, the number of standard deviations from fitted line, is calculated for each score, 5. Low scoring seq. are removed. 6. A statistical comparison with Z distribution follows, which allows to calculate E ( ) value. If E ( ) = 0, and z score is high two sequences are identical, when E is higher then a threshold level, no clear similarity is observed. Methods used by FASTA to locate sequence similarity: Rapid location of 10 best matching regions in each pair. For DNA seq. k = 4-6, for protein k = 1-2. The highest-density matches identified. The highest-density regions are evaluated using special scoring matrixes (next lecture) and the best initial regions (INIT1) are found (*-the best). Longer regions of identity of score INITN are generated by joining INIT with scores higher than a certain threshold, which include positive scores for similarity and negative for gaps. Optimisation procedure follows.

Typical output of FASTA similarity search Query – Motif2; #282 – is a fragment from a DB >>#282 (18 aa) initn: 48 init1: 48 opt: 71 z-score: 191.0 E(): 6.9e-06 Smith-Waterman score: 71; 61.111% identity in 18 aa overlap   10 20 Motif2 VKTYGFAATSVEEAKEVAEERGK X:.:::.::X.:.:.. : #282 GFVATSAEEAEEIAKKLG 10

BLAST Basic Local Alignment Search Tool (BLAST) was developed as a new way to perform seq. similarity search. BLAST is faster than FASTA while being nearly as sensitive. The minimal “word” (k-tuple) length is slightly higher than in FASTA, 3 for proteins and 11 for DNA.

BLAST procedure The steps used by the BLAST algorithm: The seq is optionally filtered to remove low-complexity regions (AGAGAG…) A list of words of certain length is made Using substitution scores matrixes (like PAM or BLOSUM62) the query seq. words are evaluated for matches with any DB seq. and these scores (log) are added A cutoff score (T) is selected to reduce number of matches to the most significant ones The above procedure is repeated for each word in the query seq. The remaining high-scoring words are organised into efficient search tree and rapidly compared to the DB seq. If a good match is found then an alignment is extended from the match area in both directions as far as the score continue to grow. In the latest version of BLAST more time-efficient method is used

BLAST procedure The essence of this method is finding a diagonal connecting ungapped alignments and extending them Query sequence Database sequence

BLAST procedure 8) The next step is to determine those high scoring pairs (HSP) of seq., which have score greater than a cutoff score (S). S is determined empirically by examining a range of scores found by comparing random seq. and by choosing a value that is significantly greater. 9) Then BLAST determines statistical significance of each HSP score. The probability p of observing a score S equal to or greater than x is given by the equation: p (S  x) = 1 – exp(-e-(x-u)), where u = [log (Km’n’)]/ and K and  are parameters that are calculated by BLAST for amino acid or nucleotide substitution scoring matrix, n’ is effective length of the query seq. and m’ is effective length of the database seq. 10) On the next step a statistical assessments is made in the case if two or more HSP regions are found and certain matching pairs are put in descending order in the output file as far as their similarity/ score is concerned.

On line BLAST results

On line BLAST results

On line BLAST results