Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Published byModified over 4 years ago
Presentation on theme: "Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments."— Presentation transcript:
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments
Overview of Lecture Introduction: When, what and how to search for “homologous” seqeunces Terminology Nucleotide database searches –BLAST programs –FASTA programs –Others Protein database searches
Introduction When do I search? What do I search for (which database)? How do I search (which program)? What do the search results mean? Answer: Database searches (hopefully) identify biologically relevant sequence alignments
Sequence Alignments Sequence alignments allow comparison of new sequences to either one, a group of, or all known sequences A well-designed alignment can allow one to infer: –gene or protein function –evolutionary relationships among genes, proteins or species –structure of proteins of nucleic acids Process is highly dependent on choice of query and parameters of alignment
Terminology Associated with Searches and Alignments Query: The input sequence (or other type of search term) with which all of the entries in a database are to be compared. –Examples: Your unknown DNA sequence, a word, an accession number, etc. Algorithm: A fixed procedure embodied in a computer program –Examples: Alignment programs like BLAST,FASTA, BLITZ, etc.
Terminology Associated with Searches and Alignments (cont.) Homology:Similarity attributed to descent from a common ancestor (often misused). Identity: The extent to which two (nucleotide or amino acid) sequences are invariant. Often expressed as a percentage. Similarity: The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity (nucleotides) and/or conservation (proteins i.e., a lysine substituted for an arginine).
Terminology Associated with Searches and Alignments (cont.) Gap: A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. –Example: Aligning a cDNA sequence with a gene requires gaps at the position of introns Substitution (Scoring) matrices: Speed vs. sensitivity –allow a query sequence to be aligned with sequences in the database very rapidly. The most significant matches (successful alignments) are reported. Less complex, faster matrices sacrifice a certain degree of match significance (i.e. you need a better match for it to be recognized than if you use a slower, more complex matrix). The matrix, together with the choice of program essentially determine the search sensitivity and search speed.
Terminology Associated with Searches and Alignments (cont.) Filters: usually part of an alignment algorithm and are turned on by default. –The filter masks (hides) regions of the query sequence (your sequence) that have low compositional complexity (like poly A tails). Masking is achieved by replacing the sequence with a string of N's (NNNNNN), the code for any DNA base. –Poly-A tails, for example, can give rise to artificially high scores and therefore misleading results. This is due to the large numbers of such sequences distributed throughout the genome, and therefore throughout the database. –Similarly, new programs exist to filter out vector sequences.
Nucleotide Database Searching Commonly used search algorithms: –BLAST (at NCBI) –FASTA (in France) –BLITZ (at EPI in EMBL) –SSEARCH (in France) –PSI-BLAST (at NCBI)
Basic Local Alignment Search Tool (BLAST) A set of similarity search tools Fast and sensitive “real” matches fairly easily distinguished from random matches by scoring Seeks local rather than global alignment Can detect relationships between sequences that share only regions of similarity –GREAT as proteins are “modular”
Algorithms Within BLAST Blastncompares nucleotide query sequence against nucleotide sequence database Blastpcompares amino acid query sequence against protein sequence database Blastx compares nucleotide query translated in all reading frames against a protein sequence database
Algorithms Within BLAST (cont.) Tblastncompares amino acid query sequence against nucleotide sequence database dynamically translated in all reading frames Tblastxcompares the six-frame translation of a nucleotide query sequence against a nucleotide sequence database dynamically translated in all reading frames - COMPUTATIONALLY INTENSE!! Choose the correct algorithm!!!
Top red line represents query sequence Each line below indicates matching sequences sorted by score (in color) and position of match Below is a list of high scoring matches followed by actual alignments
The “Expectation” Value (E Value) Expectation value. The number of different alignments with scores equivalent to or better than S (threshold score) that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Given in scientific notation. For example, an E value of e-167 indicates that there is a 1/10 167 chance that the match is random The smaller the E value, the more significant the match Varies due to number of bp of sequence in the database and the length of the query sequence
How Does the BLAST Algorithm Work? An Overview A two step process Initial scanning identifies high scoring matches to “words” in the query sequence –Positive scores for exact matching bases or amino acids –Negative scores for mismatches –Default word size is 11 bases Sequences with high scores are extended in both directions in the second step until the best score is achieved Scoring matrices are used in each step
Options Word length –Set at 11 bases for blastn. –Requires a perfect 11 bp match to go to the second step –Chances of a random 11 bp exact match are 1/4 11 (= 1/4,194,304) –Shortening the word length may make the search more sensitive, but it may increase the number of non-biologically significant hits
Options (cont.) Filters –Can mask regions of low complexity Poly A tails Proline rich regions –Can now mask human repetitive sequences –Low complexity filter is on by default. Others must be activated
Options (cont.) The Expect threshold –The statistical significance threshold for reporting matches against database sequences –The default value is 10, meaning that 10 matches are expected to be found merely by chance –If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. –Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. –Increasing the threshold shows less stringent matches. Fractional values are acceptable.
Protein Database Searching 2-5 times more sensitive than a DNA database search! –DNA alphabet is smaller than the protein alphabet (4 v. 20 letters) –The genetic code is redundant (6 serine codons) –There is a selection for function, thus protein sequence is more highly conserved through time Groups of genes or proteins from different organisms that have the same function are called “orthologs”