Download presentation
Presentation is loading. Please wait.
1
Bioinformatics III (“Systems biology”)
Course will address two areas: analysis and comparison of whole genome sequences „systems biology“ – integrated view of cellular networks 1. Lecture WS 2003/04 Bioinformatics III
2
Whole Genomes - Content
genome assembly gene finding genome alignment whole genome comparison (prokaryotes, human mouse) genome rearrangements transcriptional regulation functional genomics phylogeny single nucleotide polymorphisms (SNPs) some topics were already covered in Bioinformatics 1 lecture by Prof. Lenhof 1. Lecture WS 2003/04 Bioinformatics III
3
Cellular Networks - Content
network topologies: random networks, scale free networks robustness of networks expression analysis metabolic networks, metabolic flow analysis linear systems, non-linear dynamics molecular systems biology: protein-protein interaction networks molecular machines ... 1. Lecture WS 2003/04 Bioinformatics III
4
Literature whole genome sequences e.g. David Mount, Bioinformatics
Chapters 6, 8, 10 system biology mostly taken from original literature Web-resources - Institute of Systems Biology, Seattle, WA - The systems biology institute € 68 1. Lecture WS 2003/04 Bioinformatics III
5
assignments 12 weekly assignments planned
Homeworks are handed out in the Tuesday lectures and are available on our webserver on the same day. Solutions need to be returned until Tuesday of the following week 14.00 in room 1.05 Geb. 17.1, first floor, or handed in prior (!) to the lecture starting at In case of illness please send to: and provide a medical certificate. 1. Lecture WS 2003/04 Bioinformatics III
6
Schein = successful written exam
The successful participation in the lecture course („Schein“) will be certified upon successful completion of the written exam on Feb. 18, 2004. Participation at the exam is open to those students who have received 50% of credit points for the 12 assignments. Unless published otherwise on the course website until Feb. 4, the exam will be based on all material covered in the lectures and in the assignments. In case of illness please send to: and provide a medical certificate. A „second and final chance“ exam may be offered at the beginning of April 2004 to those who failed the first exam and those who missed the first exam due to illness (medical certificate required). 1. Lecture WS 2003/04 Bioinformatics III
7
tutors Prof. Dr. Volkhard Helms
Sprechstunde: Tue Geb. 17.1, room 1.06. Generally, I am also available after the lectures. Dr. Tihamer Geyer – assignments for network part Geb. 17.1, room 1.09. guest lecturers+tutors 1. Lecture WS 2003/04 Bioinformatics III
8
Bacteria Archaea Eukarya
Tree of Life Bacteria Archaea Eukarya Euryarchaeota Methanosarcina Purple bacteria Animals Gram-positive Halophiles Methanobacterium Fungi Cyanobacteria Chlamydiae Methanococcus Thermoplasma Thermococcus Plants Crenarchaeota Slime molds Thermoproteus Pyrodictium Flavobacteria Entamoebae Ciliates Spirochetes Deinococci Green nonsulfur bacteria Stramenophiles Thermotogales Trichomonads Aquifex Microsporidia Diplomonads 1. Lecture WS 2003/04 Bioinformatics III
9
Genomes A genome is the entire genomic material of any of these biological organism. We will review genome organization, known sequences, genome language, sequencing details etc. in the next lecture. Now that we have genome information from multiple organisms I see the following issues: 1 what biological questions do we ask? 2 what bioinformatics tools do we need to find the answers? 3 what are the answers? 1. Lecture WS 2003/04 Bioinformatics III
10
Why mouse? 19 mouse chromosomes. Genetecists have anxiously awaited the recently published draft version of the Mouse genome? Why? Mouse as a close relative to humans is a unique lens through which we can view ourselves. As the leading mammalian system for genetic research over the past century it has provided a model for human physiology and disease. Comparative genomics makes it possible to discern biological features that would otherwise escape our notice. Nature 420, 520 (2002) 1. Lecture WS 2003/04 Bioinformatics III
11
How do we compare genomes?
Conservation of synteny between human and mouse. 558,000 highly conserved, reciprocally unique landmarks were detected within the mouse and human genomes, which can be joined into conserved syntenic segments and blocks. A typical 510-kb segment of mouse chromosome 12 that shares common ancestry with a 600-kb section of human chromosome 14 is shown. Blue lines connect the reciprocal unique matches in the two genomes. In general, the landmarks in the mouse genome are more closely spaced, reflecting the 14% smaller overall genome size. Nature 420, 520 (2002) 1. Lecture WS 2003/04 Bioinformatics III
12
Genome rearrangements
Segments and blocks >300 kb in size with conserved synteny in human are superimposed on the mouse genome. Each colour corresponds to a particular human chromosome. The 342 segments are separated from each other by thin, white lines within the 217 blocks of consistent colour. Genome rearrangments have functional implications (will be discussed later). Nature 420, 520 (2002) 1. Lecture WS 2003/04 Bioinformatics III
13
Review: Pairwise sequence alignment
dynamic programming: Needleman-Wunsch, Smith Waterman sequence alignments substition matrices significance of alignments BLAST, algorithmn – parameters – output This part of lecture taken from O’Reilly book on “BLAST” by Korf, Yandell, Bedell see also Bioinformatik I lecture by Prof. Lenhof weeks 3 and 5 Database similarity searches is one of the first and most important steps in analysing a new sequence. If your unknown sequence has a similar copy already in the databases, a search will quickly reveal this fact and if the copy is well annotated you need go little further in trying to identify your sequence. Database searches usually provide the first clues of whether the sequence belongs to an already studied and well known protein family. If there is a similarity to a sequence that is from another species, then they may be homologous (i.e. sequences that descended from a common ancestral sequence). Knowing the function of a similar/homologous sequence will often give a good indication of the identity of the unknown sequence. N.B. You should bear in mind that in order to identify homologous sequences, searches should be made at the protein sequence level, because it is about 5 times more sensitive at finding matches. 1. Lecture WS 2003/04 Bioinformatics III
14
Sequence alignment When 2 or more sequences are present one would like
to detect quantitatively their similarities discover equivalences of single sequence motifs observe regularities of conservation and variability deduce historical relationships important goal: annotation of structural and functional properties assumption: sequence, structure, and function are inter-related. Database similarity searches is one of the first and most important steps in analysing a new sequence. If your unknown sequence has a similar copy already in the databases, a search will quickly reveal this fact and if the copy is well annotated you need go little further in trying to identify your sequence. Database searches usually provide the first clues of whether the sequence belongs to an already studied and well known protein family. If there is a similarity to a sequence that is from another species, then they may be homologous (i.e. sequences that descended from a common ancestral sequence). Knowing the function of a similar/homologous sequence will often give a good indication of the identity of the unknown sequence. N.B. You should bear in mind that in order to identify homologous sequences, searches should be made at the protein sequence level, because it is about 5 times more sensitive at finding matches. 1. Lecture WS 2003/04 Bioinformatics III
15
Why do sequence database searching
Search in databases Identify similarities between a new test sequence, of unknown and uncharacterized structure and function and sequences in (public) sequence databases with known structure and function. N.B. The similar regions can encompass the entire sequence or parts of it! Local alignment global alignment Why do sequence database searching •What have I cloned ? •Is this really “my gene” ? •Has someone else already found it ? •Is it interesting anyway? •What is it related to ? •Can I get more sequence easily ? 1. Lecture WS 2003/04 Bioinformatics III
16
gap = Insertion oder Deletion
Sequence Alignment The purpose of a sequence alignment is to arrange all those residues of a deliberate number of sequences beneath eachother that are derived from the same residue position in an ancestral gene or protein. gap = Insertion oder Deletion Wat is het belangrijkste residue voor alignen? Cys, want meest geconserveerd A multiple sequence alignment is a 2D table, in which the rows represent individual sequences, and the columns the residue positions. Sequences are laid onto this grid in such a manner that (a) the relative positioning of residues within any one sequence is preserved, and (b) similar residues in all the sequences are brought into vertical register. 1. Lecture WS 2003/04 J.Leunissen Bioinformatics III
17
Needleman-Wunsch Algorithm
general algorithm for sequence comparison maximises a similarity score maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions finds the best GLOBAL alignment of any two sequences NW involves an iterative matrix method of calculation all possible pairs of residues (bases or amino acids) – one from each sequence – are represented in a two-dimensional array all possible alignments (comparisons) are represented by pathways through this array. Three main steps 1 initialization 2 fill (induction) 3 trace-back What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence 1. Lecture WS 2003/04 Bioinformatics III
18
Needleman-Wunsch Algorithm: Initialization
task: align words “COELACANTH” and “PELICAN” of length m=10 and n=7. Construct (m+1) (n+1) matrix. Assign values – m gap and – n gap to elements m and n of first row and first column. Here, gap = -1. Arrows of these fields point back to origin. C O E L A N T H -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 P I What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence 1. Lecture WS 2003/04 Bioinformatics III
19
Needleman-Wunsch Algorithm: Fill
Fill all matrix fields with scores and pointers using a simple operation that requires the scores from the diagonal, vertical, and horizontal neighboring cells. Compute match score: value of upper left diagonal cell + score for a match (+1 or -1) horizontal gap score: value of cell to the left + gap score (-1) vertical gap score: value of cell to the top + gap score (-1) assign maximum of these 3 scores to cell. point arrow in direction of maximum score. max(-1, -2, -2) = -1 max(-2, -2, -3) = -2 (make arbitrary, consistent choice – e.g. always choose the diagonal over a gap. C O E L A N T H -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 P What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence 1. Lecture WS 2003/04 Bioinformatics III
20
Needleman-Wunsch Algorithm: Trace-back
trace-back lets you recover the alignment from the matrix. start at the bottom-right corner and follow the arrows until you get to the beginning. COELACANTH -PELICAN-- C O E L A N T H -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 P I 1 2 What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence 1. Lecture WS 2003/04 Bioinformatics III
21
Smith-Waterman-Algorithm
Smith-Waterman is a local alignment algorithm. SW is a very simple modification of Needleman-Wunsch. Only 3 changes: edges of the matrix are initialized to 0 instead of increasing gap penalties. maximum score is never less than 0. No pointer is recorded unless the score is greater than 0. trace-back starts from highest score in matrix and ends at a score of 0. ELACAN ELICAN C O E L A N T H P 1 2 I 3 4 What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence 1. Lecture WS 2003/04 Bioinformatics III
22
Differences Needleman-Wunsch Smith-Waterman
1 Global alignments 1 Local alignments 2 requires alignment score for a pair 2 Residue alignment score may be of residues to be 0 positive or negative 3 no gap penalty required 3 requires a gap penalty to work efficiently more suited for alignment of eukaryotic sequences with exons and introns What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence 1. Lecture WS 2003/04 Bioinformatics III
23
Algorithmic complexity
Dynamic programming methods such as Needleman-Wunsch and Smith-Waterman have O(mn) complexity in both time and memory. Variation: just use 2 rows at a time and don’t allocate the whole matrix. The alignment algorithm becomes O(n) in memory. What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence 1. Lecture WS 2003/04 Bioinformatics III
24
Scoring - or Substition Matrices
serve to better score the quality of sequence alignments. for protein/protein comparison: a 20 x 20 matrix for the probabilities that certain amino acids are exchange others by random mutations the exchange of amino acids of similar character (Ile, Leu) is more likely (receives higher score) than for exchanging amino acids of dissimilar character (e.g. Ile Asp) scoring matrices are assumed to be symmetrical (exchange Ile Asp has the same probability as Asp Ile). Therefore they are triangular matrices. Contains scores for matches between residues, according to observed substitution rates across large evolutionary distances Scoring Matrices are designed to detect signal above background, to detect similarities beyond what would be observed by chance alone. All algorithms to compare protein sequences rely on some scheme to score the equivalencing of each of the 210 possible pairs of amino acids. (i.e. 190 pairs of different amino acids + 20 pairs of identical amino acids). 20x20=400-20=380/2=190 The choice of matrix determines both the pattern and the extent of substitution in the sequences the database search is most likely to discover 1. Lecture WS 2003/04 Bioinformatics III
25
Substitution matrices
Not all amino acids are similar some can be replaced more easily than others some mutations occur more frequently than others some mutations are more long-lived than others Mutations prefer certain exchanges some amino acids have similar 3-letter codons those residues are more replaced by random DNA mutation Selection prefers certain exchanges some amino acids have similar properties and structure (E.g. Trp cannot be inserted in the protein interior.) The two forces together yield substitution matrices (From computational biology) Example of CODONS: TTT & TTC code for Phe TTA & TTG code for Leu 1. Lecture WS 2003/04 Bioinformatics III
26
PAM250 Matrix 1) Notice 1 lettercode for the amino acids on both axes are the 20 aa note blocks of similar amino acids 2) Symmetric, only one half shown 3) Diagonal: * For example: high score for matching Tryptophans and “low” score for matching Alanines. * Cysteine * Leu abundant 4) Off-diagonal Groups of similar amino acids K -> F -5 A score above zero assigned to two amino acids indicates that these two .. Each other more often than expected by chance alone. Ie they are functionall.. Exchangable A negative score indicates that the two amino acids are rarely .. Interchangeable. Eg. A basic amino acids for an acidic one or one with an … side chain for one with aliphatic side chain. 1. Lecture WS 2003/04 Bioinformatics III
27
Example Score 1 12 12 6 2 5 -1 2 6 1 0 => Alignment Score = 46
The Score of an alignment is the sum of all invidual scores of the amino acid (base) pairs of the alignment. Sequence 1: TCCPSIVARSN Sequence 2: SCCPSISARNT => Alignment Score = 46 1. Lecture WS 2003/04 Bioinformatics III
28
Dayhoff Matrix (1) derived by M.O. Dayhoff who collected statistical data for probabilities of amino acid exchanges data set for closely related protein sequences (> 85% identity). advantage: these can be aligned to high certainty. derive 20 x 20 matrix for probabilities of amino acid mutations from the observed frequency of exchanges This matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means that 1 point mutations occur per 100 residues. Or: both sequences are 99% identical. Possibly the most widely used scheme for scoring amino acid pairs is that developed by Dayhoff and co-workers. The system arose out of a general model for the evolution of proteins. 1978!!!, 1572 changes in 71 groups of closely related proteins. Atlas of Protein Sequences. Dataset of 71 aligned sequences? Newer PAM matrices do not differ greatly from the original ones Dayhoff and co workers examined alignments of closely similar sequences where the the likelihood of a particular mutation (e. A-D) being the result of a set of successive mutations (eg. A-x-y-D) was low. Since relatively few families were considered, the resulting matrix of accepted point mutations included a large number of entries equal to 0 or 1. A complete picture of the mutation process including those amino acids which did not change was determined by calculating the average ratio of the number of changes a particular amino acid type underwent to the total number of amino acids of that type present in the database. for example after 2 PAM (Percentage of Acceptable point Mutations per 10^8 years). An evolutionary distance of 1 PAM means there has been 1 point mutation per 100 residues (percent accepted mutation?) 1 PAM corresponds to an average change in 1% of all amino acids positions. Take a list of aligned proteins every time you see a substitution between two amino acids, increment the similarity score betweent them must normalize it by how often amino acids occur in general. Rare amino acids will give rare substitutions PAM model of molecular evolution After 100 PAMs of evolution, not every residue will have changed: some will have mutated several times, perhaps returning to their original state, and others not at all. Note that there is no general correspondence between PAM distance and evolutionary time, as different protein families evolve at different rates. The probabilities represent the average mutational change that will take place when 1 residue out of 100 undergo mutation = 1 PAM (Point Accepted Mutation). 2 sequences 1 PAM apart have 99% identical residues 1. Lecture WS 2003/04 Bioinformatics III
29
Dayhoff Matrix (2) Log odds Matrix: contains logarithms of PAM matrix entries. Score of mutation i j observed mutation rate i j = log( ) expected mutation rate according to amino acid frequency The probability of two independent mutational events is the product of the individual probabilities. When using a log odds Matrix (i.e. using the logarithm of all values) one obtains the total alignment score as sum of the scores for every residue pair. 1. Lecture WS 2003/04 Bioinformatics III
30
Dayhoff Matrix (3) Derive Matrices for larger evolutionary distances by multiplying the PAM1 matrix with itself. PAM250: 2,5 mutations per residue corresponds to 20% matches between two sequences, i.e. mutations are observed at 80% of all residue positions. This is the default matrix of most sequence analysis packages. 2 PAM = 108 jaar?? Opzoeken…. However, in principle, it is more effective to use a matrix that corresponds to the actual evolutionary distance between the sequences being compared PAM250: approximately 80 % of the amino acid positions are observed to have changed. Rule of thumb PAM 1 = 1 million year PAM 10: on diagonal S=7 W=13 Man vs gorilla on avg 1-2 aa different Je kunt niets zeggen over uitwisselingen PAM 85: on diagnoal 4-13 Man & horse? PAM 250 dit is ongeveer hoe verje kunt gaan, daarna vgl je mens met slime mold Take powers of this matrix PAM1 PAM250 corresponds to ca. 20% overall sequence identity, is the lowest sequence seq sim for which we can hope to produce a correct alignment by sequence analysis alone. 1. Lecture WS 2003/04 Bioinformatics III
31
BLOSUM Matrix limitation of Dayhoff-Matrix:
the matrices based on the Dayhoff model of evolutionary rates are of limited value because the substitution rates were derived from sequence alignments of sequences that are more than 85% identical. A different path was taken by S. Henikoff and J.G. Henikoff who used local multiple alignments of distantly related sequences. Advantages: larger data sets multiple alignments are more robust NO EXTRAPOLATION NECESSARY They examine multiple alignments of distantly related proteins directly, rather than extrapolate from closely related sequences. Advantage: it cleaves closer to observation; a disadvantage is that it yields no evolutionary model. A number of tests suggest that the BLOSUM matrices produced by this method are generally superior to thte PAM matrices for detecting biological relationships. 1. Lecture WS 2003/04 Bioinformatics III
32
BLOSUM Matrix (2) The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS database. The BLOCKS database uses the concept of blocks (ungapped amino acid signatures) that are characteristic for protein families. Derive probabilities of exchange for all amino acid pairs from the observed mutations inside the blocks. Convert into log odds BLOSUM matrix. Different matrices are obtained by varying the lower requirement for the level of sequence identity. e.g. the BLOSUM80 matrix is derived from blocks with > 80% identity. . Built only form the most conserved domains of the blocks database of conserved proteins. Dataset: 2000 blocks of aligned sequence … segments characterizing more than 500 groups of related proteins (1992) 1. Lecture WS 2003/04 Bioinformatics III
33
Which matrix to use? Close relationship (low PAM, high Blosum) Distant relationship (High PAM, low Blosum) reasonable default parameters: PAM250, BLOSUM62 At the level of 2,000 PAM Schwartz and Dayhoff suggest that all the information present in the matrix has degenerated except that the matrix element for Cys-Cys is 10% higher than would be expected by chance. At the evolutionary distance of 256 PAMs one amino acid in five remains unchanged but the amino acids vary in their mutability; 48% of the tryptophans, 41% of the cysteines and 20% of the histidines would be unchanged, but only 7% of serines would remain. 1. Lecture WS 2003/04
34
Gap penalties Besides substitution matrices we need a method to score gaps Which relevance do insertions or deletions have relative to substitutions? distinguish introduction of gaps: aaagaaa aaa-aaa from extension of gaps: aaaggggaaa aaa----aaa different programs (CLUSTAL-W, BLAST, FASTA) recommend different default parameters which should be used as a first guess. At the level of 2,000 PAM Schwartz and Dayhoff suggest that all the information present in the matrix has degenerated except that the matrix element for Cys-Cys is 10% higher than would be expected by chance. At the evolutionary distance of 256 PAMs one amino acid in five remains unchanged but the amino acids vary in their mutability; 48% of the tryptophans, 41% of the cysteines and 20% of the histidines would be unchanged, but only 7% of serines would remain. 1. Lecture WS 2003/04
35
Significance of Alignments (1)
When is an alignment statistically significant? In other words: How different is the obtained score of an alignment from scores that would result from alignments of the test sequence with random sequences? Or: What is the probability that an alignment of this score occured randomly? 1. Lecture WS 2003/04 Bioinformatics III
36
Significance of Alignments (2)
size of database = 20 x 106 letters Peptide #hits A 1 x 106 (if equally distributed) AP IAP LIAP 125 WLIAP 6 KWLIAP 0,3 KWLIAPY 0,015 Swissprot 30 Mletters, maar 20 rekent makkelijker Hexapeptide search: 206=64 x 106 possibilities SwissProt 30 x 106 letters => Hexapeptide found in SwissProt is pure chance Tripeptide search: 203=8000 possibilities If size of database is 8000 letters, every tripeptide occurs once! Always remember: Mathematical significance Biological significance 1. Lecture WS 2003/04 Bioinformatics III
37
BLAST – Basic Local Alignment Search Tool
finds the highest-scored local optimal alignment of a test sequence with all sequences of a database. Very fast algorithm. Ca. 50 times faster than dynamical programming. because BLAST uses pre-indexed database, BLAST can be used to search very large databases. is sufficiently sensitive and selective for most purposes. Is robust – default parameters usually work fine. What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence 1. Lecture WS 2003/04 Bioinformatics III
38
BLAST Algorithm, Step 1 For given word of length w (usually 3 for proteins) and for a given scoring matrix construct list of all words (w-mers) which get score > T if compared to w-mer of input sequence. P Q A 12 P Q N 12 etc. below cut-off (T=13) test sequence L N K C K T P Q G Q R L V N Q P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 related words word P M G 13 P D G 13 1. Lecture WS 2003/04 Bioinformatics III
39
BLAST Algorithm, Step 2 each related word points to positions in data base (hit list). P D G 13 P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 P M G 13 PMG Database 1. Lecture WS 2003/04 Bioinformatics III
40
BLAST Algorithm, Step 3 Program tries to extend suitable segments (seeds) in both directions by adding pairs of residues. Residues are added until score sinks below cut-off. 1. Lecture WS 2003/04 Bioinformatics III
41
different BLAST algorithms
BLASTN – compares nucleotide sequence against nucleotide database BLASTP – compares protein sequence against protein database BLASTX – compares nucleotide sequences translated in all 6 open reading frames against protein sequence database TBLASTN TBLASTX 1. Lecture WS 2003/04 Bioinformatics III
42
BLAST Output (1) Small probability shows that hit is likely not random
1. Lecture WS 2003/04 Bioinformatics III
43
Significance of BLAST alignment
P-value (probability) probability that alignment score could result from alignment of random sequences the closer P equals 0, the higher is certainty that a hit is a true hit (homologous sequence) E-value (expectation value) E = P * number of sequences in database E is the number of alignments of a particular score that can be expected to occur randomly in a sequence database of this size if e.g. E=10, one expects 10 random hits with the same score. Such an alignment is not significant. Use appropriate threshold in BLAST. The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. E= The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. P = probability that the HSP was generated as a chance alignment. Orthologs will have extremely significant scores DNA , protein 10-30 Closely related paralogs will have significant scores Protein 10-15 Distantly related homologs may be hard to identify Protein 10-4 Orthologs: the sequences have diverged by speciations -E.g. human, mouse and chicken hemoglobin Paralogs: the sequences have diverged by gene duplication -E.g. the and hemoglobin genes 1. Lecture WS 2003/04 Bioinformatics III
44
Rough guide P-value (probability) – A. M. Lesk
P sequences are identical < P < sequences are almost identical, e.g. alleles or SNPs 10-50 < P < closely related sequences, homology is certain 10-10 < P < sequences are usually distantly related P > similarity probably not significant E-value (expectation value) E 0,02 sequences probably homologous 0,02 < E < 1 Homology possible E 1 good agreement most likely random. The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. E= The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. P = probability that the HSP was generated as a chance alignment. Orthologs will have extremely significant scores DNA , protein 10-30 Closely related paralogs will have significant scores Protein 10-15 Distantly related homologs may be hard to identify Protein 10-4 Orthologs: the sequences have diverged by speciations -E.g. human, mouse and chicken hemoglobin Paralogs: the sequences have diverged by gene duplication -E.g. the and hemoglobin genes 1. Lecture WS 2003/04 Bioinformatics III
45
Rough guide > 45% proteins have very similar structure
Level of sequence identity with optimal alignment > 45% proteins have very similar structure and most likely the same function > 25% proteins probably possess similar fold 18 – 25% Twilight-Zone - assuming homology is tempting below alignment has little significance The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. E= The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. P = probability that the HSP was generated as a chance alignment. Orthologs will have extremely significant scores DNA , protein 10-30 Closely related paralogs will have significant scores Protein 10-15 Distantly related homologs may be hard to identify Protein 10-4 Orthologs: the sequences have diverged by speciations -E.g. human, mouse and chicken hemoglobin Paralogs: the sequences have diverged by gene duplication -E.g. the and hemoglobin genes 1. Lecture WS 2003/04 Bioinformatics III
46
Twilight-Zone (1) both have very similar tertiary structure.
myoglobin from whale and Leghemoglobin of lupins are 15% identical with optimal alignment both have very similar tertiary structure. both contain heme group and bind oxygen they are remotely related, though homologous proteins Left: Whale Mb Right: Leg Hb The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. E= The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. P = probability that the HSP was generated as a chance alignment. Orthologs will have extremely significant scores DNA , protein 10-30 Closely related paralogs will have significant scores Protein 10-15 Distantly related homologs may be hard to identify Protein 10-4 Orthologs: the sequences have diverged by speciations -E.g. human, mouse and chicken hemoglobin Paralogs: the sequences have diverged by gene duplication -E.g. the and hemoglobin genes 1. Lecture WS 2003/04 Bioinformatics III
47
Twilight-Zone (2) the N- and C-terminal halfs of thiosulfate-sulfate-transferase have 11% sequence identity. Because they belong to the same protein assumption that they resulted from gene duplication and divergent evolution. Indeed both 3D structures show large similarity. 2ORA Rhodanese The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. E= The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. P = probability that the HSP was generated as a chance alignment. Orthologs will have extremely significant scores DNA , protein 10-30 Closely related paralogs will have significant scores Protein 10-15 Distantly related homologs may be hard to identify Protein 10-4 Orthologs: the sequences have diverged by speciations -E.g. human, mouse and chicken hemoglobin Paralogs: the sequences have diverged by gene duplication -E.g. the and hemoglobin genes 1. Lecture WS 2003/04 Bioinformatics III
48
Twilight-Zone (3) serine proteases chymotrypsin and subtilisin
have 12% identity with optimal alignment both have same function, same catalytic triad of 3 amino acids (Ser – His – Asp) However, the two folds are completely different and the proteins are not related. Example for convergent evolution. Left: 1AB9- Bovine Chymo Trypsin Right: 1GCI Bacillus Lentus Subtilisin The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. E= The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. P = probability that the HSP was generated as a chance alignment. Orthologs will have extremely significant scores DNA , protein 10-30 Closely related paralogs will have significant scores Protein 10-15 Distantly related homologs may be hard to identify Protein 10-4 Orthologs: the sequences have diverged by speciations -E.g. human, mouse and chicken hemoglobin Paralogs: the sequences have diverged by gene duplication -E.g. the and hemoglobin genes 1. Lecture WS 2003/04 Bioinformatics III
49
Summary Pairwise alignment of sequences is routine but not trivial.
Dynamic programming guarantees finding the alignment with optimal score (Smith-Waterman, Needleman-Wunsch). Much faster but reliable tools are: FASTA, (PSI) BLAST Deeper functional insight into sequences and relationships from multiple sequence alignments (see lecture on phylogenies). 1. Lecture WS 2003/04 Bioinformatics III
50
Growth of Proteomic Data vs. Sequence Data
And we thought we had too much sequence data! This slide only deals with the growth of proteomics data. An even larger amount of other types of data will be created (including metabolic data). Already, in 2003 (in any single month beginning in the later half of 2003 after PNNL’s sample doubling breakthrough) DOE’s pilot proteomics facility will produce more raw data in a month than GenBank accumulated in more than 20 years. PNNL: 24 samples/day X ~24 days X 200 GB/sample = 115 TB vs. GenBank's ~107 TB. GenBank’s 107 TB data size was calculated as follows: total GenBank size of sequence data is 32.6 GB, plus additional character data totals to 107 GB total in GenBank according to release notes #136, Jun Multiply 107 GB X 1000 correction factor to more realistically represent the raw data that had to be created to produce the data sets represented in GenBank. So, the amount of raw data that would have been required to produce the most current release of GenBank = 107 TB to make GenBank data more comparable to raw proteomic data. A kilobyte is 1024 bytes Gigabyte is over one billion bytes (1,000,000,000) Terabyte over one trillion bytes (1,000,000,000,000) Petabyte is over 1,000,000,000,000,000 bytes 1. Lecture WS 2003/04 Bioinformatics III
51
Systems biology Systems biology is an emergent field that aims at system-level understanding of biological systems. Cybernetics, for example, aims at describing animals and machines from the control and communication theory. Unfortunately, molecular biology had just started at that time, so that only phenomenological analysis has been possible. With the progress of genome sequence project and range of other molecular biology project that accumulate in-depth knowledge of molecular nature of biological system, we are now at the stage to seriously look into possibility of system-level understanding solidly grounded on molecular-level understanding. 1. Lecture WS 2003/04 Bioinformatics III
52
Systems biology What does it mean to understand at "system level"?
Unlike molecular biology which focusses on molecules, such as the sequences of nucleotide acids and proteins, systems biology focusses on systems that are composed of molecular components. Although systems are composed of matters, the essence of systems lies in the dynamics and cannot be described merely by enumerating components of the system. At the same time, it is misleading to believe that only the system structure, such as network topologies, is important without paying sufficient attention to diversities and functionalities of components. Both the structure of the system and its components play indispensable roles forming a symbiotic state of the system as a whole. 1. Lecture WS 2003/04 Bioinformatics III
53
Systems biology Key milestones are:
(1) understanding of structure of the system, such as gene regulatory and biochemical networks, as well as physical structures, (2) understanding of dynamics of the system, both quantitative and qualitative analysis as well as construction of theory/models with powerful prediction capability, (3) understanding of control methods of the system, and (4) understanding of design methods of the system. There are numbers of exciting and profound issues that are actively investigated, such as robustness of biological systems, network structures and dynamics, and applications to drug discovery. Systems biology is in its infancy, but this is the area that has to be explored and the area that we believe to be the main stream in biological sciences in this century. 1. Lecture WS 2003/04 Bioinformatics III
54
Systems Biology Nat. Biotech. Nov. 2000, 1147 1. Lecture WS 2003/04
Bioinformatics III
55
From Genomics to Genetic Circuits
The relationship between the genotype and the phenotype is complex, highly non-linear and cannot be predicted from simply cataloging and assigning gene functions to genes found in a genome. 1. Lecture WS 2003/04 Bioinformatics III
56
Genetic Circuits Engineering
1. Lecture WS 2003/04 Bioinformatics III
57
Analysis of Genetic Circuits
1. Lecture WS 2003/04 Bioinformatics III
58
Reconstructing Metabolic Networks
1. Lecture WS 2003/04 Bioinformatics III
59
Translating Biochemistry into Linear Algebra
1. Lecture WS 2003/04 Bioinformatics III
60
DOE initiative: Genomes to Life
a coordinated effort slides borrowed from talk of Marvin Frazier Life Sciences Division U.S. Dept of Energy Facility 1: Production and characterization of proteins High-throughput production of proteins on genome-wide scale Produce affinity reagents for each protein Biophysical characterizations Reagents, databases, and computational tools accessible to the broad scientific community Overcomes a principal roadblock to whole-system analysis 1. Lecture WS 2003/04 Bioinformatics III
61
Facility I Production and Characterization of Proteins Estimating Microbial Genome Capability
Computational Analysis Genome analysis of genes, proteins, and operons Metabolic pathways analysis from reference data Protein machines estimate from PM reference data Knowledge Captured Initial annotation of genome Initial perceptions of pathways and processes Recognized machines, function, and homology Novel proteins/machines (including prioritization) Production conditions and experience Facility 1: Production and characterization of proteins High-throughput production of proteins on genome-wide scale Produce affinity reagents for each protein Biophysical characterizations Reagents, databases, and computational tools accessible to the broad scientific community Overcomes a principal roadblock to whole-system analysis 1. Lecture WS 2003/04 Bioinformatics III
62
Facility II Whole Proteome Analysis
Facility II Whole Proteome Analysis Modeling Proteome Expression, Regulation, and Pathways Analysis and Modeling Mass spectrometry expression analysis Metabolic and regulatory pathway/ network analysis and modeling Knowledge Captured Expression data and conditions Novel pathways and processes Functional inferences about novel proteins/machines Genome super annotation: regulation, function, and processes (deep knowledge about cellular subsystems) Facility II Whole Proteome Analysis Measure proteome and metabolites for a cell or community systems under controlled conditions. Gain functional insights by characterizing known and unknown dynamic processes to correlate proteins and machines that work together in a process. 1. Lecture WS 2003/04 Bioinformatics III
63
Facility III Characterization and Imaging of Molecular Machines Exploring Molecular Machine Geometry and Dynamics Computational Analysis, Modeling and Simulation Image analysis/cryoelectron microscopy Protein interaction analysis/mass spec Machine geometry and docking modeling Machine biophysical dynamic simulation Knowledge Captured Machine composition, organization, geometry, assembly and disassembly Component docking and dynamic simulations of machines Facility III Characterization and Imaging of Molecular Machines Isolate the repertoire of molecular machines. Characterize machines in terms of composition and molecular organization. High-throughput isolation and identification of complexes from cells Characterize the interactions of the components in the complex Simulations for molecular machine function, models for assembly/disassembly of complexes Interpret, archive and disseminate data, models, and computational simulations to the greater biological community 1. Lecture WS 2003/04 Bioinformatics III
64
Facility IV Modeling and Analysis of Cellular Systems
Facility IV Analysis and Modeling of Cellular Systems Simulating Cell and Community Dynamics Analysis, Modeling and Simulation Couple knowledge of pathways, networks, and machines to generate an understanding of cellular and multi-cellular systems Metabolism, regulation, and machine simulation Cell and multicell modeling and flux visualization Knowledge Captured Cell and community measurement data sets Protein machine assembly time-course data sets Dynamic models and simulations of cell processes Facility IV Modeling and Analysis of Cellular Systems Couple knowledge of pathways, networks, and molecular machines to generate understanding of cellular and multicellular systems. Measure structure and properties of a single cell in a population or community under controlled conditions. Couple knowledge of pathways, networks, and molecular machines to generate an understanding of cellular and multi-cellular systems Determine the timed choreography of events and the state of molecular machines during dynamic cell processes 1. Lecture WS 2003/04 Bioinformatics III
65
GTL Computing Roadmap Computing and Information
Protein machine Interactions Molecule-based cell simulation Computing and Information Infrastructure Capabilities Molecular machine classical simulation Cell, pathway, and network simulation Community metabolic regulatory, signaling simulations Constrained rigid docking Constraint-Based Flexible Docking Current U.S. Computing Genome-scale protein threading Red arrow indicates current level of computing power. Need more computing power. Need better ways to analyze algorithms. Need better data handling. Data Processing Will Be Remote Databases will be so huge that current paradigm of localized databases will be inadequate. Databases will have to be distributed across several networks and machines because there is not enough bandwidth to handle large volumes of data. “The chess board is the world, the pieces are the phenomena of the universe, the rules of the game are what we call the laws of Nature. The player on the other side is hidden from us. We know that his play is always fair, just and patient. But we also know, to our cost, that he never overlooks a mistake, or makes the smallest allowance for ignorance.” --Thomas Henry Huxley In Biology we will soon have the chess board of entire genomes—the information that provide continuity of a species, including our own. We will have the Chess pieces of genes and proteins, and —once we discover ONE rule by which the peieces are played, we will be able—by homologous conservation over evolutionary time—be able to efficiently make hypothesis about other rules with similar chess pieces. Bioinformatics tracks the moves made by biologists and integrates the observed data; Computational Biology anticipates the moves and finds patterns in the rules of the game with homology and other data analysis. Comparative Genomics Biological Complexity 1. Lecture WS 2003/04 Bioinformatics III
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.