# Final Exam on 19/2 Lectures 1-12 covered many topics rather broadly.

## Presentation on theme: "Final Exam on 19/2 Lectures 1-12 covered many topics rather broadly."— Presentation transcript:

Final Exam on 19/2 Lectures 1-12 covered many topics rather broadly.
Some items were already discussed more deeply in Bioinformatics I/II. You don‘t need to understand the technical details at more detailed level than they were covered in the lecture. What you need to understand for the first part (1-12): the principles of the methods at the level they were covered in the lecture why particular methods are being used; advantages? some facts about genomes. Numerical problems: not covered are: suffix trees Hidden Markov-models extreme pathways / elementary modes network topologies Bayes statistics Jansen et al. Science 302, 449 (2003) 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V1 Overview V2 codon usage is important
CG content varies alternative splicing allows tissue-specific variation of gene expression Phred base calling V3 Sequence assembly ...  Suffix trees Jansen et al. Science 302, 449 (2003) 26. Lecture WS 2003/04 Bioinformatics III

V3 – genome assembly; What to prepare?
26. Lecture WS 2003/04 Bioinformatics III

Arachne: creation of overlap graph

Arachne: table of k-mer occurrences
Find number of k-mer matches in the forward or reverse complement direction between each pair of reads in R. (1) Obtain all triplets (r,t,v) r = read in R t = index of a k-mer occuring in r v = direction of occurrence (forward or reverse complement) (2) sort the set of pairs according to k-mer indices t (3) use sorted list to create table T of quadrublets (ri, rj, f, v) where ri and ri are reads that contain at least one common k-mer, v is a direction, and f is the number of k-mers in common between ri and rj in direction v. Batzoglou PhD thesis (2002) 26. Lecture WS 2003/04 Bioinformatics III

Arachne: table of k-mer occurrences
Here: k = 3 Batzoglou PhD thesis (2002) 26. Lecture WS 2003/04 Bioinformatics III

Arachne: table of k-mer occurrences
If a k-mer occurs „too often“  likely part of a repeat sequence, we should not use it for detecting overlap. Implementation find k-mer occurences (r,t,v) and sort into 64 files according to the first three nucleotides of each k-mer. For i=1,64 load file in memory, sort according to t, store sorted file. end load 64 sorted files in memory sequentially, create table T incrementally. In practice, k = 8 to 24. Batzoglou PhD thesis (2002) 26. Lecture WS 2003/04 Bioinformatics III

Perform pairwise alignments between reads that contain more than a cutoff number of common k-mers. When excluding those k-mers that are too common (larger than a second cutoff) it is guaranteed that only O(N) number of pairwise alignments will be performed. Why are those too common k-mers excluded? Only a small number of base substitutions and indels is allowed in an overlapping region of two aligned reads. Use dynamic programming alignment that disallows deviations of more than a few characters. Output of the alignment algorithm: for reads ri, rj quadrublets (b1, b2, e1, e2) of beginning b1, b2 and end e1, e2 positions of the detected overlap region. If a significant overlap region is detected (ri, rj, b1, b2, e1, e2) becomes a link in the overlap graph G. Batzoglou PhD thesis (2002) 26. Lecture WS 2003/04 Bioinformatics III

Partial alignments 3 partial alignments of length k=6 between a pair of reads coalesce to yield a single full alignment of length k=19. Vertical bars denote matching bases, whereas x‘s denote mismatches. This illustrates the commonly occurring situation where an extended k-mer hit is a full alignment between two reads. Batzoglou et al. Genome Res 12, 177 (2002) 26. Lecture WS 2003/04 Bioinformatics III

Ambiguity created by the presence of repeats
In the absence of sequencing errors and repreats it would be simple to retrieve all retrievable pairwise distances of reads and to construct G. In the presence of repeats a link between two reads in G does not necessarily imply true overlap. A „repeat link“ is a link in G between two reads that come from different regions in the genome, and overlap in a repeated segment. Batzoglou PhD thesis (2002) 26. Lecture WS 2003/04 Bioinformatics III

Using paired pairs of overlaps to merge reads
Arachne searches for instances of two plasmids of similar insert size with sequence overlaps occurring at both ends  paired pairs. What is the advantage of generating „earmuff“ experimental data? (A) A paired pair of overlaps. The top two reads are end sequences from one insert, and the bottom two reads are end sequences from another. The two overlaps must not imply too large a discrepancy between the insert lengths. (B) Initially, the top two pairs of reads are merged. Then the third pair of reads is merged in, based on having an overlap with one of the top two left reads, an overlap with one of the top two right reads, and consistent insert lengths. The bottom pair is similarly merged. Bottom: collection of paired pairs are merged into contigs, and consensus sequences are formed. Batzoglou et al. Genome Res 12, 177 (2002) 26. Lecture WS 2003/04 Bioinformatics III

Detection of repeat contigs
Some of the identified contigs are repeat contigs in which nearly identical sequence from distinct regions are collapsed together. Detection by (a) repeat contigs usually have an unusually high depth of coverage. (b) they will typically have conflicting links to other contigs. Contig R is linked to contigs A and B to the right. The distances estimated between R and A and R and B are such A and B cannot be positioned without substantial overlap between them. If there is no corresponding detected overlap between A and B then R is probably a repeat linking to two unique regions to the right. After marking repeat contigs, the remaining contigs should represent the correctly assembled sequence. Batzoglou et al. Genome Res 12, 177 (2002) 26. Lecture WS 2003/04 Bioinformatics III

Contig assembly If (a,b) and (a,c) overlap, then (b,c) are expected to overlap. Moreover, one can calculate that shift(b,c)=shift(a,c)-shift(a,b). A repeat boundary is detected toward the right of read a, if there is no overlap (b,c), nor any path of reads x1, ..., xk such that (b,x1), (x1,x2) ..., (xk,c) are all overlaps, and shift(b,x1) shift(xk,c)  shift(a,c) – shift(a,b). Batzoglou et al. Genome Res 12, 177 (2002) 26. Lecture WS 2003/04 Bioinformatics III

The distance d(A,B) (length of gap or negated length of overlap) between two linked contigs A and B can be estimated using the forward-reverse linked reads between them. The distance d(B,C) between two contigs B,C that are linked to the same contig A can be estimated from their respective distances to the linked contig. Design a minimal number of exeriments to resolve the following non-identical sequence assembly ... Batzoglou et al. Genome Res 12, 177 (2002) 26. Lecture WS 2003/04 Bioinformatics III

V4 – genome alignment; What to prepare?
26. Lecture WS 2003/04 Bioinformatics III

Use Suffix Trees for Genome Alignment
MUMmer: A.L. Delcher et al. 1999, 2002 Nucleic Acids Res. Assume two sequences are closely-related (highly similar) MUMmer can align two bacterial genomes in less than 1 minute Use Suffix Tree to find Maximal Unique Matches Maximal Unique Match (MUM) Definition: A subsequence that occurs in two exactly matching copies, once in each input sequence, and that cannot be extended in either direction Key idea: a significantly long MUM is certainly going to be part of the global alignment A maximal unique matching subsequence (MUM) of 39 nt (shown in uppercase) shared by Genome A and Genome B. Any extension of the MUM will result in a mismatch. By definition, an MUM does not occur anywhere else in either genome. Delcher et al. Nucleic Acids Res 27, 2369 (1999) 26. Lecture WS 2003/04 Bioinformatics III

ACTGATTACGTGAACTGGATCCA ACTCTAGGTGAAGTGATCCA
MUMmer: Key Steps Locating MUMs (user-defined length) ACTGATTACGTGAACTGGATCCA ACTCTAGGTGAAGTGATCCA ACTGATTACGTGAACTGGATCCA ACTCTAGGTGAAGTGATCCA 1 10 20 ACTGATTACGTGAACTGGATCCA ACTC--TAGGTGAAGTG-ATCCA 1 10 20 26. Lecture WS 2003/04 Bioinformatics III

Definition of MUMmers For two strings S1 and S2 and a parameter l
The substring u is an MUM sequence if: |u| > l u occurs exactly once in S1 and it occurs exactly once in S2 (uniqueness) For any character a neither ua nor au occurs both in S1 and in S2 (maximality) What is a maximally unique match (MUM)? Why does one search for MUMs in the context of genome alignment? Describe an efficient strategy for genome alignment based on detection of MUMs. 26. Lecture WS 2003/04 Bioinformatics III

How to find MUMs? Naïve Approach
Compare all subsequences of genome A with all subsequences of genome B O(nn) Suffix Tree One problem in the final exam will involve suffix trees similar to the problems on the assignment sheet. 26. Lecture WS 2003/04 Bioinformatics III

Suffix Tree Suffix trees are well-established since > 20 years. Some properties: a “suffix” starts at any position I of the sequence and goes until its end. sequence of length N string has N suffixes N leaves Each internal node has at least 2 child nodes No 2 edges out of the same node can have edge beginning with the same character add \$ to the end CACATAG\$ 26. Lecture WS 2003/04 Bioinformatics III

Constructing a Suffix Tree
CACATAG\$ Suffixes: 1. CACATAG\$ C A C A T A G \$ 1 26. Lecture WS 2003/04 Bioinformatics III

Constructing a Suffix Tree
CACATAG\$ A Suffixes: 1. CACATAG\$ 2. ACATAG\$ C A C C A A T T A A G G \$ \$ 2 1 26. Lecture WS 2003/04 Bioinformatics III

Searching a Suffix Tree
Search Pattern: CATA \$ G \$ 8 7 A C T A G \$ 4 A C C T A A T G T A T \$ A A A G 6 G G G \$ \$ \$ \$ 2 3 1 5 26. Lecture WS 2003/04 Bioinformatics III

Searching a Suffix Tree
Search Pattern: ATCG \$ G \$ 8 7 A C T A G \$ 4 A C C T A A T G T A T \$ A A A G 6 G G G \$ \$ \$ \$ 2 3 1 5 26. Lecture WS 2003/04 Bioinformatics III

Sorting the MUMs MUMs are sorted according to their positions in genome A Use a variation of Longest Increasing Subsequence (LIS) to find sequences in ascending order in both genomes Takes into account lengths of sequences represented by MUMs, and overlaps O(klogk) running time, k = number of MUMs Genome A: 1 2 3 4 5 6 7 Genome B: 3 6 5 1 2 4 7 Genome A: 1 2 4 6 7 Genome B: 6 7 1 2 4 Each MUM is indicated only by a number, regardless of its length. Top alignment shows all MUMs. The shift of MUM 5 in Genome B indicates a transposition. The shift of MUM 3 could be simply a random match or part of an inexact repeat sequence. Bottom alignment shows just LIS of MUMs in Genome B. 26. Lecture WS 2003/04 Bioinformatics III

4 types of gaps in MUM alignment
These examples are drawn from the alignment of the two M.tuberculosis genomes. What types of gaps can exist? Which ones occur most frequently? In conclusion, we think MUMmer is a major breakthrough in full genome alignment and MUMmer 2 has made some further improvement in time and space requirements; It is possible to improve the data structure further, for example by using suffix array Also, it is worth noting the the principle of MUMmer has been extended to multiple genome alignment implemented in a program called MGA Delcher et al. Nucleic Acids Res 27, 2369 (1999) 26. Lecture WS 2003/04 Bioinformatics III

Closing the Gaps SNP Simple case: gap of one base between adjacent MUMs when adjacent to repeat sequences: treated as tandem repeats Variable / Polymorphic Region Small region: dynamic programming alignment Large region: recursively apply MUMmer with reduced minimum cut-off length Insertions / Deletion Transposition: out of alignment order Simple insertion: not in the alignment Repeats Tandem repeats are detected by overlapping MUMs Other repeats (i.e. duplication) are treated as gaps Close gaps by performing local alignment on portion between the aligned MUMs (using e.g. Smith-Waterman). How does one close remaining gaps? 26. Lecture WS 2003/04 Bioinformatics III

some positions are not uniquely defined
Repeat sequences surrounded by unique sequences. For the purposes of illustration, other characters besides the four DNA nucleotides are used. Are all positions well defined by the preceeding steps? What difficulties do you encounter during the algorithm using MUMs ... ? In conclusion, we think MUMmer is a major breakthrough in full genome alignment and MUMmer 2 has made some further improvement in time and space requirements; It is possible to improve the data structure further, for example by using suffix array Also, it is worth noting the the principle of MUMmer has been extended to multiple genome alignment implemented in a program called MGA Delcher et al. Nucleic Acids Res 27, 2369 (1999) 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V5 – genome rearrangement
26. Lecture WS 2003/04 Bioinformatics III

Formal Approach: Sorting by Reversals
The order of genes in 2 organisms is represented by permutations = 12 ... n and  = 12 ... n. A reversal of an interval [i,j] is the permutation i-1 i i j-1 j j n i-1 j j i+1 i j n (i,j) has the effect of reversing the order of ii j and transforming 1 ... i-1i ... j j n into •(i,j) = 1 ... i-1j ... ij  n . Given permutations  and , the reversal distance problem is to find a series of reversals 12 ... t such that •1•2 ... t =  and t is minimal. t is called the reversal distance between  and . 26. Lecture WS 2003/04 Bioinformatics III

Why is sorting unsigned reversals NP-hard?
Carpara, Proceedings of RECOMB 97. „Analogy with problem of finding the maximum number of edge-disjoint alternating cycles in a suitably-defined bicolored graph.“ 26. Lecture WS 2003/04 Bioinformatics III

Breakpoint Graph Sorting a permutation is a hard problem.
Breakpoints were introduced by Watterson et al. (1982) and by Nadeau and Taylor (1984) and correlations were noticed between the reversal distance and the number of breakpoints. Let i  j if |i – j| = 1. Extend a permutation  = 12 ... n by adding 0 = 0 and n+1 = n + 1. We call a pair of elements (i,i+1), 0  i  n, of  an adjacency if i  i+1, and a breakpoint if i  i+1. As the identity permutation has no breakpoints, sorting by reversals corresponds to eliminating breakpoints. An observation that every reversal can eliminate at most 2 breakpoints implies that the reversal distance d()  b() / 2 where b() is the number of breakpoints in . However, this is a clear overestimate. adjacencies breakpoints 26. Lecture WS 2003/04 Bioinformatics III

Breakpoint Graph The breakpoint graph of a permutation  is an edge-colored graph G() with n + 2 vertices {0, 1 ... n, n+1}  {0, 1, ..., n, n+1}. We join vertices i and  i+1 by a black edge for 0  i  n. We join vertices i and j by a gray edge if i  j. A breakpoint graph is obtained by a super- position of a black path traversing the vertices 0, 1, ..., n, n+1 in the order given by the permutation  and a gray path traversing the vertices in the order given by the identity permutation. Black path Grey path Superposition of black and grey paths forms the breakpoint graph: Construct a breakpoint graph for the following situation ... Give a formula to estimate the breakpoint distance. 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V6 – human mouse comparison
26. Lecture WS 2003/04 Bioinformatics III

Identify regions of conserved synteny
Syntenic segment: maximal region in which a series of landmarks occur in the same order on a single chromosome in both species. Syntenic block: one or more syntenic segments that are all adjacent on the same chromosome in human and on the same chromosome in mouse; may otherwise be shuffled with respect to order and orientation. (only consider regions > 300 kb) Each genome could be parsed into a total of 342 conserved syntenic segments. On average, each landmark resides in a segment containing 1600 other landmarks. Segments vary greatly in length: 303 kb – 64.9 Mb. About 90.2 % of human and 93.3% of mouse genome unambigously reside with conserved syntenic segments. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Conservation of synteny between human and mouse
A typical 510-kb segment of mouse chromosome 12 that shares common ancestry with a 600-kb section of human chromosome 14 is shown. Blue lines connect the reciprocal unique matches in the two genomes. The cyan bars represent sequence coverage in each of the two genomes for the regions. In general, the landmarks in the mouse genome are more closely spaced, reflecting the 14% smaller overall genome size. Discuss the alignment of a human and mouse chromosome. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Size distribution of elements with conserved synteny
Draw a distribution of the elements with conserved synteny and discuss the plot. Size distribution of segments and blocks with synteny conserved between mouse and human. a, b, The number of segments (a) and blocks (b) with synteny conserved between mouse and human in 5-Mb bins (starting with 0.3–5 Mb) is plotted on a logarithmic scale. The dots indicate the expected values for the exponential curve of random breakage given the number of blocks and segments, respectively. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Genome rearrangement? Using the methods from lecture 5 (Pevzner & Tesler algorithms) one can compute the minimal number of rearrangements needed to „transform“ one genome into the other. When applied to the 342 syntenic segments, the most parsimonious (=shortest) path has 295 rearrangements. The analysis suggests that chromosomal breaks may have a tendency to reoccur in certain regions. With only two species, however, it is not yet possible to recover the ancestral chromosomal order or reconstruct the precise pathway of rearrangements. This will become possible in short time as more and more mammalian species are sequenced. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Next: Genome landscape
- genome expansion and contraction What accounts for the smaller size of the mouse genome? See section on repeats. - (G + C) content In mammalian genomes, there is a positive correlation between gene density and (G + C) content. - CpG islands The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

What are CpG islands? 26. Lecture WS 2003/04 Bioinformatics III

What are CpG islands? CpG islands are short stretches of DNA with higher frequency of the CG sequence CpG dinucleotides are rare in mammalian DNA DNA Methylation only occurs at CpG sites Methylated cytosines may be converted to thymine by deamination over evolution CpG  TpG Usually they are not methylated 26. Lecture WS 2003/04 Bioinformatics III

What CpG islands are? Definition from Gardiner-Garden & Frommer
At least 200 bases long G+C content: > 50% observed CpG/expected CpG ratio: >= 0.6 Definition from Takai & Jones Longer than 500 bp G+C content: > 55% observed CpG/expected CpG ratio: >= 0.65 With this definition, these CpGi’s are more likely to be associated with the 5’ regions of genes and exclude most Alu’s There are about 29,000 such regions in the human genome 26. Lecture WS 2003/04 Bioinformatics III

CpG islands & Genes CpG islands located in the promoter regions of genes can play important roles in gene silencing Housekeeping genes Almost all housekeeping genes are associated with at least one CpG island CpG islands are starting 5’ to the transcription start site and covering one or more exons and introns Tissue specific genes About 40 % tissue specific genes are associated with islands The position of these islands is not strongly toward the transcription start site as in the housekeeping genes Not all CpG islands are associated with genes Ioshikhes & Zhang determined the features to discriminate the promoter-associated and non-associated CpG islands There are methylation-prone and methylation-resistant CpG islands Feltus et. al. found patterns to discriminate methylation-prone from methylation-resistant CpG islands 26. Lecture WS 2003/04 Bioinformatics III

(G + C) content The overall distribution of local (G + C)
content is significantly different between the mouse (blue) and human (red) genomes. In human, 1.4% of the windows have (G + C) > 56% and 1.3% with < 33%. Such extreme deviations are absent in the mouse genome. The reason for this difference is unknown  I am not going to ask you for this! Both species have 75-80% of genes residing in the (G+C)-richest half of their genome (see below). Mouse shows similar extremes of gene density despite being less extreme in (G+C) content. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

CpG islands Please assign the two chromosomes to human
or mouse and explain. (G+C) content and density of CpG islands shows more variability in human (red) than mouse (blue) chromosomes. a, The (G+C) content for each of the mouse chromosomes is relatively similar, whereas human chromosomes show more variation; chromosomes 16, 17, 19 and 22 have higher (G+C) content, and chromosome 13 lower (G+C) content. b, Similarly, the density of CpG islands is relatively homogenous for all mouse chromosomes and more variable in human, with the same exceptions. Note that the mouse and human chromosomes are matched by chromosome number, not by regions of conserved synteny. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Repeats All mammals have essentially the same 4 classes of transposable elements: 1 LINE: autonomous long interspersed nucleotide element 2 SINE: LINE-dependent, short RNA-derived short interspersed nucleotide elements 3 LTR: retrovirus-like elements with long termain repeats 1 - 3 procreate by reverse transcription of an RNA intermediate 4 DNA transposons; move by a cut-and-paste mechanism of DNA sequence 26. Lecture WS 2003/04 Bioinformatics III

Interspersed repeats 32.4% (mouse) of genome are lineage-specific repeats vs. 24.4% for human The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Interspersed repeats Mouse lacks ancestral repeats; they comprise only 5% of the mouse genome vs. 22% of the human genome. Median divergence levels of 18 subfamilies of interspersed repeats that were active shortly before the human-rodent specification indicates an approximately twofold higher average substitution rate in mouse than in human. Comparison of ancestral repeats to their consensus sequence also allows an estimate of the rate of occurrence of small (<50 bp) insertions and deletions. Both species show a net loss of nucleotides. The overall loss due to small indels in ancestral repeats is at least twofold higher in mouse than in human. (This contributes ca. 1-2% to the smaller size of the mouse genome). This is an average. Currently, the substitution rate per year in mouse is probably fivefold higher than in human. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Density of interspersed repeat classes
In both species, there is a strong increase in SINE density and a decrease in L1 density with increasing (G+C) content, with the latter particularly marked in the mouse. Another notable contrast is that in mouse, overall interspersed repeat density gradually decreases 2.5-fold with increasing (G+C) content, whereas in human the overall repeat density remains quite uniform. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Similar repeats accumulate in orthologous locations
Contrast in the genomic distribution of LINEs and SINEs: Whereas LINES are strongly biased towards (A + T) regions, SINEs are strongly biased towards (G + C) rich regions. Are (A + T) and (G + C) truly causative factors or merely reflections of an underlying biological process? Interpreation of analysis: SINE density is influenced by genomic factors that are correlated with (G + C)-content but that are distinct from (G + C) content per se. Please assign the following chromosomes to mouse or human and explain. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Comparison of mouse and human gene sets
Approximately 99% of mouse genes have a homologue in the human genome. For 96% the homologue lies within a similar conserved syntenic interval in the human genome. For 80% of mouse genomes, the best match in the human genome in turn has its best match against that same mouse gene. These are termed 1:1 orthologues. For less than 1% of the predicted mouse genes there was no homologous predicted human gene. Those genes that may seem to be mouse-specific may correspond to human genes that are still missing due to the incompleteness of the human genome sequence. What could be the reason for the remaining 1% of genes without homologue? De novo gene addition in the mouse lineage and gene deletion in the human lineage have not significantly altered the gene repertoire. 26. Lecture WS 2003/04 Bioinformatics III

Gene ontology annotations
Gene ontology (GO) annotations for mouse and human proteins. The GO terms assigned to mouse (blue) and human (red) proteins based on sequence matches to InterPro domains are grouped into approximately a dozen categories. These categories fell within each of the larger ontologies of cellular component (a) molecular function (b) and biological process (c) In general, mouse has a similar percentage of proteins compared with human in most categories. The apparently significant difference between the number of mouse and human proteins in the translational apparatus category of the cellular component ontology may be due to ribosomal protein pseudogenes incorrectly assigned as genes in mouse. What protein properties can be assigned by GO? The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Evolution of orthologues
two measures: - percentage of amino acid identity - KA / KS ratio Orthologues generally have lower values for KA / KS e.g. < 0.05 because the proteins are subject to relatively strong purifying selection. The mouse genome. Nature 420, 26. Lecture WS 2003/04 Bioinformatics III

Purifying selection Domain prediction with SMART:
Domains are under greater purifying selection than regions not containing domains. Consistent with hypothesis that domains are under greater structural and functional constraints than unstructured, domain-free regions. Also, domain families with enzymatic activitiy were found to have a lower KA / KS ration than non-enzymatic domains. 26. Lecture WS 2003/04 Bioinformatics III

Summary * the mouse genome is about 14% smaller than the human genome. The difference probably reflects a higher rate of deletion in mouse. * over 90% of the mouse and human genomes can be partitioned into corresponding regions of conserved synteny (segments in which the gene order in the most recent common ancestor has been conserved in both species) * at the nucleotide level, ca. 40% of the human genome can be aligned to the mouse genome. These sequences seem to represent most of the orthologous sequences that remain in both lineages from the common ancestor. The rest was probably deleted in one or both genomes. * the neutral substitution rate has been roughly half a nucleotide substitution per site since the divergence of the species. About twice as many of these substitutions have occurred in mouse as in human. 26. Lecture WS 2003/04 Bioinformatics III

Summary * the proportion of small ( bp) segments in the mammalian genome that is under (purifying) selection is ca. 5%, i.e. much higher than can be explained by protein-coding sequences alone.  genome contains many additional features (UTRs, regulatory elements, non-protein-coding genes, chromosomal structural elements) under selection for biological function! * the mammalian genome is evolving in a non-uniform manner, various measures of divergence showing substantial variation across the genome. * mouse and human genomes each seem to contain ca protein-coding genes. The proportion of mouse genes with a single identifiable orthologue in the human genome is ca. 80%. The proportion of mouse genes without any homologue currently detectable in the human genome (and vice versa) is < 1%. 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V8 – protein phylogeny
26. Lecture WS 2003/04 Bioinformatics III

Traditional phylogenetic tree reconstruction is based on the analysis (e.g. level of conservation of amino acids) of individual genes/proteins. Genetic distance is defined as # mismatches / # matches. Sequence conservation depends on physico-chemical properties of amino acids (and genome context such as G+C content). Last lecture (mouse:man) we saw that many more genomic elements are conserved between related species than only the genes. Therefore, genome rearrangement studies that are based on genome-wide analysis of gene orders rather than individual genes, may provide a more general picture on evolution. In future both approaches should probably be combined. 26. Lecture WS 2003/04 Bioinformatics III

Reconstruction of phylogenetic trees from WG data
1 Phylogeny reconstruction as optimization problem? Attempt to reconstruct an evolutionary scenario with a minimum number of permitted evolutionary events (e.g. duplications, insertions, deletions, inversions, transpositions) on a tree  all known approaches are NP-hard Also, no automated tool exists sofar. 2 Estimate leaf-to-leaf distances (based on some metric) between all genomes. Then úse a standard distance-based method such as neighbour-joining to construct the tree. Such approaches are quite fast but cannot recover the ancestral gene order. 2a Breakpoint phylogeny (Blanchette & Sankoff) for special case in which the genomes all have the same set of genes, and each gene appears once. Use breakpoint distance as distance matrix. Which evolutionary method allows you to reconstruct ancestral gene order? 26. Lecture WS 2003/04 Bioinformatics III

Reversal distance problem
Although the reversal distance for a pair of genomes can be computed in polynomial time (Hannenhalli & Pevzner 1999 and others), its use in studies of multiple genome rearrangements was somewhat limited since it was not clear how to combine pairwise rearrangement scenarios into a multiple rearrangement scenario. In particular, Capara (1997) demonstrated that even the simplest version of the Multiple Genome Rearrangement Problem, the Median Problem, is NP-hard. Therefore, this line of research was abandoned for a while in favor of the breakpoint analysis approach (see Blanchette & Sankoff). The existing tools BPAnalysis or GRAPPA use the so-called breakpoint distance to derive rearrangement scenarios. 26. Lecture WS 2003/04 Bioinformatics III

Breakpoint phylogeny When each genome has the same set of genes and each gene appears exactly once, a genome can be described by a (circular or linear) ordering = permutation of these genes. Each gene has either positive (gi) or negative (- gi) orientation. Given 2 genomes G and G‘ on the same set of genes, a breakpoint in G is defined as an ordered pair of genes (gi,gj) such that gi and gj appear consecutively in that order in G, but neither (gi,gj) (- gi,- gj) appears consecutively in that order in G‘. The breakpoint distance between two genomes is simply the number of breakpoints between that pair of genomes. The breakpoint score of a tree in which each node is labelled by a signed ordering of genes is then the sum of the breakpoint distances along the edges of the tree. 26. Lecture WS 2003/04 Bioinformatics III

Distance matrices for 11 species
Number of breakpoints indicates that many of the gene orders seem to be random permutations of each other (random genomes with n genes would have n – 0.5 breakpoints with each other, on average). number of breakpoints minimal inversion distance (Hannivalli & Pevzer) combined inversion/ transposition Blanchette, Sankoff, J Mol Evol (1999) 26. Lecture WS 2003/04 Bioinformatics III

Tree inference Compare 3 criteria for optimum tree
topology in the light of theories of metazoan evolution: (a) Neighbour-joining (b) Fitch-Margoliash (b) minimum breakpoint. (a) and (b) operate on the genome data as reduced to the breakpoint distance matrix. (c) is based on the gene orders themselves. 26. Lecture WS 2003/04 Bioinformatics III

Tree from neighbor-joining analysis
Neighbour joining disrupts the deuterostomes by grouping ART with the human genome, and disrupts the molluscs. The Fitch-Margoliash routine minimizes the sum of squared differences between distance matrix entries and total path length in the tree between two species, divided by the square of the matrix entry. Worse grouping than in (a): the rapidly evolving lineages, NEM, snails, and ECH are grouped together, thus completely disrupting both the CHO+ECH grouping and the MOL grouping. Blanchette, Sankoff, J Mol Evol (1999) 26. Lecture WS 2003/04 Bioinformatics III

Tree from minimal breakpoint analysis (BPA)
A minimum breakpoint tree is one in which (a) a genome is reconstructed for each ancestral node, (b) the number of break-points is calculated for each pair of nodes directly connected by a branch of the tree, (c) the sum is taken over all branches, where the sum is minimal over all possible trees. Problematic: all possible trees on the set of given data genomes need to be evaluated. For median problem analogy to travelling salesman problem. Blanchette et al. didn‘t want to question the original 3 models solely on basis of this data. All trees not consistent with either of the 3 models was disrupted, leaving 105 trees! Blanchette, Sankoff, J Mol Evol (1999) 26. Lecture WS 2003/04 Bioinformatics III

Drawbacks of breakpoint analysis: costly + ambiguous
Let us consider a simple example: Suppose that the genomes G1, G2, and G3, evolved from the ancestral genome A = by one reversal each such that G1 = G2 = G3 = Searching for the breakpoint median will produce 4 optimal solutions. A, but also G1, G2, and G3. If the median is A, then we have two breakpoints on each edge of the tree for a total of six. But if the median is any of the three genomes, we also get a total of 6 = breakpoints. Therefore, the breakpoint median fails to unambigously identify the ancestor. 26. Lecture WS 2003/04 Bioinformatics III

Multiple Genome Rearrangement Problem
Find a phylogenetic tree describing the most „plausible“ rearrangement scenario for multiple species. The genomic distance in the case of genome rearrangement is defined in terms of (1) reversals, (2) translocations, (3) fusions, and (4) fissions which are the most common rearrangement events in multichromosomal genomes. The special case of three genomes (m = 3) is called the Median Problem. Given the gene order of three unichromosomal genomes G1, G2, and G3, find the ancestral genome A which minimizes the total reversal distance 26. Lecture WS 2003/04 Bioinformatics III

Multiple Genome Rearrangement Problem
The breakpoint analysis attempts to solve the Median Problem by minimizing the breakpoint distance instead of the reversal distance. However, the breakpoint distance, in contrast to the reversal distance, does not correspond to a minimum number of rearrangement events! As a result, the breakpoint, recovered by breakpoint analysis, rarely corresponds to the ancestral median, the genome that minimizes the overall number of rearrangements in the evolutionary scenario. New approach: Given a set of m permutations (existing genomes) or order n, find a tree T with the m permutations as leaf nodes and assign permutations (ancestral genomes) to internal nodes such that D(T) is minimized, where is the sum of reversal distances over all edges of the tree. 26. Lecture WS 2003/04 Bioinformatics III

New algorithm Aim: Among all possible reversals for each of the three genomes identify good reversals. A good reversal  in a genome G1 is a reversal that brings a genome closer to the ancestral genome. But since this is unknown, it is unclear to find good reversals, oops! Instead: assume that reversals that reduce the reversal distance between G1 and G2 and the reversal distance between G1 and G3 are likely to be good reversals. With () as the overall reduction in the reversal distances: the reversal () is good if () = 2. 26. Lecture WS 2003/04 Bioinformatics III

New algorithm Iteratively carry on these good rearrangements until the genomes G1, G2, and G3 are transformed into an identical genome, hoping that this is the most likely „ancestral median“. When we are dealing with multichromosomal genomes and with four different types of rearrangements, ambiguous situations may occur too. 26. Lecture WS 2003/04 Bioinformatics III

Ambiguities again possible
E.g. G1 = G2 = G3 = The parsimony principle does not allow to umambiguously reconstruct the evolutionary scenario. If the ancestor coincides with G1, then a reversal occurred on the way to G2, and a fission occurred on the way to G3. One can as well start with G2 or G3 as the ancestors. In this case This kind of ambiguity does not exist for unichromosomal genomes because, there, it is impossible to find 3 genomes that would all be within one reversal of each other. 26. Lecture WS 2003/04 Bioinformatics III

Summary Breakpoint analysis (BPA) is a robust technique for small rearrangement problems. Problem of ambiguity between different optimal solutions. Although complexity could be dramatically reduced by algorithmic improvements (e.g. GRAPPA), method is still too expensive for more than 10 genomes. Heuristic algorithm by Bourque & Pevzner minimizes reversal distance instead of breakpoint distance. (Recall from lecture 5 that (number of breakpoints)  2 was not the optimal lower bound for the reversal distance.) Runs more efficient + can be applied to much larger problems + provides only one or a few solutions. Analogy to conformational search in some energy landscape ... The problem remains what is the correct way to identify the biologically correct = true evolutionary trees: by minimizing the breakpoint distance or the reversal distance or something else? 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V8 – protein phylogeny
26. Lecture WS 2003/04 Bioinformatics III

Phylogenetic Prediction (of single genes)
A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution. Placing the sequences as outer branches on a tree, the evolutionary relationships among the sequences are depicted. 26. Lecture WS 2003/04 Bioinformatics III

3 main approaches in single-gene phylogeny
- maximum parsimony - distance - maximum likelihood Popular programs: PHYLIP (phylogenetic inference package – J Felsenstein) PAUP (phylogenetic analysis using parsimony – Sinauer Assoc 26. Lecture WS 2003/04 Bioinformatics III

Concept of evolutionary trees
An evolutionary tree is a 2-dimensional graph showing evolutionary relationships among organisms, or in the case of sequences, in certain genes from separate organisms. sequence A length of branches reflects number of sequence changes. Often: assume uniform rate of mutations (molecular clock hypothesis). nodes rooted tree sequence B sequence C branches sequence D sequence A sequence C unrooted tree sequence B sequence D 26. Lecture WS 2003/04 Bioinformatics III

Methods for Single-Gene Phylogeny
Choose set of related sequences Obtain multiple sequence alignment Is there strong sequence similarity? Yes Maximum parsimony methods No Given a set of closely related genomes which strategy do you apply ...? Yes Is there clearly recogniza-ble sequence similarity? Distance methods No Analyze how well data support prediction Maximum likelihood methods 26. Lecture WS 2003/04 Bioinformatics III

Maximum Parsimony Method
Method predicts the evolutionary tree that minimizes the number of steps required to generate the observed variation in the sequences. Step 0 Input: multiple sequence alignment Step 1 For each aligned position, identify phylogenetic trees that require the smallest number of evolutionary changes to produce the observed sequence changes. Step 1.5 Continue analysis for every position in the sequence alignment. Step 2 Sequence variations at each site in the alignment are placed at the tips of the trees. Identify the tree (trees) that produce the smallest number of changes overall for all sequence positions. Because all possible trees are examined, method is best suited for sequences that are quite similar + for small number of sequences. It is guaranteed to find the best tree. 26. Lecture WS 2003/04 Bioinformatics III

Example Sequence# Sequence position 1 2 3 4 5 6 7 8 9
1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G These are 4 sequences giving 3 possible unrooted trees. E.g. trees for position 5: similar example ... Seq1 Seq3 Seq1 Seq2 Seq1 Seq2 G A G G G G G A A A A A G A A A A A Seq2 Seq4 Seq3 Seq4 Seq4 Seq3 Informative sites: (1) must favor one tree over another (site 5 is informative, but sites 1, 6, 8 are not). (2) To be informative, a site must also have the same sequence character in at least two genomes (only sites 5, 7, and 9 are informative according to this rule). Combining sites 5, 7, and 9, the left tree is the best tree for these 4 sequences. 26. Lecture WS 2003/04 Bioinformatics III

Where maximum parsimony fails
Parsimony can give misleading information when rates of sequence change vary in the different branches of a tree that are represented by the sequence data. Seq1 Seq4 Seq1 Seq2 G G G A G A Seq4 Seq3 A A In parsimony analysis rates of change along all branches of the tree are assumed equal. Therefore the tree predicted from parsimony will not be correct. Seq2 Seq3 Real tree: 2 long branches in which G has turned to A independently, possibly with some intermediate steps. 26. Lecture WS 2003/04 Bioinformatics III

Distance methods The distance method employs the number of changes between each pair in a group of sequences to produce a phylogenetic tree of the group. The sequence pairs that have the smallest number of sequence changes between them are termed „neighbors“. On a tree, these sequences share a node or common ancestor position and are each joined to that node by branch. Goal of distance methods: identify tree that correctly positions neighbors and that also has branch lengths that reproduce the original data as closely as possible.  neighbor-joining algorithm, Fitch-Margoliash algorithm Finding the closest neighbors among a group of sequences by the distance method is often the first step in producing a multiple sequence alignment. E.g. ClustalW uses the neighbor-joining distance method. 26. Lecture WS 2003/04 Bioinformatics III

Example sequence A A C G C G T T G G G C G A T G G C A A C
sequence B A C G C G T T G G G C G A C G G T A A T sequence C A C G C A T T G A A T G A T G A T A A T sequence D A C A C A T T G A G T G A T A A T A A T distances beween sequences distance table nAB 3 nAC 7 nAD 8 nBC 6 nBD 7 nCD 3 A B C D - 3 7 8 6 A C 2 1 4 1 2 B D 26. Lecture WS 2003/04 Bioinformatics III

Maximum likelihood approach
Method uses probability calculations to find a tree that best accounts for the variation in a set of sequences. Similar to maximum parsimony method in that analysis is performed on each column of a multiple sequence alignment. All trees are considered. Because the rate of appearance of new mutations is very small, the more mutations are needed to fit a tree to the data, the less likely that tree. Start with an evolutionary model of sequence change that provides estimates of rates of substitution of one base for another (transitions and transversions). Base A C G T A -u(aC+bG+cT) uaC ubG ucT C ugA -u(gA+dG+eT) udG ueT G uhA ujG -u(hA+jG+fT) ufT T uiA ukG ulT -u(iA+kG+lT) 26. Lecture WS 2003/04 Bioinformatics III

Maximum likelihood approach
Step1 Align set of sequences Step2 Examine substitutions in each column for their fit to a set of trees that describe possible phylogenetic relationships among the sequences. Each tree has a certain likelihood based on the series of mutations that are required to give the sequence data. The probability of each tree is the product of the mutation rates in each branch of the tree, which itself is the product of the rate of substitution in each branch times the branch length. Advantage of maximum likelihood approach: allows to evaluate trees with variations in mutation rates in different lineages. Can be used for more diverse sequences. Disadvantage: computationally intense. 26. Lecture WS 2003/04 Bioinformatics III

Resolve Incongruences in Phylogeny
Many possible reasons that may make decisions on how to handle conflicts in larger sets of molecular data difficult. E.g. two genes with different evolutionary history (e.g. owing to hybridization or horizontal transfer) will necessarily give incongruent pictures while still depicting true histories. Here: compare genome sequence data for 7 Saccharomyces yeast species: S. cerevisae S. paradoxus S. mikatae S. kudriavzevii S. bayanus S. castelli S. kluyveri plus one outgroup fungus Candida albicans. Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Alternative Tree topologies
Rokas et al. Nature 425, 798 (2003) Single-gene data sets generate multiple, robustly supported alternative topologies. Representative alternative trees recovered from analyses of nucleotide data of 106 selected single genes and six commonly used genes are shown. The trees are the 50% majority-rule consensus trees from the genes YBL091C (a), YDL031W (b), YER005W (c), YGL001C (d), YNL155W (e) and YOL097C (f). These 6 genes were selected without consideration of their function. Maybe commonly used, well known genes of important functions provide a better resolution? 26. Lecture WS 2003/04 Bioinformatics III

Explanations? The alternative phylogenies could have resulted from a number of different scenarios: (1) most genes could have weakly supported most phylogenies and strongly supported only a few alternative trees, (2) most genes could have strongly supported one phylogeny and a few genes strongly supported only a small number of alternatives, (3) there could have been some combinations of these scenarios so that each branch among alternative phylogenies had either weak or strong support depending on the gene. To distinguish between these possibilities, identify all branches recovered during single-gene analyses, record each bootstrap value with respect to the gene and method of analysis.  8 branches were shared by all three analyses with multiple instances of bootstrap values > 50%. Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Common Branches The distribution of bootstrap values for the eight prevalent branches recovered from 106 single-gene analyses highlights the pervasive conflict among single-gene analyses. a, Majority-rule consensus tree of the 106 ML trees derived from single-gene analyses. Across all analyses, there were eight commonly observed branches; the five branches in the consensus tree (numbers 1–5; a) and the three branches (numbers 6–8) shown in b. Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Many factors were checked that could lead to incongruence between single-gene phylogenies: - outgroup choice repeat all analyses without C. albicans - number of variable sites significantly correlated with - number of parsimony-informative sites bootstrap values for some - gene size branches - rate of evolution - nucleotide composition - base compositional bias - genome location - gene ontology no parameters can systematically account for or predict the performance of single genes! } Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Can incongruence be overcome?
Although we do not know the cause(s) of incongruence between single-gene phylogenies, the critical question is how this incongruence between single trees might be overcome to arrive at the actual species tree. Can single gene trees be concatenated into one large data set? Do phylogenetic trees from concatenated genomes provide a more robust approach to generating phylogenies than merging individual trees generate from phylogenies of the individual proteins? Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Concatenation of single genes gives a single tree!
Phylogenetic analyses of the concatenated data set composed of 106 genes yield maximum support for a single tree, irrespective of method and type of character evaluated. Numbers above branches indicate bootstrap values (ML on nucleotides/MP on nucleotides/MP on amino acids). All alternative topologies were rejected. This level of support for a single tree with 5 internal branches is unprecedented. This tree can now be referred to as species tree. Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Convergence on single tree
branch 3 branch 5 A minimum of 20 genes is required to recover >95% bootstrap values for each branch of the species tree. Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Independent evolution?
It has been suggested that nucleotides within a given gene do not evolve independently. Re-sample subset of orthologous nucleotides from the total data set. Only 3000 randomly chosen nucleotide positions (corresponding to less than three concatenated genes) are sufficient to generate single tree with > 95% confidence. This indicates that nucleotides in genes have not evolved independently (because when using complete genes more than 20 genes are necessary to generate single tree). Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Implications for resolution of phylogenies
Unreliability of single-gene data sets stems from the fact that each gene is shaped by a unique set of functional constraints through evolution. Phylogenetic algorithms are sensitive to such constraints. Such problems can be avoided with genome-wide sampling of independently evolving genes. In other cases the amount of sequence information needed to resolve specific relationships will be dependent on the particular phylogenetic history under examination. Branches depicting speciation events separated by long time intervals may be resolved with a smaller amount of data, and those depicting speciation events separated by shorter invtervals may be much harder to resolve. Rokas et al. Nature 425, 798 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Summary Robust strategies exist for phylogenies built on single-gene comparisons (maximum parsimony, distance, maximum likelihood). Problem of incongruence of phylogenies derived from individual genes. Can be resolved by integrative analysis of multiple (here > 20) genes. It is desirable to combine results from phylogenies constructed from local sequence information with trees constructed from genome rearrangement. The power of genome rearrangement studies is the construction of ancestral genomes. Then one can derive the speed of evolution at different times, disect mutation biases at different times from the influence of genomic context ... and possibly derive the driving forces of biological evolution. 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V9 – gene finding
26. Lecture WS 2003/04 Bioinformatics III

Introduction The simplest method of finding DNA sequences that encode proteins is to search for open reading frames (ORFs). In each sequence, there are 6 possible open reading frames: 3 ORFs starting at positions 1, 2, and 3, and going in the 5‘ to 3‘ direction and 3 ORFs starting at positions 1, 2, and 3, and going in the 5‘ to 3‘ direction of the complementary sequence. In prokaryotic genomes, DNA sequences encoding proteins are transcribed into mRNA, and the mRNA is usually directly translated into proteins without significant modification. Therefore, the longest ORF running from the first available Met codon (AUG) on the mRNA to the next stop codon in the same reading frame, is usually a good prediction of the protein-encoding regions. 26. Lecture WS 2003/04 Bioinformatics III

Extrinsic and intrinsic methods
Most approaches now combine (a) homology methods = „extrinsic methods“ with (b) gene prediction methods = „intrinsic methods“ Only about half of all genes can be found by homology to other known genes or proteins (this value is of course increasing as more genomes get sequenced and more cDNA/EST sequences get available). How many genes can roughly be found by homology methods of gene prediction? Will this ratio alter over the future and why? In order to determine the 50% of remaining genes, one has to turn to predictive methods. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 26. Lecture WS 2003/04 Bioinformatics III

Eukaryotic genomes Transcription of protein-encoding regions is initiated at specific promoter sequences, and followed by removal of noncoding sequence (introns) from pre-mRNA by a splicing mechanism. 3 types of posttranscriptional events influence the translation of mRNA into protein and the accuracy of gene prediction: (1) species-dependent codon usage (2) tissue-dependent splice variations (3) mRNA may be edited. 26. Lecture WS 2003/04 Bioinformatics III

Intrinsic Content Sensors for eukaryotic genomes
Characterize „coding“ regions: - nucleotide composition - G+C content (introns are more A/T rich than exons, especially in plants) - codon composition - hexamer frequency (this was found to be the most discriminating variable between coding and non-coding sequences) - base occurrence periodicity ... Hexamer frequence, or, more generally, the k-mer composition of coding sequences is the main search tools in many packages. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 26. Lecture WS 2003/04 Bioinformatics III

Signal sensors The basic and natural approach to finding a signal that may represent the presence of a functional site is to search for a match with a consensus sequence e.g. for promoter regions (TATA box), or the ribosomal binding site on the mRNA. This consensus could be determined from a multiple alignment of functionally related documented sequences. Programs SPLICEVIEW and SplicePredictor. A more flexible representation of signals is offered by the so-called positional weight matrices (PWMs): indicate the probability that a given base appears at each position of the signal (again computed from a multiple alignment of functionally related sequences). One can say that a PWM is defined by one classical zero order Markov model per position. The PWM weights can also be optimized by a neural network method. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 26. Lecture WS 2003/04 Bioinformatics III

Signal sensors In order to capture possible dependencies between adjacent positions of a signal, one may use higher order Markov models called weight array models (WAM). These methods assume a fixed length signal. Hidden Markov models further allow for insertions and deletions. Most existing programs use such models to represent and detect - splice sites - branch points - correct intron/exon boundaries and other motives like - poly(A) sites (in 3‘-UTRs) - promoters Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 26. Lecture WS 2003/04 Bioinformatics III

Predict eukaryotic gene structures
One doesn‘t want to only search for independent exons, but instead identify the whole complex structures of genes! Each consistent pair of detected signals (translation starts and stops, spice sites) defines a potential gene region (intron, exon or coding part of an exon). Since all these potential gene regions can be used to build a gene model, the number of potential gene models grows exponentially with the number of predicted exons! In practice, „correct“ gene structures must satisfy a set of properties (i) there are no overlapping exons (ii) coding exons must be frame compatible (iii) merging two successive coding exons will not generate an in-frame stop at the junction. The number of candidates remains, however, exponential. Mathé et al. Nucl. Acids. Res. 30, 4103 (2002) 26. Lecture WS 2003/04 Bioinformatics III

Testing the reliability of an ORF prediction
(1) Observation of unusual type of sequence variation found in ORFs: every 3rd base tends to be the same one much more often than by chance alone (Fickett 1982). This property is due to nonrandom use of codons in ORFs and is true for any ORF, regardless of the species. (2) Determine whether the codons in the ORF correspond to those used in other genes of the same organism (codon usage statistic). (3) Translate ORFs into amino acid sequence and compare that to database of protein sequences. If good hits are found, confidence in new predicted ORF rises. How would one test ORF predictions? 26. Lecture WS 2003/04 Bioinformatics III

Neural Network: GRAIL II
Grail II provides analyses of protein-coding regions, poly(A) sites, and promoters constructs gene models, predicts encoded protein sequences provides database searching capabilities. (1) create list of most likely exon candidates (2) evaluate candidates by neural network Uberbacher, Mural. PNAS, 88, (1991) 26. Lecture WS 2003/04 Bioinformatics III

Problematic gene start prediction
Detecting a gene as a protein-coding ORF with an ‘open’ start still does not provide full information for gene annotation. Although several procedures for gene start prediction accuracy have been described, verification of the actual accuracy of these methods has been hampered by an insufficient number of experimentally validated translation starts and, therefore, a deficit of reliable data for training and testing. In the absence of a reliable computer procedure for gene start prediction, the rule of the ‘longest ORF’ was frequently applied to annotate complete microbial genomes with gene start assigned to the 5'-most ATG codon (see Table). Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 26. Lecture WS 2003/04 Bioinformatics III Salzberg et al. NAR, 27, 4636 (1999)

Spacer length (B) Graph of probability distribution of spacer length, the sequence between the RBS sequence and the gene start. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Ribosome binding sites
Distributions of log-odds scores of RBS sites, as detected by GeneMarkS, in sets of overlapping and non-overlapping of genes of (A) B.subtilis, (B) E.coli and (C) M.jannaschii. As can be seen, the overlapping genes, which are likely to be located inside operons, frequently have strong RBS sites. Still, most strong sites of ribosome binding precede the non-overlapping genes (stand alone genes and genes leading operons). This tendency is much more apparent in the case of the archaeal genome of M.jannaschii than in the E.coli and B.subtilis genomes Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Spacer length Distributions of spacer length for two species with strong RBS patterns, B.subtilis and E.coli (solid and dashed lines, respectively), and one species with a strong eukaryotic promoter-like pattern, A.fulgidus (dotted line). The promoter-like pattern of A.fulgidus is localized much further upstream of the start codon than the RBS patterns of B.subtilis and E.coli. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Distribution of spacer lengths
(A) Distribution of spacer lengths observed in the B.subtilis genome for two different types of possible RBS hexamers: AGGAGG and AGGTGA. Multiple alignment allows these hexamers to be superimposed. In actual upstream sequences, these hexamers tend to occupy different locations relative to the start codon. This preference may be involved in the precise positioning of the ribosome at the translation initiation site when the 16S rRNA binds to mRNA. The more frequent hexamer was observed on average at a further distance from the gene start than the rare hexamer. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Distribution of spacer lengths
(B) Distribution of spacer lengths observed in the M.thermoautotrophicum genome for two different types of RBS hexamers: GGAGGT and GGTGAT. Properties of these hexamers are similar to the two hexamers observed in the B.subtilis genome (A), except that more frequent hexamer is now found on average at a closer distance to the gene start than the rare hexamer. Besemer et al. Nucl. Acids. Res. 29, 2607 (2003) 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V10 – computational functional genomics
26. Lecture WS 2003/04 Bioinformatics III

V10: Computational functional genomics
The goal of computational functional genomics is to assign the function, localization, and interactions of genes (proteins) from the genome organisation, homology to other proteins, occurrence in different species ... - phylogenetic profiles - assignment of function and localization combination with operon method, rosetta stone method, genome neighborhood method Name 5 methods of computational functional genomics and explain them. 26. Lecture WS 2003/04 Bioinformatics III

Assigning protein functions by comparative genome analysis: protein phylogenetic profiles
Hypothesis: functionally linked proteins evolve in a correlated fashion, and, therefore, they have homologs in the same subset of organisms. In general, pairs of functionally linked proteins have no amino acid sequence similarity with each other and, therefore, cannot be linked by conventional sequence-alignment techniques. Phylogenetic profile of a particular protein: a string with n entries, each one bit, where n corresponds to the number of genomes. The presence of a homolog to a given protein in the nth genome is indicated by an entry of 1 at the nth position. If no homolog is found, the entry is 0. Variation: assign 1/E-value from BLAST to distinguish levels of similarity. Pellegrini et al. PNAS 96, 4285 (1999) 26. Lecture WS 2003/04 Bioinformatics III

Protein phylogenetic profiles
Illustrate method for hypothetical case of four fully sequenced genomes (from E. coli, Saccharomyces cerevisiae, Haemophilus influenzae, and Bacillus subtilis) in which we focus on seven proteins (P1-P7). For each E. coli protein, a profile is constructed, indicating which genomes code for homologs of the protein. We next cluster the profiles to determine which proteins share the same profiles. Proteins with identical (or similar) profiles are boxed to indicate that they are likely to be functionally linked. Boxes connected by lines have phylogenetic profiles that differ by one bit and are termed neighbors. Pellegrini et al. PNAS 96, 4285 (1999) 26. Lecture WS 2003/04 Bioinformatics III

3 phylogenetic profiles for E.coli proteins
Proteins with phylogenetic profiles in the neighborhood of ribosomal protein RL7 (A), flagellar structural protein FlgL (B), and histidine biosynthetic protein His5 (C). All proteins with profiles identical to the query proteins are shown in the double boxes. All the proteins with profiles that differed by one bit are shown in the single boxes. Proteins in bold participate in the same complex or pathway as the query protein. Proteins in italics participate in a different but related complex or pathway. Proteins with identical profiles are shown within the same box. Single lines between boxes represent a one-bit difference between the two profiles. Homologous proteins are connected by a dashed line or are indented. Each protein is labeled by a four-digit E. coli gene number, a SwissProt gene name, and a brief description. Note that proteins within a box or in boxes connected by a line have similar functions. Proteins in the double boxes in A, B, and C have 11, 6, and 10 ones, respectively, in their phylogenetic profiles, of a possible 16 for the 17 genomes presently sequenced. Pellegrini et al. PNAS 96, 4285 (1999) 26. Lecture WS 2003/04 Bioinformatics III

results from phylogenetic profile analysis
The phylogenetic profile of a protein describes the presence or absence of homologs in organisms. Proteins that make up multimeric structural complexes are likely to have similar profiles. Also, proteins that are known to participate in a given biochemical pathway are likely to be neighbors in the space of phylogenetic profiles. Proteins that are functionally linked are far more likely to be neighbors in profile space than randomly selected proteins. However, only a fraction of all possible neighbors is found with a group. Therefore, not all functionally linked proteins have similar profiles. They may fall into multiple clusters in profile space. Interestingly, hypothetical are also more likely to be neighbors than random proteins, suggesting that many hypothetical proteins are part of uncharacterized pathways or complexes. Pellegrini et al. PNAS 96, 4285 (1999) 26. Lecture WS 2003/04 Bioinformatics III

Localizing proteins in the cell from their phylogenetic profiles
Observation: proteins localized to a given organelle by experiments tend to share a characteristic phylogenetic distribution of their homologs – the phylogenetic profile. Marcotte et al. PNAS 97, (2000) 26. Lecture WS 2003/04 Bioinformatics III

Phylogenetic profile of yeast proteins
(A) The mean phylogenetic profiles (horizontal bars of 31 elements) of yeast proteins experimentally localized to different cellular locations. Each profile shows the distribution among genomes of homologs of proteins from one subcellular location. Plasma Mb, plasma membrane. Colors express the average degree of sequence similarity of proteins in that organelle to their sequence homologs in the indicated genomes, with red indicating greater average similarity and blue indicating less. (B) A tree of the observed relationships among the yeast proteins from different subcellular compartments. Overlaid on the tree is our interpretation of the relationships, showing ellipses clustering compartments thought to be derived from the progenitor of mitochondria (orange ellipse) and of the eukaryote nucleus (yellow ellipse). A distance matrix was calculated of pairwise Euclidian distances between the mean phylogenetic profiles (A) of proteins known to be localized in each compartment. A tree was generated from this matrix by the neighbor-joining method implemented in PHYLIP 3.5C. Marcotte et al. PNAS 97, (2000) 26. Lecture WS 2003/04 Bioinformatics III

Classification scheme
The scheme by which proteins are classified into mitochondrial or nonmitochondrial cellular localizations. Each horizontal bar is a phylogenetic profile; that for the protein of interest x0 is compared with the mean profiles for mitochondrial and nonmitochondrial proteins to determine its localization. In this example, the protein of interest is assigned to the mitochondrion because the query protein's phylogenetic profile more closely resembles the mean profile of mitochondrial proteins than the mean profile of cytosolic proteins. Marcotte et al. PNAS 97, (2000) 26. Lecture WS 2003/04 Bioinformatics III

Inference of protein function and protein lineages in Mycobacterium tuberculosis based on prokaryotic genome organization One difference between prokaryotic and eukaryotic genomes is the organization of the prokaryotic genome into multi-gene units, known as operons. Prokaryotic operon organization enables the highly controlled co-expression of multiple genes, by transcribing them together onto a single transcript. The encoded proteins of common operons often have related functions, form common complexes, or participate in shared biochemical pathways. Strong, Mallick, Pellegrini, Thompson, Eisenberg Genome Biology :R59 26. Lecture WS 2003/04 Bioinformatics III

Prokaryotic operon organization
(a) Prokaryotic operon organization. Genes A, B, and C are transcribed together onto a single polycistronic transcript, which is then translated to produce three separate proteins. Proteins originating from genes of a common operon often have similar functions, interact physically through protein-protein interactions, or participate in shared biochemical pathways. (b) Functional Linkages based on the Operon method. Genes A, B and C are 'linked' if the intergenic nucleotide distance between pairs of adjacent genes is less than or equal to the specified threshold. In this case the distance between gene A and B, and the distance between gene B and C is less than the hypothetical distance threshold, thereby allowing links between all possible sets of genes. Strong, Mallick, Pellegrini, Thompson, Eisenberg Genome Biology :R59 26. Lecture WS 2003/04 Bioinformatics III

Prokaryotic operon organization
Although the operon structure has been well studied at the biochemical level in microorganisms such as E.coli , genome-wide operon organization in pathogenic organisms, such as M. tuberculosis, remains largely unknown. One can exploit the conservation of certain genetic elements present in many prokaryotic organisms, including M. tuberculosis, to learn about operon structure and gene function: -10 and -35 bp promoter elements - ribosome binding sites (RBS) - the 5‘ and 3‘ untranslated regions (UTR) Strong, Mallick, Pellegrini, Thompson, Eisenberg Genome Biology :R59 26. Lecture WS 2003/04 Bioinformatics III

Independent vs. consecutive transcription
Schematic representation of the minimum genetic requirements for adjacent genes that are transcribed independently and those transcribed together as a single operon. Cases 1, 2 and 3 depict instances where gene A and gene B are transcribed independently as distinct transcriptional units, while Case 4 depicts genes organized into a common operon. The minimum requirement for genes of a common operon is only a RBS, while Case 3 emphasizes the numerous genetic elements required if gene A and gene B are organized into separate transcription units Strong, Mallick, Pellegrini, Thompson, Eisenberg Genome Biology :R59 26. Lecture WS 2003/04 Bioinformatics III

Conservation of Swissprot annotation
Swissprot-keyword recovery scores as a function of combined intergenic distances between pairs of genes in a run. All gene members of a run (bordered on each side by genes in opposite orientations) were linked and given a value equal to the combined intergenic distances between them. While the keyword recovery of genes linked by a combined intergenic distance less than 150 bp is fairly high (34-52%), it is apparent that as the total intergenic distance increases above 150 bp, there is a decrease in keyword recovery. At combined intergenic distances above 250 bp the keyword recovery is comparable to that of randomly linked genes. Draw the plot! Strong et al.,Genome Biology (2003) 4:R59 26. Lecture WS 2003/04 Bioinformatics III

Combine computational methods of functional assignment
4 methods for functional assignment used: Operon method (intergenic distance criterion) Rosetta Stone (RS): genes A and B have common function if a fused gene AB is found in any other organism Phlogenetic Profile (PP) Conserved Gene Neighbor (GN) method: identify genes that are in close proximity in multiple genomes Keyword recovery scores for the Operon method alone and in combination with RS, PP, and GN methods. Notice that the combination of either RS, PP, or GN has a dramatic effect on the keyword recovery, with the best score resulting from a combination of the 100 bp Operon, RS and PP methods. Strong et al.,Genome Biology (2003) 4:R59 26. Lecture WS 2003/04 Bioinformatics III

Determine operon distance threshold
Keyword recovery and maximum false positive fraction scores as the Operon distance threshold increases from 0 bp to 300 bp. Notice the decrease in the keyword recovery and the increase in maximum false positive fraction as the distance threshold increases. Draw the plot! Strong et al.,Genome Biology (2003) 4:R59 26. Lecture WS 2003/04 Bioinformatics III

Verify predictions on known examples
Comparison of the genomic organization of the leucine biosynthesis genes in M. tuberculosis and S. pombe. (a) Genomic organization of the leuC and leuD genes of M. tuberculosis. (b) S. pombe alpha-isopropylmalate isomerase, containing both the leuC and leuD coding regions in a single fusion gene. This example illustrates the power of the Rosetta Stone, Phylogenetic Profile, Gene Neighbor and Operon methods to infer a functional linkage, in this case one that is already established. Strong et al.,Genome Biology (2003) 4:R59 26. Lecture WS 2003/04 Bioinformatics III

Inference of protein function
Inference of M. tuberculosis protein function and operon organization based on multiple method overlap. (a) Inference of an operon encoding members involved in thiamine biosynthesis. (b) Operon inference for a region possibly involved in RNA degradation. (c) Functional links and operon inference for a region likely to be involved in cell wall metabolism. In these cases, inferences are made for the functions of uncharacterized genes by their functional linkages to genes of known function. Strong et al.,Genome Biology (2003) 4:R59 26. Lecture WS 2003/04 Bioinformatics III

Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages
Find computational approaches for finding gene and protein interactions to complement and extend experimental approaches such as: - synthetic lethal and suppressor screens - yeast two-hybrid experiments - high-throughput mass spectrometry interaction assays. Approach followed here: phylogenetic profiles Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Identify novel cellular sytems
Top: Using computational genetics, the genome-wide protein network of an organism is reconstructed. Middle Suitable candidate clusters that contain three or more linked proteins, at least 50% of which are uncharacterized, are selected for further evaluation. Bottom: Such core clusters are then extended to include operon partners and other proteins that are naturally linked with the protein cluster. Thick boxes and lines indicate proteins in the core cluster; thin boxes and lines indicate proteins extending the core cluster. Shaded boxes represent homologs; thick gray lines represent links to operon partners. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Metric of phylogenetic profile similarity
The mutual information MI(A,B) measures the similarity of a pair of phylogenetic profiles A and B. MI(A,B) is maximum when there is complete covariance between the occurrences of the genes A and B and tends to 0 as variation decreases or the gene occurrences vary independently. M(A,B) = H(A) + H(B) – H(A,B) H(A) represents the marginal entropy of the probability distribution p(a) occurring among the organisms in the reference database, and represents the relative entropy of the joint probability distribution p(a,b) of occurrences of genes A and B accross the set of reference organisms. 26. Lecture WS 2003/04 Bioinformatics III

The inherent information in phylogenetic profiles can be seen from the distributions of scores from comparisons of all possible protein pairs in each of seven organisms. Pairwise comparisons of actual phylogenetic profiles (solid lines) show significantly more similar profiles (indicated by larger mutual information values) than pairwise comparisons of shuffled profiles (dashed lines). Mutual information scores MI between shuffled profiles exceed 0.7 at a rate of 1 in 107 pairs, whereas scores between actual profiles are greater than 1.2, indicating that scores above 0.7 are statistically likely to indicate legitimate functional linkages between pairs of genes. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Mutual information scores plotted versus pathway similarity on a linear scale show increasing trends. The solid and dashed lines represent analytical curves fit to the data of b by least squares. Scores of indicate approximately 35–50% accurate predictions by this test, higher scores approach 100% functional accuracy. For comparison, the percentage of proteins that share no pathways in common show a decreasing trend, as mutual information values increase (inset). The accuracies of experimentally determined protein interactions from large scale yeast two-hybrid screens14, 15 indicating 14% and 44% accuracies, and mass spectrometry experiments16, 17 indicating 27% and 76% accuracies are shown with the dot-dashed horizontal lines. As in b, each point represents the average values of 1,000 pairs of proteins. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Predicted genome-wide protein networks for yeast
Proteins are represented as vertices, and derived functional linkages are shown as lines connecting the corresponding proteins. All linkages with scores above a mutual information value of 0.75 are drawn, essentially by modeling the linkages as springs that pull functionally linked proteins together on the page. (Thus, the lengths of the lines are not meaningful, only the connections). Groups of proteins sharing functional links are seen to cluster together, representing portions of genetic or functional networks. Systems in gray circles are labeled with their corresponding functions. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Clusters representing potentially new pathways
Clusters representing potentially new pathways selected from reconstructions of genome-wide interaction networks of four different organisms. Boxes with thicker borders, and bold lines denote the cluster core. Each cluster was extended to include operon partners, as well as secondarily linked proteins that are naturally grouped with the proteins in the cluster but with a mutual information value less than the selected threshold; these are represented by dotted lines and boxes with thinner borders. Thick red lines represent connections between genes in an operon, whereas colored boxes represent homologous proteins. All selected core clusters are composed of proteins, at least 50% of which lack precise functional assignments. Boxes with dashed outlines represent such uncharacterized proteins. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Phylogenetic profiles for new gene clusters
The genes corresponding to proteins within a cluster show similar patterns of presence and absence, indicated by red and blue squares, respectively, among the 57 genomes, labeled across the top. The intensity of red denotes the degree of homology between the protein labeled at the left with the best matching protein sequence of the corresponding genome. Deeper red indicates stronger sequence similarity, blue indicates no detectable similarity (BLAST E-value  1). Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Why can one still find entirely new systems?
In well-characterized systems like yeast, ca. 90% of the uncharacterized proteins are linked in networks to proteins of known function. Most uncharacterized proteins therefore appear to be additional components of known systems. The few characterized proteins of the novel cellular systems detected here seem to be strongly biased towards metabolic functions that occur commonly as more or less discrete systems within cells, which can easibly be coinherited or horizontally transferred. The presented analysis seems an ideal way of discovering such systems. Of course, it cannot indicate the precise biological function of these systems. In traditional biology, the biological knowledge extended gradually along known sets of pathways, rather than sampling all pathways evenly. Again, the presented approach allows new discoveries. Date & Marcotte Nat Biotech 21, 1055 - 1062 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Summary Computational functional annotation of genes may be based on
(a) annotation by homology to genes with known function in other organisms (b) combination of several, relatively search techniques as presented today. Proteins often have multiple functions! We need to detect all of them. The search techniques under (b) are biology-driven. This area is still in the exploratory phase. Soon certain rules will emerge and allow to apply more sophisticated computational techniques  a job for computer scientists/bioinformaticians. 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V11 – SNPs V1 26. Lecture WS 2003/04
Bioinformatics III

V11: Single Nucleotide Polymorphism
Some people have blue eyes, some are great artists or athletes, and others are afflicted with a major disease before they are old. Many of these kinds of differences among people have a genetic basis - alterations in the DNA. Sometimes the alterations involve a single base pair and are shared by many people. Such single base pair differences are called "single nucleotide polymorphisms", or SNPs for short. Nonetheless many SNPs, perhaps the majority, do not produce physical changes in people with affected DNA. (On average, SNPs occur in the human population more than 1 % of the time. Since only 3-5% of the genome code for proteins, most SNPs are found outside of coding regions. Those within a coding region are of course of particular interest.) Why then are genetic scientists eager to identify as many SNPs as they can, distributed on all 23 human chromosomes? 26. Lecture WS 2003/04 Bioinformatics III

Reasons for studying SNPs
1 Even SNPs that do not themselves change protein expression and cause disease may be close on the chromosome to deleterious mutations. Because of this proximity, SNPs may be shared among groups of people with harmful but unknown mutations and serve as markers for them. Such markers help unearth the mutations and accelerate efforts to find therapeutic drugs. 2 Analyzing shifts in SNPs among different groups of people will help population geneticists to trace the evolution of the human race down through the millenia and to unravel the connections between widely dispersed ethnic groups and races. 3 Most human sequence variation is attributable to SNPs, with the rest attributable to insertions or deletions of one or more bases, repeat length polymorphisms and rearrangements. 26. Lecture WS 2003/04 Bioinformatics III

The SNP Consortium These motives motivated a number of pharmaceutical and technology companies and academic sequencing centers to join forces to identify thousands of SNPs. - The task is smaller than sequencing the whole human genome - 4 major centers for genetics involved: Cold Spring Harbor Lab, Sanger Centre, Wash Univ St. Louis, Whitehead/MIT Center for Genome Research The fruits of this research are made available in the database dbSNP: Main publication: A map of human genome sequence variation containing 1.42 million SNPs, The international SNP Map Working Group (41 authors) Nature 409, 928 (2001). In this work, single base differences were detected using two validated algorithms: Polybayes and the neighbourhood quality standard (NQS). 26. Lecture WS 2003/04 Bioinformatics III

POLYBAYES Develop + test with EST clones from 10 genomic clones that are aligned against a fragment of the finished sequence of human (less than 1 error per bp). Task is to identify SNPs from the genomic sequences of multiple individuals (e.g. 10 genomic clones). First organize sequences: - fragment clustering - identification of paralogues (induction of sequences representing highly similar regions duplicated elsewhere in the genome may give rise to false SNP predictions) - multiple alignment of sequences - analyze differences among sequences (e.g. using Polybayes) Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

Application of the POLYBAYES procedure to EST data
Regions of known human repeats in a genomic sequence are masked. b, Matching human ESTs are retrieved from dbEST and traces are re-called. c, Paralogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples. Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

Paralogue identification
Identify paralogous sequences by determining if the number of mismatches observed between the genomic reference sequence and a matching EST is consistent with polymorphic variation opposed to sequence difference between duplicated chromosomal locations, taking into account sequence quality. Observation: paralogous sequences exhibit a pair-wise dissimilarity rate higher than PPAR = 0.02 (2%) compared with the average pair-wise polymorphism rate, PPOLY,2 = (0.1%) In a pair-wise match of length L we therefore expect L  PPOLY,2 mismatches due to polymorphism, versus L  PPAR mismatches due to paralogous difference. In both cases, an additional number, E, of mismatches are expected to arise from sequencing errors. Expect DNAT = L  PPOLY,+ E or DPAR = L  PPAR+ E mismatches. Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

Paralogue identification
The probability of observing d discrepancies in the pairwise alignment is approximated by a Poisson distribution, with parameter  = DNAT for ModelNAT and  = DPAR for ModelPAR. In the absence of reliable a priori knowledge of the expected proportions of native versus paralogous ESTs, uninformed (flat) priors were used. The posterior probability PNAT = P(ModelNAT|d) that the EST represents native sequence is estimated as: ESTs that scored above a cutoff value, PNAT,MIN, were considered native; sequences scoring below the threshold were declared paralogous. Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

Paralogue discrimination
Example probability distributions for a matching sequence with (hypothetical) uniform base quality values of 20, in pair-wise alignment with base perfect genomic anchor sequence (quality values 40), over a length of 250 bp. PPOLY,2 = 0.001, PPAR = 0.02, E = 2.525, DNAT = and DPAR = Note: the error rate E is quite similar to the frequency of true polymorphisms DNAT ! If the posterior probability, PNAT, is higher than PNAT,MIN, the EST is considered native; otherwise, it is considered paralogous. Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

Paralogue discrimination
Distribution of the posterior probability values, PNAT, calculated for 1,954 cluster members from real EST data anchored to ten genomic clone sequences. The bimodal distribution indicates that one can distinguish between less accurate sequences that nevertheless originate from the same underlying genomic location, and more accurate sequences with high-quality discrepancies that are likely to be paralogous. Using PNAT,MIN = 0.75, 23% of the cluster members were declared as paralogous. Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

SNP detection The Polybayes algorithm identifies polymorphic locations by evaluating the likelihood of nucleotide heterogeneity within (perpendicular) cross-sections of a multiple alignment = single nucleotide positions. Each of the nucleotides S1, ..., SN, in a cross-section of N sequences, R1, ..., RN, can be any of the four DNA bases, for a total of 4N nucleotide permutations. The likelihood P(Si|Ri) that a nucleotide Si is A, C, G, or T is estimated from the error probability PError,i obtained from the base quality value. (1 - PError,i) is assigned to the called base, and (PError,i/3) to each of the three uncalled bases. In the absence of likelihood estimates, insertions and deletions are not considered. Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

SNP detection in multiple alignments
Each heterogenous (polymorphic) permutation is classified according to its nucleotide multiplicity, the specific variation, and the distribution of alleles. The value PPOLY = (1 polymorphic site in 333 bp) was used as the total a priori probability that a site is polymorphic. The values PPoly have to be distributed to assign a prior probability PPrior(S1, ..., SN) to each permutation. Here: assign equal values to different variation types. PPrior = (1 - PPOLY)/4 is assigned to each of the four non-polymorphic permutations, corresponding to a uniform base composition, PPrior(Si). Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

SNP detection in multiple alignments
The Bayesian posterior probability of a particular nucleotide permutation is calculated considering the 4N different permutations (for N positions) as the set of conflicting models: The Bayesian posterior probability of a SNP, PSNP, is the sum of posterior probabilities of all heterogeneous permutations observed in the cross section. The computation is performed with a recursive algorithm. A site within a multiple alignment is reported as a candidate SNP if the corresponding posterior probability exceeds a set threshold value, PSNP,MIN. Marth et al. Nature Gen. 23, 452 (1999) 26. Lecture WS 2003/04 Bioinformatics III

SNP detection in multiple alignments
- SNP density is relatively constant across the autosomes. - two exonic SNPs per gene are estimated - density of SNPs in exons (1 SNP per 1.08 kb) is higher than in the genome as a whole; this reflects the fact that sequencing efforts focus on exonic regions. The SNP Consortium, Nature 409, 928 (2001) 26. Lecture WS 2003/04 Bioinformatics III

Analysis of nucleotide diversity
Describing the underlying pattern of nucleotide diversity requires a polymorphism survey performed at high density, in a single, defined population sample, and analyzed with a uniform set of tools. Analyze 4.5 M passing sequence reads using genomic alignment using the NQS. Set contains 1.2 billion aligned bases and 920,752 heterozygous positions. Measure nucleotide sequence variation using the normalized measure of heterozygosity (), representing the likelihood that a nucleotide position will be heterozygous when compared across two chromosomes selected randomly from a population. The SNP Consortium, Nature 409, 928 (2001) 26. Lecture WS 2003/04 Bioinformatics III

Nucleotide diversity by chromosome
The autosomes are quite similar to one another. The most striking difference is the lower diversity of the sex chromosomes X and Y. This may be explained by a lower effective population size (Ne) and a lower mutation rate . The SNP Consortium Nature 409, 928 (2001) 26. Lecture WS 2003/04 Bioinformatics III

Distribution of heterozygosity
Draw the distribution of heterozygosity a, The genome was divided into contiguous bins of 200,000 bp based on chromosome coordinates, they are randomly shuffled, and the number of high-quality bases examined and heterozygosity calculated for each. The heterozygosities are quite different b, Heterozygosity was calculated across contiguous 200,000-bp bins on Chromosome 6. The blue lines represent the values within which 95% of regions fall: 2.0 10-4. Red, bins falling outside this range. The SNP Consortium Nature 409, 928 (2001) 26. Lecture WS 2003/04 Bioinformatics III

Distribution of heterozygosity
One measure of the spread in the data is the coefficient of variation (CV), the ratio of the standard deviation () to the mean () of the heterozygosity  of each individual read. (Each nucleotide position has its own  value  compute average () and standard deviation () for each read.) For the observed data, the CV (observed / observed) was 1.93, considerably larger than would be expected if every base had uniform diversity, corresponding to a Poisson sampling process (Poisson / Poisson) = 1.73. This high variability can be expected because both biochemical and evolutionary forces cause diversity to be nonuniform across the genome. The SNP Consortium Nature 409, 928 (2001) 26. Lecture WS 2003/04 Bioinformatics III

Distribution of heterozygosity
Biological factors may include rates of mutation and recombination at each locus. The figure shows that heterozygosity is correlated with the GC content for each read, reflecting the high frequency of CpG to TpG mutations arising from deamination of methylated 5-methylcytosine. Population genetic forces are likely to be even more important. Each locus has its own history, with samples at some loci tracing back to a recent common ancestor, and other loci describing more ancient genalogies. The SNP Consortium Nature 409, 928 (2001) 26. Lecture WS 2003/04 Bioinformatics III

What to prepare? V12 – pharmacogenomics
26. Lecture WS 2003/04 Bioinformatics III

Haplotypes The diagram shows 5 haplotypes. 12 SNPs are localized in order along the chromosome. The letters on the top indicate groups of SNPs that have perfect pairwise linkage disequilibrium (LD) with one another, and the numbers on the bottom indicate each of the 12 SNPs. SNP 9 is the causal variant, which in this simple example determines drug response: allele C results in a therapeutic response, whereas allele G results in an adverse reaction. In this example, the selection of just one SNP from each of the groups A–E would be sufficient to fully represent all of the haplotype diversity. Each haplotype can be identified by just five tagging SNPs (tSNPs), and the causal variant would be tagged even if it were not itself typed. So, tSNP profiles that are highlighted predict an adverse reaction to the medicine. Normally, LD patterns are not so clear-cut and statistical methods are required to select appropriate sets of tSNPs. Goldstein et al. Nature Rev. Gen. 4, 937 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Haplotypes b The diagram depicts the same 12 SNPs, but with different associations among them, as might happen in a different population group. Because patterns of LD are different, some patients would be misclassified if the same five tSNPs were used and interpreted in the same way. Using the same SNP profiles as defined in population A, haplotype profiles 1, 2 and 3 are predicted to have allele C at the causal SNP 9 (a therapeutic response), whereas haplotype profiles 4 and 5 are predicted to have an adverse response. However, because the pattern of association has changed, the new haplotypes 6 and 7 are misclassified as haplotype patterns 6 and 7 in population B. New diagram – find possible tSNPs. Goldstein et al. Nature Rev. Gen. 4, 937 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Lessons from cloned mendelian genes
HGMD lists mutations in 1222 genes associated with human diseases and traits. In-frame amino acid substitutions are the most frequent. Less than 1% are found in regulatory regions. These data provide overwhelming support for the notion that mendelian clinical phenotypes are associated primarily with alterations in the normal coding sequence of proteins. Botstein & Risch, Nature Gen. 33, 228 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Criteria for amino acid replacements
Distinguish (1) biochemical severity of missense changes, and (2) location and/or context of the altered amino acid in the protein sequence. A useful guide is the Grantham scale: categorize codon replacements into classes of increasing chemical dissimilarity between the encoded amino acids: conservative moderately conservative moderately radical radical „stop“ or nonsense. There is a clear relationship between the severity of amino acid replacement and the likelihood of clinical observation. Botstein & Risch, Nature Gen. 33, 228 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Clinical severity increases with severity of AA substitution
Purple bars represent the ratio of frequencies of the indicated class of change compared to conservative changes for functional human genes compared to pseudogenes. Orange bars represent the ratio of the likelihood of clinical observation for a conservative change versus the indicated class of change. A nonsense change is 9 times more likely to present clinically than a conservative amino acid substitution. For the other changes, the ratios are 3, 2.3, and 1.8. 9 x The same trend exists for the relative abundance of the different types of substitutions found in SNPs from human genes as compared with their abundance in pseudogenes. Evolution selects against radical changes! Botstein & Risch, Nature Gen. 33, 228 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Clinical significance correlates with degree of cross-species evolutionary conservation
An obvious way to measure the importance of a particular amino acid: conservation across species. The figure shows that the disease probability decreases monotonically with the number of amino acid differences among species. In simple terms: if evolution allows mutations between species, this amino acid cannot be so crucial. Relative risks (log odds ratios) for the observed versus the expected number of amino acid changes. Purple: severe diseases, Orange: milder disease mutations (G6PD). Botstein & Risch, Nature Gen. 33, 228 (2003) 26. Lecture WS 2003/04 Bioinformatics III

large-scale SNP discovery projects
Two strategies: „map-based“ or „sequence-based“. It is unclear which one will be more effective. The private sequencing effort has reported 2.1 million SNPs (Venter et al. 2001) and the public SNP consortium has identified 1.4 million SNPs (Sachidanandam et al. 2001). Rates of false-positives (10-15%) are modest. Rates of false-negatives (undetected SNPs) are more problematic. Neither collection was based on the sequences of many individuals  many lower-fequency (< 10%) SNPs were not detected, especially those that are specific to a single population. Botstein & Risch, Nature Gen. 33, 228 (2003) 26. Lecture WS 2003/04 Bioinformatics III

fine-scale SNP discovery projects
Study A analyzed 313 genes (720 kb of genomic sequence) for 84 ethnically diverse individuals. Only 2% (or 6% excluding singletons) of the SNPs identified are in dbSNP suggesting that there exist many more SNPs than the roughly 1.2 million unique SNPs in dbSNP Study B analyzed 65% of the unique sequence of chromosome 21 for 10 individuals. SNPs were identified  > 6.4 million SNPs for whole genome. Only 45% of the SNPs in dbSNP were found in this study. Conclusion: the number of SNPs in the human genome (defined by a rare-allele frequency of 1% or greater in at least one population) is likely to be > 15 million. Note: there are only genes. Botstein & Risch, Nature Gen. 33, 228 (2003) 26. Lecture WS 2003/04 Bioinformatics III

fine-scale SNP discovery projects
The alternative strategy to „map-based“ is based on genes and sequence. Here, genotyping focuses on SNPs identified in coding regions that alter or terminate amino acid sequence, or disrupt splice sites, or occur in promoter regions. The table shows that we expect – such gene-related SNPs. Based on results from cloned mendelian disease, one can prioritize amino acid replacements according to (a) the severity of the alteration, and (b) the degree of evolutionary conservation. Botstein & Risch, Nature Gen. 33, 228 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Can disease-associated alleles be predicted from sequence?
Main feature that distinguishes a map-based approach from a genome-based approach to genome-wide association studies is: degree to which functional variants can be predicted on the basis of sequence in, for example, coding and/or conserved regions of the genome. Table 1 showed that – for mendelian phenotypes - most diseases are the result of changes that cause loss or alterations in encoded proteins. < 1% of listed mutations occur in regulatory regions (these would be more difficult to predict from sequence). The greatest risk of a disease phenotype is associated with splice-site mutations, deletions and insertions. Botstein & Risch, Nature Gen. 33, 228 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Structure of Membrane Transporters
Transmembrane helices (25 residue long stretches, purely hydrophobic; prediction accuracy > 90%). Typically TM helices align to form pore. External domains are very variable in size. Predicted secondary structures of two representative membrane transporters from the ABC and SLC superfamilies. The transmembrane topology is schematically rendered. Leabman et al. PNAS 100, 5896 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Aims of SNP scan Analyze 247 DNA samples of ethnically diverse collection (100 European Americans, 100 African Americans, 30 Asians, 10 Mexicans, 7 Pacific Islanders). Identify SNPs. Aim 1: determine the levels and patterns of genetic diversity - in different ethnic groups - in different transporter families - across different structural regions of membrane transporters. Aim 2: combine population-genetic and phylogenetic analysis to identify amino acid residues and protein domains that may be important for human fitness. Infer functional consequences of amino acid substitution. To identify polymorphisms, screen all exons plus bp of flanking intronic sequence. Leabman et al. PNAS 100, 5896 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Analysis of Nucleotide Diversity
On average, genetic variation in membrane transporters () is similar to that in other genes. Next: study nucleotide diversity in TM domains and in loop domains. Leabman et al. PNAS 100, 5896 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Variation across structural regions
As expected, amino acid diversity (ns) is significantly lower in TM domains than in loops. Consistent with observation that TM domains are evolu-tionary more conserved than loops; suggesting that there are constraints on TM domains of transporters. EC: evolutionary conserved EU: evolutionary unconserved Agreement suggests that constraints on structural regions of proteins (e.g. TM domains) occurs across long and short evolutionary distances for this set of proteins. Leabman et al. PNAS 100, 5896 (2003) 26. Lecture WS 2003/04 Bioinformatics III

ABC and SLC superfamilies
ABC and SLC superfamilies of transporters have evolved to transport structurally diverse biological molecules. TMDs of both superfamilies contain residues and structural domains responsible for substrate specificity. Only the loops of the ABC transporters contain ATP-binding domains. Observation:  is extremely low in TM domains of ABC transporters, much lower than in TM domains of SLC family members. What could be the advantage of combining SNP detection with TM prediction? Leabman et al. PNAS 100, 5896 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Paralogue identification
Predicted secondary structures of two representative membrane transporters (BSEP and CNT1) from the ABC and SLC superfamilies showing positions of nonsynonymous SNPs (leading to amino acid mutations). The transmembrane topology schematic was rendered by using the program TOPO. Nonsynonymous amino acid changes are shown in red. Leabman et al. PNAS 100, 5896 (2003) 26. Lecture WS 2003/04 Bioinformatics III

Evolutionary conservation
Surprisingly, the extent of amino acid diversity did not parallel evolutionary conservation: the fraction of EU residues in the TM domains of the ABC superfamily is significantly higher than in the TM domains of the SLC superfamily. This implies that a protein segment (TM domains of ABC transporters) is more constrained within humans than across species  may be related to substrate properties _________________________________________________________________ For the SLC superfamily, NS-EC is significantly lower than NS-EU – both for the TM domains and for the loops. For the TM domains of the ABC superfamily, NS-EC ~ NS-EU. This may reflect special functional demands on the TM domain of this superfamily.  Again: variation among humans does not always parallel phylogenetic variation! Leabman et al. PNAS 100, 5896 (2003) 26. Lecture WS 2003/04 Bioinformatics III