Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005.

Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005

Complete genomes

Brief calculation Approximately 233 complete genomes with about 3000 genes in each on average. Almost all genes are new and unstudied In a lab: investigation of function of one gene requires one postdoc-year at least. Hurrah!: we have work for all molecular biologists for thousands of years right now!

We have a new “complete genome”. What can we do with it now (in silico)? (outline of the lecture) Gene recognition Prediction of regulation of gene expression Functional annotation of proteins Metabolic reconstruction Study of genome evolution Main differences: Prokaryotes and Eukaryotes

Gene recognition I. Prokaryotes Projection of known genes Genome comparisons Finding long ORFs Using DNA statistics Identification of gene starts Size of a prokaryotic genome: Pathogenesis bacteria - from < 1 Mb and 600 genes Free living bacteria – up to 6-9 Mb, 9000 genes E.g., Escherichia coli: 4.6 Mb - 4400 генов

Mapping “known” genes BLASTx: //www.ncbi.nlm.nih.gov/BLAST/ A lot of information when a close genome is well-studied. But it happens rarely. Problems: choice of thresholds, fine mapping of start positions in other cases. No perfect solutions.

Using long ORFs –What minimal length is functional? –Which Met is the start? ORFs in a fragment of the K. pneumoniae genome

Frequencies of codons differ from frequencies of non-coding triplets: frequencies of amino acids (and their) codons; frequencies of dipeptides; frequencies of synonymous codons (genome-specific, correlate with tRNA concentration). Use of DNA statistics in gene recognition

Coding potential A function measuring whether the genomic fragment is coding or non-coding based on its DNA statistics. We can calculate coding potential for ORFs or for sliding window “Sliding window” technique: Scan the DNA sequence with sliding window of fixed size Calculate coding potential for each window position and plot it above the sequence (horizontal axis) Choosing of a window size so as to minimize random noise

Selection of window size for sliding window E. coli: 96nt window 48nt window

Exact mapping of gene start positions Prokaryotes: starting methionine is preceded by a ribosome-binding site (so-called Shine-Dalgarno box, any part of GGAGGA) Extension of the nucleotide alignment with orthologous region from a related genome: mutation patterns in the coding region differ from the those in the intergenic region

rbsD in enterobacteria Sty AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC Sen AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC Stm GGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC Eco AGGATTAAACTGTGGGTCAGCGAAACGTTTCGCTGATGGAGAA-AAAAATGAAAAAAGGC Ype TTTTCTAAACTCCTTGTTAGCGAAACGTTTCGCTCTTGGAGTA-GATCATGAAAAAAGGT ** *** **************** ***** * * ***** ***** Sty ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG Sen ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG Stm ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG Eco ACCGTTCTTAATTCTGATATTTCATCGGTGATCTCCCGTCTGGGACATACCGATACGCTG Ype GTATTACTGAACGCTGATATTTCCGCGGTTATCTCCCGTCTGGGCCATACCGATCAGATT * ** ** **** ** ** **** ** *********** ***** *** *

Pattern of nucleotide changes in protein-coding regions Sty TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT Stm TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT Sen TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT Eco TTGCCCG--TGCCAGACGGCAGATTATCTCCCTGACCTGGTGGTTGCCCAGGAGGAGGGCCGGAAATAGGTTGTATCATT Kpn ----CGG--TGGCGCAGTGCCTGATGGG-CCTCGCCCTGGAGGACGGTCTGGCAT---ATCAGCAAGGGGGTGCGTCATG Ype TTGTTAGAACAGGGGAAAACGGTAAACAGTGTGGCATTAGATGTCGGTTATAGCT-----CCGCCTCTGCTTTTATCGCC * * * * * * * * * * * Sty AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCTTGGCGGCCAGATTATGGC--ACACTTGTCCGG Stm AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGG Sen AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGG Eco ACGTATCCTTATAC----------CTGAAATCTTCGCAAG--TATGCCTGGCCGCGAGATTATGGC--ACACTTGTCCGG Kpn ATTCATCCTTTCGATATCGCGGTGCTGGAACCAGGTGATGAGTATGCCTGGCGGCCAGATTATGGC--ACACTTCCCCAG Ype ATGTTTCAGCAAATAT--------CGGGTACCA-CGCCTGAGCGTTTCCGGCGGGGCAATAGTGGCTTATACTAAGCCCC * ** * * * * *** * ** **** * *** ** Sty TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC Stm TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC Sen TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC Eco TTAACTCTCGT--CTCATACAG------GTAACACAAAC--GTGAAAATCCTTGTTGATGAAAATATGCCTTATGCCCGC Kpn TTAACTCTCGTT-CTCAGACAG------GTACTGAACT---GTGAAAATCCTCGTTGATGAAAATATGCCCTATGCCCGT Ype CTGTTTTTCATCTGTATGGCAGTTCGCTGTCGGAGAGTAAAGTGAAAATTCTGGTTGATGAAAATATGCCGTACGCTGAG * * ** * * *** ** * ******** ** ***************** ** ** 123123123123123123123123123123123123123 pdxB in enterobacteria

Operons Majority of genes in prokaryotes are transcribed in operons. Some examples of operons in eukaryotes: C.elegans Ideas for de novo prediction of operon structure are trivial: Small distance between adjacent genes Co-orientation (lie on the same strand) More reliability when these features are conserved in different species Additional arguments: Similar functional annotations of adjacent genes Observed co-expression Known average operon length

Training for a completely new genome For all already discussed methods we need some initial knowledge about genes in the genome (DNA statistics, minimal ORFs length etc.) – from known genes or their very close orthologs When we have no information at all, we use an iterative process with initial parameters from very long ORFs (and/or distant orthologs with reconstructed structure) as genes, and regions with no ORFs as intergenic regions

Gene recognition II. Eukaryotes Specifics: Exon-intron structure 9-10 coding exons per gene on average (human), ~5 exons (insects) Average length of internal exons is 120-130 nucleotides Very long introns (>10Kb) are frequent, may be as long as > 1 Mb There are no Shine-Dalgarno sequences (the Kozak rule can be used instead, but it is much weaker) => ORFs and “sliding window” techniques are inapplicable!

The gene of rat chemotripsin Inapplicability of “sliding window” technique for eukaryotic genomes Nothing (intergenic region)

Search for “known” genes BlastX is reliable only for large exons (short introns are treated as long deletions) What can we use instead? Splicing signals! “Spliced alignment” is an alignment of DNA fragment with a sequence coding for a homologous protein. Unlike standard alignments, it is allowed to contain non- penalized long “deletions” flanked with splicing signals (that is, introns). BLAT, ProFrame, TWINSCAN

Spliced alignments of genomic sequences VISTA (www-gsd.lbl.gov/vista/): human-dog-mouse

HMM (Hidden Markov Model) Definition: An HMM is a 5-tuple (Q, V, p, A, E), where:  Q is a finite set of states, |Q|=N  V is a finite set of observation symbols per state, |V|=M  p is the initial state probabilities.  A is the state transition probabilities, denoted by a st for each s, t ∈ Q.  For each s, t ∈ Q the transition probability is: a st ≡ P(x i = t|x i-1 = s)  E is a probability emission matrix, e sk ≡ P (v k at time t | q t = s) Property: Emissions and transition are dependent on the current state only and not on the past. Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> “ hidden ”

HMM-based Gene Finding GENSCAN (Burge 1997) FGENESH (Solovyev 1997) HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997)

GenScan Overview Developed by Chris Burge (Burge 1997), in the research group of Samuel Karlin, Dept of Mathematics, Stanford Univ. Characteristics: –Designed to predict complete gene structures Introns and exons, Promoter sites, Polyadenylation signals –Incorporates: Descriptions of transcriptional, translational and splicing signal Length distributions (Explicit State Duration HMMs) Compositional features of exons, introns, intergenic, C+G regions –Larger predictive scope Deal with partial and complete genes Multiple genes separated by intergenic DNA in a sequence Consistent sets of genes on either/both DNA strands Based on a general probabilistic model of genomic sequences composition and gene structure

GenScan Architecture It is based on Generalized HMM (GHMM) Model both strands at once –Other models: Predict on one strand first, then on the other strand –Avoids prediction of overlapping genes on the two strands (rare) Each state may output a string of symbols (according to some probability distribution). Explicit intron/exon length modeling Special sensors for Cap-site and TATA-box Advanced splice site sensors

Regulation Less than 5% of the sequence of human genome are protein-coding sequences. What is the role of the remaining DNA? It has been suggested, that a much larger part of human genome codes the regulatory machinery Processes whose regulation we try to predict: Transcription (DNA  RNA) Splicing (pre-mRNA  mRNA) Translation (mRNA  protein)

Two types of analysis of regulation Prediction of regulatory signal Identification of the signal Finding new sites Signal is an ideal “site” or a set of ALL observed sites Site is a representative of the signal in the genome

Deriving of the signal ab initio I. Ubiquitous (necessary) signals Examples: promoters of transcription, ribosome-binding signal, acceptor and donor splicing sites, stop-codon, signal of polyadenilation We know many examples and some biological characteristics (and landmarks) Often short (4-6 nucleotides)

Re-alignment approaches Initial alignment by a biological landmark –start of transcription for promoters –start codon for ribosome binding sites –exon-intron boundary for splicing sites Fix the width of the sliding window and the expected signal size Derive the signal (the most frequent word) within a sliding window Repeat for other parameters, select the best set Re-align anchoring on the signal Identify the signal positions (with non-uniform nucleotide frequencies)

Gene starts of Bacillus subtilis dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG

dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. aaagtatataagggagggttaataATG num. 001000000000110110000000111 760666658967228106888659666

dnaN ACATTATCCGTTAGGAGGATAAAAATG gyrA GTGATACTTCAGGGAGGTTTTTTAATG serS TCAATAAAAAAAGGAGTGTTTCGCATG bofA CAAGCGAAGGAGATGAGAAGATTCATG csfB GCTAACTGTACGGAGGTGGAGAAGATG xpaC ATAGACACAGGAGTCGATTATCTCATG metS ACATTCTGATTAGGAGGTTTCAAGATG gcaD AAAAGGGATATTGGAGGCCAATAAATG spoVC TATGTGACTAAGGGAGGATTCGCCATG ftsH GCTTACTGTGGGAGGAGGTAAGGAATG pabB AAAGAAAATAGAGGAATGATACAAATG rplJ CAAGAATCTACAGGAGGTGTAACCATG tufA AAAGCTCTTAAGGAGGATTTTAGAATG rpsJ TGTAGGCGAAAAGGAGGGAAAATAATG rpoA CGTTTTGAAGGAGGGTTTTAAGTAATG rplM AGATCATTTAGGAGGGGAAATTCAATG cons. tacataaaggaggtttaaaaat num. 0000000111111000000001 5755779156663678679890

Positional information content before and after re-alignment

Deriving of the signal II. Transcription regulation Transcription factors binding sites Usually longer (10-20 nts or more) Relatively small sample: only several sites in a genome at all, very few examples are known Often have some symmetry Conserved among species Experimental studies are not sufficient: they define only the regulatory region

Why TFBS are palindromes? Examples Prokaryotes Eukaryotes

Use of symmetry DNA-binding factors and their signals  Co-operative homogeneous  Palindromes  Repeats  Co-operative non-homogeneous  Cassetes  Others  RNA signals: special conservative secondary structure

Regulation of transcription in eukaryotes

Signal, consensus codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT

Pattern codB CCCACGAAAACGATTGCTTTTT purE GCCACGCAACCGTTTTCCTTGC pyrD GTTCGGAAAACGTTTGCGTTTT purT CACACGCAAACGTTTTCGTTTA cvpA CCTACGCAAACGTTTTCTTTTT purC GATACGCAAACGTGTGCGTCTG purM GTCTCGCAAACGTTTGCTTTCC purH GTTGCGCAAACGTTTTCGTTAC purL TCTACGCAAACGGTTTCGTCGG consensus ACGCAAACGTTTTCGT pattern aCGmAAACGtTTkCkT

Frequency matrix I =  j  b f(b,j)[log f(b,j) / p(b)] Information content

Positional weight matrix (PWM)

Sequence logo

Greedy algorithms (MEME) Find a signal among all k-words (assuming that we know the length signal). For all k-words it’s too time-consuming (k~16). So initially we consider only k-words that were present in the fragments. For each k-word construct a matrix of “sites”: alignment of best “copies” of the k-word from every sequence fragment. Select the best k-word. What is the measure for comparison of matrices? Information content!

Greedy algorithms. Cont’d Select the k-word with maximal information content Problem. We considered only k-words from our sequences => may select not the signal (the consensus word), but only its best representative in our sample Solution. For each k-word from the sample construct PWM and reconstruct the frequency matrix based on it. Repeat until stabilization of the matrix. Use the consensus of this matrix.

Limitation of greedy algorithms Started from k-words in our sequences and increase the information content at each step => find a local (not global) maximum of the functional. We need an alternative algorithm that will not be “greedy”!

Gibbs sampler Let’s A be a signal (set of sites), and I(A) be its information content. At each step a new site is selected in one sequence with probability P ~ exp [(I(A new )] For each candidate site the total time of occupation is computed. (Note that the signal changes all the time)

Recognition of signals I. Ubiquitous signals Consensus Pattern (consensus with degenerate positions) Positional weight matrix (PWM, or profile) Weight of the site: Logical rules Neural networks

Neural networks: architecture 4  k input neurons (sensors), each responsible for observing a particular nucleotide at particular position OR 2  k neurons (one discriminates between purines and pyrimidines, the other, between A/T and G/C) One or more layers of hidden neurons One output neuron

Each neuron is connected to all neurons of the next layer Each connection is ascribed a numerical weight A neuron Sums the inputs at incoming connections Compares the total with the threshold (or transforms it according to a fixed function) If the threshold is passed, excites the outcoming connections (resp. sends the modified value) Neural networks: architecture. II

Training of the neural network Sites and non-sites from the training sample are presented one by one. The output neuron produces the prediction. The connection weights increase if the prediction is correct and decrease if it’s incorrect. Networks differ by architecture, particulars of the signal processing, the training schedule

Neutral networks don’t work: need training, too few examples PWM – ok, but too many false positive predictions => we need rules to select the true sites among predicted. Many genomes are available => comparative approach: –Consistency filtering –Phylogenetic footprinting –Phylogenetic shadowing Recognition of signals II. Regulation of transcription

Definition of orthologs Duplication Speciation Orthologous genes: –the result of speciation –the “same” role in the cell Paralogous genes : –the result of duplication –keep common biochemical function Example: gluconate and idonate kinases Genome 1Genome 2 A1 B1A2 B2

Consistency filtering Basic assumption. Regulons (sets of co-regulated genes) are conserved => True sites occur upstream of orthologous genes False sites are scattered at random We need to check that transcription factors are true orthologs by themselves (BBH, COGs are not sufficient; conservation of the DNA-binding domain, conservation of the core pathway), have exactly the same specificity (similar binding sites) and then compare genes (and whole operons) after the predicted sites

The basic procedure Genome 2 Genome 1 Set of known sites Profile Genome N

Accounting for the operon structure

Tryptophan operons

Closely related genomes: Phylogenetic footprinting Regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions.

Low conservation

High conservation

Another variation. Phylogenetic shadowing Idea. Instead of distant orthologs use very close orthologs, but from multiple (very close) species. True sites would look like islands of strongly conserved columns on multiple alignment. Need to sequence orthologous upstream regions from a series of close genomes (e.g., from many different primates) and analyze their multiple alignment

RNA regulation. Riboswitches mRNA has two alternative conformations of its leader region: one of them blocks the expression. Two main cases (prokaryotes): a terminator interrupts transcription or a special structure blocks the ribosome- binding site. Eukaryotes: block of a splicing site Riboswitches are RNA signals stabilized by a small molecule

Capitals: invariant (absolutely conserved) positions. Lower case letters: strongly conserved positions. Obligatory base pairs are set in bold. Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide Example of the secondary structure of riboswitch

Importance of prediction of RNA regulation as bioinformatics problem Phenomenon was discovered by means of bioinformatics RNA signal is strongly conserved (on the sequence level, not only as the secondary structure) => well-predictable (no “false positive” predictions) A portion of the regulation of this type is valuable (~ 5% of all genes for some species)

Assignment of function based on homology We want to characterize a new gene. What is the function of the product? The first step: BlastP. The best case: we obtain a hit with known function Have we got a functional information on our gene? Similarity ≠ homology: e-val is a measure of statistical significance (non-randomness) of similarity.

Definition of orthologs Duplication Speciation Orthologous genes: –the result of speciation –the “same” role in the cell Paralogous genes : –the result of duplication –keep common biochemical function Example: gluconate and idonate kinases Genome 1Genome 2 A1 B1A2 B2

Orthologs or paralogs? The best proof is a phylogenetic tree, but it’s too time-expensive. We use BBH - Bidirectional Best Hit. COGs – Clusters of orthologous genes (//www.ncbi.nlm.nih.gov/COGs/new) (prokaryotes) or KOGs (eukaryotes)

Search for orthologs (fast and dirty)

Assignment of a new gene to specific functional system. I Positional clustering Operon: co-transcription of several genes (usually for prokaryotes, rarely for eukaryotes - Caenorhabditis elegans). Genes are transcribed together and so, exactly under the same conditions => they are dependent functionally

Assignment of a new gene to specific functional system. II Genes are not in the same operon, but in the same locus: horizontal transfer Divergon: a regulatory signal influents the direct and the complementary chains (usually with opposite effects) regulatory site(s) gene (operon) on (+) strand gene (operon) on (-) strand

Measure of positional closeness Let’s use a measure of positional neighborhood: a ration of divergent genomes in which our genes are closely located Servers that predict functional dependence: ERGO (//www.cordis.lu/ergo/ ), STRING (//string.embl.de/, may be described at the proteomics day): implementation and visualization of ALL the techniques related to this area

Eukaryotic case: domain shuffling Compression of biochemical functions into single molecules Prokaryotes: all enzymatic activities carried out by separate proteins Fungi: FAS1 gene encodes activities 3 and 4 FAS2 gene encodes activities 1,2 and 5-7 Animals:All activities encoded by fatty-acid synthase

Genomic structure of fatty-acid synthase from rat

Protein domains InterPro: www.ebi.ac.uk/interpro/ www.ebi.ac.uk/interpro Pfam: http://www.sanger.ac.uk/Soft ware/Pfam/ http://www.sanger.ac.uk/Soft ware/Pfam/

Co-regulation Genes that are distant in the genome, but are regulated similarly. Very similar to the case of operons But it’s hard to work with computationally. A lot of manual analysis is necessary.

Co-expression If the expression of two genes changes consistently in response to changing conditions or in time => they are functionally related Microarray data analysis: a special area of bioinformatics (Transcriptomics session)

Protein-protein interactions Evidence of physical interaction is a direct proof of the functionality in one cellular system (together) Will be discussed in detail at the Proteomics session

Phylogenetic profiling Usually functional system is present or absent in a genome as a whole (or it’s true for a separate subsystem) => If we have many distant complete genomes, we can compare patterns of occurrence (phylogenetic profile) for individual genes. This is rather weak evidence, but useful in combination with other techniques. The converse situation also is interesting: genes with complementary phylogenetic profiles may have identical function (non-orthologous displacement: paralogs, specificity changes or really different structure).

Combining of methods Each individual type of evidence is rather weak => we need to combine methods in every case. BlastP => general biochemical function Positional clustering and/or domain shuffling and/or phylogenetic profiling => assignment to functional system Metabolic reconstruction => gaps in this system Try place the product of our gene to each gap => (if we are lucky) exact biochemical function and exact position in the metabolic pathway

Archaeal shikimate-kinase Chorismate biosynthesis pathway (E. coli)

Pectin utilization E. chrysanthemi

… and transport of oligogalacturonates E. chrysanthemi Y. pestis K. pneumoniae

YpaA: riboflavine transport 5 predicted TM segments => potential transporter Regulatory RFN-element => co- regulation with genes from riboflavine metabolism => transport of metabolism or one of it’s predecessor S. pyogenes, E. faecalis, Listeria: have ypaA, no genes of riboflavin biosynthesis => transport of riboflavin So, prediction: YpaA is a riboflavin transporter (Gelfand et al., 1999) Verification: YpaA imports riboflavin (genetic analysis, Kreneva et al., 2000 ) YpaA is regulated with riboflavin (microarray expression analysis, Lee et al., 2001; direct verification, Winkler et al., 2002 ).

Genome evolution. Repeats More than 45% of human genome is repetitive DNA A.Smith: ”The best algorithm of gene prediction is to mask the repeats, and the rest will be genes!” Genome-specific classes of repeats are unique markers of genome post-speciation evolution (did humans appear due to special repeats?!) Too many repeats=> this task is computational Influence on gene recognition, similarity search and other genomic analyses. Mask repeats before!

RepeatMasker www.repeatmasker.org/

Duplications in genomes. Example of a locus with internal duplications MAGEA9a LW-1aFAM11aLW-1b MAGEA9b … 2 Mb … MAGEA4 GABRE MAGEA5MAGEA10 GABRA3GABRQ MAGEA6 TRAG3a MAGEA2a MAGEA12 CSAGE MAGEA2b TRAG3b MAGEA3 repeat I repeat II … 6 genes … MAGEA1 … … MAGE8 MAGE-A locus, X human chromosome

Duplications The main problem of duplications: assembly of newly sequenced genomes No universal solution: every group uses its own algorithm and software Human genome: the number of duplications changes from one release to another. Two initial versions (Int. consortium, Celera) were significantly different at the point of duplications

Human chromosomes cut into > 100 pieces and reassembled become a reasonable facsimile of the mouse chromosome Synteny groups

Rearrangements as a unit of genome evolution rearrangement Rearrangements of alfafa and garden pea Transforming alfaalfa into pea

Whole genome duplication in yeast Kellis M, Birren BW, Lander ES. (2004) Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 428:617-24

Thank you! The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT- 2003-503265.

Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005.

Similar presentations

Presentation on theme: "Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005.

Similar presentations

Presentation on theme: "Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005."— Presentation transcript:

Similar presentations

About project

Feedback