Tools for Comparative Sequence Analysis Ivan Ovcharenko Lawrence Livermore National Laboratory
A set of problems: 1. Browsing genomes using synteny links 2. Aligning sequences to vertebrate genomes 3. Aligning sequences to identify evolutionary conserved regions 4. Assigning function to regulatory elements 5. Decoding gene regulation using microarray data
zPicture: Dynamic Alignment of Megabase-long Sequences and Genomes
zPicture Automated sequence extraction and gene annotation I. Ovcharenko, G. Loots, R.C. Hardison, W. Miller, and Lisa Stubbs Genome Research, 14(3), (2004)
>hg16_dna range=chr16: Tataatggctacctatttggagtgcctaccatgtattagtcattgtgcta actgatgtataggcatctcatttacagttcaactcatttgaacctaaatg aagaatagttgtttgtcccttattttatttaacaaaatttaaaactattt ctaagtcgctcattaaatgacaaagcttaaaccaaattttgtctgattgt aaaggccatacttttAATCATTTATATAAAACAACGCAGCCATATTTAAC TTCTGCCATATATTTTCTTACCGATGAATGATATATATCAAATGTTGACT TAGTTTTTAAATGGAAGACAGAAGCGGTTTAGAATGGCCTATTTTCAGTC AGCCAAAAATGTCAAAACCTTCTGTGAGTAGTCCAGGTACTGGAAATCAG ACAATTTGAACTTCAGGATACTACAATAATTTTTTCCTTTGTGGGTAGTG GTGGAGCATGAATTCTCTACTTCTTATTGGTCCTTCTGCTATGATGGCCC TTTCAGTCACACCTCTGTTCTCAAAATAAGAATATAATCAATAAAGTAGA GTTTGAGGGAACGGAGGACTAAGTCAAAAGTGGGATACCTAGGACTTCAT TCTAGttactgtggaattatctcctttgcttttcttcctgtttgtgcttt ttctatcctgttaattctcctgccttatggaaagcacagtgattgtttca cagcataaaccagacatcacttttccagtttaattttttttcaaaggccc ccattgcattttggaaaaaattcaaaatattcaacatggcctacaaagcc ctgtcacccttaaatagtgtgttgagtctggctcctacccacagtctaaa tctcaactgtctccaatcttctccctcactaaactcctaccagcaaatct tttcttcaaactggctaatgccctattctagcctcagagttttgtgctgc tgttctcttaggtacagtgtttttccccaagatttttatctggctttctc ttcttcatttagacttttaaacaaacagcttcatgaattacttgagatgt aattaatatacatacaatttacccatttaaggtatacattttaatgtttt tattatattcacagagttgtacaaccatcacactctaatttcagaacgtt ttcatcttgattcagattttaaatcaaatgtcacatcatccagtaggaac tccagtcactaattagaaatacccattatgtttttacacacattctcaat cccactacctgtttgttattgcacttgaacttacatgaaactatttactt gtttatacatttattgtctGTTATTCCTAGCACATAGAAGGTATGTCTGG CACATAGCAAACACTCGATCTTTGATGAATGAATGAATAATGATAACATT AACTTTTTTGCTTATTCTGCCTTGTATTGTGTAAGATTAGAGACaatcct tacaacaaacttgaaaacccagacttaacgatctctaaaactcacatgta agttaaggctcagagaagtttcatcacttgctcagagttacgtaactggt gaataccgaggctagatttcaaacccaaggctgcccggctctaaaTGAGG GGATATTTGATTAGGCCAAAGTAACCTGAACCCTTAAAATAACcaggctt taacttccagaaacatgggaactagataacctaagaacctgctggccacg aaacccctagaatactgaacacaatatcacaaacatattttgaaatgcat agatgagcatgtaaaatactgagggaactcctcaatggccaaaagtggaa agcagatgaaaaccagaactgtgtaaaagcctgaaagttacagtcgtcct gcagacatttgtcaatctcagtaacaaagggacttagtattttttggcta tggaagacaaaaacaagctttttgtataaggtgggaatgttgaactgaga cctcatgggagaaaaagcagatgaagggttagaggctcagtaaaagaatg aactggaaaaatccatcttctgacaaagaaagacaatgaggaaacttttc tgtcttgggctgggtgCTTGGTTGGAGCAGGGGGAAAGAATCTCTGATTT > SLC6A UTR exon exon exon exon exon exon exon exon exon exon exon exon exon exon UTR > CESR UTR exon exon exon exon exon exon UTR > CES UTR exon exon exon exon exon exon exon exon exon exon exon exon exon exon UTR < CES UTR exon exon exon exon exon exon exon exon exon exon exon exon exon exon UTR < CESR UTR exon exon exon exon exon exon UTR < FLJ UTR exon exon exon exon exon exon exon exon exon exon exon Automated sequence and gene annotation extraction chr16:55,400,000-…
zPicture: dynamic & interactive alignments visualization tool. Dynamic rotation from Pip- to Smooth- plots Interactive parameter changes
zPicture: dynamic annotation
zPicture: dynamic selection of conservation parameters 100bps/70% 500bps/85%
Mycobacterium leprae vs. Mycobacterium tuberculosis. Conservation of genes: NONhypothetical genes – 97% are conserved Hypothetical genes -- 20% are conserved zPicture: Aligning complete microbial genomes
rVista 2.0: Identification of Evolutionarily Conserved Transcription Factor Binding Sites
rVista Identification of Evolutionarily Conserved Transcription Factor Binding Sites
Human ACTTTCCTACATCTATCTATA |||||::|||||||:|||||| Mouse ACTTTGATACATCTCTCTATA Human ACTTTGATACATCTATCTATA ||||||||||||||:|||||| Mouse ACTTTGATACATCTCTCTATA Human -----GATACATCTATCTATA ||||| Mouse ACTTTGATAC Human ACTTTGATACATCTATCTATA ||||| Mouse ACTTT
zPicture-rVista 2.0 interconnection zPicture rVista 2.0
ECR Browser: Tool for Browsing Genome Conservation Profiles
Grab ECR :: direct access to a conserved element
Genome Alignment: Align your sequence to a vertebrate genome
Genome Alignment AC146831
Genome alignment: Output page
ECR Browser contains rVista portal
eShadow: Phylogenetic Shadowing of Closely Related Speicies
eShadow: Phylogenetic Shadowing
Phylogenetic shadowing on multiple (10-14) primate sequences Apo-B Plasminogen LXR-alpha CETP Boffelli et al., Science, 2003
CREME: Using Microarray Data to Decode Genome Regulation
TFBS in Promoter ECRs of RefSeq genes ~13k RefSeq loci ~8k Conserved promoters 414 TRANSFAC PWMs ~ 3M predicted TFBS
TFBS in Promoter ECRs of RefSeq genes Testing Motif Abundances Identify enriched motifs in a gene set relative to a background set. Take into account length of promoters Filtering Similar PWMs TRANSFAC contains many redundancies: –Different PWMs for the same TF. –Similar PWMs for TFs from the same family. Filtering strategy: –For two PWMs that tend to co-occur in a very small window (4bp), remove the less enriched one.
Human Cell Cycle 16 enriched PWMs 1089 modules 336 genes, Whitfield et al significant modules 5 coherently expressed E2F, NFY, CREB…
Human Cell Cycle DELTAEF1, EVI1, GR : 11 genes, p=0.01
Validation on a known module NFAT-AP1: –10 known genes containing multiple regulatory elements. In all NFAT is upstream of AP1. –CREME reported the correct module only (p=0.01). –CREME correctly identified the correct orientation of the TFBS. –The module was identified even after adding 10 random promoters to the gene set.
Colleagues and collaborators Lawrence Livermore National Laboratory UC, Berkeley Stanford Lawrence Berkeley National Laboratory Pennsylvania State University Gaby Loots Lisa Stubbs Roded Sharan Asa Ben-Hur Ross HardisonWebb Miller Marcelo Nobrega Dario Boffelli Sha Hammond