Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona.

Similar presentations


Presentation on theme: "Finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona."— Presentation transcript:

1 finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona

2 número de genes en el cromosoma 22 initial annotation545Dunham et al., 1999 genscan+RT-PCR590Das et al., 2001 genscan+microarrays730Shoemaker et al., 2001 reviewed annotation726chr22 team, sanger, 2001 mouse shotgun data+20(our data) geneid predictions794 genscan predictions1128

3 número de genes en el genoma humano Consortium30.000-40.000 2001 Celera27.000-38.000 2001 Consortium+Celera50.000 Hogenesch et al. 2001 DBsearches65.000-75.000 Wrigth et al., 2001 HumanGenomeSciences 90.000-120.000 Haseltine, 2001

4 sequence conservation and coding function

5

6 rosseta ( Batzoglou et al., 2000 ) cem (Bafna and Huson, 2000) sgp1 (Wiehe et al., 2000) twinscan (Korf et al., 2001) slam ( Patcher et al., 2001 ) doublescan ( Meyer and Durbin, 2002 ) sgp2 ( Parra et al., 2003 ) comparative gene prediciton

7 comparative gene prediction 1. THE GENE PREDICTION IS THE RESULT OF THE SEQUENCE ALIGNMENT given two homologous genomic sequences, infer the exonic structure in each sequence maximizing the score of the alignment of the resulting amino acid sequences. This problem is usually solved through a complex extension of the classical dynamic programming algorithm for sequence alignment. blayo et al., 2002 pedersen and scharl, 2002

8 comparative gene prediction 2. GENE PREDICTION AND SEQUENCE ALIGNMENT ARE PRODUCED SIMULTANIOUSLY given two homologous genomic sequences, Pair hidden Markov Models for sequence alignment, and Generalized HMMs (GHMMs) for gene prediction are combined into the so-called Generalized Pair HMMs progen – novichkov et al., 2001 slam – pachter et al, 2001 doublescan – meyer and durbin, 2002

9 comparative gene prediction 3. GENE PREDICTION IS SEPARATED FROM SEQUENCE ALIGNMENT first, the alignment is obtained between two homologous genomic sequences using some generic sequence alignment program, such as tblastx, sim4 or glass then, gene structures are predicted that are compatible with this alignment, meaning that predicted exons fall in the aligned regions. rosseta – batzoglou et al., 2000 cem – bafna and huson, 2000 sgp-1 – wiehe et al., 2001

10 comparative gene prediction 4. GENE PREDICTION IS (EVEN MORE) SEPARATED FROM SEQUENCE ALIGNMENT This approach does not require the comparison of two homologous genomic sequencs. Rather, a query sequence from a target genome is compared against a collection of sequence from a second (informant, reference) genome and the results of the comparison are used to modify the scores of the exons produced by underlying ``ab initio'' gene prediction algorithms. twinscan – korf et al., 2001 sgp-2 – parra et al., 2003

11 Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons syntenic gene prediction (sgp2)

12 programs based on mouse human genome sequence comparisons improve gene predictions sensitivityspecificity genscan0.790.46 twinscan0.800.62 SGP0.790.66 Accuracy on human chromosome 22

13 how accurate are the sgp predictions nucleotide level

14 how accurate are the sgp predictions exon level

15 gene predicition programs predict a large number of genes TWINSCANSGP 48462total47055 17562novel21942 3171 multiexonic long no low complexity 4543 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned 2314821971 away from an ensembl 14171706857 predictions in the mouse genome

16 and a large number of novel genes... TWINSCANSGP 48462total47055 17562novel21942 3171 multiexonic long no low complexity 4543 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned 2314821971 away from an ensembl 14171706857 predictions in the mouse genome

17 ...with exons... TWINSCANSGP 48462total47055 17562novel21942 10987 3171 multiexonic long no low complexity 12158 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned 2314821971 away from an ensembl 14171706857 predictions in the mouse genome

18 that look fine proteins TWINSCANSGP 48462total47055 17562novel21942 10987 3171 multiexonic long no low complexity 12158 4543 954 human ts 2217 orphans 1560 Orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned 2314821971 away from an ensembl 14171706857 predictions in the mouse genome

19 almost every mouse gene has the human orthologue counterpart TWINSCANSGP 48462total47055 17562novel21942 10987 3171 multiexonic long no low complexity 12158 4543 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned predictions in the mouse genome

20 |1b chr1_2213 MSTNICSFKDRCVSILCCKFCKQVLSSRGMKAVLLADTEIDLFSTDIPPTNAVDFTGRCY **** *:*******************************:************:*** **** chr1_1808 MSTNNCTFKDRCVSILCCKFCKQVLSSRGMKAVLLADTDIDLFSTDIPPTNTVDFIGRCY |1b |2b |3a chr1_2213 FTKICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNRHFWMFHSQAVYDINRLDSTGV ** *********************************** ***********.*****:*** chr1_1808 FTGICKCKLKDIACLKCGNIVGYHVIVPCSSCLLSCNNGHFWMFHSQAVYGINRLDATGV |2b |3a chr1_2213 NVLLRGNLPEIEESTDEDVLNISAEECIR *:** ***** **.***:.*:***** ** chr1_1808 NLLLWGNLPETEECTDEETLEISAEEYIR orthologous human mouse genes have conserved exonic structure

21 orthologous human mouse genes have conserved exonic structure. 85% of the orhologous pairs have identical number of exons 91% of the orthologous exons have identical length 99.5% of the orthologous exons have identical phase there are a few cases of intron insertion/deletion (22) U12 introns appear to be strongly conserved between human and mouse non-canonical GC-AG are less conserved. data on 1506 human/mouse refseq orthologues

22 we will target genes with conserved intron positions |2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV *. ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG |3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b |4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ.****. : :********************** ************.**..* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c

23 sequence conservation and coding function

24 ortholgous splice sites are more conserved than expected solely from their splicing function

25

26 prediction of splice sites

27 we will target genes with conserved intron positions

28 the final pools TWINSCANSGP 48462total47055 17562novel21942 10987 3171 multiexonic long no low complexity 12158 4543 954 human ts 2217 orphans 1560 orphans 2983 human sgp 3176372217156019311052 intron alignedhuman tsorphans human sgpintron aligned predictions in the mouse genome

29 rtpcr: targeting conserved intron positions |2a chr10_1592 LGSETCCNSHTSLQTSGVPDGSNNNSALIFITALQKMFTGFLLVNKSSCKLNPCWEKVQV *. ****:** ** ****** chr19_1200 ------------------------------------MRCSQEPVNKSACKSNPRWEKVQV |1a chr10_1592 SSLYKLTDNCVNLQPLKRKEKKATLITLLSFTLHLLSSLAALRWDVNLPVNAVRKWMVQE *************************** ***:*************************** chr19_1200 SSLYKLTDNCVNLQPLKRKEKKATLITPLSFALHLLSSLAALRWDVNLPVNAVRKWMVQG |3b chr10_1592 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE ************************************************************ chr19_1200 GQELEISISGGCLTFMGKSSSNSVITALLMAEELHHYDNFFYSCEPKSSLLFLLSRAVIE |2b |4b chr10_1592 VCLYGV-LNSKVCQLQKVYILINTPVAWRSEGLADRWLPRKAQQASHLQHLVVGAREQAQ.****. : :********************** ************.**..* chr19_1200 ACLYGENTAGPGLHSRKVYILINTPVAWRSEGLADRWLLRKAQQASHLQHLSAGATRAVQ |3c

30 rt-pcr on 12 normal mouse adult tissues, and direct sequencing of the amplimers poolpredictionstestedpositivesuccess rate intron aligned 142821413362% similar212538411% orphan34256323%

31 rt-pcr on 12 normal mouse adult tissues, and direct sequencing of the amplimers

32 about 1000 human genes not in ensembl low support by ESTs: 34% match EST sequences low representation in other vertebrate genomes: 33% have sequence matches in fish genomes restricted expression patterns

33

34

35

36

37 limitations: sensitivity of the procedure twisncanensemblsgp2 initial predictions484642302648451 multiexonic genes368311756538979 25320163681695221184 69%94%97%54% orhtolog pairs2474330927 21099153551675719831 85%87%95%64% intron aligned1727118056 16337137091511215977 94%78%86%88%

38 specificity of the prediction can be improved: Ka/Ks ratio

39 further work scale the procedure. Try to find rtpcr evidence for (almost) every human gene not yet confirmed intronless genes human specific gene families (if any) genes with non-canonical splicing

40 selenoproteins Selenoproteins are proteins that incorporate the aminoacid selenocysteine, the 21st amino acid. Function: mostly redox enzymes Distribution: 3 domains of life Number: 22 families in mammals

41 selenoproteins UGA (STOP) is the codon for Sec There is a tRNA sec with the UGA anticodon Recoding: 1.RNA structure: the SECIS element 2.SECIS binding proteins

42 selenoproteins

43 the SECIS element. computational search for selenoproteins dSelG SECIS Pattern

44 using geneid to search for selenoproteins 1.Predict SECIS (PatScan) 1.Gene prediction with 1.TGA in-frame 2.SECIS

45 genome wide search in drosophila SECIS predicted35876 SECIS thermo assessment 1220 Genes predicted12194 Predicted Selenoproteins (4) Real Selenoproteins 3

46 dSelG

47 dSelM

48 dSelG and dSelM: experimental verification

49 dSelM has selenoprotein homologues in vertebrates

50 IMIM/UPF/CRGGenís Parra, Josep F. Abril, Roderic Guigó University of GenevaManolis Dermitzakis, Alexandre Reymond, Robert Lyle, Catherine Ucla, Stylianos Antonarakis GlaxoSmithKlinePankaj Agarwal University of OxfordChris Ponting Washington UniversityEvan Keibler, Michael Brent Universitat de Barcelona University of Lincon Harvard University Montserrat Corominas, Florenci Serras, Marta Morey, Sergi Bertran Vadim Gladishev, Gregory Kruikov Marla Berry, Nadia Morozova IMIM/UPF/CRGSergi Castellano COMPARATIVE GENE PREDICTION SELENOPROTEINS


Download ppt "Finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona."

Similar presentations


Ads by Google