Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Similar presentations


Presentation on theme: "Gene prediction roderic guigó i serra IMIM/UPF/CRG."— Presentation transcript:

1 gene prediction roderic guigó i serra IMIM/UPF/CRG

2 number of genes in chromosome 22 initial annotation545Dunham et al., 1999 genscan+RT-PCR590Das et al., 2001 genscan+microarrays730Shoemaker et al., 2001 reviewed annotation726chr22 team, sanger, 2001 mouse shotgun data+20(our data) geneid predictions794 genscan predictions1128

3 number of genes in human genome Consortium30.000-40.000 2001 Celera27.000-38.000 2001 Consortium+Celera50.000 Hogenesch et al. 2001 DBsearches65.000-75.000 Wrigth et al., 2001 HumanGenomeSciences 90.000-120.000 Haseltine, 2001

4 decodificació del genoma ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGAT GTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCA GCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACA GCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGA CACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAAT GTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGAT GGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTG TTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAG TCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATT CCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGC CCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTG AGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCT CCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTG AGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGC GTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGT CCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCA TTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCAC CATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCC GGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTG GGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTC AGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCC AGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACA CAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCA TTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATT AGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCA CGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACC CTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACC TTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCG GCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCA GGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTG CCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTT CTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGG GGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCAC CTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGT TCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC the human genome sequence

5 QIKDLLVSSSTDLDTTLVLVNAIYFKGMW KTAFNAEDTREMPFHVTKQESKPVQMMCM NNSFNVATLPAEKMKILELPFASGDLSML VLLPDEVSDLERIEKTINFEKLTEWTNPN TMEKRRVKVYLPQMKIEEKYNLTSVLMAL GMTDLFIPSANLTGISSAESLKISQAVHG AFMELSEDGIEMAGSTGVIEDIKHSPESE QFRADHPFLFLIKHNPTNTIVYFGRYWSP the amino acid sequence of the proteins

6 EXONS INTRONS ELEMENT REGULADOR ‘UPSTREAM’ ELEMENT REGULADOR ‘DOWNSTREAM’ PROMOTOR Estructura dels Gens

7 Del DNA al RNA

8 Del RNA a la Proteïna

9 Mecanisme Molecular

10 Prediction of splice sites

11 accuracy of gene prediction programs

12

13

14 rosseta ( Batzoglou et al., 2000 ) cem (Bafna and Huson, 2000) sgp1 (Wiehe et al., 2000) twinscan (Korf et al., 2001) slam ( Patcher et al., 2001) sgp2 (Guigó et al., in preparation) comparative gene prediciton

15 Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons syntenic gene prediction (sgp2)

16 benchmarking sgp2 - accuracy scimog mit

17 Predicting “novel” genes in the human genome golden path annotations additional blastn matches to ENSEMBL + REFSEQ tblastx geneid exons tblastx sgp genes Golden Path Oct 7, 2000 freeze. RepeatMasked TraceDB, as on February 2001

18 “novel” genes ? 48,890 genic regions (known genes or similar) 15,489 genes longer than 100 aa predicted by sgp 13,302 non redundant predictions 8,416 supported by tblastx hits to mouse 1.5 3,331 predicted genes with at least two exons suported by tblastx hits + 719 predicted genes supported by tblastx hits covering at least 75% of the prediction 4,050 supported sgp predictions 25% of them not overlapping genscan predictions

19 validation of predictions EST identity18% NR similarity31% CDD (NCBI)24% Mouse ESTs28% Rat ESTs19% Tetraodon15% at least one of the above 56%

20 Experimental validation

21 chr22 chr21 human genome vs. Mouse traceDB

22 SN SP CC SNe SPe SNSP ME WE chr22.assem.0.87 0.65 0.75 0.69 0.54 0.62 0.14 0.33 chr22.shot.0.82 0.66 0.72 0.63 0.54 0.58 0.20 0.31 human genome vs. Mouse assemblies

23 chr22chr21 776Predicted420 -655known-326 -25low complexity-5 -26short-11 -19intronless-34 4536 testing novel predictions experimentally In total 81 predictions. For 40 of them, adjacent exon pairs were selected for rt-pcr

24 Positive controls N Success rate refseq7896% Known tissue specific genes 2025% Low expressing genes13Not ready Twinscan with EST support Not ready Test sets TwinscanNot ready SGP4028% preliminary results

25 aknowledgments IMIM-UPF-CRG, Barcelona Josep F. Abril Genís Parra Roderic Guigó GlaxoSmithKline, King of Prussia Pankaj Agarwal Max Plank Institute for Chemical Ecology, Jena Thomas Wiehe Whitehead Institute/MIT Center for Genome Research, Cambridge Gwen Acton Dan Brown Kerstin Mouse Sequence Consortium


Download ppt "Gene prediction roderic guigó i serra IMIM/UPF/CRG."

Similar presentations


Ads by Google