Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003.

Similar presentations


Presentation on theme: "1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003."— Presentation transcript:

1 1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003

2 2 Genomic Signal Processing Genomic Signal Processing is a relatively new field in Bioinformatics, in which signal processing algorithms and methods are used to study functional structures in the DNA. Genomic Signal Processing is a relatively new field in Bioinformatics, in which signal processing algorithms and methods are used to study functional structures in the DNA. An appropriate mapping of the DNA sequence into one or more numerical sequences, enables the use of many digital signal processing tools. An appropriate mapping of the DNA sequence into one or more numerical sequences, enables the use of many digital signal processing tools. atgcggatttgccgtcgatgtc… Gene Predictor Gene DNA Segment

3 3 DNA in Eukaryotes is organized in chromosomes. DNA in Eukaryotes is organized in chromosomes. The DNA in each chromosome can be read as a discrete signal to {a,t,c,g}. (For example: atgatcccaaatggaca … ). The DNA in each chromosome can be read as a discrete signal to {a,t,c,g}. (For example: atgatcccaaatggaca … ). In exons (protein-coding region), during the biological amino acids building, those letters are read as triplets (codons). Every codon signals which amino acid to build (there 20 aa). In exons (protein-coding region), during the biological amino acids building, those letters are read as triplets (codons). Every codon signals which amino acid to build (there 20 aa). There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions). There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions). Every gene start with a start-codon and ends with a stop-codon. An exon cannot consists of more than one stop-codon. Every gene start with a start-codon and ends with a stop-codon. An exon cannot consists of more than one stop-codon. Non coding areas (majority usually) has a lot more random behavior than genes. Most of the DNA is non coding. Non coding areas (majority usually) has a lot more random behavior than genes. Most of the DNA is non coding. Genes can be detected by some statistics regularities, like codon usage, nucleotide usage, periodicity and data base comparison. Genes can be detected by some statistics regularities, like codon usage, nucleotide usage, periodicity and data base comparison. DNA Basics

4 4 Organisms Classified into two types: Classified into two types: Eukaryotes: contain a membrane-bound nucleus and organelles (plants, animals, fungi, … ) Eukaryotes: contain a membrane-bound nucleus and organelles (plants, animals, fungi, … ) Prokaryotes: lack a true membrane-bound nucleus and organelles (single-celled, includes bacteria) Prokaryotes: lack a true membrane-bound nucleus and organelles (single-celled, includes bacteria) Not all single celled organisms are prokaryotes! Not all single celled organisms are prokaryotes!

5 5 Cells Complex system enclosed in a membrane Complex system enclosed in a membrane Organisms are unicellular (bacteria, baker ’ s yeast) or multicellular Organisms are unicellular (bacteria, baker ’ s yeast) or multicellular Humans: Humans: – 60 trillion cells –320 cell types Example Animal Cell www.ebi.ac.uk/microarray/ biology_intro.htm

6 6 DNA Basics – cont. DNA in Eukaryotes is organized in chromosomes. DNA in Eukaryotes is organized in chromosomes.

7 7 Chromosomes In eukaryotes, nucleus contains one or several double stranded DNA molecules orgainized as chromosomes In eukaryotes, nucleus contains one or several double stranded DNA molecules orgainized as chromosomes Humans: Humans: –22 Pairs of autosomes –1 pair sex chromosomes Human Karyotype http://avery.rutgers.edu/WSSP/StudentScholars/ Session8/Session8.html

8 8 www.biotec.or.th/Genome/whatGenome.html

9 9 What is DNA? DNA: Deoxyribonucleic Acid DNA: Deoxyribonucleic Acid Single stranded molecule (oligomer, polynucleotide) chain of nucleotides Single stranded molecule (oligomer, polynucleotide) chain of nucleotides 4 different nucleotides: 4 different nucleotides: –Adenosine (A) –Cytosine (C) –Guanine (G) –Thymine (T)

10 10 Nucleotide Bases Purines (A and G) Purines (A and G) Pyrimidines (C and T) Pyrimidines (C and T) Difference is in base structure Difference is in base structure Image Source: www.ebi.ac.uk/microarray/ biology_intro.htmwww.ebi.ac.uk/microarray/ biology_intro.htm

11 11 DNA

12 12

13 13 Genome chromosomal DNA of an organism chromosomal DNA of an organism number of chromosomes and genome size varies quite significantly from one organism to another number of chromosomes and genome size varies quite significantly from one organism to another Genome size and number of genes does not necessarily determine organism complexity Genome size and number of genes does not necessarily determine organism complexity

14 14 ORGANISMCHROMOSOMESGENOME SIZEGENES Homo sapiens Homo sapiens (Humans) 233,200,000,000~ 30,000 Mus musculus (Mouse) 20, 2600,000,000~30,000 Drosophila melanogaster Drosophila melanogaster (Fruit Fly) 4180,000,000~18,000 Saccharomyces cerevisiae (Yeast) 161614,000,000~6,000 Zea mays (Corn)102,400,000,000??? Genome Comparison

15 15

16 16 The DNA in each chromosome can be read as a discrete signal to {a,t,c,g}. (For example: atgatcccaaatggaca … ) The DNA in each chromosome can be read as a discrete signal to {a,t,c,g}. (For example: atgatcccaaatggaca … ) DNA Basics – cont.

17 17 In genes (protein-coding region), during the construction of proteins by amino acids, these nucleotides (letters) are read as triplets (codons). Every codon signals one amino acid for the protein synthesis (there are 20 aa). In genes (protein-coding region), during the construction of proteins by amino acids, these nucleotides (letters) are read as triplets (codons). Every codon signals one amino acid for the protein synthesis (there are 20 aa). DNA Basics – cont.

18 18 There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions). There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions). DNA Basics – cont. …CATTGCCAGT…

19 19 DNA Basics – Cont. …CATTGCCAGT… Start: ATG Stop: TAA, TGA, TAG gene Exon Intron Exon

20 20 The Problem Given unannotated DNA, find the genes. Given unannotated DNA, find the genes. In practice, find the exons and their RF. In practice, find the exons and their RF. Smaller scale problem: given some annotated DNA of a creature, find the exons of unannotated DNA of the same creature. Smaller scale problem: given some annotated DNA of a creature, find the exons of unannotated DNA of the same creature. atgcggatttgccgtcgatgtc… Gene Predictor Exon

21 21 Solution Scheme Solution scheme: Solution scheme: –Work in windows analysis. –Find parameters that gives a good prediction in annotated DNA (of the same organism). Learn how to distinguish exons regions from non-exons regions. –Extract those parameters from the unannotated DNA, and use the discrimination rule in order to predict. Almost all methods shown here fit to this scheme. Almost all methods shown here fit to this scheme.

22 22 Creatures in the Project C. elegansS. cerevisiae (yeast)

23 23 Existing Methods Many methods relies on the pseudo periodicity of 3 in genes. For that we define: Many methods relies on the pseudo periodicity of 3 in genes. For that we define: –U b is the binary indicator series for base B. –U B is the STFT of u b. N, the window size, is in the hundreds. Exons size is in order of 10 1 … 10 3 ). N, the window size, is in the hundreds. Exons size is in order of 10 1 … 10 3 (in S. Cerevisiae). Overlapping windows. Overlapping windows. –There exists a connection between the DFT in k = N/3 frequency and nucleotides usage.

24 24 Calculating the DFT of a DNA sequence * ATCGTACAGCTGCAAAGCATAGATTCGGTCACAGTTG… S(n) 1000010100000111001010100000001010000 01001000001… 0010001001… 000100001001… u A (n) u T (n) u C (n) u G (n) *Silverman and Linsker 1986; Voss 1992

25 25 Spectrogram A way for showing the amplitude of U A, U C, U G and U T together. Linear Transform to RGB. Magnitude is represented by brightness Finding exons visually: bright horizontal lines, usually in k = N/3 Position( nucleotides ) Frequency N/3

26 26 Spectrogram – cont. DNA of C. Elegans chr. III versus totally random DNA

27 27 Power Spectrum Difference between gene to non-gene areas is in 1 order of magnitude Used for k = N/3

28 28 IIR Anti Notch Filtering IIR anti notch filter aimed to find “ peaks ” of a chosen frequency IIR anti notch filter aimed to find “ peaks ” of a chosen frequency all-pass Anti-notch

29 29 Optimized Spectral Content Measure (OSCM) Find good coefficients (a,g,t) for high differentiation between exons and introns. Ignoring C since of the linear dependency in the rest. Ar, Tr, Gr are generated from random DNA sequence, or Introns. Performance:

30 30 OSCM Example Direction mistake Good forward detection Good reverse detection

31 31 OSCM Justification In genes, the 4 complex variables A,T,C,G are not all-random and tend to be near a specific angle (phase). In genes, the 4 complex variables A,T,C,G are not all-random and tend to be near a specific angle (phase). In introns, the values of phase seems to be pure random. In introns, the values of phase seems to be pure random. Those unique angles enable us to detect reading frame as well. Those unique angles enable us to detect reading frame as well.

32 32 Distribution of the phase of the DFT at the freq of 1/3 in the genes of S. Cerevisiae: Distribution of arg(A) angular mean = 0.3556 angular deviation = 0.4016 Distribution of arg(T) Distribution of arg(C)Distribution of arg(G) Argument distributions for all experimental genes in all chromosomes in S. Cerevisiae angular mean = -2.6862 angular deviation = 0.8416 angular mean = -1.3734 angular deviation = 0.7903 angular mean = 2.7962 angular deviation = 0.5723

33 33 Distribution of arg(A) Distribution of arg(C) Distribution of arg(G) Argument distribution for non-coding regions in all chromosomes in S. Cerevisiae Distribution of arg(T) Distribution of the phase of the DFT at the freq of 1/3 in the introns of S. Cerevisiae:

34 34 Fourier Spectra and Position Asymmetry f(b,i) is the frequency of the base b in the codon position i, i=1,2,3.

35 35 Genes versus Introns Coding regions genes and exons Introns and intergenic spacers LARGEsmallMagnitude NarrowdistributionRandomlydistributedPhase Distribution of the DFT of T at 1/3 frequency Distribution of the DFT of G at 1/3 frequency (Data taken from S.Cerevisiae, chr. IV)

36 36 Finding Reading Frame (OSCM Phase)  Is concentrated around  1,  2 and  3 corresponding to each reading frame.  Is concentrated around  1,  2 and  3 corresponding to each reading frame. Lowering the variance of  with the optimization: Lowering the variance of  with the optimization: Transforming  to color. Transforming  to color. Deriving reading frame by a simple look. Deriving reading frame by a simple look. Blue3 Green2 Red1Color Reading Frame

37 37 New Methods in This Project Linear prediction Linear prediction Classification by clustering (CC) Classification by clustering (CC) Classification by compression ratios Classification by compression ratios

38 38 Linear Prediction Create a walk from the indicator sequences Create a walk from the indicator sequences For each window, find LP coefficients. Look for differences in correlation by: For each window, find LP coefficients. Look for differences in correlation by: –Poles map –Frequency response –Prediction error No new findings in this method. No new findings in this method.

39 39 Classification by Clustering Recall: DFT in k=N/3 frequency has a strong correlation with genes locations and reading frames (as shown in part A) Recall: DFT in k=N/3 frequency has a strong correlation with genes locations and reading frames (as shown in part A)part Apart A Here we ’ ll attempt to use it in order to discriminate exons from the rest, in a 6D space Here we ’ ll attempt to use it in order to discriminate exons from the rest, in a 6D space Learning phase: clustering Learning phase: clustering Classification phase: fuzzy KNN Classification phase: fuzzy KNN

40 40 Classification by Clustering Clustering Stage: Example From left to right: C, G and T. S. Cerevisiae 5 th chromosome.

41 41 Classification by Clustering RF = 1 +120° -120° Max סף Exon? Reading frame (if it’s an exon) (T,C,G) new sample RF = 1 RF =? 1 RF =? 3 RF =? 2 DNA = … atcgtgactagc … DFT(k=N/3) Indicator DFT(k=N/3) Indicator DFT(k=N/3) Indicator T CG Start here uTuT uCuC uGuG

42 42 Classification Rule Fuzzy KNN: create a fuzzy membership function and choose the one with the highest score. Add fuzzy clustering iteration to the LBG algorithm. Fuzzy KNN: create a fuzzy membership function and choose the one with the highest score. Add fuzzy clustering iteration to the LBG algorithm. Two methods for classifying gene/non- gene: Two methods for classifying gene/non- gene: –Add genes and non-genes scores, and max sum wins. –Max centroid score wins. 2 nd method used (better performance). Scores sums are used for reading frame: max r.f. wins. 2 nd method used (better performance). Scores sums are used for reading frame: max r.f. wins.

43 43 Results Creature: S. Cerevisiae. Creature: S. Cerevisiae. Learning was done on the 5 th chromosome. Learning was done on the 5 th chromosome. Parameters: Parameters: –K=7 and m=2 of fuzzy KNN. –True exon  50% exon. –Thresh = 1. Total: only 4.6% of true exons weren ’ t detected at all. Total: only 4.6% of true exons weren ’ t detected at all. # missed# exonsf_n_exonsrf_truef_nf_p 91020.08820.95740.45240.10371 183810.04720.96850.47350.08212 111550.0710.95510.46180.09173 217250.0290.96540.46150.08214 61200.050.97620.42470.11026 135040.02580.96470.47490.08217 122630.04560.96710.47160.1038 82000.040.94760.4520.10919 103410.02930.97230.47190.100510 233270.07030.96410.48160.082211 254860.05140.97220.47590.097312 164380.03650.96070.46820.088513 153780.03970.96160.45970.104114 165140.03110.96650.46440.090415 204420.04520.96620.47440.082416 2235376Total

44 44 CC - Example

45 45 CC - Improving Instead of deciding for each reading frame separately and then decide which r.F. “ Won ”, we can replicate the centroids for the other reading frames and the classification rule will determine [exon / non-exon] + [reading frame], at the same time. This suppose to cause a more fair competition between the reading frames. Instead of deciding for each reading frame separately and then decide which r.F. “ Won ”, we can replicate the centroids for the other reading frames and the classification rule will determine [exon / non-exon] + [reading frame], at the same time. This suppose to cause a more fair competition between the reading frames.

46 46 Classification by Compression Rates A T C G A T C G T A C G C A T G C A T G C A T G C A T G A A A A 60…11829 In forward coding, creating 3 different codon sequences. In classification of reverse coding, first complementing all the DNA, then treating it like forward (and results will also be reversed) In the end of this stage, we have 6 codon seriates. Nucleotides ( ‘ A ’, ’ C ’, ’ T ’, ’ G ’ ) Codons (0..63)

47 47 The Idea If we have a dictionary with the popular words ( = codon sequences) in exons which aren ’ t popular in non-exons then: If we have a dictionary with the popular words ( = codon sequences) in exons which aren ’ t popular in non-exons then: –Good compression will be achieved in exons –Good compression will not be achieved in introns So we need a good dictionary and a good compressing algorithm So we need a good dictionary and a good compressing algorithm

48 48 Building the Dictionary Aim: the output dictionary is expected to hold short popular words in exons. Aim: the output dictionary is expected to hold short popular words in exons. Using LZW algorithm. Using LZW algorithm. Input: all exons of learnt chromosome. Input: all exons of learnt chromosome. Initial dictionary: all codons. Initial dictionary: all codons. Add restriction on length of words to be entered to the dictionary. Add restriction on length of words to be entered to the dictionary. Output I: dictionary with words that appeared in exons. Output I: dictionary with words that appeared in exons. Output II: the code of the exons by the dictionary. Output II: the code of the exons by the dictionary.

49 49 LZW: Encoding 1)Accum  first input letter 2)If dict.Find(accum) == false 1)Dict.Add(accum) 2)Code.Add(index) 3)Accum  accum(end) 4)Return to (2) 3) Else: 1)Index = dict.Findwhere(accum) 2)Accum.Add(next letter from input) 3)Return to (2)

50 50 Dictionary Pruning Output LZW dictionary is a tree (TRIE). Output LZW dictionary is a tree (TRIE). Aim: keep the most popular words, but don ’ t allow undesired redundancy. Aim: keep the most popular words, but don ’ t allow undesired redundancy. Method: Method: –Go on every level of the tree (starting in max length words) and take predefined number of popular words. –Pass number of appearances (from output code) to parents: pass the sum of all, OR pass the sum of untaken. More variations: multiply by the entropy.

51 51 Using Entropy for Better Pruning [31 45 1 60] [31 45 1 30] [31 45 1 13] [31 45 1 31] [31 45 1] 6 6 6 6 6 6 6 6 24*log(4) = 48 [31 45 1 30] [31 45 1] 40 40*log(1) = 0 [31 45 1 60] [31 45 1 30] [31 45 1 13] [31 45 1 31] [31 45 1] 1 1 20 1 1 2 2 20*(-1)*[5/6*log(5/6) + 2*1/24*log(1/24) + 1/16*log(1/16)] = 20*0.8513 = 17.0255

52 52 Compression Rates Classification 1. Input: DNA of a chromosome and gene based dictionary 1. Input: DNA of a chromosome and gene based dictionary 2. 6 codons sequences for the 6 different reading frames 4. 6 compress rates vectors 6. 6 binary vectors + post processing data 6. 6 binary vectors + post processing data 8. 6 binary vectors – the final classification 8. 6 binary vectors – the final classification 5. Rf_wins = Argmax{compress_rate(rf),thresh) Lowerthresh = Argmax{compress_rate(rf),lower-thresh) Too_much_stops = 1 if window has more than 1 stop codon 3. Compressing with genes based dictionary 7. Post Processing

53 53 Post Processing Lower threshold technique: tag as true every window that is between close already-tagged windows, if value larger than the lower threshold. Lower threshold technique: tag as true every window that is between close already-tagged windows, if value larger than the lower threshold. Stop codons quantity in the window: more than one => not an exon-window (which is larger than analysis window size). Stop codons quantity in the window: more than one => not an exon-window (which is larger than analysis window size).

54 54 Compression Rates: Example

55 55 Stop Codons Usage 100,000b of 2 nd chromosome 100,000b of 2 nd chromosome 1 where there is one stop codon in the window, at most 1 where there is one stop codon in the window, at most

56 56 Post Processing: Stop-codon Usage Stop codon usage cleans up many potential false positives, without damaging any success measure Hence, a lower principal threshold can be determined and we ’ ll get better performance Without stop codon usage

57 57 Compression Rates: Results Learnt chromosome = 1 st, window size = 100c, dictionary size = 1381 (32 codons, branching = 3) Learnt chromosome = 1 st, window size = 100c, dictionary size = 1381 (32 codons, branching = 3) After choosing best configuration, going over all the chromosomes: After choosing best configuration, going over all the chromosomes: THRESH# miss# exonsf_n_exonsrf_truef_nf_p# 0.457183840.0468750.938660.138090.104422 0.45761550.038710.922340.160980.100153 0.457277300.0369860.938090.140140.0842714 0.45792610.0344830.927230.137630.0905565 0.45751200.0416670.924950.142740.139096 0.457145050.0277230.939270.147330.120537 0.457162670.0599250.933620.145380.150578 0.45792000.0450.924580.138160.131619 0.457113430.032070.934470.124110.1222210 0.457233330.0690690.937120.145750.0783311 0.457324940.0647770.94050.136540.1410612 0.457184410.0408160.928140.143380.1105113 0.457103770.0265250.934340.154750.1504414 0.457235200.0442310.93570.145780.08999515 0.457154440.0337840.936570.137940.1203916 23655740.0423394total

58 58 Compression Rates: Improving Use non-exon dictionary, or prune exon- dictionary considering non-exon common words. Use non-exon dictionary, or prune exon- dictionary considering non-exon common words. Adaptive dictionary: when detecting an exon, use its common words to update the current dictionary. Adaptive dictionary: when detecting an exon, use its common words to update the current dictionary.


Download ppt "1 Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003."

Similar presentations


Ads by Google