Presentation is loading. Please wait.

Presentation is loading. Please wait.

8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

Similar presentations


Presentation on theme: "8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,"— Presentation transcript:

1 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle, and Yu Meng Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

2 8/29/062 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusiont/Future Work

3 8/29/063 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion/Future Work

4 8/29/064 Chaos Game Representation (CGR) Scatter plot showing occurrence of patterns of nucleotides. University of the Basque Country http://insilico.ehu.es/genomics/my_words/ http://insilico.ehu.es/genomics/my_words/

5 8/29/065 Frequency CGR (FCGR) Shows the frequencies of oligonucleotides using a color scheme normalized to the distribution of frequency of occurrence of associated patterns.

6 8/29/066 Chaos Game Representation (CGR) n2D technique to visually see the distribution of subpatterns nOur technique is based on the following: n Generate totals for each subpattern n Scale totals to a [0,1] range. (Note scaling can be a problem) n Convert range to red/blue 0-0.5: White to Blue 0.5-1: Blue to Red AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU FCGR

7 8/29/067 FCGR AAACCACC AGATCGCT GAGCTATC GGGTTGTT AC GT a) Nucleotidesb) Dinucleotides c) Trinucletides AAAAACACAACC AAGAATACGACT AGAAGCATAATC AGGAGTATGATT GAAGACGCAGCC GAGGATGCGGCT GGAGGCGTAGTC GGGGGTGTGGTT CAACACCCACCC CAGCATCCGCCT CGACGCCTACTC CGGCGTCTGCTT TAATACTCATCC TAGTATTCGTCT TGATGCTTATTC TGGTGTTTGTTT

8 8/29/068 FCGR Example Homo Sapiens – all mature miRNA Patterns of length 3 UUC GUG

9 8/29/069 miRNA nShort (20-25nt) sequence of noncoding RNA nSingle strand nPreviously assumed to be garbage nImpact/Prevent translation of mRNA nConserved across species(sometimes) nReduce protein levels without impacting mRNA levels nBind to target areas in mRNA – Problem is that this binding is not perfect (particularly in animals) nmRNA may have multiple (nonoverlapping) binding sites for one miRNA

10 8/29/0610 miRNA Functions nCauses some cancers nEmbryo Development nCell Differentiation nCell Death nPrevents the production of a protein that causes lung cancer nControl brain development in zebra fish nAssociated with HIV

11 8/29/0611 miRNA Research Issues nPredict/Find miRNA nPredict miRNA targets nIdentify miRNA functions nIdentify how miRNAs work

12 8/29/0612 Motivation 2000bp Flanking Upstream Region mir-258.2 in C elegans a) All 2000 bp b) First 240 bp b) Last 240 bp

13 8/29/0613 Research Objectives nIdentify, develop, and implement algorithms which can be used for identifying potential miRNA functions. nCreate an online tool which can be used by other researchers to apply our algorithms to new data.

14 8/29/0614 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion/Future Work

15 8/29/0615 Temporal CGR (TCGR) nTemporal version of Frequency CGR n In our context temporal means the starting location of a window n2D Array n Each Row represents counts for a particular window in sequence First row – first window Last row – last window We start successive windows at the next character location n Each Column represents the counts for the associated pattern in that window Initially we have assumed order of patterns is alphabetic n Size of TCGR depends on sequence length and subpattern lengt nAs sequence lengths vary, we only examine complete windows nWe only count patterns completely contained in each window.

16 8/29/0616 TCGR Example acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga Moving Window ACGT Pos 0-8 2331 Pos 1-9 1332 … Pos 34-42 2421 ACGT Pos 0-80.40.60.60.2 Pos 1-90.20.60.60.4 … Pos 34-420.40.80.40.2

17 8/29/0617 TCGR Example (cont’d) TCGRs for Sub-patterns of length 1, 2, and 3

18 8/29/0618 TCGR Example (cont’d) Window 0: Pos 0-8 Window 1: Pos 1-9 Window 17: Pos 17-25 Window 18: Pos 18-26 Window 34: Pos 34-42 acgtgcacg cgtgcacgt tccggaacc ccggaacca ccacgtcga A C G T

19 8/29/0619 TCGR – Viruses miRNA ( Window=9; Pattern=1;2;3) Epstein Barr Human Cytomegalovirus Kaposi sarc Herpesvirus Mouse Gammaherpesvirus Pattern =1 Pattern =2 Pattern =3

20 8/29/0620 TCGR – Mature miRNA (Window=5; Pattern=3) All Mature Mus Musculus Homo Sapiens C Elegans ACG CGCGCGUCG

21 8/29/0621 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion/Future Work

22 8/29/0622 EMM Overview nTime Varying Discrete First Order Markov Model nNodes are clusters of real world states. nLearning continues during prediction phase. nLearning: n Transition probabilities between nodes n Node labels (centroid of cluster) n Nodes are added and removed as data arrives

23 8/29/0623 EMM Definition Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include: nEMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t. nEMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1. nEMMDecrement algorithm, which removes nodes from the EMM when needed.

24 8/29/0624 EMM Cluster nFind closest node to incoming event. nIf none “close” create new node nLabeling of cluster is centroid of members in cluster nO(n)

25 8/29/0625 EMM Increment <18,10,3,3,1,0,0><17,10,2,3,1,0,0><16,9,2,3,1,0,0><14,8,2,3,1,0,0><14,8,2,3,0,0,0><18,10,3,3,1,1,0.> 1/3 N1 N2 2/3 N3 1/1 1/3 N1 N2 2/3 1/1 N3 1/1 1/2 1/3 N1 N2 2/3 1/2 N3 1/1 2/3 1/3 N1 N2 N1 2/2 1/1 N1 1

26 8/29/0626 Research Objectives nIdentify, develop, and implement algorithms which can be used for identifying potential miRNA functions. nCreate an online tool which can be used by other researchers to apply our algorithms to new data. Our approach: 1.Represent potential miRNA sequence with TCGR sequence of count vectors 2.Create EMM using count vectors for known miRNA (miRNA stem loops, miRNA targets) 3.Predict unknown sequence to be miRNA (miRNA stem loop, miRNA target) based on normalized product of transition probabilities along clustering path in EMM

27 8/29/0627 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion/Future Work

28 8/29/0628 Prediction of miRNA Precursors 1 nPredicted occurrence of pre-miRNA segments form a set of hairpin sequences nNo assumptions about biological function or conservation across species. nUsed SVMs to differentiate the structure of hiarpin segments that contained pre-miRNAs from those that did not. nSensitivey of 93.3% nSpecificity of 88.1% nNo report of false positives 1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

29 8/29/0629 Preliminary Test Data 1 nPositive Training: This dataset consists of 163 human pre-miRNAs with lengths of 62-119. nNegative Training: This dataset was obtained from protein coding regions of human RefSeq genes. As these are from coding regions it is likely that there are no true pre-miRNAs in this data. This dataset contains 168 sequences with lengths between 63 and 110 characters. nPositive Test: This dataset contains 30 pre-miRNAs. nNegative Test: This dataset contains 1000 randomly chosen sequences from coding regions. 1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure- Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

30 8/29/0630 POSITIVEPOSITIVE NEGATIVENEGATIVE TCGRs for Xue Training Data

31 8/29/0631 POSITIVEPOSITIVE N E G AT I V E TCGRs for Xue Test Data

32 8/29/0632 Predictive Probabilities with Xue’s Data EMMTest Data Mean Std Dev Max Min NegativeTest-Neg0000 Test-Pos0000 Train-Neg0.379630.0500850.912560.2945 Train-Pos0000 PositiveTest-Neg0000 Test-Pos0.258940.187010.420750 Train-Neg0000 Train-Pos0.389260.0484390.911550.32209

33 8/29/0633 Preliminary Test Results nPositive EMM n Cutoff Probability = 0.3 n False Positive Rate = 0% n True Positive Rate = 66% nTest results could be improved by meta classifiers combining multiple positive and negative classifiers together.

34 8/29/0634 Outline nIntroduction n CGR/FCGR n miRNA n Motivation n Research Objective nTCGR nEMM nmiRNA Prediction using TCGR/EMM nConclusion/Future Work

35 8/29/0635 Conclusion/Future Work This is ongoing research. Results, although promising, are preliminary. More research is ongoing.

36 8/29/0636 Future Research 1.Obtain all known mature miRNA sequences for a species – initially the 119 C. elegans miRNAs. 2.Create TCGR count vectors for each sequence and each sub-pattern length (1,2,3,4,5). 3.Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created 4.Obtain negative data (much as Xue did in his research) from coding regions for C Elegans. 5.Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created 6.Construct a meta-classifier based on the combined results of prediction from each of these ten EMMs. 7.Apply the EMM classifier to the existing ~75x106 base pairs of non-exonic sequence in the C. elegans genome to search for miRNAs. Note: all 119 validated C. elegans miRNAs are contained in the non-exonic part of the genome and thus the first pass of the algorithm will be tested for its ability to detect all 119 validated miRNAs. 8.Validate the prediction of novel miRNAs using molecular biology.


Download ppt "8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,"

Similar presentations


Ads by Google