Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,

Similar presentations


Presentation on theme: "1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,"— Presentation transcript:

1 1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xeugong Zhang CISC 841 Bioinformatics Nehar

2 2 Background: miRNAs Single-stranded RNA, ~ 20-25 nucleotides, that play a regulatory role in gene expression. Single-stranded RNA, ~ 20-25 nucleotides, that play a regulatory role in gene expression. Transcribed as long primary miRNA having a hairpin structure. Transcribed as long primary miRNA having a hairpin structure. pri-miRNA processed by nuclear RNase III Drosha into ~60-70 nt long pre-miRNA. pri-miRNA processed by nuclear RNase III Drosha into ~60-70 nt long pre-miRNA. pre-miRNA actively transported from the nucleus to the cytoplasm by Exportin-5. pre-miRNA actively transported from the nucleus to the cytoplasm by Exportin-5. Cleaved into ~20-25 nt mature miRNA. Cleaved into ~20-25 nt mature miRNA.

3 3 Background: The ‘hairpin loop’ Sequence of nucleotides where two segments can form base-pairs with each other, but a segment within that sequence can not. Sequence of nucleotides where two segments can form base-pairs with each other, but a segment within that sequence can not.

4 4 Background: The ‘hairpin loop’ The sequence ---CCTGCXXXXXXXGCAGG--- Forms the hairpin structure ---C G--- C G T A G C C G X X X

5 5 Background: The ‘hairpin loop’ Sequence of nucleotides where two segments can form base-pairs with each other, but a segment within that sequence can not. Sequence of nucleotides where two segments can form base-pairs with each other, but a segment within that sequence can not. The pre-miRNA 'hairpin' is an important secondary structure for identifying miRNAs. The pre-miRNA 'hairpin' is an important secondary structure for identifying miRNAs. Since mature miRNAs are very short (~20 nt), sequence alignment is not very useful for identification of miRNAs. Since mature miRNAs are very short (~20 nt), sequence alignment is not very useful for identification of miRNAs. Solution is to make use the hairpin structure of pre- miRNA. Solution is to make use the hairpin structure of pre- miRNA.

6 6 The problem  There are many sequence segments that fold into similar stem-loop hairpin structure. There are many sequence segments that fold into similar stem-loop hairpin structure. so existing methods for identification of miRNAs must utilize comparative genomics information besides the structure features. An example: Filter out hairpins not conserved in related species. so existing methods for identification of miRNAs must utilize comparative genomics information besides the structure features. An example: Filter out hairpins not conserved in related species. This implies an inability to identify miRNAs without close known homologues. This implies an inability to identify miRNAs without close known homologues. Furthermore, for species without closely related species sequenced comparative genomics approaches can't be applied. Furthermore, for species without closely related species sequenced comparative genomics approaches can't be applied.

7 7 Proposed solution Proposed solution ab initio (from first principles) classification of real pre- miRNA from "pseudo " pre-miRNA i.e. non pre-miRNA sequence having the hairpin structure. ab initio (from first principles) classification of real pre- miRNA from "pseudo " pre-miRNA i.e. non pre-miRNA sequence having the hairpin structure. Get a set of novel features that combine local structure and sequence information of pre-miRNA stem-loops. Get a set of novel features that combine local structure and sequence information of pre-miRNA stem-loops. Use SVM to classify as pre-miRNA and pseudo pre- miRNA. Use SVM to classify as pre-miRNA and pseudo pre- miRNA.

8 8 The datasets Sets of human pre-miRNA and pseudo-miRNA hairpins collected to train SVM and evaluate performance. Sets of human pre-miRNA and pseudo-miRNA hairpins collected to train SVM and evaluate performance. Human pre-miRNA downloaded from the miRNA registry database. only pre-miRNAs without multiple loops considered (~193 or 93% of database.) Human pre-miRNA downloaded from the miRNA registry database. only pre-miRNAs without multiple loops considered (~193 or 93% of database.) pseudo and candidate miRNA hairpins. Segments having stem-loop structure similar to pre-miRNA but aren't pre- miRNA. pseudo and candidate miRNA hairpins. Segments having stem-loop structure similar to pre-miRNA but aren't pre- miRNA. CODING dataset and the CONSERVED-HAIRPIN dataset. CODING dataset and the CONSERVED-HAIRPIN dataset.

9 9 The Coding dataset Collected from protein coding regions. Collected from protein coding regions. Used as negative samples in training and validation of classifier. Used as negative samples in training and validation of classifier. Length distribution kept identical to pre-miRNAs. Length distribution kept identical to pre-miRNAs. Criteria for selection: Criteria for selection: minimum 18 base pairings on the stem and hairpin. minimum 18 base pairings on the stem and hairpin. Maximum of -15 kcal/mol free energy of secondary structure. (numbers correspond to limits for genuine human pre-miRNAs.) Maximum of -15 kcal/mol free energy of secondary structure. (numbers correspond to limits for genuine human pre-miRNAs.) 8,494 pre-miRNA-like hairpins in this dataset. 8,494 pre-miRNA-like hairpins in this dataset.

10 10 The Conserved-hairpin dataset Extracted from genome region of position 56,000,001 – 57,000,000 on human chromosome 19 ( UCSC db.) Extracted from genome region of position 56,000,001 – 57,000,000 on human chromosome 19 ( UCSC db.) Used as a candidate dataset to evaluate the classifier. Used as a candidate dataset to evaluate the classifier. 2,444 hairpins from sequences conserved between Human and mouse. 2,444 hairpins from sequences conserved between Human and mouse. Most hairpins likely to be pseudo-miRNAs. In fact, only 3 known miRNAs in this dataset. Most hairpins likely to be pseudo-miRNAs. In fact, only 3 known miRNAs in this dataset.

11 11 Training and Test sets For classification experiments, one training set and two test sets built from the 3 datasets. For classification experiments, one training set and two test sets built from the 3 datasets. TR-C: Training set. TR-C: Training set. 163 human pre-miRNAs (+ve samples) from the 193 human pre- miRNAs. 163 human pre-miRNAs (+ve samples) from the 193 human pre- miRNAs. 168 pseudo pre-miRNAs (-ve samples.) from the Coding dataset. 168 pseudo pre-miRNAs (-ve samples.) from the Coding dataset. TE-C: Test set 1. TE-C: Test set 1. Remaining 30 human pre-miRNAs; 1000 pseudo pre-miRNAs (avoiding those in TR-C.) Remaining 30 human pre-miRNAs; 1000 pseudo pre-miRNAs (avoiding those in TR-C.) Conserved-hairpin dataset: Test set 2. Conserved-hairpin dataset: Test set 2.

12 12 Two further test sets Apply the SVM trained using previous sets on two further test sets. Apply the SVM trained using previous sets on two further test sets. Cross-Species test set Cross-Species test set 581 pre-miRNAs from 11 species. 581 pre-miRNAs from 11 species. Updated test set Updated test set New batch of reported human miRNA. New batch of reported human miRNA. Includes 39 non-redundant pre-miRNAs without multiple loops. Includes 39 non-redundant pre-miRNAs without multiple loops.

13 13 Local contiguous structure-sequence features Local sequence features are important in pre-miRNAs. Local sequence features are important in pre-miRNAs. Authors claim – Distribution of local sub-structures (i.e. continuously paired or unpaired structures) of pre-miRNAs are significantly distinct from pseudo pre-miRNAs. Authors claim – Distribution of local sub-structures (i.e. continuously paired or unpaired structures) of pre-miRNAs are significantly distinct from pseudo pre-miRNAs. Use a combination of local structure with sequence information to classify real vs. pseudo miRNA hairpins. Use a combination of local structure with sequence information to classify real vs. pseudo miRNA hairpins. Focus on information of 3 adjacent nucleotides (triplet elements.) Focus on information of 3 adjacent nucleotides (triplet elements.) “(“ and “)” mean paired at 5’-end and 3’-end. “.” means unpaired. Paper doesn’t make 5’ – 3’ distinction. “(“ and “)” mean paired at 5’-end and 3’-end. “.” means unpaired. Paper doesn’t make 5’ – 3’ distinction.

14 14 Structure-sequence features 8 possible structure compositions for each triplet [ “(((“, “((.”, “(..”, and so on ] 8 possible structure compositions for each triplet [ “(((“, “((.”, “(..”, and so on ] 32, (U,C,G,A)x8 structure –sequence combinations if we consider the middle nt. 32, (U,C,G,A)x8 structure –sequence combinations if we consider the middle nt.

15 15 Structure-sequence features e.g. U((( => middle nt is U and all three nts are paired. e.g. U((( => middle nt is U and all three nts are paired. Count appearance of each triplet to get a 32- dimensional feature vector (normalized). Count appearance of each triplet to get a 32- dimensional feature vector (normalized).

16 16 SVM Classification The SVM classifier is trained with TE-C & applied to other test sets. The SVM classifier is trained with TE-C & applied to other test sets. From TR-C 28/30 human pre-miRNA and 881/1000 pseudo-miRNAs correctly identified. From TR-C 28/30 human pre-miRNA and 881/1000 pseudo-miRNAs correctly identified. On Conserved hairpin set 2174/2444 structures classified as false miRNAs. On Conserved hairpin set 2174/2444 structures classified as false miRNAs.

17 17 SVM Classification The triplet elements reflect contiguous fine-structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => bulge loops. The triplet elements reflect contiguous fine-structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => bulge loops. The success of the classifier shows that these features reflect intrinsic characteristics of pre-miRNAs. The success of the classifier shows that these features reflect intrinsic characteristics of pre-miRNAs. “(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs. “(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs.

18 18 SVM Classification Average freq. of triplets in training dataset

19 19 SVM Classification The triplet elements reflect contiguous fine-structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => bulge loops. The triplet elements reflect contiguous fine-structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => bulge loops. The success of the classifier shows that these features reflect intrinsic characteristics of pre-miRNAs. The success of the classifier shows that these features reflect intrinsic characteristics of pre-miRNAs. “(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs. “(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs. Observations can be linked to the stability of the secondary structure. Stacking of more continuously paired nts decreases free energy. So, pre-miRNAs are more stable. Observations can be linked to the stability of the secondary structure. Stacking of more continuously paired nts decreases free energy. So, pre-miRNAs are more stable.

20 20 SVM Classification Sequence information Sequence information Frequency of same triplet structure with different middle nts in real pre-miRNAs, and across real and psuedo miRNAs varies. Frequency of same triplet structure with different middle nts in real pre-miRNAs, and across real and psuedo miRNAs varies.

21 21 SVM Classification Average freq. of triplets in training dataset

22 22 SVM Classification across species Applied the classifier trained on human data to other species (Cross-Species test set.) Applied the classifier trained on human data to other species (Cross-Species test set.) Pretty good performance in identifying true pre-miRNAs. Pretty good performance in identifying true pre-miRNAs. 581 known pre-miRNA of 11 species. 90.9% overall accuracy. 581 known pre-miRNA of 11 species. 90.9% overall accuracy.

23 23 SVM Classification across species

24 24 Conclusion Ab initio methods for distinguishing true pre-miRNA from pre-miRNA-like hairpin structures are very important. Ab initio methods for distinguishing true pre-miRNA from pre-miRNA-like hairpin structures are very important. The triplet-SVM classifier describes fine grained sequence- structure characteristics. The triplet-SVM classifier describes fine grained sequence- structure characteristics. 90% accuracy on human data. 90% accuracy on human data. Upto 90% accuracy on 11 other species (including plants and virus) without using comparative genomics information. Upto 90% accuracy on 11 other species (including plants and virus) without using comparative genomics information. Current specificity of about 89% is not enough for genome-wide applications. Current specificity of about 89% is not enough for genome-wide applications.


Download ppt "1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,"

Similar presentations


Ads by Google