Presentation is loading. Please wait.

Presentation is loading. Please wait.

miRNA Discovery and Prediction Algorithms

Similar presentations


Presentation on theme: "miRNA Discovery and Prediction Algorithms"— Presentation transcript:

1 miRNA Discovery and Prediction Algorithms
George Michopoulos

2 microRNAs What are they? Why do we care about them?
How do we discover them? Biological Methods Computational Methods What limitations do these methods have?

3 What is microRNA? Biogenesis:
Much like mRNA, a microRNA is also transcribed from the DNA. The process starts with a microRNA precursor transcribed from the DNA This precursor folds into a unique hairpin shape. The “hairpin” travels outside of the nucleus where enzymes process it into a single stranded, nucleotide long, mature microRNA. Function: MicroRNAs regulate gene expression by binding to protein-encoding mRNAs. The mature microRNA is incorporated into a large protein complex called RISC (RNA Induced Silencing Complex) Through a process called base pairing – in which complementary codes found on microRNAs bind to the corresponding mRNAs much like a lock and key, microRNAs inhibit protein translation Can use antagomirs, chemically engineered oligonucleotides, to silence endogenous microRNA

4 miRNA structure Small non-coding RNAs ~22-25 bases long
Characterized by their hairpin precursors, composed of the mature, the loop, and the star miRNA

5 miRNA biogenesis Transcribed in the nucleus
Pri-miRNA hairpin gets cut by Drosha enzyme The pre-miRNA then either degrades into miRNA naturally, or gets cleaved by the Dicer enzyme Then the miRNA gets bound by an Argonoute protein into a RNA- induced silencing complex Then the complex binds target mRNA and cleaves it

6 Why do we care? miRNAs regulate protein expression, including those involved in: Cancer – inhibit proteins responsible for controlling proliferation Neural development – links to schizophrenia Cardiac development – linked to cardiomyopathies DNA methylation and histone modification – can alter the expression of target genes

7 Why do we care? The use of antagomirs, chemically engineered oligonucleotides, could be used as a therapy for such diseases to silence endogenous microRNA Non-coding RNAs account for a significant portion of the genome, so their homology can be used as tool to assess phylogeny

8 Detection and Discovery
Biological Methods: Can use RT-PCR and QPCR for individual miRNAs Can use microarrays to detect multiple miRNAs Computational Methods: Mining deep-sequencing data and using predictive algorithms to detect miRNA characteristics and compare potential sequences to homologs Bentwich et al. (2005) miRAlign: Wang et al. (2005) miRDeep: Friedländer et al. (2008) miRDeep2: Friedländer et al. (2011) “MicroRNA Discovery Process Screen for hairpin structures. MicroRNAs are known to exist in hairpin structures. We mined the entire human genome, over 6 billion DNA bases, for such structures and identified approximately 11 million hairpin structures. Select microRNA candidates. Using our proprietary algorithms we selected a set of several thousand likely microRNA candidates from the 11 million hairpin structures we had previously identified. Detect by microarray. In order to biologically detect the existence of the candidate microRNAs, we developed a proprietary microarray technology designed to detect the expression of microRNAs in tissues. Using these microarrays we have, to date, detected approximately 1,500 expressed microRNA candidates. Biologically validate’ by sequencing or qRT-PCR. In order to further confirm their existence, microRNA candidates that were detected by microarray are biologically validated using either our proprietary sequencing technique or a qRT-PCR technique.” –Rosetta genomics

9 RT-PCR Reverse transcription polymerase chain reaction, not real time PCR (qPCR) Desired RNA is transcribed and the resulting cDNA is amplified using qPCR Is useful for detecting very low copy numbers of RNA molecules; oldest method, non-specific for miRNA

10 Northern Blotting Measure levels of RNA expression using probes with partial homology This picture shows a northern blot that has detected 4/5 of the shown microRNAs Lower sensitivity, but higher specificity than RT- PCR Fewer false positives

11 Microarray Detection Microarrays first used to detect miRNAs in 2004 by different groups Probes can be developed and then chip can be ordered through companies (Barad et al.) Everything can be developed and put together using amine- binding slides and an array printer (Miska et al.) Incredibly more efficient for large scale discovery, but limited by the need for prior sequence data for probe development Each probe was modified with a free amino group linked to its 5' terminus through a 6-carbon spacer (IDT) and was printed onto amine-binding slides (CodeLink, Amer- sham Biosciences)

12 Barad et al. (2004) Took known miRNA sequences Created DNA chips with probes complementary to those sequences Hybridized miRNA samples onto chips Performed Clustering Analysis Use mirMASA to confirm findings Found that the microarray method has a higher sensitivity and specificity than previous miRNA identification methods Designed two DNA chips, prepared by Agilent using their SurePrint technology, that contained all the known human miRNA sequences Confirmed their sequences using mirMASA technology and then compared findings to Sempere et al. which conducted Northern Blots to test for expression. -a fluorescence-based solution hybridization method -uses a specific capture-oligo for each targeted MIR that is covalently coupled onto color-coded microspheres (beads), and a detection-oligo that is labeled with biotin Provided strong evidence for higher sensitivity and specificity of the microarray in comparison with the currently used methods for detection and characterization of miRNA

13 Useful Programs: RNAFold
RNAFold is an algorithm that is part of the “Vienna Package” Takes in RNA sequences and calculates their minimum free energy structure, outputting the following results:

14 Useful Programs: ClustalW
ClustalW is a multiple local alignment tool that is frequently used to compare homologous sequences across species, or to compare families of genes. Takes in two sequences, does a pairwise alignment, creates a phylogenetic tree, and then uses that to conduct multiple alignment using other sequences

15 Bentwich et al. (2005) - Computationally scanned the whole human genome for hairpin structures using the Vienna package - Annotated all hairpins for conserved, repetitive and protein-coding regions - Scored hairpins by thermodynamic stability and structural features, using a method (PalGrade) that detects a large percentage of known microRNAs while selecting a relatively small portion of all genome hairpins - Determined the expression of computationally predicted microRNAs by a high-throughput microRNA microarray in several tissues according to the microarray strategy outlined in Barad et al. Validated the sequence of predicted microRNAs that gave high signals on the microarray by using a specific biotinylated capture oligonucleotide, designed for the predicted microRNA to be cloned, to ‘fish out’ the complementary sequences from the microRNA-enriched libraries, which are then amplified, cloned and sequenced In some cases, the cloning and sequencing method resulted in sequencing of similar microRNAs that were slightly different in sequence from the microRNA originally sought. We also carried out sequence validation on 69 bioinformatically predicted microRNAs, which were not present on the microarray but are located adjacent to microRNAs that were successfully sequenced, resulting in more new sequences called adjacent microRNAs.

16 Bentwich et al. (2005) Scanning the entire human genome identified 11 million hairpins, including 86% of known microRNA precursors. After microarray sampling, the 359 expressed microRNAs were subjected to confirmation by sequencing Successfully cloned and sequenced 89 human microRNA genes that do not appear in the microRNA registry Using UCSC BlastZ alignment and ClustalW, found that fifty three of these are located in two large non-conserved clusters, including one on chromosome 19 that is only expressed in the placenta and was the largest microRNA cluster ever reported. This cluster comprises 43 new predicted microRNAs which all show similarity to a neighboring miRNA family specifically expressed in human embryonic stem cells The other cluster is on the X chromosome and its miRNAs are only expressed in the testis Homology analysis showed that both clusters are conserved only in chimpanzees and possibly rhesus monkeys - The fact that the primate-specific clusters described here are specifically expressed in developmental tissues supports the notion that microRNAs may have a key role in the evolutionary process and in the evolved complexity of higher mammals In some cases, the cloning and sequencing method resulted in sequencing of similar microRNAs that were slightly different in sequence from the microRNA originally sought. We also carried out sequence validation on 69 bioinformatically predicted microRNAs, which were not present on the microarray but are located adjacent to microRNAs that were successfully sequenced, resulting in more new sequences called adjacent microRNAs.

17 miRAlign: Wang et al. (2005) A novel genome-wide computational approach to detect miRNAs in animals based on both sequence and structure alignment Uses RNAfold to test secondary structures, then CLUSTAL to perform pairwise alignment, unique algorithms to confirm the miRNA’s position on the stem-loop, and finally RNAforester to conduct pairwise structure alignment Firstly, all the known pre-miRNAs in the training set are used as queries to BLAST search against the genome with a sensitive BLAST parameter setting (word-length 7 and E-value cutoff 10). Next, sequence segments of the potential regions are cut from the genome with 70 nt flanking sequences to each end and scanned by a 100 nt-sliding window with a step of 10 nt. T (1) Secondary structure prediction: The secondary structures of both strands of the candidate are predicted by RNAfold, and only the strands with MFE lower than −20 kcal/mol are kept for further analysis (2) Pairwise sequence alignment: The strands of the candidate sequences that pass Step 1 are pairwisely aligned to all the ∼22 nt known miRNA sequences in the training set. Sequence similarity score (mature_seq_sim)between the candidate and each known mature miRNAs is calculated by CLUSTALW (3) miRNA’s position on the stem–loop structure: Three properties forthe ∼22 nt miRNA’s position on the stem–loop structure derived from the miRNA reference set are considered by miRAlign for each of the potential homologue pairs: (a) the∼22 nt potential miRNA sequence should not locate on the terminal loop of the hairpin structure; (b) potential miRNA should locate on the same arm of the stem–loop structure as its known homologs and (c) the position of the potential miRNA sequence on the stem–loop structure should not differ too much from its known homologs. (4) RNA secondary structure alignment: RNAforester is used to compute pairwise structure alignment. This conducts a local alignment using the RIBOSUM85-60 substitution matrix

18 miRAlign: Wang et al. (2005) miRAlign outperforms BLAST search in both sensitivity and selectivity, and furthermore, nearly all the known miRNAs found by BLAST can also be detected by miRAlign. The average number of false positives is 7.1 for BLAST and 0.9 for miRAlign Algorithm is dependent on pre-existing data to search against, only useful for finding miRNAs that are closely related to previously annotated ones. Firstly, all the known pre-miRNAs in the training set are used as queries to BLAST search against the genome with a sensitive BLAST parameter setting (word-length 7 and E-value cutoff 10). Next, sequence segments of the potential regions are cut from the genome with 70 nt flanking sequences to each end and scanned by a 100 nt-sliding window with a step of 10 nt. T (1) Secondary structure prediction: The secondary structures of both strands of the candidate are predicted by RNAfold, and only the strands with MFE lower than −20 kcal/mol are kept for further analysis (2) Pairwise sequence alignment: The strands of the candidate sequences that pass Step 1 are pairwisely aligned to all the ∼22 nt known miRNA sequences in the training set. Sequence similarity score (mature_seq_sim)between the candidate and each known mature miRNAs is calculated by CLUSTALW (3) miRNA’s position on the stem–loop structure: Three properties forthe ∼22 nt miRNA’s position on the stem–loop structure derived from the miRNA reference set are considered by miRAlign for each of the potentia lhomologue pairs: (a) the∼22 nt potential miRNAsequence should not locate on the terminal loop of the hairpin structure; (b) potential miRNA shouldlocate on the same arm of the stem–loop structure as its known homologs and (c) the position of the potential miRNA sequence on the stem–loop structure should not differ too much from its known homologs. (4) RNA secondary structure alignment: RNAforester is used to compute pairwise structure alignment. This conducts a local alignment using the RIBOSUM85-60 substitution matrix

19 miRDeep: Friedländer et al. (2008)
Suite of PERL scripts Uses a probabilistic model of miRNA biogenesis to score compatibility of the position and frequency of sequenced RNA with the secondary structure of the miRNA precursor Used NCBI megablast to align deep sequencing reads to the genome, discarded anything that wasn’t a perfect match Then discarded reads that aligned to more than five positions in the genome. The vast amount of known mature miRNA reads align to five positions or less (unpublished results), and by discarding reads that align ubiquitously, vast numbers of alignments can be disregarded. Further, C. elegans and human reads that overlapped with positions (on either strand) annotated by the UCSC database39 as rRNA, scRNA, snRNA, snoRNA or tRNA were discarded, as were reads that had perfect alignments to these types of noncoding RNA in the Rfam database40. Potential precursors that did not fold into a hairpin, or that had reads aligning to it in a way that was inconsistent with Dicer processing, were discarded. This was done by a combinatorial investigation of structure and signature. First, the position of the potential mature miRNA sequence was defined as the position of the most abundant read sequence aligning to the potential precursor sequence. Second, the potential star sequence was defined as the sequence base pairing to the potential mature sequence, correcting for the 2-nt 3′ overhangs. Third, the loop was defined as the sequence between the potential mature and star sequence. Fourth, the potential mature-loop-star structure should form an unbi- furcated hairpin, with a minimum of 14 base pairings between the mature and the star sequence. Fifth, for each read it was tested whether it aligned to the potential precursor in consistence with the signature expected from Dicer processing. Any remaining precursor candidates were then scored probabilistically according to the following

20 Algorithm for P(sequence is a precursor)
score = log (P(pre | data) / P(bgr | data) The probability of the sequence being a precursor is given by Bayes’ theorem: P(pre | data) = P(data | pre) P(pre) / P(data) P(pre | data) = P(abs | pre) P(rel | pre) P(sig | pre) P(star | pre) P(nuc | pre) P(pre) / P(data) The same holds for the probability of the sequence being a background hairpin: P(bgr | data) = P(data | bgr) P(bgr) / P(data) P(bgr | data) = P(abs | bgr) P(rel | bgr) P(sig | bgr) P(star | bgr) P(nuc | bgr) P(bgr) / P(data) P(pre) is the prior probability that a potential precursor is actually a miRNA precursor. P(bgr) is the prior probability that a potential precursor is non-miRNA back-ground hairpin and equal to 1-P(pre). abs is the estimated minimum free energy of the potential precursor. P(abs|pre) is the probability that a real miRNA precursor would have the value abs. P(abs|bgr) is the probability that a non-miRNA background hairpin would have the value abs. rel is equal to 1 if the potential precursor sequence is energetically stable,0 otherwise. P(rel|pre) is the probability that a real miRNA precursor has the value rel. P(rel|bgr) is the probability that a background precursor has the value rel. sig is the number of reads in the deep-sequencing sample that align to the potential precursor sequence in consistence with Dicer processing (see above). P(sig|pre) is the probability that a real miRNA precursor has the value sig in the deep-sequencing sample. P(sig|bgr) is the probability that a background hairpin has the value sig in the deep-sequencing sample. star is equal to 0 if the potential precursor sequence has no reads that representa putative star sequence, and 1 otherwise. P(star|pre) is the probability that a real miRNA precursor has the value star in the deep-sequencing sample. P(star|bgr) is the probability that a background hairpin has the value of star in the deep-sequencing sample. nuc is an (optional) binary variable. It is 0 if the nt 2–8 from the 5′ end of the putative mature miRNA are not conserved in any other metazoan, and 1 otherwise. P(nuc|pre) is the probability that a real miRNA precursor has the valueof nuc. P(nuc|bgr) is the probability that a background hairpin has the value of nuc. In the above, we are assuming independence between abs, rel, sig, starand nuc.

21 miRDeep: Friedländer et al. (2008)
Of the 555 known human mature miRNA sequences, 213 were present in the data set. Of these, 154 (72%) were successfully recovered by miRDeep. The total estimated number of false positives was 6 ± 2 This pipeline is much more efficient at finding microRNA expression from deep-sequencing than the previous methods

22 miRDeep2: Friedländer et al. (2011)
Analyzing data from seven animal species representing the major animal clades, miRDeep2 identified miRNAs with an accuracy of 98.6–99.9% and reported hundreds of novel miRNAs New package include many more options and graphical outputs that make the software more accessible

23 miRDeep2: Friedländer et al. (2011)

24 miRDeep2: Friedländer et al. (2011)

25 miRDeep2: Friedländer et al. (2011)

26 miRDeep2: Friedländer et al. (2011)
Relative to miRDeep1: Performs excision by scanning the genome for stacks of reads, where a stack is one or more reads that map to the exact same 50 and 30 positions in the genome When identifying miRNAs in data from sea squirts, known to harbor large numbers of non-canonical miRNAs, the first version of miRDeep only reports 46 known and 31 novel miRNAs. In contrast, miRDeep2 reports 313 known and 127 novel ones Can detect anti-sense miRNAs (+/-) Supports single or multiple mismatches. Performs substantially better on the human data, reporting 186 known and 36 novel miRNAs (compared to 154 known and 10 novel in the initial publication) More accurate detection of lowly abundant miRNAs Faster; analyzed 30 million RNAs in less than 5 h and with 3 GB memory More intuitive interface for biologists

27 Beyond miRDeep2 Remaining challenges in identifying and detecting expression levels of miRNA: miRBase, the primary database used as a source for miRNA annotations used today, is for from pristine Hard to tell whether detected novel miRNAs actually have a biological function, will take a lot of biological experimentation until we know that Algorithms still have room for improvement in terms of accessibility and efficiency mirDeep2’s algorithms make the assumption that all miRNAs in miRBase represent genuine miRNA loci, while all loci that are not in miRBase are non-miRNA loci. While this assumption almost certainly does not hold in all cases, the miRBase public database arguably sets the best standard for which loci represent miRNA genes and which do not. Thus falsely annotated miRNAs in miRBase which are not reported by miRDeep2 will cause the accuracy of the algorithm to be underestimated, as will genuine novel miRDeep2 miRNAs which are not in miRBase. Conversely, the presence of genuine miRNAs which are not detected by miRDeep2 and are not in miRBase will cause an overestimation of the accuracy.

28 Questions?

29 References Barad, O., Meiri, E., Avniel, A., Aharonov, R., Barzilai, A., Bentwich, I., Einav, U., et al. (2004). MicroRNA expression detected by oligonucleotide microarrays : System establishment and expression profiling in human tissues. Genome Research, doi: /gr Bentwich, I., Avniel, A., Karov, Y., Aharonov, R., Gilad, S., Barad, O., Barzilai, A., et al. (2005). Identification of hundreds of conserved and nonconserved human microRNAs. Online, 37(7), doi: /ng1590 Friedländer, M. R., Chen, W., Adamidi, C., Maaskola, J., Einspanier, R., Knespel, S., & Rajewsky, N. (2008). Discovering microRNAs from deep sequencing data using miRDeep. Nature biotechnology, 26(4), doi: /nbt1394 Friedländer, M. R., Mackowiak, S. D., Li, N., Chen, W., & Rajewsky, N. (2011). miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic acids research, doi: /nar/gkr688 Krüger, J., & Rehmsmeier, M. (2006). RNAhybrid: microRNA target prediction easy, fast and flexible. Nucleic acids research, 34(Web Server issue), W doi: /nar/gkl243 Miska, E. a, Alvarez-Saavedra, E., Townsend, M., Yoshii, A., Sestan, N., Rakic, P., Constantine-Paton, M., et al. (2004). Microarray analysis of microRNA expression in the developing mammalian brain. Genome biology, 5(9), R68. doi: /gb r68 Wang, X., Zhang, J., Li, F., Gu, J., He, T., Zhang, X., & Li, Y. (2005). MicroRNA identification based on sequence and structure alignment. Bioinformatics (Oxford, England), 21(18), doi: /bioinformatics/bti562


Download ppt "miRNA Discovery and Prediction Algorithms"

Similar presentations


Ads by Google