Presentation is loading. Please wait.

Presentation is loading. Please wait.

Searching genomes for noncoding RNA CS374 Leticia Britos 10/03/06.

Similar presentations


Presentation on theme: "Searching genomes for noncoding RNA CS374 Leticia Britos 10/03/06."— Presentation transcript:

1

2 Searching genomes for noncoding RNA CS374 Leticia Britos 10/03/06

3 DNA to RNA, and genes G A G U C A G C DNA, ~3x10 9 long in humans Contains ~ 22,000 genes RNA: carries the “message” for “translating”, or “expressing” one gene transcriptiontranslation folding 1 2 easy 3

4 “Structural genes encode proteins and regulatory genes produce non-coding RNA” F. Jacob and J. Monod (1961)

5 Where are the genes? Gene Finding

6 atg tga ggtgag cagatg cagttg caggcc ggtgag Where are the genes? Gene Finding In humans: ~22,000 genes ~1.5% of human DNA

7 An expanding universe of noncoding RNA rRNArRNA (structure/function of ribosomes) tRNAtRNA (translation) snRNAsnRNA (RNA splicing, telomere maintenance) snoRNAsnoRNA (chemical modification of rRNA) miRNAmiRNA (translational regulation) gRNAgRNA (mRNA editing) tmRNAtmRNA (degradation of defective proteins) riboswitchesriboswitches (translational and transcriptional regulation) ribozymesribozymes (autocatalytic RNA) RNAiRNAi (gene regulation by dsRNA)

8 Exciting times for the RNA world (and for Stanford)

9 atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag How to find ncRNAs?

10

11 Riboswitches 5’3’ promoter 5’UTR exons3’UTR introns coding 5’3’ promoter 5’UTR exons3’UTR introns coding noncoding 5’3’ promoter 5’UTR exons3’UTR introns coding

12 Sequence conservation is not enough

13 Secondary structure is not enough

14 Noncoding RNA signals in the genome are not as strong as the signals for protein coding genes Look for structure in evolutionary conserved sequences

15 Identify new instances of a given ncRNA family in a genome

16 Existing algorithms CMSearch RSEARCH ERPIN

17 Example: finding 5S RNAs in a 1.6Mb genome RSEARCH: 6.5 h FastR: 103 s

18 FastR

19 What is a Database filter? A computational procedure that takes a DB as input and outputs a subset of it. filter The object being searched for remains in the DB after filtering (sensitivity) The filtered DB is significantly smaller The filtering operation is fast (efficiency)

20 Problem Given an RNA sequence with known structure, find homologous sequences in a RNA DB AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA UUAUAGCGCGCAUAUAGGACAAACAGUCUCUAUGGGGAC AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU QUERY DB AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG

21 alignStage 2: align the selected sequences in the DB with the query and determine the best alignments Solution filterStage 1: filter the DB

22 Filtering Sequence alone is not sufficient Structure alone is not sufficient

23 Filter using both sequence and structural features

24 (k,w)-stacks Structural features: (k,w)-stacks AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA 3 6 2528 aa’ a a a

25 AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA (k,w)-stack Definition of a (k,w)-stack AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA d = 18 A pair of substrings of at least length k, that are at most w bases apart a a’ Is a,a’ : (4,18)-stack? (4,20)-stack? (4,9)-stack? (3, 20)-stack?    

26 Use of (k,w)-stacks as filters in the search for ncRNAs If we use a (7,70)- stack filter, we eliminate 90% of the DB from consideration

27 nested Structural features: nested (k,w,l)-stacks AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA 3 62528 12 16 1418

28 parallel Structural features: parallel (k,w,l)-stacks AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCAA 3 62528 12 16 141832 35 34 36

29 multiloop Structural features: multiloop (k,w,l)-stacks

30 Filtering criteria nested stacks Parallel stacks Multiloop stacks

31 Filtering algorithm hash table 1. Build a hash table of kmers in the DB AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmerindex AUUC k=4 1

32 Filtering algorithm hash table 1. Build a hash table of kmers in the DB AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmerindex AUUC k=4 1 UUCC2

33 Filtering algorithm hash table 1. Build a hash table of kmers in the DB AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmerindex AUUC k=4 UUCG 1 UUCC2 3

34 Filtering algorithm hash table 1. Build a hash table of kmers in the DB AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmerindex AUUC k=4 UUCG CCGG 1 UUCC2 3 4

35 Filtering algorithm hash table 1. Build a hash table of kmers in the DB AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmerindex AUUC k=4 UUCG CCGG 1 UUCC2 3 4 14

36 Filtering algorithm 2. Identify (k,w)-stacks AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmerindex AUUC k=4 UUCG CCGG 1 UUCC2 3 4 14 reverse complement GAAU n

37 Filtering algorithm 2. Identify (k,w)-stacks AUUCCGGGAACAUAUUCUAGGCGACGGAUUAGAAUGCCAA kmerindex AUUC k=4 UUCG CCGG 1 UUCC2 3 4 14 reverse complement GAAU n d  w? 

38 Filtering algorithm 3. Compute complex stacks using DP nested parallel multiloop

39 Result of stage 1 (filtering) AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA UUAUAGCGCGCAUAUAGGACAAACAGUCUCUAUGGGGAC AUUCCGGGAACAUAGUAUAGGCGACGGAUUAGCUAGCCA AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU AUCGCGCUAUAGCUAGCGAGGACAGCUAUAGCUAGCGAG AUAUCGGGCUGUGGACACUAUACGAUCGAAUCUAGCUAU

40 alignStage 2: align the selected sequences in the DB with the query and determine the best alignments Solution filterStage 1: filter the DB

41 Possible ways to align RNAs 1.sequence to sequence 2.structure to structure 3.sequence to structure

42 RNA sequence structure alignment AGAGCGUAUCGAUUUAGAGAGCUAUAGCUAGAGAGGAGA Query s [1,……………………………………………………..m] UUAUAGCGCGCAUAUAGGACAAACAGUCUCUAUGGGGAC t [1,……………………………………………………..n] DB (filtered) S (set of base pairings)

43 The secondary structure of the query is represented by a binarized tree ji j -1i +1 i - j Rule 1: when i and j are paired

44 The secondary structure of the query is represented by a binarized tree ji j -1 Rule 2: when j is unpaired

45 Rule 3: when j is paired but not to the left- most base The secondary structure of the query is represented by a binarized tree j k -1 i k

46 The secondary structure of the query is represented by a binarized tree j i

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61 Final binary tree

62 The secondary structure of the query is represented by a binarized tree

63 Alignment algorithm

64 Optimal alignment between the query (with structure v) and substring (i-j) of the DB black node white node / one child white node / two children Optimal alignment between the query (with structure v) and substring (i-j) of the DB Optimal alignment between the query (with structure v) and substring (i-j) of the DB

65 Optimal alignment i j v A [i,j,v]

66 Alignment algorithm white node / one child white node / two children black node

67 Alignment algorithm white node / one child white node / two children black node ji j-1i +1 i-j j - 1 i-j alignmentscore for pairing and structure

68 Alignment algorithm black node white node / one child white node / two children ji j-1 i-j ji j-1 i-j j-1 alignmentscore for pairing

69 Alignment algorithm black node white node / one child white node / two children j k-1 i k sliding k

70 Validation Known instances of ncRNAs (tRNA, 5S rRNA, ribozymes, riboswitches) are inserted in a random sequence (1Mb) Filtering and alignment algorithms are applied (with  k, l, w, %GC)

71 Validation: filtering

72 Validation: filtering and alignment

73 Results: riboswitches 5’3’ promoter 5’ UTR exons3’ UTR introns coding non-coding

74 Riboswitches

75

76 Riboswitch families

77 Results: new riboswitches

78

79 Real Real motivation of bioinformaticists: “We design novel filters and show that they dominate dominate the HMM filters of Weiberg and Ruzzo…” world domination

80


Download ppt "Searching genomes for noncoding RNA CS374 Leticia Britos 10/03/06."

Similar presentations


Ads by Google