Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton

Similar presentations


Presentation on theme: "Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton"— Presentation transcript:

1 Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

2 Repeat Visualisation Using Suffix Arrays The Analysis Artificial Sequences Genomic Sequences The Algorithm Larger Sequences Non-genomic sequences

3 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

4 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

5 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

6 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

7 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

8 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

9 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

10 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

11 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

12 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

13 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA 12 3

14 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs 1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)

15 The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA No. occurrences (r) No. sequences the occur r times. 13 21 31 40 AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs 1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)

16 The repeat-score plot Number of occurrences Sub-string length 1 Sub-string length 2 Sub-string length 3 Sub-string length 4 Sub-string length 5 123565 201100 311000 410000 500000

17 The repeat-score plot The resulting matrix is then plotted as an image:

18 Repeatscore plots of Artificial Sequences Small repeats Reverse strand is also included

19 Random Sequences

20 DNA Sequences “The language of life” Composed of four different bases A, T, G and C Sequences range in size from 2000bp to 670 billion bp.

21 Small Genomic Sequences Lambda Phage

22 Small Genomic Sequences Lambda Phage Random Sequence

23 E.Coli

24

25 Sequences coding for rRNA Known inter-genic repeat elements

26 E.Coli

27 Repeats in Genomic Sequences

28 A Linear time algorithm The plots shown would take hours to construct using traditional methods. The algorithms used would not scale linearly It is not feasible to create these plots on large sequences unless more advanced algorithms are used.

29 The suffix array banana$ anana$ nana$ ana$ na$ a$ Original string: banana$ All suffixes

30 The suffix array banana$ anana$ nana$ ana$ na$ a$ Original string: banana$ In sorted order a$ ana$ anana$ banana$ na$ nana$ All suffixes

31 Generating the repeatscore plot a$ ana$ anana$ banana$ na$ nana$

32 Generating the repeatscore plot a$ ana$ anana$ banana$ na$ nana$

33 Whole human genome

34

35

36 Human Chromosome 18

37 Arabidopsis thaliana chromosome 1, coding region

38 Fibonacci derived sequences

39 Gallus gallus chromosome 20

40 Application to other sequences Analysing writing styles Finding plagiarised text Any sequence that may contain motif based, language like structure.

41 Shakespeare

42 Text document containing the text “The quick brown fox jumped over the lazy dog” 16times.

43 “On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial repeat inserted 16times.

44

45 Conclusion This new visualisation technique can highlight repeat structure in sequences. In genomic sequences this maybe useful in generating annotation. There are applications in other areas worth pursuing. Our next step is to allow the repeatscore plot to be easily interrogated by a user in order to better understand the repeat structure.


Download ppt "Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton"

Similar presentations


Ads by Google