Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu.

Similar presentations


Presentation on theme: "Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu."— Presentation transcript:

1 Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu

2 Human Genome

3 TEIRESIAS Algorithm

4 Define Pattern Σ(Σ{ ‘. ’ } * )Σ Σ: the alphabet of residues, like ACTG in DNA, the set of all amino acids ‘. ’ : wild card character Pattern: Any string that begins and ends with a residue, and contains an arbitrary combination of residues and ‘. ’ characters

5 Define Pattern The pattern is defined by three parameters The pattern is defined by three parameters L: the minimal number of literals L: the minimal number of literals W: the pattern length W: the pattern length K: the minimal repeat times K: the minimal repeat times Example: Example: Pattern: A.CH..E Pattern: A.CH..E L = 4, W= 7 pattern pattern

6 Define Pattern P: pattern P: pattern ‘ A.CH..E ’ ‘ A.CH..E ’ S: a set of sequences S: a set of sequences S={LFAADCHFFEDTR, LKLALCHESESDR, AFAGCHADELFT} AFAGCHADELFT} Ls(P) = {(i,j) | sequence s i matches P at offset j} Ls(P) = {(i,j) | sequence s i matches P at offset j} Ls( ‘ A.CH..E ’ ) = {(1,4), (2,4), (3,3)}

7 Problem Definition Given a set S={s 1,s 2, …, s n } of input sequences and parameters L, W, K, find all maximal patterns that have support at least K repeat times. Given a set S={s 1,s 2, …, s n } of input sequences and parameters L, W, K, find all maximal patterns that have support at least K repeat times.

8 TEIRESIAS Algorithm The algorithm contains two steps: Scanning and convolution The algorithm contains two steps: Scanning and convolution 1. Scanning the sequences in the input set S and locating all elementary patterns with support at least K. 1. Scanning the sequences in the input set S and locating all elementary patterns with support at least K. 2. Combining together the elementary pattern to recover the maximal pattern 2. Combining together the elementary pattern to recover the maximal pattern

9 Scanning Process

10 Convolution Process

11 Prefix(P) is the prefix sub pattern of P that has exactly (L-1) residues Prefix(P) is the prefix sub pattern of P that has exactly (L-1) residuesExample: Prefix( ‘ F.ASTS ’ ) = ‘ F.A ’ if L = 3 Suffix(P) is the suffix subpattern of P with exactly (L-1) residues Suffix(P) is the suffix subpattern of P with exactly (L-1) residuesExample: Prefix( ‘ F.A … S ’ ) = ‘ A..S ’ if L = 3

12 Convolution Process P, Q are the arbitrary patterns with at least L residues each P, Q are the arbitrary patterns with at least L residues each R denotes a new pattern R denotes a new pattern Q ’ denotes what remains of Q after the prefix(Q) is thrown away Q ’ denotes what remains of Q after the prefix(Q) is thrown away Φ denotes the empty string Φ denotes the empty string

13 Convolution Process Forward extension Forward extension If Suffix(P) = Prefix(Q), P is extended to right with Q ’ (Pattern Q leftover except prefix) If Suffix(P) = Prefix(Q), P is extended to right with Q ’ (Pattern Q leftover except prefix) Example: ‘ F.AS ’ convolute ‘ AST ’ to form ‘ F.AST ’ Example: ‘ F.AS ’ convolute ‘ AST ’ to form ‘ F.AST ’ Backward extension Backward extension If Prefix(P) = Suffix(Q), P is extended to the left with Q ’ (Pattern Q leftover except suffix) If Prefix(P) = Suffix(Q), P is extended to the left with Q ’ (Pattern Q leftover except suffix) Example: ‘ F.AS ’ convolute ‘ TF.A ’ to form ‘ TF.AS ’ Example: ‘ F.AS ’ convolute ‘ TF.A ’ to form ‘ TF.AS ’

14 Application 1: Finding repeat pattern in ChrM Define the elementary pattern: Define the elementary pattern:L=W=5 Scan the ChrM sequence Scan the ChrM sequence Option 1. Generate all possible elementary patterns (4^5 = 1024), implement in R Option 2. Use Hash function to count elementary patterns (1017), implement in Perl

15 Application 1: Finding repeat pattern in ChrM

16 Scan the sequence Scan the sequence Save the pattern Offset (index) Save the pattern Offset (index) ChrM: G 1 A 2 T 3 CACAGGTCT … Pattern: GATCAOffset: 1 Pattern: ATCACOffset: 2 Pattern: TCACAOffset: 3 Pattern: TCACAOffset: 3

17 Application 1: Finding repeat pattern in ChrM Convolution Convolution Simple repeat pattern offset example: Repeat Pattern1: G 1 ATCACAG Offset: 1 Repeat Pattern1: G 1 ATCACAG Offset: 1 Repeat Pattern2: G 10 ATCACAG Offset: 10 Repeat Pattern2: G 10 ATCACAG Offset: 10 Repeat Pattern3: G 20 ATCACAG Offset: 20 Repeat Pattern3: G 20 ATCACAG Offset: 20 Elementary pattern and their offset: Elementary pattern and their offset: GATCA 1 10 20 GATCA 1 10 20 ATCAC 2 11 21 ATCAC 2 11 21 TCACA 3 12 22 TCACA 3 12 22 CACAG 4 13 23 CACAG 4 13 23 From the offset, if the pattern forward extend 1 base, the offset increase 1 From the offset, if the pattern forward extend 1 base, the offset increase 1

18 Application 1: Finding repeat pattern in ChrM Repeat Pattern: GATC 4 ACAG Elementary pattern and their offset: CACAG 4 13 23 TCACA 3 12 22 ATCAC 2 11 21 GATCA 1 10 20 GATCA 1 10 20 From the offset, if the pattern backward extend 1 base, the offset decrease 1 From the offset, if the pattern backward extend 1 base, the offset decrease 1

19 Application 1: Finding the repeat pattern in ChrM Convolution Convolution Pattern: CACAACAA, length 8, repeat 5 times, offset: 6399 7724 12487 12930 15824 Pattern: CACAACAA, length 8, repeat 5 times, offset: 6399 7724 12487 12930 15824 L=W=5, K=5 L=W=5, K=5 Elementary patterns and their offset: Elementary patterns and their offset: ACAAC 492 1712 3303 3400 3445 3806 4040 4073 5057 5396 6065 6197 6400 7185 7725 8643 8709 8768 9198 9668 10040 10134 10300 10839 10897 10900 10944 11263 11335 11997 12031 12178 12250 12488 12545 12554 12741 12858 12931 13380 13746 13775 14331 14478 14691 15364 15680 15767 15825 16079 ACAAC 492 1712 3303 3400 3445 3806 4040 4073 5057 5396 6065 6197 6400 7185 7725 8643 8709 8768 9198 9668 10040 10134 10300 10839 10897 10900 10944 11263 11335 11997 12031 12178 12250 12488 12545 12554 12741 12858 12931 13380 13746 13775 14331 14478 14691 15364 15680 15767 15825 16079 CAACA 1110 1122 2214 2399 2430 2969 3304 3762 3807 3940 4041 4074 4698 5581 6198 6401 6549 6598 7186 7726 8347 8641 8663 8710 9199 9354 9560 9791 10041 10086 10617 10716 10780 10835 10895 10898 11264 11333 11954 11998 12032 12179 12251 12489 12829 12932 13747 13776 13861 13906 14188 14651 14822 15121 15681 15826 16077 16281 CAACA 1110 1122 2214 2399 2430 2969 3304 3762 3807 3940 4041 4074 4698 5581 6198 6401 6549 6598 7186 7726 8347 8641 8663 8710 9199 9354 9560 9791 10041 10086 10617 10716 10780 10835 10895 10898 11264 11333 11954 11998 12032 12179 12251 12489 12829 12932 13747 13776 13861 13906 14188 14651 14822 15121 15681 15826 16077 16281 Forward extension: Forward extension: Offset difference = pattern length (5) – pattern suffix length (4) = 1 Offset difference = pattern length (5) – pattern suffix length (4) = 1 Repeat position (repeat times) =20 >5, accept Repeat position (repeat times) =20 >5, accept New pattern: ACAACA (elementary pattern ACAAC forward extends 1 base A) New pattern: ACAACA (elementary pattern ACAAC forward extends 1 base A) New Offset: 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 12178 12250 12488 12931 13746 13775 15680 15825 New Offset: 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 12178 12250 12488 12931 13746 13775 15680 15825

20 Application 1: Finding the repeat pattern in ChrM Convolution Convolution patterns and their offset: ACAACA 3303 3806 4040 4073 6400 7185 7725 8709 9198 10040 10897 12031 12178 12250 12488 12931 13746 13775 15680 15825 AACAA 240 283 362 646 1123 2400 2766 2920 2970 3302 3721 4039 4699 5173 5287 5395 6196 6402 7727 8508 8642 8664 8691 9667 10299 10420 10781 10896 10899 10943 11334 11573 12030 12249 12416 12490 12553 12740 12933 13388 13395 13745 13774 13777 13862 14029 14189 14193 14573 14658 15363 15603 15679 15682 15827 15867 16078 16282 Forward extension: Offset difference = new pattern length (6) – pattern suffix length (4) = 2 Offset difference = new pattern length (6) – pattern suffix length (4) = 2 Repeat position (repeat times) =8 >5, accept Repeat position (repeat times) =8 >5, accept New pattern: ACAACAA (pattern ACAAC forward extends 1 base A) New pattern: ACAACAA (pattern ACAAC forward extends 1 base A) New Offset: 6400 7725 10897 12488 12931 13775 15680 15825 New Offset: 6400 7725 10897 12488 12931 13775 15680 15825

21 Application 1: Finding the repeat pattern in ChrM Convolution Convolution patterns and their offset: ACAACAA 6400 7725 10897 12488 12931 13775 15680 15825 CACAA 1011 1052 1619 3156 3805 3810 3991 4072 5280 5680 5938 6399 7184 7724 8053 8456 8563 8708 8767 8916 9106 10133 10838 10985 11262 11357 11996 12001 12217 12487 12544 12930 13218 13379 13546 13942 14330 14624 15419 15581 15824 16431 Backward extension: Offset difference = pattern length (5) – pattern suffix length (4) = 1 Offset difference = pattern length (5) – pattern suffix length (4) = 1 Repeat position (repeat times) =5 >=5, accept Repeat position (repeat times) =5 >=5, accept New pattern: CACAACAA (pattern ACAACAA backward extends 1 base C) New pattern: CACAACAA (pattern ACAACAA backward extends 1 base C) New Offset: 6399 7724 12487 12930 15824 New Offset: 6399 7724 12487 12930 15824 Forward and backward extension finished, repeat times >=K, report it Forward and backward extension finished, repeat times >=K, report it

22 Application 1: Finding the repeat pattern in ChrM PatternLength Repeat times Positions CACAACA711 3805 4072 6399 7184 7724 8708 11262 11996 12487 12930 15824 ACCCCAT710 535 638 4391 5437 6584 7491 13692 14340 14629 14814 CCCACTA714 15 110 2282 3739 4195 5705 7113 7403 9058 11216 11267 12361 13624 13681 CACCCTA710 1086 2219 11816 11871 12970 13147 13882 14391 15701 16268 CCCCCAT712 4880 7956 8029 8407 11428 12085 12238 12387 12433 14426 14589 16191 CCCTCTA710 726 1192 2068 3513 8275 8284 8608 9086 11980 13785 CCTAGCA711 4198 6357 9061 11744 12988 13132 13306 13468 13925 15320 15647 CCTCCTA712 3322 3610 4168 9279 9729 10092 10952 12095 12985 13990 15335 15491 AGCCCTA710 1098 6924 10290 10353 10695 11969 13069 13558 13837 15317 Pattern report is saved in the ChrM_pattern.txt file. Here is a part of result(L=5, K=10).

23 Application 1: Finding the repeat pattern in ChrM

24 Application 2: Finding the repeat pattern in Chr1 Define the elementary pattern: Define the elementary pattern:L=W=17 Scan the Chr1 sequence Scan the Chr1 sequence Problem: Problem: 1. Ch1 file size: 252,194,721 bytes, very large, can ’ t been read by R 2. All possible elementary patterns (4^17 = 17,179,869,184), over the R memory limitation 2. All possible elementary patterns (4^17 = 17,179,869,184), over the R memory limitation

25 Application 2: Finding the repeat pattern in Chr1 Solution: Solution: Use another computer language, like C or perl Use another computer language, like C or perl The C program gcat and the scanning result were got from Dr. Steven. Parallel program were used for running the program. The elementary patterns over 2000 repeat times are used for convolution. The C program gcat and the scanning result were got from Dr. Steven. Parallel program were used for running the program. The elementary patterns over 2000 repeat times are used for convolution. I also wrote perl script: scan.pl I also wrote perl script: scan.pl Usage: perl scan.pl Usage: perl scan.pl

26 Application 2: Finding the repeat pattern in Chr1 Convolution Convolution Problem: The index file is too large, it can ’ t been read by R Problem: The index file is too large, it can ’ t been read by R My Solution: My Solution: perl convolute.pl perl convolute.pl It ’ s working, but still very slow. The PVM (parralle virtual machine) techniques should be used for increasing speed. It ’ s working, but still very slow. The PVM (parralle virtual machine) techniques should be used for increasing speed.

27 Application 2: Finding the repeat pattern in Chr1 Some of Pattern Found in the Chr1 Pattern PatternATCTCAGCTCACTGCAACCTCTGCCTCCCGGGTTCAAGCGATTCTATCCTCCCACCTCAGCCTCCCAAAGTGCTGGGATTACAGG Pattern length 4540 Repeat times 4042 CAATCTCAGCTCACTGCAACCTCCACCTCCCAGGTTCAAG4049 GCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAATTCTC40108 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC3645 CCCACCTCAGCCTCCCAAAGTGCTGGGATTATAGG3566 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTA3442 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGC3442 CCCACCTCAGCCTCCCAAAGTGCTGAGATTACAG3454 CAATCTCAGCTCACTGCAACCTCCACCTCCTGGG3462

28 There are two more patterns, namely, AGCAATTCTCCTGCCTCA: 2858 CTCCTGCCTCAGCCTCC: 14168 can convolute with gctcactgcaacctccgcctcccaggttcaagcaattctc : 108 to form the maximal pattern GCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAA TTCTCCTGCCTCAGCCTCC : 46 There are two more patterns, namely, AGCAATTCTCCTGCCTCA: 2858 CTCCTGCCTCAGCCTCC: 14168 can convolute with gctcactgcaacctccgcctcccaggttcaagcaattctc : 108 to form the maximal pattern GCTCACTGCAACCTCCGCCTCCCAGGTTCAAGCAA TTCTCCTGCCTCAGCCTCC : 46

29 Application 2: Finding the repeat pattern in Chr1 Discussion Discussion 1. For the human genome pattern discovery, the high performance computer and parallel program should be used. R is a not good tools for solving this problem. 2. The algorithm should be carefully implemented for the better performance, like reducing the loop time. 2. The algorithm should be carefully implemented for the better performance, like reducing the loop time.

30 Reference Isodore Rigoutsos, Aris Floratos Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm, Bioinformatics, 1998, Vol(14) no.1:55-67 Isodore Rigoutsos, Aris Floratos Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm, Bioinformatics, 1998, Vol(14) no.1:55-67 Isodore Rigoutsos et al. Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological process, PNAS, 2006, Vol(103) no.17:6605-6610 Isodore Rigoutsos et al. Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological process, PNAS, 2006, Vol(103) no.17:6605-6610

31


Download ppt "Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu."

Similar presentations


Ads by Google