Presentation is loading. Please wait.

Presentation is loading. Please wait.

S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang.

Similar presentations


Presentation on theme: "S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang."— Presentation transcript:

1 S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang

2 O UTLINE Introduction What is a repeat? Why studying repeats? Related work SAGRI Algorithm Analysis Evaluation

3 I NTRODUCTION

4 W HAT IS A REPEAT ?(D EFINITION ) [General]: Nucleotide sequences occurring multiply within a genome [CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).

5 W HAT IS A REPEAT ? (F UNCTION ) Motifs Very short repeats (10-20bp) Transcription factor binding sites Long and Short interspersed elements (SINE, LINE) Jumping genes Genes and Pseudogenes Tandem repeats Simple short sequence repeats A n, CGG n

6 W HY STUDYING REPEATS ? (1) Eukaryotic genomes contain a lot of repeats E.g. Human genome contains 50% repeats. Repeats are believed to play an important role in evolution and disease. E.g. Alu elements are particularly prone to recombination. Insertion of Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999) Repeats are important to chromatin structure. Most TEs in mammals seem to be silenced by methylation. Alu sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003). It is known that heterochromatin have a lot of SINE and LINE repeats.

7 W HY STUDYING REPEATS ? (2) Repeats complicated sequence assembly and genome comparison Many people remove repeats before they analyze the genome. Repeats set hurdles on microarray probe signal analysis The probe signal may be inaccurate if the probe sequence overlap with repeat regions. Repeats may contribute to human diversity more than genes. Repeats can be used as DNA fingerprint

8 S TEPS IN R EPEAT FINDING Repeat library (RepeatMasker) De-novo repeat discovery (two steps): Identification of repeats Classification of repeats

9 SAGRI ALGORITHM

10 A LGORITHM OUTLINE Input: a text G FindHit phase: finds all candidate of second occurrence of repeat regions ACGACGCGATTAACCCTCGACGTGATCCTC Validation phase: uses hits from phase 1 to find all pairs of repeats ACGACGCGATTAACCCTCGACGTGATCCTC

11 S PECTRUM - BASED REPEAT FINDER What is a spectrum? Given a string G, its spectrum is the set of all k-mers. E.g. k=3, G= ACGACGCTCACCCT The spectrum is ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA CTC is a k-mer occurring at position 7. ACG is a k-mer occurring at positions 1, 4.

12 O BSERVATION 1: H OW TO FIND CANDIDATE REGIONS CONTAINING REPEATS ? Two regions of repeats should share some k-mers. E.g. the following repeats share CGA. ACGACGCGATTAACCCTCGACGTGATCCTC

13 F EASIBLE EXTENSION ( BUD ) i S = ACGACGTGATTAACCCTCGACGTGATCCTC Given the spectrum S for G[1..i-1]: A X C  G X T  CGA Feasible extensions! i Note: T is called a fooling probe!

14 O BSERVATION 2 A path of feasible extensions may be a repeat. Example: S = ACGACGCTATCGATGCCCTC Spectrum S for G[1..10] is ACG, CGA, CGC, CTA, GAC, GCT, TAT Starting from position 11, there exists a path of feasible extensions: CGA-C-G-C This path corresponds to a length-6 substring in position 2. Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC). 11

15 P HASE 1: F IND H IT () Algorithm: Input: a text G Initialize the empty spectrum S For i = 1 to n /* we maintain the variant that S is a spectrum for G[1..i-1] */ Let x be the k-mer at position i If x exists in S, run DetectRepSeq(S,i); Insert x into S Note: DetectRepSeq(S,i) looks for repeat occurring at position i.

16 ACGAAGTGATTAACCCTCGACGCGATCC … CGA C G C G A T C T DetectRepSeg(S (18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA CGA 1 2 … Ref Curr

17 ACGAAGTGATTAACCCTCGACGCGATCC … CGA C G C G A T C T DetectRepSeg(S (18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA CGA-T … Ref Curr

18 ACGAAGTGATTAACCCTCGACGCGATCC … CGA C G C G A T C T DetectRepSeg(S (18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA CGA-T 1 -T 2 -A 3 * A 1 -G 1 -T 2 -G 2 -A 2 -T 2 -T 3* C 2 -C 2 -C 3 * G 3 * 1 2 … Ref Curr

19 ACGAAGTGATTAACCCTCGACGCGATCC … CGA C G C G A T C T DetectRepSeg(S (18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA CGA-T 1 -T 2 -A 3 * A 1 -G 1 -T 2 -G 2 -A 2 -T 2 -T 3* C 2 -C 2 -C 3 * G 3 * 1 2 … Ref Curr

20 ACGAAGTGATTAACCCTCGACGCGATCC … CGA C G C G A T C T DetectRepSeg(S (18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA CGA-T 1 -T 2 -A 3 * A 1 -G 1 -T 2 -G 2 -A 2 -T 2 -T 3* C 2 -C 2 -C 3 * G 3 * 1 2 … Ref Curr

21 ACGAAGTGATTAACCCTCGACGCGATCC … CGA C G C G A T C T DetectRepSeg(S (18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA CGA-T 1 -T 2 -A 3 * A 1 -G 1 -T 2 -G 2 -A 2 -T 2 -T 3* C 2 -C 2 -C 3 * G 3 * 1 2 … Ref Curr

22 ACGAAGTGATTAACCCTCGACGCGATCC … CGA C G C G A T C T DetectRepSeg(S (18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA CGA-T 1 -T 2 -A 3 * A 1 -G 1 -T 2 -G 2 -A 2 -T 2 -T 3* C 2 -C 2 -C 3 * G 3 * 1 2 … Ref Curr

23 ACGAAGTGATTAACCCTCGACGCGATCC … CGA C G C G A T C T DetectRepSeg(S (18), 18) AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA CGA-T 1 -T 2 -A 3 * A 1 -G 1 -T 2 -G 2 -A 2 -T 2 -T 3* C 2 -C 2 -C 3 * G 3 * 1 2 … Ref Curr

24 O THER DETAILS Extend backward Stop backtracking after h steps

25 V ALIDATION PHASE Decompose hits into set of k-mer and index all the locations of these k-mers. Scan for each pair of locations of a k-mer w in the hits, do BLAST extension Use some auxiliary data structure to avoid double checking Report the pairs whose length exceed our threshold

26 A NALYSIS

27 How to find most repeats? Avoid false negative How to get better speed? Avoid false positive

28 H OW DO WE CHOOSE K ? (1) If k is too big, k-mer is too specific and we may miss some repeat If k is too small, k-mer cannot help us to differentiate repeat from non-repeat For repeat of length 50 and similarity>0.9, we found that k  log 4 n+2 is good enough.

29 H OW DO WE CHOOSE K ? (2) A random k-mer match with one of n chosen k-mer Pr(a k-mer re-occurs by random in a sequence of length n) (analog to throwing n balls into 4 k bins)  1-(1 – 4 -k ) m  1 – exp(-m/4 k ). We requires 1-exp(-n/4 k )  1, hence, k  log 4 n + log 4  1. If we set  1 =1/16, k  log 4 n m

30 T HE OCCURRENCE OF FALSE NEGATIVE ( MISSED REPEAT ) (1) A pair of repeats of length L, with m mismatches Probability of a preserved k-mer in repeat is M is the number of nonnegative integer solutions to Subject to L X x1x1 x2x2 X m+1 X

31 T HE OCCURRENCE OF FALSE NEGATIVE ( MISSED REPEAT ) (2) It is easy to see that M is the coefficient of x L−m in Hence

32 C RITERION FOR PATH TERMINATION (1) Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%. Then, the pruning strategy is length dependent. If the length of strings in  is r, we allow  (r) mismatches.

33 C RITERION FOR PATH TERMINATION (2) Let q be the mismatch probability and r be the length of the string. Prob that a string has s mismatches = For a threshold  (says, 0.01), we set  (r) = max {2  s  r-2 | P q (s) >  } + 2

34 C ONTROL OF FALSE POSITIVES (1) Two typical cases The probability of (case 1)/ (case 2) is  2*4 -  P(case1 or case2) is small For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10 -8

35 E VALUATION Compare with other programs

36 P ROGRAMS EulerAlign by Zhang and Waterman PALS by Edgar and Myers REPuter by Kurtz et al. SARGRI

37 M EASUREMENT Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs. Shared Repeat Region (SRR): the ratio of the found region to the reference region.

38 S IMULATED DATA Conclusion from simulated data The result is consistent with the analysis

39 G ENOME DATA M.gen (0.6 Mbp) Organism with the smallest genome Lives in the primate genital and respiratory tracts C.tra (1 Mbp) Live inside the cells of humans A.ful (2.1 Mbp) Found in high-temperature oil fields E.coli (4 Mbp) An import bacteria live inside lower intestines of mammals Human chr22 p20M to p21M (1Mbp)

40 Use CR and SRR ratio to measure Cross validation G/H=1, H/G<1  G “outperforms” H G/H<1, H/G=1  H “outperforms” G G/H<1, H/G<1  G, H are complementary G/H=1, H/G=1  G, H are similar

41

42 =   

43 Q UESTIONS AND A NSWERS

44

45 H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5): , June 2008


Download ppt "S PECTRUM - BASED DE NOVO REPEAT DETECTION IN GENOMIC SEQUENCES Do Huy Hoang."

Similar presentations


Ads by Google