Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Probe Selection in Micro-array Design Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and.

Similar presentations


Presentation on theme: "Efficient Probe Selection in Micro-array Design Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and."— Presentation transcript:

1 http://www.csc.liv.ac.uk/~cindy1 Efficient Probe Selection in Micro-array Design Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and Prudence Wong Special thanks go to: David Peleg Algorithmics Group, Dept. of Computer Science, University of Liverpool

2 http://www.csc.liv.ac.uk/~cindy2 Talk Overview Background: Microarrays & Hybridization Problem Statement Our Approach Experimental Work Conclusion

3 http://www.csc.liv.ac.uk/~cindy3 Hybridization Process DNA 5’... TGTGCTTGACAACATAGTTG... 3’ || | | Short DNA Fragments 3’-CTACGGACCGAT-5’ A single-stranded DNA probe (middle panel) is linked to an enzyme and allowed to base pair (hybridize) with the mRNA. After a series of washes, only fragments that are hybridized with the target mRNA remain.

4 http://www.csc.liv.ac.uk/~cindy4 Tool: DNA Microarrays Labeled DNA/RNA mixture flushed over array of short DNA fragments Laser activation of fluorescent labels

5 http://www.csc.liv.ac.uk/~cindy5 Talk Overview Background: Microarrays & Hybridization Problem Statement Our Approach Experimental Work Conclusion

6 http://www.csc.liv.ac.uk/~cindy6 Probe concept A probe is a substring of a gene, which acts as its fingerprint (a.k.a., signature) Probes are relatively short DNA sequences. Usually, a probe is ~ 20-25 base pairs long. For example: DNA... TGTGCTTGGCAACATAGATAGATGC... Probe TGCTTGGCAACATAGATAGA

7 http://www.csc.liv.ac.uk/~cindy7 Finding unique probes We are interested in finding a single (or a small group of) unique probe(s) for each gene The search process should be both time and space efficient P5P5 P1P1 P2P2 P3P3 P4P4 G1G1 G3G3 G4G4 Probes Genes G2G2

8 http://www.csc.liv.ac.uk/~cindy8 Finding unique probes Given a database S of gene sequences For each sequence g in S try to find a single probe P which hybridizes only with g If P cross-hybridizes with some other sequences in S (i.e., P has a close occurrence in S) then find a small set of probes that uniquely identifies g. Sometimes multiple probes are required due to the error prone wet lab environment

9 http://www.csc.liv.ac.uk/~cindy9 The use of probes The uniqueness of probes allows us to identify the genes taking part in the experiment in the wet lab I.e., seeing the trace (green color) of a number of probes on the microarray we can identify precisely which genes were involved in the experiment

10 http://www.csc.liv.ac.uk/~cindy10 Finding Unique Probes - Performance Measure Each gene in the database S should be uniquely identified by a smallest possible number of probes The search for probes should be time/space efficient The time of the search for probes should be “fairly” independent of the length of the probes All probes should be far (Hamming distance) from each other Probes should satisfy some extra (e.g., related to hybridization process) conditions Naive approach: Scans through the whole length-n genome for every length-m probe and determine if the Hamming distance is big enough, which takes O(mn 2 ) time. For example, 72 hours for S. pombe genome of length 7.1 x 10 6 bps and thus impractical for large genome.

11 http://www.csc.liv.ac.uk/~cindy11 Previous Work – Approaches based on Suffix array and fast pattern matching [Li F. and Stormo G., 2001] BLAST to avoid cross-hybridization [Rouillard J. M., Herbert C. J. and Zuker M., 2002] Longest common substrings [Rahmann S. 2002] Various filtering techniques [Lockhart DJ et al, 1996] Methods based on pigeon hole principle [Lee W. H. and Sung W. K., 2003] etc

12 http://www.csc.liv.ac.uk/~cindy12 Previous Work – The probe selection criteria No single base exceeds 50% of the probe size The length of any contiguous As and Ts or Cs and Gs is less than 25% of the probe size (G+C)% is between 40% and 60% of the probe Sensitivity - No self-complementarity within the probe sequence Homogeneity - Melting Temperature not being too low or too high Specificity – probes are unique to each gene

13 http://www.csc.liv.ac.uk/~cindy13 Previous Work – Test data Test data Genome Name E. coliS. pombeYeast Genome Length 4,752,41113,149,6518,783,280 Number of Genes 5,2535215,888

14 http://www.csc.liv.ac.uk/~cindy14 Previous Work – Test data Total length 8,783,280 Total # of genes 5,888

15 http://www.csc.liv.ac.uk/~cindy15 Previous Work Li and Stormo BIBE,2000 Rouillard et al,Bioinfor- matics,2002 Rahmann, WABI, 2002 Lee and Sung,CSB, 2003 E.Coli23-bps 1.5 days 50-bps 31 minutes Yeast24-bps 4 days 50-bps 1 day 50-bps 49 minutes Neurospora crassa 25-bps 4 hours 50-bps 3.5 hours # of probesTop 10All probesTop 50All probes

16 http://www.csc.liv.ac.uk/~cindy16 Talk Overview Background: Microarrays & Hybridization Problem Statement Our new alternative approach - main observations - the algorithm Experimental work Conclusion

17 http://www.csc.liv.ac.uk/~cindy17 Main Observations In general randomness help! 80% of “randomly” (based on our algorithm) chosen candidates for probes satisfy the probe selection criteria related to hybridization process [this suggests that random sequences hybridize properly more likely] The expected Hamming distance between two randomly chosen sequences of a length n over 4 letter alphabet is ~ 3n/4. [this suggests that randomly chosen probes will be far from each other]

18 http://www.csc.liv.ac.uk/~cindy18 An interesting observation In general, fragments of DNA sequences representing genes are more deterministic (contain more organized information) comparing to the rest of the sequence. In contrary, the best probes (signatures) representing genes are very likely to be random or almost random!

19 http://www.csc.liv.ac.uk/~cindy19 The Algorithm (*) For every gene g in the database S: a)generate a random base-pair sequence of length m b)find the closest length-m substring P in gene g c)check P for good probe criteria [80% pass this test] If P does not pass the criteria go to a) d)cross-hybridization checking for P [98% pass this test] For every length-m substring Q in other sequences S-{g}: If H(P,Q) > d, P is chosen as the probe for g, goto (*) Otherwise, P can possibly cross-hybridize and we must generate another length-m random substring P', go to a)

20 http://www.csc.liv.ac.uk/~cindy20 The algorithm R (*) For every gene g in the database S: a) generate a random base-pair sequence of length m g P b) find the closest length-m substring P in gene g c) Check P for good probe criteria, if P does not pass the criteria, go to a)

21 http://www.csc.liv.ac.uk/~cindy21 The algorithm g P d) Check P for cross-hybridization checking For every length-m substring Q in other sequences (S - {g}): If H(P,Q) > d, P is chosen as the probe for g, goto (*); Otherwise, P can possibly cross-hybridize and we must generate another length-m random substring, go to a) g1g1 P is far from g 1 √ H(P,Q)<d X gigi Q Background Sequences … g2g2 P is far from g 2 √ Generate another length-m random substring

22 http://www.csc.liv.ac.uk/~cindy22 Talk Overview Background: Microarrays & Hybridization Problem Statement Algorithm Experimental Work Conclusion

23 http://www.csc.liv.ac.uk/~cindy23 Experimental Work For Yeast: 1.80% genes with no probes Duplicated / very similar / too short apart from that 98.0% genes need only one probe 1.5% genes need two probes 0.5% genes need three probes Similar result with genome E.coli

24 http://www.csc.liv.ac.uk/~cindy24 Talk Overview Background: Microarrays & Hybridization Problem Statement Algorithm Experimental Work Conclusion

25 http://www.csc.liv.ac.uk/~cindy25 Conclusion Almost all (98%) genes can be uniquely identified by a single probe; the others need at most three probes Our method is: Suitable for large scale probe design Fairly independent from the length of probes Both time and space efficient Useful in design of fault-tolerant system of probes

26 http://www.csc.liv.ac.uk/~cindy26 Ongoing Work Distinguish multiple targets in a sample P1P1 g 1 P2’P2’ P1’P1’ g3g3 P2P2 g2g2

27 http://www.csc.liv.ac.uk/~cindy27 Questions?? ?

28 http://www.csc.liv.ac.uk/~cindy28 Thank You! Presented By Cindy Y. Li

29 http://www.csc.liv.ac.uk/~cindy29 self-complementarity Probe 5‘ TTTCAGTAATAAAAGATTTCTGT 3‘ |||| Probe 3‘ TGTCTTTAGAAAAATTAGACTTT 5‘

30 http://www.csc.liv.ac.uk/~cindy30 Melting Temperature T M can be used as a parameter to evaluate probe hybridization behavior T M is calculated for each probe as (SantaLucia et al., 1996) is the sum of the nearest neighbor enthalpy changes is the sum of the nearest neighbor entropy changes R is the Gas Constant (1.987 cal deg-1 mol-1) C T is the total molar concentration of strands ( )

31 http://www.csc.liv.ac.uk/~cindy31 Melting Temperature thermodynamic stability / nearest neighbour/ ATCG A-1.2-0.9-1.5 T-0.9-1.2-1.5-1.7 C -1.5-2.1-2.8 G-1.5 -2.3-2.1 TTT C A G TAATTAAAAA G ATTT C T G T -1.2-1.5-1.7kcal/mol


Download ppt "Efficient Probe Selection in Micro-array Design Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and."

Similar presentations


Ads by Google