Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Theoretical Probe Selection for Hybridisation Experiments

Similar presentations


Presentation on theme: "Information Theoretical Probe Selection for Hybridisation Experiments"— Presentation transcript:

1 Information Theoretical Probe Selection for Hybridisation Experiments
Ralf Herwig et al. Bioinformatics, Vol.16, No. 10, 2000 Summarized by Sun Kim SNU Biointelligence Lab.

2 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Introduction (1/2) Oligonucleotide fingerprinting Fingerprint is characteristic for the individual clone Probe should be informative for the clone sequences in sense that all different genes can be distinguished by their fingerprints Probes should occur within the clone sequences with a considerable frequency Probes should not be to similar to each other. Probe selection according to high frequencies Lead to the agglomeration of probes that are highly similar to each other The gain in information is not significantly increased when selecting probes (C) 2001, SNU Biointelligence Lab, 

3 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Introduction (2/2) This paper An information theoretical approach  probe design based on entropy maximization Performance of the probes with respect to clustering sequences by evaluating pairwise similarities of their fingerprints (C) 2001, SNU Biointelligence Lab, 

4 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
System and Methods (C) 2001, SNU Biointelligence Lab, 

5 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Data Preparation (1/2) Training set  probe set Test set  binary fingerprints , if probe j or its reverse complementary sequence matches clone sequence i , otherwise Five different test sets 685 different cDNA sequences from the GenBank database (C) 2001, SNU Biointelligence Lab, 

6 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Data Preparation (2/2) Noise By flipping the respective amount of digits of the binary fingerprints Parameter 20% of true positive  0 20% of true negative  1 (C) 2001, SNU Biointelligence Lab, 

7 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Clustering Sequential k-means algorithm Using mutual information as a pairwise similarity measure for the binary fingerprints. Sequentially assigns each data point to the most similar cluster centroid from a set of previously calculated cluster centroids. Then the centroid is updated by the data point. Enriched by heuristics and algorithmic parameters Allow the merging of clusters and an introduction of new clusters in each step of the clustering process No need a pre-fixed initialization of the number of different clusters Simulation pipeline  Herwig et al. (1999) (C) 2001, SNU Biointelligence Lab, 

8 Validation of Clustering (1/2)
True clustering  T , if and belong to the same cluster , otherwise Calculated clustering  C 2x2 contingency table to measure clustering quality (C) 2001, SNU Biointelligence Lab, 

9 Validation of Clustering (2/2)
Measure  Jaccard-coefficient Perfect clustering: J(C,T) = 1 (C) 2001, SNU Biointelligence Lab, 

10 Algorithm and Implementation (1/3)
The fingerprints obtained with a single probe generate a partitioning of the N sequences into two subsets Those sequences that match with the probe sequence or its reverse complementary sequence, And those that do not The amount of information of the probe w.r.t the set of sequences  Entropy is the proportion of sequences that fall in the respective subset Maximizing when the subsets are equal sizes. (C) 2001, SNU Biointelligence Lab, 

11 Algorithm and Implementation (2/3)
Number of fingerprints increases as with the number p of probes Screening all possibilities is computationally unfeasible  Approximation suggested by R.Mott (Meier-Ewert, 1994) (C) 2001, SNU Biointelligence Lab, 

12 Algorithm and Implementation (3/3)
Approximation Find the probe which partitions best the set of known sequences into two groups. Find the second probe which, together with the previously selected one, partitions the training set into four groups. Find the probe, which together with the previously selected ones, partitions best the training set. Stop, if the number of selected probes surmounts a given threshold or if each partition contains only one sequence. (C) 2001, SNU Biointelligence Lab, 

13 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Parameter LEN: length of probes. default = 8 N_GC: minimal number of G+C in each probe. default = 2 COMP: minimal complexity of probes, default = 0.5 OVL: maximal length of common stretch of basepairs shared by any two probes. default = 6 SEL: number of probes to be determined. default = 200 (C) 2001, SNU Biointelligence Lab, 

14 Trained Probe Sequences
(C) 2001, SNU Biointelligence Lab, 

15 Results: Frequency of Probes
(C) 2001, SNU Biointelligence Lab, 

16 Results: Comparison of Probe Sets (1/2)
Comparing Tested by human data and rodent data (C) 2001, SNU Biointelligence Lab, 

17 Results: Comparison of Probe Sets (2/2)
Comparing Trained by human, rodent, and plant sequences each. (C) 2001, SNU Biointelligence Lab, 

18 Results: Variation of Algorithmic Parameters
(C) 2001, SNU Biointelligence Lab, 

19 (C) 2001, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Conclusion Probe selection based on entropy. Dependent on the training set The training set should be chosen as close to the organism under analysis as possible. Good hybridization quality can be achieved, e.g. by G+C-rich probes. The proposed algorithm can be applied to any experiment Texts (sequences) are characterized by words (probes)  might be used to select characteristic keywords ? (C) 2001, SNU Biointelligence Lab, 


Download ppt "Information Theoretical Probe Selection for Hybridisation Experiments"

Similar presentations


Ads by Google