Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

Similar presentations


Presentation on theme: "Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class."— Presentation transcript:

1 Computational Biology Group

2 Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class

3 Sampl e 1 Sampl e 2 Sampl e 3 Sampl e 4 Sampl e 5…. Sample 100 Class Label0000011 Gene 1-214-139-76-135-106-138-72 Gene 2-153-73-49-114-125-85-144 Gene 3-58-307265-76215238 Gene 488283309121687155 Gene 5-295-264-376-419-230-272-399 Gene 6-558-400-650-585-284-558-551 Gene 7199-33033158467131 Gene 8-176-168-367-253-122-186-179 Gene 9252101206497087126 Gene 1020674-21531252193-20 :-4119 363155325-115 Gene 20000-831-743-1135-934-471-631-1003 Sample X 1 OR 0 -67 -93 84 25 -179 -323 -135 -127 -2 -66 208 -472 Training Set Test Sample

4 Class prediction : Assign unknown samples to already known classes For this procedure we use a set of samples for which we already know the class that they belong to Identification of proper molecular markers (genes) that are able to discriminate one class from the other. The genes are thousands There are genes that add noise and maybe there are genes that are able to discriminate these classes

5 How we discover these interesting Genes ?

6 …Some Definitions  A Training Dataset D, which has M samples and N genes  Arbitrarily we assign the label 0, for the genes that belong to the first class and the label 1 for the genes that belong to the second class If the DataSet has more than two classes One Class vs All or One Class vs Another class  For each gene g there is a vector : Eg = ( e 1, e 2, …, e m )  Construction of the labeling vector Vg = (v 1, v 2, …, v m )  Eg’ is the sorted vector of Eg  Based on this vector we constructed the labeling vector Vg.

7

8 RegionrepresentationP R1 1, 1, 1 3/8 = 0,378 R2 0, 0, 0, 0, 0, 0, 0, 0, 0 9/12 = 0,75 Eg = (10,25,53,78,96,122,154,198,221,256,318,455,487,503,556,601,621,647,733,785) Vg = (1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0) R1 R2 P accepted = 0.7 Threshold value Gene g is an informative gene, which belongs to Category I and it is a Class Zero Classifier

9 Informative Genes  Category I : There is an homogeneous region only in one side  Category II : There is an homogeneous region in each side, but from opposite classes  Category III : There is an homogeneous region in each side, from the same class

10 Class Prediction  Weighted Voting Class Zero Classifier : assignment of weight W 0 Class One Classifier : W 1 W 0, W 1 : inversely proportional to the number of vectors that there are in each Class Classifier W 0 > 0 W 1 0 W 1 < 0

11 Class Prediction Eg = (10,25,53,78,96,122,154,198,221,256,318,455,487,503,556,601,621,647,733,785) E(S1) > 455 E(S2) < 455 Gene g can vote for S1 S1 takes one vote for class 0 Gene g cannot vote for S2 S2 = ?

12  …for each Test Set Sample there are different informative genes that can predict its Class  If (n : the number of informative genes for this sample) This sample belongs to Class Zero  If This sample belongs to Class One ….So after these procedure, for all samples in the test set…

13

14 How can we use the information that different samples are characterized from different informative genes ?

15 Distances between samples  Two samples will be more relative than two others if they share a larger number of informative genes which assign them to the same class.

16 Computation of distances between samples are the measures of the similarity, between sample i, and sample j. is the number of the common informative genes that classify samples i, j in class 0 and 1, respectively. is the number of the total informative genes that classify samples i, j in class 0 and 1, respectively

17

18  Construction of a distance matrix, with pairwise distances between all samples  For the visualization of the results we used MEGA2 software, which constructs dendrograms.  UPGMA Tree ( ) = Hierarchical Clustering  UPGMA Tree ( Unweighted Pair Group Method using Arithmetic averages ) = Hierarchical Clustering  Neighbor – Joining Tree

19  If we merge Training Set and Test Set, and create a new DataSet with all Samples…. We expect that the samples from the test set that, belong to a particular class, will be placed near or generally in the same group with the training set samples which belong to the same class

20

21

22 Detection of Subgroups of Samples  Using the method of Supervised Clustering we described above, we were able to detect subgroups of samples  …from the training set  …and the test set

23 When we apply this algorithm to the Training Set, the Classification has no meaning, because it is according to the pre-existed labels (This is a consequence of the selection of informative genes. The Regions are completely homogeneous, so, a priori the classification in the Training Set will be consistent with the labels). …However we can use this method in order to detect Subgroups of Samples inside a certain Class.

24 The Neighbor – Joining Tree of LN DataSet (Training Set)

25 Neighbor-Joining Unrooted Tree of the Test Set From West et al. On the right of purple lines is the ER+ class, and on the left the ER- Class ind2 is misclassified.

26 Results from various datasets

27 Some problems of these methods  In some datasets there are a few genes with homogeneous regions with P value larger than 0.7. The weighted voting method doesn’t work very well when the P value is low  There is a need for enough samples from both classes in the training set.

28 Detection of Groups of Genes that are able to classify a sample  We tried to determine somehow, the properties of a group of genes (pairs, triplets…) that will be able to predict the class of a sample.  We don’t consider as a “group”, the set of “good” informative genes that are able to classify a sample, with a high percentage of success...

29  Instead of that we look for group of genes that individually aren’t so good class predictors.  …However they can predict the class precisely when they form a group.  So the prediction strength is a property of the group, and not a property of the genes that constitute this group

30 Detection of pairs of genes Eg1 = (78,96,122,154,198,221,256,318,455,487,503,556,601,621,647,733,785) Eg2 = (285,305,227,512,756,820,839,872,896,907,921,965,971,985,992,995) Vg1 = (0 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 0) Vg2 = (0 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0) Ig1 = ( 12, 16, 18, 13, 1, 20, 19, 6, 4, 15, 3, 8, 11, 10, 9, 14, 17) Ig2 = (17, 16, 18, 14, 1, 20, 9, 6, 10, 15, 3, 8, 12, 13, 19, 4) P accepted > 0.5 (If P<0.5 then it is impossible to find a pair of genes)

31  This means that no one sample from class Zero, can have simultaneously an expression value below (or above) threshold A for g1, and below (or above) threshold B for g2.  This is possible only for samples from class One.

32  The procedure of Class Prediction is similar to the procedure which was described above for single genes.  We do exactly the same for the detection of triplets of genes. (P value > 0.333)

33 Some advantages of this approach We can utilize the samples from one Class for the construction of classifiers for the other Class. This is important when the number of samples in one class is small (example) P value is lower in this case. So we detect more genes that potential can create pairs of genes. It considers that the biological mechanisms are more complex. This might be closer to reality.

34 Thank you !!!!


Download ppt "Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class."

Similar presentations


Ads by Google