Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819

Similar presentations


Presentation on theme: "Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819"— Presentation transcript:

1 Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819 fchang@iis.sinica.edu.tw

2 DNA Polymorphism One of two or more alternate forms (alleles) of a chromosomal locus that differ in nucleotide sequence or have variable numbers of repeated nucleotide units

3 Genetic Marker A segment of DNA with an identifiable physical location on a chromosome and whose inheritance can be followed A marker can be a gene, or it can be some section of DNA with no known function Because DNA segments that lie near each other on a chromosome tend to be inherited together, markers are often used as indirect ways of tracking the inheritance pattern of a gene that has not yet been identified, but whose approximate location is known

4 Objective of Our Research Genetic markers are known to contain information about individuals’ inherited origins We wish to use their information to model and infer ethnic identities Moreover, we wish to identify the minimal amount of information for the above purpose

5 Construction of Feature Vectors Each human chromosome has two copies, one from the male and the other from the female parent We denote one of the two copies as u-copy and the other v-copy To label a copy as u or v has no implication of its parental origin

6 Feature Vectors (Cnt’d) For the l th marker, the data we collect from its two copies are denoted as u l and v l, l = 1, 2, …, L We thus form two vectors out of the L markers:

7 Feature Vectors (Cnt’d) Derive a feature vector out the two vectors In case u l and v l assume only two values: 1 if a certain feature has been found and 0 otherwise We obtain the feature vector as F 1 (U, V) = U + V

8 Feature Vectors (Cnt’d) On the other hand, if u l and v l denote the number of times a marker repeats itself We have to first transform the multi-valued U into a binary-valued U b The l th component of U expands to N l components of U b N l is the number of possible values of u l We perform a similar transformation of V into V b

9 Feature Vectors (Cnt’d) The feature vector is then obtained as F 2 (U, V) = U b + V b If U = (1, 2), V = (3, 2), N 1 = 3, and N 2 = 4 We obtain U b = (1, 0, 0, 0, 1, 0, 0), V b = (0, 0, 1, 0, 1, 0, 0) F 2 (U, V) = (1, 0, 1, 0, 2, 0, 0) From F 2 (U, V) and N 1 = 3, and N 2 = 4, we immediately infer that U = (1, 2), V = (3, 2)

10 Distance Measure The L 2 -distance between two feature vectors F = (f 1, f 2, …, f L ) and G = (g 1, g 2, …, g L ) is defined as

11 Prototype Classification Method The proposed method is basically a clustering method Unlike many other clustering methods, ours construct homogeneous clusters Our method starts with a set of training samples labeled with their ethnic identity A learning algorithm then proceeds to determine the number as well as the location of prototypes, whereas prototypes are defined as cluster centers

12 Learning Algorithm The algorithm consists of two loops The outer loop decides whether all clusters are homogeneous and, when they are not, identify those ethnic types for which we want to build more prototypes The inner loop computes the prototype locations for the number of prototypes specified by the outer loop

13 Fuzzy C-Means (FCM) We use the FCM clustering technique in the inner loop to compute the prototype locations FCM assigns, to each sample s and a given cluster center C, a grade of membership that varies inversely with the distance between s and C The cluster center is the weighted average of all samples, with grades of membership serving as the weights This technique relies on an iterative process to find the location of cluster centers

14 Updating Formulas Samples: x 1, x 2, …, x N Prototypes: p 1, p 2, …, p I Grades of membership: u ij

15 Special Control: Futile Samples To ensure that the construction process terminates If an unabsorbed sample s is used as seed for generating new C-prototypes We check whether this addition produces any empty range The range of a prototype p is defined as the set of samples of the same class type that find p as the nearest prototype If some prototype ranges is empty, we declare s as futile and restore all old C-prototypes

16 Termination of Learning Process The process terminates, when all samples are either absorbed or declared as futile A sample x is absorbed if there is a prototype p of the same class type as x and for all other prototypes q

17 Progress of Prototype Construction

18 Selection of Genetic Markers {C 1, C 2, …, C i, …}: the collection of ethnic groups {m 1, m 2, …, m j, …} : all possible values that can be obtained from marker m The two copies of the same marker are viewed as two independent samples

19 Information Gain | C i | = the number of samples whose ethnic type is C i, |m j | = the number of samples whose m-marker assumes value m j, = the number of samples whose ethnic type is C i and whose marker m assumes value m j.

20 Selection of Markers The information gain is the difference between the entropy and the conditional entropy It measures how much uncertainty about class types can be reduced due to the information carried by the marker m When we rank all markers by means of this metric, we form the set consisting of top-n markers

21 Experiment Results When population groups have distinguished ethnic origins, the number of prototypes and the test accuracy are relatively stable with respect to n, as long as n exceeds a threshold On the other hand, when the groups are of high admixture, both the number of prototypes and the test accuracy become unstable

22 Datasets

23 DATASET A

24 DATASET B

25 DATASET C

26 DATASET D

27 Interclass variation: Let there be N populations, each containing samples with L markers. The interclass variation is defined as where  and  stand for mean and standard deviation, respectively Alternatives: Training error, Percentage of overlapping clusters Separability of Population Groups

28 DatasetInterclass Variation A3026.2 B2821.9 C2618.9 D2468.4 Interclass Variations

29 Results for Dataset A

30 Results for Dataset B

31 Results for Dataset C

32 Results for Dataset D


Download ppt "Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica 2788-3799 ext. 1819"

Similar presentations


Ads by Google