Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Generation and Cluster-based Feature Selection.

Similar presentations


Presentation on theme: "Feature Generation and Cluster-based Feature Selection."— Presentation transcript:

1 Feature Generation and Cluster-based Feature Selection

2 Feature Generation  An case study – Some of data has no obvious features (Sequence data) Feature generation method Feature Selection

3 Virus

4 Virus Recombination 4

5 Genome Distance 5

6 Previous Work on Genome Distance Measurement Multiple Sequence Alignment Gene Content Based Data Compression Based 6

7 Complete Composition Vector Composition information embedded in one genome sequence – R: Genome Sequence of Length L – Elements: {A,G,C,T} – K: Maximum Pattern Length – f(a 1,a 2,…,a k ): Appearance Probability of a 1,a 2,…,a k in R 7

8 Composition Value – Second Order Markov Model 8

9 Composition Value ATATCTATATACT f(ATA)=3/(13-3+1)=3/11 f(AT)=4/(13-2+1)=4/12 f(TA)=4/(13-2+1)=4/12 f(T)=6/13 Expected Appearance Probability q(ATA)=f(AT)*f(TA)/f(T)=13/54 CCV(ATA)=(f(ATA)-q(ATA))/q(ATA)=19/162 9

10 Complete Composition Vector Using complete composition vector to represent whole genome String Selection 10

11 String Selection Score Function (Relative Entropy) 11

12 Pair-wise Evolution Distance Given two sequences R and S – Their composition vector CCV(R)= CCV(S)= – Euclidean distance d(R,S) 12

13 HIV-1 Genotyping Identify HIV-1 Genotyping – Three Major Groups M-further categorized into 9 subtypes: A-D,F-H,J,K. O N – Recombinant Strains 13

14 Neighbor Jointing on 42 Pure Subtypes Strains 14

15 Genotyping Classifier – Mean-Classifier Leave-One-Out Cross Validation Independent Testing 15

16 Genotyping Classifier – Mean-Classifier Leave-One-Out Cross Validation Independent Testing 16

17 Genotyping Results Top 500 Strings by Relative Entropy – Pure Subtype Prediction LOOCV Accuracy on Training Dataset 100%. Independent Test 100% – Myers et al. 200596.4% – Oliveira et al. 200599.2% – Rozanovet al. 200499.5% 17

18 Genotyping Results Recombinant Accuracy 87.3% 18

19 Feature Selection Two feature selection categories: – Selecting topmost features with the most individual discriminatory power – Selecting a group of feature with the most overall discriminatory power 19

20 Feature Selection Examples of feature selection algorithms – F-test Compute for each feature a score And select the topmost genes with the top scores – SFS/SFFS Iteratively include one more feature each time and try to kick each gene in the selected feature pool out to see which of the residue combination is the best one. 20

21 Cluster-based Feature Selection Cluster-based feature selection 21 Team A: Team B:

22 Cluster-based Feature Selection 22

23 Cluster-based Feature Selection  Perform feature clustering  Choose from each cluster the topmost features which have the most powerful individual discriminatory power 23

24 Discrimination Power Vector 24 Class2- Class1 Class 3- Class 1 Class 3- Class 2 Feature1|Mean 1(2)- Mean 1(1)| |Mean 1(3)- Mean 1(1)| |Mean 1(3)- Mean 1(2)| Feature2|Mean 2(2)- Mean 2(1)| |Mean 2(3)- Mean 2(1)| |Mean 2(3)- Mean 2(2)| Class 1Class 2Class 3 Feature 1Mean 1(1)Mean 1 (2)Mean 1(3) Feature 2Mean 2(1)Mean 2 (3)

25 Cluster-based Feature Selection  Perform feature clustering  Choose from each cluster the topmost features which have the most powerful individual discriminatory power 25

26 FMDV Genotyping Datasets – Foot Mouse Disease virus (7 Subtypes) Genotyper – Linear kernel support vector machine (SVM) genotyper – Mean-genotyper 26

27 Genotyping Results 27

28 Experiments of Cluster Based Method

29 Linear Regression

30 Application Credit Approve (Credit Limit) x 1 =Score x 2 =Salary x 3 =Staus Y=0.5x 1 +100 Y=0.5x 1 +x 2 /100+100 Y=0.5x 1 +x 2 /100+20x 3 +100

31 Linear Fitting the Data  We want to fit a linear function to an observed set of points X= [x 1, x 2, x 3,… x n ] with associated label Y= [y 1, y 2, y 3,… y n ]. o After we fit the function, we can use the function to predict new y for new x.  Find the function that minimized sum (the average) of square distances between actual ys in the training set and predicted ones. Least Square The fitted line is used as predictor y=ax+b (x i, y i )

32 Linear Fitting the Data  We want to fit a linear function to an observed set of points X= [x 1, x 2, x 3,… x n ] with associated label Y= [y 1, y 2, y 3,… y n ]. o After we fit the function, we can use the function to predict new y for new x.  Find the function that minimized sum (the average) of square distances between actual ys in the training set and predicted ones. (x i, y i ) Least Square The fitted line is used as predictor x j y j

33 Linear Function Linear Least Square fitting with X in R 2 General form: 1D case (X=R): a line 2D case (X=R 2 ) : a plane

34 Suppose target labels come from set Y – Binary classification:Y = { 0, 1 } – Regression:Y =  (real numbers) A loss function maps decisions to costs: – defines the penalty for predicting when the true value is. Standard choice for classification: 0/1 loss (same as misclassification error) Standard choice for regression: squared loss Loss function

35 Linear Fitting the Data (x i, y i ) x j y j Minimize the error function

36 Use least squares to find the equation of the line that will best approximate the points (2,0), (-1,1) and (0,2) We want to find best line to fit the four points y=ax+b. Example

37 Minimize the error function, Example To find optimal a, b, set derivatives w.r.t. a and b equal to zero: 10a+2b+2=0 and a+3b-3=0 a=-3/7 and b=8/7, therefore, y=-3/7 x+8/7

38 Use least squares to find the equation of the line that will best approximate the points (1,6), (2,5), (3,7) and (4,10). y=ax+b Example y=1.4x+3.5

39 Approaches to solve Ax b Normal equations-quick and dirty QR- standard in libraries uses orthogonal decomposition SVD - decomposition which also gives indication how linear independent columns are Conjugate gradient- no decompositions, good for large sparse problems

40 Ax=b  For example (1,6), (2,5), (3,7) and (4,10). Normal system of equations General form

41 Ax=b  A T Ax=A t b  x=(A T A) -1 A T b Least Square

42  A matrix Q is said to be orthogonal if its columns are orthonormal, i.e. Q T ·Q=I.  Orthogonal transformations preserve the Euclidean norm since  Orthogonal matrices can transform vectors in various ways, such as rotation or reflections but they do not change the Euclidean length of the vector. Hence, they preserve the solution to a linear least squares problem. QR Factorization

43  If A is a m×n matrix with linearly independent columns, then A can be decomposed as, A=QR where Q is a m×n matrix whose columns form an orthonormal basis for the column space of A and R is an nonsingular upper triangular matrix. Where (q i, q j )=0 and | q i |=1 QR Factorization

44 Gram-Schmidt Orthonormalization Process Linearly independent set a = {a 1, a 2, …, a n } (β i,β j )=0

45 QR Factorization

46

47 We get A=UR, where U=(θ 1, θ 2, θ 3,…, θ n ). Then,

48 Example of QR Factorization

49 49 Applying Gram-Schmidt process of computing QR decomposition 1st Step: 2 nd Step: 3 rd Step: Example of QR Factorization

50 50 4th Step: 5 th Step: 6 th Step: Example of QR Factorization

51 Therefore, A=QR QR decomposition is widely used in computer codes to find the eigenvalues of a matrix, to solve linear systems, and to find least squares approximations.

52 52 The least square solution of b is Let X=QR. Then Therefore, Least square solution using QR Decomposition

53 1. Least Square Ax=b  A T Ax=A T b 2. A=QR  R T Q T QRx=R T Q T b 3. Q T Q=I  R T Rx=R T Q T b 4. Rx=Q T b

54 Least square solution using QR Decomposition  Running Time 1. QR factorization of A: A = QR (2mn 2 flops) 2. Form d = QT b (2mn flops) 3. Solve Rx = d by back substitution (n 2 flops) Cost for large m, n: 2mn 2 flops

55 QR for Least Square

56 56 Singular Value Decomposition A=USV  The singular values are the diagonal entries of the S matrix and are arranged in descending order The singular values are always real (non- negative) numbers If A is real matrix, U and V are also real

57 Singular Value Decomposition The SVD of a m-by-n matrix A is given by the formula : Where : U is a m-by-m matrix of the orthonormal eigenvectors of AA T, that is, U T= U -1 V T is the transpose of a n-by-n matrix containing the orthonormal eigenvectors of A T A, that is V T= V -1  is a n-by-n Diagonal matrix of the singular values which are the square roots of the eigenvalues of A T A A=USV T

58 SVD for Least Square

59

60

61

62 Example

63

64 Conjugate Gradient Solving of the linear equation system Problem: dimension n too big, or not enough time for gauss elimination. Iterative methods are used to get an approximate solution. A is known, square, symmetric, positive-definite Definition Iterative method: given starting point, do steps hopefully converge to the right solution x.

65 Conjugate Gradient We from now on assume we want to minimize the quadratic function: This is equivalent to solve linear problem: There are generalizations to general functions.

66 Background for Gradient Methods The min(max) problem: But we learned in calculus how to solve that kind of question!

67 Directional Derivatives

68 Directional Derivatives : In general direction…

69 Directional Derivatives

70 In the plane The Gradient: Definition in

71 The Gradient: Definition

72 The Gradient Properties The gradient defines (hyper) plane approximating the function infinitesimally

73 The Gradient properties By the chain rule: (important for later use)

74 The Gradient properties Proposition 1: is maximal choosing is minimal choosing (intuitive: the gradient points at the greatest change direction)

75 The Gradient Properties We found the best INFINITESIMAL DIRECTION at each point, Looking for minimum: “blind man” procedure How can we derive the way to the minimum using this knowledge?

76 Steepest Descent Steepest descent algorithm: Data: Step 0:set i=0 Step 1:ifstop, else, compute search direction Step 2: compute the step-size Step 3:setgo to step 1

77 Steepest Descent

78 The steepest descent find critical point and local minimum. Implicit step-size rule Actually we reduced the problem to finding minimum: There are extensions that gives the step size rule in discrete sense. (Armijo)

79 Conjugate Gradient We from now on assume we want to minimize the quadratic function: This is equivalent to solve linear problem: There are generalizations to general functions.

80 Conjugate Gradient What is the problem with steepest descent? We can repeat the same directions over and over… Conjugate gradient takes at most n steps.

81 Conjugate Gradient We say that two non-zero vectors u and v are conjugate if If A=I – orthogonal vectors are special case of conjugate vectors.

82 Conjugate Gradient Search directions – should span

83 Conjugate Gradient Given, how do we calculate ? (as before)

84 Conjugate Gradient How do we find ? We want that after n step the error will be 0 :

85 Conjugate Gradient Here an idea: if then: So if,

86 Conjugate Gradient So we look for such that : Simple calculation shows that if we take A - conjugate (- orthogonal)

87 Conjugate Gradient We have to find an A conjugate basis We can do “gram-schmidt” process, but we should be careful since it is an O(n³) process: Some series of vectors

88 Conjugate Gradient So for a arbitrary choice of we don’t earn nothing. Luckily, we can choose so that the conjugate direction calculation is O(m) where m is the number of non-zero entries in. The correct choice of is:

89 Conjugate Gradient So the conjugate gradient algorithm for minimizing f: Data: Step 0: Step 1: Step 2: Step 3: Step 4: and repeat n times.

90 Conjugate Gradient Algorithm Start with initial trial point Find the search direction Find to minimize Is optimum? No Yes Find the search direction Find to minimize


Download ppt "Feature Generation and Cluster-based Feature Selection."

Similar presentations


Ads by Google