Feature Generation and Cluster-based Feature Selection.

Feature Generation and Cluster-based Feature Selection

Feature Generation  An case study – Some of data has no obvious features (Sequence data) Feature generation method Feature Selection

Virus Recombination 4

Genome Distance 5

Previous Work on Genome Distance Measurement Multiple Sequence Alignment Gene Content Based Data Compression Based 6

Complete Composition Vector Composition information embedded in one genome sequence – R: Genome Sequence of Length L – Elements: {A,G,C,T} – K: Maximum Pattern Length – f(a 1,a 2,…,a k ): Appearance Probability of a 1,a 2,…,a k in R 7

Composition Value – Second Order Markov Model 8

Composition Value ATATCTATATACT f(ATA)=3/(13-3+1)=3/11 f(AT)=4/(13-2+1)=4/12 f(TA)=4/(13-2+1)=4/12 f(T)=6/13 Expected Appearance Probability q(ATA)=f(AT)*f(TA)/f(T)=13/54 CCV(ATA)=(f(ATA)-q(ATA))/q(ATA)=19/162 9

Complete Composition Vector Using complete composition vector to represent whole genome String Selection 10

String Selection Score Function (Relative Entropy) 11

Pair-wise Evolution Distance Given two sequences R and S – Their composition vector CCV(R)= CCV(S)= – Euclidean distance d(R,S) 12

HIV-1 Genotyping Identify HIV-1 Genotyping – Three Major Groups M-further categorized into 9 subtypes: A-D,F-H,J,K. O N – Recombinant Strains 13

Neighbor Jointing on 42 Pure Subtypes Strains 14

Genotyping Classifier – Mean-Classifier Leave-One-Out Cross Validation Independent Testing 15

Genotyping Classifier – Mean-Classifier Leave-One-Out Cross Validation Independent Testing 16

Genotyping Results Top 500 Strings by Relative Entropy – Pure Subtype Prediction LOOCV Accuracy on Training Dataset 100%. Independent Test 100% – Myers et al. 200596.4% – Oliveira et al. 200599.2% – Rozanovet al. 200499.5% 17

Genotyping Results Recombinant Accuracy 87.3% 18

Feature Selection Two feature selection categories: – Selecting topmost features with the most individual discriminatory power – Selecting a group of feature with the most overall discriminatory power 19

Feature Selection Examples of feature selection algorithms – F-test Compute for each feature a score And select the topmost genes with the top scores – SFS/SFFS Iteratively include one more feature each time and try to kick each gene in the selected feature pool out to see which of the residue combination is the best one. 20

Cluster-based Feature Selection Cluster-based feature selection 21 Team A: Team B:

Cluster-based Feature Selection 22

Cluster-based Feature Selection  Perform feature clustering  Choose from each cluster the topmost features which have the most powerful individual discriminatory power 23

Cluster-based Feature Selection  Perform feature clustering  Choose from each cluster the topmost features which have the most powerful individual discriminatory power 25

FMDV Genotyping Datasets – Foot Mouse Disease virus (7 Subtypes) Genotyper – Linear kernel support vector machine (SVM) genotyper – Mean-genotyper 26

Genotyping Results 27

Experiments of Cluster Based Method

Linear Regression

Application Credit Approve (Credit Limit) x 1 =Score x 2 =Salary x 3 =Staus Y=0.5x 1 +100 Y=0.5x 1 +x 2 /100+100 Y=0.5x 1 +x 2 /100+20x 3 +100

Linear Fitting the Data  We want to fit a linear function to an observed set of points X= [x 1, x 2, x 3,… x n ] with associated label Y= [y 1, y 2, y 3,… y n ]. o After we fit the function, we can use the function to predict new y for new x.  Find the function that minimized sum (the average) of square distances between actual ys in the training set and predicted ones. Least Square The fitted line is used as predictor y=ax+b (x i, y i )

Linear Fitting the Data  We want to fit a linear function to an observed set of points X= [x 1, x 2, x 3,… x n ] with associated label Y= [y 1, y 2, y 3,… y n ]. o After we fit the function, we can use the function to predict new y for new x.  Find the function that minimized sum (the average) of square distances between actual ys in the training set and predicted ones. (x i, y i ) Least Square The fitted line is used as predictor x j y j

Linear Function Linear Least Square fitting with X in R 2 General form: 1D case (X=R): a line 2D case (X=R 2 ) : a plane

Suppose target labels come from set Y – Binary classification:Y = { 0, 1 } – Regression:Y =  (real numbers) A loss function maps decisions to costs: – defines the penalty for predicting when the true value is. Standard choice for classification: 0/1 loss (same as misclassification error) Standard choice for regression: squared loss Loss function

Linear Fitting the Data (x i, y i ) x j y j Minimize the error function

Use least squares to find the equation of the line that will best approximate the points (2,0), (-1,1) and (0,2) We want to find best line to fit the four points y=ax+b. Example

Minimize the error function, Example To find optimal a, b, set derivatives w.r.t. a and b equal to zero: 10a+2b+2=0 and a+3b-3=0 a=-3/7 and b=8/7, therefore, y=-3/7 x+8/7

Use least squares to find the equation of the line that will best approximate the points (1,6), (2,5), (3,7) and (4,10). y=ax+b Example y=1.4x+3.5

Approaches to solve Ax b Normal equations-quick and dirty QR- standard in libraries uses orthogonal decomposition SVD - decomposition which also gives indication how linear independent columns are Conjugate gradient- no decompositions, good for large sparse problems

Ax=b  For example (1,6), (2,5), (3,7) and (4,10). Normal system of equations General form

Ax=b  A T Ax=A t b  x=(A T A) -1 A T b Least Square

 A matrix Q is said to be orthogonal if its columns are orthonormal, i.e. Q T ·Q=I.  Orthogonal transformations preserve the Euclidean norm since  Orthogonal matrices can transform vectors in various ways, such as rotation or reflections but they do not change the Euclidean length of the vector. Hence, they preserve the solution to a linear least squares problem. QR Factorization

 If A is a m×n matrix with linearly independent columns, then A can be decomposed as, A=QR where Q is a m×n matrix whose columns form an orthonormal basis for the column space of A and R is an nonsingular upper triangular matrix. Where (q i, q j )=0 and | q i |=1 QR Factorization

Gram-Schmidt Orthonormalization Process Linearly independent set a = {a 1, a 2, …, a n } (β i,β j )=0

QR Factorization

We get A=UR, where U=(θ 1, θ 2, θ 3,…, θ n ). Then,

Example of QR Factorization

49 Applying Gram-Schmidt process of computing QR decomposition 1st Step: 2 nd Step: 3 rd Step: Example of QR Factorization

50 4th Step: 5 th Step: 6 th Step: Example of QR Factorization

Therefore, A=QR QR decomposition is widely used in computer codes to find the eigenvalues of a matrix, to solve linear systems, and to find least squares approximations.

52 The least square solution of b is Let X=QR. Then Therefore, Least square solution using QR Decomposition

1. Least Square Ax=b  A T Ax=A T b 2. A=QR  R T Q T QRx=R T Q T b 3. Q T Q=I  R T Rx=R T Q T b 4. Rx=Q T b

Least square solution using QR Decomposition  Running Time 1. QR factorization of A: A = QR (2mn 2 flops) 2. Form d = QT b (2mn flops) 3. Solve Rx = d by back substitution (n 2 flops) Cost for large m, n: 2mn 2 flops

QR for Least Square

56 Singular Value Decomposition A=USV  The singular values are the diagonal entries of the S matrix and are arranged in descending order The singular values are always real (non- negative) numbers If A is real matrix, U and V are also real

Singular Value Decomposition The SVD of a m-by-n matrix A is given by the formula : Where : U is a m-by-m matrix of the orthonormal eigenvectors of AA T, that is, U T= U -1 V T is the transpose of a n-by-n matrix containing the orthonormal eigenvectors of A T A, that is V T= V -1  is a n-by-n Diagonal matrix of the singular values which are the square roots of the eigenvalues of A T A A=USV T

SVD for Least Square

Example

Conjugate Gradient Solving of the linear equation system Problem: dimension n too big, or not enough time for gauss elimination. Iterative methods are used to get an approximate solution. A is known, square, symmetric, positive-definite Definition Iterative method: given starting point, do steps hopefully converge to the right solution x.

Conjugate Gradient We from now on assume we want to minimize the quadratic function: This is equivalent to solve linear problem: There are generalizations to general functions.

Background for Gradient Methods The min(max) problem: But we learned in calculus how to solve that kind of question!

Directional Derivatives

Directional Derivatives : In general direction…

Directional Derivatives

In the plane The Gradient: Definition in

The Gradient: Definition

The Gradient Properties The gradient defines (hyper) plane approximating the function infinitesimally

The Gradient properties By the chain rule: (important for later use)

The Gradient properties Proposition 1: is maximal choosing is minimal choosing (intuitive: the gradient points at the greatest change direction)

The Gradient Properties We found the best INFINITESIMAL DIRECTION at each point, Looking for minimum: “blind man” procedure How can we derive the way to the minimum using this knowledge?

Steepest Descent Steepest descent algorithm: Data: Step 0:set i=0 Step 1:ifstop, else, compute search direction Step 2: compute the step-size Step 3:setgo to step 1

Steepest Descent

The steepest descent find critical point and local minimum. Implicit step-size rule Actually we reduced the problem to finding minimum: There are extensions that gives the step size rule in discrete sense. (Armijo)

Conjugate Gradient We from now on assume we want to minimize the quadratic function: This is equivalent to solve linear problem: There are generalizations to general functions.

Conjugate Gradient What is the problem with steepest descent? We can repeat the same directions over and over… Conjugate gradient takes at most n steps.

Conjugate Gradient We say that two non-zero vectors u and v are conjugate if If A=I – orthogonal vectors are special case of conjugate vectors.

Conjugate Gradient Search directions – should span

Conjugate Gradient Given, how do we calculate ? (as before)

Conjugate Gradient How do we find ? We want that after n step the error will be 0 :

Conjugate Gradient Here an idea: if then: So if,

Conjugate Gradient So we look for such that : Simple calculation shows that if we take A - conjugate (- orthogonal)

Conjugate Gradient We have to find an A conjugate basis We can do “gram-schmidt” process, but we should be careful since it is an O(n³) process: Some series of vectors

Conjugate Gradient So for a arbitrary choice of we don’t earn nothing. Luckily, we can choose so that the conjugate direction calculation is O(m) where m is the number of non-zero entries in. The correct choice of is:

Conjugate Gradient So the conjugate gradient algorithm for minimizing f: Data: Step 0: Step 1: Step 2: Step 3: Step 4: and repeat n times.

Conjugate Gradient Algorithm Start with initial trial point Find the search direction Find to minimize Is optimum? No Yes Find the search direction Find to minimize

Feature Generation and Cluster-based Feature Selection.

Similar presentations

Presentation on theme: "Feature Generation and Cluster-based Feature Selection."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feature Generation and Cluster-based Feature Selection.

Similar presentations

Presentation on theme: "Feature Generation and Cluster-based Feature Selection."— Presentation transcript:

Similar presentations

About project

Feedback