# Data analysis Lecture 10 Tijl De Bie.

## Presentation on theme: "Data analysis Lecture 10 Tijl De Bie."— Presentation transcript:

Data analysis Lecture 10 Tijl De Bie

Let’s do some real data analysis
A biologist comes to you and says: “I have some data on breast cancer here, if you analyse it, I will win the Nobel prize” How to start??

Let’s do some real data analysis
Real data is messy: Missing values…   Infer them as the mean of the corresponding feature (this is a basic technique for ‘imputation’) [MATLAB intermezzo]

Let’s do some real data analysis
What now?? Let’s visualize the data! How?? 9-dimensional!  Principal Component Analysis (PCA) [MATLAB intermezzo]

Mathematical intermezzo: PCA
Two views: Variance maximization Error minimization Solved using eigenvalue problem Do not forget to centre the data (subtract from each feature its mean in the dataset)

Looks interesting… Could we perhaps predict the label from the data?
I.e., find a rule that says when a cancer is benign and when it’s malignant (important for therapy and more!) Classification! [MATLAB intermezzo]

Mathematical intermezzo: LSR/FDA
Least Squares Regression (LSR) Solved by means of a system of linear equations Xw=y (approx) Missfit: ||Xw-y||2 the mean squared error Fisher Discriminant Analysis: The same thing, if the labels y are -1/1

Could there be more? Perhaps there are more than 2 clusters?
Cancers requiring different treatments? Let’s cluster the data! 2-clusters? (Benign vs malign?) More clusters? (Other cancer types?) [MATLAB intermezzo]

Similar presentations