2 Let’s do some real data analysis A biologist comes to you and says: “I have some data on breast cancer here, if you analyse it, I will win the Nobel prize”How to start??
3 Let’s do some real data analysis Real data is messy:Missing values… Infer them as the mean of the corresponding feature (this is a basic technique for ‘imputation’)[MATLAB intermezzo]
4 Let’s do some real data analysis What now??Let’s visualize the data!How?? 9-dimensional! Principal Component Analysis (PCA)[MATLAB intermezzo]
5 Mathematical intermezzo: PCA Two views:Variance maximizationError minimizationSolved using eigenvalue problemDo not forget to centre the data (subtract from each feature its mean in the dataset)
6 Looks interesting… Could we perhaps predict the label from the data? I.e., find a rule that says when a cancer is benign and when it’s malignant (important for therapy and more!)Classification![MATLAB intermezzo]
7 Mathematical intermezzo: LSR/FDA Least Squares Regression (LSR)Solved by means of a system of linear equationsXw=y (approx)Missfit: ||Xw-y||2 the mean squared errorFisher Discriminant Analysis:The same thing, if the labels y are -1/1
8 Could there be more? Perhaps there are more than 2 clusters? Cancers requiring different treatments?Let’s cluster the data!2-clusters? (Benign vs malign?)More clusters? (Other cancer types?)[MATLAB intermezzo]