2Let’s do some real data analysis A biologist comes to you and says: “I have some data on breast cancer here, if you analyse it, I will win the Nobel prize”How to start??
3Let’s do some real data analysis Real data is messy:Missing values… Infer them as the mean of the corresponding feature (this is a basic technique for ‘imputation’)[MATLAB intermezzo]
4Let’s do some real data analysis What now??Let’s visualize the data!How?? 9-dimensional! Principal Component Analysis (PCA)[MATLAB intermezzo]
5Mathematical intermezzo: PCA Two views:Variance maximizationError minimizationSolved using eigenvalue problemDo not forget to centre the data (subtract from each feature its mean in the dataset)
6Looks interesting… Could we perhaps predict the label from the data? I.e., find a rule that says when a cancer is benign and when it’s malignant (important for therapy and more!)Classification![MATLAB intermezzo]
7Mathematical intermezzo: LSR/FDA Least Squares Regression (LSR)Solved by means of a system of linear equationsXw=y (approx)Missfit: ||Xw-y||2 the mean squared errorFisher Discriminant Analysis:The same thing, if the labels y are -1/1
8Could there be more? Perhaps there are more than 2 clusters? Cancers requiring different treatments?Let’s cluster the data!2-clusters? (Benign vs malign?)More clusters? (Other cancer types?)[MATLAB intermezzo]