Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April 29 2005.

Similar presentations


Presentation on theme: "Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April 29 2005."— Presentation transcript:

1 Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April 29 2005

2 Outline Brief overview of clinical proteomics What do we intend to achieve with machine learning? Modelling profiles through mixture models Evaluation Conclusions

3 What is proteomics anyway? Proteomics – The study of Proteins and how they affect one’s state of health.  Think Genomics, but with proteins instead of genes.  It may be much more difficult to map the human proteome than it was to map the human genome.  A relatively new field of research. Lots of techniques, lots of ideas, only 25 hours in a day.

4 Why is proteomics useful? Primary reason: Efficient, early detection and diagnosis of disease  Invasive techniques such as biopsies are relatively high-risk. Not to mention, expensive!  Proteomic profiling allows for a non or minimally-invasive way of detecting a malady in a patient.  More affordable (for now), allowing for more opportunities for screening. Alternative reason: Prediction of response or non- response to a treatment  Often times, getting the treatment is worse than simply living with the disease.  Allows for a screening process to determine which treatment is best for a particular patient.

5 Vacuum Tube Laser Detector Lens Spectral View Chip Spots OK, I’m interested. How does proteomics work? It’s Spectrometry, my dear Watson. Crocodile

6 Proteomic Profiles Some examples from Pancreatic cancer patients. In this dataset:  57 Healthy patients (controls)  59 Cancerous patients (cases) Dataset is from UPCI Control Case

7 Feature Reduction Proteomic profiles can have anywhere from 15,000 to 370,000 intensities reported.  The pancreatic dataset has 60,270 m/z values  Too much for a statistical ML model to parameterize each intensity. The goal of feature reduction is to select the parts of profiles which are the most informative about class membership.  Feature = an individual intensity measurement.  Some features may be redundant  Some features may be noise

8 Feature Construction As opposed to the feature filtering approaches above, a new set of features can be constructed to represent the profiles. Techniques such as Principal Component Analysis (PCA) or Independent Component Analysis (ICA) are suited towards this task. PCA finds projections of the high-dimensional proteomic data into a low- dimensional subspace. The variance retained in the projection is maximal, so that there is a greater amount of dispersion between classes in which a decision boundary can be drawn. An additional benefit of PCA is that it identifies orthogonal sets of correlated features and constructs new features (components) that are uncorrelated, yet explain most of the variance in the data.

9 Creating clustered relations

10 Mixture Models Let X = {x 1,…, x n } be a set of n datapoints. Assume each x is generated from a mixture of m components M = {c 1,…,c m }, so that This is a mixture model with m components.

11 Determining component responsibility Using Bayes’ theorem, Interpret P(c j ) as prior probability of component j being “turned on” Interpret P(x|c j ) as a basis function which describes the behavior of x given by the component c j

12 Component Responsibility = Clustering Idea: Use the component responsibilities as features

13 Changing the basis functions Easy thing to do: Say x is computed as a confluence of m Gaussians Plug it back into the mixture model equation “Mixture of Gaussians” model

14 Mixture of Gaussians Computation of the posterior P(c j |x) is dependant on μ j and Σ j  May not assign proper “credit” to the jth component. Solution: Incorporate a hidden indicator variable z j, which indicates whether or not x was generated by component c j Interpretation:

15 Mixture of Gaussians & EM Algorithm Since z is unknown, we can use the EM algorithm to compute the values of z which maximize the ODL. In the M-step, we calculate the most likely values for the parameters of the m components.

16 Mixture of Gaussians: M-Step Mean (Co)Variance

17 Slight modification… Assume that the Gaussian components are all hyperspherical, that is, And let z c = The result? K-means algorithm The features? The values where I(c|x) = 1

18 ML Factor Analysis Now, let x be a linear combination of j factors z = {z 1,…,z j } + some noise u Columns of Λ represent sources from which x is generated  This is “normal” factor analysis.

19 Mixture of Factor Analyzers Let x be generated from the z factors, but allow the factors to spread across m loading matrices Here, the component c j is something of an indicator variable, so we search for E c,z (c j, z|x) The features are then computed as the weighted posteriors of c j conditoned on x, with P(z|x) as the weight.

20 Evaluation Step 1: Divide data into training/testing set Step 2: Compute clustered features on the training set Step 3: Reduce the samples in the testing set to the appropriate clusters Step 4: Classify the samples using an SVM

21 Mixture of Gaussians

22 K-means

23 Mixture of Factor Analyzers

24 Summary & Comparison PCA is given as a baseline for “good performance” Mixture of Gaussians does well, but still unsure about the behavior after adding features K-means is somewhat competitive MFA is likely too complicated for this task

25 Conclusions There are many ways you can cluster features in order to discover regulated sources  Sources can be examined for domain-specific importance  Choosing the number of sources is an open problem Still, the performance of these techniques were not substantially better than simple PCA.  Save yourself time and effort, go with a simple model


Download ppt "Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April 29 2005."

Similar presentations


Ads by Google