Challenges in Creating an Automated Protein Structure Metaserver

Slides:



Advertisements
Similar presentations
Chapter 5 Fundamental Algorithm Design Techniques.
Advertisements

Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, Java Version, Third Edition.
Merge Sort 4/15/2017 6:09 PM The Greedy Method The Greedy Method.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
CS107 Introduction to Computer Science
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Third Edition Additions by Shannon Steinfadt SP’05.
Reduced Support Vector Machine
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Chapter 3: The Efficiency of Algorithms
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
1 The Greedy Method CSC401 – Analysis of Algorithms Lecture Notes 10 The Greedy Method Objectives Introduce the Greedy Method Use the greedy method to.
Clustering Unsupervised learning Generating “classes”
Tal Mor  Create an automatic system that given an image of a room and a color, will color the room walls  Maintaining the original texture.
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
315 Feature Selection. 316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.
Decision Trees & the Iterative Dichotomiser 3 (ID3) Algorithm David Ramos CS 157B, Section 1 May 4, 2006.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
The Greedy Method. The Greedy Method Technique The greedy method is a general algorithm design paradigm, built on the following elements: configurations:
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Machine Learning Concept Learning General-to Specific Ordering
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Spring 2008The Greedy Method1. Spring 2008The Greedy Method2 Outline and Reading The Greedy Method Technique (§5.1) Fractional Knapsack Problem (§5.1.1)
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
PSY 626: Bayesian Statistics for Psychological Science
Chapter 5 Unsupervised learning
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining K-means Algorithm
Associated with quantitative studies
Backtracking And Branch And Bound
Merge Sort 7/29/ :21 PM The Greedy Method The Greedy Method.
Priority Queues Chuan-Ming Liu
CJT 765: Structural Equation Modeling
Artificial Intelligence Problem solving by searching CSC 361
CS 4/527: Artificial Intelligence
PSY 626: Bayesian Statistics for Psychological Science
Rank Aggregation.
Merge Sort 11/28/2018 2:18 AM The Greedy Method The Greedy Method.
The Greedy Method Spring 2007 The Greedy Method Merge Sort
Collaborative Filtering Matrix Factorization Approach
Chapter 3: The Efficiency of Algorithms
PERFORMANCE AND TALENT MANAGEMENT
of the Artificial Neural Networks.
Evaluating Hypotheses
Merge Sort 1/17/2019 3:11 AM The Greedy Method The Greedy Method.
Protein structure prediction.
Analysis of Algorithms
Nearest Neighbors CSC 576: Data Mining.
Statistical Models and Machine Learning Algorithms --Review
Sample Size What is the importance?.
Analysis of Algorithms
Learning for Efficient Retrieval of Structured Data with Noisy Queries
Calibration and homographies
Sequence alignment, E-value & Extreme value distribution
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Analysis of Algorithms
Are you measuring what really counts?
Sampling Plans.
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Challenges in Creating an Automated Protein Structure Metaserver Lawrence Wisne - CS 273

The Problem Given a set of servers running prediction algorithms and their results on test data, is it possible to automate the choice of a “best” server for unknown sequences? There does not yet exist a structure prediction algorithm which gives consistently accurate results While not providing any new answers, an algorithm which can successfully answer the question above could add a great degree of consistency to structure predictions

Solution Outline Download the results of the CASP6 competition Isolate a small subset S of structure prediction servers such that the worst result for the CASP6 target sequences is minimized, given the correct choice of server Link each amino acid target sequence with the server that gives the best result for that sequence Isolate the similar characteristics of the sequences that are linked with each server Note that this requires that he number of results which are optimally linked with each server in S is large enough that characteristics of these sequences can be observed

Picking a Set of optimal servers To evaluate the quality of a given server’s prediction of sequence i, we use a relative property, not an absolute one the ranking Rsi of each the prediction of server S on sequence i among CASP6 participants Ideally, we would pick our set of servers S such that i(minsS(Rsi)) is minimized. This is difficult even if we decide |R|, as there are |S| choose |R| possible subsets

A decent approximation To approximate an optimal subset S, use the following greedy algorithm: For each server, find the number of targets for which the server ranked within the top t, for some threshold t Add the server with the largest count to S, and remove from consideration the targets for which that server was in the top t Repeat until S reaches the desired size Using this algorithm, with t=5 and |S|=6, the worst result, given correct server prediction, had a rank of 13, and the mean rank was ~2

Linking Servers with Targets Now, we can link each target sequence with the server in our subset S that produces optimal results In the case of t=5 and |S|=6, the smallest group of targets linked with a server was of size 8, and the largest was of size 32

A Reduced (but still very difficult) Problem Find the common characteristics of a set of input strings which represent amino acids The main methods attempt were Machine Learning and Clustering

The machine learning approach Given a training set and a set of features that are present in each member of the set, weigh the features such that future input instances will be solved optimally Sounds great, but what are the “features” of a string?

The Clustering Approach OK, so why can’t we just group the strings according to some characteristic? Some kind of edit distance metric (ie: Smith-Waterman) may sound good in principle, but there are problems: Alphabet too large String size too varied Most metrics are thrown off by size differences, and normalization has its problems as well Scoring Patterns are (at best) very subtle

So, where do we go from here? It is very possible that a way to solve the reformulated problem does exist Better domain-specific knowledge may be necessary To create a richer set of features for learning, we can use the various properties of amino acids to replace the alphabet Alternately, it may be possible to alter the match/mismatch scores to account for physical properties More sample cases would be very helpful The size differentials in the strings made certain metrics almost useless More cases = the possibility of only comparing like-sized strings