PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729.

Slides:



Advertisements
Similar presentations
Lecture 19: Parallel Algorithms
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Perceptron Learning Rule
Protein – Protein Interactions Lisa Chargualaf Simon Kanaan Keefe Roedersheimer Others: Dr. Izaguirre, Dr. Chen, Dr. Wuchty, ChengBang Huang.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Analysis of Algorithms
Olivier Duchenne , Armand Joulin , Jean Ponce Willow Lab , ICCV2011.
Train DEPOT PROBLEM USING PERMUTATION GRAPHS
October, 1998DARPA / Melamed / Singh1 Parallelization of Search Algorithms for Modeling QTES Processes Joshua Kramer and Santokh Singh Rutgers University.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Dali: A Protein Structural Comparison Algorithm Using 2D Distance Matrices.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
Chapter 6 Data Flow Diagramming Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Protein Tertiary Structure Prediction
Construyendo modelos 3D de proteinas ‘fold recognition / threading’
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
Tasks and Training the Intermediate Age Students for Informatics Competitions Emil Kelevedjiev Zornitsa Dzhenkova BULGARIA.
The Electronic Geometry Textbook Project Xiaoyu Chen LMIB - Department of Mathematics Beihang University, China.
10/2/2015 3:00 PMCampus Tour1. 10/2/2015 3:00 PMCampus Tour2 Outline and Reading Overview of the assignment Review Adjacency matrix structure (§12.2.3)
ALIGNMENT OF 3D ARTICULATE SHAPES. Articulated registration Input: Two or more 3d point clouds (possibly with connectivity information) of an articulated.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Markov Cluster (MCL) algorithm Stijn van Dongen.
Modeling Visual Search Time for Soft Keyboards Lecture #14.
Clustering C.Watters CS6403.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Performance Task Overview Introduction This training module answers the following questions: –What is a performance task? –What is a Classroom Activity?
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Examining Protein Folding Process Simulation and.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Page 0 of 7 Particle filter - IFC Implementation Particle filter – IFC implementation: Accept file (one frame at a time) Initial processing** Compute autocorrelations,
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
C - IT Acumens. COMIT Acumens. COM. To demonstrate the use of Neural Networks in the field of Character and Pattern Recognition by simulating a neural.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
2.1 Functions and their Graphs Standard: Students will understand that when a element in the domain is mapped to a unique element in the range, the relation.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Notes:Relations and Functions Section 1-6 Student Objective: The students will be able to identify relations and functions and evaluate functions. 1.Definitions:
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Introduction to Programming
Parallel Programming By J. H. Wang May 2, 2017.
Relations and Functions
Intro to Alignment Algorithms: Global and Local
Protein structure prediction.
Anastasia Baryshnikova  Cell Systems 
Applying principles of computer science in a biological context
Chapter 2 Functions, Equations, and Graphs
Presentation transcript:

PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Protein Structure 2 RPDFCLEPPYAGACRARIIRYFYNAKAGLCQ Primary Structure Sequence of Amino Acids. Not enough for functional prediction. Tertiary Structure (3D Structure) Formed by 3D folding pattern of the protein. It makes protein functional.

Comparing protein 3D structures- get functional insight 3 Structure of 1QLQ Structure of 4HHB Compare structures of two DIFFERENT proteins

Significance of comparing protein 3D structures Structural similarity between two proteins means functional similarities Predict binding site Predict drug interaction 4

Structural elements represented by quintuple of features 5 Labels represent Primary Structure (amino acids sequence) Theta represents orientation Length represents size/scale Tertiary/ 3D structure

Structural alphabet (key) generation 6 Assign labels to amino acids in triple Perform rule based label arrangement Calculate Angle and Length Quintuple Label 2 Label 1 d13 Label 3 d23 d12 θ1θ1 Representative Length (D) Mapping from structure space into unique key (integer space)

Output of the key generation system For every protein millions of keys are generated each representing some special feature. The protein structure is represented and stores as unique KEY-COUNT pair.

Learning goals

Familiarizing with complex research problem and the process of solving it including reading and understanding published research papers and using them in problem solving. Parallel implementation of algorithm(s) and demonstrate the speedup from serial to parallel. Visualizing the output.

Task Outline

Calculate pairwise similarity between two proteins implemented in PARALLEL (moduleA) 11 Similarity Computation Jaccard Coefficient that allows (unique or count={0,1}) set as its arguments Jaccard-Tanimoto Coefficient that allows multi-sets (count>1) as its arguments TSR Key-Count Set representing 1QLQ Structure of 1QLQ TSR Keys-Count Set representing 4HHB Structure of 4HHB

Input to moduleA There may be some keys that present in one protein while absent in other as they represent unique features. All input files will be given as key-count pairs that will be the input to the system. Keys are integers representing the unique structural feature. All keys for a given protein will have corresponding count >=1.

Output from moduleA Display/write the pairwise similarity between each protein file as lower triangular matrix for comparison purpose You will be given a set of proteins and you have to calculate all by all pairwise similarity between them.

Input to moduleB or visualization module and the output The all by all pairwise similarity calculated in moduleA will be used as input to moduleB. Output should be connectivity graph (as shown in next slide) between all proteins. Each edge must display the similarity value. Preferred output will be each edge length weighted as similarity value between the two connecting proteins.

Construct structural similarity graph (moduleB) Method for finding the global structural connectivity between proteins that contain a specific domain of interest. 15

Final system Construct similarity graph. Should integrate moduleA and moduleB. If given a set of proteins should be able to find all by all similarity between them, display the lower triangular similarity matrix.

What do you get from me?

1.Training protein structure (key-count) file with their precalcuated similarity values, both Jaccard and Jaccard Tanimoto -- around 50 proteins -- you can use these to evaluate your system 2. Test set (50 proteins), only key-count pairs and no similarity values. 3. All the files will be text files. 4. Time taken by me to calculate the all by all similarity on the test and training set using an optimized serial algorithm for comparison with your parallel implementation.

You can use Hadoop-mapreduce for moduleA. Visualization can be done on GEPHI Information on Jaccard and Jaccard-Tanimoto can be found in the following paper: Lower triangular matrix: