BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.

Slides:

Advertisements

Similar presentations

Advertisements

Cluster Analysis: Basic Concepts and Algorithms

O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.

Data Mining Cluster Analysis: Basic Concepts and Algorithms

A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Introduction to Bioinformatics Algorithms Clustering.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Bioinformatics and Phylogenetic Analysis

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Cluster Analysis: Basic Concepts and Algorithms

Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.

Introduction to Bioinformatics Algorithms Clustering.

CSE182-L17 Clustering Population Genetics: Basics.

Similar Sequence Similar Function Charles Yan Spring 2006.

Introduction to Bioinformatics - Tutorial no. 12

What is Cluster Analysis?

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Comp602 Bioinformatics Algorithms -m werner 2011

Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.

B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Clustering Unsupervised learning Generating “classes”

Evaluating Performance for Data Mining Techniques

Gene expression & Clustering (Chapter 10)

Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.

Potential Data Mining Techniques for Flow Cyt Data Analysis Li Xiong.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

tch?v=Y6ljFaKRTrI Fireflies.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

Construction of Substitution Matrices

Chapter 3 Computational Molecular Biology Michael Smith

Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.

A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.

Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering

Semi-Supervised Clustering

Basics of BLAST Basic BLAST Search - What is BLAST?

Data Mining K-means Algorithm

Hierarchical clustering approaches for high-throughput data

Bioinformatics and BLAST

Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.

Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri

Basic Local Alignment Search Tool

Presentation transcript:

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014

Bioinformatics What? Why? Remains large frontier Goals: –Organize and serve data –Develop tools to analyze –Interpret results 2

Clustering Clustering’s goal is to take large amounts of data and group it into similarity classes. Viewing and analyzing vast amounts of biological data as a whole can be a challenging job. It is easier to interpret the data if it is partitioned into clusters combining similar data points. Clustering arose from researchers wanting to know the functions of newly sequenced genes. If you were to compare the gene sequence to already know DNA sequence it would not often give away the function of the gene. In fact 40% of sequenced gene’s functionality cannot be attained by only comparing sequences of known genes. 3

Microarrays Biologists infer gene functions by using microarrays. Microarrays are a grid of DNA segment of known sequences that is used to test and map DNA fragments. They measure the expression level of genes under varying conditions. Expression level is estimated by measuring the amount of mRNA for that particular gene. Microarray data are transformed into an intensity matrix, which allows biologists to make correlations between genes. This is where clustering comes in handy. Microarray data is plotted by each data as a point in an N-dimensional space. Clustering makes a distance matrix for the distances between every gene point. Genes that are close together share the same characteristics. The outcome is groups of clusters that have related functionality. 4

5 Green: control Red: experimental cell Yellow: in both samples Black: in neither sample Each box represents one gene’s expression over time. Microarrays can also be expressed by tables.

Hierarchical Clustering Hierarchical clustering is a method which seeks to build a hierarchy of clusters. This is done in two ways, agglomerative and divisive. Agglomerative starts with every element in its own cluster and iteratively joins clusters together. The time complexity of agglomerative clustering is O(n^3). Divisive starts with one cluster and iteratively divides it into smaller clusters. Divisive clustering with an exhaustive search is O(2^n), which is even worse. However, for some special cases, optimal efficient agglomerative methods (of complexity O(n^2)) are known: SLINK[1] for single-linkage and CLINK[2] for complete-linkage clustering. 6

Hierarchical clustering is often used to reveal evolutionary history. Distance between two blusters is the smallest distance between any pair of their elements. Average distance is between all pairs of elements. 7 This is a good example of clustering. It satisfies both homogeneity and separation principles.

H.C. Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C1 and C2 6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C1 and C2 9. Remove rows and columns of d corresponding to C1 and C2 10. Add a row and column to d corrsponding to the new cluster C 11. return T 8 Distance: d min (C, C * ) = min d(x,y) d avg (C, C * ) = (1 / |C * ||C|) ∑ d(x,y) Squared Error Disttortion: d(V,X) = ∑d(v i, X) 2 / n 1 < i < n

K-Means Cluster and Lloyd’s Clustering K-Means clustering is a technique to partition a set of N points into K clusters. 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic method for K-Means clustering is the Lloyd algorithm. It differs from K-Means in that its inputs are a continuous geometric region rather than a discrete set of points. Lloyd’s algorithm, while being fast, in each iteration it moves many data points not necessarily causing a better convergence. A more overall better clustering cost is done with the Progressive Greedy K-Means clustering. This method moves one point at a time. A comparison between the two has the conclusion that Lloyd’s method is more efficient in run-time and in best found SED. 9

K-Means Clustering: Lloyd Algorithm 1.Arbitrarily assign the k cluster centers 2. while the cluster centers keep changing 3. Assign each data point to the cluster C i corresponding to the closest cluster representative (center) (1 ≤ i ≤ k) 4.After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is ∑v \ |C| for all v in C for every cluster C 10

Progressive “Greedy” K-Means 11

K-Mean Execution Ex. 12

Distance Graph A distance matrix can be turned into a distance graph. Genes are then represented as vertices in the graph, otherwise known as Clique graphs. A problem with clique graphs is corruption which can be solved with the CAST algorithm. 13

Other Clustering Uses Clustering has a few more applications. Clustering can be used on highly homologous sequences to reduce the size of large protein databases. It reduces the complexity and speeds up comparisons. Clustering can be used to evaluate protein- protein interaction networks. The structure of proteins interactions can be represented by a graph. Clustering then looks for clusters in the graph. Also it can be used to improve the protein structure prediction by merging the predictions made by a large number of alternative conformation models. 14

BLAST 15 Need way to compare newly sequenced genes with predetermined genes Quadratic local alignment too slow Heuristic approach instead allows for O(nm) runtime

BLAST cont… Step 1: Removal of low complexity/repetitive regions –DUST Step 2: Splitting amino acids nucleotides into words/ sequences 16

BLAST cont… Step 3: All words above threshold incorporated into query list and placed in efficient search tree 17

BLAST cont… Step 4: DB queried for matches Step 5: Verification/Extension of matches –Ungapped BLAST –Gapped BLAST 18

BLAST Verification UngappedGapped 19 Step 6: Verifies threshold and returns ranked results

BLAST Implementations NCBI: –Blastn –Blastp –Blastx –Etc. mpiBLAST ScalaBLAST BLAT Future of BLAST 20

Questions? 21