2-D and 3-D Coordinates For M-Mers And Dynamic Graphics For Representing Associated Statistics By Daniel B. Carr George Mason University.

Slides:



Advertisements
Similar presentations
Recombinant DNA Technology
Advertisements

Analysis of High-Throughput Screening Data C371 Fall 2004.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Ab initio gene prediction Genome 559, Winter 2011.
A 3-D reference frame can be uniquely defined by the ordered vertices of a non- degenerate triangle p1p1 p2p2 p3p3.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Identification of Transcriptional Regulatory Elements in Chemosensory Receptor Genes by Probabilistic Segmentation Steven A. McCarroll, Hao Li Cornelia.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Lecture 6, Thursday April 17, 2003
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Introduction to BioInformatics GCB/CIS535
Bio277 Lab 3: Finding Transcription Factor Binding Motifs Adapted from a Lab Written by Prof Terry Speed Jess Mar Department of Biostatistics Quackenbush.
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
Similar Sequence Similar Function Charles Yan Spring 2006.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Info Vis: Multi-Dimensional Data Chris North cs3724: HCI.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Biotechnology in Medicine Chapter 12.
Anotation: Gene of which little is known What follows is a simulation of an orf page in the proposed graphical interface. The interface does not yet exist.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Finish up array applications Move on to proteomics Protein microarrays.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Telling self from non-self: Learning the language of the Immune System Rose Hoberman and Roni Rosenfeld BioLM Workshop May 2003.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
SRI International Bioinformatics 1 Genome Browser Markus Krummenacker Bioinformatics Research Group SRI, International Q
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
VizDB A tool to support Exploration of large databases By using Human Visual System To analyze mid-size to large data.
PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729.
Central dogma: the story of life RNA DNA Protein.
Introduction to biological molecular networks
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Local Multiple Sequence Alignment Sequence Motifs
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Motif Search and RNA Structure Prediction Lesson 9.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Molecular mechanics Classical physics, treats atoms as spheres Calculations are rapid, even for large molecules Useful for studying conformations Cannot.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Background for Molecular Biology of Lactase Persistence
Learning Sequence Motif Models Using Expectation Maximization (EM)
Molecular Docking Profacgen. The interactions between proteins and other molecules play important roles in various biological processes, including gene.
Ab initio gene prediction
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Telling self from non-self: Learning the language of the Immune System
GPX: Interactive Exploration of Time-series Microarray Data
SEG5010 Presentation Zhou Lanjun.
Properties of H. volcanii tRNA Intron Endonuclease Reveal a Relationship between the Archaeal and Eucaryal tRNA Intron Processing Systems  Karen Kleman-Leyer,
Volume 128, Issue 6, Pages (March 2007)
Deep Learning in Bioinformatics
Presentation transcript:

2-D and 3-D Coordinates For M-Mers And Dynamic Graphics For Representing Associated Statistics By Daniel B. Carr George Mason University

Overview Background Encoding and self-similar coordinates Examples Rendering software – GLISTEN Closing remarks

Background Task – Visualize statistics indexed by a sequence of letters Letter-Indexing –Nucleotides: AAGTAC –Amino Acids: KTLPLCVTL –Terminology: blocks of m letters called m-mers Statistics: counts or likelihoods for –Short DNA sequence motifs for transcription factor binding: gene regulation –Peptide docking on immune system molecules

Graphical Design Goals Provide an overview and selective focus Use geometric structures to –Organize statistics –Reveal patterns –Provide cognitive accessibility Incorporate scientific knowledge in layout choices –Enhance patterns and simplify comparisons

Common Practice - Tables Published tables – a linear list –Sorted by values of a statistic –Indexing letter sequences shown as row labels –Only few items shown of thousands to millions

Common Practice - Graphics 1-D histograms – some examples –Nucleotides: Distribution of promoters by distance upstream from the start codon –Amino acids: Sequence alignment logo plots are one variant Docking counts by position Cell-colored matrices? –More commonly used for microarray data and correlation matrices

Graphical Encoding Ideas: Use Points For M-Mers Represent m-mers using coordinates –A point stands for an m-mer –A glyph at the point represents statistics for that m-mer. For example point color, size, shape Challenge –The domain of all letter sequences is exponential in sequence length – Display space is limited

Self-Similar Coordinates Self-similarity helps us keep oriented –Parallel coordinate plots are increasingly familiar Coordinates from 3-D geometry –4 Nucleotides => tetrahedron –20 Amino acids Icosahedron face centers Familiar coordinates => hemisphere Two kinds of self-similarity –At different scales => fractals –At the same scale => shells, surfaces

Self-Similarity At Different Scales: Nucleotide Example Represent each 6-mer as a 3-D point –(4 nucleotides) 6 = 4096 points Attractor: tetrahedron vertices –A=(1,1,1), C=(1,-1,-1), G=(-1,1,-1), T=(-1,-1,1) Computation: –Hexamer position weights: 2^(5,4,3,2,1,0)/63 –ACGTTC -> (.555,.270,.206)

Application: Gene Regulation Studies Cluster genes based on –Gene expression levels in different situations –Other criteria such as gene family For each cluster look in gene regulation regions for recurrent nucleotide patterns –Over expressed m-mers: potential transcription factor docking sites Show frequencies (or multinomial likelihoods)

Sliding hexamer window 300 letters upstream from open reading frames –300ATATGA –299TATGAG –298ATGAGT –297TGAGTA Nucleotides Example Yeast Gene Regulation 29 Genes in a cluster –YBL072c –YDL130w –YDR025w – … –YCL054w

Statistics Number of genes with hexamer –TTTTTC 22 –GAAAAA 21 –TTTTTT 19 –AAAAAT 19 –TTTTCA 18 –ATTTTT 17 Total number of appearances, etc.

Extensions 2-D version (projected gasket) – 10mers => 1024 x 1024 pixel display Wild card and dimer counts –TACC……GGAA Include more scientific knowledge –Special representations for known transcription factors More interactivity –Filtering for regions upstream –Mouseovers, etc.

Self-Similarity At Different Scales: Amino Acids Sequence Coordinates Represent each 3-mer as a 3-D point – (20 amino acids) 3 = 8000 points Attractor: icosahedron face centers –Let x1=.539, x2=.873, x3=1.412 –A=(x1,x3,0), C=(0,x1,x3), … Y=(-x3,0,-x1) Computation Position weights: 3.8 (2,1,0) scaled to sum to 1. Letters HIT => (-1.26, -1.08,.180)

Graphical Encoding Ideas: Paths Use paths connecting m-mer points to represent longer sequences –Path features, thickness and color can encode statistics indexed by the concatenated m-mers –Can reuse the m-mers keeping a common framework –3 3-mers -> two segment path -> 9 mer Challenges –Overplotting, path ambiguity, prime sequence lengths –Using translucent triangles for triples is poor, etc.

Letter x Position Coordinates And Paths Merits –Few points and simple structure 20 amino acids by 9 positions = 180 points Challenges –Path overplotting =>filtering –Avoiding path interpretation ambiguity in higher dimensional tables => 3-D layouts

Self-Similarity At The Same Scale: Amino Acids Coordinates Each point represents a letter and position pair – 9-mers: 20 letter x 9 positions = 180 points Geometry: icosahedron face centers –Let x1=.539, x2=.873, x3=1.412 –A=(x1,x3,0), C=(0,x1,x3), … Y=(-x3,0,-x1) Use scale factor for a given position –Scale factors for 9-mers: 2.2, 2.4, 2.6, …, 3.6 –A1 => 2.2*(x1,x3,0) C2=>2.4*(0,x1,x3) Problem: overplotting of paths

Self-Similarity At The Same Scale: Amino Acids Example Each point represents a letter and position pair – 9-mers: 20 letter x 9 positions = 180 points Geometry: hemisphere –Amino acid: longitude, Position: latitude –Amino acid ordering Group by chemical properties: hydrophobic, etc. Order to minimize path length in given application –Include gaps for perceptual grouping Path overplotting still a problem, need filtering

Peptide Docking Example Immune system molecules combine with peptides to form a complex recognized by T-cell receptors –Problems: Failure to dock foreign peptides Docking with “self” peptides Molecule specific databases of docking peptides –MHCPEP 1997, Brusic, Rudy, and Harrison –Human leukocyte antigen (HLA) A2, class 1 molecule Small: about 500 peptides of 20 9 = ½ trillion possibilities Mostly 9-mers (483) Positions related to asymmetric docking groove

Peptide Docking Interests Which amino acids appear in which position? Characterize the space of docking, not-docking, unknown Prediction of unknowns Focused questions Is there a docking peptide in a key protein common to all 23 HIV strains?

Number of the 483 peptides with the amino acid in position 2 M Q P S T F V A L G I K R H E D C W N Y Cells from the collection of all 4-position tables: 126 tables of potentially 20 4 = cells each G4 F5 V6 F7: 35 L2 A7 A8 V9: 29 … Docking Statistics

Graphics Software GLISTEN –Geometric Letter-Indexed Statistical Table Encoding –Swap out coordinates at will with tables unchanged –NSF research: second generation version in progress Available partial alternatives –CrystalVision ftp:// –Ggobi

Hemisphere Plot Versus Parallel Coordinate Plots PC plots are –Better for the many scientists preferring flatland –Straight forward to publish –Ambiguous when connecting non-adjacent axes Hemisphere plots –3-D curvature reduces line ambiguity and provides a general framework for tables involving non-adjacent positions –3-D provides more neighbor options to group amino acids based on chemical properties: non-polar, etc.

Closing Remarks Docking applications are still evolving –New procedures for inference and better databases Graphics still need work –More scientific structure –Work on cognitive optimization GLISTEN can address many other applications

Graphics Reference Lee, et al. 2002, “The Next Frontier for Bio- an Cheminformatics Visualization,” IEEE Computer Graphics and Applications, Sept/Oct pp,

Relate Scientific References (1) Spellmen, et al “Comprehensive Identification of Cell Cycle-regulated Gened of the Yeast Saccharomyces cervisiae by Microarray Hybridization,” Molecular Biology of the Cell. Vol 9, pp Keles, van der Laan, and Eisen “Identification of regulatory elements using a feature selection method.” Bioinformatics, Vol. 18. No 9. pp

Related Scientific References (2) Segal Cummings and Hubbard “Relating Amino Acid Sequences to Phenotypes: Analysis of Peptide-Binding Data,” Biometrics 57, pp