Scalable Visual Comparison of Biological Trees and Sequences

Slides:



Advertisements
Similar presentations
1 TreeJuxtaposer side by side comparison of evolutionary trees.
Advertisements

1 TreeJuxtaposer: Scalable Tree Comparison using Focus+Context with Guaranteed Visibility Tamara Munzner Univ. British Columbia François Guimbretière Univ.
SequenceJuxtaposer: Fluid Navigation For Large-Scale Sequence Comparison in Context James Slack *, Kristian Hildebrand *†, Tamara Munzner * and Katherine.
Indexing DNA Sequences Using q-Grams
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Searching on Multi-Dimensional Data
Scalable Visualization with Accordion Drawing Tamara Munzner University of British Columbia Department of Computer Science joint work with James Slack,
CAP4730: Computational Structures in Computer Graphics Visible Surface Determination.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 Scalable Visual Comparison of Biological Trees and Sequences Tamara Munzner University of British Columbia Department of Computer Science Imager.
1 Scalable Visual Comparison of Biological Trees and Sequences Tamara Munzner University of British Columbia Department of Computer Science Imager.
1 Information Visualization at UBC Tamara Munzner University of British Columbia.
University of British Columbia Department of Computer Science Tamara Munzner Interactive Visualization of Evolutionary Trees and Gene Sequences February.
1 Information Visualization with Accordion Drawing Tamara Munzner University of British Columbia.
University of British Columbia Department of Computer Science Tamara Munzner Visualization: From Pixels to Insight March 3, 2007 UBC CS TechTrek.
1 Presented by Jean-Daniel Fekete. 2  Motivation  Mélange [Elmqvist 2008] Multiple Focus Regions.
Composite Rectilinear Deformation for Stretch and Squish Navigation James Slack, Tamara Munzner University of British Columbia November 1, 2006 Imager.
Bounding Volume Hierarchy “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presented by Mathieu Brédif.
1 Scalable Visual Comparison of Biological Trees and Sequences Tamara Munzner University of British Columbia Department of Computer Science Imager.
DEPARTMENT OF COMPUTER SCIENCE SOFTWARE ENGINEERING, GRAPHICS, AND VISUALIZATION RESEARCH GROUP 15th International Conference on Information Visualisation.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Partitioned Rendering Infrastructure with Stable Accordion Navigation James Slack MSc Thesis Presentation April 4, 2005.
1 PRISAD: A Partitioned Rendering Infrastructure for Scalable Accordion Drawing James Slack, Kristian Hildebrand, Tamara Munzner University of British.
Extending TreeJuxtaposer Nicholas Chen Maryam Farboodi May 9, 2006.
10/11/2001CS 638, Fall 2001 Today Kd-trees BSP Trees.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Mouse Genome Sequencing
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Binary Space Partitioning Trees Ray Casting Depth Buffering
- Laboratoire d'InfoRmatique en Image et Systèmes d'information
Coherent Hierarchical Culling: Hardware Occlusion Queries Made Useful Jiri Bittner 1, Michael Wimmer 1, Harald Piringer 2, Werner Purgathofer 1 1 Vienna.
Photo-realistic Rendering and Global Illumination in Computer Graphics Spring 2012 Hybrid Algorithms K. H. Ko School of Mechatronics Gwangju Institute.
Where We Stand At this point we know how to: –Convert points from local to window coordinates –Clip polygons and lines to the view volume –Determine which.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Concept Relationship Editor: A visual interface to support the assertion of synonymy relationships between taxonomic classifications Paul Craig & Jessie.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
An Evaluation of Pan & Zoom and Rubber Sheet Navigation with and without an Overview Dmitry Nekrasovski, Adam Bodnar, Joanna McGrenere, François Guimbretière,
Spatial Data Management
Lesson: Sequence processing
Real-Time Soft Shadows with Adaptive Light Source Sampling
STBVH: A Spatial-Temporal BVH for Efficient Multi-Segment Motion Blur
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
So far we have covered … Basic visualization algorithms
Real-Time Ray Tracing Stefan Popov.
PowerSchool for Parents
Real-time Wall Outline Extraction for Redirected Walking
3D Object Representations
© University of Wisconsin, CS559 Fall 2004
And now for something completely different . . .
Orthogonal Range Searching and Kd-Trees
Hierarchical clustering approaches for high-throughput data
Craig Schroeder October 26, 2004
Indexing and Hashing Basic Concepts Ordered Indices
Structural / Functional Site Diagramming
Data Structures Introduction
Motion Graphs Davey Krill May 3, 2006.
Using Animation and Multimedia
Standards learning goals - syllabus lecture notes – the current .ppt
Applying principles of computer science in a biological context
B-Trees.
CSE 326: Data Structures Lecture #14
Clustering.
Presentation transcript:

Scalable Visual Comparison of Biological Trees and Sequences Tamara Munzner University of British Columbia Department of Computer Science Imager

Outline Accordion Drawing TreeJuxtaposer SequenceJuxtaposer PRISAD information visualization technique TreeJuxtaposer tree comparison SequenceJuxtaposer sequence comparison PRISAD generic accordion drawing framework

Accordion Drawing rubber-sheet navigation guaranteed visibility stretch out part of surface, the rest squishes borders nailed down Focus+Context technique integrated overview, details old idea [Sarkar et al 93], [Robertson et al 91] guaranteed visibility marks always visible important for scalability new idea [Munzner et al 03]

Guaranteed Visibility marks are always visible easy with small datasets 4

Guaranteed Visibility Challenges hard with larger datasets reasons a mark could be invisible

Guaranteed Visibility Challenges hard with larger datasets reasons a mark could be invisible outside the window AD solution: constrained navigation

Guaranteed Visibility Challenges hard with larger datasets reasons a mark could be invisible outside the window AD solution: constrained navigation underneath other marks AD solution: avoid 3D

Guaranteed Visibility Challenges hard with larger datasets reasons a mark could be invisible outside the window AD solution: constrained navigation underneath other marks AD solution: avoid 3D smaller than a pixel AD solution: smart culling

Guaranteed Visibility: Small Items Naïve culling may not draw all marked items GV no GV Guaranteed visibility of marks No guaranteed visibility

Guaranteed Visibility: Small Items Naïve culling may not draw all marked items GV no GV Guaranteed visibility of marks No guaranteed visibility

Outline Accordion Drawing TreeJuxtaposer tree comparison information visualization technique TreeJuxtaposer tree comparison SequenceJuxtaposer sequence comparison PRISAD generic accordion drawing framework

Phylogenetic/Evolutionary Tree To do so you will build this phylogenetic tree with leaves are species, in this case frogs, and internal nodes are inferred speciation points where one species become two following a mutation for example. M Meegaskumbura et al., Science 298:379 (2002)

Common Dataset Size Today While most publication are repporting result on small trees (60 leaves and around 120 nodes) like the one shown here, Phylogenetic trees are used by biologists to depict and understand the ancestral relationships between species, for example these frogs. The leaves of the tree are existing species, and the interior nodes are bifurcation points where an ancestral species split into two younger species through a specific mutation. These interior nodes representing common ancestors must be inferred from available data. M Meegaskumbura et al., Science 298:379 (2002)

Future Goal: 10M node Tree of Life Animals Plants You are here Several group are tackling a vastly more ambitious problem: building the tree of life the tree of all species known on eath as shown here. This tree is roughly 10M leaves. Its structure is still and active area of research which made it to the front page of science last June. The entire tree of life which represent all the species leaving on the planet contains about 10M leaves. Notice that mammals is only a tiny fraction of the tree. Still an open problem – even primates aren’t clear, everything is still hypothesized. Protists Fungi David Hillis, Science 300:1687 (2003)

Paper Comparison: Multiple Trees focus In the process of constructing these trees biologist often need to compare them side by side, a task which proves difficult with today’s tools, even for tree as small as the 5000 node such as the one shown here. So they often switch to a paper/highlighter interface. As you might imagine this approach is both tedious and error prone. One of the major step for step in building this tree is to compared side by side to possible candidates to understand the difference and infer with one is the most probable tree. While difference tool like MacClade do exist for small tree, they do not scale for even medium data set forcing user to rely on large hand made collage and print-out. As you might imagine this setting is not only tedious but also error prone. Comparing trees is hard, and there are no good tools for the job. This biologist taped together many pieces of paper and manually highlighted similar vs. different leaves, showing the need for tools that show both details and their surrounding context. The task becomes rapidly intractable for even medium side because of the the current tools are not design for large size dataset. This in turn force the biologist to do manual comparison, a laborious and error prone task. context

TreeJuxtaposer side by side comparison of evolutionary trees [video] video/software downloadable from http://olduvai.sf.net/tj

TJ Contributions first interactive tree comparison system automatic structural difference computation guaranteed visibility of marked areas scalable to large datasets 250,000 to 500,000 total nodes all preprocessing subquadratic all realtime rendering sublinear scalable to large displays (4000 x 2000) introduced guaranteed visibility, accordion drawing

Structural Comparison rayfinned fish rayfinned fish salamander lungfish frog salamander mammal frog bird turtle crocodile snake Even for simple tree like the two shown here, identifying structural difference can be quite difficult. lizard lizard snake crocodile turtle mammal lungfish bird

Matching Leaf Nodes rayfinned fish rayfinned fish salamander lungfish frog salamander mammal frog bird turtle crocodile snake While matching leaf nodes is straightforward, lizard lizard snake crocodile turtle mammal lungfish bird

Matching Leaf Nodes rayfinned fish rayfinned fish salamander lungfish frog salamander mammal frog bird turtle crocodile snake lizard lizard snake crocodile turtle mammal lungfish bird

Matching Leaf Nodes rayfinned fish rayfinned fish salamander lungfish frog salamander mammal frog bird turtle crocodile snake lizard lizard snake crocodile turtle mammal lungfish bird

Matching Interior Nodes rayfinned fish rayfinned fish salamander lungfish frog salamander mammal frog bird turtle crocodile snake Matching internal nodes is far more complicated especially when the internal node do not have labels as it is the case in many of our dataset. Here I show as simple case, matching layout imply a matching sub-tree lizard lizard snake crocodile turtle mammal lungfish bird

Matching Interior Nodes rayfinned fish rayfinned fish salamander lungfish frog salamander mammal frog bird turtle crocodile snake but one has to remember that order is not important while comparing topology Leaf match: labels Interior node match difficult Often unlabelled Siblings unordered lizard lizard snake crocodile turtle mammal lungfish bird

Matching Interior Nodes rayfinned fish rayfinned fish salamander lungfish frog salamander mammal frog bird turtle crocodile snake and of course proximity can be confounding too. While bird and crocodile are close to each other here their relationship is in fact quite different in this two trees. lizard lizard snake crocodile turtle bird lungfish mammal

Matching Interior Nodes rayfinned fish rayfinned fish salamander lungfish frog salamander mammal frog ? bird turtle crocodile snake As one consider nodes further from the leaves the situation becomes more and more complicated. lizard lizard snake crocodile turtle mammal lungfish bird

Previous Work tree comparison RF distance [Robinson and Foulds 81] perfect node matching [Day 85] creation/deletion [Chi and Card 99] leaves only [Graham and Kennedy 01] Several metrics have been proposed to to address this problem such as the Robinson and Foulds metric, but none of them completely address the tree comparison problem. They often require tree with the same set of leaves and are aggregate measure providing one number for any given tree and are not well adapted to interactive exploration such as linked highlighting of corresponding nodes as you have seen in the video RF aggregate measure for a tree perfect match intead of best match creation and delection not movement Leaves not interior nodes.

Similarity Score: S(m,n) B C D E F A C B D F E n m Our work address these two shortcomings. First we extend the similarity measure to trees with different sets of leaves by using the set of leaves below a given node. In this example we compute S(M,N) by first computing the set of leaves below m and n noted L(m) and L(n) respectively. We then compute their intersection and they union. Computing the ratio of the cardinality of the two sets provides us with the reuslt. This metric can vary from 0 to one when the two nodes share the exact same set of leaves. Extend the previous measure from perfect match to best match make it better for interactive. Let’s look at an example. Here are 2 tree T1 and T2 and I wish to understand how the subtree rooted at M and the subtree rooted at n are related. The first we compute the similarity score. If L of M is the set of leave under M and L(N) the set of leaves under N, the similarity is given by the formual: cardinal of the intersection of this 2 set to the cardinality of the union of this two sets. in out case L(M) is the set E,F and LN of n is the set D,F,E the the intersection of this 2 set will be ef and the union DEF, and the similarity will be 2/3. This measure extend the previously proposed measure since it does work even in the case where the set of leaves are not identical.

Best Corresponding Node A B C D E F A C B D F E computable in O(n log2 n) linked highlighting 2/6 1/3 1/2 2/3 BCN(m) = n 1/2 m By itself the similarity measure is an aggregate measure, but because it does not require the 2 subtrees to share the same set of nodes. We can now compute the similarity score between a given node in T1 say M and all node in T2 (I did that on the right and you can see that n has the highest score.). Now pick the node with the best score as the Best Coresponding node for M. As described here it might seems like a very expensive proposition, quadratic in the number of similarity score computations. We provide an algorithm for computing the BCN very efficiently and I will refer you to the paper for more details. In our system the BCN function is used as the basis of our linked highlighting. While we won’t have time to go over it here, we proposed a efficient O(nlog2n) evaluation algorythm to compute the BCN of each node between 2 tree as a rapide preprocessing path. This information is then use to provide structural link highlighting. and will we refer you to our paper for details. Once one has the best correponding node for every node in T2 it is then trivial to implement linked highlighting as hown in the video.

Marking Structural Differences B C D E F A C B D F E Matches intuition n m Once we can compute BCN efficiently, it is a simple matter to define structural difference: They are nodes such as M for which the Similarity score between them and their best corresponding node is different from one. As shown in this picture, it is important to note that even though we started from a metric ignoring the topology, our system deliver precise topology difference.

Outline Accordion Drawing TreeJuxtaposer SequenceJuxtaposer PRISAD information visualization technique TreeJuxtaposer tree comparison SequenceJuxtaposer sequence comparison PRISAD generic accordion drawing framework

Genomic Sequences multiple aligned sequences of DNA now commonly browsed with web apps zoom and pan with abrupt jumps previous work Ensembl [Hubbard 02], UCSC Genome Browser [Kent 02], NCBI [Wheeler 02] investigate benefits of accordion drawing showing focus areas in context smooth transitions between states guaranteed visibility for globally visible landmarks

SequenceJuxtaposer comparing multiple aligned gene sequences provides searching, difference calculation [video] video/software downloadable from http://olduvai.sf.net/tj SequenceJuxtaposer allows fluid navigation of whole genomes, where the overall Context of the genome is visible even when regions of interest are being Examined in-depth. Another video…

Searching search for motifs results marked with guaranteed visibility protein/codon search regular expressions supported results marked with guaranteed visibility

Differences explore differences between aligned pairs slider controls difference threshold in realtime results marked with guaranteed visibility

SJ Contributions fluid tree comparison system showing multiple focus areas in context guaranteed visibility of marked areas thresholded differences, search results scalable to large datasets 2M nucleotides all realtime rendering sublinear

Outline Accordion Drawing TreeJuxtaposer SequenceJuxtaposer PRISAD information visualization technique TreeJuxtaposer tree comparison SequenceJuxtaposer sequence comparison PRISAD generic accordion drawing framework

Goals of PRISAD generic AD infrastructure efficiency correctness tree and sequence applications PRITree is TreeJuxtaposer using PRISAD PRISeq is SequenceJuxtaposer using PRISAD efficiency faster rendering: minimize overdrawing smaller memory footprint correctness rendering with no gaps: eliminate overculling

PRISAD Navigation generic navigation infrastructure application independent uses deformable grid split lines Grid lines define object boundaries horizontal and vertical separate Independently movable

Split line hierarchy data structure supports navigation, picking, drawing two interpretations linear ordering hierarchical subdivision A B C D E F Hierarchy for picking, partitioning, rendering, navigation, caching

PRISAD Architecture world-space discretization preprocessing initializing data structures placing geometry screen-space rendering frame updating analyzing navigation state drawing geometry

World-space Discretization interplay between infrastructure and application

Laying Out & Initializing application-specific layout of dataset non-overlapping objects initialize PRISAD split line hierarchies objects aligned by split lines 6 8 1 2 3 4 5 9 7 4 5 4 2 A A C C A T T T

Gridding each geometric object assigned its four encompassing split line boundaries A A C C A T T T

Mapping PRITree mapping initializes leaf references 8 4 9 5 6 3 2 1 bidirectional O(1) reference between leaves and split lines 8 4 9 5 6 3 2 1 Map Split line Leaf index 6 8 2 5 9 6 8 2 5 9 3 4 1 2 5 4 1 3 7

Screen-space Rendering control flow to draw each frame The only slide with progressive rendering

Partitioning partition object set into bite-sized ranges using current split line screen-space positions required for every frame subdivision stops if region smaller than 1 pixel or if range contains only 1 object 1 2 3 4 5 [1,2] [3,4] [5] { [1,2], [3,4], [5] } Queue of ranges

Seeding reordering range queue result from partition marked regions get priority in queue drawn first to provide landmarks 1 2 3 4 5 [1,2] [3,4] [5] { [1,2], [3,4], [5] } { [3,4], [5], [1,2] } ordered queue

Drawing Single Range each enqueued object range drawn according to application geometry selection for trees aggregation for sequences

PRITree Range Drawing select suitable leaf in each range draw path from leaf to the root ascent-based tree drawing efficiency: minimize overdrawing only draw one path per range 1 2 { [3,4], [5], [1,2] } 3 [3,4] 4 5

Rendering Dense Regions correctness: eliminate overculling bad leaf choices would result in misleading gaps efficiency: maximize partition size to reduce rendering too much reduction would result in gaps Intended rendering Partition size too big

Rendering Dense Regions correctness: eliminate overculling bad leaf choices would result in misleading gaps efficiency: maximize partition size to reduce rendering too much reduction would result in gaps Intended rendering Partition size too big

PRITree Skeleton guaranteed visibility of marked subtrees during progressive rendering first frame: one path per marked group full scene: entire marked subtrees

PRISeq Range Drawing: Aggregation aggregate range to select box color for each sequence random select to break ties [1,4] [1,4] A A C C A A T T T T T T T C T

PRISeq Range Drawing collect identical nucleotides in column form single box to represent identical objects attach to split line hierarchy cache lazy evaluation draw vertical column { A:[1,1], T:[2,3] } 1 A A 2 T T 3 T

PRISAD Performance PRITree vs. TreeJuxtaposer (TJ) synthetic and real datasets complete binary trees lowest branching factor regular structure star trees highest possible branching factor

InfoVis Contest Benchmarks two 190K node trees directly compare TJ and PT

OpenDirectory benchmarks two 480K node trees too large for TJ

PRITree Rendering Time Performance a closer look at the fastest rendering times TreeJuxtaposer renders all nodes for star trees branching factor k leads to O(k) performance InfoVis 2003 Contest dataset 5x rendering speedup

Detailed Rendering Time Performance TreeJuxtaposer valley from overculling PRITree handles 4 million nodes in under 0.4 seconds TreeJuxtaposer takes twice as long to render 1 million nodes

Memory Performance linear memory usage for both applications 4-5x more efficient for synthetic datasets 1GB difference for InfoVis contest comparison marked range storage changes improve scalability

Performance Comparison PRITree vs. TreeJuxtaposer detailed benchmarks against identical TJ functionality 5x faster, 8x smaller footprint handles over 4M node trees PRISeq vs. SequenceJuxtaposer 15x faster rendering, 20x smaller memory size 44 species * 17K nucleotides = 770K items 6400 species * 6400 nucleotides = 40M items

Future Work future work editing and annotating datasets PRISAD support for application specific actions logging, replay, undo, other user actions develop process or template for building applications

PRISAD Contributions infrastructure for efficient, correct, and generic accordion drawing efficient and correct rendering screen-space partitioning tightly bounds overdrawing and eliminates overculling first generic AD infrastructure PRITree renders 5x faster than TJ PRISeq renders 20x larger datasets than SJ

Joint Work TreeJuxtaposer SequenceJuxtaposer PRISAD François Guimbretière, Serdar Taşiran, Li Zhang, Yunhong Zhou SIGGRAPH 2003 SequenceJuxtaposer James Slack, Kristian Hildebrand, Katherine St.John German Conference on Bioinformatics 2004 PRISAD James Slack, Kristian Hildebrand IEEE InfoVis Symposium 2005

Open Source software freely available from http://olduvai.sourceforge.net SequenceJuxtaposer olduvai.sf.net/sj TreeJuxtaposer olduvai.sf.net/tj requires Java and OpenGL GL4Java bindings now, JOGL version coming soon papers, talks, videos also from http://www.cs.ubc.ca/~tmm The project is open source and freely available under...