Genomic Data Manipulation Thinking about data visually

Genomic Data Manipulation Thinking about data visually
Curtis Huttenhower Harvard School of Public Health Department of Biostatistics

The usual suspects Bar plot = discrete # of discrete values
Stripchart = discrete # of small # of continuous values Boxplot = discrete # of large # of continuous values Histogram = discretized bins of counts Density plot = continuous interpolation of counts Scatter plot = pairs of continuous values Line plot = function of continuous values

Small changes, big differences
Boxplots can be decorated as... Beeswarm plots = mashup of boxplot + stripchart Violin plots = mashup of boxplot + density plot Scatter plots can be decorated as... Sunflower plot = mashup of scatter + histogram 2D density plot = mashup of scatter + density

Fig. 3. Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7.
Comparison of Sargasso Sea scaffolds to Crenarchaeal clone 4B7. Predicted proteins from 4B7 and the scaffolds showing significant homology to 4B7 by tBLASTx are arrayed in positional order along the x and y axes. Colored boxes represent BLASTp matches scoring at least 25% similarity and with an e value of better than 1e-5. Black vertical and horizontal lines delineate scaffold borders. J C Venter et al. Science 2004;304:66-74 Published by AAAS

Only one of many ways to think about DNA sequence data...

(Almost) everything can be clustered into a tree, even DNA sequences
Fig. 7. Phylogenetic tree of rhodopsinlike genes in the Sargasso Sea data along with all homologs of these genes in GenBank. (Almost) everything can be clustered into a tree, even DNA sequences Phylogenetic tree of rhodopsinlike genes in the Sargasso Sea data along with all homologs of these genes in GenBank. The sequences are colored according to the type of sample in which they were found: blue, cultured species; yellow, sequences from uncultured organisms in other environmental samples; and red, sequences from uncultured species in the Sargasso Sea. The tree was divided into what we propose are distinct subfamilies of sequences, which are labeled on the right. The tree was constructed as follows: (i) All homologs of halorhodopsin were identified in the predicted proteins from the Sargasso Sea assemblies using BLASTp searches with representatives of previously identified halorhodpsinlike protein families as query sequences. (ii) All sequences greater than 75 amino acids in length were aligned to each other using CLUSTALw, and a neighbor-joining phylogenetic tree was inferred using the protdist and neighbor programs of Phylip. J C Venter et al. Science 2004;304:66-74 Published by AAAS

Aerobic, microaerobic and anaerobic communities
But not every tree is a clustering

Model of microbial biomarkers
Why are networks so popular in biology?

Don’t be afraid to get creative when representing data!
Fast and Furious 6 (!?!) Man of Steel Hunger Games Iron Man 3 Thor Hunger Games Avengers Dark Knight Rises Twilight XXVII

Wordles

Looking at data – it’s not just fun, it’s important, too!
Anscombe's quartet Four 11-pair datasets with the same... X mean, X standard deviation, Y mean, Y standard deviation, Correlation, and regression coefficients μ(x)=9 σ(x)=11 μ(y)=7.5 σ(y)=4.1 ρ=0.816 y=3+0.5x Looking at data – it’s not just fun, it’s important, too!

Genomic Data Manipulation Thinking about data visually

Similar presentations

Presentation on theme: "Genomic Data Manipulation Thinking about data visually"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genomic Data Manipulation Thinking about data visually

Similar presentations

Presentation on theme: "Genomic Data Manipulation Thinking about data visually"— Presentation transcript:

Similar presentations

About project

Feedback