Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advancing Science with DNA Sequence Sequence Clustering MGM Workshop January 30, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis.

Similar presentations


Presentation on theme: "Advancing Science with DNA Sequence Sequence Clustering MGM Workshop January 30, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis."— Presentation transcript:

1 Advancing Science with DNA Sequence Sequence Clustering MGM Workshop January 30, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, Prokaryotic Super Program

2 Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

3 Advancing Science with DNA Sequence Classification as Research Tool -Classify into groups of essentially similar objects -When new data arrives, assign objects to existing groups -Classify ‘leftovers’ -Occasionally review the entire classification Problem: What is ‘essentially similar’? Finding properties that are important (ontological relevancy) Does classification reflect reality in any way? To deal with a huge variety of individual objects:

4 Advancing Science with DNA Sequence Classification Ways to classify objects: -Spectral methods -Parametric decomposition -Clustering

5 Advancing Science with DNA Sequence Sequence Data Abundance In the modern biology: The most abundant type of data is sequence: DNA Genomic Meta-Genomic Environmental Samples (16S rDNA) RNA (cDNA libraries; RNA-Seq) Derived Proteins How to compare sequences? - Criteria depend on application, e.g. GC content vs. order of bases.

6 Advancing Science with DNA Sequence Sequence Clustering Genome Assembly: Binning, Scaffolding Transcriptomics: RNAseq (read) clustering Protein Function and Evolution studies: Protein families Phylogenetic profiling: OTUs Select Applications in Genomic Sciences:

7 Advancing Science with DNA Sequence Clustering is Crucial for MetaGenomics METAGENOMICS Thousands of samples Hundreds of millions reads per sample Trillions of base pairs Billions of genes impossible to observe/analyze individually Clustering becomes a strict requirement: - Find what classes of sequences are seen - Analyze classes rather then individual sequences

8 Advancing Science with DNA Sequence MetaGenomics Analysis Tasks Primary tasks: Assess diversity Find genes Predict functions Predict pathways Estimate capabilities Based on sequence comparison.

9 Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

10 Advancing Science with DNA Sequence Clustering in General -Any Clustering is based on the Distance in some Metric -Initial clustering is based on pair-wise distances -Subsequent classification is based on distances from objects to clusters: Pledging

11 Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

12 Advancing Science with DNA Sequence Similarity Metrics What is “similar”: Similarity measure should better reflect “reality” This “reality” depends on the application: Assembly: find identical sub- strings Orthology detection: Identify homologous proteins across the species Functional prediction: Identify proteins with similar evolutionary conserved motifs Measure is: Identity Percentage Substitution matrix based Match to HMM or PSSM

13 Advancing Science with DNA Sequence Similarity Measure Computing similarity measure: -Edit distance or (ungapped) statistics P-value: BLAST, Fasta, needle, water, etc. -Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee -K-mere statistics: CD-HIT, USEARCH, MUSCLE -Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ -Suffix Arrays: Bowtie, BWT -Position-Specific scoring matrix: PSI-Blast, Impala -Hidden Markov Models: HMMer, HHSearch/HHPred, SAM

14 Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

15 Advancing Science with DNA Sequence Assembling Clusters There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology): Linkage-based Average linkage Complete linkage Single linkage Hierarchy-based Fitting function-based (K-mean) Non-linear classifiers (SOM, etc.) Greedy methods (iterative, suboptimal)

16 Advancing Science with DNA Sequence Linkage-Based Clustering Average linkage Single linkage Complete linkage

17 Advancing Science with DNA Sequence Hierarchical Clustering -Build a tree representation of relationships -Cut the branches using some quantitative criteria

18 Advancing Science with DNA Sequence Building the Tree Criteria: More similar sequences appear at closer branches This goal is not achievable for practical distance measures 1 4 3 4 2 2 A B C D ABCD ABDC Solutions: -Approximation methods: neighbor join, UPGMA -Search for the optimal tree by explicit criteria: (maximum parsimony, maximal likelihood, etc.)

19 Advancing Science with DNA Sequence Suboptimal Tree Building Neighbor joining (corresponds to single-linkage clustering): -Order edges by distance -Join in order from short to long, merging branches as needed Unweighted Pair Group Method with Arithmetic Mean (UPGMA):(corresponds to average-linkage clustering) For every pair of clusters (A, B), starting with all singletons: -Compute average of distances between every object in A and every object in B -Merge the clusters of the closest average distance

20 Advancing Science with DNA Sequence Global Fitting-Function Based K-mean clustering -Pre-define the number of clusters -Find a distribution so that the sum of distances to the means is minimal -Computationally hard -Heuristics used, application specific heuristics may be efficient

21 Advancing Science with DNA Sequence Non-Linear Methods Self-Organizing Maps: “self-learning” method A neural network trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space

22 Advancing Science with DNA Sequence Pledging Based on distance to cluster -Representative -Set of representatives (all at extreme) -Other measure, may be unrelated to the initial one (profile, model)

23 Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

24 Advancing Science with DNA Sequence Performance Considerations Distance computing is harder than clustering (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ) For large data sets only k-mere and suffix array measures are practical However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes the use of sensitive similarity measures possible. For boolean distance, iterative similarity detection is possible (no off-the-shelf implementations) Binning: pre-clustering by rough and fast methods 33 objects 528 pairs 4 groups 127 pairs

25 Advancing Science with DNA Sequence Single Linkage is Fast Time- and space- efficient clustering method: transitive closure-based Requires ‘boolean’ distances (two sequences can be linked or not linked Requires the number of nodes to be known Space ~ NodesNo Run-time (worst) ~ EdgesNo* AveClustSize Run-time (average) ~ EdgesNo * log 2 (AveClustSize)

26 Advancing Science with DNA Sequence Single Linkage is Prone to Aggregation Single-linkage clustering killer: CLUSTER AGGREGATION In large clusters, even a small number of random links lead to huge conglomerates.

27 Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

28 Advancing Science with DNA Sequence Case Study: RNA-Seq Pipeline Goals: 1.Compute transcript structures 2.Compute expression profiles (“virtual”) Reads/ ESTdb Reads/EST clusters Reads / clones attributed to particular source/condition Counting reads originating from different sources Source / condition specific expression profiles

29 Advancing Science with DNA Sequence RNAseq Analysis Solutions Source: bioinfo.org, Macquarie University, Sydney

30 Advancing Science with DNA Sequence RNAseq Clustering 1. Detect identities (common segments): Compute similarities Select the “good” ones 2. Merge sequences into groups with shared segments: SINGLE LINKAGE Approach Outline:Outcome: One biggest cluster contains more then 60% of all sequences (selection by better similarity does not help) What causes aggregation and how to fight it?

31 Advancing Science with DNA Sequence Aggregation in RNA-Seq Clustering “Bad” identities: -Pieces of vector constructs / adaptors -Repeats -Redundant sequences -Spurious matches (short infrequent repeats) -Chimeras (if pre-amplification is used)

32 Advancing Science with DNA Sequence Similarities Selection Computing ‘boolean’ distances: Threshold – based Additional rules (match arrangement) % identity + length + arrangement:

33 Advancing Science with DNA Sequence Trimming / Masking Fighting aggregation -Vector / adapter trimming: -Lucy, Figaro, etc. – integrated in many assembly suites (newbler, velvet, AMOS, CLCbio, etc.) -Low complexity detection / masking: -SEG, DUST, FastQC, WindowMasker etc. – often integrated in search tools

34 Advancing Science with DNA Sequence Repeat Elimination Regular (tandem) repeats: Pre-search masking: Based on structure (IMEx, SRF) or on database (TRDB) Post-search detection based on similarity properties (multiple parallel threads)

35 Advancing Science with DNA Sequence Repeat Elimination Irregular (long) repeats: Database based: RepeatMasker De-novo: RepeatScout, orrb, PILER, etc. (Require genome as input, construct database)

36 Advancing Science with DNA Sequence Detecting Chimeras Detecting chimeric sequences: Abundance-based: Perseus, UCHIME Chimeras undergo less amplification cycles. So chimera segments in native arrangements are more frequent Specific to 16S: ChimeraSlayer, Bellerophon Chimera ‘arms’ are closer to originating clades then the entire chimera

37 Advancing Science with DNA Sequence Detecting Chimeras Similarity coverage-based: Mira assembler

38 Advancing Science with DNA Sequence Detecting Chimeras Similarity graph topology-based: dchim Alignment viewConnectivity view

39 Advancing Science with DNA Sequence Sequence Clustering Outline Classification of Sequences General Problem of Clustering Distance Measures Ab Initio Clustering and Pledging Performance Considerations Case Study: Transcriptomics Introduction to Protein Clustering

40 Advancing Science with DNA Sequence Protein Clustering Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species Position-specific scoring matrices and profile-HMMs provide better sensitivity, but SLOW Similar problems as for RNA-Seq/EST clustering, but their causes are harder to fight No ‘one fits all’ solution: manual tuning and curation required for comprehensive results, especially at a large scale The results of clustering are precious, they are kept as databases (PFAM, COGs, KOGs, eggNOG)

41 Advancing Science with DNA Sequence Protein Clustering at JGI Functional annotation of metagenome genes through protein clusters (IMG) : -Build a set of functionally homogenous clusters of similar proteins – for annotated genomes -Build HMM for each cluster, compose model database -Pledge metagenome proteins to clusters by matching to models -Cluster unpledged proteins, build models, update model database

42 Advancing Science with DNA Sequence Protein Clustering Use of Protein Clusters reduces search space, but adds another level of indirection, which is a source of errors, and adds complexity that consumes effort However, for proteins, which form dense relationship networks, clustering is a great tool Konstantinos Mavrommatis will elaborate on protein clustering techniques

43 Advancing Science with DNA Sequence Thank you!


Download ppt "Advancing Science with DNA Sequence Sequence Clustering MGM Workshop January 30, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis."

Similar presentations


Ads by Google