Robert Edgar Independent scientist

Robert Edgar Independent scientist robert@drive5.com www.drive5.com

 Data reduction  Make tractable for downstream analysis  Read dereplication & error-correction  Metagenomics  Identify protein families de novo  Community sequencing: identify OTUs

 Challenges  USEARCH solutions

Bacterial chromosome 16S gene   Primers 16S segments Environmental sample with bacteria PCR Amplified segments Biological sequences Chimeric artifacts formed from ≥2 biological sequences during PCR Reads

 Error correction  Chimeras  Big problem with 16S / 18S / ITS  Covered this morning: UCHIME  Other PCR errors  Sequencer error  Bad base calls, indels, homopolymers  Cluster at 97% (3% radius)  One cluster = one OTU = one species (maybe!)

Bigger dot = more reads 3% Radius 3% = species Centroid, ideally should be most abundant = most likely to be biological. Differs from rep. seq. due to: Sequencing error Biological variation

Which OTU? Ambiguous assignments

Abundant sequences <3% different 2%

Abundant sequences <3% different 2% Arbitrary choice of OTU rep. seq. Outliners create spurious OTU(s)

Full-length 16S gene (~1500nt)

Next-gen reads of hypervariable region (~300nt) Variation greater in short region, may be > 3%.

Variation between populations Healthy Diseased

Bacterial chromosome 16S gene Duplication > 3% diverged Paralogs and segmental duplications Two OTUs for one species

G A T T A C A - - G A A T T A A C A Alignment variation and defining % identity G A - T T A - C A G A A T T A A C A 3 diffs or 5 diffs? No diffs or 2 diffs? Program B Program A Different programs produce different results from the same algorithm & same input data because alignments and %id definition vary. This can bias validation, e.g. Schloss & Westcott (2011) AEM.

A B C C A B 1.5% 4% 2.5% Hard to define an OTU or an optimal set of OTUs Phylogenetic tree

A B C C A B Hard to define an OTU or an optimal set of OTUs Optimal OTUs per Schloss & Westcott’s MCC measure can be non-monophyletic.

 OTUs are hacks  Do not exist in nature  Cannot be defined and validated robustly  But can still be useful!

 One program, one binary  Suite of high-throughput algorithms  Search, clustering, dereplication, chimera detection…  Orders of magnitude faster than BLAST  Free for academic use (32-bit)

 Sort sequences  Greedy list removal

Clusters Database Input sequences In RAM for fast access. Cluster assignments written sequentially to file, not stored in RAM. Typical state: one database sequence per cluster (centroid).

Clusters Database Input sequences Initial state: empty database = no clusters. Input sequences processed in file order.

Database USEARCH Clusters Next input sequence searched against database. USEARCH algorithm: very fast database search (>>BLAST). Input sequences

Clusters Hit: input sequence assigned to cluster & discarded. Database Hit Input sequences Record written to output file(s). Optional: alignment, other info.

Database No hit Clusters Input sequences No hit: query added to database, becomes centroid of new cluster.

 Very fast  Input order matters  Centroid is always first member found  How to sort?

Longest sequences typically outliers, tend to split OTUs. Centroid: CENTROID ‑‑‑‑‑‑‑ Seq1: CENTROIDINSERTED Seq2: CENTROIDTERMINAL Centroid: CENTROID ‑‑‑‑‑‑‑ Seq1: CENTROIDINSERTED Seq2: CENTROIDTERMINAL If you don’t sort by length, fragments can become centroids and member sequences may have many differences.

Most abundant sequence is likely to be biological & a good choice of centroid

 If read errors are rare:  Abundance = size of dereplication cluster  If read errors are common:  Have a circular problem:  Abundances needs clustering, but  Clustering needs abundances.

G A T G A C G T C A A G T C A T A G G Biological sequence G A T T A C G T C A - A G T C A A A G G Read 1 G A T G A C G A C A - A G T C A T A G - Read 2 G G T G A C G T C A A A G - C A T A G G Read 3 G A T G A C G T C A A G T C A T A G G Consensus G A T G A C G T C A A G T C A T A G G Biological sequence G A T T A C G T C A - A G T C A A A G G Read 1 G A T G A C G A C A - A G T C A T A G - Read 2 G G T G A C G T C A A A G - C A T A G G Read 3 G A T G A C G T C A A G T C A T A G G Consensus Calculate consensus sequence. UCLUST can do this for each cluster.

Dereplicate: sort by length & run UCLUST Longest sequences are centroids in first round. Tend to be outliers & split a natural OTU.

Find consensus sequences Consensus sequences converge on most abundant sequence in cluster, most likely to be a correct amplicon sequence. Common for two clusters to converge on same consensus sequence: merges an OTU that was split in first round.

Before taking consensus… …after.

Consensus sequences ≈ denoised amplicons Amplicon abundance ≈ cluster size Circular problem solved. Filter chimeras Abundances needed by de novo UCHIME as well

Sort by abundance Run UCLUST at 97% Centroid is final OTU.

Assign reads to OTUs: USEARCH at 97%. Outliers need special treatment: can be assigned to closest OTU, or reclustered at 97%. Most reads match an OTU.

 Python script, runs multiple USEARCH steps  Very fast and highly scalable  10 6 reads in minutes on a laptop  Ad hoc, but good biological results  Other algorithms are also ad hoc  Average linkage “standard” but not justified by theory  Does not address read error correction, other challenges

 Technical issues  Clustering threshold for error correction  97% seems to work well so far  But can merge distinct amplicons…  …degrades abundance estimate  Higher threshold might be better if read errors rare  Minimum cluster size threshold  Clusters <4 reads discarded after error-correction step  Rare species / false-positive trade-off

 Not like QIIME or mother  Not a complete suite of analysis tools  Not "packaged" specifically for 16S  Lower-level algorithms  Typically used by "pipelines"  Multiple steps  Typical step is USEARCH command or file conversion  Implemented by scripts (bash, perl, Python...).

TaskUSEARCH Edgar QIIME Knight mothur Schloss Pyronoise Quince Perseus Quince ESPRIT Sun reads to OTUs filtered reads to OTUs Phylotype Err. correction Chimera filter (ref db) Chimera filter (de novo) Compare pops. (UNIFRAC) Diversity (α,β)

Robert Edgar Independent scientist

Similar presentations

Presentation on theme: "Robert Edgar Independent scientist"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Robert Edgar Independent scientist

Similar presentations

Presentation on theme: "Robert Edgar Independent scientist"— Presentation transcript:

Similar presentations

About project

Feedback