BARCODE Quality Assessment: Frequency Matrix Approach Mark Y Stoeckle, Rockefeller University Kevin C R Kerr, Royal Ontario Museum PLoS ONE August 2012.

Slides:



Advertisements
Similar presentations
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
Advertisements

Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
RNA-Seq based discovery and reconstruction of unannotated transcripts
Rare Events, Probability and Sample Size. Rare Events An event E is rare if its probability is very small, that is, if Pr{E} ≈ 0. Rare events require.
“ Sushi-gate 2008”: High School students apply DNA barcoding to fish sold in their NYC neighborhood, discover one-quarter is mislabeled Research report.
Success of DNA barcoding in distinguishing sister species of diverse clades of birds Allan Baker, Erika Tavares, Rebecca Elbourne Department of Natural.
BARCODING LIFE, ILLUSTRATED Goals, Rationale, Results ppt v1
The Discovery Corridor Concept and its Applicability January 13/14, 2004 workshop St. Andrews Biological Station, St. Andrews, N.B.
Measuring the degree of similarity: PAM and blosum Matrix
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
10-1 Introduction 10-2 Inference for a Difference in Means of Two Normal Distributions, Variances Known Figure 10-1 Two independent populations.
A COMPARISON OF APPROACHES FOR VERIFYING SOUTHWEST REGIONAL GAP VERTEBRATE-HABITAT DISTRIBUTION MODELS J. Judson Wynne, Charles A. Drost and Kathryn A.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
THE BUILDING BLOCKS OF LIFE. BUILT FOR YOU Putting Engineering back into Protein Engineering Jun Liao, UC Santa Cruz Manfred K. Warmuth, UC Santa Cruz.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
PSY 307 – Statistics for the Behavioral Sciences Chapter 8 – The Normal Curve, Sample vs Population, and Probability.
Barcoding the Birds of North America
NI Assays: Troubleshooting & Analysis of Curve Fitting Graph 1: Standard clinical isolate Good NA activity S shaped curve (observed points) IC50 in expected.
Butterfly diversity…………………… in rain forest. What is ecological diversity? Based on Based on 1) Species richness, i.e. number of species present 1) Species.
Competitive multiplex-PCR Amplicon Sequencing Library Preparation Note: Load into slide show mode, and click mouse periodically to play animation show.
Statistics 300: Introduction to Probability and Statistics Section 2-2.
Robust and powerful sibpair test for rare variant association
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
3-1 Lines and Angles.
Barcoding the Birds of the Palearctic Kevin C.R. Kerr University of Guelph Biodiversity Institute of Ontario Canada Collaborators: S. Birks, S. Rohwer,
Using historic data sources to calibrate and validate models of species’ range dynamics Giovanni Rapacciuolo University of California Berkeley
Aspects for Improving the ABBI Patricia Escalante Instituto de Biología UNAM AOU-Collections Committee member.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Isle Royale, Michigan Grey Wolves and Genetic Drift Conservation Genetics of the Endangered Isle Royale Gray Wolf Wayne et al Conservation Biology,
Community Ecology BDC321 Mark J Gibbons, Room 4.102, BCB Department, UWC Tel: Image acknowledgements –
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Copyright © 2005 Pearson Education, Inc. Slide 6-1.
Good Pharmacovigilance Practices
ACM Member Report Organization : ________ Presented by: ________ Web Site : ________.
San Joaquin Valley Landscape-scale Planning for Solar Energy and Conservation UNIVERSITY OF CALIFORNIA, SANTA BARBARA BREN SCHOOL OF ENVIRONMENTAL SCIENCE.
Construction of Substitution matrices
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
9.1 Sequences and Series. Definition of Sequence  An ordered list of numbers  An infinite sequence is a function whose domain is the set of positive.
Ex St 801 Statistical Methods Inference about a Single Population Mean (CI)
Powerpoint in the Primary Maths Classroom … slides to attempt.
Confidence Intervals and Hypothesis Testing Mark Dancox Public Health Intelligence Course – Day 3.
Assessment on Phytoplankton Quantity in Coastal Area by Using Remote Sensing Data RI Songgun Marine Environment Monitoring and Forecasting Division State.
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
1 CPSC 320: Intermediate Algorithm Design and Analysis August 6, 2014.
Sequence similarity, BLAST alignments & multiple sequence alignments
Metagenomic Species Diversity.
Nucleotide variation in the human genome
The hypergeometric distribution/ Fisher exact test
CONCLUSIONS BACKGROUND
Protein Sequence Alignments
STEM Fair Graphs & Statistical Analysis
Introduction to Inference
Statistical inference
Volume 6, Issue 5, Pages e5 (May 2018)
AACPS SCIENCE AND ENGINEERING EXPO – Form 6A
Identification of Paralogs in RADseq data
Independent scientist
Xin Li, Alexis Battle, Konrad J. Karczewski, Zach Zappala, David A
Wenwen Fang, David P. Bartel  Molecular Cell 
Engine Part ID Part 1.
Engine Part ID Part 2.
Engine Part ID Part 2.
Basic Local Alignment Search Tool
Statistical inference
Presentation transcript:

BARCODE Quality Assessment: Frequency Matrix Approach Mark Y Stoeckle, Rockefeller University Kevin C R Kerr, Royal Ontario Museum PLoS ONE August 2012 e43992

Potential BARCODE errors Taxonomic mislabeling Pseudogenes Sequencing Error

Hypothesis: Sequencing Errors Sequencing Errors = Rare Variants Rare Variants

Avian BARCODEs: large, representative dataset 11K records/2.7K species (27% known)

Most 1 st, 2 nd positions >99.9% conserved

Most aa positions >99.9% conserved

Working definitions for rare variants: VERY LOW FREQUENCY VARIANT (VLF): nt, aa in <0.1% seqs at a given position SINGLETON VLF: in 1 indiv/species SHARED VLF: in ≥2 indiv/species VERY LOW FREQUENCY VARIANT (VLF): nt, aa in <0.1% seqs at a given position SINGLETON VLF: in 1 indiv/species SHARED VLF: in ≥2 indiv/species

Singleton VLFs: (mostly) seq error Shared VLFs: (mostly) biological Concentrated at ends of segment Relatively evenly distributed Spatial distribution of VLFs

Sliding window analysis nVLFs

Calculating Error Rate Most (94%) 2 nd positions >99.9% conserved (Nearly) all 2 nd position seq errors are VLFs nd pos si-nVLFs (probable errors) nd pos/BC x 10,760 BC = 8 x errors/base pair >

Limitations, Observations First seq error assessment for BARCODEs Some singleton nVLFs likely biological—calc rate is upper limit ~3% BARCODEs ≥ 1 error (av 1.7/BARCODE) Seq errors unlikely to affect species ID Increased apparent intraspecific variation

Applications: 1. Compare database quality Error bars =95% CI

2. Highlight BARCODEs w probable errors--annotate? Annotated Homer Simpson Portrait

3. BONUS: cryptic pseudogenes flagged by multiple SHARED VLFs Alder flycatcher (Empidonax alnorum)

Fuscous flycatcher (Cnemotriccus fuscatus) Canada goose (Branta canadensis) Cryptic pseudogenes uncommon: 0.1% BARCODEs

Conclusions Avian BARCODE sequence quality high and improving Frequency matrix potential utility for sequence database QA Caution on studies involving rare variants

Acknowledgments Natural Sciences and Engineering Research Council of Canada Jesse Ausubel Alan BakerJan Lifjeld Arild JohnsenPer Ericson Carla DoveGary Graves Pablo TubaroDario Litjmaer