26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon #IMGC2012.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Updating the human reference assembly V.A. Schneider, P. Flicek, T. Graves, T. Hubbard & D.M. Church for the Genome Reference Consortium
Doug Brutlag 2011 Sequencing the Human Genome Doug Brutlag Professor Emeritus of Biochemistry.
GRC Workshop ASHG 22 Oct Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data
Basics of Comparative Genomics Dr G. P. S. Raghava.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison
Sequence Similarity Searching Class 4 March 2010.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
16 and 20 February, 2004 Chapter 9 Genomics Mapping and characterizing whole genomes.
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
Whole Genome Assembly. WGA 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
High Throughput Sequencing
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mouse Genome Sequencing
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
Next Generation DNA Sequencing
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Sackler Medical School
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Human Genome.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Sequence Tracking Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 Understanding your sequence context.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
No reference available
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
Accessing and visualizing genomics data
Genome representation and variant identification Deanna M. Church, NCBI.
What is BLAST? Basic BLAST search What is BLAST?
What is BLAST? Basic BLAST search What is BLAST?
Introduction to Genes and Genomes with Ensembl
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
The NCBI Annotation Pipeline
Basics of Comparative Genomics
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Genome organization and Bioinformatics
Ensembl Genome Repository.
Basic Local Alignment Search Tool (BLAST)
Pairwise Sequence Alignment
Sequence the 3 billion base pairs of human
Basics of Comparative Genomics
Human Genome Project Seminal achievement. Scientific milestone.
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, – Location: Tarpon #IMGC2012 Wi-Fi: twgroup / password: group5500

IMGS 2012 Bioinformatics Workshop Deanna Church, NCBI Carol Bult, The Jackson Laboratory

Tutorial Resources Galaxy – Genome Analysis for Biologists – NCBI 1000 Genomes Browser – Genome Reference Consortium –

Schedule 9-10 am: Intro Genome Assembly Basics Alignment Basics am: Getting Stuff Done File formats (sequences, alignments, annotations) am: Doing stuff Typical RNA-Seq workflow RNA Seq in Galaxy Differential Gene Expression with RNA Seq data

Assembly Basics 19 Oct 2012

Some assembly required…

Restrict and make libraries 2, 4, 8, 10, 40, 150 kb End-sequence all clones and retain pairing information “mate-pairs” Find sequence overlaps Each end sequence is referred to as a read WGS contig tails WGS: Sanger Reads Layout-Consensus-Overlap

Alignable trace count in frameshift window vs control in Opossum: 51nt window, >95% identity 23,894 genes 452 models with >1 exon, sym.best hit, and one frameshift 334 cases have 3 or less hits Alexander Souvorov, NCBI

Fragmented genomes tend to have less frame shifts Alexander Souvorov, NCBI

Fragmented genomes tend to have more partial models Alexander Souvorov, NCBI

BAC insert BAC vector Shotgun sequence Assemble Fold sequence Gaps deeper sequence coverage rarely resolves all gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Clone based assemblies

Scaffold N50 by chromosome

7 May 2010 Spanned Gaps by Assembly

Church et al., 2011 PLoS Biology

NCBI36 (hg18) GRCh37 (hg19)

NCBI35 (hg17) GRCh37 (hg19) AL AL

Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence

NCBI36

nsv (nstd68) Submitted on NCBI35 (hg17)

NCBI35 (hg17) Tiling Path GRCh37 (hg19) Tiling Path Gap Inserted Moved approximately 2 Mb distal on chr15 NC_ (chr15) NC_ (chr15) Removed from assembly Added to assembly HG-24

Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes

AC AC AC AC AC AC AC AC NCBI36 NC_ (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37 NC_ (chr4) Tiling Path AC AC AC AC AC AC TMPRSS11E GRCh37 : NT_ (UGT2B17 alternate locus) AC AC AC AC AC TMPRSS11E2 nsv (nstd37)

GRCh37 (hg19) 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome UGT2B17MHC MAPT

Assembly (e.g. GRCh37.p2) Primary Assembly Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 9 ALT 6 ALT 7 ALT 8 PAR Patches … Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Genomic Region (ABO) Genomic Region (SMA) Genomic Region (PECAM1)

MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)

Richa Agarwala Eugene Yaschenko

GenBank Data Archives  Data in a common format  Data in a single location (and mirrored)  Most quality checked prior to deposition  Robust data tracking mechanism (accession.version)  Data owned by submitter

Data tracking ABC J1 GapsPhaseLengthDate FP Oct-2009 FP Oct-2010 FP Nov-2010

Mouse chrX: 35,000,000-36,000000

X MGSCv3MGSCv36

Unique Identification NC_ chrX in MGSCv36 List of scaffolds and gaps (AGP) List of components and gaps (AGP)

hg19 GRCh37 mm8 MGSCv37 NCBIM37 danRer5 Zv7 What’s in a name?

Assemblies with the same name aren’t always the same chr21:8,913,216-9,246,964

Assemblies with the same name aren’t always the same Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

hg19 GRCh37 GRCh37.p2 GCA_ Assembly Database to the rescue GCA_

GRCh37hg19

Assembly (e.g. GRCh37.p5) GCA_ /GCF_ Primary Assembly GCA_ / GCF_ ALT 1 GCA_ / GCF_ ALT 2 GCA_ / GCF_ ALT 3 GCA_ / GCF_ ALT 4 GCA_ / GCF_ ALT 5 GCA_ / GCF_ ALT 6 GCA_ / GCF_ ALT 7 GCA_ / GCF_ ALT 8 GCA_ / GCF_ ALT 9 GCA_ / GCF_ Patches GCA_ GCF_ Non-nuclear assembly unit (e.g. MT) GCA_ / GCF_

GenBankRefSeq vs Submitter OwnedRefSeq Owned RedundancyNon-Redundant Updated rarelyCurated INSDCNot INSDC BRCA1 83 genomic records 31 mRNA records 27 protein records 3 genomic records 5 mRNA records 1 RNA record 5 protein records

Sequence Alignments Basics

Hypothesis

The biological basis of sequence alignment is evolution Sequences that share a common ancestor are homologous – Sequence similarity is evidence of homology – Sequences, genes, etc. are homologous or not, there is no “percent homology”

Homology Orthologous sequences – Common ancestor; speciation Paralogous sequences – Gene duplication within a species ( lineage specific expansion)

Alignment to NR -> Homology Alignment to an Assembly -> Mapping

Global and local alignments Optimal global alignment Needleman-Wunsch Sequences align essentially from end to end Optimal local alignment Smith-Waterman Sequences align only in small, isolated regions References Needleman and Wunsch (1970). J. Mol. Biol. 48, Smith and Waterman (1981). Nucleic Acids Res 13,

Hashing methods MVRRLPERTSTPACE MVR VRR RRL RLP LPE PER ERT RTS TST STP TPA PAC ACE Query sequence Word size = 3 (configurable) References Wilbur & Lipman (1983), PNAS 80, Lipman & Pearson (1985), Science 227, Pearson & Lipman (1988), PNAS 85,

Fonseca et al., 2012

Sensitivity vs. Specificity Sensitivity = actual number of true positives (tp) identified Specificity = number of true negatives (tn) identified Actual Predicted TPFN FPTN positives negatives positivesnegatives Sensitivity= TP/(TP+FN) Specificity=TN/(TN+FP)

Aligner technology specific? Gapped vs. ungapped alignments? Spliced alignments (cDNAs/RNA-Seq) Can use paired-end data?

Ruffalo et al., 2012

Li and Homer, 2010

Indels have correct and consistent alignment in reads after multiple sequence local realignment 61 DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. Phase 1: NGS data processing Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!

CDC27

Richa Agarwala MHC Alternate locus Alignment to chr6

Mouse Ren1 chr1 (NC_ ): NM_ : transcript from C57BL/6J NM_ : transcript from FVB/N

CEPH: A=1.000 G=0 APOL1

YRI: A= G= Multiple submissions Frequency Data 1000G Suspect Sudmant et al., 2010