26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, – Location: Tarpon #IMGC2012 Wi-Fi: twgroup / password: group5500
IMGS 2012 Bioinformatics Workshop Deanna Church, NCBI Carol Bult, The Jackson Laboratory
Tutorial Resources Galaxy – Genome Analysis for Biologists – NCBI 1000 Genomes Browser – Genome Reference Consortium –
Schedule 9-10 am: Intro Genome Assembly Basics Alignment Basics am: Getting Stuff Done File formats (sequences, alignments, annotations) am: Doing stuff Typical RNA-Seq workflow RNA Seq in Galaxy Differential Gene Expression with RNA Seq data
Assembly Basics 19 Oct 2012
Some assembly required…
Restrict and make libraries 2, 4, 8, 10, 40, 150 kb End-sequence all clones and retain pairing information “mate-pairs” Find sequence overlaps Each end sequence is referred to as a read WGS contig tails WGS: Sanger Reads Layout-Consensus-Overlap
Alignable trace count in frameshift window vs control in Opossum: 51nt window, >95% identity 23,894 genes 452 models with >1 exon, sym.best hit, and one frameshift 334 cases have 3 or less hits Alexander Souvorov, NCBI
Fragmented genomes tend to have less frame shifts Alexander Souvorov, NCBI
Fragmented genomes tend to have more partial models Alexander Souvorov, NCBI
BAC insert BAC vector Shotgun sequence Assemble Fold sequence Gaps deeper sequence coverage rarely resolves all gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Clone based assemblies
Scaffold N50 by chromosome
7 May 2010 Spanned Gaps by Assembly
Church et al., 2011 PLoS Biology
NCBI36 (hg18) GRCh37 (hg19)
NCBI35 (hg17) GRCh37 (hg19) AL AL
Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
NCBI36
nsv (nstd68) Submitted on NCBI35 (hg17)
NCBI35 (hg17) Tiling Path GRCh37 (hg19) Tiling Path Gap Inserted Moved approximately 2 Mb distal on chr15 NC_ (chr15) NC_ (chr15) Removed from assembly Added to assembly HG-24
Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
AC AC AC AC AC AC AC AC NCBI36 NC_ (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37 NC_ (chr4) Tiling Path AC AC AC AC AC AC TMPRSS11E GRCh37 : NT_ (UGT2B17 alternate locus) AC AC AC AC AC TMPRSS11E2 nsv (nstd37)
GRCh37 (hg19) 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome UGT2B17MHC MAPT
Assembly (e.g. GRCh37.p2) Primary Assembly Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 9 ALT 6 ALT 7 ALT 8 PAR Patches … Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Genomic Region (ABO) Genomic Region (SMA) Genomic Region (PECAM1)
MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)
Richa Agarwala Eugene Yaschenko
GenBank Data Archives Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter
Data tracking ABC J1 GapsPhaseLengthDate FP Oct-2009 FP Oct-2010 FP Nov-2010
Mouse chrX: 35,000,000-36,000000
X MGSCv3MGSCv36
Unique Identification NC_ chrX in MGSCv36 List of scaffolds and gaps (AGP) List of components and gaps (AGP)
hg19 GRCh37 mm8 MGSCv37 NCBIM37 danRer5 Zv7 What’s in a name?
Assemblies with the same name aren’t always the same chr21:8,913,216-9,246,964
Assemblies with the same name aren’t always the same Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
hg19 GRCh37 GRCh37.p2 GCA_ Assembly Database to the rescue GCA_
GRCh37hg19
Assembly (e.g. GRCh37.p5) GCA_ /GCF_ Primary Assembly GCA_ / GCF_ ALT 1 GCA_ / GCF_ ALT 2 GCA_ / GCF_ ALT 3 GCA_ / GCF_ ALT 4 GCA_ / GCF_ ALT 5 GCA_ / GCF_ ALT 6 GCA_ / GCF_ ALT 7 GCA_ / GCF_ ALT 8 GCA_ / GCF_ ALT 9 GCA_ / GCF_ Patches GCA_ GCF_ Non-nuclear assembly unit (e.g. MT) GCA_ / GCF_
GenBankRefSeq vs Submitter OwnedRefSeq Owned RedundancyNon-Redundant Updated rarelyCurated INSDCNot INSDC BRCA1 83 genomic records 31 mRNA records 27 protein records 3 genomic records 5 mRNA records 1 RNA record 5 protein records
Sequence Alignments Basics
Hypothesis
The biological basis of sequence alignment is evolution Sequences that share a common ancestor are homologous – Sequence similarity is evidence of homology – Sequences, genes, etc. are homologous or not, there is no “percent homology”
Homology Orthologous sequences – Common ancestor; speciation Paralogous sequences – Gene duplication within a species ( lineage specific expansion)
Alignment to NR -> Homology Alignment to an Assembly -> Mapping
Global and local alignments Optimal global alignment Needleman-Wunsch Sequences align essentially from end to end Optimal local alignment Smith-Waterman Sequences align only in small, isolated regions References Needleman and Wunsch (1970). J. Mol. Biol. 48, Smith and Waterman (1981). Nucleic Acids Res 13,
Hashing methods MVRRLPERTSTPACE MVR VRR RRL RLP LPE PER ERT RTS TST STP TPA PAC ACE Query sequence Word size = 3 (configurable) References Wilbur & Lipman (1983), PNAS 80, Lipman & Pearson (1985), Science 227, Pearson & Lipman (1988), PNAS 85,
Fonseca et al., 2012
Sensitivity vs. Specificity Sensitivity = actual number of true positives (tp) identified Specificity = number of true negatives (tn) identified Actual Predicted TPFN FPTN positives negatives positivesnegatives Sensitivity= TP/(TP+FN) Specificity=TN/(TN+FP)
Aligner technology specific? Gapped vs. ungapped alignments? Spliced alignments (cDNAs/RNA-Seq) Can use paired-end data?
Ruffalo et al., 2012
Li and Homer, 2010
Indels have correct and consistent alignment in reads after multiple sequence local realignment 61 DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. Phase 1: NGS data processing Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!
CDC27
Richa Agarwala MHC Alternate locus Alignment to chr6
Mouse Ren1 chr1 (NC_ ): NM_ : transcript from C57BL/6J NM_ : transcript from FVB/N
CEPH: A=1.000 G=0 APOL1
YRI: A= G= Multiple submissions Frequency Data 1000G Suspect Sudmant et al., 2010