Presentation is loading. Please wait.

Presentation is loading. Please wait.

26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon #IMGC2012.

Similar presentations


Presentation on theme: "26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon #IMGC2012."— Presentation transcript:

1 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012 Wi-Fi: twgroup / password: group5500

2 IMGS 2012 Bioinformatics Workshop Deanna Church, NCBI Carol Bult, The Jackson Laboratory

3 Tutorial Resources Galaxy – https://main.g2.bx.psu.edu/ https://main.g2.bx.psu.edu/ Genome Analysis for Biologists – http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/ http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/ NCBI 1000 Genomes Browser – http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ Genome Reference Consortium – http://genomereference.org/ http://genomereference.org/

4 Schedule 9-10 am: Intro Genome Assembly Basics Alignment Basics 10-11 am: Getting Stuff Done File formats (sequences, alignments, annotations) 11-12 am: Doing stuff Typical RNA-Seq workflow RNA Seq in Galaxy Differential Gene Expression with RNA Seq data

5 Assembly Basics 19 Oct 2012

6 Some assembly required…

7 Restrict and make libraries 2, 4, 8, 10, 40, 150 kb End-sequence all clones and retain pairing information “mate-pairs” Find sequence overlaps Each end sequence is referred to as a read WGS contig tails WGS: Sanger Reads Layout-Consensus-Overlap

8 http://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdf

9 Alignable trace count in frameshift window vs control in Opossum: 51nt window, >95% identity 23,894 genes 452 models with >1 exon, sym.best hit, and one frameshift 334 cases have 3 or less hits Alexander Souvorov, NCBI

10 Fragmented genomes tend to have less frame shifts Alexander Souvorov, NCBI

11 Fragmented genomes tend to have more partial models Alexander Souvorov, NCBI

12 BAC insert BAC vector Shotgun sequence Assemble Fold sequence Gaps deeper sequence coverage rarely resolves all gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Clone based assemblies

13 Scaffold N50 by chromosome

14 7 May 2010 Spanned Gaps by Assembly

15 Church et al., 2011 PLoS Biology http://genomereference.org

16 NCBI36 (hg18) GRCh37 (hg19)

17 NCBI35 (hg17) GRCh37 (hg19) AL139246.20 AL139246.21

18 Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence

19 NCBI36

20 nsv832911 (nstd68) Submitted on NCBI35 (hg17)

21 NCBI35 (hg17) Tiling Path GRCh37 (hg19) Tiling Path Gap Inserted Moved approximately 2 Mb distal on chr15 NC_0000015.8 (chr15) NC_0000015.9 (chr15) Removed from assembly Added to assembly HG-24

22 Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes

23 AC074378.4 AC079749.5 AC134921.2 AC147055.2 AC140484.1 AC019173.4 AC093720.2 AC021146.7 NCBI36 NC_000004.10 (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37 NC_000004.11 (chr4) Tiling Path AC074378.4 AC079749.5 AC134921.1 AC147055.2 AC093720.2 AC021146.7 TMPRSS11E GRCh37 : NT_167250.1 (UGT2B17 alternate locus) AC074378.4 AC140484.1 AC019173.4 AC226496.2 AC021146.7 TMPRSS11E2 nsv532126 (nstd37)

24 GRCh37 (hg19) http://genomereference.org 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome UGT2B17MHC MAPT

25 Assembly (e.g. GRCh37.p2) Primary Assembly Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 9 ALT 6 ALT 7 ALT 8 PAR Patches … Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Genomic Region (ABO) Genomic Region (SMA) Genomic Region (PECAM1)

26 MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)

27 Richa Agarwala Eugene Yaschenko

28

29 GenBank Data Archives  Data in a common format  Data in a single location (and mirrored)  Most quality checked prior to deposition  Robust data tracking mechanism (accession.version)  Data owned by submitter

30 Data tracking ABC14-1065514J1 GapsPhaseLengthDate FP565796.111 21-Oct-2009 FP565796.210 14-Oct-2010 FP565796.330 07-Nov-2010

31 Mouse chrX: 35,000,000-36,000000

32 X MGSCv3MGSCv36

33 Unique Identification NC_000086.6 chrX in MGSCv36 List of scaffolds and gaps (AGP) List of components and gaps (AGP)

34 hg19 GRCh37 mm8 MGSCv37 NCBIM37 danRer5 Zv7 What’s in a name?

35

36 Assemblies with the same name aren’t always the same chr21:8,913,216-9,246,964

37 Assemblies with the same name aren’t always the same Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

38 hg19 GRCh37 GRCh37.p2 GCA_000001405.1 Assembly Database to the rescue GCA_000001405.3

39 http://www.ncbi.nlm.nih.gov/genome/assembly GRCh37hg19

40

41 Assembly (e.g. GRCh37.p5) GCA_000001405.6/GCF_000001405.17 Primary Assembly GCA_000001305.1/ GCF_000001305.13 ALT 1 GCA_000001315.1/ GCF_000001315.1 ALT 2 GCA_000001325.1/ GCF_000001325.2 ALT 3 GCA_000001335.1/ GCF_000001335.1 ALT 4 GCA_000001345.1/ GCF_000001345.1 ALT 5 GCA_000001355.1/ GCF_000001355.1 ALT 6 GCA_000001365.1/ GCF_000001365.2 ALT 7 GCA_000001375.1/ GCF_000001375.1 ALT 8 GCA_000001385.1/ GCF_000001385.1 ALT 9 GCA_000001395.1/ GCF_000001395.1 Patches GCA_000005045.5 GCF_000005045.4 Non-nuclear assembly unit (e.g. MT) GCA_000006015.1/ GCF_000006015.1

42 GenBankRefSeq vs Submitter OwnedRefSeq Owned RedundancyNon-Redundant Updated rarelyCurated INSDCNot INSDC BRCA1 83 genomic records 31 mRNA records 27 protein records 3 genomic records 5 mRNA records 1 RNA record 5 protein records

43 Sequence Alignments Basics

44 Hypothesis

45 The biological basis of sequence alignment is evolution Sequences that share a common ancestor are homologous – Sequence similarity is evidence of homology – Sequences, genes, etc. are homologous or not, there is no “percent homology”

46 Homology Orthologous sequences – Common ancestor; speciation Paralogous sequences – Gene duplication within a species ( lineage specific expansion) http://www.nature.com/nrd/journal/v2/n8/box/nrd1152_BX2.html

47 Alignment to NR -> Homology Alignment to an Assembly -> Mapping

48

49 Global and local alignments Optimal global alignment Needleman-Wunsch Sequences align essentially from end to end Optimal local alignment Smith-Waterman Sequences align only in small, isolated regions References Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453. Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.

50

51 http://en.wikipedia.org/wiki/Sequence_alignment

52 Hashing methods MVRRLPERTSTPACE MVR VRR RRL RLP LPE PER ERT RTS TST STP TPA PAC ACE Query sequence Word size = 3 (configurable) References Wilbur & Lipman (1983), PNAS 80, 726- 30 Lipman & Pearson (1985), Science 227, 1435-1441 Pearson & Lipman (1988), PNAS 85, 2444-2448

53

54

55

56 http://wwwdev.ebi.ac.uk/fg/hts_mappers/ Fonseca et al., 2012

57 Sensitivity vs. Specificity Sensitivity = actual number of true positives (tp) identified Specificity = number of true negatives (tn) identified Actual Predicted TPFN FPTN positives negatives positivesnegatives Sensitivity= TP/(TP+FN) Specificity=TN/(TN+FP)

58 Aligner technology specific? Gapped vs. ungapped alignments? Spliced alignments (cDNAs/RNA-Seq) Can use paired-end data?

59 Ruffalo et al., 2012

60 Li and Homer, 2010

61 Indels have correct and consistent alignment in reads after multiple sequence local realignment 61 DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. Phase 1: NGS data processing Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!

62 http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes CDC27

63

64 Richa Agarwala MHC Alternate locus Alignment to chr6

65 Mouse Ren1 chr1 (NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N

66

67 http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes CEPH: A=1.000 G=0 APOL1

68 YRI: A=0.5852 G=0.4148 Multiple submissions Frequency Data 1000G Suspect Sudmant et al., 2010


Download ppt "26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon #IMGC2012."

Similar presentations


Ads by Google