hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Evolution of genomes.
Considerations for Analyzing Targeted NGS Data HLA
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Major insights from the HGP on Nature (2001) 15 th Feb Vol 409 special issue; pgs 814 & )Gene content 2)Proteome content 3)SNP identification.
Chap. 6 Problem 2 Protein coding genes are grouped into the classes known as solitary (single) genes, and duplicated or diverged genes in gene families.
Updating the human reference assembly V.A. Schneider, P. Flicek, T. Graves, T. Hubbard & D.M. Church for the Genome Reference Consortium
GRC Workshop ASHG 22 Oct Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data
Transcriptome Sequencing with Reference
Transcriptomics Jim Noonan GENE 760.
DTL Focus meeting: Using GRCh38 in NGS data analysis Time slotSpeakerSubject 12:45-13:00Coffee/tea 13:00-13:20Ies Nijman (UMCU) Welcome & Introduction.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Genome Browsers Ensembl (EBI, UK) and UCSC (Santa Cruz, California)
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
The mating type locus Chr. III. The MAT locus information The MAT locus can encode three regulatory peptides: - a1 is encoded by the MATa allele -
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Towards Personal Genomics Tools for Navigating the Genome of an Individual Saul A. Kravitz J. Craig Venter Institute Rockville, MD Bio-IT World 2008.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Li and Dewey BMC Bioinformatics 2011, 12:323
Geuvadis RNAseq analysis at UNIGE Analysis plans
1 Genetic Variability. 2 A population is monomorphic at a locus if there exists only one allele at the locus. A population is polymorphic at a locus if.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
PhenCode Linking Human Mutations to Phenotype. PhenCode Brings the deep information on genotypes and phenotypes in locus specific databases (LSDBs) into.
RNAseq analyses -- methods
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, – Location: Tarpon #IMGC2012.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Used for detection of genetic diseases, forensics, paternity, evolutionary links Based on the characteristics of mammalian DNA Eukaryotic genome 1000x.
RNA Sequencing I: De novo RNAseq
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Sackler Medical School
Genomics and Forensics
The Havana-Gencode annotation GENCODE CONSORTIUM.
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
Sequence Tracking Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 Understanding your sequence context.
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
Meiosis 15 October, 2004 Text Chapter 13. In asexual reproduction, individuals give rise to genetically identical offspring (clones). All cell division.
No reference available
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Accessing and visualizing genomics data
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Preprocessing Data Rob Schmieder.
Genetics and Evolutionary Biology
RNA-Seq analysis in R (Bioconductor)
ENCODE Pseudogenes and Transcription
University of Pittsburgh
Meiosis Chapter 13.
Polymorphisms in Cytokine Genes Are Associated With Higher Levels of Fatigue and Lower Levels of Energy in Women After Breast Cancer Surgery  Kord M.
Identification of Paralogs in RADseq data
lincRNAs: Genomics, Evolution, and Mechanisms
3.1 Genes Essential idea: Every living organism inherits a blueprint for life from its parents. Genes and hence genetic information is inherited from.
BF528 - Whole Genome Sequencing and Genomic Variation
Complete Haplotype Sequence of the Human Immunoglobulin Heavy-Chain Variable, Diversity, and Joining Genes and Characterization of Allelic and Copy-Number.
Human Genome Project Seminal achievement. Scientific milestone.
Sequence Analysis - RNA-Seq 2
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison Zuotian Tatum Department of Human Genetics Leiden University Medical Center

Timeline GRCh37: First release: Latest patch: GRCh38: First release: Feb 27, 2009 Latest patch: Jun 28, 2013 (p13) GRCh38: First release: Dec 24, 2013 Latest patch: Oct 14, 2014 (p1) http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/

Content GRCh37.p13: Total bases: N50: Number of alternative loci: 3.23 Billion 2.99 Billion (without N) N50: 46 Million Number of alternative loci: 9 Non-nuclear genome: No GRCh38.p2: Total bases: 3.21 Billion 3.05 Billion (without N) N50: 67 Million Number of alternative loci : 261 Non-nuclear genome: Yes http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/

UCSC tracks for GRCh38 UCSC RefSeq available since April 2014. Ensembl regulatory build available since September 2014. dbSNP 141 available since October 2014. ENCODE and FANTOM5 track hubs are still not available (Nov 2014).

New in GRCh38 release Three new sequence files, in addition to the standard assembly files: - GCA_000001405.15_GRCh38_top-level.fna.gz - GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz - GCA_000001405.15_GRCh38_full_analysis_set.fna.gz The analysis set files are created to avoid false mapping in NGS alignment pipelines.

GCA_000001405.15_GRCh38_top-level.fna.gz All the top-level objects in the full-assembly Chromosomes unlocalized scaffolds unplaced scaffolds alternate locus scaffolds mitochondrial genome The sequence identifiers are International Sequence Database Collaboration (INSDC) accession.versions and the definition lines are GenBank style. No sequences have been hard-masked.

GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz Chromosomes from the GRCh38 Primary Assembly unit. Note: the two PAR regions on chrY have been hard-masked with Ns. The chromosome Y sequence provided therefore has the same coordinates as the GenBank sequence but it is not identical to the GenBank sequence. Similarly, duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked with Ns. Mitochondrial genome from the GRCh38 non-nuclear assembly unit. Unlocalized scaffolds from the GRCh38 Primary Assembly unit. Unplaced scaffolds from the GRCh38 Primary Assembly unit. Epstein-Barr virus (EBV) sequence Note: The EBV sequence is not part of the genome assembly but is included in the analysis set as a sink for alignment of reads that are often present in sequencing samples.

GCA_000001405.15_GRCh38_full_analysis_set.fna.gz = GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz + alt-scaffolds from the GRCh38 ALT_REF_LOCI_* assembly units

Alt-loci add complexity to RNASeq quantification

Ideogram of GRCh38.p2

RNASeq quantification - Fragments (reads) per million per killobase (FPKM/RPKM) values to quantify gene expression - Unique mapping only Analysis tools do not distinguish allelic duplication from paralogous duplication - Non overlapping gene regions

To understand the effect of alt-loci on RNASeq quantification Compare alignment of chromosome 6 MHC region between - hg19 full set with 7 alt-loci - hg38 analysis set without alt-loci Sequence content are largely unchanged between hg19 and hg38.

Mapping/alignment for RNASeq hg19 hg38 mapped 14,655,299 14,704,427 mappedDiffChr 4,959 4,017 mappedPairProper 14,639,261 14,690,090 mappedPairProperPct 92.62 92.94 total 15,805,561 totalSplice 5,060,829 5,078,133 unmapped 1,150,262 1,101,134 hg19: with alt loci hg38: without alt loci

Effect of alt loci in RNASeq alignments Gene RPKM (hg38)

Distribution of RPKM difference

Major Histocompatibility complex region on chromosome 6

HLA-A hg19 full set – chr6 D1 hg19 full set – chr6_mann_hap4 D1 hg19 full set – chr6_qb1_hap6 D1 hg19 full set – chr6_dbb_hap3 D1

HLA-A hg19 full set – chr6 D1 D2 D3 hg38 analysis set D1 D2 D3

HLA-C hg19 full set D1 D2 D3 hg38 analysis set D1 D2 D3

HLA-DRA hg19 full set D1 D2 D3 hg38 analysis set D1 D2 D3

Major Histocompatibility complex region on chromosome 6 Class III

MHC Class III 700kb stretch, 60 genes. The most gene-dense region of the human genome > 14% coding ~ 72% transcribed Highly conserved Only a free have clearly defined and proven function

TNF hg19 full set – chr6 D1.control D1.treated hg38 analysis set – chr6 D1.control D1.treated

Highly variant immune regions retiled

LILRA3 moved to alt-loci in hg38 LILRB2 LILRA3 LILRA5 hg38 LILRB2 LILRA5

Phantom LILRA3

LILRA3 in hg19 LILRB5 Intergenic LILRB3 LILRA4

Gene length calculation We need gene length for calculating RPKM. If alignment uses alt loci RPKM would be artificially lowered for alt loci genes. If alignment does not alt loci Remove alt loci annotations from the official set.

Need more comprehensive approach to genome variation. Assembly model is neither haploid nor diploid Analysis tools penalize reads mapping to > 1 location do not distinguish allelic duplication from paralogous duplication A graph structure is a natural way to represent a population- based genome assembly

Conclusions RPKM values are highly correlated between hg19 and hg38. Analysis set is preferred for expression analysis. Additional analysis may be performed to use the alt-loci separately. Annotations for hg38 is still lacking and need contribution from the community. Improve modeling of genome variability in population.

Questions?