Presentation is loading. Please wait.

Presentation is loading. Please wait.

hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison

Similar presentations


Presentation on theme: "hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison"— Presentation transcript:

1 hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison
Zuotian Tatum Department of Human Genetics Leiden University Medical Center

2 Timeline GRCh37: First release: Latest patch: GRCh38: First release:
Feb 27, 2009 Latest patch: Jun 28, 2013 (p13) GRCh38: First release: Dec 24, 2013 Latest patch: Oct 14, 2014 (p1)

3 Content GRCh37.p13: Total bases: N50: Number of alternative loci:
3.23 Billion 2.99 Billion (without N) N50: 46 Million Number of alternative loci: 9 Non-nuclear genome: No GRCh38.p2: Total bases: 3.21 Billion 3.05 Billion (without N) N50: 67 Million Number of alternative loci : 261 Non-nuclear genome: Yes

4 UCSC tracks for GRCh38 UCSC RefSeq available since April 2014.
Ensembl regulatory build available since September 2014. dbSNP 141 available since October 2014. ENCODE and FANTOM5 track hubs are still not available (Nov 2014).

5 New in GRCh38 release Three new sequence files, in addition to the standard assembly files: - GCA_ _GRCh38_top-level.fna.gz - GCA_ _GRCh38_no_alt_analysis_set.fna.gz - GCA_ _GRCh38_full_analysis_set.fna.gz The analysis set files are created to avoid false mapping in NGS alignment pipelines.

6 GCA_ _GRCh38_top-level.fna.gz All the top-level objects in the full-assembly Chromosomes unlocalized scaffolds unplaced scaffolds alternate locus scaffolds mitochondrial genome The sequence identifiers are International Sequence Database Collaboration (INSDC) accession.versions and the definition lines are GenBank style. No sequences have been hard-masked.

7 GCA_ _GRCh38_no_alt_analysis_set.fna.gz Chromosomes from the GRCh38 Primary Assembly unit. Note: the two PAR regions on chrY have been hard-masked with Ns. The chromosome Y sequence provided therefore has the same coordinates as the GenBank sequence but it is not identical to the GenBank sequence. Similarly, duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked with Ns. Mitochondrial genome from the GRCh38 non-nuclear assembly unit. Unlocalized scaffolds from the GRCh38 Primary Assembly unit. Unplaced scaffolds from the GRCh38 Primary Assembly unit. Epstein-Barr virus (EBV) sequence Note: The EBV sequence is not part of the genome assembly but is included in the analysis set as a sink for alignment of reads that are often present in sequencing samples.

8 GCA_000001405.15_GRCh38_full_analysis_set.fna.gz =
GCA_ _GRCh38_no_alt_analysis_set.fna.gz + alt-scaffolds from the GRCh38 ALT_REF_LOCI_* assembly units

9 Alt-loci add complexity to RNASeq quantification

10 Ideogram of GRCh38.p2

11 RNASeq quantification
- Fragments (reads) per million per killobase (FPKM/RPKM) values to quantify gene expression - Unique mapping only Analysis tools do not distinguish allelic duplication from paralogous duplication - Non overlapping gene regions

12 To understand the effect of alt-loci on RNASeq quantification
Compare alignment of chromosome 6 MHC region between - hg19 full set with 7 alt-loci - hg38 analysis set without alt-loci Sequence content are largely unchanged between hg19 and hg38.

13 Mapping/alignment for RNASeq
hg19 hg38 mapped 14,655,299 14,704,427 mappedDiffChr 4,959 4,017 mappedPairProper 14,639,261 14,690,090 mappedPairProperPct 92.62 92.94 total 15,805,561 totalSplice 5,060,829 5,078,133 unmapped 1,150,262 1,101,134 hg19: with alt loci hg38: without alt loci

14 Effect of alt loci in RNASeq alignments
Gene RPKM (hg38)

15 Distribution of RPKM difference

16 Major Histocompatibility complex region on chromosome 6

17 HLA-A hg19 full set – chr6 D1 hg19 full set – chr6_mann_hap4 D1
hg19 full set – chr6_qb1_hap6 D1 hg19 full set – chr6_dbb_hap3 D1

18 HLA-A hg19 full set – chr6 D1 D2 D3 hg38 analysis set D1 D2 D3

19 HLA-C hg19 full set D1 D2 D3 hg38 analysis set D1 D2 D3

20 HLA-DRA hg19 full set D1 D2 D3 hg38 analysis set D1 D2 D3

21 Major Histocompatibility complex region on chromosome 6
Class III

22 MHC Class III 700kb stretch, 60 genes.
The most gene-dense region of the human genome > 14% coding ~ 72% transcribed Highly conserved Only a free have clearly defined and proven function

23 TNF hg19 full set – chr6 D1.control D1.treated
hg38 analysis set – chr6 D1.control D1.treated

24 Highly variant immune regions retiled

25 LILRA3 moved to alt-loci in hg38
LILRB LILRA LILRA5 hg38 LILRB LILRA5

26 Phantom LILRA3

27 LILRA3 in hg19 LILRB5 Intergenic LILRB3 LILRA4

28 Gene length calculation
We need gene length for calculating RPKM. If alignment uses alt loci RPKM would be artificially lowered for alt loci genes. If alignment does not alt loci Remove alt loci annotations from the official set.

29 Need more comprehensive approach to genome variation.
Assembly model is neither haploid nor diploid Analysis tools penalize reads mapping to > 1 location do not distinguish allelic duplication from paralogous duplication A graph structure is a natural way to represent a population- based genome assembly

30 Conclusions RPKM values are highly correlated between hg19 and hg38.
Analysis set is preferred for expression analysis. Additional analysis may be performed to use the alt-loci separately. Annotations for hg38 is still lacking and need contribution from the community. Improve modeling of genome variability in population.

31 Questions?


Download ppt "hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison"

Similar presentations


Ads by Google