Presentation on theme: "Variation and Functional Genomics. 2 of 51 Overview of Talk SNPs and InDels Larger structural variants (CNVs) Phenotype data Individual genomes HapMap."— Presentation transcript:
Variation and Functional Genomics
2 of 51 Overview of Talk SNPs and InDels Larger structural variants (CNVs) Phenotype data Individual genomes HapMap variations and genotypes Locus Specific Databases LRGs
3 of 51 Genomic Diversity SNPs (Single Nucleotide Polymorphisms) base pair substitutions InDels insertion/deletion (frameshifts) occur in 1 in every 300 bp (human) ~3 billion base pairs in mammalian genomes!
4 of 51 Functional Consequences TypeConsequence SNPs in coding area that alter aa sequence Cause of most monogenic disorders, e.g: Cystic fibrosis (CFTR) Hemophilia (F8) SNPs in coding areas that don’t alter aa sequence May affect splicing SNPs in promoter or regulatory regions May affect the level, location or timing of gene expression SNPs in other regionsNo direct known impact on phenotype Useful as markers
5 of 51 Cause disease (SNP in clotting factor IX codes for a stop codon: haemophilia) Increase disease risk (SNP in LDL receptor reduces efficiancy: high cholesterol) Affect drug response (2 million hospitalized patients suffer serious adverse drug reactions, with more than 100,000 are fatal*) Sequence Polymorphisms Effects
9 of 51 9 of 49 9 of 25 Small Scale Sequence Variants Most SNPs and Indels are imported from dbSNP (rs……): Imported data: alleles, flanking sequences, pop. frequencies Calculated data: position, transcript effect For human also: HGMD (Human Gene Mutation Database) HGVS (Human Genome Variation Society) Affymetrix and Illumina variations Ensembl-called SNPs (from aligned individual genomes) For mouse, rat, dog and chicken also: Sanger- and Ensembl-called SNPs (other strains/breeds)
10 of of 25 SNPs and InDels in Ensembl Non-synonymousIn coding sequence, resulting in an aa change Synonymous In coding sequence, not resulting in an aa change FrameshiftIn coding sequence, resulting in a frameshift Stop lostIn coding sequence, resulting in the loss of a stop codon Stop gainedIn coding sequence, resulting in the gain of a stop codon Essential splice site In the first 2 or the last 2 basepairs of an intron Splice site1-3 bps into an exon or 3-8 bps into an intron UpstreamWithin 5 kb upstream of the 5'-end of a transcript Regulatory regionIn regulatory region annotated by Ensembl 5' UTRIn 5' UTR IntronicIn intron 3' UTRIn 3' UTR DownstreamWithin 5 kb downstream of the 3'-end of a transcript IntergenicMore than 5 kb away from a transcript
11 of of 49 Small Scale Sequence Variants Ensembl Region in Detail View Colour-coded SNPs and InDels Legend
12 of 51 Polymorphisms in Ensembl Chicken Chimp Cow Dog Human Mouse Rat Platypus Tetraodon Zebrafish Plants (Rice, Arabadopsis, Grapevine, Brachypodia) Yeast Fly Mosquito Plasmodium falciparum
13 of 51 13/72 CNV in human Structural variants track
14 of of 49 14/72 Phenotype Data Genome wide association data 159 annotations from EGA from NHGRI
15 of 51 15/72
16 of 51 Somatic Variations: COSMIC
17 of 51 17/72 Population Data in Ensembl
18 of 51 Population Data Variation tab: Population genetics
19 of 51 Variation Tab Flanking sequence Population genetics and LD plots Disease relationships (human) EGA, GWAS, HapMap, Clinical/LSDB Ancestral alleles
20 of 51 Variation Views View variations drawn on the sequence Gene tab: Sequence link, Transcript tab: Exons, cDNA, protein links View a table of variations for each transcript Gene tab: Variation Table View variations drawn along a transcript Gene tab: Variation Image
21 of 51 Comparison Views Human, Mouse, Rat, Dog and Cow have individual or strain comparisons: Comparison Image link at the left of the Transcript tab.
22 of 51 SNP Effect Calculator Click on Manage your data at the left of any page. Follow the link to “SNP Effect Predictor”. Paste in variation positions and alleles
23 of 51 SNP Effect Calculator Location, variation name in Ensembl, and consequence on amino acid sequence is returned.
24 of 51 Ensembl Variation SNPs and InDels Larger structural variants (CNVs) Phenotype data Individual genomes (human) HapMap variations and genotypes Locus Specific Databases LRGs
25 of 51 Sequencing Individuals Venter and Watson genomes 1000 genomes project HapMap
26 of 51 First diploid genomes for human “The Diploid Genome Sequence of an Individual Human” PLoS Biology 5: (2007) “The Complete Genome of an Individual by Massively Parallel DNA Sequencing” Nature 452: (2008) “Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry ” Nature 456:53-59 (2008) “The Diploid Genome Sequence of an Asian Individual” Nature 456:60-65 (2008) Craig Venter: Sequence & analysis ongoing since 2003 Jim Watson: 454 technology (7.4x) 100 mill unpaired reads (25 billion bps) $1,000,000
27 of 51 The Human Genome Project gave the “average” DNA sequence of a small number of people. This helps us find out how a human develops and works Does not show us the DNA differences between different humans Does not reflect the major alleles Reference Sequence
28 of Genomes Project 1000 genomes track in Region in Detail
29 of 51 HapMap A multi-country effort to identify and catalogue genetic similarities and differences in people. Collaboration among scientists and funding agencies from Japan, the United Kingdom, Canada, China, Nigeria, and the United States. All of the information generated by the project released into the public domain.
30 of 51 HapMap (phase III) Genotypes from 1115 individual from 11 populations: ASW African ancestry in Southwest USA (71) CEU Utah residents with Northern and Western European ancestry from the CEPH collection (162) CHB Han Chinese in Beijing, China (70) CHD Chinese in Metropolitan Denver, Colorado (70) GIH Gujarati Indians in Houston, Texas (83) JPT Japanese in Tokyo, Japan (82) LWK Luhya in Webuye, Kenya (83) MEX Mexican ancestry in Los Angeles, California (71) MKK Maasai in Kinyawa, Kenya (171) TSI Toscani in Italia (77) YRI Yoruba in Ibadan, Nigeria (163)
31 of 51 Haplotyping A haplotype is a set of SNPs (on average ~25 kb) found to be statistically associated on a single chromatid and which therefore tend to be inherited together over time. Haplotyping involves grouping subjects by haplotypes.
32 of 51 Locus specific databases (LSDB) Databases that focus on one gene or one disease e.g. p53, ABO, collagen e.g. Albinism, cystic fibrosis, Alzheimer’s disease User communities: Research groups-disease and function driven Clinicians – driven by genetic testing of patients
33 of of 49 LSDBs >1000 on the Human Genome Variation Society website
34 of 51 LSDB examples
35 of 51 Why is it difficult to merge these data? Historical reasons. LSDBs sometimes Use sequences which do not start at Methionine Use transcript coordinates not genomic Use a different transcript for reporting mutations Regularly changes with new assemblies/gene builds It may contain minor alleles or rare alleles It may be inaccurate Missing genes (e.g. no α-haemoglobin - Thalasemia) Mixture of sequences from different individuals
36 of 51 Ensembl and LRGs Define an exchange format for LRGs with the NCBI Create an LRG website Create a pipeline for receiving the data and creating an LRG Extend e! databases to store LRGs Develop an API to query LRGs and associated annotation Consult with the LSDBs to develop useful visualisation tools Build displays for LRG data and annotation
37 of 51 EGA- Repository for genotype data
38 of 51 Sequences Differing from the Reference Common coordinate system for reporting mutations and variation data (stable sequence) Locus Reference Genomic (LRG) Ensembl displays LRGs Project in collaboration with the NCBI and GEN2PHEN Extension of the RefSeq gene project View and Request LRGs here:
39 of 51 Locus Reference Genomic LRG = Genomic sequence for reporting mutations (containing transcript ) * Often differs from the reference assembly
40 of 51 LRGs in the Browser LRG transcripts and underlying sequence can be viewed. LRG_13 All LRGs
41 of 51 Variations Team Fiona Cunningham Pontus Larsson Will McLaren Graham Ritchie
42 of 51 Functional Genomics (Wikipedia): Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein) functions and interactions. In Ensembl: Regulatory build using ENCODE project information Promoters and Enhancers from CisRED and VISTA FlyReg features (for Drosophila)
43 of 51 ENCODE Encylopedia Of DNA Elements Where are the promoter, enhancer, and other regulatory regions of the human genome? Pilot project showed: Use chromatin accessibility and histone modification analysis to predict TSS 14 June 2007, Nature
44 of 51 Regulatory Build CTCF-binding sites DNAse1 hypersensitive sites TF binding sites These are “core features” Overlapping methylation sites expand these regions.
45 of 51 The Regulation Tab
46 of 51 How to get there?
47 of 51 The Location Tab
48 of 51 BioMart
49 of 51 There are other sets… Sequence motifs determined by experimental and prediction tools. VISTA Enhancer Set Tissue-specific enhancers. Tested experimentally. Nucleic Acids Res January; 35(Database issue): D88–D92.
50 of 51 Gene Regulation Summary DNase I hypersensitivitiy, CTCF binding sites, TF binding sites (core features) Histone modification data MeDIP-chip methylation data for 17 human tissues and cell lines VISTA Enhancer Assay (http://enhancer.lbl.gov)http://enhancer.lbl.gov cisRED motifs (www.cisred.org)www.cisred.org miRanda microRNA target prediction Expression Quantitative Trait Loci (eQTL) from the Sanger Institute DNase1 Hypersensititvity site (ES cells) Histone modifications for ES, MEF, and NPC cells cisRED motifs (www.cisred.org)www.cisred.org ZFMODELS-enhancers REDfly TFBSs BioTIFFIN REDfly CRMs Homo sapiens Mus musculus Danio rerio Drosophila melanogaster
51 of 51 Functional Genomics eFG Ian Dunham Nathan Johnson Daniel Sobral Andy Yates ENCODE Steven Wilder Damian Keefe