Presentation on theme: "(Human) Genomics BIOM/PHAR206 – 05/19/2014 Olivier Harismendy, PhD Division of Genome Information Sciences Department of Pediatrics Moores UCSD Cancer."— Presentation transcript:
(Human) Genomics BIOM/PHAR206 – 05/19/2014 Olivier Harismendy, PhD Division of Genome Information Sciences Department of Pediatrics Moores UCSD Cancer Center
UCSC Genome Browser isPCR BLAT LiftOver Track types – BED minimum – BED extended – WIG Track Display and Shuffle Browser Navigation Custom Session – Export Figure Custom Tracks
BED Track Formats Header: space separated parameters name= description= type= - Defines the track type. The track type attribute is required for BAM, BED detail, bedGraph, bigBed, bigWig, broadPeak, narrowPeak, Microarray, VCF and WIG tracks. visibility= 0 - hide, 1 - dense, 2 - full, 3 - pack, and 4 - squish. color= - Defines the main color for the annotation track. itemRgb=On colorByStrand= - Sets colors for + and - strands, in that order. useScore= group= - priority= - When the group attribute is set, defines the display position of the track relative to other tracks db= - When set, indicates the specific genome assembly for which the annotation data is intended; offset= - Defines a number to be added to all coordinates in the annotation track. The default is "0". maxItems= - Defines the maximum number of items the track can contain. url= - Defines a URL for an external link associated with this track. htmlUrl= - Defines a URL for an HTML description page to be displayed with this track. bigDataUrl= - Defines a URL to the data file for BAM, bigBed, bigWig or VCF tracks.
BED Track Formats For intervals Header: space separated configuration parameters – chrom - The name of the chromosome – chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. – chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. – name - Defines the name of the BED line. – score - A score between 0 and 1000. – strand - Defines the strand - either '+' or '-'. – thickStart - The starting position at which the feature is drawn thickly – thickEnd - The ending position at which the feature is drawn thickly – itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). – blockCount - The number of blocks (exons) in the BED line. – blockSizes - A comma-separated list of the block sizes. – blockStarts - A comma-separated list of block starts.
WIG track format #150 base wide bar graph at arbitrarily spaced positions, #threshold line drawn at y=11.76 #autoScale off viewing range set to [0:25] #priority = 10 positions this as the first graph #Note, one-relative coordinate system in use for this format track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 49304701 10.0 49304901 12.5 49305401 15.0 49305601 17.5 49305901 20.0 49306081 17.5 49306301 15.0 49306691 12.5 49307871 10.0 #200 base wide points graph at every 300 bases, 50 pixel high graph #autoScale off and viewing range set to [0:1000] #priority = 20 positions this as the second graph #Note, one-relative coordinate system in use for this format track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20 fixedStep chrom=chr19 start=49307401 step=300 span=200 1000 900 800 700 600 500 400 300 200 100
Specific Tracks of interest UCSC genes RefSeq Genes RepeatMasker Conservation TF motif predictions dbSNP ENCODE Roadmap
Custom Sessions Create an account Customize the tracks displayed Add you own track (limited in size and time) Save and Share
Table Browser Subset gene, region, genome Output BED or fasta Intersection Filters
ENCODE / Roadmap Tracks Track search Cell Types / Tissue Types Raw Peaks HMM
UNIX commands Head More (press Q to exit) Cat – Example cat file – Example cat file1 file2 Grep – Grep –v ‘expression’ – Grep –A 1 ‘expression’ – Grep –B 2 ‘expression’ – Example: grep –v ‘#’ file.txt to remove comments Expression metacharacters – $ end of line – $ beginning of line – [AB] A or B – * any character – Example: ‘CDKN*’ or ‘chr[1-7]’
DNA variants (Sequence differences) Highly Similar Genomes Phenotypic Differences (Physical traits) Human Genetic Variation
Variant Types Frazer et al. 2009 Rahim, Harismendy et al (2008)
Within any given individual there are ~ 4 million genetic variants encompassing ~ 12 Mb Variants from an individual genome
Variants from multiple genomes Within a given individual the majority of variants are common.
Next Generation DNA analysis Whole genome sequencing – Mutations (coding and non-coding) – Translocations – Copy Number Variants Whole Exome Sequencing – Mutations (coding) – ~Copy number variants (trisomia, gene amplifications) Gene Panel – Mutations (coding)
Variant Frequencies Common genetic variants – second allele present at greater than 3% frequency Rare genetic variant – present at less than 3% frequency, and commonly at very low frequencies Private variants – in limited families or single individuals
Map of Genetic Variation Relationships between common SNPs in the human genome Frazer et al (2007) HapMap Project Genotyped ~ 3.1 million SNPs in 270 individual s –90 Yoruba in Ibadan, Nigeria (YRI) –90 European descent in Utah, USA (CEU) –45 Han Chinese in Beijing, China (CHB) –45 Japanese in Tokyo, Japan (JPT)
VCF format ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig= ##phasing=partial ##INFO= ##FILTER= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330. T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237. T. 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Linkage Disequilibrium (LD) Given two biallelic sites there are four combinations that can be observed with the following distributions. SNP 1 = A/G SNP 2 = A/C SNP1- SNP2 Case r 2 =1Case r 2 =0 A7025 ACAC0 GAGA0 GCGC3025 LD measure the level of correlation between SNPs LD is the consequence of recombination at preferential sites
LD Bin structure example LD bin = groups of SNPs with r 2 ≥0.8 The majority of common SNPs are in LD bins in the human genome Genotypes of a set of ~500,000 “tag SNPs” provide information (r 2 ≥ 0.8) regarding a large fraction (90%) of all 8 million common SNPs present in humans.
GWAS principle Tests if common SNPs tagging an interval in the human genome are “associated” with a disease From phenotype to genotype http://www.mpg.de
GWAS results WTCCC (2007) PR interval Large number to test requires low p-value (5.10 -8 ) Sample sizes determine variant frequencies and effect size (Power) Q1 2011 221 traits 1319 studies >4000 associated SNPs
GWAS highlights Many genes/loci not previously known to be involved in the diseases studied Newly identified pathways suggest that molecular sub- phenotypes of common diseases may exist Many common diseases have the same associated genes suggesting similar etiologies
GWAS limitations – Genetic Small Effect sizes : only explains a small fraction (1-25%) of the heritability Missing heritability can be hiding in – Rare variants with large effects – Epitasis (Gene x Gene interactions) – Gene x Environment interaction (overlooked in heritability studies) – Clinical Limited Prognostic value : classic marker (family history, life style) work better Limited by ethnicity – Functional Proxy SNPs are not the functional ones Genes associated by proximity : Variants are mostly outside Cell type and condition unknown
Days after Dx Patients Decreasing Intrinsic sensitivity Clinical Data Collected
Molecular Data Collected MoleculeMethodMeasured entityData RNAmicroarrays15,000 transcriptsExpression levels RNARNA-Seq All known and novel trasncripts Expression levels, isoform quantification, editing, Novel transcripts, Fusion Trasncripts DNAmicroarrays100k to 1M SNP Copy Number Aberrations, LoH, Polymorphisms DNASanger Sequencing30 M Base pairsCoding Mutations DNA whole exome sequencing 50 M Base pairs Coding Mutations, Copy Number Aberrations DNAwhole genome3 billion base pairs Coding and Regulatory Mutations, Copy Number Aberrations, Rearragements DNAMethylation Array450,000 CpGMethylation levels DNAMethylation Array27,000 CpGMethylation levels