MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.

Slides:



Advertisements
Similar presentations
Potato Mapping / QTLs Amir Moarefi VCR
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
Genetic Analysis in Human Disease
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Comparative genomics Joachim Bargsten February 2012.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Signatures of Selection
Genomics An introduction. Aims of genomics I Establishing integrated databases – being far from merely a storage Linking genomic and expressed gene sequences.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Applying haplotype models to association study design Natalie Castellana June 7, 2005.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Introduction to Computational Biology Topics. Molecular Data Definition of data  DNA/RNA  Protein  Expression Basics of programming in Matlab  Vectors.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Positional Cloning LOD Sib pairs Chromosome Region Association Study Genetics Genomics Physical Mapping/ Sequencing Candidate Gene Selection/ Polymorphism.
Genome Browsers Ensembl (EBI, UK) and UCSC (Santa Cruz, California)
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Whole Genome Polymorphism Analysis of Regulatory Elements in Breast Cancer AAGTCGGTGATGATTGGGACTGCTCT[C/T]AACACAAGCGAGATGAAGAAACTGA Jacob Biesinger Dr.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Office hours Wednesday 3-4pm 304A Stanley Hall Review session 5pm Thursday, Dec. 11 GPB100.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Natural Variation in Arabidopsis ecotypes. Using natural variation to understand diversity Correlation of phenotype with environment (selective pressure?)
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Introduction to Phylogenetics
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Identification of Copy Number Variants using Genome Graphs
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Cis-regulatory Modules and Module Discovery
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Chapter 2 From Genes to Genomes. 2.1 Introduction We can think about mapping genes and genomes at several levels of resolution: A genetic (or linkage)
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
1 Before considering selection, it’s important to characterize how gene expression varies within and between species. What evolutionary forces act on gene.
Supplemental Figure 1. False trans association due to probe cross-hybridization and genetic polymorphism at single base extension site. (A) The Infinium.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Transcription factor binding motifs (part II) 10/22/07.
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
Of Sea Urchins, Birds and Men
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Genome-wide Associations
Cis-regulatory evolution of duplicate genes in yeasts
Linking Genetic Variation to Important Phenotypes
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Phylogenetic footprinting and shadowing
Presentation transcript:

MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2 (0.2)(0.3)(0.2)(0.3)(0.2)(0.3) = 2.16 x This is the probability that ANY 6 mer will be this sequence by chance How many instances within 1,000 bp upstream of 6,000 genes? Number of 6mers per 1,000 bp: 1000 bp – 5 bp (account for 6mer start position) 6000 genes * 995 = 5.97 x 10 6 possible 6mers total P that any one is your sequence: 2.16 x x 5.97 x 10 6 = 1290 sites = 995 6mers per gene upstream region BUT … can also have the reverse complement (i.e. site on other strand) = 2X possible sites (because of our bg model) = 2580 possible matches 1

An alternative approach: Phylogenetic footprinting Rather than look at multiple, different regulatory regions from one species, look at one region but across multiple, orthologous regions from many species. Hypothesis: functional regions of the genome will be conserved more than ‘ nonfunctional ’ regions, due to selection. Therefore, simply look for regions of sequence that are conserved above background. 2

Simplest case: stretches of very highly conserved sequence Kellis et al “ Sequencing and comparison of yeast species to identify genes and regulatory elements ” Sequenced 4 closely related Saccharomyces genomes & identified conserved sequences in multiple alignments of orthologous sequences from the four species. 3

Position bits Information Profile: Incorporating evolutionary models can improve motif finding Remember that evolution acts on functionally important base pairs … Also remember from our motif finding exercise that not all contiguous base pairs are equally important (information content). 4

Incorporating evolutionary models can improve motif finding Remember that evolution acts on functionally important base pairs … Also remember from our motif finding exercise that not all contiguous base pairs are equally important (information content). Moses et al “ Position specific variation in the rate of evolution in transcription factor binding sites ” Rate of evolution (ie. degree of conservation) within a motif is inversely proportional to the information content … important base pairs evolve slower 5

Sinha et al “ PhyME: A probabalistic algorithm for finding motifs in sets of orthologous sequences ” Moses et al “ Monkey: identification of transcription factor binding sites in multiple alignments using a binding site-specific evolutionary model Siddharthan et al “ PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny. ” Wang & Stormo (PhyloCon) “Combining phylogenetic data with co-regulated genes to identify regulatory motifs” Prakash et al (OrthoMEME) “Motif discovery in heterogeneous sequence data Multiple motif finding methods now work on multiple alignments of regulatory regions of coregulated genes. Given: 1) group of regulatory regions of coregulated genes 2) orthologs of each region, in the form of multiple alignments 6

Keep in mind that the relevant evolutionary models are specific for what one is looking for (TF binding sites, ncRNA, etc) Moses et al “ Position specific variation in the rate of evolution in transcription factor binding sites ” Rate of evolution (ie. degree of conservation) within a motif is inversely proportional to the information content … important base pairs evolve slower 7

VISTA suite for visualizing conservation in global alignments Pre-computed multiple global alignments of mammalian genomes, visualized by conservation level. -- Uses BLAT local alignment tool to find seeds of high sequence similarity, then these seeds are used for global single- or multiple-genome alignment Frazer et al “ VISTA: computational tools for comparative genomics ” 8

9

Which species to compare? Balance between: -- species closely related enough that: 1) There ’ s enough similar sequence to get confident pairwise alignments 2) The sequences of interest and their corresponding functions have been conserved -- species distantly enough related that: 1) nonfunctional sequence has had time to diverge 10

The above approaches have focused on using similarity/conservation to identify important regions of the genome … A large focus in genomics is understanding the differences in genome sequences and what accounts for the vast diversity in phenotypes within a population. Analysis of single nucleotide polymorphisms (SNP) within populations, Analysis of variations in gene expression within and between populations, Analysis of quantitative trait loci (QTLs) accounting for differences in gene expression. 11

Connecting phenotype to genotype -- Large variations in size, shape, health, etc in human populations -- Much of that variation has to do with disease susceptibility -- A major goal of genetics (and now genomics) is understanding the consequences of genetic variation. A major force in genomics is to identify and annotate SNPs in human populations, and identify those related to disease ~2800 disease-associated genes known, mostly from positional cloning & mapping studies Done by linkage analysis: pattern of marker inheritance in families with heritable diseases 12

Each base-pair position on human chromosome 21 is interrogated 8 times (4 in forward & 4 in reverse orientations) GGAGATGAGTTC G ATTACTCTTAGG GGAGATGAGTTC A ATTACTCTTAGG GGAGATGAGTTC T ATTACTCTTAGG GGAGATGAGTTC C ATTACTCTTAGG 1.7 x 10 8 oligos total on eight Affy wafers were used to identify SNPs on human Chromosome 21 from 21 different individuals. Array-based methods of SNP detection & Haplotype mapping 14

Each row = single SNP Each column = Ch 21 Blue = major allele Yellow = minor allele Much of the chromosomal variation is explained with relatively limited haplotype diversity. 80% of haplotype structure can be captured with only 10% of the SNPs in that block (need only 2SNPs to type) Haplotype length can vary from a few kb to mega bases. 15

16

Phenotypic variation (including disease susceptibility) are often linked to copy # changes This is especially true of numerous types of cancers, where local amplifications and translocations increase the copy number of cell proliferation regulators, etc. 17

Amplifications in breast cancer lines increase the copy # of specific regulators.. 18

19