Comparative Genomics I: Tools for comparative genomics

Comparative Genomics I: Tools for comparative genomics
Ross Hardison, Penn State University James Taylor: Courant Institute, New York University Major collaborators: Webb Miller, Francesca Chiaromonte, Laura Elnitski, David King, et al., PSU David Haussler, Jim Kent, Univ. California at Santa Cruz Ivan Ovcharenko, Lawrence Livermore National Lab CSH Nov. 11, 2006

Major goals of comparative genomics
Identify all DNA sequences in a genome that are functional Selection to preserve function Adaptive selection Determine the biological role of each functional sequence Elucidate the evolutionary history of each type of sequence Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research

Three major classes of evolution
Neutral evolution Acts on DNA with no function Genetic drift allows some random mutations to become fixed in a population Purifying (negative) selection Acts on DNA with a conserved function Signature: Rate of change is significantly slower than that of neutral DNA Sequences with a common function in the species examined are under purifying (negative) selection Darwinian (positive) selection Acts on DNA in which changes benefit an organism Signature: Rate of change is significantly faster than that of neutral DNA

Ideal case for interpretation
Negative selection (purifying) Positive selection (adaptive) Similarity Neutral DNA Position along chromosome Exonic segments coding for regions of a polypeptide with common function in two species. Exonic segments coding for regions of a polypeptide in which change is beneficial to one of the two species.

DCODE.org Comparative Genomics: Align your own sequences
blastZ multiZ and TBA

zPicture interface for aligning sequences

Automated extraction of sequence and annotation

Pre-computed alignment of genomes
blastZ for pairwise alignments multiZ for multiple alignment Human, chimp, mouse, rat, chicken, dog Also multiple fly, worm, yeast genomes Organize local alignments: chains and nets All against all comparisons High sensitivity and specificity Computer cluster at UC Santa Cruz 1024 cpus Pentium III Job takes about half a day Results available at UCSC Genome Browser Galaxy server: Webb Miller Jim Kent Schwartz et al., 2003, blastZ, Genome Research Blanchette et al., 2004, TBA and multiZ, Genome Research David Haussler

Genome-wide local alignment chains
Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb. Human blastZ: Each segment of human is given the opportunity to align with all mouse sequences. Mouse Run blastZ in parallel for all human segments. Collect all local alignments above threshold. Organize local alignments into a set of chains based on position in assembly and orientation. Level 1 chain Net Level 2 chain

Comparative genomics to find functional sequences
Genome size Find common sequences blastZ, multiZ Human Mouse Rat All mammals 1000 Mbp 2,900 Identify functional sequences: ~ 145 Mbp 2,400 Also birds: 72Mb 2,500 1,200 million base pairs (Mbp) Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004

Use measures of alignment quality to discriminate functional from nonfunctional DNA
Compute a conservation score adjusted for the local neutral rate Score S for a 50 bp region R is the normalized fraction of aligned bases that are identical Subtract mean for aligned ancestral repeats in the surrounding region Divide by standard deviation p = fraction of aligned sites in R that are identical between human and mouse m = average fraction of aligned sites that are identical in aligned ancestral repeats in the surrounding region n = number of aligned sites in R Waterston et al., Nature

Decomposition of conservation score into neutral and likely-selected portions
Neutral DNA (ARs) All DNA Likely selected DNA At least 5-6% S is the conservation score adjusted for variation in the local substitution rate. The frequency of the S score for all 50bp windows in the human genome is shown. From the distribution of S scores in ancestral repeats (mostly neutral DNA), can compute a probability that a given alignment could result from locally adjusted neutral rate. Waterston et al., Nature

DNA sequences of mammalian genomes
Human: 2.9 billion bp, “finished” High quality, comprehensive sequence, very few gaps Mouse, rat, dog, oppossum, chicken, frog etc. etc etc. About 40% of the human genome aligns with mouse This is conserved, but not all is under selection. About 5-6% of the human genome is under purifying selection since the rodent-primate divergence About 1.2% codes for protein The 4 to 5% of the human genome that is under selection but does not code for protein should have: Regulatory sequences Non-protein coding genes (UTRs and noncoding RNAs) Other important sequences

Leverage many species to improve accuracy and resolution of signals for constraint
ENCODE multi-species alignment group Margulies et al., 2007

Score multi-species alignments for features associated with function
Multiple alignment scores Margulies et al. (2003) Genome Research 13: Binomial, parsimony PhastCons Siepel et al. (2005) Genome Research 15: Phylogenetic Hidden Markov Model Posterior probability that a site is among the most highly conserved sites GERP Cooper et al. (2005) Genome Research 15: Genomic Evolutionary Rate Profiling Measures constraint as rejected substitutions = nucleotide substitution deficits

phastCons: Likelihood of being constrained
Phylogenetic Hidden Markov Model Posterior probability that a site is among the most highly conserved sites Allows for variation in rates along lineages c is “conserved” (constrained) n is “nonconserved” (aligns but is not clearly subject to purifying selection) Siepel et al. (2005) Genome Research 15:

Larger genomes have more of the constrained DNA in noncoding regions
Siepel et al. 2005, Genome Research

Some constrained introns are editing complementary regions:GRIA2
Siepel et al. 2005, Genome Research

3’UTRs can be highly constrained over large distances
3’ UTRs contain RNA processing signals, miRNA targets, other regions subject to constraints Siepel et al. 2005, Genome Research

Ultraconserved elements = UCEs
At least 200 bp with no interspecies differences Bejerano et al. (2004) Science 304: 481 UCEs with no changes among human, mouse and rat Also conserved between out to dog and chicken More highly conserved than vast majority of coding regions Most do not code for protein Only 111 out of 481overlap with protein-coding exons Some are developmental enhancers. Nonexonic UCEs tend to cluster in introns or in vicinity of genes encoding transcription factors regulating development 88 are more than 100 kb away from an annotated gene; may be distal enhancers

GO category analysis of UCE-associated genes
Genes in which a coding exon overlaps a UCE 91 Type I genes RNA binding and modification Transcriptional regulation Genes in the vicinity of a UCE (no overlap of coding exons) 211 Type II genes Developmental regulators Bejerano et al. (2004) Science

Intronic UCE in SOX6 enhances expression in melanocytes in transgenic mice
UCEs Pennacchio et al., Tested UCEs

The most stringently conserved sequences in eukaryotes are mysteries
Yeast MATa2 locus Most conserved region in 4 species of yeast 100% identity over 357 bp Role is not clear Vertebrate UCEs More constrained than exons in vertebrates Noncoding UCEs are not detectable outside chordates, whereas coding regions are Were they fast-evolving prior to vertebrate/invertebrate divergence? Are they chordate innovations? Where did they come from? Role of many is not clear; need for 100% identity over 200 bp is not obvious for any What molecular process requires strict invariance for at least 200 nucleotides? One possibility: Multiple, overlapping functions

Finding and analyzing genome data
NCBI Entrez Ensembl/BioMart UCSC Table Browser Galaxy

Browsers vs Data Retrieval
Browsers are designed to show selected information on one locus or region at a time. UCSC Genome Browser Ensembl Run on top of databases that record vast amounts of information. Sometimes need to retrieve one type of information for many genomics intervals or genome-wide. Access this by querying on the tables in the databases or “data marts” UCSC Table Browser EnsMart or BioMart Entrez at NCBI

Retrieve all the protein-coding exons in humans

Galaxy: Data retrieval and analysis
Data can be retrieved from multiple external sources, or uploaded from user’s computer Hundreds of computational tools Data editing File conversion Operations: union, intersection, complement … Compute functions on data Statistics EMBOSS tools for sequence analysis PHYLIP tools for molecular evolutionary analysis PAML to compute substitutions per site Add your own tools

Galaxy via Table Browser: coding exons

Retrieve human mutations

Find exons with human mutations: Intersection

Compute length using “expression”

Statistics on exon lengths

Plot a histogram of exon lengths

Distribution of (human mutation) exon lengths

What is that really long exon? Sort by length

SACS has an 11kb exon

Comparative Genomics I: Tools for comparative genomics

Similar presentations

Presentation on theme: "Comparative Genomics I: Tools for comparative genomics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparative Genomics I: Tools for comparative genomics

Similar presentations

Presentation on theme: "Comparative Genomics I: Tools for comparative genomics"— Presentation transcript:

Similar presentations

About project

Feedback