[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 12:

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Homology Based Analysis of the Human/Mouse lncRNome
Basics of Comparative Genomics Dr G. P. S. Raghava.
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
Structural bioinformatics
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
Protein Modules An Introduction to Bioinformatics.
[Bejerano Fall09/10] 1 Milestones due today. Anything to report?
[Bejerano Fall10/11] 1 HW1 Due This Fri 10/15 at noon. TA Q&A: What to ask, How to ask.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
[Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
[Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Sequencing a genome and Basic Sequence Alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
[Bejerano Fall11/12] 1 Primer Friday 10am Beckman B-302 Introduction to the UCSC Browser.
CS273A Lecture 11: Comparative Genomics II
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 11:
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 17:
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Sequencing a genome and Basic Sequence Alignment
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Copyright © 2008 Pearson Education, Inc., publishing as Pearson Benjamin Cummings PowerPoint ® Lecture Presentations for Biology Eighth Edition Neil Campbell.
Sequence Alignment.
CS173 Lecture 9: Transcriptional regulation III
Step 3: Tools Database Searching
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
LECTURE PRESENTATIONS For CAMPBELL BIOLOGY, NINTH EDITION Jane B. Reece, Lisa A. Urry, Michael L. Cain, Steven A. Wasserman, Peter V. Minorsky, Robert.
Objective: I can explain how genes jumping between chromosomes can lead to evolution. Chapter 21; Sections ; Pgs Genomes: Connecting.
Sequence similarity, BLAST alignments & multiple sequence alignments
CS273A Lecture 15: Inferring Evolution: Chains & Nets II
Basics of Comparative Genomics
Comparative Genomics.
Genomes and Their Evolution
SGN23 The Organization of the Human Genome
CS273A Lecture 12: Inferring Evolution: Chains & Nets
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
CS273A Lecture 14: Inferring Evolution: Chains & Nets
CS273A Lecture 8: Inferring Evolution: Chains & Nets
The Human Genome Source Code
Basics of Comparative Genomics
Basic Local Alignment Search Tool
The Human Genome Source Code
Presentation transcript:

[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 12: Chains & Nets, Conservation & Function

[BejeranoWinter12/13] 2 Announcements HW2 Due Today As are project assignments Coming monday 2/25 lecture has been moved to LK101 (building next door – we’ll post instructions)

[BejeranoWinter12/13] 3 Inferring Genomic Mutations From Alignments of Genomes

[BejeranoWinter12/13] 4 Terminology Orthologs : Genes related via speciation (e.g. C,M,H3) Paralogs: Genes related through duplication (e.g. H1,H2,H3) Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3) Species tree Gene tree Speciation Duplication Loss single ancestral gene

What? Compare whole genomes Compare two genomes Within (intra) species Between (inter) species Compare genome to itself Why? Comparison reveals functional and neutral regions Homologous regions most often have similar functions Modification of functional regions can reveal Disease susceptibility Adaptation How? [BejeranoWinter12/13] 5

6 Every Genome is Different DNA Replication is imperfect – between individuals of the same species, even between the cells of an individual....ACGTACGACTGACTAGCATCGACTACGA... chicken egg...ACGTACGACTGACTAGCATCGACTACGA... functional junk TT CAT “anything goes” many changes are not tolerated chicken

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Similarity is often measured using “%id”, or percent identity %id = number of matching bases / number of alignment columns Where Every alignment column is a match / mismatch / indel base Where indel = insertion or deletion (requires an outgroup to resolve) AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

What to expect from genome comparisons? [BejeranoWinter12/13] 8 human lizard Objective: find local alignment blocks, that are likely homologous (share common origin) O(mn) examine the full matrix using DP O(m+n) heuristics based on seeding + extension trades sensitivity for speed

9 “Raw” (B)lastz track (no longer displayed) Protease Regulatory Subunit 3 Alignment = homologous regions

Chaining co-linear alignment blocks [BejeranoWinter12/13] 10 human lizard Objective: find local alignment blocks, that are likely homologous (share common origin) Chaining strings together co-linear blocks in the target genome to which we are comparing. Double lines when there is unalignable sequence in the other species. Single lines when there isn’t.

Reference genome perspective, The Use of an Outgroup ABC DE Outgroup Sequence ABC DE Human Sequence ABC DE Mouse Sequence B’ In Human Browser Implicit Human sequence Mouse chains B’ … … DE DE In Mouse Browser Implicit Mouse sequence Human chains … … DE 11

Gap Types: Single vs Double sided ABC DE Ancestral Sequence ABC DE Human Sequence ABC DE Mouse Sequence B’ In Human Browser Implicit Human sequence Mouse chains B’ … … DE DE In Mouse Browser Implicit Mouse sequence Human chains … … DE 12

[BejeranoWinter12/13] 13 Conservation Track Documentation

[BejeranoWinter12/13] 14 Chains join together related local alignments Protease Regulatory Subunit 3 likely ortholog likely paralogs shared domain?

Note: repeats are a nuisance [BejeranoWinter12/13] 15 human mouse If, for example, human and mouse have each 10,000 copies of the same repeat: We will obtain and need to output 10 8 alignments of all these copies to each other. Note that for the sake of this comparison interspersed repeats and simple repeats are equal nuisances. Also note that simple repeats, but not interspersed repeats, violate the assumption that similar sequences are homologous. Solution: 1 Discover all repetitive sequences in each genome. 2 Mask them when doing genome to genome comparison. 3 Chain your alignments. 4 Add back to the alignments only repeat matches that lie within pre-computed chains.

[BejeranoWinter12/13] 16 Chains a chain is a sequence of gapless aligned blocks, where there must be no overlaps of blocks' target or query coords within the chain. Within a chain, target and query coords are monotonically non- decreasing. (i.e. always increasing or flat) double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed. not just orthologs, but paralogs too, can result in good chains. but that's useful! chains should be symmetrical -- e.g. swap human-mouse -> mouse- human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done. chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki]

[BejeranoWinter12/13] 17 Before and After Chaining

[BejeranoWinter12/13] 18 Chaining Algorithm Input - blocks of gapless alignments from blastz Dynamic program based on the recurrence relationship: score(B i ) = max(score(B j ) + match(B i ) - gap(B i, B j )) Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i See [Kent et al, 2003] “Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes”

[BejeranoWinter12/13] 19 Netting Alignments Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. Net finds best match mouse match for each human region. Highest scoring chains are used first. Lower scoring chains fill in gaps within chains inducing a natural hierarchy.

[BejeranoWinter12/13] 20 Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

[BejeranoWinter12/13] 21 A Rearrangement Hot Spot Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

[BejeranoWinter12/13] 22 Nets attempt to capture the ortholog (they also hide everything else)

[BejeranoWinter12/13] 23 Retroposed Genes and Pseudogenes Pseudogenes (“dead genes”): Genomic sequences that resemble (originated from) genes that no longer make proteins. Retrogenes (“retrotranscribed”): Protein coding RNA that was reverse transcribed and inserted back into the genome. The RNA can be grabbed at any stage (partial/full transcript, before/during/after all introns are spliced).

[BejeranoWinter12/13] 24 Useful in finding pseudogenes Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting! gene pred.

[BejeranoWinter12/13] 25 Nets/chains can reveal retrogenes (and when they jumped in!)

[BejeranoWinter12/13] 26 Nets a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels. a net is single-coverage for target but not for query. because it's single-coverage in the target, it's no longer symmetrical. the netter has two outputs, one of which we usually ignore: the target- centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single- cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal- best nets are symmetrical again. nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level. GB: for human inspection always prefer looking at the chains! [Angie Hinrichs, UCSC wiki]

[BejeranoWinter12/13] 27 Before and After Netting

[BejeranoWinter12/13] 28 Convert / LiftOver "LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process. LiftOver – batch utility

[BejeranoWinter12/13] 29 What nets can’t show, but chains will

[BejeranoWinter12/13] 30 Same Region… same in all the other fish

Drawbacks Inversions not handled optimally > > > > chr1 > > > < < < < chr1 < < < < < < < < chr5 < < < < Chains Nets > > > > chr1 > > > < < < < chr5 < < < < 31

Drawbacks High copy number genes can break orthology 32

[BejeranoWinter12/13] 33 Self Chain reveals paralogs (self net is meaningless)

[BejeranoWinter12/13] 34 Conservation and Function

[BejeranoWinter12/13] 35 Evolution = Mutation + Selection Mistakes can happen during DNA replication. Mistakes are oblivious to DNA segment function. But then selection kicks in....ACGTACGACTGACTAGCATCGACTACGA... chicken egg...ACGTACGACTGACTAGCATCGACTACGA... functional junk TT CAT “anything goes” many changes are not tolerated chicken Conservation implies function! (But what function?)

[BejeranoWinter12/13] 36 Vertebrates: what to sequence? [Human Molecular Genetics, 3rd Edition]  you are here, Opossum, Lizard, Stickleback too far sweet spot too close Which species to compare to? Too close and purifying selection will be largely indistinguishable from the neutral rate. Too far and many functional orthologs will diverge beyond our ability to accurately align them.

Searching Near And Far [BejeranoWinter12/13] 37 Search too near (eg human to chimp or orang above) and you cannot distinguish neutral sequence from sequence under purifying selection. Search further still (eg mouse) and the two distributions pry apart. But now you’ve lost younger functional sequences born after the split. Ie, conservation implies function, but lack of conservation does NOT imply lack of function!

[BejeranoWinter12/13] 38, Opossum, Lizard, Stickleback Phylogenetic Shadowing  you are here too close “too close” can actually be a boon if you have enough closely related genomes

PhastCons Conserved Elements [BejeranoWinter12/13]

Distant homologies [BejeranoWinter12/13] 40 When species diverge too much (e.g. chicken and beyond above), confident alignments can no longer be detected at the DNA level. E.g.: all SPI1 and SLC39A13 exons are there in chicken & fish.

Distant homologies search strategies [BejeranoWinter12/13] 41 Here it is much better to search a gene model from species A (e.g human) against the genome of species B (e.g. chicken) This is a search of amino acids in all their possible codons into a gene structure with unknown exon – intron structure. (eg TBLASTN, translated BLAT)

Distant homologies [BejeranoWinter12/13] 42 Find the most distantly related genes using gene models in both species: 1 search amino acids sequences against each other. (eg using BLASTP). 2 Map your hits back to the two respective genomes, anchored on the amino acid alignment (respecting any exon-intron gene body structure change). 3 Examine co-linear homology of flanking genes to try and call orthologs from paralogs.

RNA homology searches [BejeranoWinter12/13] 43 1 Define a mathematical construct that describe potential homologs. 2 Go search for them (efficiently!). 3 Examine genomic context.

Enhancer remote homologs [BejeranoWinter12/13] 44 Enhancer = Gene regulatory sequences in general are the most challenging to search for: Individual binding sites are very flexible. Gaps between binding sites may evolve (semi) neutrally, making DNA alignment seeding particularly frail. Binding site gain/loss and shuffling may or may not be allowed – we need a better understanding of underlying logic.

Exceptionally Old Enhancers Exist [BejeranoWinter12/13] 45 But how many of these really exist?

[BejeranoWinter12/13] 46 Ultraconservation: No known function requires this much conservation CDSncRNATFBS * * * * * seq. ?

“Gene” Finding III: Comparative Genomics [BejeranoWinter12/13] 47

The challenge: map code to output [BejeranoWinter12/13] 48 genome person Ultimately we sequence genomes, and study their function in detail to understand genome to phenotype relationships: Minus side: Genomic contribution to disease Plus side: Adaptation and speciation 3*10 9 letters cells To be continued…