Presentation is loading. Please wait.

Presentation is loading. Please wait.

[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 11:

Similar presentations


Presentation on theme: "[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 11:"— Presentation transcript:

1 http://cs173.stanford.edu [BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 11: Repeats II, Mutations

2 http://cs173.stanford.edu [BejeranoWinter12/13] 2 Announcements TA HW1 Comments

3 http://cs173.stanford.edu [BejeranoWinter12/13] 3 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT CTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT TTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC CTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT TGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC TCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT GCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT ATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT TCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA GATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA TCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT CATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT CAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA TAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT ATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG

4 Transcription http://cs173.stanford.edu [BejeranoWinter12/13] 4

5 5 Transcription Regulation Chromatin / Proteins DNA / Proteins Extracellular signals

6 Repeats http://cs173.stanford.edu [BejeranoWinter12/13] 6

7 Sequences that repeat many times in the genome Take up cumulatively a whooping half of the genome Come in two major, very different, flavors http://cs173.stanford.edu [BejeranoWinter12/13] 7 I II

8 http://cs173.stanford.edu [BejeranoWinter12/13] 8 I. Interspersed Repeats Get a copy out of the genome, and into a new location.

9 http://cs173.stanford.edu [BejeranoWinter12/13] 9 II. Simple Repeats Every possible motif of mono-, di, tri- and tetranucleotide repeats is vastly overrepresented in the human genome. These are called microsatellites, Longer repeating units are called minisatellites, The real long ones are called satellites. Highly polymorphic in the human population. Highly heterozygous in a single individual. As a result microsatellites are used in paternity testing, forensics, and the inference of demographic processes. There is no clear definition of how many repetitions make a simple repeat, nor how imperfect the different copies can be. Highly variable between species: e.g., using the same search criteria the mouse & rat genomes have 2-3 times more microsatellites than the human genome. They’re also longer in mouse & rat. AAAAAAAAA CACACACAC CAACAACAA

10 http://cs173.stanford.edu [BejeranoWinter12/13] 10 DNA Replication

11 http://cs173.stanford.edu [BejeranoWinter12/13] 11 Simple Repeats Create Funky DNA structures

12 http://cs173.stanford.edu [BejeranoWinter12/13] 12 These Bumps Give The DNA Polymerase Hiccups

13 http://cs173.stanford.edu [BejeranoWinter12/13] 13 Expandable Repeats and Disease

14 Restriction Enzymes Restriction enzymes recognize and make a cut within specific DNA sequences, known as restriction sites. This is usually a 4-6 base pair palindromic sequence. Naturally found in different types of bacteria Bacteria use restriction enzymes to protect themselves from foreign DNA Many have been isolated and sold for use in lab work http://cs173.stanford.edu [BejeranoWinter12/13] 14 blunt end sticky end

15 DNA Fingerprint Basics DNA fragments of different size will be produced by a restriction enzyme that cuts at the points shown by the arrows. 15

16 DNA fragments are then separated based on size using gel electrophoresis. 16

17 DNA Fingerprinting can be used in paternity testing or murder cases. 17

18 http://cs173.stanford.edu [BejeranoWinter12/13] 18 There are Tracks for it

19 http://cs173.stanford.edu [BejeranoWinter12/13] 19 Interspersed vs. Simple Repeats From an evolutionary point of view transposons and simple repeats are very different. Different instances of the same transposon share common ancestry (but not necessarily a direct common progenitor). Different instances of the same simple repeat most often do not.

20 Genome Content, Genome Function DONE Transcripts Protein coding genes Non-coding RNAs Gene regulatory elements Promoters Enhancers Repressors Insulators Epigenomics Nucleosomes, open chromatin Histone modifications Repeats Interspersed repeats / mobile elements Simple repeats http://cs173.stanford.edu [BejeranoWinter12/13] 20

21 Categories are NOT mutually exclusive We already discussed repeat instances that became Coding exons Enhancers There are known genomic loci that Code for protein coding exons and act as enhancers. Ditto for non-coding RNA + enhancer. There are bi-direction exons Coding in both directions Coding and anti-sense Both non-coding http://cs173.stanford.edu [BejeranoWinter12/13] 21

22 http://cs173.stanford.edu [BejeranoWinter12/13] 22 TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA CTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA TAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT GGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT CTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT TTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC CTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT TGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC TCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT GCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT ATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT TCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA GATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA TCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT CATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT CAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA TAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT ATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG

23 http://cs173.stanford.edu [BejeranoWinter12/13] 23 human mouse rat chimp chicken fugu zfish dog tetra human mouse rat chimp chicken fugu zfish dog tetra opossum cow macaque platypus opossum cow macaque platypus Comparative Genomics “Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky t

24 http://cs173.stanford.edu [BejeranoWinter12/13] 24 The genome is constantly replicated Every cell holds 2 copies of all its DNA = its genome. The human body is made of ~10 13 cells. All originate from a single cell through repeated cell divisions. cell genome = all DNA chicken ≈ 10 13 copies (DNA) of egg (DNA) chicken egg cell division DNA strings = Chromosomes

25 http://cs173.stanford.edu [BejeranoWinter12/13] 25 Evolution = Mutation + Selection Mistakes can happen during DNA replication. Mistakes are oblivious to DNA segment function. But then selection kicks in....ACGTACGACTGACTAGCATCGACTACGA... chicken egg...ACGTACGACTGACTAGCATCGACTACGA... functional junk TT CAT “anything goes” many changes are not tolerated chicken This has bad implications – disease, and good implications – adaptation.

26 http://cs173.stanford.edu [BejeranoWinter12/13] 26 Mutation

27 Chromosomal (ie big) Mutations Five types exist: –Deletion –Inversion –Duplication –Translocation –Nondisjunction

28 Deletion Due to breakage A piece of a chromosome is lost

29 Inversion Chromosome segment breaks off Segment flips around backwards Segment reattaches

30 Duplication Occurs when a genomic region is repeated

31 Whole Genome Duplication at the Base of the Vertebrate Tree http://cs173.stanford.edu [BejeranoWinter12/13] 31 Xen.Laevis WGD

32 Translocation Involves two chromosomes that aren’t homologous Part of one chromosome is transferred to another chromosomes

33 Nondisjunction Failure of chromosomes to separate during meiosis Causes gamete to have too many or too few chromosomes Disorders: –Down Syndrome – three 21 st chromosomes –Turner Syndrome – single X chromosome –Klinefelter’s Syndrome – XXY chromosomes

34 Genomic (ie small) Mutations Six types exist: –Substitution (eg G  T) –Deletion –Insertion –Inversion –Duplication –Translocation

35 35 Example: Human-Chimp Genomic Differences Number of events Nucleotide substitutions Indels < 10 Kb Microinversions < 100 Kb Deletions/Duplications Microinversions > 100 Kb Pericentric inversions Fusion

36 http://cs173.stanford.edu [BejeranoWinter12/13] 36 Inferring Genomic Mutations From Alignments of Genomes

37 37 A Gene tree evolves with respect to a Species tree Species tree Gene tree Speciation Duplication Loss By “Gene” we mean any piece of DNA.

38 http://cs173.stanford.edu [BejeranoWinter12/13] 38 Terminology Orthologs : Genes related via speciation (e.g. C,M,H3) Paralogs: Genes related through duplication (e.g. H1,H2,H3) Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3) Species tree Gene tree Speciation Duplication Loss single ancestral gene

39 http://cs173.stanford.edu [BejeranoWinter12/13] 39 Typical Molecular Distances If they were only evolving neutrally: To which is H1 closer in sequence, H2 or H3? To which H is M closest? And C? (Selection may change distances) Species tree Gene tree Speciation Duplication Loss single ancestral gene

40 http://cs173.stanford.edu [BejeranoWinter12/13] 40 Gene trees and even species trees are figments of our (scientific) imagination Species trees and gene trees can be wrong. All we really have are extant observations, and fossils. Species tree Gene tree Speciation Duplication Loss single ancestral gene Observed Inferred

41 Gene Families 41

42 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

43 Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d Alternative definition: minimal edit distance “Given two strings x, y, find minimum # of edits (insertions, deletions, mutations) to transform one string to the other” Cost of edit operations needs to be biologically inspired (eg DEL length). Solve via Dynamic Programming

44 Are two sequences homologous? -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given an (optimal) alignment between two genome regions, you can ask what is the probability that they are (not) related by homology? Note that (when known) the answer is a function of the molecular distance between the two (eg, between two species) AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC DP matrix:

45 http://cs173.stanford.edu [BejeranoWinter12/13] 45 Chaining Alignments Chaining highlights homologous regions between genomes (it bridges the gulf between syntenic blocks and base-by-base alignments. Local alignments tend to break at transposon insertions, inversions, duplications, etc. Global alignments tend to force non-homologous bases to align. Chaining is a rigorous way of joining together local alignments into larger structures. dot plots:DP matrix:

46 46 “Raw” (B)lastz track (no longer displayed) Protease Regulatory Subunit 3 Alignment = homologous regions

47 Chains & Nets: How they’re built 1: Blastz one genome to another – Local alignment algorithm – Finds short blocks of similarity Hg18: AAAAAACCCCCAAAAA Mm8: AAAAAAGGGGG Hg18.1-6 + AAAAAA Mm8.1-6 + AAAAAA Hg18.7-11 + CCCCC Mm8.1-5 - CCCCC Hg18.12-16 + AAAAA Mm8.1-5 + AAAAA 47

48 Chains & Nets: How they’re built 2: “Chain” alignment blocks together – Links blocks that preserve order and orientation – Not single coverage in either species Hg18: AAAAAACCCCCAAAAA Mm8: AAAAAAGGGGGAAAAA Hg18: AAAAAACCCCCAAAAA Mm8 chains Mm8.1-6 + Mm8.7-11 - Mm8.12-16 + Mm8.12-15 +Mm8.1-5 + 48

49 Another Chain Example ABC DE Ancestral Sequence ABC DE Human Sequence ABC DE Mouse Sequence B’ In Human Browser Implicit Human sequence Mouse chains B’ … … DE DE In Mouse Browser Implicit Mouse sequence Human chains … … DE 49

50 The Use of an Outgroup ABC DE Outgroup Sequence ABC DE Human Sequence ABC DE Mouse Sequence B’ In Human Browser Implicit Human sequence Mouse chains B’ … … DE DE In Mouse Browser Implicit Mouse sequence Human chains … … DE 50

51 http://cs173.stanford.edu [BejeranoWinter12/13] 51 Chains join together related local alignments Protease Regulatory Subunit 3 likely ortholog likely paralogs shared domain?

52 http://cs173.stanford.edu [BejeranoWinter12/13] 52 Chains a chain is a sequence of gapless aligned blocks, where there must be no overlaps of blocks' target or query coords within the chain. Within a chain, target and query coords are monotonically non- decreasing. (i.e. always increasing or flat) double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed. not just orthologs, but paralogs too, can result in good chains. but that's useful! chains should be symmetrical -- e.g. swap human-mouse -> mouse- human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done. chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki]

53 http://cs173.stanford.edu [BejeranoWinter12/13] 53 Before and After Chaining

54 http://cs173.stanford.edu [BejeranoWinter12/13] 54 Chaining Algorithm Input - blocks of gapless alignments from blastz Dynamic program based on the recurrence relationship: score(B i ) = max(score(B j ) + match(B i ) - gap(B i, B j )) Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i


Download ppt "[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 11:"

Similar presentations


Ads by Google