Very important to know the difference between the trees!

Slides:



Advertisements
Similar presentations
Evolution of genomes.
Advertisements

Genomics – The Language of DNA Honors Genetics 2006.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Chapter 19 Evolutionary Genetics 18 and 20 April, 2004
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Some new sequencing technologies. Molecular Inversion Probes.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Mutation and DNA Mutation = change(s) in the nucleotide/base sequence of DNA; may occur due to errors in DNA replication or due to the impacts of chemicals.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Gene Structure and Identification
From DNA to Protein  Genes determine traits & functions  Genes are sections of DNA on chromosomes  DNA “code” must be converted to be used.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
RNA and Protein Synthesis
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
The Biology and Genetic Base of Cancer. 2 (Mutation)
Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Chapter 24: Molecular and Genomic Evolution CHAPTER 24 Molecular and Genomic Evolution.
Models of Molecular Evolution III Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 7.5 – 7.8.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Molecular and Genomic Evolution Getting at the Gene Pool.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Chapter 3 The Interrupted Gene.
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Eukaryotic genes are interrupted by large introns. In eukaryotes, repeated sequences characterize great amounts of noncoding DNA. Bacteria have compact.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Chromosome Organization & Molecular Structure. Chromosomes & Genomes Chromosomes complexes of DNA & proteins – chromatin Viral – linear, circular; DNA.
Virginia Commonwealth University
Interpreting exomes and genomes: a beginner’s guide
bacteria and eukaryotes
Sequence similarity, BLAST alignments & multiple sequence alignments
Evolutionary genomics can now be applied beyond ‘model’ organisms
Evolution of gene function
Genetics and Evolutionary Biology
Causes of Variation in Substitution Rates
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
The neutral theory of molecular evolution
Basics of Comparative Genomics
Genomes and Their Evolution
Relationship between Genotype and Phenotype
The Human Genome Project
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Outline What is an amino acid / protein
What are the Patterns Of Nucleotide Substitution Within Coding and
Today… Review a few items from last class
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Evolution of eukaryote genomes
Chapter 19 Molecular Phylogenetics
Chapter 6 Clusters and Repeats.
From Mendel to Genomics
Introduction to Sequencing
Unit Genomic sequencing
Basics of Comparative Genomics
Section 20.4 Mutations and Genetic Variation
Sequence Analysis - RNA-Seq 2
Relationship between Genotype and Phenotype
Presentation transcript:

Very important to know the difference between the trees! Species tree vs. gene/protein tree Trees can be very different, since genes can have their own histories Very important to know the difference between the trees! a. Gene tree is based a set of orthologous genes (i.e. related by a common ancestor) Often (but certainly not always) the gene tree is similar to the species tree b. Species tree is meant to represent the historical relationship between species. Want to build on characters that reflect time since divergence: In the genomic age, often use as many genes as possible (hundreds to thousands) to generate a species tree: Phylogenomics

Phylogenomics: Using Whole-genome information to reconstruct the Tree of Life Several approaches: 1. Concatonate many gene sequences and treat as one Use a ‘super matrix’ of variable sequence characters 2. Construct many separate trees, one for each gene, and then compare Often construct a ‘super tree’ that is built from all single trees 3. Incorporate non-sequence characters like synteny, intron structure, etc. The goal is to use many different # and types of characters to avoid being mislead about the relationship between species. Now recognized that different regions of the genome can have distinct histories.

A few other key basic concepts: Selection acts on phenotypes, based on their fitness cost/advantage, to affect the population frequencies of the underlying genotypes. In the case of DNA sequence: Neutral substitutions = no effect on fitness, no effect on selection Given a ~constant mutation rate, can convert the # of substitutions into time of divergence since speciation = molecular clock theory. Deleterious substitutions = fitness cost * These are removed by purifying (negative) selection Advantageous substitutions = fitness advantage * These alleles are enriched for through adaptive (positive) selection

Evolutionary genomics relies on one or more quality genome sequences Quality of a genome sequence can dramatically affect evolutionary interpretations Bad genome = bad evolutionary inference Therefore, it’s important to know what makes a good genome sequence

Anatomy of a Genome Project Sequencing De novo vs. ‘resequencing’ Sanger WGS versus ‘next generation’ sequencing High versus low sequence coverage Assembly Draft assembly Gap closure Annotation Gene, intron, RNA prediction De novo vs. homology-based prediction Assessing confidence Comparison Comparing gene content, lineage specific gene loss, gain, emergence Comparing genome structure (chromosomes, breakpoints, etc) Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

Sequencing Approaches Old school: Sanger Whole Genome Shotgun (WGS)

The coverage of a genome = average coverage across all base pairs Overlapping sequencing ‘reads’ are assembled into a ‘contig’ 10-fold representation at this point 2-fold representation at this point The coverage of a genome = average coverage across all base pairs 8 - >10-fold is typically considered high coverage 1-3-fold is considered low coverage ** Even high average coverage can include ‘gaps’ (i.e. regions with NO coverage) See Lander-Waterman formula (poisson distribution that incorporates the number and length of reads, size of genome, coverage, and amount of overlap between reads) For 500 Mb target, 600bp read length, 5X coverage: ~29k gaps; 10X coverage: 393 gaps

Sequencing Approaches Advantages of Sanger Whole Genome Shotgun (WGS) * High quality sequence data * Individual sequence reads are long (~1,000 bp) * WGS is less work than map-based sequencing Disadvantages of Sanger Whole Genome Shotgun (WGS) * Still a lot of processing involved * Sanger sequencing is expensive and slow (gel-based sequencing) * True WGS sequencing requires good sequencing to get assembly to work

Sequencing Approaches New technology: ‘Next generation’ sequencing Includes ‘454’, Illumina/Solexa, SOLiD, and other types of sequencing Advantages: * New technology is much cheaper per genome * Generates a huge amount of sequence per run Disadvantages: * Has a higher sequencing error rate per base pair * Generates short reads (100 - 200 bp) - more challenging assembly * Generates a huge amount of sequence (and massive data files) per run

Several different ‘next-generation’ sequencing methods 454: emulsion sequencing per well: > 500bp read length SOLiD: emulsion amplification, bead attachment to solid surface, Ligation-based sequencing interrogates each base in 2 ligation reactions

Solexa (Illumina) Sequencing >100 bp read length

Solexa (Illumina) Sequencing >100 bp read length

Solexa (Illumina) Sequencing >100 bp read length

Next-generation (deep) sequencing Very high (>100X) coverage Much cheaper per bp covered Rapid improvements in technology (including single-molecule sequencing) But Much higher error rate (~1%) Short reads cause assembly challenges Some require prior amplification Sequence-specific bias in sequencing efficiency For 500 Mb target, 100bp read length, 5X coverage: ~168k gaps; 10X coverage: 2,250 gaps

Anatomy of a Genome Project Sequencing De novo vs. ‘resequencing’ Sanger WGS versus ‘next generation’ sequencing High versus low sequence coverage Assembly Draft assembly Gap closure Annotation Gene, intron, RNA prediction De novo vs. homology-based prediction Assessing confidence Comparison Comparing gene content, lineage specific gene loss, gain, emergence Comparing genome structure (chromosomes, breakpoints, etc) Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

Sanger WGS de novo assembly ‘reads’ ‘contigs’ ‘scaffold’ Goal is to have no gaps & complete scaffolds (chromosomes) Challenges: Some regions difficult to sequence through (centromeres, heterochromatin, etc) Repetitive regions make assembly difficult/ambiguous

‘Next-generation sequencing’ de novo assembly ** Short read length a real challenge

Matching to a ‘reference’ genome ‘Next-generation sequencing’ de novo assembly ** Short read length a real challenge OR Matching to a ‘reference’ genome * paired-end reads Challenges: Can have lots of gaps, miss any new sequence not in the reference, repetitive regions not sequenced well, can totally miss structural rearrangements

Anatomy of a Genome Project Sequencing De novo vs. ‘resequencing’ Sanger WGS versus ‘next generation’ sequencing High versus low sequence coverage Assembly Draft assembly Gap closure Annotation Gene, intron, RNA prediction De novo vs. homology-based prediction Assessing confidence Comparison Comparing gene content, lineage specific gene loss, gain, emergence Comparing genome structure (chromosomes, breakpoints, etc) Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

Genome Annotation: predicting genetic features ‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGATACCG ACAAGCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACA TTCCTCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTA AAGACACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCG GCACATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTA ACCATGATGTCGCATAACCGAGATGAGATGATAAAAAA

Genome Annotation: predicting genetic features ‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGATACCG ACAAGCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACA TTCCTCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTA AAGACACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCG GCACATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTA ACCATGATGTCGCATAACCGAGATGAGATGATAAAAAA Features of ORFs used in computational predictions: * Start with ATG * End with stop codon (e.g. TAA) * Should be in one frame (i.e. length divisible by 3 for each codon) * Have a size range (max. size can be >10 kb, min size can be 30 bp; median is probably ~few kb depending on organism)

Genome Annotation: predicting genetic features ‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGTCGAAT AGACGATGAGACGAGATAGAGCGAGCAAAAGGTAGGATACCGACAA GCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACATTCC TCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTAAAGA CACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCGGCAC ATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTAACCA TGATGTCGCATAACCGAGATGAGATGATAAAAAA Many ORFs have introns - splice junction signals are short and variable = difficult to predict.

Genome Annotation: predicting genetic features ‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like - Homology-based assignments: find sequences homologous to known ORFs/proteins Met Ser Ser Gln Asp Ser Asn Asp Ser Asp Lys Gln … Met Ser Ser Ans Asp Ser Asn GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGTCGAAT AGACGATGAGACGAGATAGAGCGAGCAAAAGGTAGGATACCGACAA GCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACATTCC TCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTAAAGA CACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCGGCAC ATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTAACCA TGATGTCGCATAACCGAGATGAGATGATAAAAAA Asp Thr Asp Lys Gln ..

Genome Annotation: predicting genetic features ‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like - Homology-based assignments: find sequences homologous to known ORFs/proteins - Matches to cDNA library or RNA transcripts from sequencing RNA transcript GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGTCGAAT AGACGATGAGACGAGATAGAGCGAGCAAAAGGTAGGATACCGACAA GCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACATTCC TCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTAAAGA CACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCGGCAC ATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTAACCA TGATGTCGCATAACCGAGATGAGATGATAAAAAA …

Genome Annotation: predicting genetic features Other Predictions: * Open Reading Frames (ORFs) * Non-coding RNAs (tRNAs, rRNA, other small RNAs, miRNAs, etc * Regulatory elements (ENCODE project) * Transposable elements (TEs) * Origins of DNA replication

Anatomy of a Genome Project Sequencing De novo vs. ‘resequencing’ Sanger WGS versus ‘next generation’ sequencing High versus low sequence coverage Assembly Draft assembly Gap closure Annotation Gene, intron, RNA prediction De novo vs. homology-based prediction Assessing confidence Comparison Comparing gene content, lineage specific gene loss, gain, emergence Comparing genome structure (chromosomes, breakpoints, etc) Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

In what ways can a bad genome sequence affect the following: Comparisons of: * Genome size, organization (chromosomes/plasmids), structure * Gene/ncRNA content: number of genes, duplicates, size of gene families, etc * Sequence differences related to: gene evolution, regulatory evolution * RNA & protein abundance across species, for all RNAs/proteins