Very important to know the difference between the trees!

Very important to know the difference between the trees!
Species tree vs. gene/protein tree Trees can be very different, since genes can have their own histories Very important to know the difference between the trees! a. Gene tree is based a set of orthologous genes (i.e. related by a common ancestor) Often (but certainly not always) the gene tree is similar to the species tree b. Species tree is meant to represent the historical relationship between species. Want to build on characters that reflect time since divergence: In the genomic age, often use as many genes as possible (hundreds to thousands) to generate a species tree: Phylogenomics

Phylogenomics: Using Whole-genome information to reconstruct
the Tree of Life Several approaches: 1. Concatonate many gene sequences and treat as one Use a ‘super matrix’ of variable sequence characters 2. Construct many separate trees, one for each gene, and then compare Often construct a ‘super tree’ that is built from all single trees 3. Incorporate non-sequence characters like synteny, intron structure, etc. The goal is to use many different # and types of characters to avoid being mislead about the relationship between species. Now recognized that different regions of the genome can have distinct histories.

A few other key basic concepts:
Selection acts on phenotypes, based on their fitness cost/advantage, to affect the population frequencies of the underlying genotypes. In the case of DNA sequence: Neutral substitutions = no effect on fitness, no effect on selection Given a ~constant mutation rate, can convert the # of substitutions into time of divergence since speciation = molecular clock theory. Deleterious substitutions = fitness cost * These are removed by purifying (negative) selection Advantageous substitutions = fitness advantage * These alleles are enriched for through adaptive (positive) selection

Evolutionary genomics relies on one or more quality genome sequences
Quality of a genome sequence can dramatically affect evolutionary interpretations Bad genome = bad evolutionary inference Therefore, it’s important to know what makes a good genome sequence

Anatomy of a Genome Project
Sequencing De novo vs. ‘resequencing’ Sanger WGS versus ‘next generation’ sequencing High versus low sequence coverage Assembly Draft assembly Gap closure Annotation Gene, intron, RNA prediction De novo vs. homology-based prediction Assessing confidence Comparison Comparing gene content, lineage specific gene loss, gain, emergence Comparing genome structure (chromosomes, breakpoints, etc) Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

Sequencing Approaches
Old school: Sanger Whole Genome Shotgun (WGS)

The coverage of a genome = average coverage across all base pairs
Overlapping sequencing ‘reads’ are assembled into a ‘contig’ 10-fold representation at this point 2-fold representation at this point The coverage of a genome = average coverage across all base pairs 8 - >10-fold is typically considered high coverage 1-3-fold is considered low coverage ** Even high average coverage can include ‘gaps’ (i.e. regions with NO coverage) See Lander-Waterman formula (poisson distribution that incorporates the number and length of reads, size of genome, coverage, and amount of overlap between reads) For 500 Mb target, 600bp read length, 5X coverage: ~29k gaps; 10X coverage: 393 gaps

Advantages of Sanger Whole Genome Shotgun (WGS) * High quality sequence data * Individual sequence reads are long (~1,000 bp) * WGS is less work than map-based sequencing Disadvantages of Sanger Whole Genome Shotgun (WGS) * Still a lot of processing involved * Sanger sequencing is expensive and slow (gel-based sequencing) * True WGS sequencing requires good sequencing to get assembly to work

New technology: ‘Next generation’ sequencing Includes ‘454’, Illumina/Solexa, SOLiD, and other types of sequencing Advantages: * New technology is much cheaper per genome * Generates a huge amount of sequence per run Disadvantages: * Has a higher sequencing error rate per base pair * Generates short reads ( bp) - more challenging assembly * Generates a huge amount of sequence (and massive data files) per run

Several different ‘next-generation’ sequencing methods
454: emulsion sequencing per well: > 500bp read length SOLiD: emulsion amplification, bead attachment to solid surface, Ligation-based sequencing interrogates each base in 2 ligation reactions

Solexa (Illumina) Sequencing >100 bp read length

Next-generation (deep) sequencing
Very high (>100X) coverage Much cheaper per bp covered Rapid improvements in technology (including single-molecule sequencing) But Much higher error rate (~1%) Short reads cause assembly challenges Some require prior amplification Sequence-specific bias in sequencing efficiency For 500 Mb target, 100bp read length, 5X coverage: ~168k gaps; 10X coverage: 2,250 gaps

Sanger WGS de novo assembly
‘reads’ ‘contigs’ ‘scaffold’ Goal is to have no gaps & complete scaffolds (chromosomes) Challenges: Some regions difficult to sequence through (centromeres, heterochromatin, etc) Repetitive regions make assembly difficult/ambiguous

‘Next-generation sequencing’ de novo assembly
** Short read length a real challenge

Matching to a ‘reference’ genome
‘Next-generation sequencing’ de novo assembly ** Short read length a real challenge OR Matching to a ‘reference’ genome * paired-end reads Challenges: Can have lots of gaps, miss any new sequence not in the reference, repetitive regions not sequenced well, can totally miss structural rearrangements

Genome Annotation: predicting genetic features
‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGATACCG ACAAGCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACA TTCCTCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTA AAGACACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCG GCACATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTA ACCATGATGTCGCATAACCGAGATGAGATGATAAAAAA

‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGATACCG ACAAGCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACA TTCCTCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTA AAGACACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCG GCACATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTA ACCATGATGTCGCATAACCGAGATGAGATGATAAAAAA Features of ORFs used in computational predictions: * Start with ATG * End with stop codon (e.g. TAA) * Should be in one frame (i.e. length divisible by 3 for each codon) * Have a size range (max. size can be >10 kb, min size can be 30 bp; median is probably ~few kb depending on organism)

‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGTCGAAT AGACGATGAGACGAGATAGAGCGAGCAAAAGGTAGGATACCGACAA GCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACATTCC TCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTAAAGA CACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCGGCAC ATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTAACCA TGATGTCGCATAACCGAGATGAGATGATAAAAAA Many ORFs have introns - splice junction signals are short and variable = difficult to predict.

‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like - Homology-based assignments: find sequences homologous to known ORFs/proteins Met Ser Ser Gln Asp Ser Asn Asp Ser Asp Lys Gln … Met Ser Ser Ans Asp Ser Asn GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGTCGAAT AGACGATGAGACGAGATAGAGCGAGCAAAAGGTAGGATACCGACAA GCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACATTCC TCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTAAAGA CACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCGGCAC ATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTAACCA TGATGTCGCATAACCGAGATGAGATGATAAAAAA Asp Thr Asp Lys Gln ..

‘Simplest’ predictions: Open Reading Frames (ORFs) - De novo predictions: based on expectation of what ORFs should look like - Homology-based assignments: find sequences homologous to known ORFs/proteins - Matches to cDNA library or RNA transcripts from sequencing RNA transcript GCGCTTACGTTATCTGCAATATGTCTTCGAACGATTCGAACGTCGAAT AGACGATGAGACGAGATAGAGCGAGCAAAAGGTAGGATACCGACAA GCAACATACACGTCTGGATCCTACCGGTGTGGACGACGCCTACATTCC TCCGGAGCAGCCGGAAACAAAGCACCATCGCTTTAAAATCTCTAAAGA CACCCTGAGAAACCACTTTATCGCTGCGGCCGGTGAGTTCTGCGGCAC ATTCATGTTTTTATGGTGCGCTTACGTTATCTGCAATGTCGCTAACCA TGATGTCGCATAACCGAGATGAGATGATAAAAAA …

Other Predictions: * Open Reading Frames (ORFs) * Non-coding RNAs (tRNAs, rRNA, other small RNAs, miRNAs, etc * Regulatory elements (ENCODE project) * Transposable elements (TEs) * Origins of DNA replication

In what ways can a bad genome sequence affect the following:
Comparisons of: * Genome size, organization (chromosomes/plasmids), structure * Gene/ncRNA content: number of genes, duplicates, size of gene families, etc * Sequence differences related to: gene evolution, regulatory evolution * RNA & protein abundance across species, for all RNAs/proteins

Very important to know the difference between the trees!

Similar presentations

Presentation on theme: "Very important to know the difference between the trees!"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Very important to know the difference between the trees!

Similar presentations

Presentation on theme: "Very important to know the difference between the trees!"— Presentation transcript:

Similar presentations

About project

Feedback