Presentation on theme: "Concepts and methods in genome assembly and annotation B. Franz LANG, Département de Biochimie Bureau: H307-15 Courrier électronique:"— Presentation transcript:
Concepts and methods in genome assembly and annotation B. Franz LANG, Département de Biochimie Bureau: H Courrier électronique: BCM-2002
Outline 1.What is genome assembly? 2.What is genome annotation? 3.Annotating protein coding genes and introns 4.Prediction of RNA genes
Stitching together sequence reads into contigs (up to the complete chromosome size) – required for identification of complete genes and their annotation. Assembly provides also information on the genome architecture (linear or circular chromosomes, their number etc.). Contigs may be up to millions of nucleotides in size. An average read coverage >10 is required for decent assemblies. Long reads or paired-end reads of long DNA fragments permit linking contigs bordered by repeat regions (true links, or scaffolds with missing sequence of predicted length NNNNN…). 1. What is genome assembly?
Genomes may be DNA or RNA (double- or single-stranded) An organism may have several genomes (e.g. in eukaryotes, organelles) Genomes may consist of more than one physical unit (chromosomes) Chromosomes may be circular, monomeric linear, directly repeated several times and linear (e.g., product of rolling circle replication, which appears as circular-mapping in sequence assembly !) Genome assembly at a higher level Circular-mapping concatamers, only replicative form is truly circular
Given sequence read information (Sanger, Illumina, PacBio …) an algorithm is required to combine more or less perfectly overlapping sequence into a genome sequence Overlap-join procedures. Slow, but allow use of error-prone sequencing technologies like 454, which in turn may introduce error into the assembly (e.g., frameshifts with 454). Examples of software – Phrap, Consed, Newbler, Mira. Eulerian algorithms based on graphs. Very fast, but require reads without sequence error or variation. Huge datasets (Illumina) can be processed. An important feature is the use of sequence coverage across the graph, for removal of assembled regions due to experimental error, contaminant reads from other genomes etc. Examples of software - Velvet, SOAPdenovo, Celera, Abyss, Allpath, Spades. How genome assembly of real (dirty) data works
Graph algorithms for assembly P. Compeau, P. Pevzner & G. Tesler (2011) NATURE BIOTECHNOLOGY 2 9: (a)Sequence, (b) Traditional assembly, walk through Hamiltonian cycle. Variant in (c), after split of reads into short k-mers (ex. 3. (d) Modern de Bruijn graph finding sequence more quickly via Eulerian cycle.
Given sequence read information (Sanger, Illumina, PacBio …) an algorithmic approach is required to: Discard information from contaminating DNA, primers and adapters If at low level, sequence coverage cut-off will resolve the issue Resolve repeat regions of all kinds that constitute assembly conflicts Mobile genetic elements and other short repeated DNA segments Segmental genome duplication Diploid, aneuploid … genomes with sequence differences in allels (‘snips’) Whole genome duplication followed by genetic drift of one copy or its partial loss. This requires sequence from large DNA fragments, chromosome size mapping, other physical or biological genome information How genome assembly of real (dirty) data works
Resolve chromosome architecture (multiple genomes and chromosomes, linear, circular, or circular-mapping concatamers) An issue that usually needs manual input of an expert who has additional molecular information How genome assembly of real data works
Finding and precise positional prediction of all genes, other genetic elements, insertion elements and repeats, on a given genome sequence Species may contain more than one genome (e.g., nuclear, mitochondrial, chloroplast, virus/phage, plasmid …) The genetic code and gene expression signals may differ from one genome to another - needs info on gene expression at the RNA and/or protein level Genes may be contiguous, or disrupted by introns, as well as discontinuous (trans-spliced or in pieces). Based on comparative gene/intron predictions (gene models, bioinformatic inference); information on transcript and protein sequence and other biological facts (e.g., enzymatic or genetic studies) is usually required List of features at the sequence level (e.g., GenBank submission file) Genetic maps 2. What is genome annotation?
11 Example: Partial GenBank annotation of a mitochondrial genome (rRNA gene with introns and a predicted protein coding sequence) What is genome annotation? COMMENT Complete mitochondrial genome. FEATURES Location/Qualifiers source /organism="Glomus irregulare" /organelle="mitochondrion" /mol_type="genomic DNA" /strain="DAOM " /type="genomic" gene /gene="rnl" rRNA join( , , , , , , ) /gene="rnl" /product="large subunit ribosomal RNA" exon /gene="rnl" /number=1 intron /gene="rnl" /note="Group IA3" /number=1 exon /gene="rnl" /number=2 intron /gene="rnl" /note="Group IA3" /number=2 gene /gene="orf202“ Continued to the right … CDS /gene="orf202" /codon_start=1 /transl_table=4 /product="hypothetical protein" /translation="MKSPNPQPALSSIQREILVGGLLGDLSIYRAKVTHNARLYVQQG SVHKEYLNHLYSVFQNLCSSEPKWSLSLDKRSNTTYETLRFNSRSLPCFNYYRDVFYP EGVKIVPANIGELLTARGLAYWSMDDGYKDRGNFRLATQSFSRNDVLLLIKLLKDNFS LDCSLNTVKSTQYRIYVRANSMVQFRALVSPYFHPSMLYKLQ" exon /gene="rnl" /number=3 intron /gene="rnl" /note="Group IB" /number=3 … and so on …
12 Example: Genetic maps of two mitochondrial genomes What is genome annotation?
Incomplete genome assembly (‘draft genome’) annotation of genes and genetic elements are somewhat incomplete – still works for bulk of gene identification, expression studies and comparisons Systematic sequence error ( technology-specific) 454, number of nucleotides in homopolymer sequences incorrect – causes difficulties in genome assembly at these sites, and potential frame-shifting in protein coding genes that may therefore remain unidentified Sanger, difficulties to resolve snap-back structures; termination and/or slippage at long homopolymers - same as above but less severe, less in genes Illumina, uncertain sequence at certain sequence motifs such as GGCNN – seems to be less with latest technology. Error prediction and correction is possible. Ion Torrent, Pacific Biosciences – overall high error rate, may to some degree be corrected by using very deep coverage (fails if polymorphic sites/snips are of interest; errors and snips are hard to distinguish) Effect of completeness, sequencing error, and assembly artifacts on genome annotation
First, one needs to know, or infer, the genetic code Translate Open Reading Frames (ORFs) that are not interrupted by a stop codon and that start with a know initiation codon (ATG, GTG …) ORFs may be given a functional identity, by sequence comparison to known genes. Protein sequence data can be used to confirm factual translation and identification of the genetic code. 3. Annotation of protein coding genes and introns
Transcription data for the gene region as well as the presence of regulatory elements help to confirm the prediction (in case of bacteria, ribosomal binding site at 5’; terminator sequence at 3’; upstream promoters …); If these genes contain introns, exons may be identified in two ways –By comparing the gene region with transcript sequences (do not contain introns) –Inference of exon-intron structure based on sequence similarities of exons, intron features such as conserved splice site motifs, as well as any other feature that is known to define a gene in a given group of organisms. ‘Gene models’ and ‘intron models’. 3. Annotation of protein coding genes and introns
If genes contain introns, exon/intron boundaries (nucleus, eukaryotes) may be identified by conserved splice site motifs (intron models). For other intron types, use respective models. 3. Annotation of protein coding genes and introns
M Yandell and D. Ence (2012) NATURE REVIEWS | GENETICS 13: | 329
3. Annotation of protein coding genes and introns M Yandell and D. Ence (2012) NATURE REVIEWS | GENETICS 13: | 329
4. Prediction of structured RNA genes: a comparison of RNAmotif with ERPIN Features of structured RNAs: primary sequence conservation secondary structure tertiary interactions site-wise conservation may be highly variable – not similar enough to find with Blast … follows example from RNase P RNA …
Mitochondrial RNase P RNA is highly conserved in pairing P4, the reactive center of the molecule, with respect to its bacterial counterparts. Yet, even the conserved sequence motifs (red) very too much in most of the known genes that they can be identified with Blast.
Examples of rnpB gene sequences in yeast mitochondria.
How to search most effectively for mitochondrial RNase P RNAs ? Method 1: search conserved primary sequence motif only, using regular sequence expressions
Mitochondrial primary consensus sequence – most conservation is close to P4 P4 – helical interaction
As it turns out, primary sequence conservation is weak, and just ~50% of currently known sequences are found with this information. Corresponding regular expression: [AT]G[GA]NAA[GA]T[TC][ATC][GT][GA] … A[CT][AU]NAAN[ATC][TC][AC][GAT][GT][CT]TTA[GAT]
How to search most effectively for mitochondrial RNase P RNAs ? Method 2: Use both conserved primary sequence plus secondary structure, united in a structural profile that is translated into an RNAmotif ‘descriptor’
Structured sequence profile including P4 helical region (using more sequences than in the primary sequence example)
Translation of this complex structural motif into an RNAmotif descriptor parms ### finds mt RNase P RNAs wc +=gu; ### permits global GU descr ss(len=20) ### 20 flanking nucleotides ss(len=5, seq="[GAT][AT]G[GAT]A$") ### ss 5' to structure h5(len=3, seq="A[GA][GA]",mispair=1,ends='mm') ### P4-1 ss(len=1,seq="T$") ### T bulge h5(len=5,seq="[TC][ATC][GAT][GAT].$",mispair=2,ends='mm') ### P4-2 ss(minlen=50, maxlen=1000,seq="[AC]C[ATC].[GA]A$") ### P4 loop h3 (seq="[ATC][ATC][ATC][GAT][GT]$") ### P4-1' h3 (seq="[GTC][TC]T$") ### P4-2' ss(len=1,seq="A") ### universal A ss(len=20) ### 20 flanking nt It finds four false positives in a collection of 9 mtDNAs with RNase P RNA, and misses one solution: lack of both sensitivity and specificity.
How to search most effectively for mitochondrial RNase P RNAs ? Method 3: Use both conserved primary sequence plus secondary structure, united in a training set with all known sequences aligned, plus a corresponding structural line: to be used for ERPIN searches.
Translate the structural alignment into the ERPIN format … however, it is ‘a bit’ cryptic …
The GDE editor comes to help, with color coding, coupled to a tool that translates the alignment into ERPIN format
ERPIN then calculates RNA primary and secondary structure profiles from the sequence alignment that are matched to the target sequence. Probabilistic search taking into account nucleotide frequencies. Much of the algorithm’s efficiency stems from the use of user- defined, precisely delimited structural elements that can be searched individually or in combination, and by the option to use a defined search order (‘search strategy’).
ERPIN results Note the E-values, the probability that a given structural motif occurs by chance in a target database of given size and nucleotide composition. Values of 1e-2 and smaller can be already considered ‘safe’ matches, although solutions close to 1e+1 might also be considered. Results are much superior to RNAmotif: few if any false positives; some degree of sequence variance is tolerated – finds deviant sequences.
A recent, even more powerful probabilistic approach has become available, called Infernal (Sean Eddy). It uses primary sequence plus covariance/HMM-like inferences that provide slightly better E-values than ERPIN. A large variety of specific search models are available via the publicly available RFAM database, useful for genome annotation. Rfam 11.0: 10 years of RNA families. S.W. Burge, J. Daub, R. Eberhardt, J. Tate, L. Barquist, E.P. Nawrocki, S.R. Eddy, P.P. Gardner, A. Bateman. Nucleic Acids Research (2012)
How much structural conservation is required for meaningful ERPIN searches? Example: T-stem plus T-loop of tRNAs, to find matches with E-value better than 5e-2 Results: few if any false positives even in large datasets