Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.

Slides:



Advertisements
Similar presentations
Chapter 10 How proteins are made.
Advertisements

LECTURE 17: RNA TRANSCRIPTION, PROCESSING, TURNOVER Levels of specific messenger RNAs can differ in different types of cells and at different times in.
Protein Targetting Prokaryotes vs. Eukaryotes Mutations
Central dogma of genetics Lecture 4. The conversion of DNA to Proteins.
Protein Synthesis $100 $200 $300 $400 $500 $100$100$100 $200 $300 $400 $500 Central Dogma Basics Transcription RNA Mutations FINAL ROUND Translation.
Alternative Splicing Genomic DNA Sequence GmGm AAAAA Exon Intron Exon GmGm AAAAA Transcription mRNA RNA Processing pre-mRNA.
RNA and Protein Synthesis
Finding Eukaryotic Open reading frames.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Finding Charles Yan.
Prosite and UCSC Genome Browser Exercise 3. Protein motifs and Prosite.
From Gene to Protein. Genes code for... Proteins RNAs.
Lecture 12 Splicing and gene prediction in eukaryotes
DNA & genetic information DNA replication Protein synthesis Gene regulation & expression DNA structure DNA as a carrier Gene concept Definition Models.
Transcription & Translation
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Regulation of eukaryotic gene sequence expression
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Progress report Yiming Zhang 02/10/2012. All AS events in ASIP Intron retention Exon skipping Alternative Acceptor site NAGNAG AltA Alternative Donor.
Gene Expression Chapter 13.
Topic 8 From Gene …to Protein Biology 1001 October 17, 2005.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
1 The Interrupted Gene. Ex Biochem c3-interrupted gene Introduction Figure 3.1.
Mutation And Natural Selection how genomes record a history of mutations and their effects on survival Tina Hubler, Ph.D., University of North Alabama,
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
From Genomes to Genes Rui Alves.
Transcription and mRNA Modification
Genes and How They Work Chapter The Nature of Genes information flows in one direction: DNA (gene)RNAprotein TranscriptionTranslation.
Protein Synthesis. DNA is in the form of specific sequences of nucleotides along the DNA strands The DNA inherited by an organism leads to specific traits.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Chapter 2 From Genes to Genomes. 2.1 Introduction We can think about mapping genes and genomes at several levels of resolution: A genetic (or linkage)
While replication, one strand will form a continuous copy while the other form a series of short “Okazaki” fragments Genetic traits can be transferred.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Chapter 3 The Interrupted Gene.
Mutations Learning Goal: Identify mutations in DNA (point mutation and frameshift mutation caused by insertion or deletion) and explain how they can affect.
Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.
Protein Synthesis- Transcription DNA-->RNA. Expression of Gene or Protein Synthesis I. Transcription A. Initiation B. Elongation C. Termination D. RNA.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Unit 1: DNA and the Genome Structure and function of RNA.
The Central Dogma of Life. replication. Protein Synthesis The information content of DNA is in the form of specific sequences of nucleotides along the.
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
GROUP 2 DNA TO PROTEIN. 9.1 RICIN AND YOUR RIBOSOMES.
bacteria and eukaryotes
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
Distribution of Introns among Full Length cDNA
Forensic DNA Analysis Protein Synthesis.
Chapter 10 How Proteins are Made.
Transcription & Translation.
There are four levels of structure in proteins
Transcription Definition
Ensembl Genome Repository.
Chapter 4 The Interrupted Gene.
Central Dogma Central Dogma categorized by: DNA Replication Transcription Translation From that, we find the flow of.
CHAPTER 17 FROM GENE TO PROTEIN.
Gene Expression Practice Test
Chapter 6.2 McGraw-Hill Ryerson Biology 12 (2011)
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located. Motivation: Pseudogenes can tolerate FS indels and have only one exon. 2 Number of unaffected transcripts For the affected gene, count the number of transcripts that are not affected by the indel. 3 Fraction of unaffected transcripts Number of unaffected transcripts divided by the total number of transcripts for the affected gene. 4 Average relative indel location For each affected transcript of the affected gene, calculate the relative indel position as: position of indel on the coding sequence divided by the length of coding sequence. Then for all transcripts of the affected gene, get the average relative indel position. 5 Maximum relative indel location For each affected transcript of the affected gene, calculate the relative indel position as: position of indel on the coding sequence divided by the length of coding sequence. Take the maximum relative indel location across all transcripts for the affected gene. 7 Minimum relative indel location For each affected transcript of the affected gene, calculate the relative indel position as: position of indel on the coding sequence divided by the length of coding sequence. Take the minimum relative indel location across all transcripts for the affected gene. 7 Average relative indel location to the center of the coding sequence |0.5 – average relative indel location| 8 Average number of overlapping residues between new protein and original protein For each affected transcript, count the number of overlapping identical amino acid residues between the newly translated protein a that results from the indel and the original protein. Then calculate the average number of overlapping residues of all affected transcripts. 9Fraction of mRNA decay (i.e., Nonsense-mediated decay[11-12] or nonstop mRNA decay[13]) Percentage of transcripts with nonsense mediated decay (NMD) b or nonstop mRNA decay c of the affected gene. Table S1. List of all features tested by the decision tree.

10 Fraction of all functional domains (pFam, super family, signal peptide, Seg, ncoils, Tmhmm, etc.) affected due to indel. Functional domains of each protein are downloaded from Ensembl [26]. For each affected transcript, calculate the percentage of all functional domains as annotated by Ensembl, including pFam domains, super family domains, signal peptides, and all other domains that are lost from the newly translated protein due to indel. Then for all the affected transcripts, calculate the average fraction. 11 Fraction of all pFam domains affected due to indel. Same as 10, but restricted to pFam domains. 12 Fraction of all super family domains affected due to indel. Same as 10, but restricted to super family domains. 13 Fraction of all signal peptide domains affected due to indel. Same as 10, but restricted to signal peptide domains. 14 Fraction of affected conserved DNA bases Fraction of conserved nucleotide positions affected d due to the indel. The conservation score of each DNA base is obtained from PhyloP [29]. A high positive score indicates the base is conserved, a negative score indicates positive selection, and a 0 score represents neutral selection. In this study, DNA bases with conservation scores >= 1 are treated as conserved bases. For the affected gene, we calculate the percentage of conserved DNA bases in affected regions against the total number of conserved DNA bases of the gene. 15 Minimum distance of indel to the exon boundary of all affected transcripts For all affected transcripts, calculate the minimum distance of indel to the exon boundary. 16 Number of paralogous genes For each affected gene by the indel, count how many paralogous genes it has. The information of paralogous genes is downloaded from Ensembl [26]. 17K a /K s [27]K a /K s is an indicator of the selective pressure on the gene. Table S1 (continued)

18 Maximum fraction of lost conserved amino acids of all affected transcripts at a 25% threshold. For each transcript, find out the percentage of lost conserved amino acids due to indel. Then get the maximum percentage for all affected transcripts. To calculate conservation scores, we followed the SIFT method for choosing sequences [2] by searching a database of proteins from vertebrate genomes. The SIFT procedure generates a protein sequence alignment, conservation values were calculated for each position [28], and then ranked. We counted u, the number of positions that were greater than the 25 th percentile (so ¼ of the positions were ignored and deemed not conserved). We then counted which of these positions were affected by the indel and termed this v. The fraction of conserved positions that were affected v/u was calculated for each transcript, and the maximum value over all transcripts was used in the decision tree. 19 Maximum fraction of lost conserved amino acids of all affected transcripts at a 50% threshold. Similar to 18, except the number of positions greater than the 50 th percentile were considered (so half the positions were ignored). 20Maximum fraction of lost conserved amino acids of all affected transcripts at a 75% threshold. Similar to 18, except the number of positions greater than the 75 th percentile were considered. Table S1 (continued) a. Alternative translation start side: If the lost-of-function variant is near the beginning of the protein, translation could be initiated by a downstream in-frame AUG [30]. In our study, if the indel is in the first 25 codons (first 75 bp of translated cDNA) or within 5 th percentile of the coding sequence length, then we looked for the next downstream in-frame start codon to translate the new protein (notice that this is a relaxed threshold compared with the one proposed in [30], i.e., first 30 bp of translated cDNA), The reason for this relaxation is because we found there is a significant portion of neutral indels occur at the beginning regions, after the first 30 bp. If the indel is not in the first 25 codons or 5 th percentile of the coding sequence length, SIFT indel translates from the beginning of the transcript until it reaches a stop codon. b. Nonsense mediated decay (NMD) is a cellular mechanism of mRNA surveillance to detect nonsense mutations and prevent the expression of truncated or erroneous proteins [11-12]. Based on [11], there is no NMD for the following two conditions: 1) If the last coding exon is flanked by only one 3’UTR exon, and the premature termination codon is in the last exon, or in the last 50 nucleotides in the second to last exon; 2) If the last coding exon is flanked by more than one 3’ UTR exon, and the premature termination codon is in the last 50 coding nucleotides of the last coding exon. See Figure S1. [11] had another rule for transcripts containing more than two 3’UTRs in the transcript. However, we observed that the stop codons in Ensembl gene annotation did not follow this particular rule, so we eliminated this rule and simply followed rule 2 if there was more than one 3’ UTR. c. Eukaryotic mRNAs that do not contain a termination codon are rapidly degraded [13]. d. The procedure to identify DNA regions affected by an indel is described in Figure S2 and S3.

Protein Conservation Score Sensitivity(%)Specificity(%)Precision(%)Accuracy(%) Scores derived from alignments that use all vertebrate species (final method) Scores derived from alignments where sequence(s) from species that possess the indel have been removed Supplementary Table 2. Performance of the final decision tree using the four features with protein conservation scores calculated from alignments with all vertebrate species, or from alignments where species that have indels at that location has been removed. The latter set of alignments tests that there is no bias introduced in our scores.

Figure S1. Rules for determining whether a transcript undergoes nonsense-mediated decay (NMD). 50 bp from intron end Stops allowed anywhere here and there is NO NMD 1 UTR 50 bp from termination codon Stops allowed within 50 bp of the termination codon (NO NMD) 2 UTRs Case 2: ≥2 UTRs Case 1: 1 UTR directly flanking CDS

Figure S2. Rules for identifying affected DNA bases in a gene for an indel located in an alternatively spliced exon. This process is used for the calculation of the feature “Fraction of affected conserved DNA bases” (feature 14). To identify affected regions in a given gene, we did the following: 1) Create a universal transcript by taking the union of all transcript isoforms. Name this U universal ; 2) Create a union of all unaffected transcript isoforms. Name this as U unaffected ; 3) If there are affected transcripts without mRNA decay, take the union of all positions before the indel, as these are functional. Name this as U partially-functional ; 4) Take the union of U unaffected and U partially-functional, and we get all the functional expressed regions. Name this as F; 5) Subtract F from the universal transcript, and we get the affected regions (A = U universal – F). A single gene with four transcript isoforms. The indel is located in an alternatively spliced exon. Tx1 Tx2 Tx3 Tx4 U unaffected U universal indel Step 1) Create the universal transcript. Step 2) Obtain U unaffected. Merge regions from unaffected transcripts (Tx1, Tx2, Tx3) Step 3) Obtain U partially-functional. In this example, U partially-functional is empty and contains no positions. Step 4) Let F be the union of positions in U unaffected and U partially-functional U partially-functional Step 5) Get affected regions A, where A = U - F. F A

Figure S3. An example of the procedure for identifying affected DNA bases for an indel located at the end of a gene. See Figure S2 for method details. Tx1 Tx2 Tx3 Tx4 indel Step 1) Create the universal transcript. Step 2) Merge regions from unaffected transcripts (Tx1, Tx2, Tx4) Step 3) Merge regions from transcripts containing indel but do not undergo mRNA decay. Step 4) F is the union of U unaffected and U partially-functional Step 5) Get affected regions A. A= U - F U unaffected U universal U partially- functional F A A single gene with four transcript isoforms and the indel is located at the end of the gene.