1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.

Slides:



Advertisements
Similar presentations
RNA-Seq as a Discovery Tool
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Two short pieces MicroRNA Alternative splicing.
Transcriptome Sequencing with Reference
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
The Sense of Sequense The Sense of Sequense Chris Evelo BiGCaT Bioinformatics Universiteit Maastricht.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Finding Charles Yan.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Eukaryotic Gene Finding
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Whole genome transcriptome variation in Arabidopsis thaliana Xu Zhang Borevitz Lab Whole genome transcriptome variation in Arabidopsis thaliana Xu Zhang.
Eukaryotic Gene Finding
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Rhesus genome annotations Rob Norgren Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Expression of the Genome The transcriptome. Decoding the Genetic Information  Information encoded in nucleotide sequences contained in discrete units.
Verna Vu & Timothy Abreo
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Sackler Medical School
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
Research about Alternative Splicing recently 楊佳熒.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
August 20, 2007 BDGP modENCODE Data Production. BDGP Data Production Project Goals 21,000 RACE experiments 6,000 cDNA’s from directed screening and full.
Exam #1 is T 2/17 in class (bring cheat sheet). Protein DNA is used to produce RNA and/or proteins, but not all genes are expressed at the same time or.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Transcriptome What is it - genome wide transcript abundance How do you obtain it - Arrays + MPSS What do you do with it when you have it - ?
Finding genes in the genome
Annotation of eukaryotic genomes
Ligate tags SAGE: Procedure Digest with “Tagging enzyme” BsmFI tm Isolate mRNA, RT to cDNA Digest with “Anchoring.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Web Databases for Drosophila
bacteria and eukaryotes
Annotating The data.
EGASP 2005 Evaluation Protocol
The Transcriptional Landscape of the Mammalian Genome
Human Genome Project.
Genomics A Systematic Study of the Locations, Functions and Interactions of Many Genes at Once.
EGASP 2005 Evaluation Protocol
Experimental Verification Department of Genetic Medicine
Expression of the Genome
ENCODE Pseudogenes and Transcription
Exam #1 is T 9/23 in class (bring cheat sheet).
Ensembl Genome Repository.
Introduction to Alternative Splicing and my research report
Part II SeqViewer AraCyc Help
Presentation transcript:

1 Transcript modeling Brent lab

2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney  Validating predicted genes Laura Langton

3 How gene finders work in 3 easy steps A computational gene finder annotates a sequence by: 1. Identifying valid gene predictions 2. Assigning a probability to each gene prediction 3. Selecting the gene prediction with the highest probability

4 TGTCCATGACCGGAGTCTACAGTTAAACGGAGTATGATGTCATTACTAGTACCATCAGGATCGTCAATACCTACAGATTACCTAATACC Defining valid gene predictions Start codonStop codon Canonical splice sites No in-frame stops

5 Assigning probabilities to valid gene predictions Probabilities based on “sequence submodels” trained on examples of real genes TGTCCATGACCGGAGTCTACAGTTAAACGGAGTATGATGTCATTACTAGTACCATCAGGATCGTCAATACCTACAGATTACCTAATACC

6 Picking optimal gene prediction Viterbi algorithm TGTCCATGACCGGAGTCTACAGTTAAACGGAGTATGATGTCATTACTAGTACCATCAGGATCGTCAATACCTACAGATTACCTAATACC

7 Alignment.....ATGACTGGGGT-TACAGTTAA.....GTACGATGT-ATTGCT GATAACCTAA.... ||||| || || ||||||||| ||| ||||| ||| || ||| |||||| TGTCCATGACCGGAGTCTACAGTTAAACGGAGTATGATGTCATTACTAGTACCATCAGGATCGTCAATACCTACAGATTACCTAATACC Adding external information

8 Types of external information DNA sequence Aligned transcripts Evolutionary conservation Tiling array data Gene predictions Conservation: D. erecta and D. pseudoobscura Transcripts: ESTs and mRNA Tiling arrays: Affymetrix and Aaron

9 PASA Assembly 1 Assembly 2 Brian J. Haas et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. NAR : 5654–5666.

10 Adding the cDNA information ||||| |||||| |||| ||||| TGTCCATGACCGGAGTCTACAGTTAAACGGAGTATGATGTCATTACTAGTACCATCAGGATCGTCAATACCTACAGATTACCTAATACC

11 Creating an annotation We can use PASA to “update” gene predictions Assembly 1 Assembly 2 Alternative splice Gene prediction

12 Difficult: dscam

13 Difficult: dscam

14 Quirks  In-frame stop codons (selenocysteines?)  Uncommon splice sites Non GT/AG GC/AG or AT/AC  Genome rearrangements  Dicistronic genes (androcam)  Trans-splicing (mod(mdg4))  More?

15 Storing the data DNA sequence Evolutionary conservation Tiling array data Gene predictions GenomeDB (Brian Koebbe) PASA clusters Manual annotation Genome annotation Aligned transcripts

16 Tiling arrays Aaron Tenney

17 Goal  Combine computational gene finding and tiling array analysis Improve prediction accuracy on protein coding genes Predict different forms of genes in different hybridization conditions

18 Tiling arrays complement other information sources  Tiling arrays vs. DNA sequence No explicit use of sequence, not as biased by genes in training set Easier to find atypical novel genes (odd splice sites or codon usage)  Tiling arrays vs. Evolutionary conservation Much conserved sequence is not transcribed Tiling array will help sift out conserved but non- transcribed sequence  Tiling arrays vs. aligned ESTs Similar to information from aligned ESTs Less biased to high copy number transcripts and 3’ ends More complete view of transcriptome

19 Tiling arrays complement other information sources

20 Challenges  Most literature on analysis of oligonucleotide arrays is about expression arrays Sets of probes designed to query specific genes  Analysis of tiling arrays is different Determining which probes are hybridizing instead of estimating expression levels Looser probe design criteria, noiser data

21 Low level analysis questions  Individual probe intensities  Normalization  Probe sequence specific corrections  Cross hybridization

22 Data integration questions  Adding tiling arrays to information we already use  Resolution vs. noise reduction tradeoff  Sequence representation / feature functions  Modeling entities of interest in the genome Protein and non protein coding genes Non-genes, “Dark matter”  Correlations to DNA / conservation / EST signals

23 Validation experiments Laura Langton

24 Prediction Validation  Which predictions to validate? Filter predictions for exon overlap with existing experimental evidence.  Evidence = mRNAs and PASA clustered ESTs Classify predictions into 3 major categories:  Known  Partially verified  Novel Feature of interest = splice sites

25 Known Gene

26 Known Gene

27 Partial

28 Novel

29 Novel

30 Categories not currently tested  Alternative splices  Single exon genes  UTRs  Structural disagreements

31 Structural disagreements

32 Alternate Splice

33  Design primers to span one or more unverified introns.  Reverse transcribe RNA from whole fly or cell lines.  PCR 650 bp amplicons.  Directly sequence.  Align resulting ESTs to genome Experimental Validation - RT PCR

34 Sequence Data

35 RT - PCR Our EST data [RTDB] DNA sequence Aligned ESTs Evolutionary conservation Tiling array data Gene predictions

36 RT Database  Types of data in RTDB (examples) Traces, reads, quality values Primers, amplicons Predictions, genome version Experiment information  Accessible to collaborators  Schema available on request Charles Comstock

37 Preliminary results  Novel 176 predictions tested, 51% hit rate  Partial 442 predictions tested, 74% hit rate