How can we find genes? Search for them Look them up.

Slides:



Advertisements
Similar presentations
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Chap 9. Gene Discovery. DNARNA cDNA protein EST (Expressed Seq. Tag)
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
PROMoter SCanning/ANalysis tool. Goal Creating a tool to analyse a set of putative promoter sequences and recognize known and unknown promoters, with.
Finding Eukaryotic Open reading frames.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Tutorial 7 Genome browser. Free, open source, on-line broswer for genomes Contains ~100 genomes, from nematodes to human. Many tools that can be used.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Gene Identification Lab
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Genome Annotation BCB 660 October 20, From Carson Holt.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Using DNA Subway in the Classroom Red Line Lesson Sketch.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education.
Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Sequence & course material repository Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
bacteria and eukaryotes
Annotating The data.
Bacterial infection by lytic virus
The Transcriptional Landscape of the Mammalian Genome
Visualization of genomic data
Gene architecture and sequence annotation
Visualization of genomic data
Genome Center of Wisconsin, UW-Madison
Genome Editing with Apollo
Gene Annotation with DNA Subway
Introduction to Bioinformatics II
Introduction to Alternative Splicing and my research report
Part II SeqViewer AraCyc Help
Presentation transcript:

How can we find genes? Search for them Look them up

How do I get from this… >mouse_ear_cress_1080 GAAATAATCAATGGAATATGTAGAGGTCTCCTGTACCTTCACAGAGATTCTAGGCTGAGAGCAGTGCATATAGATATCTTT CGTACTCATCTGCTTTTTCTGGTCTCCATCACAAAAGCCAACTAGGTAATCATATCAATCTCTCTTTACCGTTTACTCGAC CTTTTCCAATCAGGTGCT TCTGGTGTGTCTACTACTATCAGTTTTAGGTCTTTGTATACCTGATCTTATCTGCTACTG AGGCTTGTAAAAGTGATTAAAACTGTGACATTTACTCTAAGAGAAGTAACCTGTTTGATGCATTTCCCTAATATACCGGTG TGGAAAAGTGTAGGTATCTGTACTCAGCTGAAATGGTGGACGATTTTGAAGAAGATGAACTCTCATTGACTGAAAGCGGGT TGAAGAGTGAAGATGGCGTTATTATCGAGATGAATGTCTCCTGGATGCTTTTATTATCATGTTTGGGAATTTACCAAGGGA GAGGTATCAGAATCTATCTTAGAAGGTTACATTTAGCTCAAGCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTA GTGTGTTTGAAGTTTCTTAACTCCTAGTATAATTAGAATCTTCTGCAGCAGACTTTAGAGTTTTGGGATGTAGAGCTAACC AGAGTCGGTTTGTTTAAACTAGAATCTTTTTATGTAGCAGACTTGTTCAGTACCTGAATACCAGTTTTAAATTACCGTCAG ATGTTGATCTTGTTGGTAATAATGGAGAAACGGAAGAATAATTAGACGAAACAAACTCTTTAAGAACGTATCTTTCAGTTT TCCATCACAAATTTTCTTACAAGCTACAAAAATCGAACTATATATAACTGAACCGAATTTAAACCGGAGGGAGGGTTTGAC TTTGGTCAATCACATTTCCAATGATACCGTCGTTTGGTTTGGGGAAGCCTCGTCGTACAAATACGACGTCGTTTAAGGAAA GCCCTCCTTAACCCCAGTTATAAGCTCAAAGTTGTACTTGACCTTTTTAAAGAAGCACGAAACGAAAAACCCTAAAATTCC CAAGCAGAGAAAGAGAGACAGAGCAAGTACAGATTTCAACTAGCTCAAGATGATCATCCCTGTTCGTTGCTTTACTTGTGG AAAGGTTGATATTTTCCCCTTCGCTTTGGTCTTATTTAGGGTTTTACTCCGTCTTTATAGGGTTTTAGTTACTCCAAATTT GGCTAAGAAGAGATCTTTACTCTCTGTATTTGACACGAATGTTTTTAATCGGTTGGATACATGTTGGGTCGATTAGAGAAA TAAAGTATTGAGCTTTACTAAGCTTTCACCTTGTGATTGGTTTAGGTGATTGGAAACAAATGGGATCAGTATCTTGATCTT CTCCAGCTCGACTACACTGAAGGGTAAGCTTACAATGATTCTCACTTCTTGCTGCTCTAATCATCATACTTTGTGTCAAAA AGAGAGTAATTGCTTTGCGTTTTAGAGAAATTAGCCCAGATTTCGTATTGGGTCTGTGAAGTTTCATATTAGCTAACACAC TTCTCTAATTGATAACAGAAGCTATAAAATAGATTTGCTGATGAAGGAGTTAGCTTTTTATAATCTTCTGTGTTTGTGTTT TACTGTCTGTGTCATTGGAAGAGACTATGTCCTGCCTATATAATCTCTATGTGCCTATCTAGATTTTCTATACAATTGATA TTTGATAGAAGTAGAAAGTAAGACTTAAGGTCTTTTGATTAGACTTGTGCCCATCTACATGATTCTTATTGGACTAATCAT TCTTTGTGTGAAAATAGAATACTTTGTCTGAACATGAGAGAATGGTTCATAATACGTGTGAAGTATGGGATTAGTTCAACA ATTTCGCTATTGGAGAAGCAAACCAAGGGTTAATCGTTTATAGGGTTAAGCTAATGCTCTGCTCTTTATATGTTATTGGAA CAGACTATTGTTGTGCCTATCTTGTTTAGTTGTAGATTCTATCTCGACTGTTATAAGTATGACTGAAGGCTTGATGACTTA TGATTCTCTTTACACCTGTAGAAGGATTTAAGCTTGGTGTCTAGATATTCAATCTGTGTTGGTTTTGTCTTTCTTTTGGCT CTTAGTGTTGTTCAATCTCCTCAATAGGTATGAAGTTACAATATCCTTATTATTTTGCAGGGACGCACTTGATGCACTCCA GCTAGTCAGATACTGCTGCAGGCGTATGCTAATGACCTTGCATCAACATCTTTACTTAGAGCTCTACGGGTTTTAGTGTGT

…to this?

Meaning?

Mathematical Tools (Code; statistics)

Comparative Tools (Database searches)

What do we know about genes? Expressed (Transcribed) – Transcriptional start & termination sites (TXSS, TXTS) – Transcription artefacts (cDNA & ESTs) Regulated – Promoters (TATAAA) – Transcription Factor Binding Sites – CpG (Cytosin methylation) Meaningful (Translated) – 3n basepairs – Codon usage – Translational start & stop/termination codons (TLSS, TLTS) – Translation artefacts (proteins) Spliced – Splice sites (GT-AG) Derived (Homology: Paralogy/Orthology) – Search for known genes, proteins (BLAST)

How might this knowledge help to find genes? Predict genes – Look for potential starts and stops. – Connect them into open reading frames (ORFs). – Filter for “correct’ length & codon usage. Search databases – Known genes: UniGene – Known proteins: UniProt Use transcript evidence – cDNA – ESTs – proteins

Operating computationally Go to beginning of sequence  start SCAN If ATG  register putative TLSS; then – Move in 3-steps & count steps (=COUNTS) – If 3-step = (TAA or TAG or TGA),  register putative TLTS – If register  evaluate COUNTS (= triplets) If COUNTS < minimum  discard; then go behind ATG above and start SCAN If COUNTS > maximum  discard; then go behind ATG above and start SCAN If minimum < COUNTS < maximum  record as GENE with TLSS, TLTS; then go behind ATG above and start SCAN. Arrive at end of sequence  stop SCAN

Find gene families Mathematical evidence Analyze large data sets Browse in ccontext Construct gene models Annotation workflow Biological evidence Browse results Get/Generate sequence

Annotation Cheat Sheet Open existing project or generate new (Red square) Run RepeatMasker Generate evidence (Predictions, BLAST searches) Synthesize evidence into gene models (Apollo) Browse results locally and in context (Phytozome) Conduct functional analysis (link from Browser) Prospect for gene family (Yellow Line from Browser) Select region that holds biological gene evidence Optimize work space and zoom to region (View tab) Expand all tiers (Tiers tab) Drag evidence item(s) onto workspace (mouse) Edit to match biol. evidence (right-click item for tools) Record what was done in Annotation Info Editor Assess necessity to build alternative model(s) Upload model(s) to DNA Subway (File tab) A. DNA Subway B. Apollo

Predictors (mathematical evidence) Utilize predominantly mathematical methods (statistical). Search for patterns –Some score starts, stops, splice sites (GenScan). –Some score nucleotides (Augustus, FGenesH). Few incorporate EST data and/or known genes/proteins. Require optimization for each new species (training). Accuracy: –False positives (scoring non-genes as genes):5% - 50%. –False negatives (missed genes): 5%-40%. –Weak or unable in determining first and last exons, and UTRs. Specific for gene models (spliced genes, non-spliced genes). Specialty predictors (tRNA Scan, RepeatMasker).

Search tools (biological evidence) Search sequence (molecules; tangible) databases: –Known genes –Known proteins –cDNAs & ESTs Utilize alignment methods (BLAST, BLAT). Reliability: –Good in determining gene locations and general gene structures. –Weak in exactly determining exon/intron borders. –Unlikely to correctly determine TXSS and TXTS. –Should be used with cDNA/EST from same species as genome.

Sequence & course material repository Don’t open items, save them to your computer!! Annotation (sequences & evidence) Manuals (DNA, Subway, Apollo, JalView) Presentations (.ppt files) Prospecting (sequences) Readings (Bioinformatics tools, splicing, etc.) Worksheets (Word docs, handouts, etc.) BCR-ABL (temporary; not course-related)