1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft.

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

Algorithms for Finding Genes Rhys Price Jones Anne R. Haake.
Ab initio gene prediction Genome 559, Winter 2011.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Models in Bioinformatics
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Direct Kernel Methods. Data mining is the process of automatically extracting valid, novel, potentially useful and ultimately comprehensible information.
Visualization of AAG Paper Abstracts André Skupin Dept. of Geography University of New Orleans AAG Pittsburgh, April 5, 2000.
Data Mining with Neural Networks
Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC.
“Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004.
It & Health 2010 Summary Thomas Nordahl Petersen.
Lecture 12 Splicing and gene prediction in eukaryotes
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
DNA Splicing By: Desiree Seales, Ingrid Verastegui, Jaskiranjeet Sodhi, Nicholas Enea.
Biological Motivation Gene Finding in Eukaryotic Genomes
Hidden Markov Models In BioInformatics
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
AdvisorStudent Dr. Jia Li Shaojun Liu Dept. of Computer Science and Engineering, Oakland University 3D Shape Classification Using Conformal Mapping In.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Applying AI to Human Genome Part 1 : Collecting data Prof. M. Embrechts Robert Bress Bram Heyns.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Self-organizing Maps Kevin Pang. Goal Research SOMs Research SOMs Create an introductory tutorial on the algorithm Create an introductory tutorial on.
What is a neural network? Collection of interconnected neurons that compute and generate impulses. Components of a neural network include neurons, synapses,
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
A new way of seeing genomes Combining sequence- and signal-based genome analyses Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI Introduction: So far,
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
You should be able to label these pictures Label the following: –RNA polymerase –DNA –mRNA –tRNA –5’ end –3’ end –Amino acid –Ribosome –Polypeptide chain.
1 The Interrupted Gene. Ex Biochem c3-interrupted gene Introduction Figure 3.1.
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
DNA TO RNA Transcription is the process of creating a molecule that can carry the genetic blueprint for a particular protein coding gene from the DNA.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
 The type of RNA that carriers the genetic information/message from DNA and coveys it to ribosomes where the information is translated into amino acid.
TreeSOM :Cluster analysis in the self- organizing map Neural Networks 19 (2006) Special Issue Reporter 張欽隆 D
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Genes and How They Work Chapter The Nature of Genes information flows in one direction: DNA (gene)RNAprotein TranscriptionTranslation.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.
DNA in the Cell Stored in Number of Chromosomes (24 in Human Genome) Tightly coiled threads of DNA and Associated Proteins: Chromatin 3 billion bp in Human.
Splice Site Recognition in DNA Sequences Using K-mer Frequency Based Mapping for Support Vector Machine with Power Series Kernel Dr. Robertas Damaševičius.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
GENOME: an organism’s complete set of genetic material In humans, ~3 billion base pairs CHROMOSOME: Part of the genome; structure that holds tightly wound.
GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.
CFE Higher Biology DNA and the Genome Transcription.
Soft Computing & Computational Intelligence Biologically inspired computing models Compatible with human expertise/reasoning Intensive numerical computations.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
bacteria and eukaryotes
Eukaryotic Gene Structure
Chapter 4 “DNA Finger Printing”
BTY100-Lec#4.2 DNA to Protein (Central Dogma).
Molecular Classification of Cancer
Unsupervised Learning and Neural Networks
Self organizing networks
Ab initio gene prediction
The triplet code Starter A DNA molecule is 23% guanine.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
network of simple neuron-like computing elements
BLAT Blast Like Alignment Tool
Genetics Lesson 3.
Modeling of Spliceosome
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Introduction to Alternative Splicing and my research report
The Toy Exon Finder.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft Computing in Industrial Application

2 Presentation Outline Introduction to DNA Splice Junctions Data Collection Introduction to SOMs SOM for DNA Splice Junction Classification Results Conclusions

3

4 Human genome in a nutshell Human : 23 chromosomes Chromosomes  thousands of genes Gene  info : exons, comments : introns Splice junction are like /* comment flags */ in C-code Exons and introns  codons Codon  bases

5 DNA Splice Junctions DNA  billions of nucleotides ( A, C, G, T) Genes  sequences of amino acids (exons) that are often interrupted by non-coding nucleotides (introns) <.1% of human DNA is made up of exons 99% of splice junctions have the same motif, for – Exon to intron it is GT – Intron to exon it is AG ….GTGAAGGTTAA AGATGTAGAT GT ATTG… Splice Junction Exon Intron

6 Data Collection: HTML Browser + Perl scripts BioBrowser Download HTML ExtractLinks() Download HTML - data ExtractData() TranslateData()

7

8 DNA Splice Junction (Cont.) A complete gene is made up of different exons Splice junction identification aids in the discovery of new genes The dataset used for this study is made up of 1,424 sequences Data were created ab initio from GENBANK Each sequence is 32 nucleotides long with regions comprising -15 to +15 nucleotides from the splice-junction …T GTAAG G AG ACGA GTT … Intron Splice Junction Exon

9 Self-Organizing Maps (SOM) Network Unsupervised learning neural network Projects high-dimensional input data onto two- dimensional output map Preserves the topology of the input data Visualizes structures and clusters of the data

10 Use of SOM for DNA Splice Junction Classification Model SOM SOM Classification Map Classification Class A: intron to exon Class B: exon to intron Class C: no transition Classification Class A: intron to exon Class B: exon to intron Class C: no transition DNA training set DNA test set Neuron identification methods - Highest frequency class - Closest neuron Neuron identification methods - Highest frequency class - Closest neuron A B C U-Matrix Map

11 The U-matrix of the DNA Training Set

12 SOM Results for DNA Splice Junction Data A B C Confusion matrix of 424-DNA test set The U-matrix of the DNA training set

13 Conclusions SOM is effective in DNA splice junction classification SOM is powerful visualization for high dimensional data

14 Demo with Analyze Code 800 training data, 324 test data (160 features) 96% correct overall classification on test data Confusion Matrix

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT THE END

16

17

18

19

20

21

22

23

24