Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing genome tiling microarrays for the detection of novel.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
RNAseq.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Discriminative and generative methods for bags of features
Comparison of array detected transcription map with GENCODE/HAVANA annotations in ENCODE regions.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Global dissection of cis and trans regulatory variations in Arabidopsis thaliana Xu Zhang Borevitz Lab.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Whole Genome Expression Analysis
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Expression of the Genome The transcriptome. Decoding the Genetic Information  Information encoded in nucleotide sequences contained in discrete units.
Dr Paul Lewis Lecturer in Bioinformatics Lecturer in Bioinformatics Cardiff University Cardiff University Biostatistics & Bioinformatics Unit Biostatistics.
Microarray - Leukemia vs. normal GeneChip System.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Mapping Sites of Transcription Across the Drosophila Genome Using High Resolution Tiling Microarrays LBNL, Berkeley CA August 20, 2007 A. WillinghamAffymetrix,
A Short Overview of Microarrays Tex Thompson Spring 2005.
The generalized transcription of the genome Víctor Gámez Visairas Genomics Course 2014/15.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Stabil07 03/10/ Michael Biehl Intelligent Systems Group University of Groningen Rainer Breitling, Yang Li Groningen Bioinformatics Centre Analysis.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
____ __ __ _______Birol et al :: AGBT :: 7 February 2008 A NOVEL APPROACH TO IMPROVE THE NOISE IN DETECTING COPY NUMBER VARIATIONS USING OLIGONUCLEOTIDE.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius,
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Cluster validation Integration ICES Bioinformatics.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Affymetrix User’s Group Meeting Boston, MA May 2005 Keynote Topics: 1. Human genome annotations: emergence of non-coding transcripts -tiling arrays: study.
Canadian Bioinformatics Workshops
CIVET seminar Presentation day: Presenter : Park, GilSoon.
Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation Rendong Yang and Zhen Su Division of Bioinformatics,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Gene expression.
Research in Computational Molecular Biology , Vol (2008)
Dimension reduction : PCA and Clustering
Volume 116, Issue 4, Pages (February 2004)
Anastasia Baryshnikova  Cell Systems 
Summarized by Sun Kim SNU Biointelligence Lab.
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing genome tiling microarrays for the detection of novel expressed genes Groningen Bioinformatics Centre Preliminary version 23 Feb 2007

Groningen Bioinformatics Centre Introduction to tiling arrays Published research on exon finding Our data set Machine learning for exon finding Results Outline

Groningen Bioinformatics Centre Background Genomic tiling array Probes are designed to blanket an entire genomic region of interest and used to detect the presence or absence of transcription. Tiling A sequence of probes spanning a genomic region is called a “tile path”, or a “tiling”.

Groningen Bioinformatics Centre Two types of tiling array construction: 1)Oligonucleotide tiling array 2) Tiling array constructed using PCR products Trend in Genetics 2005 v21 466

Groningen Bioinformatics Centre 1)Discovery of novel genes 2)Discovery of novel non-coding RNAs 3) Alternative splicing study Advantages: 1)The sensitivity of microarrays enables rare transcripts to be detected; 2)The parallel nature of the arrays enables numerous samples and genomic sequences to be analyzed. 3)The experimental design is not dependent on current genome annotations. Detection of transcription

Groningen Bioinformatics Centre Recent Research

Groningen Bioinformatics Centre Recent Research Surprising amounts of genomic ‘dark matter’ More than 50% of animal genomes may be transcribed Novel protein-coding genes Novel non-coding genes (rRNA, tRNA, snoRNA, miRNA…) Antisense transcripts Alternative isoforms and gene ‘extensions’ Leaky transcription Technical noise/artifacts

Groningen Bioinformatics Centre Kampa et al. Hodges–Lehman estimator ( pseudo median ) Exon-intron discriminators

Groningen Bioinformatics Centre Schadt et al. PCA 1. Probes are separated into 15 kb sliding windows 2. Calculate robust principal component (between-sample correlation matrix) 3. Calculate Mahalanobis distance (probe location minus the center of the data in the first two dimensions of the principal component score (PCS)) 4. Decide on exon vs. intron 5. Assign probes to transcriptional units Exon-intron discriminators

Groningen Bioinformatics Centre Our collaborators’ approach (Andrew Fraser and Tom Gingeras): use negative bacterial controls to calculate an intensity threshold corresponding to 5% false positive rate in a given regions apply these intensity thresholds to generate positive probe maps which are then joined together using two parameters: maxgap, the maximal distance between two positive probes and minrun, the minimal size of a transfrag minrun of 40 (two positive probes) or 80 (three positive probes) are a good starting point for these parameters Exon-intron discriminators

Groningen Bioinformatics Centre Affymetrix C. elegans Tiling 1.0R Array Genome-wide gene expression: ChrI~V, Chr X and Chr M ( Mitochondrion ) Resolution: on average 25 bp Negative bacterial controls Samples: 21 samples across development (plus mutant) Probes: 2,942,364 PM/MM pairs About our tiling data

Groningen Bioinformatics Centre Developmental time course L2L3L4Young adult Gravid adult total strains N smg- 1* sample number * smg-1: deficient in nonsense mediated decay About tiling data

Groningen Bioinformatics Centre LAP-1(ZK353.6) Genomic Position: III: bp Lap-1 is expressed throughout the life cycle. While there appears to be marginally less LAP-1 message at 2 h and 40 h, corresponding to early L1 and young adult stages respectively, LAP-1 appears to be constitutively expressed. Densitometric analysis of LAP-1 expression compared to the housekeeping gene ama-1 shows some variation in LAP-1 expression but this appears to be unrelated to moulting. Examples

Groningen Bioinformatics Centre Probe intensity intron extron Example

Groningen Bioinformatics Centre Example

Groningen Bioinformatics Centre Probe intensity Example 2

Groningen Bioinformatics Centre Example 2

Groningen Bioinformatics Centre Chr III 2866 genes General impression

Groningen Bioinformatics Centre General impression

Groningen Bioinformatics Centre General impression

Groningen Bioinformatics Centre PCA

Groningen Bioinformatics Centre Methods: machine learning Aim Find the most effective (correct) machine learning method that distinguishes between True exons and True introns Find the simplest (fastest, intuitive) method that achieves this task

Groningen Bioinformatics Centre Methods: machine learning Main challenge True exons and True introns are not known: Annotated exons may be unexpressed Annotated introns may be novel transcripts Our approach Ignore the problem and optimize supervised performance Assumption True novel transcripts will be similar to known ones

Groningen Bioinformatics Centre Methods: machine learning 1.Classification and regression tree (CART) binary recursive partitioning Advantages: Easy to understand Easy to implement Computationally cheap

Groningen Bioinformatics Centre Methods: Machine learning 2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?

Groningen Bioinformatics Centre denotes +1 denotes 0 How would you classify this data? 2. Support vector machines (SVM)

Groningen Bioinformatics Centre denotes +1 denotes 0 How would you classify this data? 2. Support vector machines (SVM)

Groningen Bioinformatics Centre denotes +1 denotes 0 How would you classify this data? 2. Support vector machines (SVM)

Groningen Bioinformatics Centre denotes +1 denotes 0 Maximum Margin The classifier with the maximum margin is the ideal one.

Groningen Bioinformatics Centre Receiver Operating Characteristic curve (ROC curve) Evaluation ROC False Positive Rate (1-specificity) True Positive Rate (sensitivity)

Groningen Bioinformatics Centre The Area Under an ROC Curve (AUC)

Groningen Bioinformatics Centre Raw Normalized Mean Median Max Max_1 pm.i,pm.1,pm_1,pm.2,pm_2,mm.i,mm.1,mm_1,mm.2,mm_2 Selection of informative features – intensities

Groningen Bioinformatics Centre Raw Normalized Pearson Spearman pm1,pm-1, mm1,mm-1 Selection of informative features – correlation

Groningen Bioinformatics Centre Summary Almost all reasonable features are informative No striking difference between mean and median, but they seem better than max, max_1 CC also informative. No striking difference between Pearson and Spearman Quantile normalization doesn’t improve the result Decision Median, CC (Pearson) of non-normalized data are used to generate features GC content or melting temperature can also be informative Selection of informative features

Groningen Bioinformatics Centre Selection of informative features – neighbors CART

Groningen Bioinformatics Centre Selection of informative features – neighbors SVM CART

Groningen Bioinformatics Centre Selection of informative features Neighbours MM CC.PM CC.MM Tm ANOVA results

Groningen Bioinformatics Centre Results

Groningen Bioinformatics Centre Example tree

Groningen Bioinformatics Centre AUC ~ ( expression level )

Groningen Bioinformatics Centre AUC ~ length( exon )

Groningen Bioinformatics Centre AUC ~ Tm

Groningen Bioinformatics Centre AUC ~ probe position within exon

Groningen Bioinformatics Centre AUC ~ ( other factors ) expression exon length melting temperature relative position

Groningen Bioinformatics Centre Can minrun and maxgap improve the results? maxgap = 1 minrun = 3

Groningen Bioinformatics Centre Can minrun and maxgap improve the results? minrun = 3 maxgap = 1

Groningen Bioinformatics Centre Minrun/maxgap Maxgap/minrun thres ccr fpr tpr Maxgap and minrun optimization

Groningen Bioinformatics Centre Minrun/maxgap Maxgap/minrun thres ccr fpr tpr Maxgap and minrun optimization

Groningen Bioinformatics Centre Minrun/maxgap Maxgap/minrun thres ccr fpr tpr Maxgap and minrun optimization

Groningen Bioinformatics Centre Minrun/maxgap Maxgap/minrun thres ccr fpr tpr Maxgap and minrun optimization

Groningen Bioinformatics Centre Maxgap and minrun optimization

Groningen Bioinformatics Centre Maxgap and minrun optimization 1 - maxgap 2 - minrun Order: minrun/maxgap

Groningen Bioinformatics Centre Maxgap and minrun conclusion a minrun of 0 and a maxgap of 1 give the best overall result for our classifier minrun and maxgap have minimal influence on the results, if the classifier already uses neighboring probe information

Groningen Bioinformatics Centre Future work Joining of transfrags into transcriptional units (genes) Differential gene expression between developmental stage and strains (ANOVA) Detect alternative splicing (ANOVA)

Groningen Bioinformatics Centre Acknowledgements Yang Li and Ritsert Jansen, Groningen Bioinformatics Centre Andrew Fraser, Welcome Trust Sanger Institute, Cambridge Tom Gingeras, Affymetrix, Santa Clara Jan Kammenga, Nematology, Wageningen University