Presentation is loading. Please wait.

Presentation is loading. Please wait.

Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing genome tiling microarrays for the detection of novel.

Similar presentations


Presentation on theme: "Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing genome tiling microarrays for the detection of novel."— Presentation transcript:

1 Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing genome tiling microarrays for the detection of novel expressed genes Groningen Bioinformatics Centre Preliminary version 23 Feb 2007

2 Groningen Bioinformatics Centre Introduction to tiling arrays Published research on exon finding Our data set Machine learning for exon finding Results Outline

3 Groningen Bioinformatics Centre Background Genomic tiling array Probes are designed to blanket an entire genomic region of interest and used to detect the presence or absence of transcription. Tiling A sequence of probes spanning a genomic region is called a “tile path”, or a “tiling”.

4 Groningen Bioinformatics Centre Two types of tiling array construction: 1)Oligonucleotide tiling array 2) Tiling array constructed using PCR products Trend in Genetics 2005 v21 466

5 Groningen Bioinformatics Centre 1)Discovery of novel genes 2)Discovery of novel non-coding RNAs 3) Alternative splicing study Advantages: 1)The sensitivity of microarrays enables rare transcripts to be detected; 2)The parallel nature of the arrays enables numerous samples and genomic sequences to be analyzed. 3)The experimental design is not dependent on current genome annotations. Detection of transcription

6 Groningen Bioinformatics Centre Recent Research

7 Groningen Bioinformatics Centre Recent Research Surprising amounts of genomic ‘dark matter’ More than 50% of animal genomes may be transcribed Novel protein-coding genes Novel non-coding genes (rRNA, tRNA, snoRNA, miRNA…) Antisense transcripts Alternative isoforms and gene ‘extensions’ Leaky transcription Technical noise/artifacts

8 Groningen Bioinformatics Centre Kampa et al. Hodges–Lehman estimator ( pseudo median ) Exon-intron discriminators

9 Groningen Bioinformatics Centre Schadt et al. PCA 1. Probes are separated into 15 kb sliding windows 2. Calculate robust principal component (between-sample correlation matrix) 3. Calculate Mahalanobis distance (probe location minus the center of the data in the first two dimensions of the principal component score (PCS)) 4. Decide on exon vs. intron 5. Assign probes to transcriptional units Exon-intron discriminators

10 Groningen Bioinformatics Centre Our collaborators’ approach (Andrew Fraser and Tom Gingeras): use negative bacterial controls to calculate an intensity threshold corresponding to 5% false positive rate in a given regions apply these intensity thresholds to generate positive probe maps which are then joined together using two parameters: maxgap, the maximal distance between two positive probes and minrun, the minimal size of a transfrag minrun of 40 (two positive probes) or 80 (three positive probes) are a good starting point for these parameters Exon-intron discriminators

11 Groningen Bioinformatics Centre Affymetrix C. elegans Tiling 1.0R Array Genome-wide gene expression: ChrI~V, Chr X and Chr M ( Mitochondrion ) Resolution: on average 25 bp Negative bacterial controls Samples: 21 samples across development (plus mutant) Probes: 2,942,364 PM/MM pairs About our tiling data

12 Groningen Bioinformatics Centre Developmental time course L2L3L4Young adult Gravid adult total strains N22233313 smg- 1*--3238 sample number * smg-1: deficient in nonsense mediated decay About tiling data

13 Groningen Bioinformatics Centre LAP-1(ZK353.6) Genomic Position: III:8401845..8399119 bp Lap-1 is expressed throughout the life cycle. While there appears to be marginally less LAP-1 message at 2 h and 40 h, corresponding to early L1 and young adult stages respectively, LAP-1 appears to be constitutively expressed. Densitometric analysis of LAP-1 expression compared to the housekeeping gene ama-1 shows some variation in LAP-1 expression but this appears to be unrelated to moulting. Examples

14 Groningen Bioinformatics Centre Probe intensity intron extron Example

15 Groningen Bioinformatics Centre Example

16 Groningen Bioinformatics Centre Probe intensity Example 2

17 Groningen Bioinformatics Centre Example 2

18 Groningen Bioinformatics Centre Chr III 2866 genes General impression

19 Groningen Bioinformatics Centre General impression

20 Groningen Bioinformatics Centre General impression

21 Groningen Bioinformatics Centre PCA

22 Groningen Bioinformatics Centre Methods: machine learning Aim Find the most effective (correct) machine learning method that distinguishes between True exons and True introns Find the simplest (fastest, intuitive) method that achieves this task

23 Groningen Bioinformatics Centre Methods: machine learning Main challenge True exons and True introns are not known: Annotated exons may be unexpressed Annotated introns may be novel transcripts Our approach Ignore the problem and optimize supervised performance Assumption True novel transcripts will be similar to known ones

24 Groningen Bioinformatics Centre Methods: machine learning 1.Classification and regression tree (CART) binary recursive partitioning Advantages: Easy to understand Easy to implement Computationally cheap

25 Groningen Bioinformatics Centre Methods: Machine learning 2. Support vector machines (SVM) denotes +1 denotes 0 How would you classify this data?

26 Groningen Bioinformatics Centre denotes +1 denotes 0 How would you classify this data? 2. Support vector machines (SVM)

27 Groningen Bioinformatics Centre denotes +1 denotes 0 How would you classify this data? 2. Support vector machines (SVM)

28 Groningen Bioinformatics Centre denotes +1 denotes 0 How would you classify this data? 2. Support vector machines (SVM)

29 Groningen Bioinformatics Centre denotes +1 denotes 0 Maximum Margin The classifier with the maximum margin is the ideal one.

30 Groningen Bioinformatics Centre Receiver Operating Characteristic curve (ROC curve) Evaluation ROC False Positive Rate (1-specificity) True Positive Rate (sensitivity) 0.00.20.40.60.81.0 0.00 0.50 0.80 0.85 0.90 1.00 0.1 0.3 0.51 0.72 0.93 1.14

31 Groningen Bioinformatics Centre The Area Under an ROC Curve (AUC)

32 Groningen Bioinformatics Centre Raw Normalized Mean Median Max Max_1 pm.i,pm.1,pm_1,pm.2,pm_2,mm.i,mm.1,mm_1,mm.2,mm_2 Selection of informative features – intensities

33 Groningen Bioinformatics Centre Raw Normalized Pearson Spearman pm1,pm-1, mm1,mm-1 Selection of informative features – correlation

34 Groningen Bioinformatics Centre Summary Almost all reasonable features are informative No striking difference between mean and median, but they seem better than max, max_1 CC also informative. No striking difference between Pearson and Spearman Quantile normalization doesn’t improve the result Decision Median, CC (Pearson) of non-normalized data are used to generate features GC content or melting temperature can also be informative Selection of informative features

35 Groningen Bioinformatics Centre Selection of informative features – neighbors CART

36 Groningen Bioinformatics Centre Selection of informative features – neighbors SVM CART

37 Groningen Bioinformatics Centre Selection of informative features Neighbours MM CC.PM CC.MM Tm ANOVA results

38 Groningen Bioinformatics Centre Results

39 Groningen Bioinformatics Centre Example tree

40 Groningen Bioinformatics Centre AUC ~ ( expression level )

41 Groningen Bioinformatics Centre AUC ~ length( exon )

42 Groningen Bioinformatics Centre AUC ~ Tm

43 Groningen Bioinformatics Centre AUC ~ probe position within exon

44 Groningen Bioinformatics Centre AUC ~ ( other factors ) expression exon length melting temperature relative position

45 Groningen Bioinformatics Centre Can minrun and maxgap improve the results? maxgap = 1 minrun = 3

46 Groningen Bioinformatics Centre Can minrun and maxgap improve the results? minrun = 3 maxgap = 1

47 Groningen Bioinformatics Centre Minrun/maxgap Maxgap/minrun thres ccr fpr tpr 0.936 0.806 0.009 0.464 Maxgap and minrun optimization

48 Groningen Bioinformatics Centre Minrun/maxgap Maxgap/minrun thres ccr fpr tpr 0.718 0.850 0.030 0.627 Maxgap and minrun optimization

49 Groningen Bioinformatics Centre Minrun/maxgap Maxgap/minrun thres ccr fpr tpr 0.500 0.856 0.059 0.700 Maxgap and minrun optimization

50 Groningen Bioinformatics Centre Minrun/maxgap Maxgap/minrun thres ccr fpr tpr 0.300 0.815 0.216 0.851 Maxgap and minrun optimization

51 Groningen Bioinformatics Centre Maxgap and minrun optimization

52 Groningen Bioinformatics Centre Maxgap and minrun optimization 1 - maxgap 2 - minrun Order: minrun/maxgap

53 Groningen Bioinformatics Centre Maxgap and minrun conclusion a minrun of 0 and a maxgap of 1 give the best overall result for our classifier minrun and maxgap have minimal influence on the results, if the classifier already uses neighboring probe information

54 Groningen Bioinformatics Centre Future work Joining of transfrags into transcriptional units (genes) Differential gene expression between developmental stage and strains (ANOVA) Detect alternative splicing (ANOVA)

55 Groningen Bioinformatics Centre Acknowledgements Yang Li and Ritsert Jansen, Groningen Bioinformatics Centre Andrew Fraser, Welcome Trust Sanger Institute, Cambridge Tom Gingeras, Affymetrix, Santa Clara Jan Kammenga, Nematology, Wageningen University


Download ppt "Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing genome tiling microarrays for the detection of novel."

Similar presentations


Ads by Google