Spanish Inquisition Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas Chris Yan Leon.

Slides:



Advertisements
Similar presentations
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Advertisements

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Yue Han and Lei Yu Binghamton University.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Feature Selection Presented by: Nafise Hatamikhah
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
III 1 Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ URL: rutcor.rutgers.edu/~salexe Datascope - a new tool.
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Part II: Discriminative Margin Clustering Joint work with: Rob Tibshirani, Dept of Statistics Patrick O. Brown, School of Medicine Stanford University.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Spanish Inquisition Final Project Week 4 - 5/21/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas Chris Yan Leon.
Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University
DNA Microarrays Examining Gene Expression. Prof. GrossBiology 4 DNA MicroArrays DNA MicroArrays use hybridization technology to examine gene expression.
Final Project Week 3 - 5/7/09 GSEA and Cluster Computing in Protein Research Leon Kay, Yan Tran, Chris Thomas Yan Gary Chris Leon.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
Classification with Hyperplanes Defines a boundary between various points of data which represent examples plotted in multidimensional space according.
Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Issues with Data Mining
Whole Genome Expression Analysis
Chapter 7 Essential Concepts in Molecular Pathology Companion site for Molecular Pathology Author: William B. Coleman and Gregory J. Tsongalis.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Appendix: The WEKA Data Mining Software
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
1 Decision tree based classifications of heterogeneous lung cancer data Student: Yi LI Supervisor: Associate Prof. Jiuyong Li Data: 15 th May 2009.
CellFateScout step- by-step tutorial for a case study Version 0.94.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Clustering Algorithms to make sense of Microarray data: Systems Analyses in Biology Doug Welsh and Brian Davis BioQuest Workshop Beloit Wisconsin, June.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Artificial Intelligence Project #3 : Diagnosis Using Bayesian Networks May 19, 2005.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Classification Ensemble Methods 1
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
Breast cancer Not a single disease. Type and stage. Grade/Lympho-vascular invasion Molecular biology. ER\PR Her-2-neu Other. Gene expression profiling.
Canadian Bioinformatics Workshops
Rule Induction for Classification Using
Boosting For Tumor Classification With Gene Expression Data
Volume 5, Issue 6, Pages e3 (December 2017)
Recurrence-Associated Long Non-coding RNA Signature for Determining the Risk of Recurrence in Patients with Colon Cancer  Meng Zhou, Long Hu, Zicheng.
JNK signature in breast cancer cells is linked to ECM, stem cell, and wound healing gene networks and is enriched in basal‐like breast cancer JNK signature.
Shrinking the Psoriasis Assessment Gap: Early Gene-Expression Profiling Accurately Predicts Response to Long-Term Treatment  Joel Correa da Rosa, Jaehwan.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Receiver under the operator characteristic (ROC) curve for the test accuracy of the final risk score in the entire external validation sample (c statistic=0.84,
Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.
A, unsupervised hierarchical clustering of the expression of probe sets differentially expressed in the oral mucosa of smokers versus never smokers. A,
M-Wnt and E-Wnt cells cluster tightly with claudin-low and basal-like breast tumors, respectively, by microarray analysis. M-Wnt and E-Wnt cells cluster.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Subtype classification of breast functional screening results.
Presentation transcript:

Spanish Inquisition Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas Chris Yan Leon

Weka Filtering Used CFS with BestFirst Search Reduced the number of attributes from 1544 to 125 CFS stands for Correlation-based Feature Selection. Basic hypothesis: “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.” [1]

CFS Algorithm - Searching Any search algorithm can be plugged into CFS – author describes three - forward selection, backward elimination, and best first. They are all essentially greedy heuristic search algorithms. The greedy search approach reduces the complexity of generating the feature subset. “Best first can start with either no features or all features. In the former, the search progresses forward through the search space adding single features; in the latter the search moves backward through the search space deleting single features. To prevent the best first search from exploring the entire feature subset search space, a stopping criterion is imposed. The search will terminate if five consecutive fully expanded subsets show no improvement over the current best subset.” [1]

CFS Algorithm Visual Diagram [1]

Accuracy (Error Rate) of algorithms before and after applying CFS/BestFit filtering Before*After**Error Rate Reduction J Bagging (J48) Boosting (J48) Random Forests SMO (SVM) * From Week1 - all 1544 Attributes ** After applying CFS/BestFit filtering, 125 attributes

ROC – Receiver Operating Characteristic ROC graphs “depict the tradeoff between hit rates and false alarm rates of classifiers “ [2] “one point in ROC space is better than another if it is to the northwest (tp rate is higher, fp rate is lower, or both) of the first” [2] Therefore, Area Under Curve, or AUC is an accurate numerical value that can be used to compare classifiers.

ROC Data – Area under Curve J48Bagging (J48)Boosting (J48)Random ForestsSMO (SVM) Basal-like Claudin-low HER2+/ER Luminal A Luminal B Normal Breast-like

Example ROC – Random Forests

MeV Analysis Initial Hierarchical Clustering

Analyze the Cluster

FLJ13710 and GATA3 Lowly expressed in basal-like samples. Highly expressed in luminal samples.

GATA3 GATA3 levels are a known indication of breast cancer prognosis. (Basal-like is worse than Luminal.) Associated with estrogen receptor alpha, which is often highly expressed in the early stages of breast cancer.

FLJ13710 Mentioned in a paper on finding prognostic signatures for breast cancer. Couldn’t find any in-depth studies on this gene.

References 1) Mark Hall, “Correlation-based Feature Selection for Machine Learning”, 2)Tom Fawcett, “An introduction to ROC analysis“, doi: /j.patrec – enter into doi: /j.patrec http://dx.doi.org/ 3)Wilson, Brian J., Giguère, Vincent. “Meta-analysis of human cancer microarrays reveals GATA3 is integral to the estrogen receptor alpha pathway”, Molecular Cancer 2008, 7:49. cancer.com/content/7/1/49 4) Hayashi, SI., et al. “The expression and function of estrogen receptor alpha and beta in human breast cancer and its clinical application”, 5) “Suppl. Table 2: List of probe sets significantly differentially expressed between luminal cell lines and basal cell lines. Probe sets are ordered according to decreasing DS (discriminating score). “ 6)Carrivick, L., et al. “Identification of Prognostic Signatures in Breast Cancer Microarray Data using Bayesian Techniques.”