Statistical, Computational, and Informatics Tools for Biomarker Analysis Methodology Development at the D ata M anagement and C oordinating C enter of.

Slides:



Advertisements
Similar presentations
Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Brief introduction on Logistic Regression
Breakout Session 4: Personalized Medicine and Subgroup Selection Christopher Jennison, University of Bath Robert A. Beckman, Daiichi Sankyo Pharmaceutical.
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Learning Algorithm Evaluation
Clinical Trial Designs for the Evaluation of Prognostic & Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
ROC Statistics for the Lazy Machine Learner in All of Us Bradley Malin Lecture for COS Lab School of Computer Science Carnegie Mellon University 9/22/2005.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Ensemble Learning: An Introduction
. Differentially Expressed Genes, Class Discovery & Classification.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Introduction of Cancer Molecular Epidemiology Zuo-Feng Zhang, MD, PhD University of California Los Angeles.
Machine Learning: Ensemble Methods
3 rd Summer School in Computational Biology September 10, 2014 Frank Emmert-Streib & Salissou Moutari Computational Biology and Machine Learning Laboratory.
Diagnosis of Ovarian Cancer Based on Mass Spectrum of Blood Samples Committee: Eugene Fink Lihua Li Dmitry B. Goldgof Hong Tang.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Boosting for tumor classification
Chapter 5 Data mining : A Closer Look.
Thoughts on Biomarker Discovery and Validation Karla Ballman, Ph.D. Division of Biostatistics October 29, 2007.
Screening and Early Detection Epidemiological Basis for Disease Control – Fall 2001 Joel L. Weissfeld, M.D. M.P.H.
A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.
Are the results valid? Was the validity of the included studies appraised?
BASIC STATISTICS: AN OXYMORON? (With a little EPI thrown in…) URVASHI VAID MD, MS AUG 2012.
197 Case Study: Predicting Breast Cancer Invasion with Artificial Neural Networks on the Basis of Mammographic Features MEDINFO 2004, T02: Machine Learning.
A Significance Test-Based Feature Selection Method for the Detection of Prostate Cancer from Proteomic Patterns M.A.Sc. Candidate: Qianren (Tim) Xu The.
Criteria for Assessment of Performance of Cancer Risk Prediction Models: Overview Ruth Pfeiffer Cancer Risk Prediction Workshop, May 21, 2004 Division.
Molecular Diagnosis Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Metrological Experiments in Biomarker Development (Mass Spectrometry—Statistical Issues) Walter Liggett Statistical Engineering Division Peter Barker Biotechnology.
EDRN Approaches to Biomarker Validation DMCC Statisticians Fred Hutchinson Cancer Research Center Margaret Pepe Ziding Feng, Mark Thornquist, Yingye Zheng,
Chapter 9 – Classification and Regression Trees
Benk Erika Kelemen Zsolt
EVIDENCE ABOUT DIAGNOSTIC TESTS Min H. Huang, PT, PhD, NCS.
Differential Protein Expression Analysis for Biomarker Discovery.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Successful Concepts Study Rationale Literature Review Study Design Rationale for Intervention Eligibility Criteria Endpoint Measurement Tools.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
EMBC2001 Using Artificial Neural Networks to Predict Malignancy of Ovarian Tumors C. Lu 1, J. De Brabanter 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Using Predictive Classifiers in the Design of Phase III Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute.
Evaluating Results of Learning Blaž Zupan
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.
Software Architecture Evaluation Methodologies Presented By: Anthony Register.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Unit 15: Screening. Unit 15 Learning Objectives: 1.Understand the role of screening in the secondary prevention of disease. 2.Recognize the characteristics.
Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: Instructor Name: Dr. Dongchul Kim Review Article by Zengyou.
NTU & MSRA Ming-Feng Tsai
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
NCI Division of Cancer Prevention Ongoing Activities at Frederick Facilities Presented By: Lori Minasian, M.D. Robert Shoemaker, Ph.D. October 1, 2015.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
Uses of Diagnostic Tests Screen (mammography for breast cancer) Diagnose (electrocardiogram for acute myocardial infarction) Grade (stage of cancer) Monitor.
Raphael Sandaltzopoulos, PhD, MBA Professor at MBG (Molecular Biology) Lab. of Gene Expression, Molecular Diagnosis and Modern Therapeutics,
Screening Tests: A Review. Learning Objectives: 1.Understand the role of screening in the secondary prevention of disease. 2.Recognize the characteristics.
Kelci J. Miclaus, PhD Advanced Analytics R&D Manager JMP Life Sciences
Chapter 7. Classification and Prediction
Classification with Gene Expression Data
* Potential use of HP and AMBP in urine for the screening of prostate cancer Sanja Kiprijanovska1, Selim Komina2, Gordana Petrusevska2,
Performance Comparison of CA125 and the Combination of the Other Serum Biomarkers for the Early Detection of the Ovarian Cancer Hye-Jeong Song1,3,
Computational Diagnostics
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Signature of CRC‐associated gut microbial species Relative abundances of 22 gut microbial species, collectively associated with CRC, are displayed as heatmap.
Evidence Based Diagnosis
Detecting Treatment by Biomarker Interaction with Binary Endpoints
Presentation transcript:

Statistical, Computational, and Informatics Tools for Biomarker Analysis Methodology Development at the D ata M anagement and C oordinating C enter of the E arly D etection R esearch N etwork

18 Laboratories 8 Centers CDCP 2 Laboratories NIST Chair: David Sidransky Chair: Bernard Levin EDRN ORGANIZATIONAL STRUCTURE An “infrastructure” for supporting collaborative research on molecular, genetic and other biomarkers in human cancer detection and risk assessment. Early Detection Research Network

Specimens with matching controls and epidemiological data Infrastructure to provide preneoplastic tissues: - Prostate - Lung - Ovarian - Colon - Breast BIOREPOSITORY Early Detection Research Network INFRASTRUCTURE

Early Detection Research Network INFRASTRUCTURE Capability in high-throughput molecular and biochemical assays Ability to respond to evolving technologies for EDRN needs Extensive experience and scale-up ability in proteomics and molecular assays Outstanding infrastructure for handling multiple assays and validation requests LABORATORY CAPACITY

Early Detection Research Network INFRASTRUCTURE Outstanding track record in biomarker research Statistical and data mining technology Statistical and predictive models for multiple biomarkers Novel statistical methods to interpret high-throughput data DATA STORAGE AND MINING

Early Detection Research Network INFRASTRUCTURE Improving informatics and information flow Network web sites public web site secure web site Early Detection Research Network Exchange (ERNE) Standardizing of Data Reporting: CDEs Developed DATA EXCHANGE AND SHARING

Early Detection Research Network (EDRN) INFORMATICS AND INFORMATION FLOW

Contact one of the EDRN Principal Investigators to serve as a sponsor for an application. Three types of collaborative opportunities are available: Type A: Novel research ideas complementing EDRN ongoing efforts; one year of funding at $100,000 Type B: Share tools, technology and resources, no time limit Type C: Allow to participate in the EDRN Meetings and Workshop For details on how to apply, see How To Become an Associate Member EARLY DETECTION RESEARCH NETWORK COLLABORATION

DMCC Statisticians Margaret Pepe, Lead of Methodology Group Ziding Feng, Principal Investigator Yinsheng Qu Mary Lou Thompson Mark Thornquist Yutaka Yasui

Biomarker Lab Collaborators at Eastern Virginia Medical School Bao-Ling Adam John Semmes George Wright

Focus of Presentation Design: Phase Structure for Biomarker Research Analysis: Statistical Methods for Biomarker Discovery from High-Dimensional Data Sets

Design: Phase Structure for Biomarker Research Three phase structure for therapeutic trials well-established Structure promotes coherent, thorough, efficient development Similar structure needs to be developed for biomarker research

Biomarker Development Categorize process into 5 phases Define objectives for each phase Define ideal study designs, evaluation and criteria for proceeding further Standardize the process to promote efficiency and rigor

The Details of Study Design Specific Aims Subject/Specimen Selection Outcome measures Evaluation of Results Sample Size Calculations Limitations / Pitfalls

Specific Aims Phase 1 Identify leads for potentially useful biomarkers Prioritize these leads Phase 2 Determine the sensitivity and specificity or ROC curve for the clinical biomarker assay in discriminating clinical cancer from controls

Specimen Selection -- Cases Phase 1 Cancers that are ultimately serious if not treated early, but treatable in early stage Spectrum of sub-types Collected at diagnosis Phase 2: same criteria as for phase 1 Wide spectrum of cases Clinical specimen at diagnosis From target screening population

Specimen Selection -- Controls Phase 1 Non-cancer tissue same organ same patient Normal tissue non-cancer patient Benign growth tissue non- cancer patient Phase 2 From potential target population for screening

Outcome Measures Phase 1 True positive and False positive rates (binary result) True positive rate at threshold yielding acceptable false positive rate ROC curve Phase 2 Results of clinical biomarker assay

Evaluation of Results Phase 1 Algorithms select and prioritize markers that best distinguish tumor from non-tumor tissue Initial exploratory studies need confirmation with new validation specimens Phase 2 ROC curves ROC regression to determine if characteristics of cases and/or characteristics of controls effect biomarker’s discriminatory capacity

Sample Size Phase 1 Should be large enough so that very promising biomarkers are likely to be selected for phase 2 development Phase 2 Based on a confidence intervals for the TPR or FPR, or confidence intervals for the ROC curve at selected critical points

Findings: Sample Size Estimation For phase 1 microarray experiments, use of ROC curves is more efficient than comparing means For phase 2 studies, equal numbers of cases and controls is often not optimally efficient Sample size calculations and look-up tables are now in EDRN website

1.Pepe et al. Phases of biomarker development for early detection of cancer. Journal of the National Cancer Institute 93(14):1054–61, Pepe et al. “Elements of Study Design for Biomarker Development” In Tumor Markers, Diamandis, Fritsche, Lilja, Chan, and Schwartz, eds. AAAC Press, Washington, DC Pepe. “Statistical Evaluation of Diagnostic Tests & Biomarkers” Oxford U. Press, 2003.

Selecting Differentially Expressed Genes from Microarray Experiments Lead: Margaret Pepe Context gene expression arrays for n D tumor tissues and n C normal tissues Y ig = logarithm relative intensity at gene g for tissue i. for which genes are Y ig different in some/most cases from the normals? how many tissues, n D and n C, should be evaluated in these experiments? illustrated with ovarian cancer data

Statistical Measures for Gene Selection — typically use a two sample t-test for each gene — we argue that sensitivity and specificity are more directly relevant for cancer biomarker research. — focus attention on high specificity (or high sensitivity) — use the partial area under the ROC curve to rank genes, instead of the t-test

Example Gene Rank (among 100 genes) gene #5gene #97 t-test104 partial AUC331

traditional calculations based on statistical hypothesis testing These are exploratory studies, need new methods Propose to base calculations on the probability that a differentially expressed gene will rank high among all genes Use computer simulation for sample size calculations Sample Sizes for Gene Discovery Studies

with 50 tumor and 50 normal tissues we can be 83.6% sure that the top 30 genes will rank in the top 100 in the experiment.

Pepe et al. Selecting differentially expressed genes from microarray experiments. Biometrics (in press)

Summary The method we developed for selecting genes and calculating sample sizes are more appropriate for the purpose of diagnosis and early detection

Analysis: Statistical Methods for Biomarker Discovery from High-Dimensional Data Sets Method development motivated by SELDI data from John Semmes/George Wright at Eastern Virginia Medical School Data consist of protein intensities at tens of thousands of mass/charge points on each of 297 individuals Developed three approaches to biomarker discovery: wavelets, boosting decision tree, and automated peak identification

The EVMS prostate cancer biomarker project Prostate cancer patients:N=99 early-stage N=98 late-stage Normal controlsN=96 Serum samples for proteomic analysis by Surface Enhanced Laser Desorption/Ionization (SELDI) Goal: To discover protein signals that distinguish cancers from normals

An example of SELDI output 48,000 mass/charge points (  200K Da)

Normal The design of the biomarker analysis PCa- early PCa-late N=96N=99N=98 Training Data 167 PCa (84 early, 83 late) vs. 81 Normal Test Data 30 PCa 15 Normal (Blinded)

Wavelet Analysis Lead: Yinsheng Qu Steps in the wavelet analysis: Represent original data plot with a set of wavelets (dimension reduction) Determine those wavelets that distinguish between subgroups (information criterion) Define discriminating functions based on the distinguishing wavelets (Fisher discrimination)

Three Group Classification: Normal, Cancer, BPH 12,352 mass spectrum data points, reduced to 3,420 Haar wavelet coefficients, of which 17 coefficients distinguish between the three cases. 2 classification functions generated. Truth: Predicted:NormalCancerBPH Normal Cancer BPH 0 3 8

Qu Y et al. Data reduction using discrete wavelet transform in discriminant analysis with very high dimension. Biometrics, in press.

Boosted Decision Tree Method. Lead: Yinsheng Qu/Yutaka Yasui This method combines multiple weak learners into a very accurate classifier It can be used in cancer detection It can also be used in identification of tumor markers Using this method we can separate controls, BPH, and PCA without error in test set

Outline of boosting decision tree The combined classifier is a committee with the decision stumps, the base classifiers, as its members. It makes decisions by majority vote. The base classifiers are constructed on weighted examples: the examples misclassified will increase their weights on next round. The 2 nd stump’s specialty is to correct the 1st stump’s mistakes, and the 3 rd stump’s specialty is to correct the 2 nd stump’s mistakes, and so on. The combined classifier with dozens and even hundreds of decision stumps will be accurate. Boosting technique is resistant to over fitting.

Classifier 2: A boosted decision stump classifier with 21 peaks (potential markers)

The Boosting procedure Y i ={cancer, normal}={1, -1}, f m (x i )={1, -1} Initial weights (m=1), w i = 1 (i = 1,...,N). Choose first peak and threshold c. For m =1 to M: w i = w i exp{  m  (incorrect)} –where  m = ln(1-err)/err) and err is the classification error rate at the current stage –normalize the weights so they sum to N. –choose a peak and c (i-th subject with weight w i ) Final classifier: f(x) = sum(  m f m (x)) over m=1 to M. f(x i )> 0  i-th subject classified as cancer

When to stop iteration? minimal margin: minimum of y i f(x i ) over all N subjects The minimal margin in the training sample measures how well the two classes are separated by classifier. Even classifier reaches zero error on training sample, if iteration still increases the minimal margin --> improve prediction in future samples.

Qu et al Boosted Decision Tree Analysis of SELDI Mass Spectral Serum Profiles Discriminates Prostate Cancer from Non-Cancer Patients. Clinical Chemistry. In press. Adam et al Serum Protein Fingerprinting Coupled with a Pattern Matching Algorithm that Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men. Cancer Research. 62:

Summary Wavelets approach: Does not require peak identification (black-box classification) Boosting decision tree: Requires peak identification first. Useful for both classification and protein mass identification

Final Summary The methods developed in the past two years are mainly for Phase 1&2 studies, reflecting the current needs of EDRN. EDRN DMCC statisticians are working on key design and analysis issues in early detection research. More work remains to be done (e.g., In classification, consider the mislabeling of Prostate cancer by BPH; exam gene by environmental interactions).