Data Mining Consultant GlaxoSmithKline: US Pharma IT

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
What is Statistical Modeling
Chapter 15 Application of Computer Simulation and Modeling.
Data Mining AERS FDA’s (Spontaneous) Adverse Event Reporting System Division of Drug Risk Evaluation Office of Drug Safety Carolyn McCloskey, M.D., M.P.H.
1 DATA MINING: DEFINITIONS AND DECISION TREE EXAMPLES Emily Thomas Director of Planning and Institutional Research.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Business Intelligence Andrew Davis Andria Zippler Jana Krinsky Tiffany Ferris.
Data Mining By Archana Ketkar.
Analysis of Variance & Multivariate Analysis of Variance
Data Mining – Intro.
DataMining By Guan Hang Su CS157A section 2 fall 2005.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Data Mining and Decision Tree CS157B Spring 2006 Masumi Shimoda.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
JumpStart the Regulatory Review: Applying the Right Tools at the Right Time to the Right Audience Lilliam Rosario, Ph.D. Director Office of Computational.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
© 2010 IBM Corporation © 2011 IBM Corporation September 6, 2012 NCDHHS FAMS Overview for Behavioral Health Managed Care Organizations.
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
Brain Mapping Unit The General Linear Model A Basic Introduction Roger Tait
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Bioinformatics Brad Windle Ph# Web Site:
Patterns of Event Causality Suggest More Effective Corrective Actions Abstract: The Occurrence Reporting and Processing System (ORPS) has used a consistent.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Data Mining By Dave Maung.
ANOVA and Linear Regression ScWk 242 – Week 13 Slides.
Introduction – Addressing Business Challenges Microsoft® Business Intelligence Solutions.
Data Mining: Neural Network Applications by Louise Francis CAS Annual Meeting, Nov 11, 2002 Francis Analytics and Actuarial Data Mining, Inc.
Data Mining: Software Helping Business Run
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Industrial Data Modeling with DataModeler Mark Kotanchek Evolved Analytics
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Information Technology in the Natural Sciences Biology – Chemistry – Physics.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Mining of Massive Datasets Edited based on Leskovec’s from
Clustering Algorithms Minimize distance But to Centers of Groups.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
Book web site:
Introduction to Machine Learning, its potential usage in network area,
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
JMP Discovery Summit 2016 Janet Alvarado
Data Based Decision Making
A Research Oriented Study Report By :- Akash Saxena
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
PolyAnalyst Data and Text Mining tool
Gerd Kortemeyer, William F. Punch
CSc4730/6730 Scientific Visualization
Data Mining AERS FDA’s (Spontaneous) Adverse Event Reporting System Division of Drug Risk Evaluation Office of Drug Safety Carolyn McCloskey, M.D., M.P.H.
Interactive Visual Analytics for Discovering Simpson’s Paradox
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Warehousing Data Mining Privacy
Topic 5: Cluster Analysis
CSE591: Data Mining by H. Liu
Presentation transcript:

Data Mining Consultant GlaxoSmithKline: US Pharma IT Integrating Discovery, Development, and Commercial Data into Data Mining Jennifer Sloan Data Mining Consultant GlaxoSmithKline: US Pharma IT 15 September 2004

Data Mining Definition Data Mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid and accurate predictions.

Data Mining is a tool that allows us to Identify problematic areas Control process variability Make concrete decisions on business needs Develop a model which can aid in future business decisions

Analyzing Multivariate Data Managing Data Usage Model Building Commercial Data Analyzing Multivariate Data Managing Data Usage Model Building

Multivariate Data Sets Data are multivariate in nature Large data sets containing multiple criteria within each observation Comparing multiple vectors is nearly impossible without reducing to a single point

Here we view 5-dimensional information on one observation Here we view 5-dimensional information on one observation. Each point represents a prescriber and the color represents a Market Share increase or decrease. Overlapping distributions make this difficult to interpret and further analysis is required. Over 200K observations are represented in this graph.

The same observations are observed but now two-way interactions between the variables help us determine which variables are affecting market shifts and lead to constructing models which will predict prescriber behavior.

Drug Development

Drug Development Issues Adverse Event Reporting System (AERS) Over 2 million AE reports and approximately 2000 drugs and biologics submitted to the FDA since 1968 Creates Extremely Complicated Matrix of Data Recently, Data Mining methods have helped address this issue with the development of a method used to examine large databases for associations between drugs and AEs

Data Mining Algorithm Multi-Item Gamma Poisson Shrinker (MGPS) Developed by William DuMochel (AT&T) Through statistical modeling, this Empirical Bayesian method identifies higher-than-expected reporting relationships of drug-event combinations Automated, web-based system with rapid drill-down capability MGPS runs using all event terms and drugs in the AERS database and produces results for all drug-event combinations

MGPS: Significance Handles Complex Stratification (age, gender, year of report > 945 categories) Performs complex computations in minimal amount of time: Much MORE EFFICIENT Real World Example:

Membership: PhRMA-FDA Working Group Chair: June Almenoff (GSK) FDA Involvement Involved PhRMA companies: Abbott, Allergan, AstraZeneca, Bristol-Myers Squibb, GlaxoSmithKline, Johnson & Johnson, Lilly, Merck, Novartis, Schering-Plough, Pfizer, Roche, Wyeth

Drug Discovery

SCAM—Statistical Classification of Activities of Molecules Recursive partitioning customized for chemistry Creates a structure activity relationship (SAR) mode7l Handles large numbers of descriptors (> 1 million)

SCAM : Data Structure Biological Activities Y1 1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 Y2 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 Y3 1 0 1 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 Y4 1 0 1 0 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 ... ... 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1 Yn > 2 million >100K

SCAM’s Recursive Partitioning Ave = 0.34 SD = 0.81 Feature n = 1614 ave = 0.29 sd = 0.73 n = 36 ave = 2.60 sd = 0.9 rP = 2.03E-70 aP = 1.30E-66 AID, Automatic Interaction Detection, is a simple statistical method for finding interactions in large data sets. It is recursive. One variable and a split point is selected to divide the data set into two or more groups. Each group is divided in turn. When the groups become too small or are homogeneous, the splitting is stopped. A tree results and the sub-grouping rules are read off the tree. NB: It is possible to separate off different groups that are following different mechanisms. There are serious multiple testing questions. Also at issue is the statistic to use for splitting. Signal 2.60 - 0.29 t = = = 18.68 Noise 0.734 1 1 36 1614 +

SCAM Tree

Advantages of SCAM Works for complex situations, mixtures and interactions. Output is easy to understand and explain High statistical power Produces a valid answer

SCAM Drawbacks Data greedy Only one view of the data Binary descriptors may be too “crude” Disposition of outliers is difficult Highly correlated variables may be obscured Higher order interactions may be masked The are of course disadvantages to any algorithmic method used to analyze large data sets.

Concluding Remarks Data Mining enables us to efficiently handle LARGE amounts of data Data Mining allows us to perform analyses IN REAL TIME Data Mining covers a wide array of topics in drug industry and its benefits are plentiful

References Almenoff, June S, et al. “Disproportionality Analysis Using Empirical Bayes Data Mining: A tool for the Evaluation of Drug Interactions in the Post-Marketing Setting.” Pharmacoepidemiology and Drug Safety,12, 517-521 (2003). Donahue, Rafe. “An Overview of Data Mining in Drug Development and Marketing.” http://home.earthlink.net/~rafedonahue. May 2003. Hawkins, D.M. and G.V. Kass, “Automatic Interaction Detection.” Topics in Applied Multivariate Analysis, ed. Hawkins, (1982). Hawkins, D.M., S.S. Young and A. Rusinko. “Analysis of a Large Structure-Activity Data Set Using Recursive Partitioning.” QSAR, 16, 296-302 (1997).