Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability Center for Health Informatics and Bioinformatics.

Similar presentations

Presentation on theme: "Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability Center for Health Informatics and Bioinformatics."— Presentation transcript:

1 Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability Center for Health Informatics and Bioinformatics NYU Langone Medical Center Yin Aphinyanaphongs MD, PhD Lawrence Fu PhD Constantin Aliferis MD, PhD Medinfo, Copenhagen, Denmark 08/19/2013

2 Our Vision: A Safer Health Web We envision an health web that provides safe, reliable, validated medical information to health consumers. Focus on the lowest quality websites and building classifiers to identify those. Our initial pilot targeted unproven treatments in cancer. 2Image From: use-and-safety/

3 Cancer, Health Consumers, and the Health Web In the US, 65% of cancer pts searched unproven treatments and 12% purchased at least one unproven treatment. 83% of cancer patients used at least one unproven treatment. Patients with dire prognosis are more susceptible to using these unproven treatments. Routine cancer search topics return sites with non-conventional treatments. These sites were of variable quality. 23% discouraged the use of conventional medicine, 15% discouraged adherence to physician advice, and 26% provided anecdotal experiences. Schmidt and Ernst found that nearly 10% of their sample had potentially harmful or definitely harmful material. Impact of bad advice: Death Financial Costs with No Benefit. Reduction in Efficacy of Known Treatments Delayed Access to Proven Therapies Presentation Title Goes Here3

4 How is the Health Web policed now? Manual efforts Rating agencies (HONCode) Self ratings Operation Cure.all Effort led by Federal Trade Commission in 1997 and 1999 in collaboration with 80 agencies and organizations from 25 countries. Identified approximately 800 to 1200 web sites purporting to cure a variety of diseases. Presentation Title Goes Here4

5 How Do We Get to a Safer Health Web? Automation In our prior work, we built models that successfully identified known unproven cancer treatments on the Internet. (where an unproven treatment is known and the goal is to identify new websites that may market the known unproven treatment, Area Under the Curve = 0.93). Some questions left to consider: What happens with unknown unproven treatments. (i.e treatments that are not known a priori.) Is it possible to build models that will identify these unknown unproven treatments? What methods may scale a high performing machine learning model to billions of web pages? This paper addresses the limitations of this prior work. Presentation Title Goes Here5

6 Where Machine Learning May Generalize and Scale? 6 1. Preprocessing 2. Generalizability 3. Model Building 4. Model Application

7 Solution 1: Map Reduce for Pre-processing Image from: http://www.infosun.fim.uni- 7

8 Solution 2: Feature Selection Identifying a subset of relevant features for use in model construction that ideally maintain or improve predictivity from a dataset that includes all features. Generalized Local Learning Parents Children is one such algorithm that learns a subset of features for use in model construction. Will identify the smallest subset of features that gives optimal classification performance under several assumptions. Image from: Causal Graph Based analysis of genome wide association data in rheumatoid arthritis. 8

9 Area Under the Curve Receive Operating Curves Area Under the Curve Centor RM. The Use of ROC Curves and Their Analyses. Med Decis Making 1991;11(2):102-106.

10 Area Under the Curve as a Measure of Performance Perfect Separation Area Under the Curve ~ 1.0 Okay Separation Area Under the Curve ~ 0.8 Random Separation Area Under the Curve ~ 0.5

11 Gold Standard 8 quack treatments identified by Applied to Google appending “cancer” and “treatment.” Top 30 results for each treatment labeled by the authors. Two authors labeled 191 out of 240 web pages as making unproven claims or not (Inter-rater Reliability - Kappa 0.76) Excluded not found (404 response code) error pages no content pages non-English pages password-protected pages pdf pages redirect pages pages where the actual treatment text does not appear in the document

12 Build 8 Independent Training/ Testing Sets Presentation Title Goes Here12 CategoryTrain SizeTest Size Insulin Potentiation Therapy 14618+/ 7- Cure for All Cancers14716+/ 8- Mistletoe1458+/ 18- Metabolic Therapy15311+/ 7- Macrobiotic Diet1484+/ 19- ICTH1625+/ 4- Krebiozen15110+/ 10- Cellular Health1549+/ 8-

13 Experimental Design 8 fold Leave One Treatment Out Cross Validation Classifiers Linear Support Vector Machines L1 Regularized Logistic Regression L2 Regularized Logistic Regression Classifiers optimized over costs and regularization coefficient respectively. Text pre-processed by term frequency – inverse document frequency weighting scheme. Presentation Title Goes Here13

14 Results – Generalizability to unknown unproven treatments CategoryTrain Set Test Set Linear SVM L1 Regularized Logistic Regression L2 Regularized Logistic Regression Insulin Potentiation Therapy 14618+/ 7-0.960.97 Cure for All Cancers14716+/ 8-0.890.85 Mistletoe1458+/ 18-0.900.880.90 Metabolic Therapy15311+/ 7-0.970.990.96 Macrobiotic Diet1484+/ 19-0.970.980.97 ICTH1625+/ 4-1.00.941.0 Krebiozen15110+/ 10- 0.860.750.90 Cellular Health1549+/ 8-0.890.910.93 14

15 Results – Feature Selection Feature/ Classifier Combination Number of FeaturesArea Under the Curve Performance All Features/ Linear SVM 9,1870.946* Generalized Local Learning – Parents Children 960.974* * - these performances are calculated using 8 fold nested cross validation. Example documents are held out randomly without regards to the selected treatment. 15

16 Computational Performance Point Estimates Corpus PreparationNumber of Documents Speed (Point estimate) Single Threaded200,0001,300 seconds Hadoop/ Mapreduce200,000450 seconds Presentation Title Goes Here16 Model Building Performance Number of Features Speed (Point estimate) All Features9,1870.20 seconds Generalized Local Learning – Parents Children 960.0069 seconds Model Application Performance Number of Documents Number of Features Speed (Point Estimate) All Features145,0009,18798.6 seconds Generalized Local Learning – Parents Children 145,000966 seconds

17 Future Work Larger Sample Size. Address building classification models in other languages. Improved Computational Performance Estimate (not point). Address adversarial nature of this application. Further explore labeled sample versus ability to identify low quality websites in the full web. Address efficient collection of web pages. Presentation Title Goes Here17

18 Conclusions Generalization to unknown unproven treatments. Evidence that training in a production environment will need a small number of labeled documents. Scalability. Mapreduce for pre-processing. Feature Selection Speed up model building. Speed up model application. Presentation Title Goes Here18

19 References Aphinyanaphongs, Y., & Aliferis, C. (2007). Text categorization models for identifying unproven cancer treatments on the web. Presented at the MEDINFO 2007. J. Metz, “A multi-institutional study of Internet utilization by radiation oncology patients,” International Journal of Radiation Oncology Biology Physics, vol. 56, no. 4, pp. 1201–1205, Jul. 2003. M. Richardson, T. Sanders, and J. Palmer, “Complementary/alternative medicine use in a comprehensive cancer center and the implications for oncology,” Journal of Clinical Oncology, 2000. K. Schmidt, “Assessing websites on complementary and alternative medicine for cancer,” Annals of Oncology, vol. 15, no. 5, pp. 733–742, May 2004. D. Y. KIM, H. R. LEE, and E. M. NAM, “Assessing cancer treatment related information online: unintended retrieval of complementary and alternative medicine web sites,” European Journal of Cancer Care, vol. 18, no. 1, pp. 64–68, Jan. 2009. M. Hainer, N. Tsai, and S. Komura, “Fatal hepatorenal failure associated with hydrazine sulfate,” Annals of internal …, 2000. J. Bromley, “Life-Threatening Interaction Between Complementary Medicines: Cyanide Toxicity Following Ingestion of Amygdalin and Vitamin C,” Annals of Pharmacotherapy, vol. 39, no. 9, pp. 1566– 1569, Aug. 2005. A. Sparreboom, “Herbal Remedies in the United States: Potential Adverse Interactions With Anticancer Agents,” Journal of Clinical Oncology, vol. 22, no. 12, pp. 2489–2503, Jun. 2004. “ ‘Operation Cure.all’ Targets Internet Health Fraud.” Federal Trade Commission, 24-Jun-1999. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008. C. F. Aliferis, A. Statnikov, I. Tsamardinos, S. Mani, and X. D. Koutsoukos, “Local causal and markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation,” The Journal of Machine Learning Research, vol. 11, pp. 171–234, 2010. Presentation Title Goes Here19

20 Thank you. NYU Langone Medical Center. Center for Health Informatics and Bioinformatics. Center for Translational Sciences Institute Grant 1UL1RR029893 from the National Center for Research Resources, National Institutes of Health Collaborators Lawrence Fu Constantin Aliferis Contact: Yin Aphinyanaphongs ( Presentation Title Goes Here20

Download ppt "Identifying Unproven Cancer Treatments on the Health Web: Addressing Accuracy, Generalizability, and Scalability Center for Health Informatics and Bioinformatics."

Similar presentations

Ads by Google