Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data, Databases, and Discovery Andy Novobilski, PhD UT Chattanooga Computer Science Nuts and Bolts Research Methods Symposium UT College of Medicine Chattanooga.

Similar presentations


Presentation on theme: "Data, Databases, and Discovery Andy Novobilski, PhD UT Chattanooga Computer Science Nuts and Bolts Research Methods Symposium UT College of Medicine Chattanooga."— Presentation transcript:

1 Data, Databases, and Discovery Andy Novobilski, PhD UT Chattanooga Computer Science Nuts and Bolts Research Methods Symposium UT College of Medicine Chattanooga September 29, 2006

2 An Introduction to Knowledge Discovery Data CollectionData Collection Data ValidationData Validation Preprocessing of DataPreprocessing of Data Mining the DataMining the Data Comparing MethodsComparing Methods

3 Data Collection … Paper or Electronic?Paper or Electronic? –Fingernet Continuous or Discrete?Continuous or Discrete? And the Understatement of the Year …And the Understatement of the Year … Health Insurance Portability and Accountability Act of 1996 The HIPAA website http://www.hipaa.org/ links to the government’s website http://aspe.hhs.gov/admnsimp/ which states “ Administrative Simplification in the Health Care Industry”

4 … And Raw Storage … Alphanumeric DataAlphanumeric Data –Excel Worksheets –Comma/Tab Delimited Text Files –XML: The Extensible Markup Language http://www.xml.com/http://www.xml.com/http://www.xml.com/ Binary DataBinary Data –Images GIF, BMP, EPSGIF, BMP, EPS –Streaming Data HL7 - http://www.hl7.org/ (http://en.wikipedia.org/wiki/HL7)HL7 - http://www.hl7.org/ (http://en.wikipedia.org/wiki/HL7)http://www.hl7.org/http://en.wikipedia.org/wiki/HL7http://www.hl7.org/http://en.wikipedia.org/wiki/HL7 DICOM - http://medical.nema.org/DICOM - http://medical.nema.org/http://medical.nema.org/

5 … Stored in a Relational Manner Relational DatabasesRelational Databases –Inexpensive MS AccessMS Access –Expensive MS SQL Server, Oracle, Sybase, …MS SQL Server, Oracle, Sybase, … –Free (sort of … open source) MySQL, PostgreSQLMySQL, PostgreSQL Licensing Varies by UsageLicensing Varies by Usage

6 Data Validation Patient 002 is a …Patient 002 is a … –Pregnant Male ( hit the 9 instead of 0) –With Ice Water in His Veins (misplaced decimal) –Who Might or Might Not Smoke (missing data) IdGenderAge Months Pregnant TempSmoker 001M55098.3Yes 002M5599.82.

7 Preprocessing the Data Clean-upClean-up –Out of Scope vs. Out of Family Feature ExtractionFeature Extraction –Data Aggregation Feature TransformationFeature Transformation –Normalization –Principle Component Analysis

8 Turning Data into Information Data Mining …Data Mining … –Clustering –Decision Trees –Neural Networks –Bayesian Networks

9 Clustering K-Means Y Y Y Y Y N N N N N N N

10 Decision Trees Division of Data Based on Information GainDivision of Data Based on Information Gain White BoxWhite Box Age Smoker Age NNY Y NY MF Gender Y NY

11 Neural Networks Functional Approximation to DataFunctional Approximation to Data –Black Box –Most Common is Feed Forward, Back Propagation Considerations in Training the NetworkConsiderations in Training the Network –Many Types of Neural Networks –Difficulties with Discrete Data –Missing Data Requires Careful Consideration Case DataForecast

12 Bayesian Networks Belief NetworksBelief Networks –White Box Causal OrientationCausal Orientation Beliefs are Updated Based on New InformationBeliefs are Updated Based on New Information Nodes Can Serve as Both Evidence and Query PointsNodes Can Serve as Both Evidence and Query Points Handles Missing Data GracefullyHandles Missing Data Gracefully

13 An Example Novobilski, Andrew, F. Fesmire, D. Sonnemaker. "Mining Bayesian Networks to Forecast Adverse Outcomes Related to Acute Coronary Syndrome."." The 17th International FLAIRS Conference 2004.

14 Comparing Models – The ROC Curve The Receiver Operating Characteristic (ROC) CurveThe Receiver Operating Characteristic (ROC) Curve –Plots the Percentage of True Positives against the Percentage of False Positives as the Cutoff Value is varied from everyone classified as ill to everyone classified as healthy. –Provides a consistent measure of model fitness that varies between 0 and 100.

15 An Illustration Healthy Cutoff Value Ill

16 Comparing Multiple Classifiers

17 In Summary … A Process to Consider …A Process to Consider … –Collect, Validate, Preprocess, Mine, Compare Excellent Software is AvailableExcellent Software is Available –Both Commercial and Open Source Sample Data Is AvailableSample Data Is Available

18 Thank You ! Questions and/or Comments are Welcome …Questions and/or Comments are Welcome … Dr. Andy Novobilski UT Chattanooga Computer Science 615 McCallie Ave., Dept. 2302 Chattanooga, TN 37403 (423) 425-4202 Andy-Novobilski@utc.edu http://www.utc.edu/Faculty/Andy-Novobilski


Download ppt "Data, Databases, and Discovery Andy Novobilski, PhD UT Chattanooga Computer Science Nuts and Bolts Research Methods Symposium UT College of Medicine Chattanooga."

Similar presentations


Ads by Google