Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to Data Mining Some slide material based on: Groth; Han.

Similar presentations


Presentation on theme: "Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to Data Mining Some slide material based on: Groth; Han."— Presentation transcript:

1 slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to Data Mining Some slide material based on: Groth; Han and Kamber; Cerrito; SAS Education

2 slide 2 DSCI 4520/5240 DATA MINING ITDS Résumé Book ITDS majors (BCIS/DS), please send your résumé to melody.white@unt.edu, so that we can include it to the ITDS Résumé Book we send to our corporate partners for hiring/coop consideration. Make sure the résumés are formatted per UNT standards. Here is a link to the sample résumés: https://unt.optimalresume.com/ melody.white@unt.edu

3 slide 3 DSCI 4520/5240 DATA MINING Data (and the lack thereof) (Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia") http://www.dilbert.com/2012-12-05/ “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

4 slide 4 DSCI 4520/5240 DATA MINING http://www.dilbert.com/2012-12-05/ Data (and the lack thereof)

5 slide 5 DSCI 4520/5240 DATA MINING Nobel Laureate Calls Data Mining "A Must" In an interview with ComputerWorld in January 1999, Dr. Penzias (won the 1978 Nobel Prize in physics and was the vice president and chief scientist at Bell Laboratories) considered large scale data mining from very large databases as the key application for corporations in the next few years. In response to ComputerWorld's age-old question of "What will be the killer applications in the corporation?" Dr. Penzias replied: " Data mining. " He then added: "Data mining will become much more important and companies will throw away nothing about their customers because it will be so valuable. If you're not doing this, you're out of business" he said.

6 slide 6 DSCI 4520/5240 DATA MINING What Is Data Mining? Data mining (knowledge discovery in databases): n A process of identifying hidden patterns and relationships within data (Groth) Data mining: n Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

7 slide 7 DSCI 4520/5240 DATA MINING Motivation: “Necessity is the Mother of Invention” Data explosion problem n Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories Problem: We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining n Data warehousing and on-line analytical processing n Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

8 slide 8 DSCI 4520/5240 DATA MINING electronic point-of-sale data hospital patient registries catalog orders bank transactions remote sensing images tax returns airline reservations credit card charges stock trades OLTP telephone calls Data Deluge

9 slide 9 DSCI 4520/5240 DATA MINING Data Mining, circa 1963 IBM 7090 600 cases “Machine storage limitations restricted the total number of variables which could be considered at one time to 25.” “Machine storage limitations restricted the total number of variables which could be considered at one time to 25.”

10 slide 10 DSCI 4520/5240 DATA MINING Business Decision Support n Database Marketing –Target marketing –Customer relationship management n Credit Risk Management –Credit scoring n Fraud Detection n Healthcare Informatics –Clinical decision support

11 slide 11 DSCI 4520/5240 DATA MINING Required Expertise n Domain n Data n Analytical Methods

12 slide 12 DSCI 4520/5240 DATA MINING Multidisciplinary Databases Statistics Pattern Recognition KDD Machine Learning AI Neurocomputing Data Mining

13 slide 13 DSCI 4520/5240 DATA MINING What Is Data Mining? n IT: Complicated database queries n ML: Inductive learning from examples n Stat: What we were taught not to do

14 slide 14 DSCI 4520/5240 DATA MINING Comparing Statistics to Data Mining (from Cerrito 2006)

15 slide 15 DSCI 4520/5240 DATA MINING Comparing Statistics to Data Mining (from Cerrito 2006)

16 slide 16 DSCI 4520/5240 DATA MINING... Predictive Modeling..................................................................... Inputs Cases Target...

17 slide 17 DSCI 4520/5240 DATA MINING Types of Targets n Supervised Classification –Event/no event (binary target) –Class label (multiclass problem) n Regression –Continuous outcome n Survival Analysis –Time-to-event (possibly censored)

18 slide 18 DSCI 4520/5240 DATA MINING Why Data Mining? — Potential Applications Database analysis and decision support n Market analysis and management –target marketing, customer relation management, market basket analysis, cross selling, market segmentation n Risk analysis and management –Forecasting, customer retention, improved underwriting, quality control, competitive analysis n Fraud detection and management Other Applications n Text mining (news group, email, documents) and Web analysis. n Intelligent query answering

19 slide 19 DSCI 4520/5240 DATA MINING Market Analysis and Management (1) Where are the data sources for analysis? n Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing n Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Cross-market analysis n Associations/co-relations between product sales n Prediction based on the association information

20 slide 20 DSCI 4520/5240 DATA MINING Market Analysis and Management (2) Customer profiling n data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements n identifying the best products for different customers n use prediction to find what factors will attract new customers

21 slide 21 DSCI 4520/5240 DATA MINING Corporate Analysis and Risk Management Finance planning and asset evaluation n cash flow analysis and prediction n contingent claim analysis to evaluate assets n cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: n summarize and compare the resources and spending Competition: n monitor competitors and market directions n group customers into classes and a class-based pricing procedure n set pricing strategy in a highly competitive market

22 slide 22 DSCI 4520/5240 DATA MINING Other Applications Sports n IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy n JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid n IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

23 slide 23 DSCI 4520/5240 DATA MINING On the News: Rexer Analytics Annual Data Mining survey The 2013 survey will become available in Fall 2013 (stay tuned)

24 slide 24 DSCI 4520/5240 DATA MINING Rexer Analytics 2011 Survey Overview SURVEY & PARTICIPANTS: 52-item survey of data miners, conducted on-line in 2011. Participants: 1,319 data miners from over 60 countries. FIELDS & GOALS: CRM/Marketing has been the #1 field for the past five years. “Improving the understanding of customers”, “retaining customers” and other CRM goals continue to be the primary goals. ALGORITHMS: Decision trees, regression, and cluster analysis continue to form the top three algorithms for most data miners. A third of data miners currently use text mining and another third plan to do so in the future. TOOLS: R continued its rise this year and is now being used by close to half of all data miners (47%). R users prefer it for being free, open source, and having a wide variety of algorithms. STATISTICA is selected as the primary data mining tool (17%). STATISTICA, KNIME, Rapid Miner and Salford Systems received the strongest satisfaction ratings. ANALYTIC CAPABILITY AND SUCCESS MEASUREMENT: Only 12% of corporate respondents rate their company as having very high analytic sophistication. Measures of analytic success: Return on Investment (ROI), and predictive validity or accuracy of their models. Challenges to measuring success: user cooperation and data availability/quality.

25 slide 25 DSCI 4520/5240 DATA MINING Where Data Miners Work Data Mining is everywhere! Data miners also report working in Non-profit (6%), Hospitality / Entertainment / Sports (3%), Military / Security (3%), and Other (9%). © 2012 Rexer Analytics

26 slide 26 DSCI 4520/5240 DATA MINING The Algorithms Data Miners use © 2012 Rexer Analytics

27 slide 27 DSCI 4520/5240 DATA MINING The positive impact of Data Mining In the 5th Annual Survey (2011) of Rexer Analytics (1,319 participant data miners from over 60 countries) data miners shared examples of situations where data mining is having a positive impact on society. The five areas mentioned most often were:  Health / Medical Progress  Business Improvements  Personalized Communications & Marketing  Fraud Detection  Environmental

28 slide 28 DSCI 4520/5240 DATA MINING Text Miners Plan to Start Text Mining No Plans to Conduct Text Mining 34% 33% Text Material Customer / market surveys38% Blogs and other social media33% E-mail or other correspondence27% News articles25% Scientific or technical literature23% Web-site feedback22% Online forums or review sites21% Contact center notes or transcripts16% Employee surveys15% Insurance claims or underwriting notes15% Medical records11%11% Point of service notes or transcripts10% The rise of Text Mining © 2012 Rexer Analytics

29 slide 29 DSCI 4520/5240 DATA MINING The average data miner reports using 4 software tools. R is used by the most data miners (47%). OverallCorporateConsultantsAcademicsNGO / Gov’t Data Mining Software 29 © 2012 Rexer Analytics

30 slide 30 DSCI 4520/5240 DATA MINING Satisfaction with Data Mining Tools Extremely SatisfiedExtremely Dissatisfied © 2012 Rexer Analytics

31 slide 31 DSCI 4520/5240 DATA MINING Measuring Analytic Success © 2012 Rexer Analytics 53 010 Number of respondents 5060 Model Performance (Accuracy, F, ROC, AUC, Lift) Financial Performance (ROI, etc.) Performance in Control or Other Group Feedback from User / Client / Management Cross-Validation 202030304040 4343 3535 29 14 Question: Please share your best practices concerning how you measure analytic project performance / success. (text box provided for response)

32 slide 32 DSCI 4520/5240 DATA MINING Overcoming Data Mining challenges In the four annual data miner surveys, these key challenges have been identified by data miners more than any others:  Dirty Data  Explaining Data Mining to Others  Unavailability of Data / Difficult Access to Data

33 slide 33 DSCI 4520/5240 DATA MINING Data Mining: A KDD Process n Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

34 slide 34 DSCI 4520/5240 DATA MINING Steps of a KDD Process Learning the application domain: n relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: n Find useful features, dimensionality/variable reduction, invariant representation. Choosing data mining algorithms n summarization, classification, regression, association, clustering. Data mining: search for patterns of interest Pattern evaluation and knowledge presentation n visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

35 slide 35 DSCI 4520/5240 DATA MINING Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP


Download ppt "Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to Data Mining Some slide material based on: Groth; Han."

Similar presentations


Ads by Google