Lecture-2 Bscshelp.com.  Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2.

Lecture-2 Bscshelp.com

 Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2

 Huge volumes of Data available: from terabytes to petabytes  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube  Medical data, demographic data, financial data and marketing data  We are drowning in data, but starving for knowledge!  “Necessity is the mother of invention”—Data mining— Automated analysis of massive data sets

 Data analysis and decision support  Market analysis and management  Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation  Risk analysis and management  Forecasting, customer retention, quality control, competitive analysis  Fraud detection and detection of unusual patterns (outliers)  Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis

 Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies  Target marketing  Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association  Customer profiling—What types of customers buy what products (clustering or classification)  Customer requirement analysis  Identify the best products for different groups of customers  Predict what factors will attract new customers  Provision of summary information  Multidimensional summary reports

 Finance planning and asset evaluation  cash flow analysis and prediction  Resource planning  summarize and compare the resources and spending  Competition  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive market

 Approaches: Clustering & model construction for frauds, outlier analysis  Applications: Health care, retail, credit card service, telecomm.  Money laundering: suspicious monetary transactions  Telecommunications: phone-call fraud  Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm  Retail industry  Analysts estimate that 38% of retail shrink is due to dishonest employees  Anti-terrorism

 Approaches: Clustering & Classification  Applications:  Automated diagnosis  Discovery of disease trends  Prediction of epidemics  Discovering causes for certain conditions  Patient data retrieval

 Data mining is a multidisciplinary field, borrowing from various areas including  Database technology,  machine learning,  statistics,  pattern recognition,  information retrieval,  neural networks,  knowledge-based systems,  artificial intelligence,  high-performance computing and data visualization.

Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization

 Database systems research focuses on the creation, maintenance, and use of databases for organizations and end-users.  Database systems are often well known for their high scalability in processing very large, relatively structured data sets.  Many data mining tasks need to handle large data sets or even real-time, fast streaming data.  So, data mining can make good use of scalable database technologies to achieve high efficiency and scalability on large data sets.  A data warehouse integrates data originating from multiple sources and various timeframes.

 Machine learning investigates how computers can learn (or improve their performance) based on data.  A main research area is for computer programs to automatically learn to recognize complex patterns and make intelligent decisions based on data.  Supervised learning, Unsupervised learning, Semi- supervised learning, Active learning are some classic problems in machine learning that are highly related to data mining.

Supervised learning:  A synonym for classification  The supervision in the learning comes from the labeled examples in the training data set  E.g. the postal code recognition problem

Unsupervised learning:  A synonym for clustering  The learning process is unsupervised since the input examples are not class labeled.  We may use clustering to discover classes within the data.  E.g. an unsupervised learning method can take, as input, a set of images of handwritten digits. Suppose that it finds 10 clusters of data. These clusters may correspond to the 10 distinct digits of 0 to 9, respectively. since the training data are not labeled, the learned model cannot tell us the semantic meaning of the clusters found.

Semi-supervised learning :  It is a class of machine learning techniques that make use of both labeled and unlabeled examples when learning a model  Labeled examples are used to learn class models and unlabeled examples are used to refine the boundaries between classes.  For a two-class problem, we can think of the set of examples belonging to one class as the positive examples and those belonging t o the other class as the negative examples.

Semi-supervised learning :

Active learning :  It lets users play an active role in the learning process.  An active learning approach can ask a user (e.g., a domain expert) to label an example, which may be from a set of unlabeled examples or synthesized by the learning program.  The goal is to optimize the model quality by actively acquiring knowledge from human users, given a constraint on how many examples they can be asked to label.

 Statistics studies the collection, analysis, interpretation or explanation, and presentation of data.  Data mining has an inherent connection with statistics.  A statistical model is a set of mathematical functions that describe the behavior of the objects in a target class in terms of random variables and their associated probability distributions.  Statistical models are widely used to model data and data classes.  For example, in data mining tasks like data characterization and classification, statistical models of target classes can be built.

 Information retrieval ( IR ) is the science of searching for documents or information in documents.  Documents can be text or multimedia, and may reside on the Web.  The differences between traditional information retrieval and database systems are twofold: Information retrieval assumes that 1. the data under search are unstructured; 2. and the queries are formed mainly by keywords, which do not have complex structures (unlike SQL queries in database systems).

 Increasingly large amounts of text and multimedia data have been accumulated and made available online due to the fast growth of the Web and applications such as digital libraries, digital governments, and health care information systems.  Their effective search and analysis have raised many challenging issues in data mining.  Therefore, text mining and multimedia data mining, integrated with information retrieval methods, have become increasingly important.

 Pattern recognition is the study of methods and algorithms for putting data objects into categories.  Pattern Recognition is an application of Machine Learning.  Pattern recognition systems are in many cases trained from labeled "training" data ( supervised learning ), but when no labeled data are available other algorithms can be used to discover previously unknown patterns ( unsupervised learning ).

 An artificial neural network (ANN), often just called a "neural network" (NN), is a mathematical model or computational model based on biological neural networks, in other words, is an emulation of biological neural system.  It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation.  In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase.

 Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

 Learning the application domain  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 60% of effort!)  Data reduction and transformation  Find useful features, dimensionality/variable reduction  Choosing functions of data mining  summarization, classification, regression, association, clustering  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge

 Data mining may generate thousands of patterns: Not all of them are interesting  Interestingness measures  A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm  Objective vs. subjective interestingness measures  Objective (Data Driven): based on statistics and structures of patterns, e.g., support, confidence, etc.  Subjective (User Driven) : based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.

 What Kinds of Patterns Can Be Mined? June 21, 201626

Lecture-2 Bscshelp.com.  Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2.

Similar presentations

Presentation on theme: "Lecture-2 Bscshelp.com.  Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture-2 Bscshelp.com.  Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2.

Similar presentations

Presentation on theme: "Lecture-2 Bscshelp.com.  Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2."— Presentation transcript:

Similar presentations

About project

Feedback