CES 514 – Data Mining Spring 2010 Sonoma State University.

Slides:



Advertisements
Similar presentations
CS583 – Data Mining and Text Mining
Advertisements

Web Search and Mining Course Overview 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 0: Course Overview.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
1 Input and Output Thanks: I. Witten and E. Frank.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
2015/6/1Course Introduction1 Welcome! MSCIT 521: Knowledge Discovery and Data Mining Qiang Yang Hong Kong University of Science and Technology
CS583 – Data Mining and Text Mining
1 Data Mining Techniques Instructor: Ruoming Jin Fall 2006.
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
CS 5831 CS583 – Data Mining and Text Mining Course Web Page 05/cs583.html.
CS 5941 CS583 – Data Mining and Text Mining Course Web Page 05/cs583.html.
CS583 – Data Mining and Text Mining
Data Mining: A Closer Look
CS583 – Data Mining and Text Mining Course Web Page 07/cs583.html.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Mining and Searching Opinions in User-Generated Contents Bing Liu Department of Computer Science University of Illinois at Chicago.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
A Holistic Lexicon-Based Approach to Opinion Mining Xiaowen Ding, Bing Liu and Philip Yu Department of Computer Science University of Illinois at Chicago.
Chapter 1 Introduction to Data Mining
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
LIS618 lecture 2 the Boolean model Thomas Krichel
Knowledge Discovery and Data Mining Evgueni Smirnov.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
CS 5831 CS583 – Data Mining and Text Mining Course Web Page 06/cs583.html.
1 CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
ITIS 4510/5510 Web Mining Spring Overview Class hour 5:00 – 6:15pm, Tuesday & Thursday, Woodward Hall 135 Office hour 3:00 – 5:00pm, Tuesday, Woodward.
MIS2502: Data Analytics Advanced Analytics - Introduction.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Mining of Massive Datasets Edited based on Leskovec’s from
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
COMP423 Summary Information retrieval and Web search  Vecter space model  Tf-idf  Cosine similarity  Evaluation: precision, recall  PageRank 1.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
CS315 Introduction to Information Retrieval Boolean Search 1.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
CS583 – Data Mining and Text Mining
Data Mining.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Lecture 1: Introduction and the Boolean Model Information Retrieval
MIS2502: Data Analytics Advanced Analytics - Introduction
CS583 – Data Mining and Text Mining
Data mining (KDD) process
Statistics 202: Statistical Aspects of Data Mining
CS583 – Data Mining and Text Mining
Data Mining: Concepts and Techniques Course Outline
CS583 – Data Mining and Text Mining
Data Mining Modified from
CSE591: Data Mining by H. Liu
CS583 – Data Mining and Text Mining
CS583 – Data Mining and Text Mining
Data Mining: Introduction
CS583 – Data Mining and Text Mining
Welcome! Knowledge Discovery and Data Mining
CS583 – Data Mining and Text Mining
CSE591: Data Mining by H. Liu
Presentation transcript:

CES 514 – Data Mining Spring 2010 Sonoma State University

Course Details:  Instructor: Bala Ravikumar (Ravi)   Tel: (707)  Office: Darwin Hall 116 I  Course Web Page  Lecture time:  6 to 8:45 PM, Wednesdays  Room: Salazar Hall 2003  Office hours: M 9 – 10, T 11 – 12, W 5 – 6

Prerequisites  basic probability and statistics (probability distribution, random variable, conditional probability etc.)  algorithms and data structures (sorting, hashing, binary trees, algorithm design techniques)  Programming in high-level language (Java, Python, Matlab, c#, …)  Linear algebra (vectors, linear independence, matrix rank, Gaussian elimination etc.) These topics will be reviewed. However, it will be helpful to spend some time on your own to familiarize yourself.

Text book Christopher D. ManningChristopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press Prabhakar RaghavanHinrich Schütze Web site for the text: This book’s focus is on WEB DATA MINING

Additional references  Mining the Web, S.Chakrabarti, MKP.  Data Mining, Witten and Frank, MKP.  The elements of statistical learning, Hastie, Tibshirani, and Friedman, Springer-Verlag.  Web Data Mining: Exploring Hyperlinks, Contents and Usage data, Bing Liu, Springer-Verlag.  Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Pearson/Addison Wesley.

Overlapping fields Statistics Artificial intelligence (machine learning) Data base and Information retrieval Natural language processing Algorithm design and analysis

Grading  Quiz: 10%  Home Work: 25 %  Midterm: 15%  One mid-term, in-class, open book/notes?  Final Exam: 25%  In-class or take-home?  Project: 25%  Individual, design and implementation

Example Projects from Fall 2005 and 2007 Strategy for predicting the winner in a game similar to Jai Alai. Hand-written character recognition classify the type of disease based on some test results classification of (junk vs. useful, personal vs. business vs. family etc.) classification of questions in a multiple choice test based on the responses of students identifying the author from a sample text implement an association rule mining algorithm implement a visualization algorithm that provides various options for viewing the data classifying mushroom into edible and poisonous based on a number of attributes – such as color, length of the stem, width etc. classifying web site based on content Project is done individually, and is semester long - implement, test, write a paper, present in class.

Today’s lecture Overview of the course Chapter 1 of the text

Overview of Topics  Web data organization  Web search  Classification (supervised learning)  Clustering (unsupervised learning)  Association rule mining  Language models for information retrieval  Vector space models  SVM and other tools  LSI and tools from linear algebra  Link analysis  Other applications – e.g. bioinformatics

What is data mining?  Data mining is also called knowledge discovery  Data mining is  extraction of useful patterns from data sources, e.g., databases, texts, web, images, etc.  Patterns must be:  valid, novel, potentially useful, understandable  Our focus will be on text data (in particular web)

Some sample problems in Data Mining Extract useful knowledge from the vast data and information available on the web. (e.g. tagging of web sites, labeling images, predict the needs of a web surfer from pattern of clicks.) Using the financial record of a person, determine the risk involved in giving a loan. (decision could be yes or no. more generally, it could be the type of loan – interest rate, duration etc.) movie (book etc.) recommendation based on prior choices. prediction of weather, traffic pattern, outcome of an event etc. From the items recorded in the check-out counter of a super market, determine any correlation between items being sold. (used to decide which ones to put on sale.) study and understand of social networks. rank web page according to significance.

Classic data mining tasks  Classification mining patterns that can classify future (new) data into known classes.  Association rule mining mining any rule of the form X  Y, where X and Y are sets of data items.  Clustering identifying similar groups in the data  Regression analysis

Classic data mining tasks (contd)  Sequential pattern mining: A sequential rule: A  B, says that event A will be immediately followed by event B with a certain confidence  Deviation detection: discovering the most significant changes in data  Data visualization: using graphical methods to show patterns in data.

Why is data mining important?  Computerization of businesses produce huge amount of data  How to make best use of data?  Knowledge discovered from data can be used for competitive advantage.  Online businesses generate even larger data sets  Online retailers (e.g., amazon.com) are largely driven by data mining.  Web search engines are information retrieval and data mining companies

Why is data mining necessary?  Make use of your data assets  There is a big gap from stored data to knowledge; and the transition won’t occur automatically.  Many interesting things you want to find cannot be found using database queries “find me people likely to buy my products” “Who are likely to respond to my promotion?” “Which movies should be recommended to each customer?”

Why data mining now?  The data is abundant.  The computing power is not an issue.  Data mining tools are available  The competitive pressure is very strong.  Almost every company is doing (or has to do) it  Socio-political exigencies  Detecting terrorism activities  New technologies  Streaming data, mobile computing, wireless networks

Related fields  Data mining is an multi-disciplinary field:  Machine learning/artificial intelligence  Statistics  Databases  Information retrieval  Visualization  Natural language processing  Game theory etc.

Data mining applications  Marketing: customer profiling and retention, identifying potential customers, market segmentation.  Engineering: identify causes of problems in products.  Scientific data analysis: weather prediction, financial data analysis, image analysis etc.  Fraud detection: identifying credit card fraud, intrusion detection.  Text and web: a huge number of applications …  Bioinformatics : structure prediction, classification, microarray analysis etc.  Any application that involves a large amount of data …

Structural descriptions  Example: if-then rules AgeSpectacle prescription AstigmatismTear production rate Recommended lenses YoungMyopeNoReducedNone YoungHypermetropeNoNormalSoft Pre- presbyopic HypermetropeNoReducedNone PresbyopicMyopeYesNormalHard …………… If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft

Classification vs. association rules  Classification rule: predicts value of a given attribute (the classification of an example)  Association rule: predicts value of arbitrary attribute (or combination) If outlook = sunny and humidity = high then play = no If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high

A decision tree for this problem

 Example: 209 different computer configurations  Linear regression function Predicting CPU performance Cycle time (ns) Main memory (Kb) Cache (Kb) ChannelsPerformance MYCTMMINMMAXCACHCHMINCHMAXPRP … PRP = MYCT MMIN MMAX CACH CHMIN CHMAX

Spam filter software Given below are the % of occurrences of a few select words in spam and genuine messages: A decision list may be used to identify spam.

Text mining  Data mining on text  Due to a huge amount of online texts on the Web and other sources  Text contains a huge amount of information of any imaginable type!  A major direction and tremendous opportunity!  Main topics  Text classification and clustering  Information retrieval  Information extraction  Opinion mining and summarization

Example: Opinion Mining  The Web has dramatically changed the way that people express their opinions.  One can post their opinions on almost anything at review sites, Internet forums, discussion groups, blogs, etc.  Product reviews  Benefits of Review Analysis  Potential Customer: No need to read many reviews  Product manufacturer: market intelligence, product benchmarking

Feature Based Analysis & Summarization  Extracting product features (called Opinion Features) that have been commented on by customers.  Identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative.  Summarizing and comparing results.

An example GREAT Camera., Jun 3, 2004 Reviewer: jprice174 from Atlanta, Ga. I did a lot of research last year before I bought this camera... It kinda hurt to leave behind my beloved nikon 35mm SLR, but I was going to Italy, and I needed something smaller, and digital. The pictures coming out of this camera are amazing. The 'auto' feature takes great pictures most of the time. And with digital, you're not wasting film if the picture doesn't come out. … …. Summary: Feature1: picture Positive: 12  The pictures coming out of this camera are amazing.  Overall this is a good camera with a really good picture clarity..... Negative: 2  The pictures come out hazy if your hands shake even for a moment during the entire process of taking a picture.  Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange. Feature2: battery life …

Visual Comparison Summary of reviews of Digital camera 1 PictureBatterySizeWeightZoom Comparison of reviews of Digital camera 1 Digital camera 2 + _ _ +

Information retrieval – Ch 1 Boolean query Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

Unstructured data in 1680  Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?  One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia?  Why is that not the answer?  Slow (for large corpora)  NOT Calpurnia is non-trivial  Other operations (e.g., find the word Romans near countrymen) not feasible  Ranked retrieval (best documents to return)  Later lectures 31 Sec. 1.1

Term-document incidence 1 if play contains word, 0 otherwise Brutus AND Caesar BUT NOT Calpurnia Sec. 1.1

Incidence vectors  So we have a 0/1 vector for each term.  To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND.  AND AND = Sec. 1.1