© 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development.

Slides:

Advertisements

Similar presentations

Writing Research Papers - A presentation by William Badke

Advertisements

Critical Reading Strategies: Overview of Research Process

Machine Learning Homework

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Web Intelligence Text Mining, and web-related Applications

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.

What is Statistical Modeling

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

CPSC 322 Introduction to Artificial Intelligence December 1, 2004.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 21 More About Tests.

CS231: Computer Architecture I Laxmikant Kale Fall 2004.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

O VERVIEW OF THE W RITING P ROCESS Language Network – Chapter 12.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Text Classification, Active/Interactive learning.

The Scientific Method Honors Biology Laboratory Skills.

Web Intelligence Web Communities and Dissemination of Information and Culture on the www.

Tonga Institute of Higher Education Design and Analysis of Algorithms IT 254 Lecture 8: Complexity Theory.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.

Section Using Simulation to Estimate Probabilities Objectives: 1.Learn to design and interpret simulations of probabilistic situations.

Hypotheses tests for means

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.

CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.

Advanced Quantitative Research ED 602. You know, Mary Stevens has really blossomed this year. She is doing much better. Actually, this whole fifth grade.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

PHANTOMS: A Method of Testing Hypotheses

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

High Frequency Words.

SCARAB Substance No depth or written for children. Lacking the depth needed for your purpose. Written for the general public. Depth of coverage.

Copyright Paula Matuszek Kinds of Machine Learning.

Feature Engineering Studio September 9, Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features.

Hybrid Intelligent Systems for Network Security Lane Thames Georgia Institute of Technology Savannah, GA

Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Copyright © 2009 Pearson Education, Inc. Chapter 11 Understanding Randomness.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

This is the page devoted to a little introduction of how polymers are made. A chemical reaction which make polymers is called a polymerization. There are.

Frompo is a Next Generation Curated Search Engine. Frompo has a community of users who come together and curate search results to help improve.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Document Filtering Social Web 3/17/2010 Jae-wook Ahn.

Chapter 21 More About Tests.

Data Mining Lecture 11.

Revision (Part II) Ke Chen

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Planning and Storyboarding a Web Site

Presentation transcript:

© 2006 Intelliware Development Inc. An Introduction to Data Mining Concepts Tim Eapen and B.C. Holmes Intelliware Development

© 2006 Intelliware Development Inc. Agenda Introduction to data mining The typical steps What were we trying to accomplish Bayesian Categorization An example Data Clustering k-means clustering Interesting conclusions Other Stuff Java and Data Mining

© 2006 Intelliware Development Inc. What is Data Mining? Data mining is the discovery of useful information from data Data mining touches on many of the same problems as machine learning and artificial intelligence This is a huge topic, and we cant hope to do more than just touch on it, today

© 2006 Intelliware Development Inc. Some Crazy Examples Here are some interesting examples of useful information gleaned from data: Diapers and beer People who buy diapers are also likely to buy beer. Put potato chips in between them and the sales of all three items go up Google ad-words: digital cameras is worth more than digital camera Airline traveler behaviours Amazon.ca other people who bought this DVD liked such-and-such

© 2006 Intelliware Development Inc. The Data Mining Process Cle ans e Ext ract the Go od Stu ff Ide ntif y Pat ter ns Gather the Data Vet the res ults

© 2006 Intelliware Development Inc. What We Were Trying to Accomplish Tim, Tom and I were working on the WhatAmITaking.com project WhatAmITaking.com is a wiki / repository that collects information about medications Data is all available from public sources, including: Government drug reference database Wikipedia Open License publications available through the (U.S.) National Institute for Health News articles Concept: want to using data mining techniques on publications and news First steps: we wanted to try to emulate the Google news-style categorization and topic correlation

© 2006 Intelliware Development Inc. But Along the Way… We learned some interesting things about the field of Data Mining

© 2006 Intelliware Development Inc. News: Obtaining News How do we get news? Need to build a bot or a web crawler that goes out to a large number of web sites and GETs the interesting content. Nice additions: look for links to other pieces of news Some complications: Theres a Good Internet Citizen standard (the robots.txt file standard) that should be respected If the site has a robots.txt file that says bots keep out, you shouldnt crawl their site. How do you determine whats a story and whats not? Thats a hard problem: too big a topic for this presentation

© 2006 Intelliware Development Inc. Data Cleansing You would not believe how bad some news sites are with respect to their content. Poor formatting bad encoding problems Clear problems related to converting the content from another format (e.g. Word) Two interesting word-related cleansing problems The US spelling versus British spelling problem Root words Some of it looks deliberately obfuscated

© 2006 Intelliware Development Inc. Extracting Interesting Stuff Your typical web page news article has a lot of extra stuff on it: banner ads, menus, links to related stories, navigation widgets, etc. Almost all word manipulation problems talks about stop words: words that are so common they provide no significant meaning in analysis of text: the he she said it etc…

© 2006 Intelliware Development Inc. Two Interesting Topics Categorization I know what the groups are, and I want to assign a group to any particular data point E.g.: News is categorized: Sports, Health, Finance, World News, National, etc. Data Clustering I have a lot of data, and I want to find some mechanism for finding meaningful groups E.g.: News events

© 2006 Intelliware Development Inc. Bayesian Analysis A Delightful Example

© 2006 Intelliware Development Inc. The Problem HEALTH SPORTS TECHNOLOGY BUSINESSNEWS ENTERTAINMENT Given a random news article, how can we determine what category it belongs to?

© 2006 Intelliware Development Inc. In Light of New Evidence… Do some detective work! Start off with a hypothesis Collect evidence The evidence will be either consistent or inconsistent with a given hypothesis As more evidence is accumulated, the degree of belief in the initial hypothesis will change A hypothesis with a very high degree of belief may be accepted as true Likewise, a hypothesis with a very low degree of belief may be considered false How do we measure this degree of belief?

© 2006 Intelliware Development Inc. Bayes Theorem

© 2006 Intelliware Development Inc. Bayes Theorem

© 2006 Intelliware Development Inc. Bayes Theorem

© 2006 Intelliware Development Inc. An Edible Example 10 Chocolate Chip Cookies 30 Oatmeal Cookies 20 Chocolate Chip Cookies 20 Oatmeal Cookies

© 2006 Intelliware Development Inc. State a Hypothesis Little Johnny picks a bowl at random Little Johnny picks a cookie at random The cookie turns out to be an oatmeal cookie How probable is it that Johnny picked the cookie out of bowl #1?

© 2006 Intelliware Development Inc. Consider the Evidence Probability of selecting an Oatmeal cookie given Johnny chooses bowl #1 Probability of selecting an Oatmeal cookie given Johnny chooses bowl #2

© 2006 Intelliware Development Inc. An Edible Example Bayes Theorem gives the following result Notice that initially the prior probability that the cookie came from bowl #1 was P(H 1 ) = 0.5 In light of evidence E, the probability that the cookie came from bowl #1 increased to P(H 1 |E) = 0.6

© 2006 Intelliware Development Inc. Back to our problem… Given a random news article, how can we determine what category it belongs to? OF COURSE WE CAN!!! USE BAYESIAN ANALYSIS

© 2006 Intelliware Development Inc. Naïve Bayes Classifier To categorize a news article use a Naïve Bayes Classifier A simple probabilistic classifier based on some naïve independence assumptions Can be trained Naïve Probabilistic Model The probability model for a classifier is conditional: Given an news article with n words … Let C represent a category of news (i.e. Health) Let F n represent the frequency with which that n th word appears in articles from category C

© 2006 Intelliware Development Inc. Naïve Probabilistic Model We can express our probability model using Bayes Theorem Solving this is difficult so we make some simplifying assumptions: Denominator is constant Naively assume that each feature (word frequency) F i is conditionally independent of every other feature F j (i j)

© 2006 Intelliware Development Inc. Naïve Probabilistic Model Problems with our assumptions Words have context Assuming that the frequency (F i ) of word i is independent of the frequency (F j ) of word j is untrue For example the words War and Afghanistan are more likely to appear in the same article than the words War and Tuna Benefits of our assumptions It simplifies our math algorithm

© 2006 Intelliware Development Inc. Naïve Probabilistic Model We can approximate that the probability that an article belongs to category C as the product of a prior probability that the article belongs to that category multiplied by the product of individual word frequencies for that category

© 2006 Intelliware Development Inc. A Simple Algorithm for Classifying An Article Given a random article with n words to classify the article in one of several possible categories do the following: For each possible category Calculate the probability that article X belongs to that category by considering the prior probability and word frequencies Classify the article as belonging to the category with the highest probability

© 2006 Intelliware Development Inc. A Simple Example Consider this very simple article … hockey puck For simplicity consider that there are only two possible categories: Sports News

© 2006 Intelliware Development Inc. A Simple Example … Consider the following word frequencies: WordCategoryFrequency hockeySports98% puckSports96% hockeyNews2% puckNews4% 1.Let C = Sports: p(C)=0.5, p(F 1 |C)=0.98 and p(F 2 |C)=0.96 p(C|F 1,F 2 ) = 0.5x0.98x0.96= Let C = News: p(C)=0.5, p(F 1 |C)=0.02 and p(F 2 |C)=0.04 p(C|F 1,F 2 ) = 0.5x0.02x0.04=0.0004

© 2006 Intelliware Development Inc. Gathering the Evidence So where do the frequencies we use come from? To perform Bayesian analysis, it is important to have a large corpus of articles This corpus is what we use to determine the word frequencies used in categorizing a given article This corpus would grow over time This corpus is what we use to train our Bayesian classifier

© 2006 Intelliware Development Inc. What We Actually Did First step was to gather a corpus of articles This corpus would be used to train our Bayesian classifier Initially started by gathering 5000 articles Number of articles in the corpus would grow over time Built a simple, little NewsFinder utility that would regularly go to and gather articles Google has seven categories of news News Finder worldCanadaHealthbusinesssciencesports entertainment

© 2006 Intelliware Development Inc. Bayesian Classifier Started with an open-source package from sourceforge called classifier4j: available at Created a SimpleClassifier This classifier has an instance of our Bayesian classifier which does all the Bayesian analysis for us The classifier also has a WordDataSource: a simple map that correlates a frequency with a given word in a given category Used our corpus of articles to train the our classifier (fill up our word data source)

© 2006 Intelliware Development Inc. Issues To Consider Making sure that the corpus was clean This was part of cleansing the data as we gather it Had to actually tweak Classifier4j because the algorithm wasnt correct

© 2006 Intelliware Development Inc. Clustering What is a Cluster, anyway?

© 2006 Intelliware Development Inc. Data Clustering Data clustering is the process of taking points in some n- dimensional space, and grouping them into some understandable group. Thats kind of math-y sounding. How does that relate to news? This is the fundamental question: trying to decide good measures is the key success criteria I want to defer the answer for now There are two fundamental approaches: Centroid Guess certain centres of clusters, and iteratively refine them Hierarchical Assume that each point is a cluster, and iteratively merge them until good clusters emerge

© 2006 Intelliware Development Inc. Another Key Consideration The field of Data Mining spends a lot of time thinking about one special problem: Often, theres too much data to fit into memory; any algorithms that try to cluster information must think about the special problem of data not fitting into memory Im not going to say too much about this problem

© 2006 Intelliware Development Inc. k-Means Algorithm One of the fundamental centroid-based algorithms is called the k- means algorithm Assume you have a number of points of data and you want to cluster these points into some number of clusters (k) You dont really need to know what the clusters represent, just some arbitrary number of clusters

© 2006 Intelliware Development Inc. Step One: Pick k=3 objects

© 2006 Intelliware Development Inc. Step Two: Create initial Groupings Groups are based on distance from initial points

© 2006 Intelliware Development Inc. Step Three: Find the centres/means

© 2006 Intelliware Development Inc. Step Four: Re-jig the clusters

© 2006 Intelliware Development Inc. Repeat until the Clusters dont change

© 2006 Intelliware Development Inc. But How Do You Decide on k? A key question to ask is how many clusters is the right number? Try a bunch of different values, and map distance 12345

© 2006 Intelliware Development Inc. Converting from Words to Points One idea: There are about 100,000,000 English words. Consider an n-Dimensional space, where n = 100,000,000 Frequency of a particular word in an article can be considered a distance in one dimension of the n-Dimensional space.

© 2006 Intelliware Development Inc. Unintuitive Conclusions When dealing with points in n-Dimensional space, where n is very large (say > 100), most points are about as far away as average.

© 2006 Intelliware Development Inc. Determining a Good Measuring Stick So how do you deal with the problem of large dimensional spaces? Try to determine a smaller set of interesting dimensions. Try this: Pick an article In that article try to find 25 interesting words Whats interesting? Try 10 of the most common words in the article (excluding stop words) Pick 10 of the most significant classification words (e.g. certain words are strongly correlated with health articles. Find the 10 most strongly correlated, that also have high frequency of occurrence in the article) Pick 5 unusual words Now youve got some measuring stick. Now measure other articles according to this measuring stick, and figure out distance

© 2006 Intelliware Development Inc. Java and Data Mining There a few (but not many) Java initiatives relating to Data Mining Bayesian Classifier: - Classifier4J Used this initially, and discovered that the algorithm wasnt correctly implemented Weka Created by a number of Data Mining professors The same group has published a Data Mining book with some references to Weka (but its a heavy math book) YALE (Yet Another Learning Environment) Theres a Java Community Process around coming up with a consistent Java API for data mining JSR 73 and JSR 247 javax.datamining

© 2006 Intelliware Development Inc. Other Topics (Use Wikipedia) w-shingling Concept Mining

© 2006 Intelliware Development Inc. Crazy Ideas that Might Make Interesting Experiments Could you perform data mining on code? What if you parsed Camel Case variable and class names and performed text clustering on classes. Could you find interesting relationships between classes? In different projects? What could you learn if you tried to perform clustering on a bunch of open source web frameworks? How must similarity and/or difference do they have?