Data Mining Chapter 9 Moving on: Applications and Beyond

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Conceptual Clustering
Naïve-Bayes Classifiers Business Intelligence for Managers.
NEURAL NETWORKS Perceptron
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Chapter 12 – Strategies for Effective Written Reports
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Search Engines and Information Retrieval
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Data Mining.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Recommender systems Ram Akella November 26 th 2008.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Chapter 5: Information Retrieval and Web Search
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
SAT Prep- Reading Comprehension Strategies- Short Passages
Chapter 2 The Research Enterprise in Psychology. n Basic assumption: events are governed by some lawful order  Goals: Measurement and description Understanding.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
SQL Unit 5 Aggregation, GROUP BY, and HAVING Kirk Scott 1.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Basic Principle of Statistics: Rare Event Rule If, under a given assumption,
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Art 315 Lecture 6 Dr. J. Parker. Variables Variables are one of a few key concepts in programming that must be understood. Many engineering/cs students.
Design Patterns in Java Chapter 1 Introduction Summary prepared by Kirk Scott 1.
Analysis of Algorithms
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
Observation & Analysis. Observation Field Research In the fields of social science, psychology and medicine, amongst others, observational study is an.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Presenter: Shanshan Lu 03/04/2010
Views Lesson 7.
1 Chapter Two: Sampling Methods §know the reasons of sampling §use the table of random numbers §perform Simple Random, Systematic, Stratified, Cluster,
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Learning Targets January 21, 2008 Londa Richter & Jo Hartmann TIE.
Scientific Methods and Terminology. Scientific methods are The most reliable means to ensure that experiments produce reliable information in response.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
M1G Introduction to Programming 2 3. Creating Classes: Room and Item.
Data Mining and Decision Support
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Copyright © Cengage Learning. All rights reserved. 2 Probability.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Knowledge Discovery and Data Mining 19 th Meeting Course Name: Business Intelligence Year: 2009.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Investigate Plan Design Create Evaluate (Test it to objective evaluation at each stage of the design cycle) state – describe - explain the problem some.
Physical Changes That Don’t Change the Logical Design
Rule Induction for Classification Using
Statistical Data Analysis
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Data Mining (and machine learning)
Objective of This Course
Statistical Data Analysis
Presentation transcript:

Data Mining Chapter 9 Moving on: Applications and Beyond Kirk Scott

So-called machine learning is a broad topic with many ramifications Data mining is just an applied subset of this overall field The book says the algorithms aren’t “abstruse or complicated” but they’re also not “completely obvious and trivial”

The book identifies the challenge of the future as lying in the realm of applications In this sense, data mining has something in common with database management systems For some people the interesting part is figuring out how to apply the techniques to a given problem

The book notes that the source of these applications are people working in the problem domains People specializing in data mining will continue to develop new algorithms But this doesn’t happen in a vacuum Much of the real, interesting work will come out of applications

9.1 Applying Data Mining The book lists the “Top 10” data mining algorithms These are given in Table 9.1, shown on the following overhead Recall that number 1, C4.5, was for decision tree induction Notice also that the majority of these algorithms are for classification

Progress in Data Mining There is a pitfall of applying algorithms to data sets, comparing results, and drawing broad conclusions about what is best in certain problem domains Reasoning from the specific to the general, without further information, is not necessarily correct

Even statistically significant differences in outcomes may not be important in practice Quite often, simple methods get reasonably good results Complicated methods have their own shortcomings, including computational cost

Something to always keep in mind: There may just be a lot of noise in the data Or, there may just be a lot of statistical variation There ARE limits on the ability to draw inferences from data

Also, training sets, by definition, are historical They can’t perfectly reflect new data in a changing world

Another point to consider: Recall that some classification schemes give probabilities that an instance falls into a class However, in reality, classification categories might not be mutually exclusive There may be data points in the training set which are partially one and partially another

However, the training set is considered to have instances that are rigidly classified as one or the other The training set doesn’t reflect probability Thus, the training set which you are basing your inferences on already contains inaccuracies You might think of this as conceptual noise resulting from forcing an instance into one class

The book admits that tweaking (picking parameters) can affect performance As a result, small empirical differences in data mining results do not necessarily reflect actual differences in the quality of the algorithms In application, one might have been more successfully tweaked than another

Another interesting point on the Occam’s Razor/Epicurus divide: Complicated methods may be harder to criticize than simple ones That fact alone doesn’t make them better The authors still favor “All else being equal, simpler is better”.

9.2 Learning from Massive Datasets Basic constraints: Computational space and time Data stream methods are unaffected by this (more later) In other types of algorithms implementation techniques like hashing, caching, indexing and other data structures may be critical to practicality

Massive data sets typically imply large numbers of instances Any algorithm with time complexity > linear will eventually be swamped by massive data Depending on the algorithm, too many attributes may also render it impractical because the computational complexity is in the dimension of the problem space

General ways to adapt to large data sets: Train on a sample or subset of the data only Do a parallel implementation of the algorithm in a multi-processor environment Invent new algorithms…

Training on samples or subsets can give you as good a result as training on the whole set The law of diminishing returns says that after a certain point, more instances don’t give significant increases in accuracy

Training on Samples There are two ways of looking at this: If a problem is simple, a small data set may encapsulate all there is to know about it If the problem is complex but the data mining algorithm is simple, the algorithm may max out on its predictive power no matter how many training instances there are

Parallelization Algorithms like nearest neighbor, tree formation, etc. can be parallelized Not only do you have to figure out how to parallelize Parallelization is no defense against combinatorial explosion If the complexity is exponential but the growth in the number of processors is linear, you eventually lose

New Algorithms This is where research comes in In general, the sky is the limit In some situations (tree building, for example) there is a provable floor on complexity Even here, new methods may be simpler and still have the characteristic of approximating the solutions given by deterministic methods

A side note on this: Virtually everything we’ve looked at has been a heuristic anyway—including greedy tree formation Exhaustive search would give genuinely optimal results Everything else is essentially an approximation approach

Another aspect to improving algorithm performance: A lot of the data in a set, both instances and attributes, may be redundant Simply finding ways to throw out useless data may improve performance This idea will recur in the next section

9.3 Data Stream Learning For data streams, the overriding assumption for algorithms is that each instance will be examined at most one time The model of the data is updated incrementally based on the incoming instance Then the instance is discarded

An example of an application area is sensor readings As long as the sensor is active, the readings just keep on coming It seems that things like Web transactions might be another example

Both time and space are issues Discarding instances saves space The examination of each instance also has to be fixed in time, with an average rate of examination no less than the arrival rate of instances This constraint rules out major changes or reorganization of the model

Modifications to the model resulting from an instance either have to be counted in the examination time Or they have to occur infrequently enough that they are averaged out over the examination time of multiple instances

You may come up with ways of throwing out unneeded data But the goal is not to throw out data simply because you couldn’t handle it fast enough (Although stay tuned for a later comment on throwing out data)

Algorithms That Are Directly Suited to Data Streams Naïve Bayes Perceptrons Multi-layer neural networks Rules with exceptions (although you can’t simply accumulate unlimited exceptions)

The book notes that other kinds of algorithms can be adapted to data streams It spends some time explaining how this might be done with trees The details are obscure and not important at this late stage in the semester

The key insight about throwing out instances is this: Will you lose important data if you throw out instances? In an unending data stream, if information or a pattern is significant, it will recur So in the long run, throwing out an instance doesn’t hurt the model that results from the algorithm

9.4 Incorporating Domain Knowledge The overall topic here is metadata—data about the data How to put this to use is an open area of research There can be various kinds of relationships between attributes They include semantic, causal, and functional

Semantic relationships can be summarized in this way: If one attribute is included in a rule, another should also be included Informally, in the problem domain, this means that these two attributes aren’t (fully) meaningful without the other Somehow this could be included as a condition in a data mining scheme

The idea of causality also comes from the problem domain The point is that if causality exists, the data mining scheme should be able to detect it This causality may go through multiple attributes ABC…

Functional dependency in data mining refers to the same concept as in db design The point with respect to metadata is that if the functional dependency is already know, it’s not productive to have data mining “discover” it

On the one hand, there may be ways of applying data mining to normalization On the other hand, if that’s not your purpose, functional dependencies that are mined will tend to have high confidence and most likely high support These associations will end up outweighing other, new associations that an algorithm might mine

How should metadata be represented? A straightforward approach is to list what you already know about the data set using rules Logical deduction schemes can produce other rules resulting from the ones you already know

Data mining scheme works with instances to produce other new rules that you didn’t know before The bodies of rules, merged together, give the sum total of knowledge gleaned about the problem

9.5 Text Mining Compare text mining to standard data mining Data sets roughly parallel database tables There are identifiable instances and well-defined attributes Text is emphatically not structured in this way

An interesting comparison: Data mining is supposed to find information about data where that information was not known By definition, text is different In text, the information is out in the open, in the form language It is simply not in a form suitable for easy computerized analysis

Data mining can be said to have as its goal the acquisition of “actionable” information Based on a training set you can classify or cluster future instances, for example In a derivative way, you can make decisions that earn make money, etc. Another goal of data mining is to develop a data model Again, this is out in the open with text

There are several applications of text mining: Text summarization, document classification, and clustering Language identification and authorship ascription Assigning key descriptive phrases to documents Metadata, entity, and information extraction

Document classification can be done based on the (count of) occurrences of words in the document There is a feature extraction aspect to this problem Frequent words don’t help classify Infrequent words don’t help classify There is still an overwhelming number of words in the middle that have value

More complex methods step up from counting words alone Context, word order, grammatical constructs, etc. all affect meaning At the very least phrases might be mined instead of words Natural language processing, syntax, and semantics may come into play

Document classification may be done with predefined classes Document clustering doesn’t have predefined classes Mining techniques can be used to identify the language of a document n-grams, n-letter sequences correlate highly with different languages n = 3 is usually sufficient for this

Authorship ascription is done by counting common (stylistic) words, not the content words which define classifications A more complex approach would again do more than just count words

Assignment of key phrases to a document corresponds to the problem of assigning subject headings in the library catalog You start with established sets of phrases with defined meanings The goal is to assign one or more of these phrases to the document

Metadata extraction is a related idea with further ramifications Is it possible to find specific information like author and title, automatically? Can you extract useful identifying key words and phrases? Note that the ability to do this may result in “actionable” information

The next step is entity extraction Not only do you want to extract more obvious things like the author and title: You want to identify any things that are mentioned in the document

How do you identify entities? You can look them up in reference resources like dictionaries, lists of names, etc. You may rely on simply things like capitalization or titles of address, etc. You may search for regular expressions or use simple grammars for expressions

Information extraction refers to a situation similar to reading information from a form Certain documents may be limited in their scope and expected to contain a given set of data items If the informational items can be extracted, it may then be possible to infer rules or relationships involving them

Text mining is complex There are many potential applications Ultimately, the algorithms fall into the realm of natural language processing The current state of the art of computerized text processing is at the level where it is hard to match human abilities and understanding

9.6 Web Mining This is like a specialized area of text mining The basic content of pages is predominantly text There are two key differences: Pages have internal markup, which defines structure Pages have links in and out, which help classify by content and value

In fact, large amounts of Web content are tabular, and presumably suited to standard data mining However, the tabular nature is only reflected in the html presentation A data mining technique called wrapper induction can be used to try and infer the tabular relationships based on formatting

Wrapper induction can be extended: To accommodate changes in formatting To recognized when formats that differ superficially really indicate the same structure

Page rank is an important concept in Web mining It is only tangentially related to standard data mining Page rank is a measure of how “good” a match a page is to a certain search request based on links in and out of pages

These are the general principles for page ranking: Many links to page x suggest that x is good A link from y to x is a good indicator if there are few links out of y A link from y to x is a good indicator if y itself is rated highly

Having the ranks of pages depend on the ranks of other pages is somewhat circular However, iterative algorithms can run through networks of links converging on solutions The stopping condition is when the change of ranking values between iterations falls below a certain threshold

Some relevant factoids: The authors guess that for ranks normalized between 0 and 1, the threshold for the stopping conditions is in the range 10-9-10-12 An early experiment when the Web was simpler converged after ~50 iterations

Rumor has it that Google’s page ranking program(s) cycle through the entire Web in a matter of days This process is repeated every few weeks

There is a practical problem with this approach There may be Web pages with no in links or no out links These are known as page rank sinks If the algorithm is caught in a sink it won’t terminate or converge

The algorithms are given probabilistic parameters In searching the web there is a small probability of “teleporting”, jumping, to another random page The parameter is tunable Its size affects the speed and accuracy of convergence

9.7 Adversarial Situations Spam filtering is a classic example of this You can write algorithms to identify spam The point is that spam senders will track the algorithms and try to outwit them This is a see-saw problem

Similar kinds of problems: Trying to trick page ranking algorithms Trying to confound computer security programs that detect intrusions and breaches based on transaction patterns There are many applications of data mining which, while not necessarily adversarial exactly, are “security” rather than money based

Data mining may be used to detect fraud, money laundering, other criminal activity It is used to flag airline passengers for additional screening The government uses it to detect terrorists or terror-related activity This can be based on any obtainable record, whether computer, cell phone or other, which can be mined

There are serious ethical, legal, privacy, constitutional, and civil liberties questions about these practices Data mining can also be applied in a collaborative/adversarial way

The authors cite robo-soccer Independent agents “learn” to act independently and in concert with other agents to win a game or achieve some goal against other collections of agents The book also cites a project that was designed to determine the author of a document where the styled had presumably been intentionally altered

9.8 Ubiquitous Data Mining Ubiquitous data mining stems from the concept of ubiquitous computing More factoids: The Web contains 10-20 billion documents This is ~50 Terabytes Suppose you combined the content of the Web with the contents or actions of every device with an electronic brain, cell phone, iPOD, etc., and data mined across it all

In a more limited sphere the authors cite a specific example: There are programs which follow a user’s actions with the goal of “learning” what tasks a user performs when using an operating system, in essence learning how to customize the O/S’s actions to the user’s desires

The authors state that data mining will be ubiquitous in the future Doubtless this is true The questions remaining: Who wins, who makes money: Will this make us more masters or (willing) slaves of technology? Do you want to play the game? Will you have a choice?

The End