Presentation is loading. Please wait.

Presentation is loading. Please wait.

By Matt Goliber and Jim Hougas Data Mining and Knowledge Discovery.

Similar presentations


Presentation on theme: "By Matt Goliber and Jim Hougas Data Mining and Knowledge Discovery."— Presentation transcript:

1 By Matt Goliber and Jim Hougas Data Mining and Knowledge Discovery

2 What is Data Mining? Not like gold or diamond mining Mining of knowledge from data Important to many different fields A Part of Knowledge Discovery in Databases (KDD)

3 The Process of Knowledge Discovery Raw data Data Warehouse Patterns KNOWLEDGE! Data cleaning and integration Data transformation, selection, and mining Pattern evaluation and knowledge presentation

4 Why is Data Mining useful? We are data rich but information poor -Internet -Intelligence Humans often lack the ability to comprehend and manage the immense amount of available and sometime seemingly unrelated data

5 How long has this idea been around? Late 60’s and Early 70’s Stanford’s Meta-DENDRAL (1970-76) -Extension of DENDRAL Doug Lenat with AM (1976)

6 Meta-DENDRAL Extension of the DENDRAL (1965) program -One of the first expert systems -Interpreted mass spectra Meta-DENDRAL took the mass spectra of compound of known 3- D structure and formulated rules about the interpretation of the spectra Came up with known rules and some new ones!

7 Sample Mass Spec ethyl 3-oxy-3-phenylpropanoate (ethyl benzoylacetate)

8 AM Doug Lenat, 1976 Name means nothing, stand alone AM was given sets, bags, ordered sets, and lists AM was also given operations to perform on these data sets -Union, Intersection, ect… Came up with ideas about counting, addition, multiplication, prime numbers, and Goldbach’s conjecture AM thought that these were all uninteresting Liked maximally divisible numbers though…

9 What next? Not a whole lot… Databases were not prevalent enough, no great demand Did benefit from machine learning research Beginning of the 1990’s, “The next area…” -Ranked as one of the most promising research areas (NSF) -Information explosion Early commercial systems -Farm Journal -GM

10 Next Generation Techniques Decision Trees –Each branch is a classification question –Allows businesses to segment customers, products, and sales regions –Questions organize the data Rule Induction –All patterns are pulled from the data –Accuracy and Significance are then added to them –Help the user know how strong pattern is and likelihood of it occurring again –Ex: If bagels are purchased then cream cheese is purchased 90% of the time and this pattern occurs in 3% of all shopping baskets

11 Decision Trees vs. Rule Induction Decision Trees –Many rules to cover same instance or –no rule to cover an instance Rule Induction –Always and only one rule Example –Decision Trees use height and shoe size to determine size of person –Rule Induction uses one or the other

12 Examples of Significant Developments Stock Market Advances (1991) –Astrophysicists Doyne Farmer and Norman Packard –Prediction company could predict stock market trends Bell Atlantic (1996) –Consumer phone buying trends –Rule Induction Advanced Scout (1997) –Inderpal Bhandari assists NBA coaches –Rule Induction Persuade 400,000 undecided voters (2004) –MoveOn attemps to influence the election –Decision Tree

13 Challenges Large Data Sets with High Complexity - One or the other is currently possible, but not both Expensive - Costs of Bell Atlantic (Experts are needed) - Cost for a two-day course in Las Vegas ($1,300) - Software ($100,000)

14 Research DARPA –Defense Advance Research Projects Agency –ACLU claims this is an invasion of privacy –Decision Tree Uncovering Terrorists in public chat rooms –Tracks the times that messages are sent Advanced Scout –Bhandari is working on Advanced Scout for the NHL –Rule Induction

15 Current State Out of the Lab –Into Fortune 500 companies Automate Model Scoring –Fingers are currently crossed in hopes that scoring by IT personnel is done correctly

16 Future States Utilizing Company Warehouses –Data miners must take advantage of a million dollar warehouse that a company builds Effort Knob –Low for quick model, high for quality model Computed Target Columns –User could create a new target variable –Ex: finance information that a business has

17 Sources http://web.media.mit.edu/~haase/thesis/node54.html#SECTION00711000000000000000 http://smi-web.stanford.edu/projects/history.html#METADENDRAL http://www.cs.cf.ac.uk/Dave/AI2/node151.html http://64.233.161.104/search?q=cache:Q6eMD9tEKwIJ:www.cosc.brocku.ca/Offerings/4P79/Week12.ppt+meta-dendral&hl=en http://laurel.actlab.utexas.edu/~cynbe/muq/muf3_21.html http://64.233.161.104/search?q=cache:yft0cQ5tZJQJ:www.cs.uwaterloo.ca/~shallit/Talks/cct.ps+%22fundamental+theorem+of+a rithmetic%22+computer+data+mining+prove&hl=en http://mathworld.wolfram.com/GoldbachConjecture.html http://www.quantlet.com/mdstat/scripts/csa/html/node202.html http://www.thearling.com http://www.wired.com http://www.dmreview.com http://www.ebscohost.com http://www.thearling.com/text/dmtechniques/dmtechniques.htm http://www.aaai.org/Library/Magazine/Vol13/13-03/vol13-03.html Data Mining: Concepts and Techniques. Han J. and Kamber M.


Download ppt "By Matt Goliber and Jim Hougas Data Mining and Knowledge Discovery."

Similar presentations


Ads by Google