Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.

Similar presentations


Presentation on theme: "Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center."— Presentation transcript:

1 Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center

2 Thesis Data mining has started to live up to its promise in the commercial world, particularly in applications involving structured data Promising data mining applications in non- conventional domains are beginning to emerge, involving combination of structured and unstructured data Investment in data mining research can have large payoff

3 Outline Examples of some promising non- conventional data mining applications and technologies Some hurdles we need to cross

4 Identifying Social Links Using Association Rules Input: Crawl of about 1 million pages

5 Website Profiling using Classification Input: Example pages for each category during training

6 Discovering Trends Using Sequential Patterns & Shape Queries Input: i) patent database ii) shape of interest

7 Discovering Micro-communities Frequently co-cited pages are related. Pages with large bibliographic overlap are related.

8 Technical Chasms Privacy Concerns? – Privacy-preserving data mining Data for data mining? – Data mining over compartmentalized databases

9 Inducing Classifiers over Privacy Preserved Numeric Data 30 | 25K | …50 | 40K | … Randomizer 65 | 50K | … Randomizer 35 | 60K | … Reconstruct Age Distribution Reconstruct Salary Distribution Decision Tree Algorithm Model 30 become s 65 (30+35) Alice’s age Alice’s salary John’s age

10 Reconstruction Algorithm f X 0 := Uniform distribution j := 0 repeat f X j+1 (a) := Bayes’ Rule j := j+1 until (stopping criterion met) Converges to maximum likelihood estimate. – D. Agrawal & C.C. Aggarwal, PODS 2001.

11 Works Well

12 Accuracy vs. Randomization

13 Discovering frequent itemsets Itemset Size True Itemsets True Positives False Drops False Positives 12662541231 22171952245 34843526 Itemset Size True Itemsets True Positives False Drops False Positives 165 00 22282121628 3221845 Soccer: s min = 0.2% Mailorder: s min = 0.2% Breach level = 50%.

14 Computation over Compartmentalized Databases

15 Some Hard Problems Past may be a poor predictor of future – Abrupt changes – Wrong training examples Reliability and quality of data Actionable patterns (principled use of domain knowledge?) Over-fitting vs. not missing the rare nuggets Richer patterns Simultaneous mining over multiple data types When to use which algorithm? Automatic, data-dependent selection of algorithm parameters

16 Summary Data mining has shown promise but we need further research to realize its full potential We stand on the brink of great new answers, but even more, of great new questions -- Matt Ridley


Download ppt "Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center."

Similar presentations


Ads by Google