Presentation on theme: "PAKDD Panel: What Next Ramakrishnan Srikant. What Next Electronic Commerce –Catalog Integration (WWW 2001, with R. Agrawal) –Searching with Numbers (WWW."— Presentation transcript:
Intuition Use affinity information in new catalog. –Products in same category are similar. Bias Naïve Bayes classifier to incorporate this information. –Accuracy boost depends on match between two categorizations. –Use tuning set to determine weight given to affinity information.
Yahoo & Google 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software –Typical match: 69%, 15%, 3%, 3%, 1%, …. Merging Yahoo into Google –30% fewer errors (14.1% absolute difference in accuracy) Merging Google into Yahoo –26% fewer errors (14.3% absolute difference) Open Problems: SVM, Decision Tree,...
Data Extraction is hard Synonyms for attribute names and units. –"lb" and "pounds", but no "lbs" or "pound". Attribute names are often missing. –No "Speed", just "MHz Pentium III" –No "Memory", just "MB SDRAM" 850 MHz Intel Pentium III 192 MB RAM 15 GB Hard Disk DVD Recorder: Included; Windows Me 14.1 inch diplay 8.0 pounds
Why does it work? Conjecture: If we get a close match on numbers, it is likely that we have correctly matched attribute names. Non-overlapping attributes: –Memory: 64 - 512 Mb, Disk: 10 - 40 Gb Correlations: –Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still fine.
Some Hard Problems Past may be a poor predictor of future –Abrupt changes Reliability and quality of data –Wrong training examples Simultaneous mining over multiple data types Richer patterns
Privacy Preserving Data Mining Have your cake and mine it too! –Preserve privacy at the individual level, but still build accurate models. Challenges –Privacy Breaches –Clustering & Associations –Privacy-sensitive Security Applications Opportunities –Web Demographics –Inter-Enterprise Data Mining –Privacy-sensitive Security Applications