File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
3-1 Decision Tree Learning Kelby Lee. 3-2 Overview ¨ What is a Decision Tree ¨ ID3 ¨ REP ¨ IREP ¨ RIPPER ¨ Application.
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
RIPPER Fast Effective Rule Induction
Decision Tree Approach in Data Mining
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Computational Biology Lecture Slides Week 10 Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar)
Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Self-Correlating Predictive Information Tracking for Large-Scale Production Systems Zhao, Tan, Gong, Gu, Wambolt Presented by: Andrew Hahn.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Credit Card Applicants’ Credibility Prediction with Decision Tree n Dan Xiao n Jerry Yang.
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
Three kinds of learning
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Video Mining Learning Patterns of Behaviour via an Intelligent Image Analysis System.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Ao-Jan Su, David R. Choffnes, Fabián E. Bustamante and Aleksandar Kuzmanovic Department of EECS Northwestern University Relative Network Positioning via.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
Chapter 9 – Classification and Regression Trees
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
GLOBECOM (Global Communications Conference), 2012
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Universit at Dortmund, LS VIII
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.4: Covering Algorithms Rodney Nielsen Many.
Today Ensemble Methods. Recap of the course. Classifier Fusion
CLASSIFICATION: Ensemble Methods
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
1.3 Experimental Design. What is the goal of every statistical Study?  Collect data  Use data to make a decision If the process to collect data is flawed,
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Antara Ghosh Jignashu Parikh
Kyriaki Dimitriadou, Brandeis University
Rule Induction for Classification Using
Data Mining K-means Algorithm
Chapter 11: Learning Introduction
Database Performance Tuning and Query Optimization
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
© 2013 ExcelR Solutions. All Rights Reserved An Introduction to Creating a Perfect Decision Tree.
I don’t need a title slide for a lecture
Clustering.
Experiments in Machine Learning
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Chapter 11 Database Performance Tuning and Query Optimization
Presentation transcript:

File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer

Introduction Self-* infrastructure need information about Users Applications Policies Not readily provided, and cannot depend on them to provide them So? Must be learned

Self-* storage systems Sub-problem of the self-* structure Key: to get hints based on what creators associate with their files File size File names Lifetimes Intentions determined, then decisions can be made Results: better file organization, performance

Classifying Files Current: rule-of-thumb policy selection Generic, not optimized Better: distinguish classes Finer grained policies Ideally assigned at file creation Determine classes at creation Self-* must learn this association 1) traces 2)running fs

So, how? Create model that classify based on (some attribs) Name Owner Permissions Must filter out irrelevant attribs Classifier must learn rules to do so Based on test set Then inference happens

The right model Model must be Scalable Dynamic Cost-sensitive (mis-prediction cost) Interpretable (human) Model selected: decision trees

ABLE Attribute-based learning environment 1. obtain traces 2. make decision tree 3. make predictions Top down, until all attribs are used Split sample until leaves have similar file attribs After creation, query begins

Tests Based on several systems to make sure it is workload-independent DEAS03 EECS03 CAMPUS LAB The control: MODE algorithm – places all files in a single cluster

Results Prediction results quite good 90% - 100% claimed Clustering files by attribs are clear Predict that a model ’ s ruleset will converge over time

Benefits of incremental learning Dynamically refines model as samples become available Generally better than one-shot learners Sometimes one-shot performs poorly Ruleset of incremental learners are smaller

On accuracy More attributes = chance of over-fitting More rules -> smaller ratios Loses compression benefits Predictive models can have false predictions Can impact performance Things that should be in RAM is placed on disk instead etc. Solution: cost functions Penalize errors Create biased tree System goals will need to be translated into it

Conclusion These trees provide prediction accuracies in the 90% range Adaptable via incremental learning Continued work: integration into self-* infrastructure

Questions?