File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer

Introduction Self-* infrastructure need information about Users Applications Policies Not readily provided, and cannot depend on them to provide them So? Must be learned

Self-* storage systems Sub-problem of the self-* structure Key: to get hints based on what creators associate with their files File size File names Lifetimes Intentions determined, then decisions can be made Results: better file organization, performance

Classifying Files Current: rule-of-thumb policy selection Generic, not optimized Better: distinguish classes Finer grained policies Ideally assigned at file creation Determine classes at creation Self-* must learn this association 1) traces 2)running fs

So, how? Create model that classify based on (some attribs) Name Owner Permissions Must filter out irrelevant attribs Classifier must learn rules to do so Based on test set Then inference happens

The right model Model must be Scalable Dynamic Cost-sensitive (mis-prediction cost) Interpretable (human) Model selected: decision trees

ABLE Attribute-based learning environment 1. obtain traces 2. make decision tree 3. make predictions Top down, until all attribs are used Split sample until leaves have similar file attribs After creation, query begins

Tests Based on several systems to make sure it is workload-independent DEAS03 EECS03 CAMPUS LAB The control: MODE algorithm – places all files in a single cluster

Results Prediction results quite good 90% - 100% claimed Clustering files by attribs are clear Predict that a model ’ s ruleset will converge over time

Benefits of incremental learning Dynamically refines model as samples become available Generally better than one-shot learners Sometimes one-shot performs poorly Ruleset of incremental learners are smaller

On accuracy More attributes = chance of over-fitting More rules -> smaller ratios Loses compression benefits Predictive models can have false predictions Can impact performance Things that should be in RAM is placed on disk instead etc. Solution: cost functions Penalize errors Create biased tree System goals will need to be translated into it

Conclusion These trees provide prediction accuracies in the 90% range Adaptable via incremental learning Continued work: integration into self-* infrastructure

Questions?

File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Similar presentations

Presentation on theme: "File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer.

Similar presentations

Presentation on theme: "File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer."— Presentation transcript:

Similar presentations

About project

Feedback