File Classification in self-* storage systems Michael Mesnier, Eno Thereska, Gregory R. Ganger, Daniel Ellard, Margo Seltzer
Introduction Self-* infrastructure need information about Users Applications Policies Not readily provided, and cannot depend on them to provide them So? Must be learned
Self-* storage systems Sub-problem of the self-* structure Key: to get hints based on what creators associate with their files File size File names Lifetimes Intentions determined, then decisions can be made Results: better file organization, performance
Classifying Files Current: rule-of-thumb policy selection Generic, not optimized Better: distinguish classes Finer grained policies Ideally assigned at file creation Determine classes at creation Self-* must learn this association 1) traces 2)running fs
So, how? Create model that classify based on (some attribs) Name Owner Permissions Must filter out irrelevant attribs Classifier must learn rules to do so Based on test set Then inference happens
The right model Model must be Scalable Dynamic Cost-sensitive (mis-prediction cost) Interpretable (human) Model selected: decision trees
ABLE Attribute-based learning environment 1. obtain traces 2. make decision tree 3. make predictions Top down, until all attribs are used Split sample until leaves have similar file attribs After creation, query begins
Tests Based on several systems to make sure it is workload-independent DEAS03 EECS03 CAMPUS LAB The control: MODE algorithm – places all files in a single cluster
Results Prediction results quite good 90% - 100% claimed Clustering files by attribs are clear Predict that a model ’ s ruleset will converge over time
Benefits of incremental learning Dynamically refines model as samples become available Generally better than one-shot learners Sometimes one-shot performs poorly Ruleset of incremental learners are smaller
On accuracy More attributes = chance of over-fitting More rules -> smaller ratios Loses compression benefits Predictive models can have false predictions Can impact performance Things that should be in RAM is placed on disk instead etc. Solution: cost functions Penalize errors Create biased tree System goals will need to be translated into it
Conclusion These trees provide prediction accuracies in the 90% range Adaptable via incremental learning Continued work: integration into self-* infrastructure
Questions?