Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SCAN-Lite: Enterprise-wide analysis.

Similar presentations

Presentation on theme: "© 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SCAN-Lite: Enterprise-wide analysis."— Presentation transcript:

1 © 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SCAN-Lite: Enterprise-wide analysis on the cheap Craig Soules, Kimberly Keeton, Brad Morrey

2 2 Enterprise information management Search Clustering Provenance Classification IT Trending Virus scanning Metadata Server

3 3 Enterprise information management Data is duplicated across machines! Duplicate analysis is wasted work Metadata Server

4 4 Issues Analysis programs conflict on clients −Contend for system resources (memory, disk) Clients repeat work −Duplicate files on multiple clients Client foreground workloads are impacted −Work exceeds available idle time on busy clients

5 5 Approaches Reduce resource contention Client

6 6 Approaches Avoid duplicate work Clients

7 7 Approaches Leverage duplication to balance client load −Delay analysis to identify all duplicates Clients Global Scheduler

8 8 Solutions Local scheduler −Coordinates analyses to reduce resource contention −Up to 60% improvement Global scheduler −Identifies duplicates to remove work −Balance load −40% reduction in impact to foreground tasks

9 9 Local scheduling Traditionally, analyses are separate programs −Scheduling left to the operating system Potentially at different times −Each program identifies files to scan −Each program opens and reads file data Analysis Programs Disk

10 10 Unified local scheduling Each analysis routine is a separate thread Control thread manages shared tasks −Identify files to scan, and open/read file data Shared memory buffer distributes file data Disk Control Thread Analysis Plugins Shared Memory

11 11 Local scheduling performance Ran a fitness test using 7 analysis routines −42 data sets, each containing files of a fixed size −Ran both approaches over each data set −Calculated per-file elapsed scan time −Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 Seven-at-once −Run each analysis routine separately at the same time Unified −SCAN-Lite’s unified local scheduling approach

12 12 Elapsed time vs. CPU time Original fitness test used CPU time −Gave less variable performance curves for modeling Disk contention shows up in elapsed time −CPU time is multiplexed −Elapsed time is not App 1 App 2 This is very bad Sum of CPU times Max of elapsed times Sum of elapsed times

13 13 Local scheduling results 17% - 60% improvement Small random I/Os have worse interaction than larger ones Seven-at-once benefits from deep disk queues, but this hurts foreground apps

14 14 Global scheduler Two goals: −Reduce additional work from duplicate files −Utilize duplication to schedule work to the “best” client Two-phase scanning −Phase one: identify duplicate files using content hashing −Phase two: analyze one copy at the appropriate client −Delaying between phase one and two provides opportunity for additional duplication and deletion

15 15 Traditional scanning Clients Server

16 16 Phase one: Duplicate detection Clients Server

17 17 Phase two: Scheduling Clients Server

18 18 When to schedule Clients upload hashes each scheduling period The freshness specifies a deadline by which new data must be analyzed Scheduling Period Freshness Schedule before this period Scheduling here gives one option Scheduling here gives three options Time

19 19 How to schedule Scheduling is a bin packing problem −Files are balls, clients are bins −Size of bins is available idle time −Color of balls/bins equates to location of duplicates −Size of balls is time required for analysis Idle Time ABCD Clients Files

20 20 How to schedule We use a greedy heuristic for scheduling −Consider idle time and machine priorities −See paper for details Idle Time ABCD Clients Files

21 21 Work ahead Start by scheduling all work that meets freshness Schedule additional work on still idle machines −Any remaining idle time can be used for additional work −We refer to this as work ahead Idle Time ABCD Clients Files

22 22 Two-phase scanning: Trade-offs Clients Two-phase Cost One-phase Cost

23 23 Two-phase scanning: Trade-offs Clients Two-phase Cost One-phase Cost

24 24 Two-phase scanning: Trade-offs If cost of hashing exceeds the additional work from duplicates, then one-phase scanning is better Analysis of hashing costs using SHA-1 indicate that 3% data duplication is the minimum −Do we see that in practice?

25 25 Duplication in enterprise data Examined two data sources: −100 user home directories from a central server −12 user productivity machines In both datasets, saw ~10% duplication −Even more with system files, email servers, sharepoints, etc. This is sufficient duplication for work reduction Data set2+1 Hash = 4/7 duplication

26 26 Global scheduling policies Traditional −One-phase scanning, scan all copies Rand −Two-phase scanning, random scheduling BestPlace −Two-phase scanning, greedy scheduling BestPlaceTime −Two-phase scanning, greedy scheduling + work ahead Opt −Unreplicated data only, delayed + work ahead

27 27 Metrics Total Work −Total elapsed time spent on analysis and hashing Client Impact −Time spent that exceeded client idle time Client Total Work Idle Time Client Impact

28 28 Metrics Metrics calculated for each day Summed over the entire simulation period Client Total Work Idle Time Client Impact

29 29 Experimental setup Implemented a simulator to test a variety of machine configurations and scheduling policies −Config: 50 high priority blades, 50 low priority laptops −Blades were modeled after: Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1 −Laptops were modeled after: 2GHz Pentium M, 1.5GB RAM, 60GB SATA Simulated 30 days −Daily creation rates and layouts from traced workloads −Freshness of 3 days, scheduling period of 1 day

30 30 Total work Doing work ahead of the freshness delay means analyzing files that would have been deleted Prefers faster blade machines over laptops, increasing their total work to reduce client impact Removes duplicate work, reducing the total work done

31 31 Client impact Less work means less impact Choosing the best place helps hit the idle time targets, reducing average client impact By doing work ahead of the freshness deadline, SCAN-Lite takes better advantage of idle time 40% Improvement Theoretical OPT only 8% better than BestPlaceTime

32 32 Summary Reducing local scanning interference is critical −17% - 60% improvement from reduced contention Two-phase scanning reduces analysis overheads −Reduce total work to near single-copy costs −Reduced client impact by up to 40% on our workload

33 33 Future work This is an initial system for reducing analysis costs −Many improvements remain! Vary freshness delays −Different applications may have different requirements Provide freshness and scan priorities to clients −Could prioritize scan order to not exceed client idle times Try more workloads −May need better bin packing algorithms

34 34 Summary Ever increasing number of analyses in the enterprise −Search, provenance, trending, clustering, classification, etc. Local scheduling to reduce resource contention on clients −Up to 60% performance improvement Two-phase scanning to reduce work and balance load −Delay analysis work to identify duplicate work −Global scheduling to balance load −Reduced client impact by up to 40% on our workload

35 35 Getting a handle on enterprise data Unstructured information growing at XX% per year Increasing number of needs for metadata −eDiscovery −Worker productivity and search −IT trending and historical analysis Lots of different analysis to perform −Term vectors, fingerprints, feature vectors, usage statistics, etc. Data is spread across file servers, web servers, email servers, laptops, desktops, backups, etc.

36 36 Where to perform analysis? On backups? −Not all data is backed up, encrypted, utilized On idle servers? −Requires data migration strategies, may break privacy On end nodes? −May interrupt foreground workloads, frustrate users All solutions desire minimizing work and balancing load to reduce required resources

37 37 The problems Most analysis tools run in isolation −Tools compete for resources locally, create interference Replicated data creates replicated work −Tools produce the same results in multiple locations Machines have difference characteristics −Creation rates, performance, idle time, etc. Goal: perform analysis at the best time and place

38 38 Best place and time? ABCD Time ABCD ABCD Available Time

39 39 Solution: Improve scheduling Local scheduler to coordinate analysis tasks −Single resource controller to prevent competition Global scheduler to single-instance analysis −Centralize decision of when and where to analyze

40 40 Local scheduling Prefetch thread reads data from disk once Analysis routines run in separate parallel threads Shared memory buffer distributes data to routines Files Prefetch Thread Producer/Consumer Buffer Analysis Threads

41 41 Traditional: One-phase scanning Client Server Apps Files Analysis Programs Metadata Store Metadata

42 42 SCAN-Lite: Two-phase scanning Client Server Apps Files Analysis Plugins Metadata Store Hashes Global Scheduler Metadata Local Scheduler Fitness Test Performance Models Idle Time Estimation Utilization Statistics

43 43 Global scheduling Time broken into scheduling periods based on some freshness delay (max time until data scan) Starting each scheduling period, the global scheduler picks which client will scan which data First schedule data that has met its freshness delay −Idle time, priorities, worst-fit, and ordering Second schedule any possible additional data −Work-ahead

44 44 Idle time, priorities, and worst-fit For a given piece of data: −Choose the set of machines that have available idle time If none, then choose all machines −From that, choose the machines with the highest priority −From that, choose the machine with the most idle time If none, choose the machine with the least client impact P2 P1 Idle Time Work Assigned

45 45 Ordering There is still a problem: P2P1 Idle Time Order: P2P1P2P1

46 46 Ordering Assign each piece of data a number based on the number of machines at each priority class Order all data by its ordering number P2P1 Idle Time P3 > P2 > P1 0 1 1 0 1 0

47 47 Work ahead Once all data that has met its freshness delay has been scheduled, assign additional data to any machines with available idle time P2 P1 Idle Time Work Assigned

48 48 How to schedule First, schedule any work that will meet its freshness deadline during this scheduling period Second, schedule any additional work that will fit within the remaining idle time of clients

49 49 Local scheduling results

50 50 Local “performance improvements” What happens when one or more analysis routines try to “improve performance?” For example, using direct I/O to reduce memory footprint, and thus impact on client workloads Seven Direct −Analysis programs implement direct I/O Unified Direct −SCAN-Lite implements direct I/O

51 51 Local scheduling with direct I/O

Download ppt "© 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice SCAN-Lite: Enterprise-wide analysis."

Similar presentations

Ads by Google