Presentation is loading. Please wait.

Presentation is loading. Please wait.

DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley.

Similar presentations


Presentation on theme: "DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley."— Presentation transcript:

1 DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

2 Parallel programming, yada yada Intel claims we will all have many-core, etc. “This algorithm is easily parallelizable” –Not “we implemented a parallel version” Historically, low-latency fine-grain parallelism –Shared-memory SMP (threads, locks, etc.) –MPI (finite-element analysis, etc.) But also data-parallel! –We have lots of data now (video, the web) –But most people still use their laptops/toy data –Even “big” systems use tens of computers

3 Why do people use Matlab? Parallel programming tedious and complex –Distributed programming even worse –Perl scripts, manual management of data, … Matlab is easy (or at least popular) –Relatively few high-level constructs –System “does the right thing” –Programmers willing to put up with a lot We want similarly low barrier to entry –Familiar languages, legacy codebase, etc.

4 What are we doing? When single-computer processing runs out of steam –Web-scale processing of terabytes of data Infeasible without a big cluster –Network log-mining, machine learning Multi-week job → 4 hours on 250 computers 1-hour iteration → 3.5 minutes on 4 computers

5 A typical data-intensive query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; Ulfar’s most frequently visited web pages

6 Steps in the query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a LogEntry object. Go through logentries and keep only entries that are accesses by ulfar. Group ulfar ’s accesses according to what page they correspond to. For each page, count the occurrences. Sort the pages ulfar has accessed according to access frequency.

7 Serial execution var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; For each line in logs, do… For each entry in logentries, do.. Sort entries in user by page. Then iterate over sorted list, counting the occurrences of each page as you go. Re-sort entries in access by page frequency.

8 Parallel execution var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

9 Linear Regression Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 9

10 Execution Graph 10 X×X T Y×X T Σ X[0]X[1]X[2]Y[0]Y[1]Y[2] Σ [ ] -1 * A

11 DryadLINQ Programmer writes sequential C# code –Rich type system, libraries, modules, loops… –System can figure out data-parallelism Sees declarative expression plans Full control of high-level optimizations Traditional parallel-database tricks

12 Dryad execution engine General-purpose execution environment for distributed, data-parallel applications –Concentrates on throughput not latency –Assumes private data center Automatic management of scheduling, distribution, fault tolerance, etc. Well tested over two years on clusters of thousands of computers Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu

13 Job = Directed Acyclic Graph Processing vertices Channels (file, pipe, shared memory) Inputs Outputs

14 Scheduler state machine Scheduling a DAG –Vertex can run anywhere once all its inputs are ready Constraints/hints place it near its inputs –Fault tolerance If A fails, run it again If A’s inputs are gone, run upstream vertices again (recursively) If A is slow, run another copy elsewhere and use output from whichever finishes first

15 Static/dynamic optimizations Static optimizer builds execution graph Dynamic optimizer mutates running graph –Picks number of partitions when size is known –Builds aggregation trees based on locality

16 LINQ Constructs/type system in.NET v3.5 Operators to manipulate datasets –Data elements are arbitrary.NET types Traditional relational operators –Select, Join, Aggregate, etc. Extensible –Add new operators –Add new implementations

17 DryadLINQ Automatically distribute a LINQ program Few Dryad-specific extensions –Same source program runs on single-core through multi-core up to cluster Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

18 A complete DryadLINQ program public class LogEntry { public string user; public string ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } public class UserPageCount { public string user; public string page; public int count; public UserPageCount(string user, string page, int count) { this.user = user; this.page = page; this.count = count; } DryadDataContext ddc = new DryadDataContext(“fs://logfile”); DryadTable logs = ddc.GetTable (); var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"\ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToDryadTable(“fs://results”)

19 Query plan LINQ query DryadLINQ: From LINQ to Dryad Dryad select where logs Automatic query plan generation Distributed query execution by Dryad var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);

20 How does it work? Sequential code “operates” on datasets But really just builds an expression graph –Lazy evaluation When a result is retrieved –Entire graph is handed to DryadLINQ –Optimizer builds efficient DAG –Program is executed on cluster

21 Terasort 10 billion 100-byte records (10 12 bytes) 240 computers, 960 disks –349 secs Comparable with record public struct TeraRecord : IComparable { public const int RecordSize = 100; public const int KeySize = 10; public byte[] content; public int CompareTo(TeraRecord rec) { for (int i = 0; i < KeySize; i++) { int cmp = this.content[i] - rec.content[i]; if (cmp != 0) return cmp; } return 0; } public static TeraRecord Read(DryadBinaryReader rd) { TeraRecord rec; rec.content = rd.ReadBytes(RecordSize); return rec; } public static int Write(DryadBinaryWriter wr, TeraRecord rec) { return wr.WriteBytes(rec.content); } class Terasort { public static void Main(string[] args) DryadDataContext ddc = new DryadDataContext(@"file://\\svc-yuanbyu-00\dryad\terasort"); DryadTable records = ddc.GetPartitionedTable ("sherwood-sort2.pt"); var q = records.OrderBy(x => x); q.ToDryadPartitionedTable("sherwood-sort2.pt"); }

22 Machine Learning in DryadLINQ 22 Dryad DryadLINQ Large Vector Machine learning Data analysis Kannan Achan, Mihai Budiu

23 Linear Regression Code Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 23

24 Expectation Maximization 24 160 lines 3 iterations shown

25 Computer vision Ongoing –Epitomes, features for image search, … Anecdotal evidence –Nebojsa Jojic, Anitha Kannan Tutorial from Mihai Anitha implemented Probabilistic Image Map algorithm in an afternoon

26 Continuing research Application-level research –What can we write with DryadLINQ? System-level research –Performance, usability, etc. Lots of interest from learning/vision researchers

27


Download ppt "DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley."

Similar presentations


Ads by Google