Presentation on theme: "Parasol Architecture A mild case of scary asynchronous system stuff."— Presentation transcript:
Parasol Architecture A mild case of scary asynchronous system stuff
Initial Design Goals Handle huge batches of jobs, large clusters –100,000 blastz jobs on 1000 cpus Error tolerant –Transient network glitches, compute node failures, software failures –Allow easy restart when bugs are fixed Easy to check status of jobs –Bundling jobs in batches rather than tracking individual jobs. Sharing cluster between users As robust as possible: –Simple –Leveraging earlier work on jabba ‘job babysitter’ for Condor scheduler.
Technical Considerations Very busy networks complicate things: –Messages may be dropped so have to have retry logic. –Retry logic means it’s not instant to figure out that a machine is down. A design where a central scheduler just communicated with one cluster node at a time would be too slow. Multiple threads/processes can lead to hard to debug race conditions.
Process/Thread Architecture Parasol processes/threads (circles) and message flow (arrows). All processes reside on the scheduling machine except for the node processes. A spoke process can send messages to any node. The hub, spoke, and heartbeat are all threads of a hub process.
Node Process Runs as root. Forks and changes to user to run job. Keeps list of last 10 jobs it has finished as well as the ones (one for each CPU) it is working on. Responds to job-start, job-kill and job-status- query messages. Sends job-end and job-status messages. –Job-end message includes error code –Stores stderr in a local file which it will send to hub on request.
Para client process The para client manages batchs of jobs through the hub. It is designed to catch jobs which may have run into problems of any sort, and give the user a chance to rerun them after the problem is fixed. The major input to para is a job list. Each job can have checks associated with it before and after the job itself is run. Initially para reads the job list and transforms it into a job database. The central routine of para, paraCycle, reads the job database, queries the hub to see what jobs are running and waiting, looks at the results file to see what jobs are finished, performs output checks on the finished jobs, sends unsubmitted jobs or jobs that need to be rerun to the hub, updates the database in memory, and writes it back out. The database is in a comma-delimited text format with one job per line. The job database keeps track of the timing and status of each job submission. The code to read and write this database was generated with AutoSql. para will avoid loading the hub with more than 100,000 jobs at a time, and will only submit failed jobs three times before giving up on them. Para is a direct descendant of the “jabba” wrapper we put around the Condor scheduler.
Hub Process Where the rubber really meets the road, the most complex part of the system. Multithreaded around a central message queue. Talk goes into chalk-talk mode here
Chalk talk outline Message queue synchronization Revisit architecture diagram Message passing –Udp between processes –Message queue between threads of hub Main thread eats messages from queue, sends messages to spokes, clients. Other threads so simple, easy to see that they don’t write things main thread uses other than message queue. Main thread designed to respond to any one message quickly, deferring longer stuff to spoke. Heartbeat messages trigger status checks, cleanup. Main data structures: machine, user, batch, job