Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz The future of AliEn
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 227 Mar 2013 Pablo Saiz ALICE offline week Table of contents Current statusOngoing workFuture plansSummary
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 327 Mar 2013 Pablo Saiz ALICE offline week AliEn File Catalogue –LFN to PFN mapping –Metadata –700 M entries TaskQueue –Job execution model –Package management –50K concurrent jobs File transfers Used by ALICE and PANDA
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 427 Mar 2013 Pablo Saiz ALICE offline week AliEn versions v2-19 ** : Current version of ALICE –With plenty of patches v2-20: Current version of PANDA –Json, removal of PackMan, Catalogue layout v2-21: Development release –GUIDless catalogue After a release has been adopted, database change go to new release
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 527 Mar 2013 Pablo Saiz ALICE offline week AliEn improvements not yet used by ALICE Catalogue structure –InnoDB tables, foreign keys, numeric id –2-day downtime or creating 1 week hybrid version Removal of PackMan service –Clients can handle package installation by themselves JSON communication –Backward incompatible. Full redeployment File popularity –Requires changes in the Central Services
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 627 Mar 2013 Pablo Saiz ALICE offline week Current work File Catalogue jAliEnPopularity Classads Trust Model Priority Price AliEn/PoDVO to VO
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 727 Mar 2013 Pablo Saiz ALICE offline week File Catalogue Investigate File Catalogue using file system –Using all features from real file system: user, quotas, Prototype of AliEn creating entries on FS: –700M entries in the ALICE catalogue –Ext4 not up to the challenge reiserfs –One entry per file one entry per directory Locking, simultaneous clients, booking entries, backups –Prototype was discontinued File catalogue without GUID –See Miguel’s presentation
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 827 Mar 2013 Pablo Saiz ALICE offline week jAliEn Already used in production: –Managing productions –Data transfers –Data cleanup Server part for the web interface Need to: –Improve the ROOT plugin –Integrate on FITS
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 927 Mar 2013 Pablo Saiz ALICE offline week Other improvements TaskQueue improvements: –Store diffs between original and final JDL –Remove Classad library –Retrial mechanism Separation of price and priority –Priority: select user –Price: sort among the jobs of the same use More worker nodes platforms: SLC6 Fedora
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1027 Mar 2013 Pablo Saiz ALICE offline week File Popularity Developed by A. Abramyan and N. Manukyan Requires patches in central services v2-19 Frequency of file access: –Including errors –File types Identify: –In demand files increase replicas –Other files decrease replicas –Broken files
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1127 Mar 2013 Pablo Saiz ALICE offline week Other contributions AliEn trust model –Define service/user trust, and schedule jobs/storage accordingly –Sergio Guinez, TALCA AliEn/PoD integration –Interactive analysis on the grid –Cinzia Luzzi VO to VO submission –Submit jobs from one VO to another, output visible in both –PANDA colleagues
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1227 Mar 2013 Pablo Saiz ALICE offline week PANDA GRID/AliEn developers Link
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1327 Mar 2013 Pablo Saiz ALICE offline week Future work Testing Framework Job Brokering User credentials Scaling up
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1427 Mar 2013 Pablo Saiz ALICE offline week Testing framework Create environment to test new approaches Up to know: –BITS & FITS (functionality tests) –PANDA (becoming a mature GRID) –Development VO: ALICE_TEST Setup and running for one year Used for some train analyses Users have different priorities
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1527 Mar 2013 Pablo Saiz ALICE offline week Development environment I FC TQ SE CE SECE …… FC TQ SESE CECE SESE CECE ……
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1627 Mar 2013 Pablo Saiz ALICE offline week Environment I One way catalogue synchronization –Take snapshot of catalogue Duplicate small percentage of jobs –5,10% of TQ Jobs get executed twice –Easy to check output –Duplication of work –Setting new SE that will be erased Test of the full scale catalogue
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1727 Mar 2013 Pablo Saiz ALICE offline week Development Environment II FC TQ SE CE SECE …… FC TQ CECE CECE … CE
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1827 Mar 2013 Pablo Saiz ALICE offline week Using VO to VO submission –Once the plugin becomes available… New VO with only CE –Easier to setup –Using same SE as ALICE If jobs fail, reschedule them Does not test the full catalogue
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1927 Mar 2013 Pablo Saiz ALICE offline week Alternative Job Brokering Two level broker: –Broker dispatches batches of job to CM –CM distributes among worker nodes –Bigger dependency on vobox –Reduce load on central services New job optimizer: –Groups jobs together Ideally, with the same input –Send group to the JobAgent
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 2027 Mar 2013 Pablo Saiz ALICE offline week User credentials Glexec Propagate user credentials to worker node Sign jdl and changes –Traceability As already presented by S. Schreiner sId=0&materialId=slides&confId=111325http://indico.cern.ch/getFile.py/access?contribId=58&sessionId=9&re sId=0&materialId=slides&confId=111325
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 2127 Mar 2013 Pablo Saiz ALICE offline week Factor 1000 scale up… Number of sites: 80 – SETI, BOINC, … Opportunistic sites (without vobox) Number of nodes: 50K jobs 50M jobs –Amazon has 0.5M servers [1] Decentralized Job brokering Amount of information:30 PB 30EB –One tenth of the world’s info! [2] I/O bottleneck Number of files: 700M 700B –Default ext4, max 4B [1] [2]
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 2227 Mar 2013 Pablo Saiz ALICE offline week Factor 1:1000 scale up It will require quite some tuning… Luckily, factor 10 is not even questioned –And that’s more than enough for the expected increase in resources
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 2327 Mar 2013 Pablo Saiz ALICE offline week After more than 13 years…
CERN IT Department CH-1211 Geneva 23 Switzerland t ES 2427 Mar 2013 Pablo Saiz ALICE offline week Summary AliEn can handle current load –80 sites, 50K concurrent jobs, 700 M files An increase of 10 should be easy Plenty of areas for research/improvement –Catalogue –Job distribution –jAliEn AliEn needs a new project leader –Thank you for the last 13 years!