Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.

Slides:



Advertisements
Similar presentations
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MyProxy and EGEE Ludek Matyska and Daniel.
Advertisements

4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
Auditing Concepts.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Security Q&A OSG Site Administrators workshop Indianapolis August Doug Olson LBNL.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 04/02/2014.
DoD Information Technology Security Certification and Accreditation Process (DITSCAP) Phase III – Validation Thomas Howard Chris Pierce.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 05/15/2013.
Information Security Policies and Standards
Public Works and Government Services Canada Travaux publics et Services gouvernementaux Canada Password Management for Multiple Accounts Some Security.
WLCG Security TEG, risks and Identity Management David Kelsey GridPP28, Manchester 18 Apr 2012.
Network security policy: best practices
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 01/29/2014.
SSC2 and Update on Multi-user Pilot Jobs Framework Mingchao Ma, STFC – RAL HEPSysMan Meeting 20/06/2008.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
Information Systems Security Computer System Life Cycle Security.
CILogon OSG CA Mine Altunay Jim Basney TAGPMA Meeting Pittsburgh May 27, 2015.
TeraGrid Science Gateways: Scaling TeraGrid Access Aaron Shelmire¹, Jim Basney², Jim Marsteller¹, Von Welch²,
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Project Management Methodology Project Closing. Project closing stage Must be performed for all projects, successfully completed or shut off by management.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Blueprint Meeting Notes Feb 20, Feb 17, 2009 Authentication Infrastrusture Federation = {Institutes} U {CA} where both entities can be empty TODO1:
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
OSG Area Coordinators Meeting Security Team Report Mine Altunay 04/3/2013.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
OSG Security Review Mine Altunay December 4, 2008.
EMI is partially funded by the European Commission under Grant Agreement RI Argus Policies Tutorial Valery Tschopp - SWITCH EGI TF Prague.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks David Kelsey RAL/STFC,
Mine Altunay July 30, 2007 Security and Privacy in OSG.
Pilot Jobs John Gordon Management Board 23/10/2007.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
Summary of AAAA Information David Kelsey Infrastructure Policy Group, Singapore, 15 Sep 2008.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center.
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
EMI INFSO-RI Argus Policies in Action Valery Tschopp (SWITCH) on behalf of the Argus PT.
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
DTI Mission – 29 June LCG Security Ian Neilson LCG Security Officer Grid Deployment Group CERN.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
LCG Support for Pilot Jobs John Gordon, STFC GDB December 2 nd 2009.
Security Policy Update WLCG GDB CERN, 14 May 2008 David Kelsey STFC/RAL
WLCG Authentication & Authorisation LHCOPN/LHCONE Rome, 29 April 2014 David Kelsey STFC/RAL.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 02/13/2012.
JSPG Update David Kelsey MWSG, Zurich 31 Mar 2009.
VOX Project Tanya Levshina. 05/17/2004 VOX Project2 Presentation overview Introduction VOX Project VOMRS Concepts Roles Registration flow EDG VOMS Open.
Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.
Shibboleth Use at the National e-Science Centre Hub Glasgow at collaborating institutions in the Shibboleth federation depending.
Sicherheitsaspekte beim Betrieb von IT-Systemen Christian Leichtfried, BDE Smart Energy IBM Austria December 2011.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI CSIRT Procedure for Compromised Certificates and Central Security Emergency.
RISK MANAGEMENT IN THE PUBLIC SECTOR CONVERGING MULTIPLE STAKEHOLDER’S EXPECTATIONS Organised by National Treasury Presented by WELEKAZI DUKUZA CEREBRO.
EMI is partially funded by the European Commission under Grant Agreement RI Argus Policies Tutorial Valery Tschopp (SWITCH) – Argus Product Team.
OSG Security: Updates on OSG CA & Federated Identities Mine Altunay, PhD OSG Security Team OSG AHM March 24, 2015.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
OSG VO Security Policies and Requirements Mine Altunay OSG Security Team July 2007.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
New OSG Virtual Organization Security Training OSG Security Team.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Auditing Concepts.
OGF PGI – EDGI Security Use Case and Requirements
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
LCG/EGEE Incident Response Planning
The New Virtual Organization Membership Service (VOMS)
Leigh Grundhoefer Indiana University
WMS Options: DIRAC and GlideIN-WMS
Internal Control Internal control is the process designed and affected by owners, management, and other personnel. It is implemented to address business.
Presentation transcript:

Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial Studies National Center for Supercomputing Applications University of Illinois at Urbana-Champaign OSG Security Team OSG AHM 2015

March 24, 2015 Motivation X.509 user certificates expose users to complexities of authN/authZ systems  First hurdle for every new Grid user  Represents a significant barrier to promoting ease of use and discourages potential users End user certificates are however the primary way to authenticate on Grids

March 24, 2015 Changing Trust Paradigm 3

March 24, 2015 Changing Trust Paradigm OLD TRUST model  Sites only trust user’s with a certificate  VOs do not play a role in this trust relationship  This is inefficient NEW TRUST model  Sites trust VOs to provide the users’ information when needed  Sites only know which VO the user belongs to  VO trusts its user to behave properly and takes responsibility for its user’s actions  Sites don’t know the identity of the user upfront  If a problem is noticed with a user job, sites will  contact the responsible VO  expect VOs to identify and ban the problem user  If a problem persists, the site can ban the entire VO

March 24, 2015 Why we needed the New Trust Model? What was wrong with the Old Model The old model did not recognize the important role a VO played in trust relationships. It puts the burden of proving identities on the users.  VO already knows its users’ identities and could provide it to sites. Biggest beneficiary of the new model is the end user.  If site did not need to establish trust with each user, then user did not need to deal with proving its identity and role.  Users do not need to obtain certs, create proxies, etc  Hides complexities of authN/authZ systems  Promotes ease of use

March 24, 2015 Does the New Trust Model Meet Our Security Needs? Needs for knowing a user’s identity or distinguishing a user from one another with access tokens  Fine-grain access privileges: Alice needs a different execution environment than Bob  Accountability: Holding users responsible for their actions on the grid Most VO members (limited exceptions like software installers exist) need the same access privileges  We do not hence need fine-grain privileges for most jobs Accountability – tracing a malicious job to its owner is the main reason for using certificates  Sites typically do not care who a user is unless a problem occurs We hence evaluate glideinWMS for traceability

March 24, 2015 Hypothesis Pilot job framework (e.g. GlideinWMS) already collect sufficient amount of data about users and jobs. So, it already has traceability information.  This will lead to improved usability  Existing pilot job framework are well positioned to support this model Goals  Research: Can traceability be achieved through different technical means?  If so, what are the security risks of using these means?

March 24, 2015 GlideinWMS Architecture

March 24, 2015 GlideinWMS Workflow GlideinWMS acts as a shield to protect users form complexities of the job submission on the Grid User accesses the Frontend and submits a job Frontend communicates with the Factory and requests Glideins (i.e. Pilot jobs) Glideins checks if the worker node suitable for the user job and then starts the user job Glidein job is a parent process to the actual user job and watches over the user job throughout its lifetime  GlideinWMS framework is documented at

March 24, 2015 GlideinWMS Details A worker node can run multiple Glideins and user jobs simultaneously An individual Glidein only starts a single job at a time  A Glidein can however run a sequence of jobs one after the other  Jobs may belong to different users All HTCondor logs associated with the Glidein are written on the worker nodes and are transferred back to the Glidein factory after the Glidein completes execution Glideins typically run for hours

March 24, 2015 Experiment Setup The security team submitted jobs on OSG sites without end user certificates and evaluated if the jobs can be traced back to the submitter Our exercises labeled a random (non- malicious) job as malicious Conducted searches in both directions  Tracing a malicious job back to a user  Tracing a user to find all of her jobs We paid special focus to additional risks from the lack of certificates

March 24, 2015 Tracing Jobs Site admin identifies a problem job on a worker node at the site Identify the Glidein process that started the problem job  Use the standard error and out, Starter and Startd logs associated with Glidein  Search associated timelines Identify the VO that owns the Glidein and the problem job  Use Glidein DN and contact Factory operator With the HTCondor job id of the Glidein, look at the StartdLogs belonging to that Glidein instance  Find frontend where the job originated  HTCondor jobid at frontend Contact VO operating frontend to check frontend log, history files, node login logs, etc to determine user

March 24, 2015 Challenges against Traceability Attacker overwrites the Glidein log files  The user job runs in the same user account as the Glidein job, so a malicious user could hijack the Glidein infrastructure, overwrite the log files, and no useful info returned to the factory (i.e. fake logs).  Likelihood: Moderate. Requires some insider information  Mitigation: Two potential venues: 1) getting more information back to the Schedd/Frontend; and 2) making sure we always do the UID switch.  Currently investigating solutions

March 24, 2015 Results Traceability study was carefully conducted on three frontend/VOs and access to Fermilab (FNAL) resources (which previously required certificates) was enabled  OSG-XSEDE frontend (OSG VO)  CHTC frontends (GLOW VO)  Will present experiences today  HCC frontends (HCC VO) Careful drills were conducted and recommendations presented to FNAL security team Recommendations were accepted and implemented  Successfully opened up the use of opportunistic resources from FNAL to a number of users Process we went through is documented on twiki  E.g.

March 24, 2015 Findings GlideinWMS has shown to possess significant tracing capabilities  System can identify a unique owner for a grid job at a worker node for a given timeframe Some corner cases could make tracing more challenging  Likelihood is however low and mitigations have been recommended Has profound affects on trust relationships  Site trusts VO and VO trusts its users

March 24, 2015 What to do if my VO is Interested Defined a process for VOs to submit jobs under the new trust model  Prospective VO applies to the Security Team  Security Team evaluates them against a criteria  Example Questions  How do you manage your users; how do you vet their identity, give permissions and assign roles to users. How do you document and log your users activities.  What access control mechanisms do you employ to ensure only authorized VO members can access VO services?  What is the architecture of the glidein setup at your VO?  Does your VO operate more than one submit node (e.g. PI controlled submit node) and use flocking, if so how do you trace users between these systems  Do you have a centralized system for collecting logs?  Do you have a policy of how access is revoked?  If a VO demonstrates that they can manage their users effectively and take responsibility for them, they switch to operate under new trust model

March 24, 2015 OSG Procedures and Policies Old and new model co-exist Many VOs prefer to switch to the new model  Easier on new users Some VOs will continue to use the old model OSG leaves the decision up to the VOs  OSG provides the infrastructure and services to enable both models

March 24, 2015 Questions? Thank you 18