Presentation on theme: "Growing the Site Reliability Team at LinkedIn: Hiring is Hard"— Presentation transcript:
1Growing the Site Reliability Team at LinkedIn: Hiring is Hard Here to tell you why hiring is hardHow to make it not so hardAnd how to know if it’s still workingGreg LefflerManager, Site Reliabilityhttps://linkedin.com/in/gleffler
2Who am I? Site Reliability Manager (New York) MS in Industrial/Organizational PsychologyResponsible for interview process for SREsTook this responsibility as an IC, so originated from the bottom upTeam grew 10x from August 2011 to May 2014
3Who are SREs at LinkedIn? 5 sites, 2 countries1000+ SW Engineers8th busiest website in the world10k+ prod machines per DC: 2 DCs today +1 in 2014300+ RESTful services, 300MM+ membersServices with 99th %ile latencies as low as 10 ms3 time zones (PT, ET, IST)
4What matters for a great company? Funding?Good idea?Execution?Product?People.They all matter, but people matter the most. Great people can make up for weaknesses in other areas. Great people can help you make your plans better. Great people are what make a company great.
5Obligatory LinkedIn culture plug Talent is our #1 operating priority.Our culture is what sets us apart. We are committed to supporting the career transformation of our employees.Transparency is encouraged and emphasized at every company all- hands meetingWhich occur every other weekOur commitment to our employees is emphasized in how we behaveEveryone is encouraged to do interviews! Yes, everyone.~60% of SREs participate in the interview process60% Stat cited from HireIn.Talent is our #1 operating priority.Our culture is what sets us apart.We are committed to supporting the career transformation of our employees.We will be transparent in our programs to ensure trust, quality and effectiveness. If something is wrong, we will fix it.We will strive for consistency but distinguish programs where business or market necessitates.We will provide opportunities for high performing employees to achieve top rewards.We believe in recognition that goes beyond the cash and imbeds itself in our day to day interactions with our employees.We strive to create a working environment where people want to come and stay.
6What do we want from SREs? Excited about LinkedIn and the SRE roleWe have the luxury of being pickyFit our culture and embody our valuesThese matter. If you haven’t set them or can’t articulate them, you need that 1stHave the skills needed to do the jobThese also matter. You need to know what these are before you screen for themAND NOTHING ELSENot skills that are ‘nice to have’, not somebody you’d want to hang out with on the weekend, not somebody who has won a programming contest, not somebody who went to an elite school.
7These don’t always work Coding puzzles“Fermi problems”Algorithm design questionsIf you were a zebra, what pattern would your stripes have?HomeworkPersonality testsTrivia (quick, which signal is #7 in RHEL 6.4 on x86?)First 2: you can buy book & studyAside: coding is important to LI, but we test it pragmatically by having them do actual SRE coding tasks before they even come onsite – we don’t bring people in who can’t codeAlgo design: likely not appropriate except in very limited situations where that is part of the role. REALLY think about if they need itZebra question: what’s the right answer? Really? You sure?Homework: CIO mag surveyed self-appointed “expert” devs, only 8% would even consider doing homeworkPersonality tests: not correlated very well to performance in technical fields(re: signal it’s SIGBUS – 7 in i for other architectures. Would you know that if I hadn’t said x86 in the question, or told you just now?)Coding puzzle article: https://www.linkedin.com/today/post/article/ i-m-sorry-i-won-t-do-your-take-home-coding-exercise?trk=object-title
8Here’s why Industrial Psychology has figured this out already Schmidt & Hunter, 1998The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findingsEven if they hadn’t, you should collect your own dataAnd not rely on hunches or cargo cultsFurther reading in the notes on this slide.Industrial psych has been around for almost 100 years at this point, coming into its own with the development of the assembly line and Frederick Taylor’s principles of “scientific management”. Industrial Psychology applies the scientific method to hiring, screening, motivation, and everything else that happens at work.Scientists have been studying how work works for a loooong time and have published these findings. They apply across industries and to nearly every work situation that we’ve been able to dream up, including studies of space shuttle pilots, elite military units, etc.Further reading: Google “Hawthorne Works” and read about the Hawthorne EffectAnderson, N., Ones, D. S., Sinangil, H. K., & Viswesvaran, C. (Eds.). (2002). Handbook of industrial, work and organizational psychology, Volume 1: Personnel psychology. Thousand Oaks, CA: Sage Publications Ltd.Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996). The substantive nature of job performance variability. In K. R. Murphy (Ed.), Individual differences and behavior in organizations (pp. 258–299). San Francisco: Jossey-Bass.Guion, R. M. (1998). Assessment, measurement and prediction for personnel decisions. Mahwah, NJ: Lawrence Erlbaum Associates.Muchinsky, P. M. (Ed.). (2002). Psychology Applied to Work. Wadsworth Publishing Company. (this is a textbook but is very well written)Sackett, P. R., & Wilk, S. L. (1994). Within group norming and other forms of score adjustment in pre-employment testing. American Psychologist, 49,
9What does work? Good funnel at the start Realistic job previews Structured interviewsSituational judgment testsThese have the highest standalone validity (i.e., the best predictors) as well as high incremental validity (adding them to other predictors also helps.)They also have high face validity – which means the candidates like them and they look like they actually evaluate things that are relevant to the job.
10The LinkedIn SRE funnel Sourcing/screeningRecruiter prescreenOperationally-focused phone screen (TPS 1)Code-focused phone screen (TPS 2)By the time onsite, we expect they will pass.82%Only combined pass rate for TPSes due to way data is stored. #s are for 2013.24%
11Realistic job previewRJP is a way to show the candidate what they’ll do on a day to day basis. It gives them a chance to put themselves in the shoes of an SRE at LinkedIn, and more importantly, it gives them a chance to figure out if they like doing it or not
12Structured InterviewA structured interview is different from most interviews in one very key way – it is structured. The same candidates get the same questions and you can then compare apples to apples.
13Situational Judgment Test Situational judgment tests ask candidates to determine how they would react in a given situation. SREs or devops folks or star-ops, have the keys to the kingdom and need to be trusted. Their judgment is a HUGE part of their qualifications for the role, and as such, SJTs are the way to get at that as early as possible.
14How do we implement these? Live Troubleshooting (Realistic job preview)Systems Internals, Web Architecture (Structured interviews)Triage & Investigation (Situational judgment test)Host Manager (structured interview for culture and role fit)Lunch (not an interview… or is it?)This is LinkedIn’s interview process. This is everything we do – onsite. We also do some phone screens, see earlier slides.
15Live Troubleshooting Here’s a broken service (in EC2) Fix it (As realistic as it gets)No ‘man voldemort’You are probably the 1st person in the world to troubleshoot the exact situation in question
16Technical Modules Added structure and scoring guidelines Scoring guidelines are what matterConsistency is the only way you can scientifically prove if these are workingHigh # of interviewers = need to be able to compare resultsThis is systems internals & web architecture/design
17Triage & Investigation Module Situational Judgment and TriageIt’s your first day oncall and the NOC calls to say the site is on fire. Here’s the alert board – what do you look at first? Why?Assesses standard troubleshooting/investigation ability“The CEO calls you and says ‘the site is slow’ – what do you do?”“Disk is full. You delete a file but df still shows the disk being full. What’s wrong?”You’ll notice all of these are still situational judgment questions. They aren’t technical trivia or require “a-ha” moments. You can reason through it even if you haven’t been in that exact situation before.
18Results of implementing changes Happier candidatesIn fact, no unhappy candidates“The troubleshooting module was the most fun I’ve ever had in an interview”“I thought the troubleshooting module was hard but I learned so much”Happier interviewersSome hesitation at firstLive Troubleshooting is stressful for the interviewer too!Solve with training and apprenticeshipsThe candidate statements do not contradict each other. Especially for this role, we want candidates who find the interviews hard – but then who find that intellectually stimulating and want to come work for us!Also, even candidates we don’t accept: we still want to have a good experience! Selfishly (they might have a friend who’d be great) but for them too – no need to be mean to people. It’s not our culture.We trained candidates, but we also provided practice sessions with other employees & we also have a voluntary apprenticeship program where we let people shadow an experienced interviewer for any module
19Data, data, data We’re collecting scores from each module Correlating them to performance ratingsRe-evaluating the utility of each moduleIf a module doesn’t predict performance, get rid of itThis is hard, especially with things people ‘need’However, if there’s no correlation, it is worthless.Don’t mention the actual LI numbers here, use a nominal scaleAnalysis to occur after a full year of data due to power issuesIf power is interesting to you, come talk to me at office hours
20How to make your process better Make talent your first priorityImplement the good stuff from I/O psychRealistic job previewsSituational judgment testsSTRUCTURED interviewsCollect data on interview performance (module scores)Correlate this to job performance!Re-evaluate your process
21Want to experience it for real? We’re hiring. See me afterwards.Office hours are at 2 pmAny hiring or culture related questions are fair game