I recently interviewed for an SRE position. I spent a full week learning (or refreshing my memory) on the subjects and topics that could be covered in such an interview. I'll try and lay down the list of topics I covered and resources I used.
What is an SRE?
Having spent the last 2 years employed as a DevOps, I've often felt that DevOps and SRE were two slightly differing implementations of the same ideas. The first one felt like a set of general principles, when the second one is a clear and detailed model (pre-dating DevOps), with a set of rules and guidelines. Google developed the SRE model and explained it in the SRE book. The underlying ideas are simple, but powerful:
- Develop tools and systems reducing toil and repetitive work from engineers
- Automate everything, or as much as possible (deployments, maintenances, tests, scaling, mitigation)
- Monitor everything
- Think scalable from the start
- Build resilient-enough architectures
- Handle change and risk through SLAs, SLOs and SLIs
- Learn from outages
If you haven't yet read the SRE book, I strongly urge you to do so. There's even a free online version available. If you do not have the time, then maybe have a look at this Ben Treynor (Google VP Engineering) What is 'Site Reliability Engineering'? interview, for a general introduction.
According to the SRE book, an SRE should spend half of its time on "ops" work, and the other half doing development.
Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. [...] An SRE team must spend the remaining 50% of its time actually doing development. Source
Some skills are thus paramount to an SRE:
- coding / software development
- system administration and automation
- scalable system design
- system troubleshooting
Consequently, each of these areas of expertise can be (and often are) the subject of an interview.
Coding / Software development interview
I've found that the reference resource to prepare a coding interview, especially when targeting companies like Amazon, Google, Microsoft, Yahoo, etc, is Cracking the Coding Interview, by Gayle Laakmann McDowell. This book is a real trove of advice (technical or not) and example exercises (with the associated solutions).
Even though it is targeted to software developer interviews, I still covered the following topics listed in the Must Know section of the book:
- Linked list
- Hash table
- Binary tree
- associated Big-O time and memory complexity for common operations (Search, insert, delete, etc).
I found Data structures and Algorithms using Python and C++ to be useful (albeit a bit lengthy) when dealing with these data structures for the first time. This presentation gives a short but to-the-point, no-nonsense introduction of these data structures.
- Binary search
I also had a look at https://github.com/adicu/interview_help to practice on some real-life interview questions, and at https://github.com/nryoung/algorithms to read Python implementations of common data structures and algorithms.
Scalable system design interview
This was my favorite subject to work on, as an apparently simple question such as "Design the bit.ly service" hides unexpected depths of complexity. Being able to design a scalable system implies knowing about:
- load balancing
- micro-service architecture
- CAP theorem
- consistency patterns
- availability patterns
- asynchronism patterns
The main idea is to be able to identify the architecture bottlenecks, and to dimension the architecture with an appropriate number of machines, with some "back-of-the-envelope" calculations, whilst being robust and failure tolerant.
The most useful resources I found to prepare were:
- Scalability lecture given at Harvard
- Latency Numbers Every Programmer Should Know
- The System Design Primer (I suggest you follow the links after each section for an in-depth follow-up)
- this great step-by-step walkthrough on design questions, by HiredInTech
- Scaling up to your first 10 million users, talk given by Joel Williams of AWS
- Crack the design interview
- When to use NoSQL vs SQL
System troubleshooting interview
To be able to automate the administration of a system, one should first know the said system in depth, which, in a lot of cases, will be GNU/Linux. If you have time, I strongly suggest reading The Linux Programming Interface. Note that this is a large book (my version has 1556 pages) focusing on an old version of the Linux kernel (2.6.x). Fear not! You'll still gain a vast knowledge about how a GNU/Linux system operates. For a quicker tour, you could have a look at the Linux Kernel Internals blog. You'll also find interesting SRE interview questions/answers in this SRE interview questions blogpost.
Mastering the mentioned tools (
ngrep, etc) gave me some good debugging chops I have applied in production many times.
Netflix has also written a very nice and thorough blogpost on performance troubleshooting: Linux Performance Analysis in 60,000 Milliseconds, detailing what to check in case of a performance issue.
Wait, there's more
Technical knowledge is one thing, but SRE being a relatively new activity, I also wanted to get real-life feedbacks from real-life SREs. To that end, I watched the following (great) talks:
- Case Study: Adopting SRE Principles at StackOverflow, by Tom Limoncelli of Stack Exchange
- Love DevOps? Wait until you meet SRE, by Nick Wright, from Atlassian
- Panel: training new SREs, with Katie Ballinger (CircleCI), Saravanan Loganathan (Yahoo), Rita Lu (Google), Craig Sebenik (Matterport), Andrew Widdowson (Google)