Preparing the SRE interview

Apr 20, 2017 5 minute read Programming

I recently interviewed for an SRE position. I spent a full week learning (or refreshing my memory) on the subjects and topics that could be covered in such an interview. I'll try and lay down the list of topics I covered and resources I used.

What is an SRE?

Having spent the last 2 years employed as a DevOps, I've often felt that DevOps and SRE were two slightly differing implementations of the same ideas. The first one felt like a set of general principles, when the second one is a clear and detailed model (pre-dating DevOps), with a set of rules and guidelines. Google developed the SRE model and explained it in the SRE book. The underlying ideas are simple, but powerful:

Develop tools and systems reducing toil and repetitive work from engineers
Automate everything, or as much as possible (deployments, maintenances, tests, scaling, mitigation)
Monitor everything
Think scalable from the start
Build resilient-enough architectures
Handle change and risk through SLAs, SLOs and SLIs
Learn from outages

If you haven't yet read the SRE book, I strongly urge you to do so. There's even a free online version available. If you do not have the time, then maybe have a look at this Ben Treynor (Google VP Engineering) What is 'Site Reliability Engineering'? interview, for a general introduction.

According to the SRE book, an SRE should spend half of its time on "ops" work, and the other half doing development.

Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. [...] An SRE team must spend the remaining 50% of its time actually doing development. Source

Some skills are thus paramount to an SRE:

coding / software development
system administration and automation
scalable system design
system troubleshooting

Consequently, each of these areas of expertise can be (and often are) the subject of an interview.

Coding / Software development interview

I've found that the reference resource to prepare a coding interview, especially when targeting companies like Amazon, Google, Microsoft, Yahoo, etc, is Cracking the Coding Interview, by Gayle Laakmann McDowell. This book is a real trove of advice (technical or not) and example exercises (with the associated solutions).

Even though it is targeted to software developer interviews, I still covered the following topics listed in the Must Know section of the book:

Data structures:

Linked list
Stack
Queue
Heap
Hash table
Binary tree
associated Big-O time and memory complexity for common operations (Search, insert, delete, etc).

I found Data structures and Algorithms using Python and C++ to be useful (albeit a bit lengthy) when dealing with these data structures for the first time. This presentation gives a short but to-the-point, no-nonsense introduction of these data structures.

Algorithms

Mergesort
Quicksort
Binary search

I also had a look at https://github.com/adicu/interview_help to practice on some real-life interview questions, and at https://github.com/nryoung/algorithms to read Python implementations of common data structures and algorithms.

Scalable system design interview

This was my favorite subject to work on, as an apparently simple question such as "Design the bit.ly service" hides unexpected depths of complexity. Being able to design a scalable system implies knowing about:

DNS
load balancing
micro-service architecture
CAP theorem
consistency patterns
availability patterns
databases
caching
asynchronism patterns
etc

The main idea is to be able to identify the architecture bottlenecks, and to dimension the architecture with an appropriate number of machines, with some "back-of-the-envelope" calculations, whilst being robust and failure tolerant.

The most useful resources I found to prepare were:

Scalability lecture given at Harvard
Latency Numbers Every Programmer Should Know
The System Design Primer (I suggest you follow the links after each section for an in-depth follow-up)
this great step-by-step walkthrough on design questions, by HiredInTech
Scaling up to your first 10 million users, talk given by Joel Williams of AWS
Crack the design interview
When to use NoSQL vs SQL

System troubleshooting interview

To be able to automate the administration of a system, one should first know the said system in depth, which, in a lot of cases, will be GNU/Linux. If you have time, I strongly suggest reading The Linux Programming Interface. Note that this is a large book (my version has 1556 pages) focusing on an old version of the Linux kernel (2.6.x). Fear not! You'll still gain a vast knowledge about how a GNU/Linux system operates. For a quicker tour, you could have a look at the Linux Kernel Internals blog. You'll also find interesting SRE interview questions/answers in this SRE interview questions blogpost.

Julia Evans, also known as b0rk has written some absolutely fantastic beginner-friendly resources about troubleshooting and networking. I strongly recommend having a look at:

Mastering the mentioned tools (strace, tcpdump, netstat, lsof, ngrep, etc) gave me some good debugging chops I have applied in production many times.

Netflix has also written a very nice and thorough blogpost on performance troubleshooting: Linux Performance Analysis in 60,000 Milliseconds, detailing what to check in case of a performance issue.

Wait, there's more

Technical knowledge is one thing, but SRE being a relatively new activity, I also wanted to get real-life feedbacks from real-life SREs. To that end, I watched the following (great) talks:

Case Study: Adopting SRE Principles at StackOverflow, by Tom Limoncelli of Stack Exchange
Love DevOps? Wait until you meet SRE, by Nick Wright, from Atlassian
Panel: training new SREs, with Katie Ballinger (CircleCI), Saravanan Loganathan (Yahoo), Rita Lu (Google), Craig Sebenik (Matterport), Andrew Widdowson (Google)

Oh and one last thing...

I'm super excited to announce I'm joining @datadoghq as an SRE ! pic.twitter.com/Ji1JJQLJ4x
— Balthazar Rouberol (@brouberol) 19 avril 2017