Google SRE and Facebook / Meta Production engineer preparation guide

9 min readJun 20, 2021

I will mainly focus on the SRE interview preparation part and share the resources I used to get offers from these two companies. The specific interviews are SRE-SE interview(System Engineer Pipeline) for Google and Production engineer interview for Facebook.

Interview rounds

FAANG divides interviews into three phases which are called recruiter, phone screen, and on-site. Screening starts from the recruiter's call.

Facebook

There are, overall, eight interviews. You should expect around 15 simple questions regarding Linux, coding, and network with one or two-word answers for the recruiter call. Phone-screen has one coding and one Linux interview, and last but not least, the on-site consists of system design, coding, network, troubleshooting/Linux, and behavioral.

Google

For Google, there are two pipelines to SRE. SRE-SWE and SRE-SE, the difference is that SWE has two rounds of coding with Leetcode style questions. in contrast, SE has a round of Linux internals and a practical coding question working with files. I decided to pick the SRE-SE pipeline.

You should expect around five multiple-choice questions on Linux that are in-depth. The phone screen is a 45 minute round that covers both Linux and coding. On-site is five interviews: Non-abstract large system design or NALSD (unique to google), Googleyness or behavioral, Troubleshooting, Linux internals, and coding.

Recruiter call

For Facebook, if you have enough experience as an SRE, you should be able to answer enough questions to pass. Otherwise, brush up on network (TCP vs. UDP, TCP control bits, etc.), Linux (Process states, typical commands, etc.), and coding (know your Big O, you should be pretty comfortable with time/space complexity analysis).

For Google, you need to be pretty comfortable with Linux before taking the call, so look at the Linux preparation section.

resources:

Phone Screen

Coding

you should expect file operations and easy/medium Leetcode questions. The most common data structures/algorithms tested are arrays, hashes, binary search, sorting, and heaps (how and where to use them). Graphs, Trees, and backtracking are unlikely. Dynamic programming is entirely out of the picture.

Remember to think of edge cases and ask clarifying questions. For example, what is the file’s encoding? Should I handle parsing issues such as X and Y? what is the typical file size?

resources:

LeetCode - The World's Leading Online Programming Learning Platform

At LeetCode, our mission is to help you improve yourself and land your dream job. We have a sizable repository of…

leetcode.com

Grokking the Coding Interview: Patterns for Coding Questions - Learn Interactively

Coding interviews are getting harder every day. A few years back, brushing up on key data structures and going through…

www.educative.io

Linux

for the Linux part, you should have a pretty decent understanding of Linux internals (do NOT memorize). For each topic, you should know why it is implemented this way and how it is used.

For example, why are we using virtual memory instead of physical? How is virtual memory translated to physical memory?

There are the topics that you should cover:

Virtual memory (Paging, demand paging, anonymous vs. file-backed memory, shared memory, page faults, dirty pages, page cache, swapping, memory mapping, memory protection, memory layout, overcommit, TLB, MMU, OOM, PSI)
Signals (Know key signals such as SIGTERM, SIGSTP, SIGCHLD, SIGKILL, SIGSEG, signal handlers, signal masking, default handlers, tracing signals)
Processes (Exec/Fork, Zombie/Orphan, Interruptible and uninterruptible Sleep, Runqueue/Scheduler latency, Completely fair scheduler and other scheduler policies, Preemption, Context switching, CPU registers and caches, userspace threads/lightweight threads/coroutines,)
Interprocess communication and Locking (Advantages / Disadvantages of each approach, a rough idea of how it’s implemented, what are the system calls)
Networking stack (which part is in the kernel, which part is handled by userspace libraries, common syscalls, sockfs)
Control groups (what it is, how it works) and namespaces (unlikely but good to cover)
System calls (you should know the 12 key system calls, how they are initiated, CPU protection rings, mode switch, userspace vs. kernel space)
Tracing (strace, ltrace, ptrace, perf, user-space tracing, kernel tracing)
Virtual file system or VFS (pseudo-file-systems such as proc, sockfs, pipefs and how shared memory integrates with VFS, file descriptors, open file descriptions table, inodes, NFS, LVM, software RAID, capabilities, extended file attributes, a bit about ACLs and SeLinux, SSD vs HDD, path resolving in linux)
Linux boot process (unlikely but good to cover, rather high-level. BIOS -> MBR -> grub -> kernel -> init -> userspace)
Main responsibilities of the kernel and init (remember to ask the questions, e.g., why do we need a kernel?)
Interrupts (what events cause interrupts, interrupt context vs. thread context, how interrupts are executed, top half and bottom half, and a bit about interrupt masking)
Common Linux tools (iostat, vmstat, top, pidstat, uname, touch, rm, cd, kill, iotop, mount, df, du, lsof, etc. you should know when they are used, how they are used, and what is the output)

Also, you should avoid getting too deep into the internals. Knowing what kinds of data structures are used or how a particular feature is implemented is not important. You are not expected to contribute code to the kernel, write drivers, or kernel modules.

resources:

https://landley.net/writing/memory-faq.txt
https://blog.ndk.name/preparing-for-the-sre-technical-interview/
https://man7.org/linux/man-pages/man7/signal.7.html
https://chrisdown.name/2018/01/02/in-defence-of-swap.html
https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
learnlinuxconcepts.blogspot.com
https://man7.org/linux/man-pages/man7/sysvipc.7.html
https://www.slideshare.net/JrmeKehrli/linux-and-java-understanding-and-troubleshooting
https://www.youtube.com/watch?v=7aONIVSXiJ8
The Linux Programming Interface: A Linux and UNIX System Programming Handbook (Skim through code sections)
Linux Kernel Development by Robert Love
Operating systems concepts Book

On-site

When you reach on-site, the recruiter will provide you with a document that covers the topics of the interviews and what you should expect. Treat it just as another source of info. The document isn’t going to be comprehensive at all.

Coding and Linux Internals

These are the same as phone screens, but a tad bit harder and more in-depth.

Network

Google, unfortunately, doesn’t ask any network-related questions, and This is just for Facebook. This interview has a bit less weight than other interviews so if you are tight on time, put more effort into other interviews.

The interview is pretty easy and simple. You are meant to be leading the interview, and questions are quite a bit open-ended to cover what you know. expect things like what happens when you press facebook.com in the browser? (i talked non-stop for roughly 30 minutes on this question). You should at least cover the basics, which are DNS, TCP, and HTTP. But you can talk about DHCP, SLAAC, IPv4, IPv6, BGP, OSPF, iBGP, NAT, QUIC, UDP, ICMP, and on and on.

Also, you should know at least one network protocol in depth. (e.g., TCP, HTTP) and a bit about troubleshooting common network problems and it’s tools (ping, mtr, traceroute, arp, IP, route, netstat, iperf etc.)

resources:

TCP/IP illustrated Volume 1 (Only read if you have enough time; otherwise, go through the Stanford course)
Stanford introduction to computer networking CS144 (really recommended, concisely covers most important topics)

Behavioral

I didn’t spend much time on behavioral preparation, but I can give you some tips. Facebook and Google behavioral interviews are completely different things.

At Facebook, the interviewer goes through your past experiences and asks basic questions such as why Facebook, why production engineering, and stuff like that. It would help if you went through the example questions your recruiter provides and try to think of past experiences beforehand. It is meant to make sure you are a decent, functioning person.

At Google, the interviewer asks mainly hypothetical questions, which are quite abstract and ambiguous. You are expected to ask clarifying questions to come up with a reasonable answer. My suggestion is to brush up on agile (focus on the concept, not the names, such as breaking up a project into small deliverable sub-tasks) and don’t get hung up on the specifics. Also, get familiar with OKRs and take a look at rework. You can find most of the questions asked online as well.

System design

This is just for Facebook. Google does a non-abstract version which is different from a typical system design interview.

The system design for the Production engineering role is focused on infra tools similar to Kubernetes, Jenkins, BitTorrent, etc., so my suggestion is to study how these systems are implemented to draw inspiration for your own design.

The most important things in the interview are:

Ask clarifying questions (how many queries? what is the data size? how many deploys? etc.)
Think about edge cases and what’s going to happen when things fail. concurrency issues, power outages …
Make sure your system doesn’t have a single point of failure, and it’s scalable.

There are many examples available on system design interviews and how they work; however, they are mainly for SWE positions and don’t focus on infra tools; the nature of the interviews is the same, so they are somewhat useful.

You should know some basics: etcd/zookeeper, distributed locking, concurrency issues, isolation levels, queues, S3. for the practice, you can design a job scheduler.

resources:

My other blogpost for E5+ system design preparation

NALSD

This interview is unique for google, and there are limited resources available for it. The design is pretty low-level, and you are expected to come up with a Bill of materials.

Same as any other system design interview, you are given a vague question. You should ask clarifying questions and come up with numbers regarding storage/network/IOPS, then come up with an estimate as to how many machines are required and the main bottleneck of the system.

To be truly successful, you should know a simple sharding strategy and how the assignment is done, SSTABLES and Memtable, Write ahead logging, and SLOs (what kind of SLI are appropriate for each type of system).

resources:

https://sre.google/workbook/non-abstract-design/

https://www.educative.io/courses/grokking-adv-system-design-intvw
https://sre.google/classroom/

https://research.fb.com/publications/rocksdb-evolution-of-development-priorities-in-a-key-value-store-serving-large-scale-applications/

My Path to Site Reliability Management

On my way to space I am currently taking a little stop to help organizing the world's information and doing my part in…

danrl.com

https://www.youtube.com/watch?v=swfurPw8c6A

Troubleshooting

The troubleshooting interviews at Google and Facebook are completely different. Facebook focuses on practical open-ended issues that you probably encounter during your day-to-day tasks, such as latency problems (e.g., a database is running slow). Google comes up with weird, specific scenarios that literally never happens (interviewer himself agreed). So I will break it up into two parts.

Facebook:

The interview is like Dungeons and Dragons, you query the interviewer, and he provides you with an answer (for example, you say I’m going to run top, then he tells you that you see high Load but low CPU utilization). he is very interested in your thought process and how you go about troubleshooting and less about solving the problem itself.

The most useful resource for tackling real production problems is Brendan Greggs work. So watch his youtube videos, read his book, and his blog. Also, go through the topics provided by your recruiter. Try to come up with some problems yourself and try to debug those. For example, you could have a latency problem caused by a noisy neighbor that resides in the same rack and takes the whole bandwidth.

Google:

The format of the interview is still querying the interviewer and getting a response. Still, most of the time, they will provide you with a command output instead of telling you the most important piece of information.

To be honest, I still don’t know how you can prepare for this interview. My suggestion is to talk to a google SRE about this type of interview to find out more beforehand. While Facebook questions were open-ended, google asked multiple specific questions which either you could answer or couldn’t without much room to explore.

Also, Brendan's work was basically useless for this interview as none of the questions covered practical troubleshooting scenarios. For example, they could give you a question where the system doesn’t boot because some specific file isn’t in the right format!

Negotiation tips

Here are some tips so you can avoid the mistakes i made:
1. DO NOT accept an offer without competing offers.

2. DO NOT believe anything the recruiter says, they will deceive you and twist words. recruiter is not your friend.

3. DO NOT believe for a second that these or any other company wants or cares about paying fairly. they want to pay the minimum for the level.

4. Be prepared to walk away

Final Notes

I believe it’s crucial to get a feel of what you should expect during the interview and what kinds of signals you need to give. So I strongly recommend scheduling mock interviews with peers or, much more preferably, actual google or Facebook employees.

The interview prep for me took roughly 3 months of putting 30–40 hours a week.

Overall resources:

https://leetcode.com/discuss/interview-experience/707265/Facebook-Apple-Amazon-or-Production-Engineer-EE-SRE-SysDE-or-London-or-May-2020-Offer

https://interviewthoughts.quora.com/My-Site-Reliability-Engineer-Interview-with-Google-Dublin

SRE Interview Questions

Unix Processes What is the difference between a process and a thread? A thread is a lightweight process. Each process…

syedali.net

mxssl/sre-interview-prep-guide

Site Reliability Engineer Interview Preparation Guide - mxssl/sre-interview-prep-guide

github.com

rishiloyola/SRE-Interviews

Curated list of good SRE interview questions. . Contribute to rishiloyola/SRE-Interviews development by creating an…

github.com

Interview Preparation

Submitted by recent candidates, vetted by experts. Keep up with what your target company and role is looking for…

prepfully.com

https://fabrizio2210.medium.com/how-i-get-a-job-at-google-as-sre-83d44aef7859

https://pramp.com

Google SRE and Facebook / Meta Production engineer preparation guide

Interview rounds

Facebook

Recruiter call

Phone Screen

Coding

LeetCode - The World's Leading Online Programming Learning Platform

At LeetCode, our mission is to help you improve yourself and land your dream job. We have a sizable repository of…

Grokking the Coding Interview: Patterns for Coding Questions - Learn Interactively

Coding interviews are getting harder every day. A few years back, brushing up on key data structures and going through…

Linux

On-site

Coding and Linux Internals

Network

Behavioral

System design

NALSD

My Path to Site Reliability Management

On my way to space I am currently taking a little stop to help organizing the world's information and doing my part in…

Troubleshooting

Negotiation tips

Final Notes

SRE Interview Questions

Unix Processes What is the difference between a process and a thread? A thread is a lightweight process. Each process…

mxssl/sre-interview-prep-guide

Site Reliability Engineer Interview Preparation Guide - mxssl/sre-interview-prep-guide

rishiloyola/SRE-Interviews

Curated list of good SRE interview questions. . Contribute to rishiloyola/SRE-Interviews development by creating an…

Interview Preparation

Submitted by recent candidates, vetted by experts. Keep up with what your target company and role is looking for…

Written by Underpaid

No responses yet