Meta Production Engineer System Design Preparation Guide

2 min readMar 17, 2022

--

I have read all these resources multiple times, sometimes more than 5 times, and understood the trade offs they made and why, how the systems fail, how they react to network / hardware issues and how they scale and what’s the bottleneck. If you do the same, you can get E5+ as well.

Notes:

Most important signals for E5 interview is talking about trade-offs, being able to lead the interview and mentioning few rare scenarios.
The most important component that you have to fully understand is zookeeper. the beating heart of every major distributed system, made. I would dare say that if you are designing a system and zookeeper isn’t in it, you are doing something wrong.
Fully understand these concepts and how they work: consistent hashing and static sharding, Membership protocols and Gossip based protocols, Paxos based systems, leaderless vs leader based systems, All the different consistency levels(sequential, casual, eventual etc.), Logical and physical clocks, Leader election with Zookeeper, Push vs Pull, Failure domains, Storage Engines.
When designing think about these things: concurrency, failures, scale, bottlenecks, rare scenarios (copy file -> what if disk is full?).
This is a massive undertaking that should prepare you to overwhelm the interviewer with sheer knowledge. But it isn’t possible in 40 minutes :( so do a few mocks to make sure you give the right signals.
If you are short on time just read: Zookeeper paper, Twine paper, RAS paper, LAD Video, Optimal Workload placement Blog.

Resources

These are the ones i found most useful:

DDIA, Designing Data Intensive applications

Grokking the system design interview — Advanced

Owl (distributing file across a fleet)

https://engineering.fb.com/2022/07/14/data-infrastructure/owl-distributing-content-at-meta-scale/

Dynamo http://www.cs.cornell.edu/courses/cs5414/2017fa/papers/dynamo.pdf

Cassandra

https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf

Gossip Protocol

“SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol”

https://arxiv.org/abs/1707.00788

Kafka

http://notes.stephenholiday.com/Kafka.pdf

Chubby https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

GFS https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

HDFS

https://storageconference.us/2010/Papers/MSST/Shvachko.pdf

BigTable https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

Zookeeper

https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf

SpannerDB

https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

Raft

https://raft.github.io/raft.pdf

Paxos

https://lamport.azurewebsites.net/pubs/paxos-simple.pdf

LogDevice

https://logdevice.io/docs/Writepath.html

Borg

https://research.google/pubs/pub43438/

Twine

Twine: A Unified Cluster Management System for Shared Infrastructure - Meta Research

We present Twine, Facebook's cluster management system which has been running in production for the past decade. Twine…

research.facebook.com

Shard manager

Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications - Meta…

At Meta, research permeates everything we do. We believe the most interesting research questions are derived from real…

research.facebook.com

RAS

https://research.facebook.com/publications/ras-continuously-optimized-region-wide-datacenter-resource-allocation/

Data distribution

Log in or sign up to view

See posts, photos and more on Facebook.

www.facebook.com

Failure Domains

Making Facebook self-healing: Automating proactive rack maintenance

We always want Facebook's products and services to work well, for anyone who uses them, no matter where they are in the…

engineering.fb.com

https://engineering.fb.com/2020/09/08/data-center-engineering/fault-tolerance-through-optimal-workload-placement/

System Design Interview

Production Engineer

Written by Underpaid

No responses yet

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams