Meta Production Engineer System Design Preparation Guide

Underpaid
2 min readMar 17, 2022

--

I have read all these resources multiple times, sometimes more than 5 times, and understood the trade offs they made and why, how the systems fail, how they react to network / hardware issues and how they scale and what’s the bottleneck. If you do the same, you can get E5+ as well.

Notes:

  • Most important signals for E5 interview is talking about trade-offs, being able to lead the interview and mentioning few rare scenarios.
  • The most important component that you have to fully understand is zookeeper. the beating heart of every major distributed system, made. I would dare say that if you are designing a system and zookeeper isn’t in it, you are doing something wrong.
  • Fully understand these concepts and how they work: consistent hashing and static sharding, Membership protocols and Gossip based protocols, Paxos based systems, leaderless vs leader based systems, All the different consistency levels(sequential, casual, eventual etc.), Logical and physical clocks, Leader election with Zookeeper, Push vs Pull, Failure domains, Storage Engines.
  • When designing think about these things: concurrency, failures, scale, bottlenecks, rare scenarios (copy file -> what if disk is full?).
  • This is a massive undertaking that should prepare you to overwhelm the interviewer with sheer knowledge. But it isn’t possible in 40 minutes :( so do a few mocks to make sure you give the right signals.
  • If you are short on time just read: Zookeeper paper, Twine paper, RAS paper, LAD Video, Optimal Workload placement Blog.

Resources

These are the ones i found most useful:

DDIA, Designing Data Intensive applications

Grokking the system design interview — Advanced

Owl (distributing file across a fleet)

https://engineering.fb.com/2022/07/14/data-infrastructure/owl-distributing-content-at-meta-scale/

Dynamo http://www.cs.cornell.edu/courses/cs5414/2017fa/papers/dynamo.pdf

Cassandra

https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf

Gossip Protocol

“SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol”

https://arxiv.org/abs/1707.00788

Kafka

http://notes.stephenholiday.com/Kafka.pdf

Chubby https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

GFS https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

HDFS

https://storageconference.us/2010/Papers/MSST/Shvachko.pdf

BigTable https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

Zookeeper

https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf

SpannerDB

https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

Raft

https://raft.github.io/raft.pdf

Paxos

https://lamport.azurewebsites.net/pubs/paxos-simple.pdf

LogDevice

https://logdevice.io/docs/Writepath.html

Borg

https://research.google/pubs/pub43438/

Twine

Shard manager

RAS

https://research.facebook.com/publications/ras-continuously-optimized-region-wide-datacenter-resource-allocation/

Data distribution

Failure Domains

https://engineering.fb.com/2020/09/08/data-center-engineering/fault-tolerance-through-optimal-workload-placement/

--

--

No responses yet