I have read all these resources multiple times, sometimes more than 5 times, and understood the trade offs they made and why, how the systems fail, how they react to network / hardware issues and how they scale and what’s the bottleneck. If you do the same, you can get E5+ as well.
Notes:
- Most important signals for E5 interview is talking about trade-offs, being able to lead the interview and mentioning few rare scenarios.
- The most important component that you have to fully understand is zookeeper. the beating heart of every major distributed system, made. I would dare say that if you are designing a system and zookeeper isn’t in it, you are doing something wrong.
- Fully understand these concepts and how they work: consistent hashing and static sharding, Membership protocols and Gossip based protocols, Paxos based systems, leaderless vs leader based systems, All the different consistency levels(sequential, casual, eventual etc.), Logical and physical clocks, Leader election with Zookeeper, Push vs Pull, Failure domains, Storage Engines.
- When designing think about these things: concurrency, failures, scale, bottlenecks, rare scenarios (copy file -> what if disk is full?).
- This is a massive undertaking that should prepare you to overwhelm the interviewer with sheer knowledge. But it isn’t possible in 40 minutes :( so do a few mocks to make sure you give the right signals.
- If you are short on time just read: Zookeeper paper, Twine paper, RAS paper, LAD Video, Optimal Workload placement Blog.
Resources
These are the ones i found most useful:
DDIA, Designing Data Intensive applications
Grokking the system design interview — Advanced
Owl (distributing file across a fleet)
https://engineering.fb.com/2022/07/14/data-infrastructure/owl-distributing-content-at-meta-scale/
Dynamo http://www.cs.cornell.edu/courses/cs5414/2017fa/papers/dynamo.pdf
Cassandra
https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
Gossip Protocol
“SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol”
https://arxiv.org/abs/1707.00788
Kafka
http://notes.stephenholiday.com/Kafka.pdf
Chubby https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
GFS https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
HDFS
https://storageconference.us/2010/Papers/MSST/Shvachko.pdf
BigTable https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Zookeeper
https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf
SpannerDB
https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
Raft
https://raft.github.io/raft.pdf
Paxos
https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
LogDevice
https://logdevice.io/docs/Writepath.html
Borg
https://research.google/pubs/pub43438/
Twine
Shard manager
RAS
Data distribution
Failure Domains