System Design · The Long Read

The Forces,
Not the Parts

Most people learn system design as a glossary. DNS, sharding, idempotency, a hundred terms with no connective tissue. Here is the other way: a small set of forces, and the moves engineers make to relieve them. Learn the pressures and the catalog falls out on its own. We start with one box, push until it breaks, and by the end we have designed YouTube together.

Boyan Balev 30 min read Junior to Staff The Trenches Field Manual

Field manual · build the smallest thing, then earn every box

The whole arc, in one picture assembles as you read

One box on the left. Every component to its right earns its place by relieving a pressure that box could not survive. We add them one at a time, and we name the force each one answers. Cool nodes are moves we make. Warm labels are the forces that pushed us.

one server, holding everything

Read it left to right. We will earn every node, in order, and you will know exactly why each one is there. Nothing here is magic, and nothing here is memorized.

The gap between a junior engineer and a senior one is not years. It is how they reason about systems. A junior reaches for the right noun. A senior reaches for the right tradeoff. When you can name what a system actually needs and then choose deliberately, the interview round stops being a memory test and becomes a conversation, and your day job stops being a pile of tools and becomes a sequence of decisions you can defend.

Every concept in this guide is an answer to a specific pressure. Statelessness answers "how do I add a second server without breaking logins." Caching answers "this read is too slow and the database is on fire." A message queue answers "one slow dependency keeps taking down a flow that does not need it." Learn the pressure first and the technique is obvious. Learn the technique first and you are carrying a glossary you will freeze on under stress, because you will not know which term the moment in front of you is asking for.

So we will build a system the way you actually build one in real life. Start with the smallest thing that works. Push on it until it breaks. Name the force that broke it. Apply the smallest move that fixes it. Repeat. By the end the whole picture above will be assembled, and you will have watched every piece arrive for a reason you can state in one sentence.

You do not study system design by collecting parts. You study the forces, and the parts arrive exactly when you need them.

00 / The FrameHow to think before you draw

Before a single box goes on the whiteboard, there is an order of operations. It works in interviews because it works in real life, and it is the difference between a design that grows cleanly and one that gets bolted together in a panic.

The mistake under pressure is to start drawing. Resist it. Drawing first means you commit to a shape before you understand the problem, and then you spend the rest of the session defending a shape you chose blind. The same five steps that pass a system design round are the steps a careful engineer runs in their head before touching any infrastructure.

Requirements, in two flavors. Functional requirements are the features, the "users should be able to" statements. Non functional requirements are the qualities: scalability, latency, availability, durability, consistency. The functional list is usually short and obvious. The non functional list is where the real engineering lives, because that is where the tradeoffs hide, and tradeoffs are the entire job.
Core entities. The nouns that get stored and exchanged. A video, a user, an order. You do not need the full schema yet. You need to know what the system is fundamentally about, because the entities decide what your data tier has to protect.
The API. The contract with the outside world. Derive it straight from the functional requirements, roughly one endpoint per feature. This is the surface everyone downstream depends on, and it is the first thing that becomes expensive to change.
High level design. The simplest set of components that satisfies the features. Not fast, not scaled, not fault tolerant. Just correct. You are proving the idea works before you make it survive the real world.
Deep dives. Now walk the non functional list one item at a time and evolve the design until each quality is met. This is the fun part, and it is where senior signal shows up, because this is where you trade one thing for another out loud.

Notice the shape. You earn the right to talk about scale only after the thing is correct. People who lead with "I would use Kafka and shard the database" before they know what the system does are performing competence, not demonstrating it.

Learn to count before you scale

There is one more skill that lives inside the Frame, and almost nobody teaches it: rough arithmetic. You cannot reason about scale if you cannot estimate it, and estimation is mostly knowing a handful of numbers and being willing to round hard. A day has about 86,400 seconds, which you should round to 100,000, or ten to the fifth. That single move turns "one million writes a day" into roughly ten writes per second on average, and if you assume traffic peaks at five times the average, you are designing for about fifty writes per second. That is a small number. Knowing it is small stops you from over engineering, which is the most common and most expensive mistake there is.

The other numbers worth carrying in your head are the latencies, because they decide where work is allowed to happen. They span many orders of magnitude, and the whole craft of performance is keeping hot work near the fast end of this ladder.

Numbers every engineer should know log scale, one frame

Latency, drawn on a logarithmic axis so a single picture can hold the gap between a CPU cache and a trip across the ocean. Each step right is roughly ten times slower than the one before. This is why "just put it in memory" and "just hit the database" are not the same sentence.

From an L1 cache hit to a packet crossing the Atlantic is a factor of about one hundred million. Every caching decision in this article is an attempt to stay on the left side of this chart.

The useful intuition: memory is a thousand times faster than disk, disk is faster than the network within a datacenter, and crossing the planet costs you more than everything else combined. Where data lives is a latency decision before it is anything else.

01 / The First ForkConsistency or availability

The very first deep dive, almost every time, is a single question that decides the personality of your whole system. It comes from the CAP theorem, and it is taught badly almost everywhere.

CAP is usually presented as "pick two of three: consistency, availability, partition tolerance." That framing is misleading and it makes people memorize a triangle instead of understanding a choice. In any real system spread across a network, partitions are not optional. Cables get cut, packets drop, a switch reboots, two nodes briefly lose sight of each other. Partition tolerance is a fact you live with, not a slider you set. So the triangle collapses to a line.

The real choice is between the other two, and only during a partition. Consistency means every read returns the most recent write. Availability means the system always answers, even if it cannot promise the answer is the very latest. When a partition splits your nodes, a write lands on one side, and a read arrives on the other side, you have exactly two options. Refuse to answer rather than serve data you cannot vouch for, or answer with what you have and accept it might be stale. That is the whole decision, and it is worth feeling rather than reciting.

CAP, lived not memorized

A partition splits your nodes. A write lands on Node A. Now a read arrives at Node B, which cannot reach A. What does B do? Toggle the choice and watch the answer it gives.

Here is the part that separates the strong answer from the weak one. Almost no real system makes this choice once and globally. Different operations choose differently. A bank transfer and a permission check demand consistency, because a stale answer is a security hole or lost money. A content feed and a view count are perfectly happy being eventually consistent, because nobody is harmed if the like count is a few seconds behind. The senior answer is never "I will use eventual consistency." It is "these operations need strong consistency and here is exactly why, these others do not and here is why they are safe."

02 / One BoxStart with a single server

Every complex system started simple. So do we. One server holds the app, the database, the cache, everything, and it serves a small group of happy users.

A user types app.demo.com and hits enter. Their browser asks DNS, the domain name system, which maps that human readable name to an IP address. Think of DNS as the phone book of the internet: you know the name, DNS gives you the number. It hands back your server's address. Now the client has a location, opens a connection, sends an HTTP request, and your server does the work and sends back a response. HTML for a browser, JSON for a mobile app. That is the entire loop, and for a real product with a real but modest user base, it is genuinely enough. Do not let anyone shame you out of shipping it.

The request, end to end

The client asks DNS where the server lives, gets an address back, then talks to the server directly. Watch one request travel the whole loop.

This is ideal for a small user base. It is also fragile in three specific ways, and naming those three weaknesses gives us the entire roadmap for the rest of this guide.

That single box has exactly three weaknesses, and almost everything that follows is a cure for one of them.

It has a ceiling. You can only buy so much RAM and CPU for one machine. Past a point you physically cannot grow it, no matter how much money you throw at it.
It has no twin. If it dies, the whole product dies with it. There is nothing to fall back to, so a single hardware fault is a total outage.
It remembers things. It holds sessions, carts, and login state in local memory. That memory feels harmless now, and it is the single hidden reason scaling is hard, as we are about to see.

03 / The Ceiling and the TwinScaling, and the catch nobody mentions

There are two ways to grow. One is honest and limited. The other is unlimited but demands a discipline most tutorials skip right past, and that discipline is the whole game.

Vertical scaling, scaling up, means a bigger box. More RAM, more CPU, faster disk, all on the same machine. It is the simplest move and it genuinely works for low and moderate traffic, often for far longer than people expect. Do not sneer at it. The catch is the two weaknesses it cannot fix: hardware has a real maximum, so the ceiling is still there, and a single box still has no twin, so a crash is still a complete outage. Vertical scaling buys you headroom, not resilience.

Horizontal scaling, scaling out, means more boxes. Three servers instead of one, sharing the load behind something that spreads requests across them. This is the path past the ceiling. It gives you fault tolerance, because a death leaves survivors who keep serving. And it scales almost linearly, because growing means adding another identical machine. It is the answer for anything large, and it is what every system you admire actually does.

Here is the catch nobody mentions when they say "just add servers." You can only add servers if the servers are interchangeable. And they are only interchangeable if they are stateless, meaning no single server holds something that only it knows.

Why state decides everything

A stateless server remembers nothing about you between requests. Every request carries what the server needs, usually a token the client holds. Flip the switch and watch what the load balancer is suddenly allowed to do.

When state lives on a specific server, that user is chained to it. The load balancer's hands are tied, the servers are not really interchangeable, and if that one box dies the session dies with it and the user gets logged out. Move the state into a shared store like Redis and every server becomes identical. Now the balancer routes freely, you add and remove servers without losing anyone's session, and you get real fault tolerance. So before you ask "how do I scale this," ask "what state is each server holding, and where should it actually live."

The load bearing idea

Horizontal scaling is not really about adding machines. It is about making your machines interchangeable, and statelessness is what makes them interchangeable. Almost every hard scaling problem you will ever meet is a state problem wearing a costume.

04 / The Traffic CopLoad balancers, and how they actually decide

Once you have twins, something has to decide which twin handles each request. That something is a load balancer, and it sits in front of your fleet as the single front door everyone walks through.

People list load balancing algorithms like trivia, as if you need to recite all seven on demand. The useful way to hold them is by the question each one answers. You almost never pick an algorithm in the abstract. You recognize the force you are facing and reach for the valve that relieves it. Play with the simulator below, then read the strategies as answers to situations rather than as a menu.

Load balancer, decision by decision live simulator

Pick a strategy, then send requests and watch where they land and why. The point is not the animation. It is feeling how each strategy spreads load differently, so you can match the algorithm to the situation in front of you.

Round robin hands requests out in a rotation: server one, then two, then three, then back to one. The simplest thing that works when every server is roughly identical.

The strategies, by the question each one answers.

Round robin. Hand requests out in strict rotation. Best when every server is roughly identical and every request costs about the same. It is the boring default, and the boring default is correct more often than people admit.
Least connections. Send the next request to whoever has the fewest open connections right now. Best when sessions vary wildly in length, because raw rotation would happily pile three long lived connections onto one box while another sits idle.
Least response time. Favor the server that is both responsive and lightly loaded, blending latency with connection count. Best when servers differ in capability and you care most about the slowest requests.
Weighted. Give bigger machines bigger shares. Best when your fleet is not uniform, say one box with sixty four gigabytes of RAM standing next to two with sixteen. You hand the strong server proportionally more work.
IP hash. Hash the client address so a given client always lands on the same server. Useful when a server holds something about that client. Notice the tension though: this fights statelessness, and if you find yourself reaching for it, that is usually a sign the state belongs in a shared store instead of being pinned to a machine.
Consistent hashing. The grown up version of IP hash, and important enough that we give it its own section later. Place servers and clients on a ring and route each client to the nearest server clockwise. You keep stickiness, but adding or removing a server only reshuffles a thin slice of clients instead of all of them.
Geographic. Route each user to the server physically closest to them. Best for global products where the latency of crossing the planet, the right edge of that chart above, is the headline cost.

How does the balancer know a server died? Health checks. It pings each server on a schedule and keeps a live list of who is up. The moment a check fails a few times in a row, it stops routing there until the check passes again. This same trick, a heartbeat plus a live roster, reappears for almost every component in this guide, which is a good sign you are learning a force and not a fact.

And now an uncomfortable observation that sets up the next idea. We added the load balancer to remove a single point of failure, and the load balancer is now itself a single point of failure. If it dies, nobody reaches anything. We have moved the fragility, not removed it.

05 / The LensSingle points of failure, and the math of staying up

A single point of failure is any one component that takes the whole system down when it stops. Treat it as a lens you point at every box on the board, asking the same question each time: what happens if exactly this dies?

It is worth stating why these matter, because it is more than "things break." A single point of failure hurts you three ways at once. It hurts reliability, because one death is a full outage, which is usually a direct loss of money and trust. It hurts scalability, because the choke point caps the whole system no matter how much you grow around it. And it hurts security, because an attacker who finds the one fragile node can take everything down by overwhelming just that node, which is a cheap attack for a large payoff.

The cures are a small, reusable set, and you will apply the same three to load balancers, app servers, databases, and queues alike. Redundancy: run more than one of the thing, so a death leaves a survivor. Two load balancers, and if one dies the other carries the traffic. Health checks and monitoring: watch each component continuously so you learn it failed in seconds, not when a customer emails you. Self healing: when you detect a death, automatically replace it with a fresh instance, so there is no human standing in the critical path at three in the morning.

There is a deeper version of this idea that quietly reshapes how you build, and it is pure arithmetic. In a synchronous system where each step waits on the next, your end to end reliability is the product of every dependency's reliability. Chain services that are each up ninety nine percent of the time and the chain is not ninety nine percent. Each link multiplies, and the product only ever goes down. Watch what happens as the chain grows.

Reliability is a product, not an average drag the slider

Set how reliable each individual service is, then watch the end to end uptime fall as you chain more of them together synchronously. The lesson is brutal and simple: chaining dependencies multiplies their failure, so the fix is to stop chaining.

per service uptime 99.00%

Translate the percentages into time and they get visceral. Three nines, ninety nine point nine percent, is about eight hours and forty five minutes of downtime a year. Four nines is about fifty two minutes. Five nines is around five minutes. Now notice that a chain of ten services at four nines each lands you back at roughly three nines overall, turning fifty two minutes of downtime a year into almost nine hours. This is why decoupling is not a luxury. The synchronous chain is a reliability tax you pay on every request, and the next section is how you stop paying it.

06 / The BufferMessage queues, or how to stop one slow thing from breaking everything

A user places an order. In a synchronous world you call inventory, then payment, then notifications, all in real time, all before the user hears back. You have built a chain of dependencies, and the chart you just saw says the chain is only as reliable as the product of its links.

If the notification service is slow or down, the entire order fails, even though sending an email is the least important part of placing an order. That is absurd, and a message queue is how you fix it. Instead of calling each service directly and waiting, you publish an event, order.placed, and a broker holds it durably. Inventory, payment, and notifications each pick it up and process it on their own schedule. If notifications is down, its copy of the message simply waits in the queue and gets handled when the service recovers. The order still goes through, because the customer's success no longer depends on the email being sent this instant.

Synchronous chain versus the queue

Knock the notification service offline and compare the two designs. In the chain, the whole flow fails because the last call fails. With a queue, the work simply waits and drains later, and the order succeeds.

notification service is

The trade is real and worth naming out loud. You gain resilience and decoupling. You give up the simplicity of a direct call and immediate consistency, and you take on a broker to run and monitor. Brokers like Kafka, RabbitMQ, and SQS let you fan one message out to many consumers, route by topic, or deliver point to point. Reach for a queue when a slow or flaky dependency must not be allowed to fail the main path.

The catch that queues hand you: idempotency

Asynchronous systems retry. The network drops a confirmation, the consumer crashes after doing the work but before acknowledging it, the broker redelivers to be safe. This means a message can arrive more than once, and most queues honestly promise "at least once" delivery, not "exactly once." So your handlers must be idempotent, which means applying the same operation many times leaves the system in the same state as applying it once.

The classic example is charging a card. "Charge fifty dollars" is not idempotent: run it twice and the customer pays a hundred. "Charge fifty dollars for order 12345, and ignore it if order 12345 was already charged" is idempotent, because the second attempt sees the order is settled and does nothing. The usual mechanism is an idempotency key: the caller attaches a unique id, the handler records which ids it has processed, and a repeat is a no op. This is also why REST gives GET, PUT, and DELETE idempotent semantics while POST is not. Idempotency is the discipline that makes "at least once" safe, and once you internalize it, distributed systems stop feeling so frightening.

07 / THE DATA TIERSQL or NoSQL is the wrong question

As traffic grows we split the web tier from the data tier so each scales on its own load. Then we pick a database, and the usual debate is framed as old versus new, or relational versus web scale. That framing is noise. The real question hides one level down.

The question is not which family is better. It is: what guarantees does this particular data need. Ask that, and the choice usually makes itself. Relational databases are built around ACID, and these four letters are worth understanding for real, not just recognizing on a slide.

Atomicity. A transaction is all or nothing. Moving money debits one account and credits another, and either both land or neither does. You never strand a balance halfway between two accounts.
Consistency. Every committed transaction leaves the database in a valid state. Constraints, foreign keys, and rules hold at all times. You cannot write a row that breaks them, the database refuses.
Isolation. Concurrent transactions do not see each other's half finished work. Two writers at the same instant get a clean result, as if they had run one after the other.
Durability. Once committed, it survives. A crash one millisecond later does not lose it, because it is already on disk. Committed means permanent.

NoSQL databases trade some of these guarantees for scale and flexibility, and that trade is a feature for the right workload, not a defect. Dropping the rigid schema and relaxing consistency is precisely what lets them spread across many machines so cleanly. They come in four shapes, and each shape is really a different answer to "what does my access pattern look like."

Document stores (MongoDB). Records are JSON like documents with flexible nested structure. A user, their profile, and their recent orders can live in one document you fetch in a single read.
Wide column stores (Cassandra). Tables with dynamic columns, tuned for enormous write volume and predictable reads at massive scale. The shape behind time series and event firehoses.
Key value stores (Redis, Memcached). The simplest model there is: a key, a value, blistering speed, often entirely in memory. The workhorse of caches, sessions, and rate limiters.
Graph stores (Neo4j). First class entities and the relationships between them. When the question is "who is connected to whom, and how," as in social graphs or recommendations, this shape answers in one hop what relational joins answer in ten.

So the decision was never SQL versus NoSQL. It is: does this data need ACID. Money, inventory, orders, anything where a partial write is a small disaster, belongs in a relational database. Activity feeds, product catalogs, telemetry, real time analytics where slightly stale is perfectly fine, often fit NoSQL better. And most serious systems run both, each holding the slice of the workload it suits. The data tier is now also a single point of failure, exactly as the lens in section 05 warned. Curing that is the next move.

08 / SPREADING THE DATAReplication, read replicas, and the sharding tax

One database box has the same three weaknesses as one server box. It has a ceiling, it has no twin, and now it also holds the one thing you truly cannot afford to lose. We relieve each pressure with a different move, in order.

The first move is replication. Keep one primary that takes writes and one or more replicas that copy from it. If the primary dies, a replica is promoted and the system survives. That cures the missing twin and gives durability a second home. The catch is the same CAP tradeoff from section 01, now made physical: replicas lag the primary by milliseconds to seconds, so a read from a replica can be slightly stale. For a bank balance that matters, so you read from the primary. For a follower count it does not, so you read from a replica and save the primary's capacity.

That observation leads straight to the second move. Most systems read far more than they write, often by ten or a hundred to one. So point all reads at the read replicas and reserve the primary for writes. You have just multiplied read capacity without touching the write path, and you did it by accepting a little staleness exactly where staleness is harmless. This is the cheapest large win in the whole data tier, and it costs you nothing but the discipline to decide per query.

Replicas solve reads. They do nothing for a write ceiling, because every write still funnels through one primary, and one machine can only absorb so many. When writes themselves outgrow a single box, you reach for the heaviest move in the article: sharding, splitting the data across many primaries so each owns a slice. Sharding is a tax, not a gift. Cross shard queries get slow or impossible, transactions across shards get painful, and rebalancing is a project. Do not shard until a named write pressure forces it. But when you must, how you place keys decides whether the next resize is a shrug or an outage.

Placing keys on the ringinteractive

Naive hashing assigns a key with hash(key) mod N, where N is the node count. It is balanced and simple, until N changes. Consistent hashing places nodes and keys on a ring and walks clockwise to the next node. Add or remove a node, and watch how many keys have to move under each scheme.

This is the whole reason consistent hashing exists. Under hash mod N, changing N reassigns nearly every key at once, which for a cache means a near total miss storm the instant you add a node, exactly when you were already under load. On the ring, adding a node only steals keys from its immediate neighbor, so roughly one in N keys moves and the rest stay put. Real deployments place each physical node at many points on the ring, called virtual nodes, so the slices stay even. Same idea as IP hash from section 04, grown up enough to survive resizing.

Indexes: the other half of database speed

Spreading data across machines handles volume. Most slow queries on a single machine are a different problem with a simpler fix. Without an index, finding a row means scanning the whole table, which is fine at a thousand rows and ruinous at fifty million. An index is a sorted lookup structure, usually a B tree, that turns a full scan into a direct jump, taking a query from seconds to milliseconds. The cost is honest and worth stating: every index you add makes writes a little slower, because each insert and update must also maintain the index, and indexes take disk space. So you index the columns you filter and join on, not every column out of habit. The senior instinct is the same as everywhere else in this article. Add the index when a named slow query demands it, measure, and stop. Speed is not free, it is bought with write cost, and you should know the price you are paying.

Replicas buy read scale by spending freshness. Sharding buys write scale by spending simplicity. Every move in the data tier is a purchase, and the senior skill is knowing the price before you pay it.

09 / SPEED FOR FRESHNESSCaching is one trade, made in many places

Caching shows up in your browser, your CDN, your app server, and your database, and it can feel like a dozen unrelated topics. It is one topic wearing four hats. Every cache makes the identical trade: speed now, in exchange for the risk that the copy is slightly out of date.

You keep a copy of data somewhere faster and closer to where it is needed, and you accept that the copy might trail the source of truth. Once that single idea clicks, you stop memorizing "Redis versus CDN" as separate flashcards and start asking the only two questions that matter. Where is the bottleneck, and how stale can this data afford to be. The answers place the cache for you.

The cache stack, by distance and latencyinteractive

Click any layer. The chart shows where a read could be answered and how long it takes. Every layer above the database trades a little freshness for a large drop in latency, and the closer to the user a layer sits, the faster it is and the staler it is allowed to be.

Browser cache

On the device itself. Static assets and responses that rarely change.

~0 ms

CDN edge

A server geographically near the user. Static content and popular files.

~10 to 50 ms

Application cache (Redis)

In front of the database. Hot read heavy queries that need millisecond answers.

~1 to 5 ms

Database query cache

Inside the database. The last cheap stop before the disk.

~5 to 30 ms

Database (source of truth)

Always correct, never the fastest. Everything above exists to avoid coming here.

slowest

Click a layer to see what answering a read there costs, and what it risks.

The vocabulary that covers almost every caching question, and it is small. A TTL (time to live) is how long a copy may live before it expires and must be refetched. Cache aside means the app checks the cache, and on a miss reads the database and fills the cache for next time, which is the default pattern. Write through writes to cache and database together so the two never drift, paying latency on every write for that safety. Write back writes to the cache first and the database soon after, which is faster but risks losing the most recent writes if the cache dies before it flushes. Notice the through line: the shared Redis that made statelessness possible in section 03 is this same box, doing this same job. The ideas keep reinforcing each other.

The failure mode worth knowing by name

Caches have one classic way of turning into a weapon against you, and it ties straight back to consistent hashing. A popular key expires, and in the same instant a thousand requests miss the cache and stampede the database all at once, each trying to recompute the same value. This is the thundering herd, also called a cache stampede, and it can take down a database that was comfortably idle a second earlier. The cures are small and worth carrying: stagger expirations by adding a little randomness to each TTL so keys do not all die together, or let only the first request recompute while the rest briefly wait on it, a trick called request coalescing or a mutex lock. The lesson generalizes past caches. Anything that makes many clients act in lockstep, synchronized retries, synchronized expirations, synchronized cron jobs, is a load spike waiting to happen, and a pinch of randomness is often the entire fix.

10 / THE CONTRACTAPI design, where mistakes are expensive forever

Your API is a contract with every client that depends on it. Once it ships, changing it is not a code edit, it is a coordinated migration that touches everyone downstream, some of whom you have never met. Design it like it is permanent, because for your callers it very nearly is.

Three styles dominate, and once more the right one falls out of the force you are actually under, not out of fashion.

REST. Resource based, stateless, built on the plain HTTP verbs. GET to read, POST to create, PUT or PATCH to update, DELETE to remove. Simple, cacheable, easy to reason about and document. The right default for public APIs, mobile clients, and anywhere a stable documented surface matters more than squeezing out every round trip.
GraphQL. One endpoint, and the client asks for exactly the fields it wants, no more and no less. It kills the over fetching and under fetching that forces REST clients to make three calls to paint one screen. Reach for it when many different clients with different data needs hit the same backend, or when your UIs are deeply nested.
gRPC. A high performance framework over HTTP/2 using protocol buffers, with streaming built in. Browsers do not speak it natively, which is exactly why it shines for internal service to service traffic where raw speed and tight contracts are the whole point and no browser is involved.

The protocol underneath shapes what the API can even do, so it is worth seeing the layers honestly. Plain request and response over HTTP fits REST and hands you status codes for free. WebSockets hold a two way connection open so the server can push without being asked, which is what live chat, presence, and real time feeds need. gRPC over HTTP/2 is for fast internal calls. And below all of those sit the two transport choices everything is built on. TCP guarantees delivery and ordering through a handshake and retransmission, which is why anything where correctness is non negotiable, payments, auth, loading a web page, rides on it. UDP is faster and lighter and makes no promise that a packet arrives or arrives in order, which is exactly right for video calls and games, where a packet that shows up late is worse than useless and is better dropped than resent.

What good REST actually looks like

Model your domain as resources and name them with plural nouns. GET /api/v1/products returns the collection, GET /api/v1/products/123 returns one, and GET /api/v1/products/123/reviews reads a nested resource. Let the verb carry the action, never the URL. No /getProducts, no /deleteUser, because the method already says what you are doing and baking it into the path is how you end up with a thousand bespoke endpoints nobody can cache.

Return the right status codes, because callers branch on them. The 200 range means success, with 201 for "created" and 204 for "no content." The 400 range means the client erred, so 400 for a malformed request, 401 for not authenticated, 403 for authenticated but not allowed, 404 for not found. The 500 range means the server erred, and the client is not at fault. Support filtering, sorting, and pagination through query parameters, because you almost never want to return fifty thousand rows in one response. And version your API from the first day with a /v1 prefix, so the day you must make a breaking change, old clients keep working against the old version while new clients move to the next one, and nobody's app breaks on a Tuesday.

GraphQL has its own grain and its own footguns. One endpoint, queries to read and mutations to write, and a typed schema that is the contract. It returns HTTP 200 even when something failed, so errors travel inside an errors field in the body rather than in the status line, which trips up anyone expecting REST conventions. Keep schemas modular, cap query depth so a client cannot nest itself into a denial of service, and use input types for mutations. The flexibility that makes GraphQL pleasant for clients is the same flexibility that lets a careless query melt your backend, so the guardrails are not optional.

11 / THE DOORSWho you are, and what you may do

Two different questions, constantly confused for one. Authentication asks who you are. Authorization asks what you are allowed to do. You cannot answer the second until you have answered the first, and conflating them is the root of a surprising share of security bugs.

Before untangling them, name the pieces, because the names alone trip up experienced engineers. A JWT is a token format, not an authentication method. A bearer token is a pattern, "whoever holds this token gets in," and a JWT is the most common kind of bearer token. OAuth 2 is an authorization framework, not a login method. Single sign on is a user experience goal, not a protocol. Keeping these four straight is genuinely half the battle, because most confusion in this area is people using one word for two different things.

Authentication, by evolution

The ladder runs from simple and weak to modern and strong, and seeing it in order makes the modern choices obvious. Basic auth sends a reversible encoding of username and password on every single request, which is unsafe without HTTPS and rarely used in production. Digest auth improved on it with hashing and is now dated. API keys are random strings a client sends with each request, simple but with no built in expiry and no embedded identity, so the server has to look each one up. Then comes the fork that matters.

Session based auth stores the session on the server and hands the client a cookie that points at it. It is stateful: the server must remember every active session, usually in Redis, which is the exact statelessness tension from section 03 wearing an auth costume. Token based auth, usually a JWT, hands the client a signed object that carries the user id, an expiry, and claims like roles. The server verifies the signature locally, with no database lookup, which is why it scales so cleanly. In practice you issue two tokens: a short lived access token for API calls, and a long lived refresh token kept in an HTTP only cookie, used to mint fresh access tokens so the user is not forced to log in every fifteen minutes. Short access token life limits the damage if one leaks, long refresh token life keeps the experience smooth. The two token split is itself a tradeoff between security and convenience, made deliberately.

On top of these sit the protocols people actually name. OAuth 2 lets one app act on a service on your behalf, handing over a scoped token instead of your password, which is how you grant an app access to your calendar without giving it your Google password. OpenID Connect adds an identity layer on top of OAuth 2, returning an ID token that states who you are, and that is what powers "sign in with Google." Single sign on uses these underneath so you authenticate once and reach many services. None of this is new machinery, it is the token ideas above, composed.

Authorization, by model

Role based (RBAC). Users get roles, roles carry permissions. Admin, editor, viewer. The most common model and the correct default for most systems. Think repository access on a code host.
Attribute based (ABAC). Decisions come from attributes of the user, the resource, and the environment. "Allow if the user is in HR and the resource is internal and it is during work hours." More expressive, more complex, easier to tangle yourself into rules that contradict.
Access control list (ACL). Each resource carries its own list of who may do what to it. This is how a shared document grants specific people specific rights. Precise and user centric, harder to scale to hundreds of millions of objects, but clearly workable since the largest products on earth do exactly this.

Real systems combine them. The token carries identity and claims, the authorization model decides what those claims are allowed to touch. The token is the mechanism, the model is the policy, and keeping that line clear keeps the system reasoned about rather than guessed at.

Guarding the doors

An API is a set of doors into your system, and unguarded doors get walked through. The defenses are a checklist of specific forces, not a single product you buy. Rate limiting caps requests per user, per IP, and overall, which blunts brute force and absorbs abuse. CORS controls which browser origins may call your API at all. Parameterized queries stop SQL and NoSQL injection by never letting user input become query structure, only query data. CSRF tokens stop a logged in user's browser from being tricked into firing a hidden request they never intended. Output escaping stops cross site scripting, where an attacker stores a script that later runs in another user's browser. Firewalls and private networks keep internal tools off the public internet entirely. Each one locks a specific door against a specific attack. You do not need to fear them all at once, you need to recognize which door each one closes, and then close the ones your system actually exposes.

12 / THE CAPSTONEWe design YouTube, and every force shows up

Now we run the frame from section 00 end to end on a real and famous problem, and watch every force we named earn its place. This is the payoff. Not new tricks, just the same principles arriving exactly when the design needs them, in the order the pressures appear.

Requirements. Functional is short: users upload videos, and users watch and stream videos. Then we ask about scale and get honest numbers to design against.

~12 / s

uploads, from 1M a day

100M

daily active viewers

256 GB

max single video size

availability over consistency

Non functional follows from those numbers: availability over consistency, support for very large uploads and streams, low latency playback even on a weak connection, and the scale to match a hundred million viewers. Notice that the very first non functional decision is CAP, exactly as in section 01. When someone uploads in Germany, does a viewer in America need that video the same second? No. A few seconds to a few minutes of propagation is completely fine. So we choose availability over consistency, and video upload becomes eventually consistent. Everyone keeps watching everything else while a new upload spreads. One honest question about the product, answered, and the personality of the whole system is set.

Core entities. A user, and a video. The video splits immediately into two very different things that want to be stored in two very different places: the raw bytes you watch, which are enormous, and the metadata, the title, description, and status, which is tiny. We keep them separate from the very first sketch, because pretending they are one thing is the mistake that haunts the rest of the design.

API and high level design. An upload endpoint and a watch endpoint. The raw bytes go to blob storage like S3, which is cheap and effectively infinite. The metadata goes to a separate relational database. Correct, and naive, which is exactly what section 00 asked for first. Now the deep dives break it, one quality at a time, and each fix is something we already built earlier in this article.

The upload pathanimated

A 256 GB file cannot ride through an API gateway, which caps a request body at a few megabytes. So the client uploads in chunks straight to S3, and your servers never touch the bytes. Watch the chunks flow direct.

The flow: the client posts metadata first and gets back a set of presigned URLs. It then uploads chunks directly to S3 using multipart upload, so the bytes never pass through your servers and the gateway size cap never applies. When S3 finishes, it fires a notification to a worker, which marks the video uploaded and records its location. We never trust the client to tell us it finished, because a client can lie or simply crash mid upload. We let storage itself confirm. That single instinct, never trust the client for state that must be true, is pure senior reflex, and it shows up far beyond video.

Streaming is the harder half, and it hides a lovely subtlety. Downloading a ten gigabyte file before the first frame plays is unacceptable, so we chunk again, but differently, and the difference is the whole point. The upload chunks were large, five to ten megabytes, sized to minimize request overhead. The streaming chunks are tiny, two to ten second clips cut at video keyframes, sized for fast playback start and quick switching. Same word, chunk, two completely different jobs, two completely different sizes. That is why we chunk twice, and noticing that you must is exactly the kind of thing that separates a memorized answer from an understood one.

Then we transcode. Each chunk is re encoded into several resolutions, from 4K down to 240p, so a viewer on a weak connection can be handed a smaller stream instead of a stalled one. We store these back in S3 and write a manifest file that lists, per resolution, the ordered chunk URLs. On playback the client fetches the manifest, then pulls chunks at the resolution its current bandwidth can sustain, reassessing every few seconds. This is adaptive bitrate: start a video in 4K on home wifi, walk out to a weak cellular signal, and the next chunks quietly drop to 720p without the video ever stopping. All of it, the chunking, the manifest, the adaptive fetch, is what streaming protocols like HLS and DASH handle for you. Drag the bandwidth below and watch the player choose.

Adaptive bitrate, liveinteractive

Move the slider to change the viewer's available bandwidth. The player picks the highest resolution the connection can sustain, chunk by chunk, the same decision the real client makes continuously.

22.0 Mbps

The ladder is not magic, it is a table of bitrate floors. 4K wants roughly 18 Mbps, 1080p around 6, 720p around 3, 480p around 1.2, and 240p well under that. The player keeps a buffer, watches its download speed, and steps down the instant the buffer is at risk and back up when there is headroom. Buffering happens when the player guesses too high and the buffer drains to empty, which is why a conservative ladder beats a greedy one for perceived quality.

The streaming pathanimated

Chunk, transcode into a ladder of resolutions, cache the popular ones near the viewer, and let the client adapt to its bandwidth chunk by chunk.

The CDN here is section 09 again, no new idea required. Popular chunks and the manifest sit on edge servers near the viewer, so a request from New York is answered in New York instead of crossing the country to S3. Caching where it is hot, exactly the trade we already named, applied to the most bandwidth hungry workload in the design.

Does it scale? Walk the bottlenecks.

This is the section 05 lens pointed at the finished design, component by component, and the satisfying part is that every answer traces back to something we already built. The video service is stateless, so it scales horizontally to any size, which is section 03. The API gateway doubles as a load balancer, which is section 04. S3 is effectively infinite, so the raw bytes are never the constraint. The metadata is tiny, about a kilobyte per video, so a million videos a day is roughly 365 GB a year, comfortable on a single relational instance for years, and if we ever outgrow that we shard by video id, which is section 08. The chunker and transcoders are stateless workers that scale out under CPU and memory pressure, section 03 again. The CDN is the genuinely expensive part, and that is a budget decision, not an architecture flaw. Every scaling story in this enormous system is a principle from earlier in this one article, arriving exactly where the pressure demanded it.

And the level ladder is the honest part to say out loud, because it is what the room is really measuring. A mid level engineer reaches chunking and CDNs with some prompting. A senior arrives at multipart upload and the need to chunk twice without being led there, and handles transcoding and adaptive bitrate when nudged. A staff engineer gets there proactively, splits bytes from metadata on instinct, and might teach the interviewer something about failure handling in the transcode pipeline or back pressure when the upload firehose outruns the workers. The difference was never vocabulary. It is how early, and how independently, the right tradeoff surfaces. Which is the entire thesis of this piece, now demonstrated on something real.

Good system design is choosing the right tradeoff for the circumstance you are in. Not reciting the catalog. Naming what the system actually needs, then deciding deliberately, out loud, with the reason attached.

13 / THE META LESSONWhat you actually carry out of here

You will not be asked to recite these definitions, on the job or in the room. You will be handed a pressure and asked which release valve fits, and why. That is the only skill underneath all of this, and three habits separate the people who have it from the people who memorized a list.

First, they decide per operation, not per system. CAP is chosen one endpoint at a time. SQL and NoSQL live side by side, each on the workload it suits. Read replicas serve the queries that tolerate staleness and the primary serves the ones that do not. Caches go only where a read is genuinely hot. The lazy answer reaches for one global setting and applies it everywhere. The strong answer says "this operation needs strong consistency and here is the reason, this one does not and here is the reason," and means both halves.

Second, they trace every component back to the force that justifies it. If you cannot say what pressure a box on the board relieves, the box should not be on the board. A load balancer answers "I have interchangeable twins now." A queue answers "one slow dependency must not be allowed to fail the whole flow." A read replica answers "reads dwarf writes and most reads tolerate lag." A CDN answers "this viewer is far from the source." When you can narrate the why for every node, the design defends itself, and the conversation stops being a quiz.

Third, they respect simple solutions, which is the rarest and most senior habit of the three. Vertical scaling is real and serves moderate load honestly. A single relational database with good indexes runs enormous products for years. The instinct to reach immediately for the most distributed, most fault tolerant, most fashionable option is the same instinct that builds a system nobody can operate at three in the morning. The senior move is to add complexity only when a named pressure forces it, and not one component sooner, because every box you add is a box that can fail, page someone, and need to be understood.

That is the whole craft, and it fits in one line. Start with the smallest thing that works, push on it until it breaks, name the force that broke it, apply the smallest move that relieves it, and repeat. Do that honestly and the giant diagram at the top of this page stops being something you memorize and becomes something you could have derived yourself, on a whiteboard, from nothing but the pressures. Which, when you strip away the vocabulary, is exactly what being good at system design has always meant.

The whole article as a lookup table

Every force in this piece, paired with the smallest move that relieves it. If you remember nothing else, remember the shape: a pressure on the left, a deliberate move on the right, and a reason connecting them.

one box, traffic grows→scale up, then out

servers not interchangeable→make them stateless

which twin handles this?→load balancer

one component kills the system→redundancy, health checks

chained reliability decays→stop chaining, go async

slow dependency fails the flow→message queue

queue redelivers messages→idempotent handlers

data needs all or nothing→ACID, relational

data needs spread and scale→NoSQL by shape

reads dwarf writes→read replicas

writes outgrow one box→shard, pay the tax

resize remaps every key→consistent hashing

this read is too slow→cache by distance

popular key expires at once→jitter the TTL

over and under fetching→GraphQL grain

who are you, what may you do→tokens plus a model

file too big for the gateway→chunked direct to blob

weak connection, no stalls→adaptive bitrate

The Forces,Not the Parts