Architecture

The Art of System Decomposition

A First-Principles Guide to System Decomposition

Boyan Balev
Boyan Balev Software Engineer
18 min
The Art of System Decomposition

This article is an argument, not a survey. It argues that most production systems are over-decomposed, and that restraint is the undervalued skill in system design. You may disagree. That’s fine. But consider the case before dismissing it.

The pattern is predictable: a startup reads about Netflix’s architecture, decides they need microservices, splits their Django app into twelve containers, and then wonders why a feature that used to take a day now takes a sprint. They’ve traded the complexity they understood for complexity they don’t.

Decomposition is not a measure of sophistication. It’s a trade: you exchange one set of properties for another. The sophistication lies in knowing which trades serve your specific context, and having the discipline to resist trades that don’t.


The Properties You’re Trading Away

Before splitting anything, understand what you’re giving up. This isn’t about resisting change; it’s about making informed decisions rather than following trends.

A monolithic application has properties that distributed systems cannot replicate:

PropertySingle ProcessDistributed
Function call latency ~100 nanoseconds 1–200 milliseconds
Data consistency ACID transactions Eventual consistency or complex coordination
Debugging Stack traces, breakpoints Distributed tracing, correlation IDs
Deployment Atomic (all or nothing) Coordinated rollouts
Failure modes Process crash Partial failures, network partitions
Refactoring IDE support, type checking Contract versioning, backward compatibility

That latency gap isn’t rounding error. A function call is 10,000 to 100,000 times faster than a network call. Chain four services together, add serialization overhead, and you’ve introduced 10–50ms of latency before any business logic runs.

Concrete example: A checkout flow that was a single database transaction completing in 50ms becomes a saga spanning six services. Happy path: 400ms. Timeout scenarios: user sees a spinner for 30 seconds, then an error message that doesn’t explain whether their order went through. The support team now handles hundreds of “did my order go through?” tickets per day.


What Boundaries Actually Solve

Boundaries exist to solve specific problems. Not to look modern. Not to match what Netflix does. Not because a conference speaker said so.

Team Autonomy

When 50 engineers modify the same codebase, merge conflicts and coordination overhead dominate. Boundaries let teams own domains and deploy independently, but only if boundaries align with team structure. Boundaries that cross team ownership create more coordination, not less.

Independent Scaling

Your search service needs 20 replicas during Black Friday. Your admin panel needs one. Without boundaries, they scale together, wasting resources or creating bottlenecks. But this only matters if your components actually have different scaling profiles.

Failure Isolation

A memory leak in image processing shouldn’t crash payment processing. Boundaries contain blast radius, but only if the services don’t share critical dependencies. A shared database, a shared cache, or synchronous call chains undermine this isolation entirely.

Technology Flexibility

Your ML team needs Python with GPU access. Your API team prefers Go. Boundaries enable heterogeneous stacks when you genuinely need them. But polyglot architectures also mean polyglot hiring, polyglot debugging, and polyglot deployment pipelines.

Notice the pattern: each benefit solves a specific, nameable problem. “Better separation of concerns” is not specific. “The payments team needs to deploy without waiting for the catalog team’s two-week release cycle” is specific.

Anti-Patterns: Boundaries That Create Problems

Some boundaries don’t solve problems; they create them:

The Entity Service: Splitting by data entity (UserService, ProductService, OrderService) creates a distributed database with network overhead. Every operation requires choreographing calls across entities that naturally belong together.

The Layer Service: Separating “frontend API,” “business logic,” and “data access” into different services. Every user request now traverses three network hops to accomplish what a single process could do with function calls.

The Premature Extraction: Extracting a service because “we might need to scale it independently someday.” Unless you have evidence of different scaling needs today, you’ve added network complexity for a hypothetical future that may never arrive.


The Four-Question Framework

Before any split, these questions help determine whether decomposition provides value. They’re a starting point for analysis, not a mechanical decision rule. Context matters, and reasonable engineers can weigh these factors differently.

QuestionIf YesIf No
1. Will different teams own these components? Service boundary may help team autonomy Same engineers maintain both sides. You've just added network hops
2. Do they need different release cadences? Independent deployment provides value They always deploy together. You've created a distributed monolith
3. Do they have different resource profiles? Independent scaling reduces costs Similar profiles. You're just running more containers
4. Should failure in one NOT take down the other? Isolation is possible (if no shared dependencies) Shared database/dependencies mean illusory isolation
General Guidance

If you answered “no” to all four questions, a service boundary probably doesn’t help. A well-structured module within a monolith provides the same logical separation without distributed systems complexity. But also consider external constraints: regulatory requirements, data residency, security isolation, or M&A readiness can override this framework. Start with the four questions, then factor in your specific constraints.

When External Constraints Override the Framework

Some situations require service boundaries regardless of the four questions:

Regulatory compliance: PCI-DSS requires cardholder data environments to be isolated. HIPAA mandates specific access controls. SOC 2 auditors want clear security boundaries. These are legal requirements that override architectural preferences.

Data residency: GDPR requires EU citizen data to stay in the EU. If you serve global customers, you may need geographically separated services. A monolith can’t easily satisfy “this data never leaves Frankfurt.”

Security isolation: Some components handle fundamentally different trust levels. A service processing untrusted user uploads should be isolated from your core database. Defense in depth sometimes requires network boundaries.

M&A readiness: Companies anticipating acquisition or divestiture benefit from clean service boundaries. Selling a business unit is easier when that unit’s systems can be separated cleanly.


The Modular Monolith: The Path Nobody Talks About

The industry presents a false choice: chaotic monolith or enlightened microservices.

The practical middle ground, the one successful companies actually use before they hit genuine scale, is the modular monolith: a single deployment unit with strict internal boundaries.

Modular Monolith Structure:

├── orders/
│   ├── api/           <- Public interface (other modules call this)
│   ├── internal/      <- Implementation details (private)
│   ├── repository/    <- Data access (encapsulated)
│   └── types.go       <- Domain types
├── payments/
│   ├── api/
│   ├── internal/
│   └── ...
└── inventory/
    └── ...

Rules:
1. Modules communicate through api/ interfaces only
2. No imports of internal/ across module boundaries
3. Each module owns its database tables
4. Violations fail the build (use linting/architecture tests)

Enforcing Boundaries in Code

The module structure means nothing without enforcement. Here’s how to make boundaries real:

Go: Use internal/ directories. The Go compiler enforces that internal/ packages can only be imported by their parent module.

Java: Use ArchUnit or jMolecules to write tests that verify architectural rules:

@ArchTest
static final ArchRule ordersDoNotDependOnPaymentInternals =
    noClasses()
        .that().resideInAPackage("..orders..")
        .should().dependOnClassesThat()
        .resideInAPackage("..payments.internal..");

TypeScript/JavaScript: Use eslint-plugin-boundaries or dependency-cruiser:

// .dependency-cruiser.js
module.exports = {
  forbidden: [{
    name: 'no-cross-module-internals',
    from: { path: '^src/([^/]+)/' },
    to: { path: '^src/(?!$1)[^/]+/internal/' }
  }]
};

What You Get

Modular MonolithMicroservices
Logical separation enforced by toolingLogical separation enforced by network
Function calls (~100ns)Network calls (1–200ms)
ACID transactionsEventual consistency
Stack traces for debuggingDistributed tracing
Atomic deploymentCoordinated rollouts
Single runtime to operateN runtimes to operate
One CI/CD pipelineN pipelines, or a complex monorepo setup

The Extraction Principle

When a module genuinely needs independence (when you can answer “yes” to the framework questions or have external constraints), extraction follows the existing boundary. The hard architectural work is already done. The api/ interface becomes your service contract. The internal/ implementation moves to a new repository.

Teams that start with microservices rarely get boundaries right on the first try. They spend years fixing distributed boundaries because refactoring across services requires versioning, backward compatibility, and coordinated deployments. Teams that start with modular monoliths can refactor freely, then extract once they understand the domain.


Data Ownership: Where Decomposition Gets Real

Abstract service boundaries are easy. Architects draw them on whiteboards all day. Data ownership is where decomposition gets real, and where most migrations fail.

The microservices promise was independent databases for independent services. The reality is that data has gravity: business processes span domains, reports need cross-domain joins, and consistency requirements don’t care about your service boundaries.

The Trade-off You Cannot Escape

ApproachWhat You GetWhat You Pay
Shared database Strong consistency, direct queries, simple joins Schema coupling, team coordination, shared failure domain
API calls Clear ownership, independent schemas Runtime dependency, latency, availability coupling
Data replication Read independence, no runtime coupling Eventual consistency, sync infrastructure, storage costs
CQRS with projections Optimized read models, cross-domain views Complexity, eventual consistency, projection maintenance

Neither option is free. You’re choosing between data coupling and runtime coupling. Pick the coupling that hurts less for your specific situation. Often, the honest answer is “shared database with schema governance” rather than pretending you’ve achieved data independence.

Decision Framework: When to Share, When to Split

Share a database when:

  • Same team owns both services (coordination cost is low)
  • Strong consistency is non-negotiable (financial transactions)
  • The services are in the same bounded context
  • You’re not ready to solve distributed data problems
  • The primary access patterns involve cross-domain queries

Separate databases when:

  • Different teams with different release cycles
  • Independent scaling of storage matters
  • Eventual consistency is genuinely acceptable
  • The services represent different business domains
  • Failure isolation is a real requirement

The Outbox Pattern: Making Separation Work

When you do separate databases, you need reliable event propagation. The naive approach fails:

1. Write to database    ✓
2. Publish event        ✗ (network blip)
Result: Data saved, event lost, downstream services never know

You can’t fix this with “try harder to publish.” Network failures are inevitable, and you can’t make a database write and a message publish atomic without distributed transactions.

The outbox pattern makes event publishing transactional:

BEGIN;
  INSERT INTO orders (id, customer_id, total) VALUES (...);
  INSERT INTO outbox (event_type, payload, created_at)
    VALUES ('OrderCreated', '{"orderId": "..."}', NOW());
COMMIT;
┌─────────────────────┐     ┌─────────────────┐     ┌──────────────┐
│   Order Service     │     │  Outbox Relay   │     │    Kafka     │
│  ┌───────────────┐  │     │                 │     │              │
│  │ orders table  │  │     │  polls outbox   │────▶│  publishes   │
│  ├───────────────┤  │     │  marks as sent  │     │   events     │
│  │ outbox table  │──┼────▶│  retries failed │     │              │
│  └───────────────┘  │     └─────────────────┘     └──────────────┘
└─────────────────────┘

A separate process polls the outbox and publishes events. Events may duplicate, so consumers must be idempotent. But you never lose events. Debezium CDC is the modern alternative: it reads the database transaction log directly.


Bounded Contexts: Finding Natural Seams

Domain-Driven Design’s most practical gift is the bounded context: a boundary within which terms have consistent meaning.

Consider “Customer” across an e-commerce company:

ContextWhat “Customer” Means
SalesContact info, cart, purchase history
SupportTicket history, SLA tier, satisfaction score
BillingPayment methods, invoices, credit limit
MarketingSegments, preferences, campaign engagement

Attempting one Customer entity for all contexts produces an incoherent mess: hundreds of fields, tangled validation rules, changes in one area breaking another.

This is why forced data sharing fails. When the Sales team says “customer” and the Billing team says “customer,” they mean different things. This isn’t a bug. It reflects how the business actually works.

Finding Context Boundaries

Listen to the language. When different teams use the same word differently, you’ve found a context boundary. Watch the data flows. Data that changes together belongs together. Map change patterns. When Sales requests changes, which code changes? If the answers don’t overlap across teams, you’ve found natural boundaries.

E-commerce Context Map:

┌─────────────────────────────────────────────────────────────────────┐
│                          Sales Context                               │
│  Customer, Order, Cart, Checkout                                     │
└─────────────────────────────────────────────────────────────────────┘

                              ▼ OrderPlaced event
┌─────────────────────────────────────────────────────────────────────┐
│                       Fulfillment Context                            │
│  Shipment, Package, Carrier, TrackingNumber                          │
└─────────────────────────────────────────────────────────────────────┘

                              ▼ ShipmentDispatched event
┌─────────────────────────────────────────────────────────────────────┐
│                         Billing Context                              │
│  Invoice, PaymentMethod, Receivable, CreditMemo                      │
└─────────────────────────────────────────────────────────────────────┘

“Order” appears in all three, meaning something different each time. Don’t try to unify them. Let each context have its own representation, connected by events at the boundaries.


Integration Patterns: Where Architectures Fail

Splitting services is the easy part. Integration is where architectures fail. Not in the decomposition, but in the coupling that follows.

The Synchronous Trap

Synchronous call chains create tight runtime coupling:

User Request → Service A → Service B → Service C → Service D
ProblemImpact
Any service downEntire request fails
LatencyA + B + C + D + network overhead
Slow serviceSlows everything
DebuggingSpans four systems
Cascading failuresOne timeout causes upstream timeouts

This runtime coupling is often worse than the code coupling you were trying to escape. At least with code coupling, you have stack traces. At least code coupling is visible in import statements. Runtime coupling hides in HTTP clients and only reveals itself during outages.

Asynchronous Integration

When synchronous is acceptable:

  • User-facing flows requiring immediate response
  • Queries that need current data
  • Simple request/response patterns with few hops
  • Idempotent operations with proper retry logic

When asynchronous is preferable:

  • Fire-and-forget operations
  • Fan-out to multiple consumers
  • Operations that can tolerate seconds of delay
  • Decoupling producer/consumer lifecycles
  • Operations where the producer shouldn’t fail if the consumer is down

Sagas: Distributed Transactions

When a business process spans services, database transactions can’t help. You need a saga: a sequence of local transactions coordinated through events or an orchestrator.

Choreography (decentralized): Each service listens for events and publishes its own.

Orders ──OrderCreated──▶ Payments ──PaymentReceived──▶ Inventory
   ▲                         │                              │
   └────PaymentFailed────────┴──────StockReserved───────────┘

Orchestration (centralized): A coordinator controls the flow, making the process explicit.

                     ┌─────────────────┐
                     │   Orchestrator  │
                     │  (Order Saga)   │
                     └────────┬────────┘

       ┌──────────────────────┼──────────────────────┐
       ▼                      ▼                      ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Payments   │     │  Inventory   │     │   Shipping   │
└──────────────┘     └──────────────┘     └──────────────┘
ApproachAdvantagesDisadvantages
Choreography Scales better, no single point of coordination Hard to understand full flow, debugging across services, implicit coupling through events
Orchestration Explicit flow, easier to understand and debug, centralized error handling Central coordinator is a dependency, can become bottleneck, logic centralized in one place

The critical insight: Every saga step needs a compensating action. If payment succeeds but inventory fails, you refund the payment. Design compensation into every service from the start, not as an afterthought. Services must be idempotent, and timeouts should trigger compensation rather than indefinite waiting.


Resilience Patterns: What Distribution Demands

Distributed systems fail in ways monoliths don’t. Network partitions, partial failures, cascading timeouts. These require explicit handling, not hope.

Circuit Breakers

When a dependency fails, continuing to call it makes things worse. Threads block, timeouts accumulate, failure cascades upstream. Your entire system slows down because one dependency is struggling.

    ┌─────────┐     failures     ┌───────────┐     timeout     ┌─────────────┐
    │  CLOSED │────────────────▶│   OPEN    │────────────────▶│  HALF-OPEN  │
    │         │                  │           │                  │             │
    │ Normal  │                  │ Fail-fast │                  │ Test one    │
    │ traffic │◀─────────────────│ (no calls)│◀─────────────────│ request     │
    └─────────┘    success       └───────────┘     failure      └─────────────┘
StateBehavior
ClosedNormal operation. Track failure rate.
OpenAfter threshold failures, stop calling dependency. Return fallback immediately.
Half-OpenAfter timeout, allow one test request. Success closes; failure keeps open.

Use libraries like Resilience4j (Java), Polly (.NET), opossum (Node.js), or gobreaker (Go). Complement circuit breakers with bulkhead isolation: isolate thread/connection pools for different dependencies so a slow payment gateway doesn’t exhaust resources needed for inventory checks.

Timeouts and Retries

Every external call needs a timeout. Without one, a hung dependency hangs your service.

Example failure mode: A third-party payment provider goes slow instead of down. No errors, just 60-second response times that back up every checkout request, exhausting thread pools. The symptom looks like high load, but the cause is a slow dependency.

PatternPurposeConfiguration
Exponential backoff Prevent retry storms Base delay × 2^attempt (cap at max)
Jitter Prevent synchronized retries Add random 0–30% to delay
Idempotency keys Ensure retries are safe UUID per operation, server dedupes
Retry budgets Limit system-wide retry load Max 20% of requests can be retries

Timeout guidance: Connect timeout of 1–3 seconds (fast fail if unreachable), read timeout based on p99 latency × 2, and total timeout less than your SLA requirements.


Platform Considerations

If your organization has a mature platform team providing self-service infrastructure, the operational tax of additional services is lower. If you’re a small team doing everything yourself, each service needs CI/CD, monitoring, alerting, security scanning, and on-call rotations.

Service meshes (Istio, Linkerd) help with 20+ services, polyglot environments, or zero-trust requirements. They’re overkill for fewer than 10 services with simple networking needs. Don’t adopt one to follow best practices; adopt one when you have the specific problems it solves.

The Organizational Reality

Architecture decisions don’t happen in a vacuum. Real factors that influence decomposition choices include:

Team politics: Sometimes microservices adoption is driven by organizational dynamics rather than technical needs. A VP building an empire, teams wanting autonomy, or engineers wanting modern tech on their resumes. These aren’t invalid reasons, but be honest about them.

Hiring dynamics: Some engineers won’t join a company running a “boring monolith.” Whether this is rational or not, it affects your ability to hire. Factor it into your decisions.

Platform maturity: A platform team that provides observability, deployment pipelines, and service templates dramatically reduces the cost of running additional services. Without that platform, each new service is a burden.

Acknowledge these factors. Pretending architecture is purely technical leads to decisions that fail for non-technical reasons.


Evolution Patterns

The Strangler Fig

When evolving a mature system, don’t rewrite in place. Build new implementation alongside old, gradually redirecting traffic. Shadow traffic first (read-only, compare results), then handle a subset of writes, then all traffic while keeping the old system as backup. The old system “withers” as the new one proves itself. Use feature flags, monitor both systems, and have a documented rollback plan.

The Independence Test

Before any deployment, ask: “Can I deploy without coordinating with any other team?”

Coupling SourceResolution
Shared codeDuplicate or version as library
Shared databaseThe boundary might be wrong
API contractImplement backward compatibility
Shared configurationExtract to service discovery
Shared deploymentThe services aren’t really independent

True service independence means deploying without waiting for anyone. If you’re still scheduling coordinated releases, you haven’t achieved what decomposition promised. You’ve created a distributed monolith with extra steps.


The Architecture Evolution Path

Most successful systems follow this trajectory:

StageCharacteristicsAppropriate When
1. MonolithSingle deployment, simple operations, fast developmentMost startups, small teams, new products
2. Modular MonolithClear internal boundaries, extractable modules, single deploymentGrowing teams, established products
3. Selective ExtractionExtract services where specific needs exist, keep rest in monolithSpecific pain that extraction solves
4. Distributed SystemMultiple services with clear ownership and boundariesLarge organizations, high scale, justified complexity
The Practical Takeaway

Start with a modular monolith. Extract services when you feel specific pain that extraction solves, not before. The cost of premature decomposition (distributed complexity, operational overhead, debugging difficulty) exceeds the cost of delayed extraction (refactoring later). Most systems need far fewer services than their architects imagine.


Measuring What Matters

Before any decomposition, establish baselines: deployment frequency, lead time for changes, MTTR, change failure rate, P95 latency, and infrastructure cost. Document the specific problems you’re trying to solve. “Orders team blocked by Catalog team releases” is measurable. “Better architecture” is not.

After decomposition, measure the same things. If the metrics don’t improve, the decomposition didn’t help, regardless of how clean the architecture looks on a whiteboard.


A Cautionary Tale

A Series B fintech with 15 engineers split their Django monolith into 23 services over 18 months. Deployment time went from 15 minutes to 2 hours. Feature development slowed 40%. Infrastructure costs tripled. They still coordinated releases. They never hit the scale that would have justified it.

The company was acquired two years later. The acquiring company migrated everything back to a monolith.


Summary: The Principles That Matter

Boundaries solve coordination problems. If you don’t have coordination problems (different teams, different deployment cadences, different scaling needs), you don’t need boundaries. You need good code organization.

Every pattern solves a specific problem. Outbox solves reliable event publishing. Circuit breakers solve cascade failures. Sagas solve distributed transactions. If you don’t have the problem, don’t adopt the pattern.

Technology changes faster than domains. Code organized around business concepts survives technology migrations. Code organized around technical layers doesn’t. Orders, Customers, and Payments will outlive your current framework.

Start where debugging is easy. Get the boundaries right in a monolith where you have stack traces and breakpoints. Extract to services once the boundaries are proven. Debugging distributed systems is hard. Don’t make it harder by figuring out boundaries at the same time.

Measure the real costs. Before and after any decomposition, measure: deployment frequency, incident recovery time, developer productivity, infrastructure costs, and system latency. If the numbers don’t improve, the decomposition didn’t help.

The goal is working software. Not impressive architecture. Not best practices compliance. Software that works, that your team can maintain, that your users rely on. Sometimes that’s microservices. Sometimes it’s a monolith. Usually it’s something in between, shaped by your specific context.


The discipline isn’t in the decomposition. It’s in the restraint to decompose only where it helps.