Outbox Pattern: Guarantee DB-to-Broker Atomicity

Starting a new article on the outbox pattern — a simple idea that saved one of my teams from a lot of late-night firefighting.

What the outbox buys you

  • Atomicity between state change and the intent to publish: write your domain state and an "outbox" record in the same DB transaction.
  • Decoupling of the actual publish: a separate process reads the outbox and sends events. If the broker is down, you keep the message around and retry.
  • Simpler failure handling: no distributed two-phase commit, no cross-service transactions. You get at-least-once delivery and can make consumers idempotent.

I’ve used both a transactional outbox (app writes to an outbox table + a poller publishes) and CDC-based approaches (Debezium streaming DB changes into Kafka). Both work — the choice depends on operational constraints. For smaller teams or when you control the DB and app, transactional outbox is low-friction. For large, polyglot environments where you can't change app code easily, CDC can be better.

Here's the core idea in plain terms:

  • In a single DB transaction, insert/update your business rows AND insert a new row in outbox with the event payload and metadata.
  • Commit.
  • A separate outbox-poller reads unprocessed outbox rows, publishes them to the broker, and marks them as published (or deletes/archives them).

A few lessons I've learned the hard way

  • Make consumers idempotent. Outbox gives at-least-once delivery. You must design accordingly.
  • Keep outbox payloads small or reference external blobs. We once used an outbox with 100KB messages and our replication/backup costs spiked.
  • Index your outbox for efficient polling (status + created_at). We had a poller causing table scans until we added an index.
  • Be explicit about ordering guarantees. If you need strict ordering, additional constraints (partitioning key + sequence number) are necessary.
  • Monitor publish lag. The message is durable in DB, but if your poller falls behind you still have a growing backlog.
  • Have a plan for poison messages: payloads that always fail publishing or always fail consumers. Move them to a DLQ after some retries.

A few implementation details worth highlighting

  • FOR UPDATE SKIP LOCKED: This lets multiple poller instances run in parallel without stepping on each other.
  • attempts column: helps us detect poison messages and move them to a dead-letter table after N attempts.
  • published_at vs delete: I prefer marking published_at instead of deleting right away. It helps auditing and troubleshooting. You can archive or compact older rows asynchronously.
  • Partitioning: if you have millions of outbox rows, partition the table by time or by aggregate type to keep performance predictable.

When to prefer transactional outbox

  • You control the app and DB and want a low-friction, well-understood solution.
  • You want the simplest path to guarantee the event is tied to the DB transaction.

Checklist before you ship an outbox

  • Ensure your DB transaction encompasses both domain write and outbox insert.
  • Add an index on (published_at, created_at) and maybe (aggregate_id) if needed.
  • Implement idempotency on consumers (idempotency keys, unique constraints).
  • Monitor poller lag and outbox size.
  • Handle poison messages (attempts + DLQ).
  • Keep payloads small or store large blobs externally.
  • Consider GDPR: do you need to purge personal data from outbox rows?

Closing thoughts

The outbox pattern isn't glamorous, but it's one of those engineering practices that earns its keep. It forces you to acknowledge the reality of distributed systems — that side effects can fail independently — and gives you a pragmatic, testable way to make your system more reliable.

I publish these kinds of notes on gazar.dev and in my "Monday by Gazar" newsletter — if you follow them, you'll see the practical follow-up with code and infra diagrams. And if you’re dealing with a tricky outbox problem right now, tell me about it — I’ll share what I’d do.

Comments