Javathoughts Logo
Javathoughts
Published on
Views

Scaling Java-Based Real-Time Systems: The Hidden Tradeoffs of Event-Driven Design

Authors
  • avatar
    Name
    Javed Shaikh
    Twitter
Apache Kafka, Spring Boot, RabbitMQ — the tools behind event-driven Java systems

Everyone Knows Event-Driven Architecture Is Scalable. Almost Nobody Talks About What Happens Next.

Loose coupling. Scalability. Independent deployments. High throughput.

These are the words you'll find in every architecture blog, every conference talk, every engineering job description. And they're all true.

But very few people discuss what actually happens after you adopt event-driven architecture in production.

Building an event-driven system isn't difficult. Operating one at scale is.

In this article, we'll explore the engineering and operational tradeoffs every Java developer should understand before introducing Kafka, RabbitMQ, or any messaging platform into a production system.


1. Start With the Fundamentals: Event, Command, and Message

These three terms are used interchangeably — and that confusion alone causes real architectural mistakes.

Event vs Command vs Message — the three building blocks of event-driven systems

📢 Event

An event is a fact. Something that already happened, immutable and irreversible. OrderPlaced, PaymentProcessed, UserRegistered. The producer doesn't care who consumes it — it fires and forgets.

🎯 Command

A command is an instruction. It tells a specific service to do something and expects a result. ProcessPayment, ReserveInventory, SendEmail. Unlike events, commands have a single intended recipient.

📦 Message

A message is the envelope. The transport container that carries either an event or a command through a broker. It has headers (correlation IDs, schema version, timestamps) and a body (the actual payload).

ConceptPurposeDirectionKnows the Recipient?
EventRecord a factOne → ManyNo
CommandInstruct an actionOne → OneYes
MessageTransport envelopeAnyInfrastructure concern

The danger: When your OrderPlaced event starts carrying fields like notifyCustomer: true or reserveStock: true, you've turned an event into a disguised command. You've just coupled your producer to every consumer — exactly what EDA is supposed to prevent.


2. Event-Driven Architecture Is NOT Event Sourcing

These patterns are confused constantly, even by experienced engineers.

Event-Driven Architecture vs Event Sourcing — two different patterns with different goals

Event-Driven Architecture (EDA) is about communication. Services talk to each other by publishing and consuming events via a broker. Your state still lives in a normal database.

Event Sourcing (ES) is about storage. Instead of persisting the current state of an entity, you store every state change as an event. To get the current state, you replay all events from the beginning.

DimensionEvent-Driven ArchitectureEvent Sourcing
Core concernInter-service communicationState persistence
Where state livesEach service's own DBThe event log
Primary benefitDecoupling and scalabilityFull audit trail, time travel
Primary costOperational complexityEvent schema evolution, replay cost
Typical toolingKafka, RabbitMQ, SNS/SQSEventStoreDB, Axon Framework

You can use both together — but they solve different problems. Most teams only need EDA.


3. Why Companies Adopt Event-Driven Architecture

The benefits are real. Here's why serious engineering organizations made the shift:

Loose Coupling: Service A publishes an event and moves on. If Service B is slow or down, A doesn't care. No direct dependency, no cascading failures.

Scalability: Kafka partitions let you parallelize consumption horizontally. More consumers = more throughput, up to the partition count.

Independent Deployments: Services with well-defined event contracts can be deployed, scaled, and released independently. This is the organizational superpower.

Fault Tolerance: Events are durable. If a consumer crashes mid-processing, the events wait in Kafka and are reprocessed on recovery.

CompanyUse CaseScale
NetflixRecommendation pipeline, encoding, content state~700 billion events/day
UberRide lifecycle, driver location, fraud detectionMillions of trips/day
AmazonOrder processing, fulfillment inventory syncBillions of events/day
Banking (JPMorgan, Goldman)Trade settlement, compliance, real-time riskMicrosecond latency

4. The Hidden Tradeoffs 🔥

This is the part nobody's conference talk covers.


4.1 Eventual Consistency

When a customer places an order, they expect to see it immediately. But in an event-driven system, the read model is updated asynchronously — after the event travels through Kafka and is processed by a consumer. That can take 50–500ms. The customer refreshes and sees nothing.

How companies handle it: Netflix updates the UI optimistically in the browser before the backend confirms. Amazon shows the order confirmation immediately while processing happens in the background. The key is designing your UX around eventual consistency, not against it.

The rule: If a user action requires immediate, strongly consistent feedback, EDA may not be the right fit for that specific flow.


4.2 Duplicate Events and Idempotency

Kafka's default delivery guarantee is at-least-once. That means the same event can arrive at your consumer more than once — due to retries, rebalancing, or broker failover. Without protection, a customer can be charged twice for a single order.

How to handle it: Every consumer must be idempotent. Track a unique eventId in a processed_events table. Before processing, check if you've already handled this event. If yes, skip it safely.

This isn't optional. It's a fundamental discipline every Kafka consumer must have.


4.3 Message Ordering

Kafka guarantees ordering within a partition — not across partitions. If events for the same entity land on different partitions (random assignment), a consumer can process AddressUpdated v3 before AddressUpdated v1.

The fix: Use the entity's ID (userId, orderId) as the Kafka partition key. All events for the same entity always route to the same partition, preserving order. The tradeoff: strict ordering caps your parallelism at the partition count.

This is exactly what Uber does — rider ID is the partition key for all ride-state events.


4.4 The Saga Pattern — Distributed Transactions

A single user action often spans multiple services: place order → charge payment → reserve inventory → send notification. In a monolith, that's one database transaction. In an event-driven system, there's no such thing.

Enter the Saga.

Choreography-based Saga — happy path and compensating transactions for distributed workflows

Choreography-based: Each service reacts to events from the previous step. Order Service emits OrderPlaced, Payment Service consumes it and emits PaymentProcessed, Inventory consumes that and emits InventoryReserved. No central controller.

Orchestration-based: A central Saga Orchestrator (built with Spring State Machine, Temporal, or Conductor) explicitly instructs each step and tracks overall progress.

When something fails midway, you need compensating transactions — explicit actions that undo what was done. If inventory reservation fails after payment, you must trigger a refund. This logic must be designed upfront, not retrofitted.

Choose choreography for simple 2–3 step flows. Choose orchestration when the workflow has complex branching or you need central visibility into saga state.


4.5 The Dual-Write Problem and the Outbox Pattern

This is one of the most critical — and most glossed-over — tradeoffs in EDA.

Your service writes to a database and publishes to Kafka. These are two separate operations. What if the DB write succeeds but Kafka is temporarily unavailable? The event is lost. What if the event is published but the DB transaction then rolls back? You've told the world about something that didn't happen.

The Outbox Pattern solves this. Instead of writing to Kafka directly, write the event to an outbox table in the same database transaction as your business data. A separate relay process reads unpublished events and delivers them to Kafka. If the relay fails, it retries — the event is never lost, never phantom-published.

The Transactional Outbox Pattern — atomic writes and CDC-based event publishing

In production, teams use Debezium for CDC-based relay — it reads the database's transaction log and streams changes to Kafka without any polling overhead.


4.6 The Inbox Pattern

The consumer-side mirror of the Outbox. Before processing an incoming event, write its ID to an inbox table in the same transaction as your business logic. On the next delivery of the same event (duplicate), the ID already exists — skip it atomically and safely.

Together, Outbox + Inbox give you reliable, exactly-once-effective processing even in an at-least-once delivery world.


4.7 Consumer Lag

Consumer lag is how many messages are sitting in Kafka waiting to be processed. During a traffic spike — Black Friday, a viral moment, a batch job — lag can grow faster than your consumers can drain it.

Unmonitored lag leads to SLA breaches, stale data, cascading slowdowns, and in the worst case, Kafka retention rolling past unconsumed messages and deleting them permanently.

What Uber does: Auto-scale consumer pods via Kubernetes HPA using consumer lag as the scaling metric. When ride-requests lag exceeds a threshold, new consumer pods spin up automatically.

Monitor consumer lag as a first-class metric. Alert on it. Treat a growing lag spike the same way you'd treat rising error rates.


4.8 Dead Letter Queues (DLQ)

What happens when a message consistently fails to process? If you keep retrying indefinitely, that message blocks every subsequent message in the same partition — a single bad event can freeze your entire consumer group.

Kafka DLQ and Retry — retry topics with exponential backoff, then dead letter queue

The pattern: Retry with exponential backoff (e.g., 1s → 2s → 4s → 8s). After exhausting retries, route the message to a Dead Letter Queue (DLT) topic. Monitor the DLT and alert your on-call engineer immediately when anything lands there.

Spring Kafka provides this out of the box with DefaultErrorHandler and DeadLetterPublishingRecoverer. Configure it for every consumer — it's not optional.


4.9 Schema Evolution

Your event schema is a public contract. Once consumers are depending on it, changing it carelessly breaks production systems.

Add a required field with no default? Every existing consumer that tries to deserialize the new event will throw an exception.

The discipline:

  • Use Avro or Protobuf with a Schema Registry (Confluent is the standard)
  • Default to BACKWARD compatibility — new schemas must be readable by old consumers
  • New fields must always have default values
  • Never rename or remove fields without a multi-version migration plan and deprecation period

Treat event schemas exactly like public APIs. They need versioning, changelogs, and sunset timelines.


4.10 Distributed Tracing and Correlation IDs

A user reports a failed order. You have log output scattered across six services, half a million lines, and no way to find the relevant ones.

The fix: Assign a correlationId at the entry point of every user request. Propagate it in Kafka message headers across every service boundary. Include it in every log line via MDC. Now a single grep correlationId=abc-123 across your ELK stack reconstructs the entire story.

Distributed Tracing — correlation ID propagated through Kafka, Jaeger/Zipkin visualization

Use OpenTelemetry with Micrometer Tracing in Spring Boot. Integrate with Jaeger or Zipkin to get a visual timeline of every span across every service — including the async Kafka hops. This transforms debugging from archaeology into a 10-second search.


4.11 Event Replay

This is the superpower most teams don't appreciate until they need it desperately.

A bug in your Analytics Service corrupted three months of aggregated data. In a traditional system, that data is gone. In an event-driven system with Kafka as a durable log, you reset the consumer's offset to the beginning, replay every OrderPlaced event, and rebuild the read model from scratch.

This only works if your events are retained long enough. Kafka defaults to 7 days. For replay capability, configure retention of 30–365 days, or use Confluent's Tiered Storage for practically unlimited retention.


4.12 Backpressure

A batch job dumps 10 million events into Kafka in 5 minutes. Your consumer can handle 50,000 per minute. Consumer lag explodes. JVM memory pressure rises. GC pauses lengthen. Other consumer groups on the same broker start suffering.

The levers: Tune max.poll.records to limit how many events your consumer fetches per poll cycle. Use batch listeners to process records in controlled chunks. Use max.poll.interval.ms to give your consumer enough time to finish a batch before Kafka considers it dead and triggers a rebalance.

At the infrastructure level, use Kubernetes HPA to scale consumers in response to lag metrics.


4.13 Poison Messages

A malformed JSON record lands in your topic. Your consumer throws a JsonParseException on every attempt. Kafka doesn't remove failed messages — it keeps presenting the same record. Your consumer group grinds to a halt on that partition.

The fix: Classify errors. Transient failures (database timeout, network blip) should be retried. Non-transient failures (bad JSON, schema violation) should be immediately routed to the DLQ without burning retry attempts. Spring Kafka's addNotRetryableExceptions handles this distinction cleanly.


4.14 Testing

Event-driven systems are notoriously difficult to test because everything is asynchronous, distributed, and time-dependent.

Use @EmbeddedKafka in Spring Boot for in-memory Kafka in unit and integration tests. Combine with Awaitility to assert asynchronous outcomes without flaky Thread.sleep calls.

Use Testcontainers for integration tests against a real Kafka instance running in Docker. This catches broker-specific behavior that EmbeddedKafka can miss.

Test your DLQ paths explicitly — send malformed messages and verify they land in the DLT. Test duplicate delivery — send the same event twice and verify your consumer handles it idempotently. Most teams only test the happy path.


4.15 Operational Complexity

Step back and look at what you're signing up for:

CapabilityMonolithEvent-Driven System
Deployment1 artifactN services + Kafka cluster
DebuggingStack traceDistributed trace across services
TransactionsDB ACIDSaga + compensating transactions
Schema changesDB migrationSchema Registry + compatibility rules
Data consistencyImmediateEventually consistent
On-call complexityLowHigh

None of this makes EDA wrong. It makes it a trade. Know what you're trading before you commit.


5. Operational Maturity — Kafka Is Just the Start

Running Kafka in production is the easy part. Here's what you actually need around it:

ComponentPurposeTool
Schema RegistryContract enforcementConfluent Schema Registry
Consumer lag monitoringSLA enforcementPrometheus + Grafana
Distributed tracingDebug async flowsJaeger / Zipkin / Tempo
Log aggregationCorrelated log searchELK / Loki
DLQ monitoringPoison message alertingCustom or Conduktor
AlertingOn-call notificationsPagerDuty / OpsGenie
RunbooksIncident responseConfluence / Notion

Alert on consumer lag. Alert on DLQ arrivals. Document your retry policies. Write runbooks for common failure scenarios before you need them at 3 AM.


6. It's Also an Organizational Problem

The biggest reason EDA projects fail isn't Kafka configuration. It's people.

When 10 teams publish to a shared cluster with no governance, topics get named inconsistently, schemas change without warning, consumers break silently, and nobody knows who owns what.

What good governance looks like:

Naming: <domain>.<aggregate>.<past-tense-verb>orders.order.placed, payments.payment.processed, inventory.stock.reserved. Consistent, discoverable, searchable.

Versioning: Breaking changes require a new schema version. Old schemas are deprecated with a minimum 6-month sunset period. Never modify a published contract in place.

Ownership: Every topic has an owning team. Schema changes require the owner's approval and at least one consuming team's sign-off.

Documentation: Every event needs a page answering: what triggered it, what it means, what it does not mean, and who is consuming it today.

Platform Engineering: At sufficient scale, you need a dedicated team that owns the Kafka cluster, provides self-service topic provisioning, enforces naming and schema standards via CI/CD gates, and runs architecture reviews for new event domains.


Conclusion

Event-driven architecture delivers on its promises — if you're prepared for what comes with them.

The scalability, loose coupling, and independent deployments are real. So is the operational complexity, the eventual consistency challenges, the schema governance burden, and the organizational coordination overhead.

Before you add Kafka to your next system, answer these honestly:

  1. Is your team operationally ready? Monitoring, alerting, DLQs, runbooks, on-call?
  2. Have you solved schema governance? Who owns events? How do contracts evolve?
  3. Can your users tolerate eventual consistency? Is your UX designed for it?
  4. Do you have the testing discipline? Idempotency tests, DLQ tests, replay tests?

The biggest challenge in event-driven architecture isn't learning Kafka.

It's learning how to build, operate, and govern distributed systems as an organization.

Kafka is just the pipe. The hard part is everything else.


If you're building event-driven systems in Java and hit a tradeoff not covered here, I'd love to hear about it. The production war stories are always the most educational.