- Published on
- Views
Why LinkedIn Is Moving Beyond Kafka: Lessons From 400K Topics and 32 Trillion Events
- Authors

- Name
- Javed Shaikh
š„ The Hook: When Good Enough Isn't Anymore
Kafka didnāt fail. LinkedIn just outgrew it.
Imagine building a highway system so efficient that it becomes the gold standard for every major city in the world. Now imagine your own city growing so fast that even a 100-lane version of that highway is jammed. That is exactly what happened to LinkedIn and Apache Kafka.
Kafka, originally birthed at LinkedIn in 2011, revolutionized how we think about event-driven architectures. It made the distributed log mainstream. But recently, the engineering world noticed a subtle shift. LinkedIn is migrating away from the very system it created, moving toward a next-generation system: Northguard.
Why? Because when you hit extreme hyperscale, the physics of distributed systems fundamentally change. We are talking about:
- 32 trillion events flowing per day
- 17 Petabytes of data processed daily
- Over 400,000 topics and thousands of brokers
At this scale, the assumptions that made Kafka brilliant started to crack. Let's explore why even the king of event streaming has its limits, and what the future of data looks like.
šļø The Rise of Kafka: The Backbone of the Data Economy
Before Kafka, data integration was a messy web of point-to-point connections. If Service A needed data from Service B, you built a custom pipeline. It was brittle, unscalable, and a nightmare to operate.
Kafka solved this by introducing a brilliantly simple concept: the distributed, append-only log. It acted as a massive, high-throughput buffer. Producers wrote to the log; consumers read from the log at their own pace.
Kafkaās partition model allowed it to scale horizontally. Need more throughput? Add more partitions. Need fault tolerance? Replicate those partitions across brokers. It was elegant, performant, and soon became the central nervous system for modern enterprisesāfrom Netflix to Uber.

But as LinkedIn's user base and feature set exploded, the traffic on that highway became unfathomable.
š When Scale Breaks Assumptions
In system design, scale exposes every hidden bottleneck. What works for 10,000 requests per second will melt your servers at 10 million.
For LinkedIn, the challenge wasn't just the sheer volume of messages (though 32 trillion a day is staggering). The real challenge was metadata.
In Kafka, every topic has partitions, and every partition has replicas. The cluster must keep track of where every replica lives, who the "leader" of each partition is, and what the synchronization status is. As LinkedInās architecture evolved into microservices, engineers created more and more topics.
They hit 400,000 topics. Multiply that by partitions and replicas, and you are dealing with millions of metadata points that must be continuously updated and propagated across the cluster.
ā ļø The Cracks in the System
Letās look at the specific breaking points LinkedIn encountered.
1. The Metadata Bottleneck
In classic Kafka (pre-KRaft), a single controller node manages cluster metadata and stores it in ZooKeeper. When a broker fails, the controller must reassign leadership for thousands of partitions. At LinkedInās scale, a single broker failure could trigger an avalanche of metadata updates. The controller became a massive, single point of failure and a performance bottleneck. The system spent more time thinking about its own state than moving data.

2. The Rigidity of Partitions
Kafka ties a partition to a specific disk on a specific broker. If a topic suddenly goes viral and receives a massive spike in traffic, the underlying disk holding that partition can max out on IOPS. Because data is tightly coupled to the broker's local storage, fixing a "hot spot" means copying massive amounts of data from one machine to another.
3. Rebalancing Chaos
Imagine trying to move a 10-lane highway while traffic is flowing at 150 mph. When LinkedIn needed to add brokers to scale up, Kafka had to physically move gigabytes or terabytes of partition data to the new nodes to balance the load. This rebalancing process is highly I/O intensive, often causing latency spikes and degraded performance for producers and consumers.

š Enter Northguard: The Evolution of Streaming
LinkedIn realized they couldnāt just patch Kafka. They needed a paradigm shift. They built Northguard (a codename for their next-gen streaming architecture concepts).
If Kafka is a physical highway where lanes are permanently painted on the ground, Northguard is a dynamic, magnetic levitation system where tracks assemble themselves in real-time based on weight and speed.
Key Innovations of Northguard:
- Segment-Based Architecture: Instead of tying a massive partition to a single broker's disk, Northguard breaks partitions into smaller, immutable "segments." These segments are distributed dynamically across the cluster. If a broker gets hot, Northguard doesn't move an entire partitionāit just stops writing to the segment on the hot broker and starts a new segment on a cold one. Instant relief.
- Distributed, Sharded Metadata: Goodbye, single controller. Northguard shards its metadata. Instead of one node knowing everything, the responsibility is distributed using consensus protocols (similar to Raft). This means near-infinite metadata scalability.
- Storage and Compute Separation: By separating the brokers (compute/serving) from the underlying storage, Northguard can scale them independently. If you need more storage, you add storage nodes. If you need more throughput, you add compute nodes.
- Auto-Balancing: Rebalancing is no longer a chaotic, manual terror. Because segments are small and storage is decoupled, balancing the cluster is continuous, automatic, and invisible to the user.

š Xinfra ā The Unsung Hero of the Migration
You donāt just hit the "off" switch on Kafka when 32 trillion events depend on it. Rewriting thousands of microservices to use a new system's client would take a decade.
To solve this, LinkedIn built an ingenious abstraction layer called Xinfra.
Xinfra acts as a universal proxy. To the microservices, Xinfra looks, talks, and acts exactly like a standard Kafka cluster. The services use standard Kafka APIs. But under the hood, Xinfra decides whether to route that traffic to legacy Kafka clusters or the new Northguard clusters.
This allowed LinkedIn to perform "dual-writes" and slowly, safely drain traffic from Kafka to the new infrastructure without a single product engineer having to rewrite their code.

āļø Kafka vs. Northguard: The Hyperscale Showdown
| Feature | Apache Kafka (Classic) | Northguard (LinkedIn's Next-Gen) |
|---|---|---|
| Architecture | Monolithic Partitions tied to Brokers | Segmented Partitions, Decoupled Storage |
| Metadata Management | Centralized (ZooKeeper / single controller) | Decentralized / Sharded Metadata |
| Scaling | Add brokers, manually rebalance terabytes | Auto-balances segments instantly |
| Hotspot Handling | High pain (requires moving large partitions) | Low pain (roll to a new segment on a cold node) |
| Ideal Use Case | 99% of companies (Startups to Enterprises) | Hyperscalers (Trillions of events, 100K+ topics) |

š” The Real Lesson: Don't Prematurely Optimize
Reading this, you might be thinking: "Oh no, is Kafka obsolete? Do I need to rip it out?"
Absolutely not.
Kafka didnāt fail. It successfully carried LinkedIn from a growing startup to a global monolith processing 32 trillion events a day. Most companies will never, ever hit the scale where Kafka breaks.
The real engineering lesson here is about evolutionary architecture. You build for the scale you have, plus a reasonable runway. When you outgrow the foundational technology, you don't panicāyou abstract (like Xinfra) and evolve (like Northguard).
š® The Future of Streaming Systems
LinkedIn's move signals where the industry is heading. We are moving away from rigid, stateful brokers toward:
- Serverless Streaming: Developers shouldn't know what a partition or a broker is. They should just say, "Here is a topic, give me throughput."
- Decoupled Compute and Storage: The era of tying data to a specific machine's hard drive is ending. Cloud-native architectures demand independent scaling and durability.
- Self-Healing Infrastructure: Systems must auto-balance and resolve hot spots continuously without human intervention or performance degradation.

šÆ Conclusion
LinkedIn created Kafka to solve the data integration nightmare of the 2010s. Now, they are pioneering the solutions for the hyperscale nightmares of the 2020s.
By breaking monolithic partitions into segments, sharding metadata, and introducing a seamless proxy layer, theyāve built a system that can handle the unimaginable weight of the modern data economy.
Kafka changed the world. Now, systems like Northguard are showing us what comes next.
What are your thoughts on decoupled storage in streaming systems? Have you ever hit the partition limit in Kafka? Let's discuss in the comments!
