gRPC vs Kafka: Upgrading Service Communication

Jump to

There was a period when engineering teams believed Apache Kafka could solve every challenge in service communication. From facilitating service-to-service communication to handling retries, scalability, and asynchronous messaging, Kafka appeared indispensable. However, as systems grew in complexity, limitations in this approach surfaced—not due to Kafka itself, but how it was wielded. This realization led to a pivotal shift—not towards REST or WebSockets, but gRPC.

Overreliance on Kafka: The Initial Setup

The context for this change began within a loan servicing platform for banks and non-banking financial companies. The architecture included distinct microservices for:

  • Loan creation
  • Disbursement
  • Eligibility assessments
  • Moratorium and rate-of-interest changes
  • Installment recalculation
  • Notifications
  • Credit bureau reporting
  • Accounting and ledger management

Kafka was initially adopted to ensure decoupled, scalable service interactions. The structure: whenever Service A completed a task, it published an event to Kafka. Services B, C, and D consumed these events and responded accordingly.

Where Kafka Fell Short

While this design appeared robust, issues emerged quickly:

  • Failure Detection: If NotificationService crashed, Kafka failed to send alerts. Events simply lingered, unnoticed.
  • Backpressure Problems: Delays in AccountingService consumption resulted in outdated ledger entries.
  • Silent Failures: Services could fail unnoticed without proactive log monitoring or tickets.
  • Lost Retries: Multiple failed retries would push the event into a dead-letter queue and it could be overlooked.
  • Immediate Responses Needed: Kafka could not guarantee real-time results for scenarios needing instant feedback, such as checking disbursement status or eligibility.

The root issue was not Kafka’s capabilities, but rather using Kafka for scenarios it wasn’t intended to handle.

The Turning Point

A critical moment occurred when a high-value loan customer did not receive a notification post-disbursement. The event had been published to Kafka, but the NotificationService was down, resulting in no retries, no alerts, and support teams left in the dark. It became clear that directly invoking the required service—rather than relying on Kafka—offered more certainty and visibility.

Exploring gRPC for Direct Service Communication

Re-evaluating the ecosystem, the team decided to explore gRPC more thoroughly. Key insights included:

  • Built on efficient HTTP/2, enabling parallel binary communication.
  • Utilization of Protocol Buffers (protobuf), delivering small and efficient payloads.
  • Strongly typed contracts ensured clarity and safety.
  • Native support for bi-directional streaming.
  • Unified .proto files enabled seamless code generation across languages like Java and Go.
  • More straightforward, real-time interaction without brokers, polling, or unnecessary intermediaries.

A proof-of-concept demonstrated responses from EligibilityService in under 10ms. This speed and clarity prompted broader adoption.

Replacing and Retaining: The Split Approach

Transitioned to gRPC:

  • Loan to EligibilityService (immediate response required)
  • Loan to NotificationService (retry logic simplified, fallback enabled)
  • Loan to AccountingService (ensuring ledger consistency, with rollback)
  • Loan to ScheduleService (real-time amortization schedules)

Retained Kafka for:

  • Audit logging (fire-and-forget, replay capability)
  • Data lake ingestion (bulk event analytics)
  • Fan-out scenarios (one service publishing to many consumers)
  • Bulk asynchronous workflows needing resilience and buffering

Kafka was shifted to where it excels, and no longer forced into roles needing synchronous or real-time responses.

What Improved After the Switch

Larger Font

Debugging Simplicity

Kafka demanded examination of multiple log sources—producers, consumers, dead-letter queues, and retry logic. gRPC brought back direct, immediate feedback: success or explicit failure.

Smaller Font

  • Latency Reduction: Synchronous gRPC calls achieved 10–30ms response times, eliminating the delays from Kafka’s polling consumers and ideal for eligibility checks and balance validations.
  • Reduced Infrastructure Complexity: Kafka required elaborate ecosystem maintenance (Zookeeper/KRaft, partitioning, retention, consumer monitoring). gRPC lightened the load, needing only service deployment management. Alert fatigue drastically dropped for the infra team.

Lessons Learned

Kafka remains a powerful technology but choosing it blindly leads to confusion. Not every problem is best solved asynchronously or with event-driven architecture. The migration to gRPC brought not just performance gains, but far greater architectural clarity.

Decisions now hinge on thoughtful consideration:

  • Is reply or fan-out necessary?
  • Is the workload async or sync?
  • Is a real-time response essential?

When rapid, transparent interactions are crucial, gRPC is selected. Kafka is reserved for situations demanding replayability or fan-out patterns.

Conclusion

Kafka continues to power significant portions of system infrastructure but is no longer the default for every scenario. The adoption of gRPC granted greater speed, transparency, and control for service-to-service communication, emphasizing the value of using a tool in alignment with its strengths.

Employing each technology for its intended use case—rather than popularity—ultimately elevated the system’s reliability and maintainability. The evolution from Kafka-overuse to a gRPC-enabled hybrid fostered an environment where architectural decisions are rooted in necessity, not legacy or trend.

Read more such articles from our Newsletter here.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may also like

Key Patterns Every Developer Should Know in Event-Driven Architecture

Modern software systems demand high scalability, resilience, and real-time responsiveness. Event-driven architecture (EDA) has emerged as a foundational design paradigm, decoupling components and enabling systems to react asynchronously to events.

ReactJS

6 Core Prerequisites Every ReactJS Learner Should Know

ReactJS has earned its place as the go-to library for front-end development, empowering teams and individuals to build dynamic, scalable, and high-performing web applications. Yet, excelling with ReactJS requires more

Illustration of Model Context Protocol

Dynamic Environment Management with Model Context Protocol

Managing various development environments can pose a significant challenge, particularly when working with AI assistants and sophisticated tooling. The Model Context Protocol (MCP) introduces a robust mechanism for handling these

Categories
Interested in working with Newsletters ?

These roles are hiring now.

Loading jobs...
Scroll to Top