Revolutionizing Real-Time Ads Data: Zomato’s Flink SQL Transformation

Jump to

Zomato’s Data Platform and Ads teams recently embarked on a mission to overhaul their real-time data streaming architecture, targeting the persistent bottlenecks and inefficiencies in their ads feedback system. By rethinking their approach and leveraging the latest in Flink SQL technology, they not only achieved dramatic reductions in system state size and infrastructure costs but also set a new standard for reliability and scalability in real-time data processing.

The Real-Time Challenge: Managing Massive State Pipelines

For Zomato, real-time feedback is the backbone of its advertising ecosystem. The platform must process enormous volumes of user interactions-clicks, impressions, and conversions-without missing a beat. However, legacy pipelines, burdened by vast states sometimes exceeding 150 GB, were slowing down recovery, risking state loss, and driving up memory usage. These issues threatened the accuracy of ad reporting and the trust of restaurant partners who rely on precise, real-time campaign metrics.

Key Issues with the Previous System

  • State Overload: Deduplication logic required storing user IDs for long periods, leading to ballooning state sizes and frequent instability.
  • Checkpointing Struggles: Larger states resulted in unwieldy checkpoint files, making recovery from failures slow and unreliable.
  • Memory Spikes: High memory consumption during peak loads often led to system failures.
  • State Loss: When the system lost state, ad counters would reset, causing data inaccuracies, over-delivery of ads, and incorrect billing.

The Power of Reconciliation in Real-Time Data

To address these pain points, Zomato introduced a reconciliation mechanism. This process acts as a safety net, periodically verifying and correcting campaign data. If a job fails or misses events, reconciliation ensures that the final ad counts remain accurate, protecting both business outcomes and partner trust.

How Reconciliation Works

  • Background Validation: Runs at set intervals to cross-check and update campaign metrics.
  • Data Integrity: Ensures no single point of failure can corrupt or lose critical data.
  • Financial Accuracy: Prevents over- or under-delivery of ads, safeguarding both Zomato and its partners.

Ads at Zomato: A Data-Driven Engine

Zomato’s advertising platform is intricately tied to its data infrastructure. Ads are tailored using machine learning models that analyze millions of data points to determine the right audience, timing, and placement. Metrics such as impressions, clicks, conversions, and ROI are tracked in real time, making the reliability of the feedback system paramount.

Ad Types and Metrics

  • Performance and Awareness Ads: Designed to meet diverse partner objectives.
  • Targeted Campaigns: Based on user behavior, cuisine preferences, and spending patterns.
  • Key Metrics: Impressions, clicks, conversions, and ROI.

Legacy System Limitations and the Need for Change

The original feedback loop, built on Flink 1.8 and Java, maintained massive state for deduplication and aggregation. This architecture struggled with:

  • Handling Late Events: User actions reported late required extended state retention, further increasing state size.
  • Manual Data Fixes: State loss led to incorrect deduplication and required manual intervention.
  • Over-Delivery Risks: Inaccurate data could result in partners being overcharged or misinformed.

Designing a Robust New Feedback Loop

The new architecture was engineered for reliability, scalability, and maintainability. Key innovations included:

Incremental Counting

  • Switch from Total to Incremental Counts: Instead of storing total daily counts, the new system publishes only incremental updates, minimizing the impact of state loss.
  • Recon Job for Accuracy: A separate job periodically fetches and updates campaign totals, acting as a fail-safe for any missed events.

State Size Reduction

  • Shorter Deduplication TTL: By reducing the deduplication window from 24 hours to just 2 hours, the system drastically cut down on state retention needs.
  • Late Event Handling: Late events are now managed by the reconciliation job, freeing the main pipeline from excessive state storage.

Migration to Flink SQL

  • Upgrade to Flink 1.17: Moving from Java to Flink SQL unlocked performance improvements and simplified code maintenance.
  • Efficient State Management: Flink SQL handles state more efficiently, reducing complexity and risk.

Technical Architecture: Two-Pronged Approach

Real-Time Attribution Flow

  • User Interactions: Events from the app are streamed via Kafka.
  • Flink SQL Processing: The new job aggregates and deduplicates data in real time, sending incremental counts to Redis for billing.

Recon ETL Flow

  • Raw Event Fetching: The recon job pulls data from the Data Lake on S3.
  • Periodic Updates: At fixed intervals, it processes late or missed events and updates Redis, ensuring campaign data is always accurate.

Key Implementation Concepts

  • TUMBLE Windows: Fixed 60-second intervals group and process events, ensuring timely aggregation.
  • User-Defined Functions (UDFs): Custom logic for deduplication, leveraging efficient caching.
  • Watermarking: Handles late events by allowing a buffer period, ensuring accuracy without sacrificing performance.

Seamless Migration and Validation

The transition was carefully managed through phased deployment:

  • Shadow Mode: The new system ran alongside the old, allowing for extensive validation and troubleshooting.
  • Gradual Cutover: After thorough testing, the new feedback loop was fully adopted, ensuring a smooth, disruption-free migration.

Post-Migration Outcomes

  • 99% State Size Reduction: State shrank from 150 GB to just 500 MB, dramatically improving system stability.
  • $3,000 Monthly Savings: Infrastructure costs dropped significantly due to optimized resource usage.
  • Zero Downtime: The new system has operated flawlessly since launch.
  • Enhanced Flexibility: Experimenting with new business logic is now easier and faster, supporting rapid innovation.

Conclusion: Setting a New Standard in Real-Time Data Streaming

Zomato’s overhaul of its real-time ads feedback system demonstrates the transformative power of modern data engineering. By embracing Flink SQL, automated reconciliation, and smarter state management, the team not only solved critical bottlenecks but also created a foundation for future innovation. This journey stands as a blueprint for any organization seeking to optimize real-time data pipelines for performance, reliability, and cost efficiency.

Read more such articles from our Newsletter here.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may also like

Software development team collaborating on AI adoption strategies

Empowering Developer Teams: A Strategic Guide to AI Adoption

Artificial intelligence is rapidly transforming the landscape of software development, but successful adoption within engineering teams requires more than just access to the latest tools. Organizations often encounter skepticism among

Categories
Scroll to Top