Scaling Analytics at Canva: Processing 25 Billion Events Daily

Jump to

Overview of Canva’s Event-Driven Analytics Infrastructure

Canva, the popular online design platform, handles an extraordinary volume of user activity every day—processing an estimated 25 billion events. Each user action, from opening a design to clicking a button, generates valuable data. For product teams, this data is not just about tracking numbers; it’s about gaining actionable, real-time insights into user behavior, feature performance, and engagement trends.

Rather than relying on static dashboards, Canva treats analytics as a core part of its infrastructure. Every event is captured, structured, and routed to the right destination, ensuring that product decisions are informed by reliable, up-to-the-minute data.

The Three Pillars of Canva’s Event Pipeline

Canva’s analytics pipeline is built on three foundational stages: structuring, collecting, and distributing events. Each stage is engineered for scalability, reliability, and long-term maintainability.

Structuring Events: Schema Governance for Future-Proof Data

In many organizations, analytics pipelines start with quick implementations, leading to inconsistent event types and undocumented formats. This can create confusion when metrics suddenly change and the reasons are unclear.

Canva avoids these pitfalls by enforcing strict schema governance from the outset. Every event—whether a page view, template click, or any other user action—must conform to a well-defined Protobuf schema. These schemas act as long-term contracts, ensuring that data remains consistent and interpretable over time.

The schemas are designed to be both forward- and backward-compatible. New consumers must handle events created by older clients, and older consumers must process events from newer clients. Breaking changes, such as removing required fields or altering data types, are strictly prohibited. If a fundamental change is needed, engineers introduce an entirely new schema version, preserving access to historical data and keeping analytics queries reliable.

To automate and enforce these schema rules, Canva developed Datumgen, a tool that extends standard code generation. Datumgen generates TypeScript definitions for frontend clients, Java definitions for backend services, and SQL schemas for the Snowflake data warehouse. It also powers a live Event Catalog, where team members can browse event types, fields, and routing details.

Each event schema is assigned two owners: a technical owner (typically the engineer who implemented the event logic) and a business owner (often a data scientist familiar with the product context). Every field includes clear, human-written documentation, which is displayed in both Snowflake and the Event Catalog, ensuring that everyone understands what each data point represents.

Collecting Events: Unified Client Architecture and Asynchronous Processing

Handling billions of events across multiple platforms and devices is a significant challenge. Most companies address this by building separate analytics SDKs for each platform, which can lead to maintenance headaches and inconsistencies.

Canva takes a different approach. Every frontend platform—iOS, Android, and web—uses the same TypeScript analytics client, running inside a WebView shell. A minimal native layer collects platform-specific metadata, such as device type and OS version, while everything else—event structure, queueing, and retries—is managed in a single shared codebase.

This unified architecture delivers several advantages:

  • Reduced maintenance: Engineers fix bugs and add features in one place, not three.
  • Consistent schemas: Event definitions remain uniform across platforms.
  • Unified instrumentation: Feature tracking is consistent, reducing duplication and drift.

When events leave the client, they are sent to a central ingestion endpoint. Here, each event is validated against the expected schema. Invalid events—those with missing, malformed, or incorrect fields—are immediately dropped, acting as a firewall against bad data.

Valid events are asynchronously pushed into Amazon Kinesis Data Streams (KDS), which serves as the ingestion buffer. This decoupling ensures that the ingestion endpoint remains fast and responsive, isolating it from downstream processing delays.

The ingestion worker retrieves events from KDS and performs enrichment tasks that the client cannot handle, such as geolocation based on IP, device fingerprinting, and timestamp correction for clock drift. Once enriched, events are forwarded to a second KDS stream, ready for routing and distribution.

This staging approach keeps enrichment logic separate from ingestion, preventing slow lookups or third-party calls from impacting frontend performance. It also isolates faults: if enrichment fails or lags, new events can still enter the pipeline.

Distributing Events: Decoupled Routing for Scalability and Reliability

A common failure in analytics pipelines is not losing data, but delivering it too slowly. When dashboards, personalization engines, or real-time triggers lag, it often traces back to tight coupling between ingestion and delivery.

Canva avoids this by splitting the pipeline cleanly. Once events are enriched, they flow into a decoupled router service. The router’s job is to match each event against predefined routing rules and deliver it to the appropriate downstream consumers—without letting any one consumer slow down the others.

This decoupling is crucial. If routing and delivery were tightly coupled, a slow consumer could block all others, or a schema mismatch could cause cascading retries. Decoupling allows Canva to scale efficiently, supporting both real-time and batch delivery as needed.

Canva delivers analytics events to several key destinations, each optimized for specific use cases:

  • Snowflake (via Snowpipe Streaming): Used for dashboards, metrics, and A/B test results. While latency is not critical, reliability and schema stability are paramount.
  • Kinesis: Supports real-time backend systems for personalization, recommendations, and usage tracking, thanks to its high-throughput, parallel read capabilities and replay support.
  • SQS Queues: Ideal for services that only care about a subset of event types, offering simplicity and low maintenance.

This multi-destination setup allows each consumer to choose the trade-off that best fits its needs: speed, volume, simplicity, or cost.

Canva’s platform guarantees “at-least-once” delivery. This means an event may be delivered more than once, but will never be silently dropped. Each consumer is responsible for deduplication, ensuring data integrity.

Optimizing Infrastructure for Cost and Performance

As Canva’s event volume grew, so did infrastructure costs. The original MVP used AWS SQS and SNS, which were easy to set up and scale, but eventually accounted for 80% of operating expenses.

The team evaluated alternatives:

  • Amazon MSK (Managed Kafka): Offered a 40% cost reduction but required significant operational overhead.
  • Kinesis Data Streams (KDS): Delivered an 85% cost reduction compared to SQS/SNS, with only a modest latency increase (10–20ms). The switch to KDS cut costs by a factor of 20.

Kinesis charges by volume, not message count, making compression a powerful lever for cost savings. Canva batches hundreds of events at a time and compresses them using ZSTD, a fast and efficient algorithm. This approach achieves a 10x compression ratio on typical analytics data, with only a 100ms overhead per batch. The result: $600,000 in annual savings, with no impact on speed or accuracy.

Kinesis is not without its challenges. While average latency is around 7ms, tail latency can spike above 500ms when shards approach their write limits. To protect frontend response times, Canva implemented a fallback to SQS whenever KDS is slow or throttled. This keeps p99 response times under 20ms and costs less than $100/month to maintain. The SQS fallback also serves as a disaster recovery mechanism, ensuring the system remains resilient even during KDS outages.

Key Takeaways for Scalable Analytics

Canva’s event collection pipeline demonstrates the value of fundamentals: strict schemas, decoupled services, typed clients, smart batching, and resilient infrastructure. The architecture is not overly complex or experimental—it is designed for reliability, cost efficiency, and long-term clarity.

For any organization looking to scale their analytics, the lesson is clear: focus on building for reliability, cost, and maintainability. This approach transforms billions of events into actionable insights, empowering product teams to make informed, data-driven decisions.

Read more such articles from our Newsletter here.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may also like

Multicloud architecture diagram showing integration of AWS, Azure, and Google Cloud

What is Multicloud?

In today’s fast-paced, digitally driven world, cloud computing is no longer optional. However, relying on a single cloud provider can limit scalability, increase risk, and create lock-in. That’s where multicloud

Progressive Web App (PWA) architecture

What are Progressive Web Apps?

In today’s hyper-connected digital era, users interact with websites on a range of devices—smartphones, tablets, desktops, and even smart TVs. With mobile traffic consistently outpacing desktop usage in India and

QA engineer career

QA Engineer Career Path in India – From Junior to Senior Levels

In today’s software-driven, always-on economy, product quality isn’t a luxury—it’s a baseline expectation. Whether it’s a consumer-facing app, a mission-critical enterprise system, or government-led digital initiatives like UPI or DigiLocker,

Categories
Scroll to Top