How Notion Scaled PostgreSQL with Database Sharding: A Comprehensive Case Study

Jump to

As user bases grow, scaling databases becomes a critical challenge for many organizations. Notion, a popular productivity platform, faced this issue in mid-2020 when its monolithic PostgreSQL database struggled to handle increasing loads. This case study explores how Notion implemented database sharding to overcome these challenges, detailing their strategy, migration process, and key lessons learned.

The Database Scaling Challenge

Notion serves as an all-in-one workspace for users to take notes, manage calendars, build spreadsheets, and more. For five years, its monolithic PostgreSQL database supported rapid growth across several orders of magnitude. However, as the user base expanded, the database began showing signs of strain:

  • Frequent CPU spikes
  • Increased on-call requests
  • PostgreSQL’s VACUUM process failing to clean up disk space effectively

These issues impacted performance and signaled that the existing architecture was no longer sustainable. Notion’s engineering team recognized the need for a scalable solution to address immediate problems while preparing for future growth. After evaluating options, they decided to implement database sharding—a horizontal scaling technique that distributes data across multiple database instances.

Sharding Strategy and Design Decisions

Choosing What to Shard

Notion’s data model revolves around blocks—individual items like paragraphs or images on a page. Blocks can contain other blocks, creating a hierarchical structure. The Block table was an obvious choice for sharding due to its central role in the data model. However, since the Block table was connected to other tables, sharding it alone would result in complex cross-shard queries.

To avoid these inefficiencies, the team decided to shard all tables connected to the Block table. This ensured that related data remained together within the same shard, eliminating the need for cross-shard queries.

Selecting a Sharding Key

Notion users typically work within a single workspace, with each block belonging to a specific workspace. The team chose Workspace ID as the sharding key to group all blocks and related data for a workspace into the same shard. This approach minimized data fetching from multiple shards during operations.

Determining the Number of Shards

The team settled on 480 logical shards distributed across 32 physical databases, with each physical database hosting 15 logical shards. The number 480 was chosen because it is divisible by many numbers, providing flexibility for scaling up or down while keeping shards evenly distributed. For example, they could scale from 32 to 40 or 48 physical hosts incrementally.

Implementation and Migration Process

Double-Write Phase

To migrate data without disrupting users, Notion adopted a double-write strategy. New data was written simultaneously to both the old monolithic database and the new sharded setup. Instead of writing directly to both databases—which could lead to inconsistencies—they used an audit log system. This log recorded all writes in the old database and applied them separately to the new sharded databases, ensuring consistency even if issues arose during migration.

Data Migration

Moving existing data required significant computational resources. The team developed a script to transfer data from the old database to the new shards using a machine with 96 CPUs. Despite this powerful setup, it took three days to complete the migration due to the sheer volume of data.

Verification

To ensure accuracy after migration, Notion implemented two verification methods:

  1. Verification Script: Randomly sampled UUIDs were checked between old and new databases.
  2. Dark Reads: For certain requests, data was fetched from both databases and compared without affecting user-facing operations.

These measures allowed the team to identify discrepancies before fully transitioning to the new setup.

Switching Over

The final switchover involved only a few minutes of downtime for users while transitioning entirely to the sharded setup. For most users, this appeared as a brief service interruption despite significant changes occurring behind the scenes.

Lessons Learned

Start Early

One major takeaway was that migrations should begin before performance issues become critical. Waiting until the database struggled added complexity and stress during implementation.

Aim for Zero Downtime

Although downtime during switchover was minimal, achieving zero downtime would have been possible with further refinement of catch-up scripts.

Choose Sharding Keys Wisely

Using Workspace ID as a sharding key proved effective in grouping related data together and avoiding cross-shard queries. However, combining primary and partition keys into one column could have simplified application code.

Outcome

The sharding process resulted in noticeable performance improvements for Notion’s users:

  • Faster response times
  • Improved handling of high loads
  • Enhanced scalability for future growth

Despite challenges during implementation, Notion’s engineering team successfully transitioned their monolithic PostgreSQL database into a scalable architecture capable of supporting their expanding user base.

Conclusion

Notion’s approach to sharding its PostgreSQL database offers valuable insights into tackling scalability challenges in modern applications. By carefully planning their strategy—choosing what to shard, selecting appropriate keys, and ensuring smooth migration—they achieved significant performance gains while minimizing disruption for users.

For organizations facing similar challenges, this case study highlights the importance of proactive planning and thoughtful execution when scaling databases horizontally.

Read more such articles from our Newsletter here.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may also like

Categories
Scroll to Top