Inside Uber’s MySQL Infrastructure: Scaling for Global Operations

Jump to

Uber’s data infrastructure relies heavily on its extensive MySQL fleet, comprising over 2,300 independent clusters. This massive scale presents unique challenges in managing the control plane while ensuring zero downtime and data integrity. In recent years, Uber has made significant strides in improving MySQL fleet availability from 99.9% to 99.99% through various optimizations and a complete overhaul of the control plane architecture.

Architecture Overview

Uber’s MySQL fleet consists of multiple clusters, each containing numerous nodes. The system is divided into two primary flows:

  1. Data Flow: Facilitates interactions between clients/services and MySQL clusters.
  2. Control Flow: Manages the lifecycle of clusters, including provisioning, maintenance, and decommissioning.

The MySQL infrastructure at Uber comprises several key components:

  • Control plane
  • Data plane
  • Discovery plane
  • Observability
  • Change data capture and data warehouse ingestion
  • Backup and restore systems

The Control Plane

At the heart of Uber’s MySQL infrastructure lies the control plane, a state-based system consisting of multiple components and services. The technology manager orchestrates these components and is responsible for publishing the goal state of the cluster to Odin, Uber’s in-house management platform for stateful technologies.

Key Functions of the Technology Manager

  • Cluster management
  • Primary failover operations
  • Node lifecycle management
  • Balanced placement across geographical locations
  • Database-specific operations

The Controller

A recent addition to the control plane, the controller acts as an external observer for all MySQL clusters. It monitors signals from the database and other control plane components, evaluating rules and taking action when violations occur.

Critical Workflows

The control plane utilizes workflows powered by Cadence to orchestrate complex, long-running tasks. These workflows are asynchronous and event-driven, providing durability, fault tolerance, and scalability.

Primary Failover

Uber employs a single primary-multiple replica setup, where writes are handled by the primary node and replicated to replicas using standard MySQL binlog replication. The primary failover process can be either graceful or emergency, depending on the health of the existing primary node.

Node Replacement

This critical process involves moving a MySQL node and all its data from one host to another without affecting database users. It ensures the MySQL fleet remains agile and protected from infrastructure disruptions.

Schema Changes

The control plane automates schema changes through a self-serve workflow, using MySQL’s instant alter or Percona ptosc for safe and non-blocking updates. This process is integrated with Uber’s CI-CD pipelines, allowing developers full control over their schema while ensuring alignment with deployed code.

Data Plane

A MySQL node at Uber consists of several containers running on a single host:

  • Database container
  • Worker container
  • Metrics container
  • Health prober
  • Backup container

Discovery Plane

The discovery plane simplifies client interaction with MySQL clusters by providing an abstraction over Uber’s hardware infrastructure. It consists of three major components:

  1. Reverse proxy
  2. Pooling service
  3. Standard client

Observability and Monitoring

Uber’s MySQL infrastructure employs a robust observability ecosystem, collecting system metrics, logs, and health data from containers and clusters. Alerts are configured to detect various failures, ensuring the MySQL-db-as-a-service team can maintain the fleet’s health effectively.

Change Data Capture and Backup

For change data capture, Uber uses Storagetapper to capture changes from binlog and stream them to Apache Kafka. The backup and restore processes are fully automated, utilizing Percona XtraBackup capabilities to maintain a 4-hour RPO and RTO of a few minutes to hours, depending on data size.

Conclusion

Uber’s MySQL infrastructure forms the backbone of many critical services across the company. The control plane provides a reliable, scalable, and high-performance MySQL platform, abstracting operational complexities and allowing Uber to serve numerous customers and use cases efficiently.

Read more such articles from our Newsletter here.

Leave a Comment

Your email address will not be published. Required fields are marked *

You may also like

Categories
Scroll to Top