Slack is widely recognized as a messaging app, but beneath the surface, it operates as a sophisticated, real-time collaboration platform. With millions of users engaging simultaneously and thousands of messages transmitted every second, Slack’s system architecture is engineered to deliver seamless interactivity and reliability at a global scale.
Origins and Early Design
Slack’s technical foundation has roots in an unexpected place: the development team originally built a browser-based MMORPG called Glitch. Although Glitch did not achieve commercial success, the internal communication tool created for its development became the basis for Slack. This legacy influenced two critical architectural principles:
- Separation of Concerns: Slack divided responsibilities early on, with a dedicated channel server for real-time messaging and a web application for business logic, storage, and authentication.
- Push-First Mentality: Instead of relying on traditional request-response models, Slack adopted a push-based approach using WebSockets, making real-time updates a core feature rather than an optimization.
Initial Architecture
Web Application (Hacklang)
- Managed authentication, permissions, storage, API endpoints, and session management.
- Written in Hacklang, a typed PHP dialect, enabling rapid iteration and gradual introduction of type safety.
Channel Server (Java)
- Handled WebSocket connections, real-time message broadcasting, typing indicators, and message ordering.
- Ensured messages were delivered instantly to all connected clients.
This division allowed Slack to move quickly in its early days, but as usage grew, challenges emerged. The monolithic backend became harder to test and deploy, while the channel server’s stateful design complicated scaling and recovery.
Persistent Messaging as a Core Principle
Unlike traditional chat systems such as IRC, Slack treats every message as important and persistent. This means messages are not only delivered in real time but are also recorded, indexed, and retrievable at any time. The system guarantees:
- Messages do not disappear unexpectedly.
- Message order remains consistent across all clients.
- All users see the same conversation history.
To achieve this, Slack implements a practical version of atomic broadcast, ensuring validity, integrity, and total order of messages. While perfect consensus is theoretically impossible in distributed systems, Slack’s architecture is designed to recover gracefully from inconsistencies and failures.
Evolution of the Message Send Flow
Original Flow
- Messages were sent directly from the client to the channel server, which broadcast them and acknowledged receipt before persisting them.
- This provided low latency but introduced risks: if the server crashed before persistence, messages could be lost despite being displayed as sent.
Modern Flow
- The client now sends messages to the web app via HTTP POST.
- The web app logs the message for persistence and indexing before invoking the channel server for real-time broadcast.
- This approach improves crash safety and ensures that messages are either stored or the sender receives a clear failure notification.
- Stateless channel servers are now easier to scale and maintain, and mobile clients benefit from not requiring a persistent WebSocket connection to send messages.
Session Initialization and Flannel
For small teams, starting a Slack session is straightforward. However, at enterprise scale, initializing sessions became a bottleneck due to the massive payloads required for large organizations. Originally, the system would assemble a complete snapshot of team data for every session start, leading to latency and potential failures.
To address this, Slack introduced Flannel, a geo-distributed microservice that:
- Maintains a pre-warmed, in-memory cache of team metadata.
- Listens to real-time events to keep the cache updated.
- Serves session data locally from regional replicas, reducing latency and backend load.
This shift transformed session startup from a compute-heavy operation to a cache-driven process, improving reliability and scalability.
Scaling Challenges and Trade-Offs
Operating at Slack’s scale means designing for both high throughput and resilience to failure. One visible symptom at scale is message duplication, often caused by client retries due to network delays. Slack uses idempotency keys to identify and suppress duplicate messages, containing the impact of retries.
On the backend, Slack leverages:
- Kafka for durable message queuing, acting as the system’s ledger.
- Redis for fast, in-flight job data, supporting rapid processing.
This separation balances durability and speed, enabling intelligent retry logic and concurrency control to prevent message loss or disorder.
Conclusion
Slack’s architecture is intentionally complex in areas where precision and reliability are paramount, such as real-time messaging and session consistency. By pushing complexity to the system’s edges and keeping the core streamlined, Slack delivers a robust, scalable platform capable of supporting billions of daily messages. The system continues to evolve, adapting to new scaling challenges and user needs, with each architectural decision reflecting a careful balance between performance, reliability, and simplicity.
Read more such articles from our Newsletter here.