Vitess: Powering Big MNCs to Billion-Dollar Savings Through Scalability and Efficiency
This post dive deeps into an open-source framework designed by Youtube to solve common Database problems faced by companies while they are expanding out.
Pre-requisites
Understanding of basic system design terminologies like Sharding.
Basic Understanding on Databases
Table Content:
Introduction
Features of Vitess
Architecture
Use Cases
Introduction
Vitess is a database solution for deploying, scaling and managing large clusters of open-source database instances. It currently supports MySQL and Percona Server for MySQL. It's architected to run as effectively in a public or private cloud architecture as it does on dedicated hardware. It combines and extends many important SQL features with the scalability of a NoSQL database. Vitess can help you with the following problems:
Scaling a SQL database by allowing you to shard it, while keeping application changes to a minimum.
Migrating from bare-metal or VMs to a private or public cloud.
Deploying and managing a large number of SQL database instances.
Who the heck is “Vitess”? Why its usage can be a boon to companies?
Lets understand this by example;
Birth of Vitess
Keshav started a platform where users can upload a video. A video uploaded can be viewed by users across platform. Currently, he just started a platform to test out his idea. In order to start simple, he went on with storing data in a single MySQL DB.
As soon as users started adopting platform, there was an increase in TPS. Keshav’s team decided to move ahead with vertical scaling for database. There was a time when the vertical scaling became the bottleneck for scaling out more because of hardware limitations. Then the team figured out that read requests are more as compared to write requests.
So, they decided to go with Leader - Follower(Or Master- Slave) Architecture. In simpler terms, they replicate the database into multiple nodes(or server). So, basically, leader node is the only one in cluster responsible for providing response to write queries. On the other hand, the follower nodes are responsible to provide response for read queries. So, team solved problem by following a strategy.
The data became too big with time to handle it, so they sharded the Database.
Now, the company expanded to multi regions. They did another improvement to have sharding more complex by doing at region level on upper topology.
TL;DR the architecture became so complex that even a single node failure would require long deployment timelines. They understood the situation complexity.
And TADA, they built Vitess.
Another reveal, the company is none other than Youtube.
Features
The framework itself provides so many features out of the box. For this post to be concise, I will divide these features across three divisions
Improvement in MYSQL
Connection Pooling:
Manages a set of long-lived connections shared between requests, reducing overhead and latency associated with establishing connections.
Vitess automatically manages the database connection pool, reducing memory pressure and latency for database access.
Query Limits:
Provides tunable limits to protect the database against expensive queries.
Includes time-based limits, result limits, disallow lists, and rejection of nondeterministic queries.
Vitess enables enforcement and tuning of these limits, ensuring database stability and performance.
Sharding
Mostly Transparent Sharding:
Vitess abstracts sharding complexities, allowing applications to interact with VTGate as a unified MySQL instance.
Applications need to define sharding strategies via VSchema, exposing them to the concept of sharding.
Queries spanning multiple shards may encounter higher latencies, especially for cross-shard transactions.
Moving Data Around:
Vitess facilitates resharding with minimal downtime, simplifying the process of moving data between shards.
Resharding workflow handles replication, progress checkpointing, data validation, and traffic redirection.
Supports moving tables between databases, reducing the significance of sharding key choices.
Reliability
Simple Vitess Architecture:
Users interact with stateless VTGate nodes, which route queries to shards composed of multiple tablets.
VTTablet watches over local MySQL instances and ensures their health.
Automatic reparenting in case of leader tablet failure ensures high reliability.
Failure Handling:
VTGate and VTTablet hosts failures are handled gracefully, with replacements orchestrated by the compute infrastructure.
Vitess detects MySQL process failures and stops routing queries to the affected tablets.
Automatic reparenting in case of leader tablet failure ensures continuous operation.
While Vitess handles the sharding topology, acquiring and deploying compute resources is outside its scope.
Now, lets talks about architecture of Vitess like what does it have different from other methodologies. In this sections, I used some terminologies like VTGate, VTTablet which are core to architecture of Vitess.
Architecture of Vitess
Vitess is a cluster management solution for sharded MySQL databases. In this section we will incrementally show what the Vitess architecture looks like. First let us consider just two of the most critical Vitess components.
VTGate:
A set of stateless nodes serving as a user-facing frontend.
Routes all user queries to any VTGate host, where queries are analyzed to identify involved shards.
Forwards queries to the relevant shard(s) based on analysis.
VTTablet:
A component deployed alongside MySQL instances within shards.
Responsible for monitoring local MySQL health, enforcing query limits, and reporting back to VTGate.
Consists of two components: a MySQL instance and a VTTablet instance, always co-deployed on a single host.
Additionally, a metadata store, referred to as the topology service, is integral to the architecture. It stores crucial information such as sharding details, tablet configurations, and shard leadership. This metadata store facilitates answering key questions regarding the sharding of keyspaces, tablet configurations, and shard leadership assignments.
The topology service in Vitess is a strongly consistent store designed to hold small amounts of metadata. It serves as a critical component, facilitating sharding details and other essential information. Designed as a plugin, it supports various implementations on top of Zookeeper, etcd, and Consul. VTGate interacts with the topology service, pulling and writing information as needed.
Importantly, the topology service is not part of the hot path for queries; instead, it is accessed during node startup and periodically updated as metadata changes. Shard routing information required for queries is cached by VTGate, ensuring efficient query processing.
Vitess extends its architecture beyond a single failure zone by supporting cross-region deployments. It introduces the concept of cells, which are groups of hosts considered to have separate failure boundaries. For example, AWS regions can be utilized as Vitess cells, enabling robust and fault-tolerant deployments across multiple regions
Life of Read and Write
When a user submits a query to a VTGate node in Vitess, the following scenarios can occur:
Read Only, Single Shard: VTGate forwards the query to any serving replica within the shard. The database processes the query, and VTGate returns the result to the user.
Read+Write, Single Shard: Similar to the previous case, but the query must be directed to the shard's leader. In cross-cell deployments, this may require cross-cell communication.
Read Only, Multiple Shards: VTGate routes the read-only query to replicas in all involved shards. It gathers results from each shard, combines them, and returns them to the user.
Read+Write, Multiple Shards: Similar to the previous case, but queries must be forwarded to shard leaders.
Fan-out queries to multiple shards incur latency and availability penalties. Latency is bound by the slowest shard, and availability decreases with each additional shard involved. Vitess also supports cross-shard transactions, albeit using a slow 2PC protocol, thus it's advisable to minimize their usage.
Moreover, the architecture of Vitess is so deep that it cannot fit in this post. To read about it, checkout Vitess Official Docs
UseCases
Vitess, an open-source database clustering system for horizontal scaling of MySQL, finds application in various scenarios where scalability, reliability, and manageability are critical. Here are some key use cases:
Scalable Web Applications: Vitess is ideal for web applications experiencing rapid growth in user base and data volume. It enables horizontal scaling of MySQL databases, allowing applications to handle increased traffic without compromising performance.
Microservices Architecture: In a microservices environment where each service has its own database, Vitess helps in managing a large number of databases efficiently. It provides a unified layer for database operations, ensuring consistent performance and scalability across services.
Multi-Tenant SaaS Platforms: Software as a Service (SaaS) platforms serving multiple tenants can benefit from Vitess by isolating tenant data in separate shards while maintaining centralized management and scalability. This allows SaaS providers to scale their infrastructure based on demand from different tenants.
High-Traffic E-Commerce Platforms: E-commerce platforms experiencing high traffic volumes, especially during peak seasons or promotional events, require a scalable database solution to handle increased transactions and user activity. Vitess enables these platforms to scale their databases dynamically to meet demand spikes.
Real-Time Analytics: Vitess can be used for real-time analytics applications where data needs to be ingested, processed, and analyzed in real-time. By horizontally scaling MySQL databases, Vitess allows for faster data processing and analytics queries, enabling businesses to make timely decisions based on insights.
Content Management Systems (CMS): CMS platforms managing a large volume of content and user-generated data can leverage Vitess for horizontal scaling of their databases. This ensures high availability and performance even during peak usage periods, providing a seamless experience for content creators and consumers.
Data Replication and Disaster Recovery: Vitess supports data replication and disaster recovery scenarios by replicating data across multiple shards and geographical regions. This ensures data availability and integrity in case of hardware failures, network outages, or other disasters.
Cloud-Native Applications: Vitess is well-suited for cloud-native applications deployed on Kubernetes or other container orchestration platforms. It integrates seamlessly with containerized environments, enabling automated deployment, scaling, and management of MySQL databases in cloud environments.
These are just a few examples of the diverse range of use cases where Vitess can be applied to address scalability, reliability, and performance challenges in modern applications and infrastructures.
There are several MNCs who have understood the potential of Vitess.
Companies like Youtube, Booking.com, BlaBlaCar, Square, Shopify, Github, Slack and Hubspot are one of them.
Shopify posted a very interesting blog on their Engineering Page which depicts like how Vitess solved their database scaling issue.
Checkout it at: Horizontally scaling the Rails backend of Shop app with Vitess
Follow me on LinkedIn for more tech bytes, deep dives and random content related to Software Engineering :D🚀