CAP Theorem For System Design Interview

The CAP theorem is a fundamental concept in distributed system design, stating that it's impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency, Availability, and Partition Tolerance. When designing a distributed system, you must choose which two of these properties to prioritize, as achieving all three simultaneously is not feasible.

Consistency (C)

Consistency in the context of the CAP theorem refers to atomicity, where every client reading the same data sees the same data at the same time, regardless of which node they connect to. In a consistent system, once a write operation is acknowledged, all subsequent reads reflect that write. This is often achieved through mechanisms like strict serializability or strong consistency, where all replicas are updated and synchronized before a write is considered complete. If a client reads from any node, they are guaranteed to receive the most recent, successfully written data.

Example: Imagine a bank account balance. If a deposit is made and the system is strongly consistent, any subsequent query for that balance, from any ATM or online banking portal, will immediately reflect the new, higher balance. If a read were to occur on a node that hadn't yet received the update, the system would either block the read until it's consistent or return an error, rather than serving stale data.

Availability (A)

Availability means that every request receives a (non-error) response, even if some nodes in the system are down or experiencing network issues. An available system ensures that clients can always read or write data, as long as there's a functional node to serve the request. This often implies that the system prioritizes responsiveness over strict consistency, meaning that a client might receive slightly stale data if it ensures a quick response.

Example: Consider an e-commerce website that needs to remain operational 24/7. If a product inventory system prioritizes availability, even if a few database servers go offline, the website will continue to display product listings and allow customers to place orders. There might be a slight delay in updating inventory counts across all servers, leading to a brief period where an item might show as "in stock" even if it's just been purchased by another user on a different server. However, the system remains responsive and functional for the vast majority of users.

Partition Tolerance (P)

Partition Tolerance means the system continues to operate despite network partitions. A network partition occurs when communication between nodes is disrupted, effectively splitting the system into multiple isolated sub-systems that can't communicate with each other. In a partition-tolerant system, even if nodes are unable to communicate, each side of the partition remains operational and can continue to process requests independently. This is a critical consideration in distributed systems, as network failures are inevitable.

Example: Imagine a distributed database spread across multiple data centers. If the network link between two data centers goes down, forming a partition, a partition-tolerant system would allow the applications in each data center to continue operating and serving requests independently. This might mean that data written in one data center during the partition won't be immediately visible in the other, but the system as a whole doesn't become completely unavailable.

CAP Trade-offs: Cassandra vs. MySQL

The CAP theorem forces a critical design decision: which two properties to prioritize when a partition occurs.

Cassandra (AP System: Availability and Partition Tolerance)

Cassandra is a distributed NoSQL database designed to be highly available and partition tolerant. It achieves this by sacrificing strong consistency.

How it works: Cassandra employs an eventually consistent model. When a write occurs, data is written to multiple nodes, but the system doesn't wait for all replicas to acknowledge the write before responding to the client. This allows for high availability and ensures the system remains operational even during network partitions. If a node becomes unavailable or a partition occurs, other nodes can still serve requests.

Trade-off: The trade-off is consistency. During a network partition, data written to one side of the partition might not be immediately visible on the other side. Reads might return stale data until the partition is healed and the data eventually propagates to all replicas. Cassandra provides tunable consistency levels, allowing developers to choose a balance between consistency and availability for specific operations (e.g., quorum reads/writes for stronger consistency, or one-node reads/writes for higher availability).

Use case: Ideal for applications that require high uptime and throughput, where some eventual consistency is acceptable, such as user activity feeds, sensor data collection, or real-time analytics.

MySQL (CP System: Consistency and Partition Tolerance, often Availability is limited during partitions)

Traditional relational databases like MySQL, when deployed in a distributed setup (e.g., using replication with master-slave or multi-master configurations), typically prioritize Consistency and Partition Tolerance over absolute Availability during network partitions.

How it works: In a highly consistent MySQL setup, especially with strong replication or a distributed transaction manager, if a network partition occurs, the system will often choose to halt operations on one side of the partition or declare itself unavailable to prevent data inconsistency. For example, if a master node is partitioned from its slaves, the slaves might become read-only or stop processing writes to ensure data integrity.

Trade-off: While MySQL is fundamentally designed for strong consistency in a single instance, scaling it into a distributed system often means that during a network partition, some parts of the system might become temporarily unavailable or operate in a degraded mode (e.g., read-only) to maintain data consistency across all nodes. This means that if the network splits and a node cannot guarantee that it has the most up-to-date information, it might refuse to serve a request rather than provide potentially inconsistent data.

Use case: Suitable for applications requiring strong data integrity and ACID properties, such as financial transactions, inventory management where exact counts are critical, or any system where even temporary data inconsistency is unacceptable. While highly available in normal operation, achieving global availability across partitions often comes at the cost of consistency unless very complex and expensive distributed transaction mechanisms are employed.

In essence, when designing a distributed system, the CAP theorem guides your architectural choices by forcing you to explicitly acknowledge the trade-offs between these three crucial properties.

SnugAcademyTech