In distributed systems, the CAP theorem highlights the compromise that exists between Consistency, Availability, and Partitions. It posits that if a system is partitioned, it can exclusively ensure the simultaneous delivery of only two out of the three features.
Eric Brewer, a computer science professor at U.C. Berkeley, originally presented this theorem, which is also referred to as Brewer’s theorem.
To clarify, what does the CAP Theorem exactly entail?
According to the theorem, a distributed system that comprises multiple computers would be advantageous if it ensures a cohesive state. Additionally, this cooperative system must possess the ability to handle data across all nodes, be it virtual or physical. Apache Iceberg has been proven to be one such system.
Our investigation will delve into the CAP theorem and differentiate between NoSQL and SQL databases, while also exploring the optimal situations and methods to create each database type to gain a comprehensive comprehension of its functionality in system architecture.
Understanding Consistency, Availability, and Partitioning
To begin with, have a glance at this illustration of a basic distributed system that has only two nodes.
It is likely that the nodes are interconnected, implying that data is probably being stored on one node and retrieved from another by the user’s system. This leads to the question, what should be the next course of action?
The CAP system is precisely defined.
To understand the implementation of the CAP theorem in system design, let us examine a specific scenario.
While working on the footage, a video editor retrieves it from one database or node and then stores the edited version on another database or node.
In this instance, CAP offers three distinct characteristics to manipulate:
- Consistent Availability and Reliability (CA database)
- Consistent Partitioning and Availability (CP Database)
- Available Partitioning and the Ability to Accept Read and Write Operations (AP Database)
Later on, we’ll delve deeper into each of these attributes.
Before we proceed to examine those applications, let us first carefully examine each of these characteristics individually to understand why the CAP theorem is valuable.
If all users are capable of reading and writing data dependably from any node in the system, the distributed system is considered to be consistent.
All nodes should receive any new data, and the read operation must return identical data to all users in a consistent distributed system.
To facilitate our discussion, let us suppose that you have made modifications to your online purchase after submitting it. If you cancel your purchase at a later time, the customer service representative you communicate with will be informed about the changes made to your order.
The evidence shows that the written data resulting from the updates was consistent across all linked nodes. Therefore, when the new representative attempted to access your data, it returned the most recent written information.
Availability alludes to the ability of a distributed network to keep transmitting data, even if certain individual nodes experience malfunctions. The overall functioning of the network must remain unaffected, even if specific components are not performing correctly.
In a distributed network with availability, user systems need not be concerned with the specifics of the data they receive.
Even if multiple servers are down, the customer service team will still be able to retrieve the required data if you have modified an online order.
Despite the network’s emphasis on ensuring availability, there is no assurance that the received data is current and reflects the most recent writing activity.
The CAP theorem affirms that a distributed system must exhibit partition tolerance to handle temporary disconnections between its nodes. This implies that the system must continue to run and permit read/write activities, even if some nodes are unavailable, to guarantee that the distributed system can still operate during such an interruption.
To maintain the intended level of database integrity, ensuring partition tolerance is almost imperative. As a result, linked systems can only ever employ two of the three database characteristics (Consistency, Availability, and Partition tolerance), and this is the primary rationale for this limitation.
The network can be configured to offer either data integrity or accessibility to all users, depending on the type of node failure and the data requirements of the user systems. Thus, users are provided access to the data they need while also maintaining the integrity of the data.
Let us examine how the user system utilises these database features.
The following diagram demonstrates the basic stages involved in accessing a database. The CAP model, which is widely used in various industries, follows a sequential approach that integrates two of these elements.
Consistency and availability (CA database)
For a networked database, it is crucial for data to be consistent and accessible across all nodes. Nonetheless, when there is no partition tolerance, it means that if a node fails, the stored data on it will become inaccessible. This is a significant problem that can result in data loss.
As node failures are inevitable in any networked system, CA databases become largely unnecessary.
The upcoming sections will cover various SQL databases that support CA, including PostgreSQL.
Let us examine the characteristics of the CAP theorem and how they interact in various contexts.
Consistency and partition tolerance (CP Database)
Databases that give more weightage to consistency and partition tolerance across all network nodes will disable any nodes that are inconsistent during instances of partition or node failure. This guarantees the network’s continued stability and security even when individual nodes are not operating correctly.
To guarantee data correctness across all active nodes, it is crucial to replicate the data identically. Generally, the data on the primary nodes is duplicated so that backup nodes can take over in case of a malfunction. However, during primary node repairs, write operations are delayed as availability is not the top priority.
NoSQL databases, including CP databases, have the ability to update independently of the primary node asynchronously. This feature makes them increasingly popular among organizations looking for an efficient database management system (DBMS). MongoDB is one of the most commonly used NoSQL DBMSs.
Availability and partition tolerance (AP Database)
During instances of partition or node failure in an Availability Partition (AP) database, the top priority is to preserve the system’s availability. This is necessary to ensure that the network can function even if one of the nodes is offline. Nevertheless, it must be noted that the stored data on these disconnected nodes may not represent the most recent information available.
With the AP database model, NoSQL databases such as Apache Cassandra and MongoDB can be accessed on all nodes simultaneously.
After the partition has been resolved, users can synchronise their data to ensure consistency.
Databases: Definition and Creation
Before delving into the specifics of how each DBMS manages each type of database, let’s first understand what they are and their general functioning.
The two primary database types are ACID and BASE. Defining your database correctly is essential and it highly depends on the nature of your data.
Consistency and ACID/BASE
ACID refers to:
Atomicity– Refers to the transaction criteria of a database management system.
Consistency– Indicates that all nodes within the network share the same accurate database.
Isolation– Refers to the segregation of distinct user systems to prevent duplication of data.
Durability– Pertains to the inherent capability of a DBMS to create data copies in the event of an error.
BASE Basic Availability, Soft State, and Eventually Consistent (BASE) is a feature that is unique to non-relational databases. As partitions could result in discrepancies between nodes, these databases can’t be considered as being compliant with Atomicity, Consistency, Isolation, and Durability (ACID).
In the end, consistency is key to ensure that data remains accessible to the end user, as previously mentioned. Though it could come at a cost, it’s a necessary measure to maintain data accessibility.
Let’s now compare SQL with NoSQL. Both relational and non-relational databases require specific languages to interact with them.
SQL and NoSQL Databases
SQL, or Structured Query Language, is a programming language explicitly crafted for use with relational databases. It’s often utilized to execute CRUD (create, read, update, and delete) operations, which are the most fundamental and vital functions of database management. In summary, the acronym CRUD is a helpful way to remember SQL’s functions.
Whilst SQL may seem limiting as it requires data to follow specific structures or parameters, this characteristic has numerous practical uses.
However, this necessitates careful planning prior to the database’s deployment.
NoSQL stands for “Not Only Structured Query Language” and is primarily utilized as a query language for non-relational databases.
Unlike SQL databases, NoSQL databases do not require predefined schemas due to the flexibility of their architecture.
We can use this data to understand the importance of various types of databases.
Utilizing PostgreSQL for CA’s Data Warehouse
PostgreSQL is a database management system (DBMS) that provides users with connectivity to databases by utilizing structured query language.
Because it adheres to the ACID (Atomicity, Consistency, Isolation, Durability) standards, PostgreSQL is well-suited for use in applications such as banking, where data consistency and availability are crucial.
Furthermore, PostgreSQL enables the use of foreign keys, which simplifies communication between databases.
Nevertheless, this requires careful planning as data is structured based on specific tables.
Employing MongoDB for CP Databases
If you are looking for a NoSQL database management system that emphasizes document-oriented data, MongoDB is an ideal option. Unlike SQL databases, MongoDB doesn’t demand extensive pre-planning since it’s constructed to be schema-less.
The CAP (Consistency, Availability, Partition tolerance) theorem is a crucial tool for big data analysis, offering flexibility in accommodating changes and updates. Therefore, it’s essential for anyone or any organization that wants to remain ahead of the curve in their respective fields.
MongoDB utilizes documents for storage and BSON (Binary JSON) files for query processing. By transforming them to JSON files, users can access the BSON files and modify the database underneath. This has already been established.
Despite having a BASE (Basic Availability, Soft-state, Eventual consistency) design originally, MongoDB has since integrated some aspects of the ACID (Atomicity, Consistency, Isolation, Durability) standards.
Although there are lingering concerns and the database doesn’t conform strictly to the ACID principles, this could be beneficial in specific situations.
Cassandra – Databases for Advanced Placement
Cassandra is frequently employed in managing AP (Availability-Partition) databases that prioritize availability over consistency.
Different from databases that use a primary and secondary node setup to handle multiple node failures, this system operates as a peer-to-peer network where all participants are accountable for managing potential node failures.
When a node in a Cassandra system fails, the system automatically generates replicas of the data that was stored on the failed node onto other nodes within the system. This automated process depends on the replication factor configured by the user during the initial setup of the system.
If the replication factor is 2, the nodes will replicate data in a clockwise direction to an additional n+1 nodes from the initial node.
This functionality guarantees the consistency of data over time, but it may take some time to make the most up-to-date information available.
For businesses of all scales, the CAP theorem is a crucial factor in data management. These technologies can minimize duplicate data entry and prevent errors caused by human mistakes when implemented effectively.
It’s worth mentioning that users have a range of database structures and administration systems to choose from so they can pick the most appropriate solution for their specific requirements. Additionally, the type of data that needs handling and the volume of operations should be taken into account when selecting a suitable database structure and administration system.