Demystifying Paxos: Python implementation and visualization
This article covers Paxos algorithm, what it is, where it is used, comparison with alternate options. Then we dive into a simple implementation, with visualization, followed by some interesting test cases to solidify the learning.
INTRODUCTION
In computer science, consensus is a fundamental problem. Consider a group of people (or machines) with no designated leader; can they still arrive at a single common consensus value? Let’s say I want a group of machines to commonly agree on a consensus value of 42. Initially, this might seem like a simple problem. One machine proposes a number, and all others accept it. But what happens if two machines propose different values simultaneously? You might think the one that receives the majority of votes wins. But what if all machines propose a value at the same time? It’s tempting to say that the one with the most votes wins. However, this is incorrect. Having the most votes is not the same as having a majority of the votes. It could happen that there are three proposers, each receiving 30% of the votes. In this scenario, each would end up with a distinct consensus value within their smaller group.
When Leslie Lamport first proposed the Paxos algorithm, it was so complicated that it was rejected by almost all reviewers. Even when it was finally accepted in 1990 in the ACM Transactions on Computer Systems — a premier fundamental computer science research journal — its wider adoption remained elusive. Today, almost all distributed systems courses begin with the FLP impossibility theorem and then move on to the Paxos Algorithm. In 2001, Lamport had to publish the same algorithm in a simpler way — ‘Paxos Made Simple.’ This version made the intuition behind the algorithm easier to grasp, and its adoption finally took off.
This article uses the approach from this second paper, supplemented with some test cases to facilitate understanding of the algorithm. A deep understanding of this algorithm is fundamental to comprehending how most systems work coherently. Moreover, a simpler version (Raft) has been a part of almost every distributed system built in the last 15 years.
PROBLEM STATEMENT
A set of equal nodes, who can communicate with each other, should come to a common consensus value.
REQUIREMENT BEFORE USING THE ALGORITHM
At least a majority of nodes should be up and connected to each other. So if there are 10 nodes, at least 6 nodes should be up and connected to each other. The ‘connected’ part is important, because if the nodes are distributed, they may form small groups of 3–3–3–1 which are network partitioned from each other. In such case, a consensus is not possible.
ALGORITHM
Phase 1: Prepare
Initiate Prepare:A proposer node decides to initiate a consensus process for a proposal. It generates a unique proposal number (typically higher than any previously used) and sends a prepare request to all other nodes (or a subset if optimizations like hierarchical Paxos are used).
Receive Prepare: Upon receiving a prepare request, an acceptor checks if the proposal number is higher than any proposal it has previously responded to. If it is, the acceptor promises not to accept any earlier proposals and sends a promise back to the proposer. This promise might include the highest-numbered proposal it has already accepted, if any.
Phase 2: Propose (Accept)
Send Propose: Once the proposer receives promises from a majority of nodes, it sends a propose request (also known as an accept request). This request will include the proposal value, which might be one that a majority of nodes have previously accepted (to ensure learning from past proposals) or a new value if no such majority exists.
Receive Propose: Acceptors receive the propose request. If they have not promised to a higher-numbered proposal, they accept this proposal by sending an acknowledgment back to the proposer and potentially to all other nodes.
Phase 3: Learning
Learning and Confirmation: Once the proposer receives enough acceptances from a majority of nodes, it considers the value chosen. The proposer or another designated learner then broadcasts this decision to all nodes or learners in the system.
Finalizing Consensus: Each node updates its state based on the learned value and moves forward with this agreed-upon value until a new consensus round is initiated.
INTUITION
This is a tricky concept. People often struggle to visualize the different aspects of the algorithm, making it unclear what should be simplified to aid intuition. Here’s an attempt to explain: The algorithm operates in two stages — promise and accept. In the first phase, at least a majority of nodes will promise a number, say 42. In the second phase, at least a majority will accept the number. Although from its name it seems like the prepare phase is meant to prepare the nodes for voting, in reality, it serves to flush out any previous unsuccessful consensus attempts. Paxos allows a consensus attempt to fail so it can be tried again. This is why it is slow and, in some very rare, though mathematically possible cases, may never reach consensus.
A common confusion arises after a majority of nodes have promised a value (say 42) to the leader node: why isn’t consensus achieved at that point? Why is an additional round of the Accept phase needed?
This occurs because, after the majority of nodes have made their promise for 42, the leader doesn’t know whether they have changed their votes. The second phase of ‘Accept’ ensures that the leader declares a consensus only after the recorded consensus has been approved by a majority of nodes. This is similar to how a two-phase commit is executed in database transactions. All the data in a transaction is written, but the commit is not declared until the write is complete. If the commit doesn’t happen, the writes are discarded. If the commit occurs, the data is declared committed. Similarly, if the promise phase was successful but the accept phase was not, the proposed value is discarded. This discarding is facilitated through the next round of proposal with a new value.
Paxos prioritizes consistency over availability. It may take a long time to achieve consensus, but since there is no single point of failure — as there is no declared leader — it generally handles the failure of nodes more effectively.
IMPLEMENTATION: BASICS
The core of the algorithm is implemented inside the PaxosNode class. The number of PaxosNode objects created corresponds to the number of nodes participating in consensus building. A simple test case directs one of the nodes to act as a leader and attempt to generate consensus. The visualization below explains the flow of messages. However, there is common confusion regarding ‘when has consensus been reached?’ The algorithm is designed such that a leader initiates an effort to build consensus. Logically, consensus is reached when this ‘leader’ confirms that consensus has been achieved. In practice, this occurs after the leader has committed to its local storage that 42 is the final consensus value, akin to the commit phase of a two-phase commit.
A valid concern against this approach is, ‘what if the leader crashes?’ An even more extreme scenario is, ‘what if all nodes crash and then recover?’ Will they be able to communicate and determine what the consensus value was?
Yes, they will. When the acceptor nodes accept a value, they persist it in their local storage. It is assumed that this local persistent storage remains available across node crashes. Thus, if all nodes recover and confer about the last accepted consensus, a majority will affirm that it was 42, enabling recovery. Some implementations incorporate an additional ‘broadcast’ message and declare that consensus is reached when the majority of nodes have received this broadcast. Although this is not technically accurate, it simplifies the understanding of the process. This approach distinguishes between the achievement of consensus and the storage of the consensus result. It also reduces the storage requirements at each node, as the local storage of an accepted value might need to be discarded later if that consensus attempt fails. However, a broadcast will never fail once consensus has been reached; the message simply broadcasts the result.
Another enhancement to this model is that when a node receives broadcast messages, it also forwards these messages to other nodes. This prevents the leader from being overwhelmed with the sole responsibility of communicating the consensus to all nodes.
IMPLEMENTATION: CLASSES
To Implement it, let's declare following classes.
NodeState (Enum): Represents the states a node can be in, such as “Up,” “Down,” or “Blocked.” This helps in managing and checking the node’s availability and operational status.
NodeRole (Enum): Defines roles a node can assume within the Paxos protocol, specifically “Leader” and “Acceptor.” This categorization aids in role-specific operations during the consensus process.
Message (Dataclass): A simple structure for messages exchanged between nodes, carrying essential information like sender and receiver IDs, and the content of the message.
Proposal (Dataclass): Represents a proposal in the Paxos protocol, containing the node ID of the proposer, a unique proposal number, and the value proposed for consensus.
ConsensusValue: A class encapsulating the value agreed upon in the consensus process, providing a structured way to handle consensus data.
PaxosLogEntry (Dataclass): Defines the structure of a log entry, which includes detailed information about actions taken during the Paxos process, such as the nodes involved, their roles and states, the action performed, and the value associated with the action.
PaxosLogger: Manages logging for Paxos operations, storing entries, and providing functionality to save logs to a CSV file. This class facilitates tracking and debugging the Paxos process over time.
PaxosNode: Represents a node participating in the Paxos protocol. It includes mechanisms for sending and receiving proposals and promises, managing state changes, and logging activities. Each node keeps track of proposals it has received and maintains counts of acceptances to determine when consensus is reached.
Nodes: A container for managing multiple PaxosNode instances, facilitating operations like adding nodes and possibly managing their interaction at a higher level.
RunPaxos: A simulation class to execute and test the Paxos algorithm under various conditions. It includes methods to simulate basic consensus operations and handle leader failures, helping to test and validate the Paxos implementation in a controlled environment.
We start with declaring the basic data structure we are going to use in the code
import csv
from enum import Enum
from dataclasses import dataclass
import logging
from datetime import datetime
# Enum to represent the possible states of a node within the Paxos consensus process.
class NodeState(Enum):
UP = "Up"
DOWN = "Down"
BLOCKED = "Blocked"
# Enum to define the roles that nodes can assume in the Paxos algorithm (either Leader or Acceptor).
class NodeRole(Enum):
LEADER = "Leader"
ACCEPTOR = "Acceptor"
# Class to act as collection of all nodes in the ring.
class Nodes:
def __init__(self, logger):
self.nodes = {}
self.logger = logger
def add_node(self, node_id, number_of_nodes, role=NodeRole.ACCEPTOR):
# Dataclass to encapsulate the information for messages exchanged between nodes in the Paxos protocol.
@dataclass
class Message:
sender_id: int
receiver_id: int
content: str
# Dataclass to represent a proposal made by a node in the Paxos consensus process, including its unique identifier and proposed value.
@dataclass
class Proposal:
node_id: int
proposal_number: int
value: str # Assuming proposals carry a value that's being agreed upon
# Abstract the consensus value in its own class
class ConsensusValue:
def __init__(self, data) -> None:
self.data = data
def __str__(self) -> str:
#return f"ConsensusValue({self.data})"
return f"({self.data})"
This is followed by schema of the logger and the logger class itself. Having a separate schema helps us easily determine what the content of the log file is going to look like.
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logger = logging.getLogger()
@dataclass
class PaxosLogEntry:
round: int
timestamp: datetime
from_node_id: int
from_node_role: str
from_node_state: str
to_node_id: int
to_node_role: str
to_node_state: str
action: str
action_value: str
consensus_value: str
consensus_reached: bool
class PaxosLogger:
def __init__(self, round, filename="paxosresult.csv"):
self._round = round
self.filename = filename
self.entries = []
@property
def round(self):
return self._round
@round.setter
def round(self, value):
if isinstance(value, int) and value >= 0:
self._round = value
else:
raise ValueError("Consensus round must be non-negative number")
def record_log(self, from_node_id, from_node_role, from_node_state, to_node_id, to_node_role, to_node_state, action, action_value, consensus_value, consensus_reached):
entry = PaxosLogEntry(
round=self.round,
timestamp=datetime.now(),
from_node_id=from_node_id,
from_node_role=from_node_role,
from_node_state=from_node_state,
to_node_id=to_node_id,
to_node_role=to_node_role,
to_node_state=to_node_state,
action=action,
action_value=str(action_value),
consensus_value=str(consensus_value),
consensus_reached=consensus_reached
)
self.entries.append(entry)
def save_to_csv(self):
with open(self.filename, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Round', 'Timestamp', 'From Node ID', 'From Node Role', 'From Node State', 'To Node ID', 'To Node Role', 'To Node State', 'Action', 'Action Value', 'Consensus Value', 'Consensus Reached'])
for entry in self.entries:
writer.writerow([
entry.round,
entry.timestamp.strftime("%Y-%m-%d %H:%M:%S"), # Ensuring timestamp is formatted properly
entry.from_node_id, entry.from_node_role, entry.from_node_state,
entry.to_node_id, entry.to_node_role, entry.to_node_state, entry.action, entry.action_value,
entry.consensus_value, entry.consensus_reached
])
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.save_to_csv()
This is a rough look at how the log is going to look like. Note that ConsensusReached flag is false in all rows, it means for Round 1, consensus has not yet been reached. Such log file makes it easy to determine the flow of messages between the nodes. But in a real world environment, it will not be easy because all the nodes work on different machines, so at some point there has to be a common logger to which all nodes can dump their logs and it merges. Those intricacies are not covered here because we are focused on understanding of the algorithm with a simple implementation.
This is followed by the PaxosNode class itself. It acts as both leader and acceptor. send_prepare(), receive_promise() and send_accept() are its actions as leader while receive_prepare(), send_promise(), receive_accept() and receive_broadcast() are its action as follower. Many times, making sure that a node can act as both leader and follower makes it a little hard to follow the implementation.
class PaxosNode:
def __init__(self, node_id, role, logger, node_count):
self.node_id = node_id
self.totalnodecount = node_count
self.state = NodeState.UP
self.role = NodeRole.ACCEPTOR
self.last_consensus = None
self.consensus_reached = False # Set when a new consensus is reached and immediately reset after that.
self.received_promises = set()
self.logger = logger # This should be instance of custom class PaxosLogger and not python in built logger.
self.acceptance_counts = {} # Track acceptance counts for each proposal.
def set_consensus(self, value):
if self.last_consensus != value:
self.last_consensus = value
self.consensus_reached = True
self.log_state_change()
# Schedule the reset of the consensus reached flag
self.reset_consensus_reached()
def reset_consensus_reached(self):
# Reset the flag in the next cycle or event
self.consensus_reached = False
self.log_state_change()
def log_state_change(self):
# Log the change of state
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value,
from_node_state=self.state.value,
to_node_id=self.node_id,
to_node_role=self.role.value,
to_node_state=self.state.value,
action="ConsensusStateUpdate",
action_value=str(self.last_consensus),
consensus_value=str(self.last_consensus),
consensus_reached=self.consensus_reached
)
def send_prepare(self, nodes, proposal):
if self.state == NodeState.UP:
for node in nodes.values():
if node.state == NodeState.UP:
node.receive_prepare(proposal)
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value,
from_node_state=self.state.value,
to_node_id=node.node_id,
to_node_role=node.role.value,
to_node_state=node.state.value,
action="PrepareSend",
action_value=str(proposal.proposal_number),
consensus_value=str(self.last_consensus) if self.last_consensus else "None",
consensus_reached=False
)
else:
# Log that the message is not sent because the destination node is down
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value,
from_node_state=self.state.value,
to_node_id=node.node_id,
to_node_role=node.role.value,
to_node_state="DOWN",
action="PrepareNotSent",
action_value=str(proposal.proposal_number),
consensus_value=str(self.last_consensus) if self.last_consensus else "None",
consensus_reached=False
)
else:
# Log that no messages are sent because the source node is down
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value,
from_node_state="DOWN",
to_node_id="N/A",
to_node_role="N/A",
to_node_state="N/A",
action="PrepareNotSent",
action_value="N/A",
consensus_value="N/A",
consensus_reached=False
)
print(f"Node {self.node_id} is DOWN. No prepare messages sent.")
def receive_prepare(self, proposal):
if self.state == NodeState.UP:
# Log receiving a prepare request
self.logger.record_log(
from_node_id=proposal.node_id,
from_node_role=self.role.value, # Assuming you may not know the role here unless you maintain more state or look it up
from_node_state=self.state.value, # Same as role, adjust if you have access to this info
to_node_id=self.node_id,
to_node_role=self.role.value, # Assuming roles are stored as enum and logging their string representation
to_node_state=self.state.value,
action="PrepareReceive",
action_value="None", # Adjust if there's specific value related to prepare you want to log
consensus_value=str(self.last_consensus) if self.last_consensus else "None",
consensus_reached=False
)
print(f"Node {self.node_id} received prepare from {proposal.node_id}")
# Log decision to send a promise based on state
if self.last_consensus is not None:
self.send_promise(proposal)
# Optionally log that we are sending a promise with an existing consensus value
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value,
from_node_state=self.state.value,
to_node_id=proposal.node_id,
to_node_role=self.role.value, # Adjust based on actual knowledge
to_node_state=self.state.value,
action="PromiseSend",
action_value="ExistingConsensus", # Indicate this promise is based on existing consensus
consensus_value=str(self.last_consensus),
consensus_reached=False
)
else:
self.send_promise(proposal)
# Optionally log that we are sending a standard promise
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value,
from_node_state=self.state.value,
to_node_id=proposal.node_id,
to_node_role='Unknown',
to_node_state='Unknown',
action="PromiseSend",
action_value="NewPromise", # Indicate this is a new promise without existing consensus
consensus_value="None",
consensus_reached=False
)
def send_promise(self, proposal):
if self.state == NodeState.UP:
# Add the proposal number to the set of received promises
self.received_promises.add(proposal.proposal_number)
# Check if there was a consensus in the current or a previous round
consensus_achieved = self.last_consensus is not None
consensus_value = self.last_consensus if consensus_achieved else "None"
# Log sending a promise
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value, # Assuming role is stored as an enum
from_node_state=self.state.value,
to_node_id=proposal.node_id,
to_node_role='Unknown', # This could be enhanced if you have a way to look up the role
to_node_state='Unknown', # This could also be enhanced if you have node state information available
action="PromiseSend",
action_value=str(proposal.proposal_number),
consensus_value=str(consensus_value),
consensus_reached=consensus_achieved
)
# Respond to the leader with the promise and the last known consensus if it exists
# This could be sent back via a network message or other means depending on system architecture
return {
'proposal_number': proposal.proposal_number,
'node_id': self.node_id,
'last_consensus': consensus_value,
'consensus_achieved': consensus_achieved
}
def receive_promise(self, proposal):
# Track promises received
self.received_promises.add(proposal.proposal_number)
# Log the reception of a promise
self.logger.record_log(
from_node_id=proposal.node_id,
from_node_role=self.role.value, # Role might not be known; update if possible
from_node_state=self.state.value, # State might not be known; update if possible
to_node_id=self.node_id,
to_node_role=self.role.value,
to_node_state=self.state.value,
action="PromiseReceive",
action_value=str(proposal.proposal_number),
consensus_value=str(proposal.value if proposal.value else "None"),
consensus_reached=False
)
print(f"Node {self.node_id} received promise from {proposal.node_id} with proposal number {proposal.proposal_number} and value {proposal.value if proposal.value else 'None'}")
# Decide on the action based on promises received
self.decide_on_promises_received(proposal)
def decide_on_promises_received(self, proposal):
# Placeholder for how to decide after receiving promises
# This method should handle the aggregation and evaluation of received promises
# Check if the majority of promises indicate a previous consensus
majority_needed = self.totalnodecount // 2 + 1
consensus_count = {}
for received_proposal in self.received_promises:
value = received_proposal
if value not in consensus_count:
consensus_count[value] = 0
consensus_count[value] += 1
if consensus_count[value] >= majority_needed:
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value,
from_node_state=self.state.value,
to_node_id=self.node_id,
to_node_role=self.role.value,
to_node_state=self.state.value,
action="ConsensusAchieved",
action_value=str(value),
consensus_value=str(value),
consensus_reached=True
)
print(f"Consensus previously reached on value {value}, re-adopting this consensus.")
# Optionally, re-broadcast this consensus or take further actions
return
print("No consensus reached previously, proceeding with current proposals.")
def send_accept(self, nodes, proposal):
if self.state == NodeState.UP:
for node in nodes.values():
if node.state == NodeState.UP:
# Log before sending accept to each node
self.logger.record_log(
from_node_id=self.node_id,
from_node_role=self.role.value, # Assuming role is stored as an enum
from_node_state=self.state.value,
to_node_id=node.node_id,
to_node_role=node.role.value, # Assuming you have access to node's role
to_node_state=node.state.value,
action="AcceptSend",
action_value=str(proposal.proposal_number),
consensus_value=str(proposal.value),
consensus_reached=False # This might be updated only after all acceptances
)
node.receive_accept(proposal)
def receive_accept(self, proposal):
if self.state == NodeState.UP:
key = (proposal.proposal_number, proposal.value) # Key by both proposal number and value
# Increment the acceptance count for the given proposal and value
if key in self.acceptance_counts:
self.acceptance_counts[key] += 1
else:
self.acceptance_counts[key] = 1
# Determine majority
majority = self.totalnodecount // 2 + 1 # Assuming 'nodes' is accessible and contains all node objects
consensus_reached = self.acceptance_counts[key] >= majority
# Log the reception of an accept message
self.logger.record_log(
from_node_id=proposal.node_id,
from_node_role=self.role.value, # Placeholder until role information is available or derived
from_node_state=self.state.value, # Placeholder until state information can be obtained
to_node_id=self.node_id,
to_node_role=self.role.value,
to_node_state=self.state.value,
action="AcceptReceive",
action_value=str(proposal.value),
consensus_value=str(proposal.value),
consensus_reached=consensus_reached
)
print(f"Node {self.node_id} received accept from {proposal.node_id} with proposal {proposal.proposal_number} and value {proposal.value}, consensus reached: {consensus_reached}")
# Update last consensus if majority is reached
if consensus_reached:
self.last_consensus = proposal.value
self.reset_consensus_reached() # Reset the consensus reached flag after updating
def receive_broadcast(self, proposal):
if self.state == NodeState.UP:
# This is the point when we assume that consensus has been achieved.
self.set_consensus(proposal.value)
# Log the reception of a broadcast message
self.logger.record_log(
from_node_id=proposal.node_id,
from_node_role=self.role.value, # Role might not be known; adjust if more information is available
from_node_state=self.state.value, # State might not be known; adjust similarly
to_node_id=self.node_id,
to_node_role=self.role.value,
to_node_state=self.state.value,
action="BroadcastReceive",
action_value=str(proposal.value),
consensus_value=str(proposal.value),
consensus_reached=True # Assuming broadcast implies consensus reached
)
print(f"Node {self.node_id} received broadcast from {proposal.node_id} with proposal {proposal.proposal_number} and value {proposal.value}")
self.last_consensus = proposal.value
def go_down(self):
The overall implementation link is given at the end of the blog, but right now it is important to touch on the most important aspect of this article.
A discussion on potential improvements, optimizations, test cases, etc. of such an algorithm is helpful in training the mind on distributed systems approach to consensus, commit, eventual consistency, latency, throughput, availability, transactions, node membership, replication, etc. Some ideas worth wondering are:
- What if some nodes are slow. How does it impact the time taken to reach consensus?
- What if all nodes hold information about all other nodes and there is a leader designated in this datastore? Does it start feeling like RAFT protocol?
- How can an external storage source make things easier? E.g. membership of nodes decided using Zookeeper?
- Design a potential case where consensus is never achieved. How would you block such a situation from happening?
- Do all nodes even need to send messages? What if the set of say 100 nodes is divided into group of 51 and 49. Attempt consensus only from within the group of 51 and if one node is slow/not responding, replace it with one node from the group of 49. Does this approach make it faster, what are pitfalls?
- Instead of all nodes sending messages to all other nodes, is it better to create a ring of nodes where each node is only sending messages to next two nodes and previous two nodes in the older in which they are added to the ring? Heard about Chord protocol for distributed hash table?
PAXOS vs RAFT
Nodes Required to Achieve Consensus:
Both Paxos and Raft require a majority of nodes (more than half) to participate and agree in order to achieve consensus. This ensures that there is agreement even in the presence of node failures.
Throughput and Latency:
Paxos: Can have lower throughput due to its complex multi-phase consensus process, potentially increasing latency especially in configurations where nodes are geographically dispersed.
Raft: Generally offers higher throughput and lower latency compared to Paxos due to its more streamlined approach, where a single leader manages log entries, simplifying coordination and communication.
Complexity:
Paxos is known for its complexity, which can make it difficult to understand and implement correctly. This complexity arises from its multiple roles and phases, each requiring careful management of state and coordination.
Raft is designed to be simpler and easier to understand, with clearer division into roles and fewer states to manage, making it more accessible for implementation and debugging.
Time Taken to Reach Consensus:
Paxos: The time to reach consensus can vary widely depending on the number of rounds needed to resolve conflicts among proposals, particularly when nodes propose different values simultaneously.
Raft: Tends to reach consensus more quickly in typical scenarios due to its structured leader election and log replication process. The leader simplifies decision-making, which speeds up the consensus.
Number of Messages Required to Reach Consensus:
Paxos: Requires a higher number of messages per consensus decision, especially as the number of participant nodes increases. Each phase of the consensus (prepare and accept) involves multiple rounds of messages across all nodes.
Raft: Reduces the number of messages needed by having a single leader handle most of the coordination and communication. Messages are primarily between the leader and other nodes for log replication and heartbeats to maintain authority.
An interesting next step any student can make is implement the same algorithm with distributed nodes. The nodes can communicate with each other through REST APIs or through TCP. It will require a good membership implementation so that new nodes can be added and all other nodes know about it. It can be implemented natively or using tools like zookeeper or etcd to control membership. Testing such a setup will be the hardest part, because you cannot control the order of messages sent by each node. Particularly in cases where multiple leaders try to build consensus, controlling order of messages is not possible. In industry, this is solved using a transport layer below all nodes and then blocking all messages to a particular node. This way, a node only receives or sends message in the order when you release the lock.
CONCLUSION
This article talked about Paxos algorithm, intuition, message flow, implementation. We discussed potential improvements, interesting test cases, tricky situations. The implementation we saw has nodes implemented as objects in a single process which made it easy to implement logging which makes stepping through the algorithm easier. We concluded with additional food for thought and comparison between Paxos and Raft.
REFERENCES
- David Mazieres, Professor of Computer Science at Stanford University, who first taught me the algorithm in his class.
2. https://github.com/sidscrazy/BlogCode/tree/main/Paxos
3. Official Raft consensus algorithm official page
4. Paxos Algorithm Original paper: The part time parliament(1989)