Explore YugabyteDB fault tolerance on local database cluster

Vladimir Novick
InstructorVladimir Novick
Share this video with your friends

Social Share Links

Send Tweet

YugabyteDB can automatically handle failures and therefore provides high availability . You will create YSQL tables with a replication factor (RF) of 3 that allows a fault tolerance of 1. This means the cluster will remain available for both reads and writes even if one node fails. However, if another node fails bringing the number of failures to two, then writes will become unavailable on the cluster in order to preserve data consistency.

In this lesson we will see how fault tolerance is tested by using workload sample app and by stopping nodes in the local cluster.

Written instructions to get up and running.

Vladimir Novick: [0:00] Let's start a new YugaByte cluster with three nodes. The Replication Factor of the cluster will be 3, which will allow for a tolerance of 1. If I go to YugaByte dashboard, you will see the Replication Factor is 3. It means if one of the nodes will be down, our database will continue to work.

[0:30] To test the workload of the cluster, we need to download a sample app provided by YugaByte to test the amount of reads and writes of the database. To run the SqlInserts workload, we'll paste the Java JAR command provided by YugaByte in application docs.

[0:53] If we go back to our console and go to the Tablet Servers, you'll see that there is amount of reads per second and amount of writes per second. You see the reads, writes numbers are quite high.

[1:13] We can see that they have even load on our Tablet Servers distributed among the nodes. Let's remove one of the nodes by running yb-ctl remove_node 3. Now, if we'll run the yb-ctl status, we will see that node 3 is stopped. If we go back to our admin UI, we'll still see that our nodes are alive, but the time since heartbeat is rising.

[2:00] As soon as it will rise to 60 seconds, we will see this node as dead. Indeed, we have a dead node here. If we look at the number of reads, number of writes, it's . We still have reads and writes happening on the database. Our cluster has replication factor of 3, meaning we can tolerate one failure, that's why you can see write happening here.

[2:38] Let's remove another node. If we go back to our admin UI, we're still seeing the time since heartbeat is rising. After 60 seconds has passed, we see this node is dead. Now, we will see writes close to and still some reads. If we go to the logs, we will see Timed out, time out on write.

[3:17] Let's start node number 2. We'll see how the cluster is back to life. You can immediately see that reads and writes rises up. Let's start node 3 to get our cluster back. Now, it's back and running.