Well the problem is obvious in mobile devices, but any asymmetric netsplit will ...

danielmewes · on July 20, 2015

Ah, I have to check how we behave in that case. Thanks for the pointers.

danielmewes · on July 20, 2015

If I understand that scenario correctly, it assumes that there is a node that can successfully send messages to the other nodes in the cluster, but cannot receive any responses.

Since RethinkDB uses TCP connections, this shouldn't usually happen (since the TCP acknowledgements wouldn't get through either). The exception might be a layer 5 router / firewall somewhere in the network that allows the TCP connection to work, but only passes the data stream through in one direction. RethinkDB is partially protected against this case, because we use bidirectional heartbeats on top of the same TCP connection that is used for Raft traffic. The heartbeat usually ensures that the other host is still alive and reachable. In this case, the node that cannot receive any responses from the remaining cluster would get a heartbeat timeout after a couple of seconds, and disconnect from the remaining servers. This is turn should limit further damage and allow the Raft cluster to proceed.

Please let me know if I'm missing something else.

Edit: There might still be some problems with non-transitive connectivity, where one node can talk to only parts of the cluster. We have built in some protection against this, but don't always handle that case well.