totem: Drop invalid join msg in operational state
According to the totem paper, if a processor
receives a join message in the operational state and if the
receivers identifier is in the join messages fail list,
then join message should be ignored.
By applying this validation of join messages, we can avoid unnecessary
switching from operational state to gather state(or even lead to rings
can not be merged) like the following to happen.
1. Initially, there is only one ring contains three nodes, say
ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL.
Node B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and
create ring(A).
Node B sends join message with proclist:A,B,C. faillist:A,C. and
create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent
by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message
with proclist:A,B,C. faillist:B. Such join message will prevent
both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join
message with proclist:A,B,C. faillist:B,C again.
Same thing happens on node B, so A and B will dead loop forever
in step 7, 8 and 9.
As the paper also said: "If a processor receives a join message in the
operational state and if the sender's identifier is in the receiver's
my_proclist and the join message's ring_seq is less than the receiver's
ring sequence number, then it ignores the join message too." So these
patch applying these validations of join messages altogether.
Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>