Просмотр исходного кода

From: Yunkai Zhang:
Today, I have observed one of the reason that corosync running into
FAILED TO RECEIVE state.

There was five nodes(A,B,C,D,E) in my testing, and I limited the UDP
transmission rate of C nodes by iptables command:
iptables -A INPUT -i eth0 -p udp -m limit --limit 10000/s
--limit-burst 1 -j ACCEPT
iptables -A INPUT -i eth0 -p udp -j DROP

After one hour later, C node had been missing some MCAST messages,
it's state described as following:
==state of C node==
my_aru:0x805
my_high_seq_received:0xC2C
my_aru_count:7

=>receved MCAST message with seq:806 from B nodes
=>enter *message_handler_mcast*
=>add this message to regular_sort_queue
...
=>enter *update_aru* function
=> range = (my_high_seq_received - my_aru)
= (0xC2C - 0x805)
= 1063
=> if range>1024, do nothing and and return directly.
==END==

According this logic, after (my_high_req_received-my_aru)>1024, my_aru
will not be updated though corosync can receive MCAST messages
retransmitted by other nodes.

But at that timte, my_aru_count was only 7. So the corosync at C node
would keep in this status until my_aru_count increased to
fail_to_recv_const(the default value is 2500). This was a long time
for corosync, but we wasted it.

To solve this issue, maybe we can enlarge the range condition in
update_aru function? Or we just ingnore the checking of range value,
it seems no harmfull, because we have been using fail_to_recv_const to
control the things.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
(cherry picked from commit e48ddf99a67754dea056a54f404f3638cf829b9c)

Steven Dake 14 лет назад
Родитель
Сommit
98f49b2e18
1 измененных файлов с 0 добавлено и 3 удалено
  1. 0 3
      exec/totemsrp.c

+ 0 - 3
exec/totemsrp.c

@@ -2354,9 +2354,6 @@ static void update_aru (
 	}
 
 	range = instance->my_high_seq_received - instance->my_aru;
-	if (range > 1024) {
-		return;
-	}
 
 	my_aru_saved = instance->my_aru;
 	for (i = 1; i <= range; i++) {