|
@@ -0,0 +1,348 @@
|
|
|
|
|
+.\"/*
|
|
|
|
|
+.\" * Copyright (c) 2012 Red Hat, Inc.
|
|
|
|
|
+.\" *
|
|
|
|
|
+.\" * All rights reserved.
|
|
|
|
|
+.\" *
|
|
|
|
|
+.\" * Authors: Christine Caulfield <ccaulfie@redhat.com>
|
|
|
|
|
+.\" * Fabio M. Di Nitto <fdinitto@redhat.com>
|
|
|
|
|
+.\" *
|
|
|
|
|
+.\" * This software licensed under BSD license, the text of which follows:
|
|
|
|
|
+.\" *
|
|
|
|
|
+.\" * Redistribution and use in source and binary forms, with or without
|
|
|
|
|
+.\" * modification, are permitted provided that the following conditions are met:
|
|
|
|
|
+.\" *
|
|
|
|
|
+.\" * - Redistributions of source code must retain the above copyright notice,
|
|
|
|
|
+.\" * this list of conditions and the following disclaimer.
|
|
|
|
|
+.\" * - Redistributions in binary form must reproduce the above copyright notice,
|
|
|
|
|
+.\" * this list of conditions and the following disclaimer in the documentation
|
|
|
|
|
+.\" * and/or other materials provided with the distribution.
|
|
|
|
|
+.\" * - Neither the name of the MontaVista Software, Inc. nor the names of its
|
|
|
|
|
+.\" * contributors may be used to endorse or promote products derived from this
|
|
|
|
|
+.\" * software without specific prior written permission.
|
|
|
|
|
+.\" *
|
|
|
|
|
+.\" * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
|
|
|
+.\" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
|
+.\" * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
|
+.\" * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
|
|
|
|
+.\" * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
|
|
|
|
+.\" * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
|
|
|
|
+.\" * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
|
|
|
+.\" * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
|
|
|
|
+.\" * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
|
|
|
|
+.\" * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
|
|
|
|
|
+.\" * THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
|
+.\" */
|
|
|
|
|
+.TH VOTEQUORUM 5 2012-01-24 "corosync Man Page" "Corosync Cluster Engine Programmer's Manual"
|
|
|
|
|
+.SH NAME
|
|
|
|
|
+votequorum \- Votequorum Configuration Overview
|
|
|
|
|
+.SH OVERVIEW
|
|
|
|
|
+The votequorum service is part of the corosync project. This service can be optionally loaded
|
|
|
|
|
+into the nodes of a corosync cluster to avoid split-brain situations.
|
|
|
|
|
+It does this by having a number of votes assigned to each system in the cluster and ensuring
|
|
|
|
|
+that only when a majority of the votes are present, cluster operations are allowed to proceed.
|
|
|
|
|
+The service must be loaded into all nodes or none. If it is loaded into a subset of cluster nodes
|
|
|
|
|
+the results will be unpredictable.
|
|
|
|
|
+.PP
|
|
|
|
|
+The following corosync.conf extract will enable votequorum service within corosync:
|
|
|
|
|
+.PP
|
|
|
|
|
+.nf
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+}
|
|
|
|
|
+.fi
|
|
|
|
|
+.PP
|
|
|
|
|
+votequorum reads its configuration from corosync.conf. Some values can be changed at runtime, others
|
|
|
|
|
+are only read at corosync startup. It is very important that those values are consistent
|
|
|
|
|
+across all the nodes participating in the cluster or votequorum behavior will be unpredictable.
|
|
|
|
|
+.PP
|
|
|
|
|
+votequorum requires an expected_votes value to function, this can be provided in two ways.
|
|
|
|
|
+The number of expected votes will be automatically calculated when the nodelist { } section is
|
|
|
|
|
+present in corosync.conf or expected_votes can be specified in the quorum { } section. Lack of
|
|
|
|
|
+both will disable votequorum. If both are present at the same time,
|
|
|
|
|
+the quorum.expected_votes value will override the one calculated from the nodelist.
|
|
|
|
|
+.PP
|
|
|
|
|
+Example (no nodelist) of an 8 node cluster (each node has 1 vote):
|
|
|
|
|
+.nf
|
|
|
|
|
+
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ expected_votes: 8
|
|
|
|
|
+}
|
|
|
|
|
+.fi
|
|
|
|
|
+.PP
|
|
|
|
|
+Example (with nodelist) of a 3 node cluster (each node has 1 vote):
|
|
|
|
|
+.nf
|
|
|
|
|
+
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+nodelist {
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.1
|
|
|
|
|
+ }
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.2
|
|
|
|
|
+ }
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.3
|
|
|
|
|
+ }
|
|
|
|
|
+}
|
|
|
|
|
+.fi
|
|
|
|
|
+.SH SPECIAL FEATURES
|
|
|
|
|
+.PP
|
|
|
|
|
+.B two_node: 1
|
|
|
|
|
+.PP
|
|
|
|
|
+Enables two node cluster operations (default: 0).
|
|
|
|
|
+.PP
|
|
|
|
|
+The "two node cluster" is a use case that requires special consideration.
|
|
|
|
|
+With a standard two node cluster, each node with a single vote, there
|
|
|
|
|
+are 2 votes in the cluster. Using the simple majority calculation
|
|
|
|
|
+(50% of the votes + 1) to calculate quorum, the quorum would be 2.
|
|
|
|
|
+This means that the both nodes would always have
|
|
|
|
|
+to be alive for the cluster to be quorate and operate.
|
|
|
|
|
+.PP
|
|
|
|
|
+Enabling two_node: 1, quorum is set artificially to 1.
|
|
|
|
|
+.PP
|
|
|
|
|
+Example configuration 1:
|
|
|
|
|
+
|
|
|
|
|
+.nf
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ expected_votes: 2
|
|
|
|
|
+ two_node: 1
|
|
|
|
|
+}
|
|
|
|
|
+.fi
|
|
|
|
|
+
|
|
|
|
|
+.PP
|
|
|
|
|
+Example configuration 2:
|
|
|
|
|
+
|
|
|
|
|
+.nf
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ two_node: 1
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+nodelist {
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.1
|
|
|
|
|
+ }
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.2
|
|
|
|
|
+ }
|
|
|
|
|
+}
|
|
|
|
|
+.fi
|
|
|
|
|
+.PP
|
|
|
|
|
+NOTES: enabling two_node: 1 automatically enables wait_for_all. It is
|
|
|
|
|
+still possible to override wait_for_all by explicitly setting it to 0.
|
|
|
|
|
+If more than 2 nodes join the cluster, the two_node option is
|
|
|
|
|
+automatically disabled.
|
|
|
|
|
+.PP
|
|
|
|
|
+.B wait_for_all: 1
|
|
|
|
|
+.PP
|
|
|
|
|
+Enables Wait For All (WFA) feature (default: 0).
|
|
|
|
|
+.PP
|
|
|
|
|
+The general behaviour of votequorum is to switch a cluster from inquorate to quorate
|
|
|
|
|
+as soon as possible. For example, in an 8 node cluster, where every node has 1 vote,
|
|
|
|
|
+expected_votes is set to 8 and quorum is (50% + 1) 5. As soon as 5 (or more) nodes
|
|
|
|
|
+are visible to each other, the partition of 5 (or more) becomes quorate and can
|
|
|
|
|
+start operating.
|
|
|
|
|
+.PP
|
|
|
|
|
+When WFA is enabled, the cluster will be quorate for the first time
|
|
|
|
|
+only after all nodes have been visible at least once at the same time.
|
|
|
|
|
+.PP
|
|
|
|
|
+This feature has the advantage of avoiding some startup race conditions, with the cost
|
|
|
|
|
+that all nodes need to be up at the same time at least once before the cluster
|
|
|
|
|
+can operate.
|
|
|
|
|
+.PP
|
|
|
|
|
+A common startup race condition based on the above example is that as soon as 5
|
|
|
|
|
+nodes become quorate, with the other 3 still offline, the remaining 3 nodes will
|
|
|
|
|
+be fenced.
|
|
|
|
|
+.PP
|
|
|
|
|
+It is very useful when combined with last_man_standing (see below).
|
|
|
|
|
+.PP
|
|
|
|
|
+Example configuration:
|
|
|
|
|
+.nf
|
|
|
|
|
+
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ expected_votes: 8
|
|
|
|
|
+ wait_for_all: 1
|
|
|
|
|
+}
|
|
|
|
|
+.fi
|
|
|
|
|
+.PP
|
|
|
|
|
+.B last_man_standing: 1
|
|
|
|
|
+/
|
|
|
|
|
+.B last_man_standing_window: 10000
|
|
|
|
|
+.PP
|
|
|
|
|
+Enables Last Man Standing (LMS) feature (default: 0).
|
|
|
|
|
+Tunable last_man_standing_window (default: 10 seconds, expressed in ms).
|
|
|
|
|
+.PP
|
|
|
|
|
+The general behaviour of votequorum is to set expected_votes and quorum
|
|
|
|
|
+at startup (unless modified by the user at runtime, see below) and use
|
|
|
|
|
+those values during the whole lifetime of the cluster.
|
|
|
|
|
+.PP
|
|
|
|
|
+Using for example an 8 node cluster where each node has 1 vote, expected_votes
|
|
|
|
|
+is set to 8 and quorum to 5. This condition allows a total failure of 3
|
|
|
|
|
+nodes. If a 4th node fails, the cluster becomes inquorate and it will
|
|
|
|
|
+stop providing services.
|
|
|
|
|
+.PP
|
|
|
|
|
+Enabling LMS allows the cluster to dynamically recalculate expected_votes
|
|
|
|
|
+and quorum under specific circumstances. It is essential to enable
|
|
|
|
|
+WFA when using LMS in High Availability clusters.
|
|
|
|
|
+.PP
|
|
|
|
|
+Using the above 8 node cluster example, with LMS enabled the cluster can retain
|
|
|
|
|
+quorum and continue operating by losing, in a cascade fashion, up to 6 nodes with
|
|
|
|
|
+only 2 remaining active.
|
|
|
|
|
+.PP
|
|
|
|
|
+Example chain of events:
|
|
|
|
|
+.nf
|
|
|
|
|
+1) cluster is fully operational with 8 nodes.
|
|
|
|
|
+ (expected_votes: 8 quorum: 5)
|
|
|
|
|
+
|
|
|
|
|
+2) 3 nodes die, cluster is quorate with 5 nodes.
|
|
|
|
|
+
|
|
|
|
|
+3) after last_man_standing_window timer expires,
|
|
|
|
|
+ expected_votes and quorum are recalculated.
|
|
|
|
|
+ (expected_votes: 5 quorum: 3)
|
|
|
|
|
+
|
|
|
|
|
+4) at this point, 2 more nodes can die and
|
|
|
|
|
+ cluster will still be quorate with 3.
|
|
|
|
|
+
|
|
|
|
|
+5) once again, after last_man_standing_window
|
|
|
|
|
+ timer expires expected_votes and quorum are
|
|
|
|
|
+ recalculated.
|
|
|
|
|
+ (expected_votes: 3 quorum: 2)
|
|
|
|
|
+
|
|
|
|
|
+6) at this point, 1 more node can die and
|
|
|
|
|
+ cluster will still be quorate with 2.
|
|
|
|
|
+
|
|
|
|
|
+7) one more last_man_standing_window timer
|
|
|
|
|
+ (expected_votes: 2 quorum: 2)
|
|
|
|
|
+.fi
|
|
|
|
|
+.PP
|
|
|
|
|
+NOTES: In order for the cluster to downgrade automatically from 2 nodes
|
|
|
|
|
+to a 1 node cluster, the auto_tie_breaker feature must also be enabled (see below).
|
|
|
|
|
+If auto_tie_breaker is not enabled, and one more failure occours, the
|
|
|
|
|
+remaining node will not be quorate. LMS does not work with asymmetric voting
|
|
|
|
|
+schemes, each node must vote 1.
|
|
|
|
|
+.PP
|
|
|
|
|
+Example configuration 1:
|
|
|
|
|
+.nf
|
|
|
|
|
+
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ expected_votes: 8
|
|
|
|
|
+ last_man_standing: 1
|
|
|
|
|
+}
|
|
|
|
|
+.fi
|
|
|
|
|
+.PP
|
|
|
|
|
+Example configuration 2 (increase timeout to 20 seconds):
|
|
|
|
|
+.nf
|
|
|
|
|
+
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ expected_votes: 8
|
|
|
|
|
+ last_man_standing: 1
|
|
|
|
|
+ last_man_standing_window: 20000
|
|
|
|
|
+}
|
|
|
|
|
+.fi
|
|
|
|
|
+.PP
|
|
|
|
|
+.B auto_tie_breaker: 1
|
|
|
|
|
+.PP
|
|
|
|
|
+Enables Auto Tie Breaker (ATB) feature (default: 0).
|
|
|
|
|
+.PP
|
|
|
|
|
+The general behaviour of votequorum allows a simultaneous node failure up
|
|
|
|
|
+to 50% - 1 node, assuming each node has 1 vote.
|
|
|
|
|
+.PP
|
|
|
|
|
+When ATB is enabled, the cluster can suffer up to 50% of the nodes failing
|
|
|
|
|
+at the same time, in a deterministic fashion. The cluster partition, or the
|
|
|
|
|
+set of nodes that are still in contact with the node that has the lowest
|
|
|
|
|
+nodeid will remain quorate. The other nodes will be inquorate.
|
|
|
|
|
+.PP
|
|
|
|
|
+NOTES: For ATB to work, the lowest nodeid in the cluster needs to be known.
|
|
|
|
|
+corosync can automatically generate a nodeid or it can be overridden manually.
|
|
|
|
|
+If nodeids are not known at startup, ATB will automatically enable WFA.
|
|
|
|
|
+WFA will guarantee that all nodeids in the cluster are known before ATB can
|
|
|
|
|
+operate correctly.
|
|
|
|
|
+.PP
|
|
|
|
|
+Example configuration 1:
|
|
|
|
|
+.nf
|
|
|
|
|
+
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ expected_votes: 8
|
|
|
|
|
+ auto_tie_breaker: 1
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+.fi
|
|
|
|
|
+this will also automatically enable WFA.
|
|
|
|
|
+.PP
|
|
|
|
|
+Example configuration 2:
|
|
|
|
|
+.nf
|
|
|
|
|
+
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ auto_tie_breaker: 1
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+nodelist {
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.1
|
|
|
|
|
+ nodeid: 1
|
|
|
|
|
+ }
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.2
|
|
|
|
|
+ nodeid: 2
|
|
|
|
|
+ }
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.3
|
|
|
|
|
+ nodeid: 3
|
|
|
|
|
+ }
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.4
|
|
|
|
|
+ nodeid: 4
|
|
|
|
|
+ }
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+.fi
|
|
|
|
|
+this will allow ATB to work without WFA as all nodeids are known at startup.
|
|
|
|
|
+.SH VARIOUS NOTES
|
|
|
|
|
+.PP
|
|
|
|
|
+* WFA / LMS / ATB can be used combined together.
|
|
|
|
|
+.PP
|
|
|
|
|
+* In order to change the default votes for a node there are two options:
|
|
|
|
|
+.nf
|
|
|
|
|
+
|
|
|
|
|
+1) nodelist:
|
|
|
|
|
+
|
|
|
|
|
+nodelist {
|
|
|
|
|
+ node {
|
|
|
|
|
+ ring0_addr: 192.168.1.1
|
|
|
|
|
+ quorum_votes: 3
|
|
|
|
|
+ }
|
|
|
|
|
+ ....
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+2) quorum section (deprecated):
|
|
|
|
|
+
|
|
|
|
|
+quorum {
|
|
|
|
|
+ provider: corosync_votequorum
|
|
|
|
|
+ expected_votes: 2
|
|
|
|
|
+ votes: 2
|
|
|
|
|
+}
|
|
|
|
|
+
|
|
|
|
|
+.fi
|
|
|
|
|
+In the event that both nodelist and quorum { votes: } are defined, the value
|
|
|
|
|
+from the nodelist will be used.
|
|
|
|
|
+.PP
|
|
|
|
|
+* Only votes, quorum_votes, expected_votes and two_node can be changed at runtime. Everything else
|
|
|
|
|
+requires a cluster restart.
|
|
|
|
|
+.SH BUGS
|
|
|
|
|
+No known bugs at the time of writing. The authors are from outerspace. Deal with it.
|
|
|
|
|
+.SH "SEE ALSO"
|
|
|
|
|
+.BR corosync (8),
|
|
|
|
|
+.BR corosync.conf (5),
|
|
|
|
|
+.BR corosync-quorumtool (8),
|
|
|
|
|
+.BR votequorum_overview (8)
|
|
|
|
|
+.PP
|