| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409 |
- .\"/*
- .\" * Copyright (c) 2012-2014 Red Hat, Inc.
- .\" *
- .\" * All rights reserved.
- .\" *
- .\" * Authors: Christine Caulfield <ccaulfie@redhat.com>
- .\" * Fabio M. Di Nitto <fdinitto@redhat.com>
- .\" *
- .\" * This software licensed under BSD license, the text of which follows:
- .\" *
- .\" * Redistribution and use in source and binary forms, with or without
- .\" * modification, are permitted provided that the following conditions are met:
- .\" *
- .\" * - Redistributions of source code must retain the above copyright notice,
- .\" * this list of conditions and the following disclaimer.
- .\" * - Redistributions in binary form must reproduce the above copyright notice,
- .\" * this list of conditions and the following disclaimer in the documentation
- .\" * and/or other materials provided with the distribution.
- .\" * - Neither the name of the MontaVista Software, Inc. nor the names of its
- .\" * contributors may be used to endorse or promote products derived from this
- .\" * software without specific prior written permission.
- .\" *
- .\" * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
- .\" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- .\" * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- .\" * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
- .\" * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
- .\" * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- .\" * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
- .\" * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
- .\" * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
- .\" * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
- .\" * THE POSSIBILITY OF SUCH DAMAGE.
- .\" */
- .TH VOTEQUORUM 5 2012-01-24 "corosync Man Page" "Corosync Cluster Engine Programmer's Manual"
- .SH NAME
- votequorum \- Votequorum Configuration Overview
- .SH OVERVIEW
- The votequorum service is part of the corosync project. This service can be optionally loaded
- into the nodes of a corosync cluster to avoid split-brain situations.
- It does this by having a number of votes assigned to each system in the cluster and ensuring
- that only when a majority of the votes are present, cluster operations are allowed to proceed.
- The service must be loaded into all nodes or none. If it is loaded into a subset of cluster nodes
- the results will be unpredictable.
- .PP
- The following corosync.conf extract will enable votequorum service within corosync:
- .PP
- .nf
- quorum {
- provider: corosync_votequorum
- }
- .fi
- .PP
- votequorum reads its configuration from corosync.conf. Some values can be changed at runtime, others
- are only read at corosync startup. It is very important that those values are consistent
- across all the nodes participating in the cluster or votequorum behavior will be unpredictable.
- .PP
- votequorum requires an expected_votes value to function, this can be provided in two ways.
- The number of expected votes will be automatically calculated when the nodelist { } section is
- present in corosync.conf or expected_votes can be specified in the quorum { } section. Lack of
- both will disable votequorum. If both are present at the same time,
- the quorum.expected_votes value will override the one calculated from the nodelist.
- .PP
- Example (no nodelist) of an 8 node cluster (each node has 1 vote):
- .nf
- quorum {
- provider: corosync_votequorum
- expected_votes: 8
- }
- .fi
- .PP
- Example (with nodelist) of a 3 node cluster (each node has 1 vote):
- .nf
- quorum {
- provider: corosync_votequorum
- }
- nodelist {
- node {
- ring0_addr: 192.168.1.1
- }
- node {
- ring0_addr: 192.168.1.2
- }
- node {
- ring0_addr: 192.168.1.3
- }
- }
- .fi
- .SH SPECIAL FEATURES
- .PP
- .B two_node: 1
- .PP
- Enables two node cluster operations (default: 0).
- .PP
- The "two node cluster" is a use case that requires special consideration.
- With a standard two node cluster, each node with a single vote, there
- are 2 votes in the cluster. Using the simple majority calculation
- (50% of the votes + 1) to calculate quorum, the quorum would be 2.
- This means that the both nodes would always have
- to be alive for the cluster to be quorate and operate.
- .PP
- Enabling two_node: 1, quorum is set artificially to 1.
- .PP
- Example configuration 1:
- .nf
- quorum {
- provider: corosync_votequorum
- expected_votes: 2
- two_node: 1
- }
- .fi
- .PP
- Example configuration 2:
- .nf
- quorum {
- provider: corosync_votequorum
- two_node: 1
- }
- nodelist {
- node {
- ring0_addr: 192.168.1.1
- }
- node {
- ring0_addr: 192.168.1.2
- }
- }
- .fi
- .PP
- NOTES: enabling two_node: 1 automatically enables wait_for_all. It is
- still possible to override wait_for_all by explicitly setting it to 0.
- If more than 2 nodes join the cluster, the two_node option is
- automatically disabled.
- .PP
- .B wait_for_all: 1
- .PP
- Enables Wait For All (WFA) feature (default: 0).
- .PP
- The general behaviour of votequorum is to switch a cluster from inquorate to quorate
- as soon as possible. For example, in an 8 node cluster, where every node has 1 vote,
- expected_votes is set to 8 and quorum is (50% + 1) 5. As soon as 5 (or more) nodes
- are visible to each other, the partition of 5 (or more) becomes quorate and can
- start operating.
- .PP
- When WFA is enabled, the cluster will be quorate for the first time
- only after all nodes have been visible at least once at the same time.
- .PP
- This feature has the advantage of avoiding some startup race conditions, with the cost
- that all nodes need to be up at the same time at least once before the cluster
- can operate.
- .PP
- A common startup race condition based on the above example is that as soon as 5
- nodes become quorate, with the other 3 still offline, the remaining 3 nodes will
- be fenced.
- .PP
- It is very useful when combined with last_man_standing (see below).
- .PP
- Example configuration:
- .nf
- quorum {
- provider: corosync_votequorum
- expected_votes: 8
- wait_for_all: 1
- }
- .fi
- .PP
- .B last_man_standing: 1
- /
- .B last_man_standing_window: 10000
- .PP
- Enables Last Man Standing (LMS) feature (default: 0).
- Tunable last_man_standing_window (default: 10 seconds, expressed in ms).
- .PP
- The general behaviour of votequorum is to set expected_votes and quorum
- at startup (unless modified by the user at runtime, see below) and use
- those values during the whole lifetime of the cluster.
- .PP
- Using for example an 8 node cluster where each node has 1 vote, expected_votes
- is set to 8 and quorum to 5. This condition allows a total failure of 3
- nodes. If a 4th node fails, the cluster becomes inquorate and it will
- stop providing services.
- .PP
- Enabling LMS allows the cluster to dynamically recalculate expected_votes
- and quorum under specific circumstances. It is essential to enable
- WFA when using LMS in High Availability clusters.
- .PP
- Using the above 8 node cluster example, with LMS enabled the cluster can retain
- quorum and continue operating by losing, in a cascade fashion, up to 6 nodes with
- only 2 remaining active.
- .PP
- Example chain of events:
- .nf
- 1) cluster is fully operational with 8 nodes.
- (expected_votes: 8 quorum: 5)
- 2) 3 nodes die, cluster is quorate with 5 nodes.
- 3) after last_man_standing_window timer expires,
- expected_votes and quorum are recalculated.
- (expected_votes: 5 quorum: 3)
- 4) at this point, 2 more nodes can die and
- cluster will still be quorate with 3.
- 5) once again, after last_man_standing_window
- timer expires expected_votes and quorum are
- recalculated.
- (expected_votes: 3 quorum: 2)
- 6) at this point, 1 more node can die and
- cluster will still be quorate with 2.
- 7) one more last_man_standing_window timer
- (expected_votes: 2 quorum: 2)
- .fi
- .PP
- NOTES: In order for the cluster to downgrade automatically from 2 nodes
- to a 1 node cluster, the auto_tie_breaker feature must also be enabled (see below).
- If auto_tie_breaker is not enabled, and one more failure occurs, the
- remaining node will not be quorate. LMS does not work with asymmetric voting
- schemes, each node must vote 1. LMS is also incompatible with quorum devices,
- if last_man_standing is specified in corosync.conf then the quorum device
- will be disabled.
- .PP
- Example configuration 1:
- .nf
- quorum {
- provider: corosync_votequorum
- expected_votes: 8
- last_man_standing: 1
- }
- .fi
- .PP
- Example configuration 2 (increase timeout to 20 seconds):
- .nf
- quorum {
- provider: corosync_votequorum
- expected_votes: 8
- last_man_standing: 1
- last_man_standing_window: 20000
- }
- .fi
- .PP
- .B auto_tie_breaker: 1
- .PP
- Enables Auto Tie Breaker (ATB) feature (default: 0).
- .PP
- The general behaviour of votequorum allows a simultaneous node failure up
- to 50% - 1 node, assuming each node has 1 vote.
- .PP
- When ATB is enabled, the cluster can suffer up to 50% of the nodes failing
- at the same time, in a deterministic fashion. By default the cluster
- partition, or the set of nodes that are still in contact with the
- node that has the lowest nodeid will remain quorate. The other nodes will
- be inquorate. This behaviour can be changed by also specifying
- .PP
- .B auto_tie_breaker_node: lowest|highest|<list of node IDs>
- .PP
- \'lowest' is the default, 'highest' is similar in that if the current set of
- nodes contains the highest nodeid then it will remain quorate. Alternatively
- it is possible to specify a particular node ID or list of node IDs that will
- be required to maintain quorum. If a (space-separated) list is given, the
- nodes are evaluated in order, so if the first node is present then it will
- be used to determine the quorate partition, if that node is not in either
- half (ie was not in the cluster before the split) then the second node ID
- will be checked for and so on. ATB is incompatible with quorum devices -
- if auto_tie_breaker is specified in corosync.conf then the quorum device
- will be disabled.
- .PP
- Example configuration 1:
- .nf
- quorum {
- provider: corosync_votequorum
- expected_votes: 8
- auto_tie_breaker: 1
- auto_tie_breaker_node: lowest
- }
- .fi
- .PP
- Example configuration 2:
- .nf
- quorum {
- provider: corosync_votequorum
- expected_votes: 8
- auto_tie_breaker: 1
- auto_tie_breaker_node: 1 3 5
- }
- .PP
- .fi
- .PP
- .B allow_downscale: 1
- .PP
- Enables allow downscale (AD) feature (default: 0).
- .PP
- THIS FEATURE IS INCOMPLETE AND CURRENTLY UNSUPPORTED.
- .PP
- The general behaviour of votequorum is to never decrease expected votes or quorum.
- .PP
- When AD is enabled, both expected votes and quorum are recalculated when
- a node leaves the cluster in a clean state (normal corosync shutdown process) down
- to configured expected_votes.
- .PP
- Example use case:
- .PP
- .nf
- 1) N node cluster (where N is any value higher than 3)
- 2) expected_votes set to 3 in corosync.conf
- 3) only 3 nodes are running
- 4) admin requires to increase processing power and adds 10 nodes
- 5) internal expected_votes is automatically set to 13
- 6) minimum expected_votes is 3 (from configuration)
- - up to this point this is standard votequorum behavior -
- 7) once the work is done, admin wants to remove nodes from the cluster
- 8) using an ordered shutdown the admin can reduce the cluster size
- automatically back to 3, but not below 3, where normal quorum
- operation will work as usual.
- .fi
- .PP
- Example configuration:
- .nf
- quorum {
- provider: corosync_votequorum
- expected_votes: 3
- allow_downscale: 1
- }
- .fi
- allow_downscale implicitly enabled EVT (see below).
- .PP
- .B expected_votes_tracking: 1
- .PP
- Enables Expected Votes Tracking (EVT) feature (default: 0).
- .PP
- Expected Votes Tracking stores the highest-seen value of expected votes on disk and uses
- that as the minimum value for expected votes in the absence of any higher authority (eg
- a current quorate cluster). This is useful for when a group of nodes becomes detached from
- the main cluster and after a restart could have enough votes to provide quorum, which can
- happen after using allow_downscale.
- .PP
- Note that even if the in-memory version of expected_votes is reduced, eg by removing nodes
- or using corosync-quorumtool, the stored value will still be the highest value seen - it
- never gets reduced.
- .PP
- The value is held in the file /var/lib/corosync/ev_tracking which can be deleted if you
- really do need to reduce the expected votes for any reason, like the node has been moved
- to a different cluster.
- .PP
- .fi
- .PP
- .SH VARIOUS NOTES
- .PP
- * WFA / LMS / ATB / AD can be used combined together.
- .PP
- * In order to change the default votes for a node there are two options:
- .nf
- 1) nodelist:
- nodelist {
- node {
- ring0_addr: 192.168.1.1
- quorum_votes: 3
- }
- ....
- }
- 2) quorum section (deprecated):
- quorum {
- provider: corosync_votequorum
- expected_votes: 2
- votes: 2
- }
- .fi
- In the event that both nodelist and quorum { votes: } are defined, the value
- from the nodelist will be used.
- .PP
- * Only votes, quorum_votes, expected_votes and two_node can be changed at runtime. Everything else
- requires a cluster restart.
- .SH BUGS
- No known bugs at the time of writing. The authors are from outerspace. Deal with it.
- .SH "SEE ALSO"
- .BR corosync (8),
- .BR corosync.conf (5),
- .BR corosync-quorumtool (8),
- .BR corosync-qdevice (8),
- .BR votequorum_overview (3)
- .PP
|