sam_overview.8 7.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181
  1. .\"/*
  2. .\" * Copyright (c) 2009-2010 Red Hat, Inc.
  3. .\" *
  4. .\" * All rights reserved.
  5. .\" *
  6. .\" * Author: Jan Friesse (jfriesse@redhat.com)
  7. .\" * Author: Steven Dake (sdake@redhat.com)
  8. .\" *
  9. .\" * This software licensed under BSD license, the text of which follows:
  10. .\" *
  11. .\" * Redistribution and use in source and binary forms, with or without
  12. .\" * modification, are permitted provided that the following conditions are met:
  13. .\" *
  14. .\" * - Redistributions of source code must retain the above copyright notice,
  15. .\" * this list of conditions and the following disclaimer.
  16. .\" * - Redistributions in binary form must reproduce the above copyright notice,
  17. .\" * this list of conditions and the following disclaimer in the documentation
  18. .\" * and/or other materials provided with the distribution.
  19. .\" * - Neither the name of the Red Hat, Inc. nor the names of its
  20. .\" * contributors may be used to endorse or promote products derived from this
  21. .\" * software without specific prior written permission.
  22. .\" *
  23. .\" * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  24. .\" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  25. .\" * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  26. .\" * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
  27. .\" * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  28. .\" * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  29. .\" * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  30. .\" * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  31. .\" * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  32. .\" * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
  33. .\" * THE POSSIBILITY OF SUCH DAMAGE.
  34. .\" */
  35. .TH "SAM_OVERVIEW" 8 "21/05/2010" "corosync Man Page" "Corosync Cluster Engine Programmer's Manual"
  36. .SH NAME
  37. .P
  38. sam_overview \- Overview of the Simple Availability Manager
  39. .SH OVERVIEW
  40. .P
  41. The SAM library provide a tool to check the health of an application.
  42. The main purpose of SAM is to restart a local process when it fails to respond
  43. to a healthcheck request in a configured time interval.
  44. .P
  45. During \fBsam_initialize(3)\fR, a duplicate copy of the process is created using
  46. the \fBfork(3)\fR system call. This duplicate process copy contains the logic
  47. for executing the SAM server. The SAM server is responsible for requesting
  48. healthchecks from the active process, and controlling the lifecycle of the
  49. active process when it fails. If the active process fails to respond to the
  50. healthcheck request sent by the SAM server, it will be sent a user configurable
  51. signal (default SIGTERM) to request shutdown of the application. After a configured time interval, the
  52. process will be forcibly killed by being sent a SIGKILL signal. Once the
  53. active process terminates, the SAM server will create a new active process.
  54. .P
  55. The Simple Availability Manager is meant to be used in conjunction with the
  56. cpg service. Used together, it is possible to restart a cpg process that fails
  57. healthchecking during operation.
  58. .P
  59. The main features of SAM include:
  60. .RS
  61. .IP \(bu 3
  62. A configurable recovery policy.
  63. .IP \(bu 3
  64. A configurable time interval for health check operations.
  65. .IP \(bu 3
  66. A notification via signal before recovery action is taken.
  67. .IP \(bu 3
  68. A mechanism to indicate to the application the number of times an active
  69. process has been created by the SAM server.
  70. .IP \(bu 3
  71. Both application driven health checking and event driven health checking.
  72. .RE
  73. .SH Initializing SAM
  74. .P
  75. The SAM library is initialized by \fBsam_initialize(3)\fR.
  76. \fBsam_initalize(3)\fR may only be called once per process. Calling it more
  77. then once has undefined results and is not recommended or tested.
  78. .SH Setting warning callback
  79. .P
  80. User configurable signal (default \fISIGTERM\fR) is sent to the application when a recovery action is
  81. planned. The application can use the \fBsignal(3)\fR system call to monitor
  82. for this signal.
  83. .P
  84. There are no special constraints on what SAM apis may be called in a warning
  85. callback. After \fItime_interval\fR expires, a SIGKILL signal is sent to the
  86. active process to force its termination.
  87. .SH Registering the active process
  88. .P
  89. The active process is registered with SAM by calling \fBsam_register(3)\fR.
  90. This function should only be called one time in a process. After a recovery
  91. action is taken, the new active process will begin execution at the next line
  92. of code in a user process after \fBsam_register(3)\fR.
  93. .SH Enabling event driven healthchecking
  94. .P
  95. Two types of healthchecking are available to the user. The first model is one
  96. where the user application healthchecks during its normal operation. It is
  97. never requested to healtcheck, and if the active process doesn't respond within
  98. the time interval, the process will be restarted.
  99. .P
  100. A more useful mechanism for healthchecking is event driven healthchecking.
  101. Because this model is directed by the SAM server, It isn't necessary to guess
  102. or add timers to the active process to signal a healthcheck operation is
  103. successful. To use event driven healthchecking,
  104. the \fBsam_hc_callback_register(3)\fR function should be executed.
  105. .SH Quorum integration
  106. .P
  107. SAM has special policies (\fISAM_RECOVERY_POLICY_QUIT\fR and \fISAM_RECOVERY_POLICY_RESTART\fR)
  108. for integration with quorum service. This policies changes SAM behaviour in two aspects.
  109. .RS
  110. .IP \(bu 3
  111. Call of \fBsam_start(3)\fR blocks until corosync becomes quorate
  112. .IP \(bu 3
  113. User selected recovery action is taken immediately after lost of quorum.
  114. .RE
  115. .SH Storing user data
  116. .P
  117. Sometimes there is need to store some data, which survives between instances.
  118. One can in such case use files, databases, ... or much simpler in memory solution
  119. presented by \fBsam_data_store(3)\fR, \fBsam_data_restore(3)\fR and \fBsam_data_getsize(3)\fR
  120. functions.
  121. .SH Confdb integration
  122. .P
  123. SAM has policy flag used for confdb system integration (\fISAM_RECOVERY_POLICY_CONFDB\fR).
  124. If process is registered with this flag, new confdb object PROCESS_NAME:PID is created with following
  125. keys:
  126. .RS
  127. .IP \(bu 3
  128. \fIrecovery\fR - will be quit or restart depending on policy
  129. .IP \(bu 3
  130. \fIhc_period\fR - period of health checking in milliseconds
  131. .IP \(bu 3
  132. \fIhc_last\fR - last known GMT time in milliseconds when health check was received
  133. .IP \(bu 3
  134. \fIstate\fR - state of process (can be one of registered, started, failed, waiting for quorum)
  135. .RE
  136. .P
  137. Object is automatically deleted if process exits with stopped health checking.
  138. .P
  139. Confdb integration with corosync wathdog can be used in implicit and explicit way.
  140. .P
  141. Implicit way is achieved by setting recovery policy to QUIT and let process exit with started health checking.
  142. If this happened, object is not deleted and corosync watchdog will take required action.
  143. .P
  144. Explicit way is usefull for situations, when developer can deal with some non-fatal fall of application.
  145. This mode is achieved by setting policy to RESTART and using SAM same as without Confdb integration.
  146. If real fail is needed (like too many restarts at all, per/sec, ...), it's possible to use \fBsam_mark_failed(3)\fR
  147. and let corosync watchdog take required action.
  148. .SH BUGS
  149. .SH "SEE ALSO"
  150. .BR sam_initialize (3),
  151. .BR sam_data_getsize (3),
  152. .BR sam_data_restore (3),
  153. .BR sam_data_store (3),
  154. .BR sam_finalize (3),
  155. .BR sam_mark_failed (3),
  156. .BR sam_start (3),
  157. .BR sam_stop (3),
  158. .BR sam_register (3),
  159. .BR sam_warn_signal_set (3),
  160. .BR sam_hc_send (3),
  161. .BR sam_hc_callback_register (3)