README.amf 9.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228
  1. AMF B.01.01 Implementation
  2. --------------------------
  3. This patch contains the basis of the AMF B.01.01 service targeted for release
  4. in Wilson (1.0). It is a work in progress and incomplete at this time.
  5. What does AMF do?
  6. -----------------
  7. The AMF has many major duties:
  8. * issue instantiate, terminate, and cleanup operations for components
  9. * assignment of component service instances to components
  10. * detection of component faults and executing recovery actions
  11. The AMF starts and stops processes that are part of the component. A SU
  12. contains multiple components. A service group contains multiple SUs.
  13. A SU is the unit of redundancy used to implement high availability.
  14. The process of starting and stopping components takes place using the CLC
  15. operations. The AMF specification is exceedingly clear about which CLC
  16. operations occur for which component types and openais implements the full
  17. CLC operations for all of the various component types.
  18. If a component is not sa-aware, the only level of high availability that
  19. can be applied to the application is through execution of the CLC interfaces.
  20. A special component, called a proxy component, can be used to present an
  21. sa-aware component to AMF to manage a non-sa-aware component. This would be
  22. useful, for example, to implement a healthcheck operation which runs some
  23. operation of the unmodified application service.
  24. Components that are sa-aware have been written specifically to the AMF
  25. interfaces. These components provide the most support for high availability
  26. for application developers.
  27. When an sa-aware component is registered, service instances are assigned
  28. to the component once the service unit is available to take service. This
  29. service instance specifies whether the component is ACTIVE or STANDBY. The
  30. component is directed by the AMF to enter either ACTIVE or STANDBY states
  31. and then executes its assigned operational mode. The number of CSIs assigned
  32. to a component is determined by a reduction process with 6 levels of
  33. reduction. The AMF provides a very clear definition of what is required
  34. with several examples for each reduction level.
  35. The AMF detects faults through the use of a healthcheck operation. The user
  36. specifies in a configuration file healthcheck keys and timing parameters.
  37. This configuration is then used by the application developer to register
  38. a healthcheck operation in the AMF. The healthcheck operation can be started
  39. or stopped. Once started, the AMF will periodically send a request to the
  40. component to determine its level of health. The AMF reacts to negative
  41. healthchecks or failed healthchecks by executing a recovery policy.
  42. The recovery policy attempts to restart components first. When components
  43. are restarted and fail a certain number of times within a timeout period, the
  44. entire service unit is failed over. When SUs on one node are restarted and fail
  45. a certain number of times within a timeout period, the service unit is failed
  46. over to a standby service unit.
  47. Currently openais implements most of what is described above.
  48. How to configure AMF
  49. --------------------
  50. The AMF doesn't specify a configuration file format. It does specify many
  51. configuration options, which are mostly implemented in openais. The
  52. configuration file specifies the service groups, service units, service
  53. instances, recovery configuration options, and information describing where
  54. components and CLI (command line interface) tools are located.
  55. There are several configuration options which are used to control the component
  56. life cycle (CLC) of the component. These configuration options are:
  57. in the group section:
  58. clccli_path=/home/sdake/amfb-dec/test
  59. The path to the CLC CLI applications.
  60. binary_path=/home/sdake/amfb-dec/test
  61. The path to the components.
  62. in the unit section:
  63. bn=testamf1
  64. The bn parameter specifies the binary name of the application that should be
  65. run by the instantion script. Note instantiate may already know this
  66. information and hence, this is optional.
  67. instantiate=clc_cli_script
  68. The instantiate parameter specifies the CLC-CLI binary program to be run to
  69. instantiate a component. An instantiation starts the processes representing
  70. the component.
  71. terminate=clc_cli_script
  72. The terminate parameter specifies the CLC-CLI binary program to be run to
  73. terminate a component. A terminate CLC terminates the processes representing
  74. the component nicely by properly shutting down.
  75. cleanup=clc_cli_script
  76. The cleanup parameter specifies the CLC-CLI binary program to be run to
  77. cleanup a component. A cleanup CLC terminates the processes representing
  78. the component abruptly.
  79. There are several options to describe the component recovery escalation
  80. policies. These are:
  81. component_restart_probation=100000
  82. This specifies the number of milliseconds that a component can be restarted
  83. in escalation level 0 (only restart components) before escalating to level 1.
  84. component_restart_max=4
  85. This specifies the number of times within component_restart_probation period
  86. before escalating from level 0 to level 1.
  87. unit_restart_probation=200000
  88. This specifies the number of milliseconds that a unit can be restarted
  89. in escalation level 1 (restart entire SU) before escalating to level 2.
  90. unit_restart_max=6
  91. This specifies the number of times within unit_restart_probation period
  92. before escalating from level 1 to level 2.
  93. The AMF will execute a N+M reduction process based upon the number of service
  94. instances specified in the configuration file and 4 configuration options
  95. at the groups level:
  96. preferred-active-units=3
  97. This is the preferred number of active units that should be active.
  98. maximum-active-instances=3
  99. This is the naximum number of active CSIs that can be assigned to a component.
  100. preferred-standby-units=2
  101. This is the preferred number of standby units that should be active.
  102. maximum-standby-instances=4
  103. This is the naximum number of standby CSIs that can be assigned to a component.
  104. A service instance is specified only as a name. If there are 4 SIs, the
  105. reduction process will execute as per the AMF specification to assign the proper
  106. number of active and standby CSIs to components currently registered. This
  107. is a little buggy at the moment.
  108. serviceinstance {
  109. name = siaa
  110. }
  111. Failure detection occurs through the healthcheck option. The healthcheck
  112. options are
  113. key
  114. The name of the healthcheck parameter
  115. period
  116. The number of milliseconds to wait before issueing a new healthcheck.
  117. maximum_duration
  118. The maximum amount of time to wait for a healthcheck to complete before
  119. declaring a failure.
  120. The example programs
  121. --------------------
  122. First the openais test programs should be installed. When compiling openais
  123. in the exec directory a file called openais-instantiate is created. Copy this
  124. to the test directory
  125. exec# cp openais-instantiate ../test
  126. Set execute permissions for the clc_cli_script
  127. exec# cd ../test
  128. test# chmod +x ../clc_cli_script
  129. IMPORTANT NOTE:
  130. Within the amf stanza, the mode variable should be set to enabled. This option
  131. defaults to off and the default configuration file turns this off as well.
  132. This is configured off by default to keep from confusing openais users
  133. interested in using AIS without the alpha-AMF.
  134. example openais.conf:
  135. amf {
  136. mode: enabled
  137. }
  138. The following two paths must be set in the groups.conf file:
  139. clccli_path=/home/sdake/amfb-l/test
  140. binary_path=/home/sdake/amfb-l/test
  141. If these are not set, the path to the clc_cli_script and component binaries
  142. cannot be determined and AMF will not institate the testamf1 binary.
  143. Once aisexec is run using the default configuration file, 5 service units
  144. will be instantiated. The testamf1 C code will be used for all 5 SUs
  145. and both comp_a and comp_b. The testamf1 program determines its component
  146. name at start time from the saAmfComponentNameGet api call. The result is
  147. that 10 processes will be started by AMF.
  148. The testamf1 will be assigned CSIs after they execute a saAmfComponentRegister
  149. operation. Note this operation causes the presence state of the testamf1
  150. component to be set to INSTANTIATED as required by the AMF specification. The
  151. service instances and their names are defined within the configuration file.
  152. The testamf1 program reports an error via saAmfErrorReport after 10
  153. healthchecks. This results in openais calling the cleanup handler, which for
  154. an sa-aware component, is the CLC_CLI_CLEANUP command. This causes the cleanup
  155. operation of the clc_cli_script to be run. This cleanup command then reads the
  156. pid of the process that was stored to /var/run at startup of the testamf1
  157. program. It then executes a kill -9 on the PID. Custom cleanup operations can
  158. be executed by modifying the clc_cli_script script program.
  159. After this is done 4 times (configurable) the entire service
  160. unit is terminated and restarted. Once this happens 6 times, the code
  161. escalates to level 2, which is currently unimplemented.
  162. Currently working:
  163. component register, healthcheck start and stop, csi assignment, n+m with
  164. all 6 reduction levels, error report, amf response, terminate, cleanup and
  165. restart escalation levels 0-1, single node (multinode not tested),
  166. setting presence and operational state of components internally, initial
  167. assignment of n+m csis based upon configuration options and fully
  168. following AIS AMF B spec.
  169. Not working or tested:
  170. escalation levels 2-3 (switchover/failover), protection group tracking,
  171. protection groups in general, any other model besides n+m, amf B
  172. specified reassignment of csis to terminated and restarted components,
  173. support for proxied or non-sa aware components, state machine for n+m
  174. needs alot of work after initial start. Timeout periods to reduce
  175. escalation level for escalation policies are unimplemented.
  176. Any feedback appreciated.
  177. Keep in mind this is very early code and may have many bugs which I'd
  178. be happy to have reported :).