README.amf 21 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537
  1. AMF B.02.01 Implementation
  2. --------------------------
  3. The implementation of AMF in openais is directed by the specification
  4. SAI-AIS-AMF-B.02.01, see http://www.saforum.org/specification/.
  5. What does AMF do?
  6. -----------------
  7. The AMF has many major duties:
  8. * issue instantiate, terminate, and cleanup operations for components
  9. * assignment of component service instances to components
  10. * executing of recovery and repair actions on fault reports delivered
  11. by components (fault detection is a responsibility of all entities
  12. in the system)
  13. An AMF user has to provide instantiate and cleanup commands and a
  14. configuration file besides from the binaries that represents the actual
  15. components.
  16. To start a component, AMF executes the instantiate command which starts
  17. processes that are part of the component. AMF can stop the component
  18. abruptly by running the cleaup command.
  19. An service unit (SU) contains multiple components and represents a
  20. "useable service" and is configured to execute on an AMF node. The AMF node
  21. is mapped in the configuration to a CLM node which is "an operating system
  22. instance". An SU is the smallest part that can be instantiated in a redundant
  23. manner and can therefore be viewed as the unit of redundancy.
  24. A service group (SG) contains multiple SUs. The SG is the unit that implements
  25. high availability by managing its contained service units. An SG can be
  26. configured to execute different redundancy policies.
  27. An application contains multiple SGs and multiple service instances (SIs).
  28. An SI represents the workload for an SU. An SI consists of one or more
  29. component service instances (CSIs).
  30. A CSI represents the workload of a component. The CSI is configured to include
  31. a list of name value pairs through which the user can express the workload.
  32. The AMF specification defines several types of components. The AMF
  33. specification is exceedingly clear about which CLC operations occur for which
  34. component types.
  35. If a component is not sa-aware, the only level of high availability that
  36. can be applied to the application is through execution of the CLC interfaces.
  37. A special component, called a proxy component, can be used to present an
  38. SA-aware component to AMF to manage a non-SA-aware component. This would be
  39. useful, for example, to implement a healthcheck operation which runs some
  40. operation of the unmodified application service.
  41. Components that are SA-aware have been written specifically to the AMF
  42. interfaces. These components provide the most support for high availability
  43. for application developers.
  44. When an SA-aware component has been instantiated it has to register within a
  45. certain time. After a successful registration, AMF assigns workload to the
  46. component by making callbacks once the service unit is available to take service.
  47. There will be one callback for each CSI-assignment. Each CSI-assignment has
  48. a HA state associated which indicates how the component shall act.
  49. The HA state can be ACTIVE, STANDBY, QUIESCED or QUIESCING.
  50. The number of CSIs assigned to a component and the setting of their HA state
  51. is determined by AMF. In the configuration the operator specifies the preferred
  52. assignment of workload to the defined SUs. The configuration specifies also
  53. limits for how much work each SU can execute. If not the preferred distribution
  54. of workload can be met due to problems in the cluster a reduction process with
  55. 6 levels of reduction will be executed by AMF. The purpose of the reduction
  56. procedure is to come as close as possible to the preferred configuration without
  57. violating any limits for how much workload an SU can handle. The reduction
  58. procedure continues until there are no SUs in-service in the SG.
  59. AMF supports fault detection through a healthcheck API. The user
  60. specifies in the configuration file healthcheck keys and timing parameters.
  61. This configuration is then used by the application developer to register
  62. a healthcheck operation in the AMF. The healthcheck operation can be started
  63. or stopped. Once started, the AMF will periodically send a request to the
  64. component to determine its level of health. Optionally, AMF can be configured to
  65. instead expect the component to report its health periodically.
  66. The AMF reacts to negative healthchecks or failed healthchecks by executing
  67. a recovery policy.
  68. The AMF specification also includes an API for reporting errors with a
  69. recommended recovery action. AMF will not take a weaker recovery action than
  70. what is recommended but may take a stronger action based on the recovery
  71. escalation policy.
  72. There is a recovery escalation policy for the recomendations:
  73. - component restart
  74. - component failover
  75. When AMF receives a recommendation to restart a component, the recovery policy
  76. attempts to restart the component first. When the component is restarted and
  77. fail a certain number of times within a timeout period, the entire service unit
  78. is restarted. When the SU has been restarted a certain number of times within
  79. a certain timeout period, the SU is failed over to a standby SU. If AMF fails
  80. over too many service units out of the same node in a given time period as a
  81. consequence of error reports with either component restart or component
  82. failover recommended recovery actions, the AMF escalates the recovery to an
  83. entire node fail-over.
  84. What is currently implemented ?
  85. -------------------------------
  86. SA-aware components can be instantiated and assigned load according to the
  87. configuration specified in amf.conf. Other types of components are currently
  88. not supported. The processes of instantiation and assignment of workload are
  89. both simplified compared to the requirements in the AMF specification.
  90. Service units represented by their components can be configured to execute
  91. on different nodes. AMF supports initial start of the cluster as well as adding
  92. of a node to the cluster after the initial start. AMF also supports that a node
  93. leave the cluster by failing over the workload to standby service units.
  94. Healthchecks are implemented as specified with only a few details missing.
  95. The error report API is implemented but AMF ignores the recommendation of
  96. recovery action instead it will always try to recover by 'component restart'.
  97. The error escalation mechanism up to SU failover is also implemented as
  98. specified with a few simplifications.
  99. Only redundancy model N+M is (partly) implemented.
  100. You can find a detailed list of what is NOT implemented later in the README.
  101. How to configure AMF
  102. --------------------
  103. The AMF specification doesn't specify a configuration file format. It does
  104. however, describe many configuration options, which are specified formally in
  105. SAI-Overview-B.02.01 chapter 4.5 - 4.11. The Overview can also be retrieved
  106. from http://www.saforum.org/specification/.
  107. An implementation specific feature of openais is to implement the configuration
  108. options in a file called amf.conf. There is a man page in the /man directory
  109. which describes the syntax of amf.conf and what configuration options which
  110. are currently supported.
  111. The example programs
  112. --------------------
  113. First the openais example programs should be installed. When compiling openais
  114. in the exec directory a file called openais-instantiate is created. Copy this
  115. file to a test directory of your own:
  116. mkdir /tmp/aisexample
  117. exec# cp openais-instantiate /tmp/aisexample
  118. Copy also the script which implements the instantiate, terminate and clean-up
  119. operations to your test directory:
  120. exec# cp ../test/clc_cli_script /tmp/aisexample/clc_cli_script
  121. Set execute permissions for the clc_cli_script
  122. exec# chmod +x /tmp/aisexample/clc_cli_script
  123. Copy the binary to be used for all components:
  124. exec# cp ../test/testamf1 /tmp/aisexample/testamf1
  125. Copy the amf example configuration files from the openais/conf directory to
  126. your test directory.
  127. exec# cp ../conf/*amf_example.conf /tmp/aisexample
  128. set environment variables to the names of the configuration files:
  129. setenv OPENAIS_AMF_CONFIG_FILE /tmp/aisexample/amf_example.conf
  130. setenv OPENAIS_MAIN_CONFIG_FILE /tmp/aisexample/openais_amf_example.conf
  131. You have to specify the host on which you would like to execute the AMF example.
  132. Open the file 'amf_example.conf' and replace the line:
  133. saAmfNodeClmNode=p01
  134. in the following section in the cluster configuration:
  135. safAmfNode = AMF1 {
  136. saAmfNodeSuFailOverProb=2000
  137. saAmfNodeSuFailoverMax=2
  138. saAmfNodeClmNode=p01
  139. }
  140. p01 shall be replaced with the name of your host.
  141. (You can obtain the name of your host by typing the command 'hostname' in a
  142. shell.)
  143. Modify the following rows of 'openais_amf_example.conf' so that they match your
  144. user and group:
  145. aisexec {
  146. user: nisse
  147. group: users
  148. }
  149. (One way to obtain your user and group is to type the command 'id' in a shell.)
  150. Start aisexec by command:
  151. ./aisexec
  152. aisexec will be run in the background.
  153. Once aisexec is run using the example configuration file, 2 service units
  154. will be instantiated. The testamf1 C code will be used for both component A
  155. and component B of both SUs. The testamf1 program determines its
  156. component name at start time from the saAmfComponentNameGet() api call.
  157. The result is that 4 processes will be started by AMF.
  158. Each testamf1 process will first try to register a bad component name and
  159. there after register the name returned from saAmfComponentNameGet().
  160. The testamf1 will be assigned CSIs after they execute a
  161. saAmfComponentRegister() API call. Note that a successful registration causes
  162. the state of the component and service units to be set to INSTANTIATED as
  163. required by the AMF specification. The service instances and their names are
  164. defined within the configuration file.
  165. The component of type saAmfCSTypeName = B, which have the active HA state,
  166. in this case, safComp=B,safSu=SERVICE_X_1,safSg=RAID,safApp=APP-1,
  167. reports an error via saAmfErrorReport() after exactly 10 healthchecks.
  168. The healthcheck period is configured to 1 second so one error report is sent
  169. every 10th second.
  170. This results in openais calling the cleanup handler, which for
  171. an sa-aware component, is the CLC_CLI_CLEANUP command. This causes the cleanup
  172. operation of the clc_cli_script to be run. This cleanup command then reads the
  173. pid of the process that was stored to /var/run ( or /tmp) at startup of the
  174. testamf1 program. It then executes a kill -9 on the PID. Custom cleanup
  175. operations can be executed by modifying the clc_cli_script script program.
  176. After this is done 2 times (configurable) the entire service
  177. unit is terminated and restarted due to the error escalation mechanism. Once
  178. this happens 3 times (also configurable), the code escalates to level 2 and a
  179. failover of the SU takes place. After this testamf1 makes no more error
  180. reports and nothing will happen until some problem is recognized (like the
  181. process of one of the components stops executing).
  182. The states of the cluster and its contained entities can be obtained by issuing
  183. the following command in the shell:
  184. pkill -USR2 ais
  185. Some notes:
  186. -----------
  187. In the example, testamf1 is sending an error report at the 10th helthcheck.
  188. This is actually controlled by the safCSIAttr = good_health_limit in
  189. file amf_example.conf and can be changed as you like.
  190. The file openais_amf_example.conf specifies logging to stderr.
  191. If you would like to follow more closely the execution of the AMF in openais,
  192. debug printouts can be enabled.
  193. example:
  194. logging {
  195. fileline: off
  196. to_stderr: yes
  197. to_file: no
  198. logfile: /tmp/openais.log
  199. debug: off
  200. timestamp: on
  201. logger {
  202. ident: AMF
  203. debug: on
  204. tags: enter|leave|trace1|trace2|trace3|trace4|trace6
  205. }
  206. Setting 'debug: on' generally gives many printouts all other parts of openais.
  207. Run the example on a cluster with 2 nodes
  208. -----------------------------------------
  209. It is easy to run the example on more than one node.
  210. Modify the file openais_amf_example.conf:
  211. <1>
  212. Replace the following line:
  213. bindnetaddr: 127.0.0.0
  214. bindnetaddr specifies the address which the openais Executive should bind to.
  215. This address should always end in zero. If the local interface traffic
  216. should be routed over is 192.168.5.92, set bindnetaddr to 192.168.5.0.
  217. Modify amf_example.conf like this:
  218. <1>
  219. Remove the comment character '#' from the following lines:
  220. # safAmfNode = AMF2 {
  221. # saAmfNodeSuFailOverProb=2000
  222. # saAmfNodeSuFailoverMax=2
  223. # saAmfNodeClmNode=p02
  224. # }
  225. and replace p02 with the name of your second machine.
  226. <2>
  227. Locate the following two lines:
  228. saAmfSUHostedByNode=AMF1
  229. # saAmfSUHostedByNode=AMF2
  230. Replace them with:
  231. # saAmfSUHostedByNode=AMF1
  232. saAmfSUHostedByNode=AMF2
  233. Feedback
  234. --------
  235. Any feed-back is appreciated.
  236. Keep in mind only parts of the functionality is implemented. Reports of bugs or
  237. behaviour not compliant with the AMF specification within the implemented part
  238. is greatly appreciated :-).
  239. What is currently NOT implemented ?
  240. -----------------------------------
  241. The following list specifies all chapters of the AMF specification which
  242. currently is NOT fully implemented. The deviations from the specification are
  243. described shortly except in those cases when none of the requirements in the
  244. chapter is implemented.
  245. Chapter: Deviation:
  246. --------- ----------
  247. 3.3.1.2 Administrative State Not supported (always UNLOCKED).
  248. 3.3.1.4 Readiness State State STOPPING is not supported.
  249. 3.3.1.5 Service Unit’s HA State ... State QUIESCING is not supported.
  250. 3.3.2.2 Operational State AMF does not detect errors in the
  251. following cases:
  252. • A command used by the Availability
  253. Management Framework to control the
  254. component life cycle returned an
  255. error or did not return in time.
  256. • The component fails to respond in
  257. time to an Availability Management
  258. Framework's callback.
  259. • The component responds to an
  260. Availability Management Framework's
  261. state change callback
  262. (SaAmfCSISetCallbackT) with an error.
  263. • If the component is SA-aware, and it
  264. does not register with the
  265. Availability Management Framework
  266. within the preconfigured time-period
  267. after its instantiation.
  268. • If the component is SA-aware, and it
  269. unexpectedly unregisters with the
  270. Availability Management Framework.
  271. • The component terminates unexpectedly.
  272. • When a fail-over recovery operation
  273. performed at the level of the service
  274. unit or the node containing the
  275. service unit triggers an abrupt
  276. termination of the component.
  277. 3.3.2.3 Readiness State State STOPPING is not supported.
  278. 3.3.2.4 Component’s HA State per ... State QUIESCING is not supported.
  279. 3.3.3.1 Administrative State Not supported (always UNLOCKED).
  280. 3.3.5 Service Group States Administrative state is not supported
  281. (always UNLOCKED).
  282. 3.3.6.1 Administrative State Not supported (always UNLOCKED).
  283. 3.3.6.2 Operational State None of the rules for transition between states are implemented.
  284. 3.3.7 Application States Administrative state is not supported (always UNLOCKED).
  285. 3.3.8 Cluster States Administrative state is not supported (always UNLOCKED).
  286. 3.5.1 Combined States for Pre-Inst.... Only Administrative state = UNLOCKED is supported.
  287. 3.5.2 Combined States for Non-Pre-I... Not supported.
  288. 3.6 Component Capability Model Configuration of capability model is
  289. ignored. AMF expects all components to
  290. be capable to be x_active_or_y_standby.
  291. 3.7.2 2N Redundancy Model Not supported.
  292. 3.7.3.1 Basics Spare service units can not be handled
  293. properly.
  294. 3.7.3.3 Configuration • Ordered list of service units for a
  295. service group: Not supported
  296. (the order is unpredictable).
  297. • Ordered list of SIs: Neither ranking
  298. nor dependencies among SIs are
  299. supported. SIs are assigned to SUs in
  300. any order.
  301. • Auto-adjust option: Not supported.
  302. Auto-adjust is never done.
  303. 3.7.3.5.1 Handling of a Node Failure.. Not supported.
  304. 3.7.3.6 An Example of Auto-adjust Not supported.
  305. 3.7.4 N-Way Redundancy Model Not supported.
  306. 3.7.5 N-Way Active Redundancy Model Not supported.
  307. 3.7.6 No Redundancy Model Not supported.
  308. 3.7.7 The Effect of Administrative... Not supported.
  309. 3.9 Dependencies Among SIs, Compone.. Not supported.
  310. 3.11 Component Monitoring • Passive Monitoring: Not supported.
  311. • External Active Monitoring:
  312. Not supported.
  313. 3.12.1.1 Error Detection AMF does not support that a component
  314. reports an error for another component.
  315. 3.12.1.2 Restart • AMF does not support terminating of
  316. components by the terminate call-back
  317. or the TERMINATE command.
  318. • AMF does not consider component
  319. instantiation-level at restart.
  320. • The configuration option
  321. disableRestart is not supported.
  322. 3.12.1.3 Recovery • Component or Service Unit Fail-Over:
  323. • Component fail-over is not
  324. implemented
  325. • Only SU fail-over is implemented and
  326. the only way to trig that case is by
  327. error escalation.
  328. • Node Switch-Over: Not implemented
  329. • Node Fail-Over: Not implemented
  330. • Node Fail-Fast: Not implemented
  331. • The configuration option
  332. recoveryOnFailure is not handled,
  333. i.e. is never evaluated.
  334. 3.12.1.4 Repair • The configuration attribute for
  335. automatic repair is not evaluated.
  336. • The administrative operation
  337. SA_AMF_ADMIN_REPAIRED is not
  338. implemented.
  339. • Repair after component fail-over
  340. is not implemented.
  341. • Node leave while performing
  342. automatic repair of that node,
  343. is not implemented.
  344. • Service unit failover recovery:
  345. Is implemented except that an attempt
  346. to repair is always done (confi-
  347. guration attribute is not evaluated).
  348. • Repair after Node Switch-Over,
  349. Fail-Over or Fail-Fast
  350. is not implemented.
  351. 3.12.1.5 Recovery Escalation The recommended recovery action is not
  352. evaluated at the reception of an error
  353. report.
  354. 3.12.2.1 Recommended Recovery Action The recommended recovery action is
  355. never evaluated. Recovery action
  356. SA_AMF_COMPONENT_RESTART is always
  357. assumed.
  358. 3.12.2.2 Escalations of Levels 1 and 2 Is implemented with the following exception:
  359. • The configuration attribute
  360. component_restart_max is compared to
  361. the restart counter of the component
  362. that has reported the error instead of
  363. against the sum of all restart
  364. counters of all components within
  365. the SU.
  366. 3.12.2.3 Escalation of Level 3 Not implemented
  367. 4.2 CLC-CLI's Environment Variables Translation of non-printable Unicode
  368. characters is not supported.
  369. 4.4 INSTANTIATE Command • AMF does not evaluate the exit code of
  370. the INSTANTIATE command as described
  371. in the specification.
  372. • AMF does not supervise that an
  373. SA-aware component registers itself,
  374. within the time limit configured.
  375. As a consequence, none of the recovery
  376. actions described are implemented.
  377. 4.5 TERMINATE Command Not supported.
  378. 4.6 CLEANUP Command AMF does not evaluate the exit code of
  379. the CLEANUP command and thus does not
  380. implement any recovery action.
  381. 4.7 AM_START Command Not supported.
  382. 4.8 AM_STOP Command Not supported.
  383. 5 Proxied Component Management Not implemented.
  384. 7 Administrative API Not implemented
  385. 8 Basic Operational Scenarios Not implemented.
  386. 9 Alarms and Notifications Not implemented.
  387. Appendix A: Implementation of CLC .. CLC-interfaces are partly implemented
  388. for SA-aware components.
  389. The terminate operation,
  390. saAmfComponentTerminateCallback(),
  391. is never called.
  392. No CLC-interfaces are implemented for
  393. any other type of component.
  394. Appendix B: API functions in Unre.... AMF does not verify that the rules
  395. described are fulfilled.
  396. Which functions of the AMF API is currently NOT implemented ?
  397. -------------------------------------------------------------
  398. Function Deviation
  399. -------- ---------
  400. saAmfComponentUnregister() Is implemented in the library
  401. but not in aisexec.
  402. saAmfPmStart() Is implemented in the library
  403. but not in aisexec.
  404. saAmfPmStop() Is implemented in the library
  405. but not in aisexec.
  406. saAmfHealthcheckStart() This function takes a parameter
  407. of type SaAmfRecommendedRecoveryT.
  408. The value of this parameter is
  409. supposed to specify what kind of
  410. recovery AMF should execute if
  411. the component fails a health
  412. check. AMF does not read the
  413. value of this parameter but
  414. instead always tries to recover
  415. the component by a component
  416. restart.
  417. void (*SaAmfCSIRemoveCallbackT)() AMF will never make a call-back
  418. to this function.
  419. void
  420. (*SaAmfComponentTerminateCallbackT)() AMF will never make a call-back
  421. to this function.
  422. void
  423. (*SaAmfProxiedComponentInstantiateCallbackT)() AMF will never make a call-back
  424. to this function.
  425. void
  426. (*SaAmfProxiedComponentCleanupCallbackT)() AMF will never make a call-back
  427. to this function.
  428. saAmfProtectionGroupTrack() Is implemented in the library
  429. but not in aisexec.
  430. saAmfProtectionGroupTrackStop() Is implemented in the library
  431. but not in aisexec.
  432. void (*SaAmfProtectionGroupTrackCallbackT)() AMF will never make a call-back
  433. to this function.
  434. saAmfProtectionGroupNotificationFree() Not implemented.
  435. saAmfComponentErrorReport() This function takes a parameter
  436. of type SaAmfRecommendedRecoveryT.
  437. The value of this parameter is
  438. supposed to specify what kind of
  439. recovery AMF should execute if
  440. the component fails a health
  441. check. AMF does not read the
  442. value of this parameter but
  443. instead always tries to recover
  444. the component by a component
  445. restart.
  446. saAmfComponentErrorClear() Is implemented in the library
  447. but not in aisexec.