README.devmap 44 KB


  1. Copyright (c) 2002-2004 MontaVista Software, Inc.
  2. All rights reserved.
  3. This software licensed under BSD license, the text of which follows:
  4. Redistribution and use in source and binary forms, with or without
  5. modification, are permitted provided that the following conditions are met:
  6. - Redistributions of source code must retain the above copyright notice,
  7. this list of conditions and the following disclaimer.
  8. - Redistributions in binary form must reproduce the above copyright notice,
  9. this list of conditions and the following disclaimer in the documentation
  10. and/or other materials provided with the distribution.
  11. - Neither the name of the MontaVista Software, Inc. nor the names of its
  12. contributors may be used to endorse or promote products derived from this
  13. software without specific prior written permission.
  14. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  15. AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  16. IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  17. ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
  18. LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  19. CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
  20. SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
  21. INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
  22. CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
  23. ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
  24. THE POSSIBILITY OF SUCH DAMAGE.
  25. -------------------------------------------------------------------------------
  26. This file provides a map for developers to understand how to contribute
  27. to the openais project. The purpose of this document is to prepare a
  28. developer to write a service for openais, or understand the architecture
  29. of openais.
  30. The following is described in this document:
  31. * all files, purpose, and dependencies
  32. * architecture of openais
  33. * taking advantage of virtual synchrony
  34. * adding libraries
  35. * adding services
  36. -------------------------------------------------------------------------------
  37. all files, purpose, and dependencies.
  38. -------------------------------------------------------------------------------
  39. *----------------*
  40. *- AIS INCLUDES -*
  41. *----------------*
  42. include/ais_amf.h
  43. -----------------
  44. Definitions for AMF interface.
  45. include/ais_ckpt.h
  46. ------------------
  47. Definitions for CKPT interface.
  48. include/ais_clm.h
  49. -----------------
  50. Definitions for CLM interface.
  51. include/ais_msg.h
  52. -----------------
  53. All the stuff that is used to specify how lib and executive communicate
  54. including message identifiers, message request data, and mesage response
  55. data.
  56. include/ais_types.h
  57. -------------------
  58. Base type definitions for AIS interface.
  59. include/list.h
  60. -------------
  61. Doubly linked list inline implementation.
  62. include/queue.h
  63. ---------------
  64. FIFO queue inline implementation.
  65. depends on list.
  66. include/sq.h
  67. ------------
  68. Sort queue where items are sorted according to a sequence number. Avoids
  69. Sort, hence, install of a new element takes is O(1). Inline implementation.
  70. depends on list.
  71. *---------------*
  72. * AIS LIBRARIES *
  73. *---------------*
  74. lib/clm.c
  75. ---------
  76. CLM user library linked into user application.
  77. lib/amf.c
  78. ---------
  79. AMF user library linked into user application.
  80. lib/ckpt.c
  81. ----------
  82. CKPT user library linked into user application.
  83. lib/evt.c
  84. ----------
  85. EVT user library linked into user application.
  86. lib/util.c
  87. ----------
  88. Utility functions used by all libraries.
  89. *-----------------*
  90. *- AIS EXECUTIVE -*
  91. *-----------------*
  92. exec/amf.{h|c}
  93. -------------
  94. Server side implementation of Availability Management Framework (AMF API).
  95. exec/ckpt.{h|c}
  96. Server side implementation of Checkpointing (CKPT API).
  97. exec/clm.{h|c}
  98. Server side implementation of Cluster Membership (CLM API).
  99. exec/amf.{h|c}
  100. Server side implementation of Event Service (EVT API).
  101. exec/gmi.{h|c}
  102. --------------
  103. group messaging interface supporting reliable totally ordered group multicast
  104. using ring topology. Supports extended virtual synchrony delivery semantics
  105. with strong membership guarantees.
  106. depends on aispoll.
  107. depends on queue.
  108. depends on sq.
  109. depends on list.
  110. exec/handlers.h
  111. ---------------
  112. Functional specification of a service that connects into AIS executive.
  113. If all functions are implemented, new services can easily be added.
  114. exec/main.{h|c}
  115. --------------
  116. Main dispatch functionality and global data types used to connect AIS
  117. services into one component.
  118. exec/mempool.{h|c}
  119. ------------------
  120. Memory pool implementation that supports preallocated memory blocks to
  121. avoid OOM errors.
  122. exec/parse.{h|c}
  123. ----------------
  124. Parsing functions for parsing /etc/ais/groups.conf and
  125. /etc/ais/network.conf into internally used data structures.
  126. exec/aispoll.{h|c}
  127. ------------------
  128. poll abstraction with support for nearly unlimited large poll handlers
  129. and timer handlers.
  130. depends on tlist.
  131. exec/print.{h|c}
  132. ----------------
  133. Logging implementation meant to replace syslog. syslog has nasty side
  134. effect of causing a signal every time a message is logged.
  135. exec/tlist.{h|c}
  136. -----------------
  137. Timer list interface for supporting timer addition, removal, expiry, and
  138. determination of timeout period left for next timer to expire.
  139. depends on list.
  140. exec/log/print.{h|c}
  141. --------------------
  142. Prototype implementation of logging to syslog without using syslog C
  143. library call.
  144. loc
  145. ---
  146. Counts the lines of code in the AIS implementation.
  147. -------------------------------------------------------------------------------
  148. architecture of openais
  149. -------------------------------------------------------------------------------
  150. The openais project is a client server architecture. Libraries implement the
  151. SA Forum APIs and are linked into the end-application. Libraries request
  152. services from the ais executive. The ais executive uses the group messaging
  153. protocol to provide cluster communication between multiple processors (nodes).
  154. Once the group makes a decision, a response is sent to the library, which then
  155. responds to the user API.
  156. --------------------------------------------------
  157. | AIS CLM, AMF, CKPT, EVT library (openais.a) |
  158. --------------------------------------------------
  159. | Interprocess Communication |
  160. --------------------------------------------------
  161. | openais Executive |
  162. | |
  163. | --------- --------- --------- --------- |
  164. | | AMF | | CLM | | CKPT | | EVT | |
  165. | |Service| |Service| |Service| |Service| |
  166. | --------- --------- --------- --------- |
  167. | |
  168. | ----------- ----------- |
  169. | | Group | | Poll | |
  170. | |Messaging| |Interface| |
  171. | |Interface| ----------- |
  172. | ----------- |
  173. | |
  174. -------------------------------------------------
  175. Figure 1: openais Architecture
  176. Every application that intends to use openais links with the libais library.
  177. This library uses IPC, or more specifically BSD unix sockets, to communicate
  178. with the executive. The library is a small program responsible only for
  179. packaging the request into a message. This message is sent, using IPC, to
  180. the executive which then processes it. The library then waits for a response.
  181. The library itself contains very little intelligence. Some utility services
  182. are provided:
  183. * create a connection to the executive
  184. * send messages to the executive
  185. * retrieve messages from the executive
  186. * Queue message for out of order delivery to library (used for async calls)
  187. * Poll on a fd
  188. * request the executive send a dummy message to break out of dispatch poll
  189. * create a handle instance
  190. * destroy a handle instance
  191. * get a reference to a handle instance
  192. * release a reference to a handle instance
  193. When a library connects, it sends via a message, the service type. The
  194. service type is stored and used later to reference the message handlers
  195. for both the library message handlers and executive message handlers.
  196. Every message sent contains an integer identifier, which is used to index
  197. into an array of message handlers to determine the correct message handler
  198. to execute.
  199. When a library sends a message via IPC, the delivery of the message occurs
  200. to the library message handler for the service specified in the service type.
  201. The library message handler is responsible for sending the message via the
  202. group messaging interface to all other processors (nodes) in the system via
  203. the API gmi_mcast(). In this way, the library handlers are also very simple
  204. containing no more logic then what is required to repackage the message into
  205. an executive message and send it via the group messaging interface.
  206. The group messaging interface sends the message according to the extended
  207. virtual synchrony model. The group messaging interface also delivers the
  208. message according to the extended virtual synchrony model. This has several
  209. advantages which are described in the virtual synchrony section. One
  210. advantage that must be described now is that messages are self-delivered;
  211. if a node sends a message, that same message is delivered back to that
  212. node.
  213. When the executive message is delivered, it is processed by the executive
  214. message handler. The executive message handler contains the brains of
  215. AIS and is responsible for making all decisions relating to the request
  216. from the libais library user.
  217. -------------------------------------------------------------------------------
  218. taking advantage of virtual synchrony
  219. -------------------------------------------------------------------------------
  220. definitions:
  221. processor: a system responsible for executing the virtual synchrony model
  222. configuration: the list of processors under which messages are delivered
  223. partition: one or more processors leave the configuration
  224. merge: one or more processors join the configuration
  225. group messaging: sending a message from one sender to many receivers
  226. Virtual synchrony is a model for group messaging. This is often confused
  227. with particular implementations of virtual synchrony. Try to focus on
  228. what virtual syncrhony provides, not how it provides it, unless interested
  229. in working on the group messaging interface of openais.
  230. Virtual synchrony provides several advantages:
  231. * integrated membership
  232. * strong membership guarantees
  233. * agreed ordering of delivered messages
  234. * same delivery of configuration changes and messages on every node
  235. * self-delivery
  236. * reliable communication in the face of unreliable networks
  237. * recovery of messages sent within a configuration where possible
  238. * use of network multicast using standard UDP/IP
  239. Integrated membership allows the group messaging interface to give
  240. configuration change events to the API services. This is obviously beneficial
  241. to the cluster membership service (and its respective API0, but is helpful
  242. to other services as described later.
  243. Strong membership guarantees allow a distributed application to make decisions
  244. based upon the configuration (membership). Every service in openais registers
  245. a configuration change function. This function is called whenever a
  246. configuration change occurs. The information passed is the current processors,
  247. the processors that have left the configuration, and the processors that have
  248. joined the configuration. This information is then used to make decisions
  249. within a distributed state machine. One example usage is that an AMF component
  250. running a specific processor has left the configuration, so failover actions
  251. must now be taken with the new configuration (and known components).
  252. Virtual synchrony requires that messages may be delivered in agreed order.
  253. FIFO order indicates that one sender and one receiver agree on the order of
  254. messages sent. Agreed ordering takes this requirement to groups, requiring that
  255. one sender and all receivers agree on the order of messages sent.
  256. Consider a lock service. The service is responsible for arbitrating locks
  257. between multiple processors in the system. With fifo ordering, this is very
  258. difficult because a request at about the same time for a lock from two seperate
  259. processors may arrive at all the receivers in different order. Agreed ordering
  260. ensures that all the processors are delivered the message in the same order.
  261. In this case the first lock message will always be from processor X, while the
  262. second lock message will always be from processor Y. Hence the first request
  263. is always honored by all processors, and the second request is rejected (since
  264. the lock is taken). This is how race conditions are avoided in distributed
  265. systems.
  266. Every processor is delivered a configuration change and messages within a
  267. configuration in the same order. This ensures that any distributed state
  268. machine will make the same decisions on every processor within the
  269. configuration. This also allows the configuration and the messages to be
  270. considered when making decisions.
  271. Virtual synchrony requires that every node is delivered messages that it
  272. sends. This enables the logic to be placed in one location (the handler
  273. for the delivery of the group message) instead of two seperate places. This
  274. also allows messages that are sent to be ordered in the stream of other
  275. messages within the configuration.
  276. Certain guarantees are required of virtually synchronous systems. If
  277. a message is sent, it must be delivered by every processor unless that
  278. processor fails. If a particular processor fails, a configuration change
  279. occurs creating a new configuration under which a new set of decisions
  280. may be made. This implies that even unreliable networks must reliably
  281. deliver messages. The implementation in openais works on unreliable as
  282. well as reliable networks.
  283. Every message sent must be delivered, unless a configuration change occurs.
  284. In the case of a configuration change, every message that can be recovered
  285. must be recovered before the new configuration is installed. Some systems
  286. during partition won't continue to recover messages within the old
  287. configuration even though those messages can be recovered. Virtual synchrony
  288. makes that impossible, except for those members that are no longer part
  289. of a configuration.
  290. Finally virtual syncrhony takes advantage of hardware multicast to avoid
  291. duplicated packets and scale to large transmit rates. On 100mbit network,
  292. openais can approach wire speeds depending on the number of messages queued
  293. for a particular processor.
  294. What does all of this mean for the developer?
  295. * messages are delivered reliably
  296. * messages are delivered in the same order to all nodes
  297. * configuration and messages can both be used to make decisions
  298. -------------------------------------------------------------------------------
  299. adding libraries
  300. -------------------------------------------------------------------------------
  301. The first stage in adding a library to the system is to develop the library.
  302. Library code should follow these guidelines:
  303. * use SA Forum coding style for APIs to aid in debugging
  304. * implement all library code within one file named after the api.
  305. examples are ckpt.c, clm.c, amf.c.
  306. * use parallel structure as much as possible between different APIs
  307. * make use of utility services provided by the library
  308. * if something is needed that is generic and useful by all services,
  309. submit patches for other libraries to use these services.
  310. * use the reference counting handle manager for handle management.
  311. ------------------
  312. Version checking
  313. ------------------
  314. struct saVersionDatabase {
  315. int versionCount;
  316. SaVersionT *versionsSupported;
  317. };
  318. The versionCount number describes how many entries are in the version database.
  319. The versionsSupported member is an array of SaVersionT describing the acceptable
  320. versions this API supports.
  321. An api developer specifies versions supported by adding the following C
  322. code to the library file:
  323. /*
  324. * Versions supported
  325. */
  326. static SaVersionT clmVersionsSupported[] = {
  327. { 'A', 1, 1 },
  328. { 'a', 1, 1 }
  329. };
  330. static struct saVersionDatabase clmVersionDatabase = {
  331. sizeof (clmVersionsSupported) / sizeof (SaVersionT),
  332. clmVersionsSupported
  333. };
  334. After this is specified, the following API is used to check versions:
  335. SaErrorT
  336. saVersionVerify (
  337. struct saVersionDatabase *versionDatabase,
  338. const SaVersionT *version);
  339. An example usage of this is
  340. SaErrorT error;
  341. error = saVersioNVerify (&clmVersionDatabase, version);
  342. where version is a pointer to an SaVersionT passed into the API.
  343. error will return SA_OK if the version is valid as specified in the
  344. version database.
  345. ------------------
  346. Handle Instances
  347. ------------------
  348. Every handle instance is stored in a handle database. The handle database
  349. stores instance information for every handle used by libraries. The system
  350. includes reference counting and is safe for use in threaded applications.
  351. The handle database structure is:
  352. struct saHandleDatabase {
  353. unsigned int handleCount;
  354. struct saHandle *handles;
  355. pthread_mutex_t mutex;
  356. void (*handleInstanceDestructor) (void *);
  357. };
  358. handleCount is the number of handles
  359. handles is an array of handles
  360. mutex is a pthread mutex used to mutually exclude access to the handle db
  361. handleInstanceDestructor is a callback that is called when the handle
  362. should be freed because its reference count as dropped to zero.
  363. The handle database is defined in a library as follows:
  364. static void clmHandleInstanceDestructor (void *);
  365. static struct saHandleDatabase clmHandleDatabase = {
  366. .handleCount = 0,
  367. .handles = 0,
  368. .mutex = PTHREAD_MUTEX_INITIALIZER,
  369. .handleInstanceDestructor = clmHandleInstanceDestructor
  370. };
  371. There are several APIs to access the handle database:
  372. SaErrorT
  373. saHandleCreate (
  374. struct saHandleDatabase *handleDatabase,
  375. int instanceSize,
  376. int *handleOut);
  377. Creates an instance of size instanceSize in the handleDatabase paraemter
  378. returning the handle number in handleOut. The handle instance reference
  379. count starts at the value 1.
  380. SaErrorT
  381. saHandleDestroy (
  382. struct saHandleDatabase *handleDatabase,
  383. unsigned int handle);
  384. Destroys further access to the handle. Once the handle reference count
  385. drops to zero, the database destructor is called for the handle. The handle
  386. instance reference count is decremented by 1.
  387. SaErrorT
  388. saHandleInstanceGet (
  389. struct saHandleDatabase *handleDatabase,
  390. unsigned int handle,
  391. void **instance);
  392. Gets an instance specified handle from the handleDatabase and returns
  393. it in the instance member. If the handle is valid SA_OK is returned
  394. otherwise an error is returned. This is used to ensure a handle is
  395. valid. Eveyr get call increases the reference count on a handle instance
  396. by one.
  397. SaErrorT
  398. saHandleInstancePut (
  399. struct saHandleDatabase *handleDatabase,
  400. unsigned int handle);
  401. Decrements the reference count by 1. If the reference count indicates
  402. the handle has been destroyed, it will then be removed from the database
  403. and the destructor called on the instance data. The put call takes care
  404. of freeing the handle instance data.
  405. Create a data structure for the instance, and use it within the libraries
  406. to store state information about the instance. This information can be
  407. the handle, a mutex for protecting I/O, a queue for queueing async messages
  408. or whatever is needed by the API.
  409. -----------------------------------
  410. communicating with the executive
  411. -----------------------------------
  412. A service connection is created with the following API;
  413. SaErrorT
  414. saServiceConnect (
  415. int *fdOut,
  416. enum req_init_types init_type);
  417. The fdOut parameter specifies the address where the file descriptor should
  418. be stored. This file descriptor should be stored within an instance structure
  419. returned by saHandleCreate.
  420. The init_type parameter specifies the service number to use when connecting.
  421. A message is sent to the executive with the function:
  422. SaErrorT
  423. saSendRetry (
  424. int s,
  425. const void *msg,
  426. size_t len,
  427. int flags);
  428. the s member is the socket to use retrieved with saServiceConnect
  429. the msg member is a pointer to the message to send to the service
  430. the len member is the length of the message to send
  431. the flags parameter is the flags to use with the sendmsg system call
  432. A message is received from the executive with the function:
  433. SaErrorT
  434. saRecvRetry (
  435. int s,
  436. void *msg,
  437. size_t len,
  438. int flags);
  439. the s member is the socket to use retrieved with saServiceConnect
  440. the msg member is a pointer to the message to receive to the service
  441. the len member is the length of the message to receive
  442. the flags parameter is the flags to use with the sendmsg system call
  443. A message is sent using io vectors with the following function:
  444. SaErrorT saSendMsgRetry (
  445. int s,
  446. struct iovec *iov,
  447. int iov_len);
  448. the s member is the socket to use retrieved with saServiceConnect
  449. the iov is an array of io vectors to send
  450. iov_len is the number of iovectors in iov
  451. Waiting for a file descriptor using poll systemcall is done with the api:
  452. SaErrorT
  453. saPollRetry (
  454. struct pollfd *ufds,
  455. unsigned int nfds,
  456. int timeout);
  457. where the parameters are the standard poll parameters.
  458. Messages can be received out of order searching for a specific message id with:
  459. SaErrorT
  460. saRecvQueue (
  461. int s,
  462. void *msg,
  463. struct queue *queue,
  464. int findMessageId);
  465. Where s is the socket to receive from
  466. where msg is the message address to receive to
  467. where queue is the queue to store messages if the message doens't match
  468. findMessageId is used to determine if a message matches (if its equal,
  469. it is received, if it isn't equal, it is stored in the queue)
  470. An API can activate the executive to send a dummy message with:
  471. SaErrorT
  472. saActivatePoll (int s);
  473. This is useful in dispatch functions to cause poll to drop out of waiting
  474. on a file descriptor when a connection is finalized.
  475. Looking at the lib/clm.c file is invaluable for showing how these APIs
  476. are used to communicate with the executive.
  477. ----------
  478. messages
  479. ----------
  480. Please follow the style of the messages. It makes debugging much easier
  481. if parallel style is used.
  482. An init message should be added to req_init_types.
  483. enum req_init_types {
  484. MESSAGE_REQ_CLM_INIT,
  485. MESSAGE_REQ_AMF_INIT,
  486. MESSAGE_REQ_CKPT_INIT,
  487. MESSAGE_REQ_CKPT_CHECKPOINT_INIT,
  488. MESSAGE_REQ_CKPT_SECTIONITERATOR_INIT
  489. };
  490. These are the request CLM message identifiers:
  491. Every library request message is defined in ais_msg.h and should look like this:
  492. enum req_clm_types {
  493. MESSAGE_REQ_CLM_TRACKSTART = 1,
  494. MESSAGE_REQ_CLM_TRACKSTOP,
  495. MESSAGE_REQ_CLM_NODEGET
  496. };
  497. These are the response CLM message identifiers:
  498. enum res_clm_types {
  499. MESSAGE_RES_CLM_TRACKCALLBACK = 1,
  500. MESSAGE_RES_CLM_NODEGET,
  501. MESSAGE_RES_CLM_NODEGETCALLBACK
  502. };
  503. index 0 of the message is special and is used for the activate poll message in
  504. every API. That is why req_clm_types and res_clm_types starts at 1.
  505. This is a request message header which should start every request message:
  506. struct req_header {
  507. int size;
  508. int id;
  509. };
  510. There is also a response message header which should start every response message:
  511. struct res_header {
  512. int size;
  513. int id;
  514. SaErrorT error;
  515. };
  516. the error parameter is used to pass errors from the executive to the library,
  517. including SA_ERR_TRY_AGAIN for flow control, which is described later.
  518. This is described later:
  519. struct message_source {
  520. struct conn_info *conn_info;
  521. struct in_addr in_addr;
  522. };
  523. This is the MESSAGE_REQ_CLM_TRACKSTART message id above:
  524. struct req_clm_trackstart {
  525. struct message_header header;
  526. SaUint8T trackFlags;
  527. SaClmClusterNotificationT *notificationBufferAddress;
  528. SaUint32T numberOfItems;
  529. };
  530. The saClmClusterTrackStart api should create this message and send it to the
  531. executive.
  532. responses should be of:
  533. struct res_clm_trackstart
  534. ----------------------------------------------------------------------
  535. Using one file descriptor for async and sync requests at the same time
  536. ----------------------------------------------------------------------
  537. A library may include async events but must also be able to handle
  538. sync request/responses on the same fd. This is achieved via the
  539. saRecvQueue() api call.
  540. 1. First have a look at exec/amf.c::saAmfInitialize.
  541. This function creates a queue to store responses that are not to be
  542. handled by the syncronous function, but instead meant to be handled by
  543. the dispatch (async) function.
  544. /*
  545. * An inq is needed to store async messages while waiting for a
  546. * sync response
  547. */
  548. error = saQueueInit (&amfInstance->inq, 512, sizeof (void *));
  549. if (error != SA_OK) {
  550. goto error_put_destroy;
  551. }
  552. 2. Next have a look at exec/amf.c::saAmfProtectionGroupTrackStart.
  553. This function must ensure that it gets a particular response, even when
  554. it may receive a request for a dispatch (async call). To solve this,
  555. the function queues the message on amfInstance->inq. It will only
  556. return a message in &req_amf_protectiongrouptrackstart once a message
  557. with MESSAGE_RES_AMF_PROTECTIONGROUPTRACKSTART defined in header->id of
  558. the response is received.
  559. error = saSendRetry (amfInstance->fd,
  560. &req_amf_protectiongrouptrackstart,
  561. sizeof (struct req_amf_protectiongrouptrackstart),
  562. MSG_NOSIGNAL);
  563. if (error != SA_OK) {
  564. goto error_unlock;
  565. }
  566. ^^^^^^ This code sends the request
  567. error = saRecvQueue (amfInstance->fd, &message,
  568. &amfInstance->inq, MESSAGE_RES_AMF_PROTECTIONGROUPTRACKSTART);
  569. ^^^^^^^^ This is the API which waits for a particular
  570. response. It will wait until a message with the header
  571. MESSAGE_RES_AMF_PROTECTIONGROUPTRACKSTART is received. Any other
  572. message it queues for the dispatch function to read the inq.
  573. 3. Finally have a look at the exec/amf/saAmfDispatch function.
  574. saQueueIsEmpty(&amfInstance->inq, &empty);
  575. if (empty == 0) {
  576. /*
  577. * Queue is not empty, read data from queue
  578. */
  579. saQueueItemGet (&amfInstance->inq, (void *)&queue_msg);
  580. msg = *queue_msg;
  581. memcpy (&dispatch_data, msg, msg->size);
  582. saQueueItemRemove (&amfInstance->inq);
  583. } else {
  584. /*
  585. * Queue empty, read response from socket
  586. */
  587. error = saRecvRetry (amfInstance->fd, &dispatch_data.header,
  588. sizeof (struct message_header), MSG_WAITALL |
  589. MSG_NOSIGNAL);
  590. if (error != SA_OK) {
  591. goto error_unlock;
  592. }
  593. if (dispatch_data.header.size > sizeof (struct
  594. message_header)) {
  595. error = saRecvRetry (amfInstance->fd,
  596. &dispatch_data.data,
  597. dispatch_data.header.size - sizeof (struct
  598. message_header),
  599. MSG_WAITALL | MSG_NOSIGNAL);
  600. if (error != SA_OK) {
  601. goto error_unlock;
  602. }
  603. }
  604. }
  605. This code basically checks if the queue is empty, then reads from the
  606. queue if there is a request, otherwise it reads from the socket.
  607. You might ask why doesn't the poll (not shown) block if there are
  608. messages in the queue but none in the socket. It doesn't block because
  609. every time a saRecvQueue queues a message, it sends a request to the
  610. executive (activate poll) which then sends a dummy message back to the
  611. library (activate poll) which keeps poll from blocking. The dummy
  612. message is ignored by the dispatch function.
  613. Not a great approach (the activate poll stuff). I have an idea to fix
  614. it though. Before a poll is ever done, the inq could be checked to see
  615. if it is empty. If there are messages on the inq, the dispatch function
  616. would not call poll, but instead indicate to the dispatch function to
  617. dispatch messages.
  618. Fortunately most of this activate poll mess is hidden from the library
  619. developer in saRecvQueue (this does the activate poll stuff). The
  620. develoepr simply has to be aware that the activate poll message is
  621. coming and ignore it appropriately.
  622. ------------
  623. some notes
  624. ------------
  625. * Avoid doing anything tricky in the library itself. Let the executive
  626. handler do all of the work of the system. minimize what the API does.
  627. * Once an api is developed, it must be added to the makefile. Just add
  628. a line for the file to EXECOBJS build line.
  629. * protect I/O send/recv with a mutex.
  630. * always look at other libraries when there is a question about how to
  631. do something. It has likely been thought out in another library.
  632. -------------------------------------------------------------------------------
  633. adding services
  634. -------------------------------------------------------------------------------
  635. Services are defined by service handlers and messages described in
  636. include/ais_msg.h. These two peices of information are used by the executive
  637. to dispatch the correct messages to the correct receipients.
  638. -------------------------------
  639. the service handler structure
  640. -------------------------------
  641. A service is added by defining a structure defined in exec/handlers.h. The
  642. structure is a little daunting:
  643. struct libais_handler {
  644. int (*libais_handler_fn) (struct conn_info *conn_info, void *msg);
  645. int response_size;
  646. int response_id;
  647. int gmi_prio;
  648. };
  649. The response_size, response_id, and gmi_prio for a library handler are used for flow
  650. control. A response message will be sent to the library of the size response_size,
  651. with the header id of response_id if the gmi priority queue gmi_prio is full. This is
  652. used for flow control so that the executive isn't responsible for queueing alot
  653. of messages.
  654. struct service_handler {
  655. struct libais_handler *libais_handlers;
  656. int libais_handlers_count;
  657. int (**aisexec_handler_fns) (void *msg);
  658. int aisexec_handler_fns_count;
  659. int (*confchg_fn) (
  660. struct sockaddr_in *member_list, int member_list_entries,
  661. struct sockaddr_in *left_list, int left_list_entries,
  662. struct sockaddr_in *joined_list, int joined_list_entries);
  663. int (*libais_init_fn) (struct conn_info *conn_info, void *msg);
  664. int (*libais_exit_fn) (struct conn_info *conn_info);
  665. int (*aisexec_init_fn) (void);
  666. };
  667. libais_handlers are the handler functions for the library and also describe the flow
  668. control information required.
  669. libais_handlers_count is the number of entries in libais_handlers.
  670. aisexec_handler_fns are a list of functions that are dispatched by the
  671. group messaging interface when a message is delivered by the group messaging
  672. interface.
  673. aisexec_handler_fns_count is the number of functions in the aisexec_handler_fns
  674. list.
  675. confchg_fn is called every time a configuration change occurs.
  676. libais_init_fn is called every time a library connection is initialized.
  677. libais_exit_fn is called every time a library connection is terminated by
  678. the executive.
  679. aisexec_init_fn is called once during startup to initialize service specific
  680. data.
  681. ---------------------------
  682. look at a service handler
  683. ---------------------------
  684. A typical declaration of a full service is done in a file exec/service.c.
  685. Looking at exec/clm.c:
  686. struct libais_handler clm_libais_handlers[] =
  687. {
  688. { /* 0 */
  689. .libais_handler_fn = message_handler_req_lib_activatepoll,
  690. .response_size = sizeof (struct res_lib_activatepoll),
  691. .response_id = MESSAGE_RES_LIB_ACTIVATEPOLL,
  692. .gmi_prio = GMI_PRIO_RECOVERY
  693. },
  694. { /* 1 */
  695. .libais_handler_fn = message_handler_req_clm_trackstart,
  696. .response_size = sizeof (struct res_clm_trackstart),
  697. .response_id = MESSAGE_RES_CLM_TRACKSTART,
  698. .gmi_prio = GMI_PRIO_RECOVERY
  699. },
  700. { /* 2 */
  701. .libais_handler_fn = message_handler_req_clm_trackstop,
  702. .response_size = sizeof (struct res_clm_trackstop),
  703. .response_id = MESSAGE_RES_CLM_TRACKSTOP,
  704. .gmi_prio = GMI_PRIO_RECOVERY
  705. },
  706. { /* 3 */
  707. .libais_handler_fn = message_handler_req_clm_nodeget,
  708. .response_size = sizeof (struct res_clm_nodeget),
  709. .response_id = MESSAGE_RES_CLM_NODEGET,
  710. .gmi_prio = GMI_PRIO_RECOVERY
  711. }
  712. };
  713. },
  714. static int (*clm_aisexec_handler_fns[]) (void *) = {
  715. message_handler_req_exec_clm_nodejoin
  716. };
  717. struct service_handler clm_service_handler = {
  718. .libais_handler_fns = clm_libais_handlers,
  719. .libais_handler_fns_count = sizeof (clm_libais_handlers) / sizeof (struct libais_handler),
  720. .aisexec_handler_fns = clm_aisexec_handler_fns ,
  721. .aisexec_handler_fns_count = sizeof (clm_aisexec_handler_fns) / sizeof (int (*)),
  722. .confchg_fn = clmConfChg,
  723. .libais_init_fn = message_handler_req_clm_init,
  724. .libais_exit_fn = clm_exit_fn,
  725. .aisexec_init_fn = clmExecutiveInitialize
  726. };
  727. If a library sends a message with id 0, message_handler_req_lib_activatepoll
  728. is called by the executive. If a message id of 1 is sent,
  729. message_handler_req_clm_trackstart is called.
  730. When a message is sent via the group messaging interface with the id of 0,
  731. message_handler_req_exec_clm_nodejoin is called.
  732. Whenever a new connection occurs from a library, message_handler_req_clm_init
  733. is called.
  734. Whenever a connection is terminated by the executive, clm_exit_fn is called.
  735. On startup, clmExecutiveInitialize is called.
  736. This service handler is exported via exec/clm.h as follows:
  737. extern struct service_handler clm_service_handler;
  738. --------------
  739. flow control
  740. --------------
  741. The group messaging interface includes flow control so that it doesn't send
  742. too many messages when the network is completely full. But the library can
  743. still send messages to the executive much faster then the executive can send
  744. them over gmi. So the library relies on the group messaging flow control to
  745. control flow of messages sent from the library. If the gmi queues are full,
  746. no more messages may be sent, so the executive in main.c automatically detects
  747. this scenario and returns an SA_ERR_TRY_AGAIN error.
  748. The reason gmi_prio is defined to GMI_PRIO_RECOVERY is because none of the above
  749. messages use flow control. For now, use this priority if no flow control is
  750. needed (because no messages are sent via the group messaging interface). Without
  751. flow control, the executive will assert when it runs out of storage space. Make
  752. sure the gmi_prio matches the priority of the message sent in the libais handler
  753. function.
  754. When a library gets SA_ERR_TRY_AGAIN, the library may either retry, or return this
  755. error to the user if the error is allowed by the API definitions. The gmi_prio is
  756. critical to this determination, because it may be possible to queue on other
  757. priority queues, but not the particular priority queue the user wants to queue upon.
  758. The other information is critical to ensuring that the library reads the correct
  759. message and size of message. Make sure the libais_handler matches the messages
  760. you are using in the handler function.
  761. ----------------------
  762. service handler list
  763. ----------------------
  764. Then the service handler is linked into the executive by adding an include
  765. for the clm.h to the main.c file and including the service in the service
  766. handlers array:
  767. /*
  768. * All service handlers in the AIS
  769. */
  770. struct service_handler *ais_service_handlers[] = {
  771. &clm_service_handler,
  772. &amf_service_handler,
  773. &ckpt_service_handler,
  774. &ckpt_checkpoint_service_handler,
  775. &ckpt_sectioniterator_service_handler
  776. };
  777. and including the definition (it is included already above).
  778. Make sure:
  779. #define AIS_SERVICE_HANDLERS_COUNT 5
  780. is defined to the number of entries in ais_service_handlers
  781. Within the main.h file is a list of the service types in the enum:
  782. enum socket_service_type {
  783. SOCKET_SERVICE_INIT,
  784. SOCKET_SERVICE_CLM,
  785. SOCKET_SERVICE_AMF,
  786. SOCKET_SERVICE_CKPT,
  787. SOCKET_SERVICE_CKPT_CHECKPOINT,
  788. SOCKET_SERVICE_CKPT_SECTIONITERATOR
  789. };
  790. SOCKET_SERVICE_CLM = service handler 0, SOCKET_SERVICE_AMF = service
  791. handler 1, etc.
  792. -------------------------
  793. the conn_info structure
  794. -------------------------
  795. information about a particular connection is stored in the connection
  796. information structure.
  797. struct conn_info {
  798. int fd; /* File descriptor for this connection */
  799. int active; /* Does this file descriptor have an active connection */
  800. char *inb; /* Input buffer for non-blocking reads */
  801. int inb_nextheader; /* Next message header starts here */
  802. int inb_start; /* Start location of input buffer */
  803. int inb_inuse; /* Bytes currently stored in input buffer */
  804. struct queue outq; /* Circular queue for outgoing requests */
  805. int byte_start; /* Byte to start sending from in head of queue */
  806. enum socket_service_type service;/* Type of service so dispatch knows how to route message */
  807. struct saAmfComponent *component; /* Component for which this connection relates to TODO shouldn't this be in the ci structure */
  808. int authenticated; /* Is this connection authenticated? */
  809. struct list_head conn_list;
  810. struct ais_ci ais_ci; /* libais connection information */
  811. };
  812. This structure is daunting, but don't worry it rarely needs to be manipulated.
  813. The only two members that should ever be accessed by a service are service
  814. (which is set during the library init call) and ais_ci which is used to store
  815. connection specific information.
  816. The connection specific information is:
  817. struct ais_ci {
  818. struct sockaddr_un un_addr; /* address of AF_UNIX socket, MUST BE FIRST IN STRUCTURE */
  819. union {
  820. struct aisexec_ci aisexec_ci;
  821. struct libclm_ci libclm_ci;
  822. struct libamf_ci libamf_ci;
  823. struct libckpt_ci libckpt_ci;
  824. } u;
  825. };
  826. If adding a service, a new structure should be defined in main.h and added
  827. to the union u in ais_ci. This union can then be used to access connection
  828. specific information and mantain state securely.
  829. ------------------------------
  830. sending responses to the api
  831. ------------------------------
  832. A message is sent to the library from the executive message handler using
  833. the function:
  834. extern int libais_send_response (struct conn_info *conn_info, void *msg,
  835. int mlen);
  836. conn_info is passed into the library message handler or stored in the
  837. executive message. This member describes the connection to send the response.
  838. msg is the message to send
  839. mlen is the length of the message to send
  840. Keep in mind that struct res_message should be at the beginning of the response
  841. message so that it follows the style used in the rest of openais.
  842. --------------------------------------------
  843. deferring response to an executive message
  844. --------------------------------------------
  845. THe source structure is used to store information about the source of a
  846. message so a later executive message can respond to a library request. In
  847. a library handler, the source field should be set up with:
  848. msg.source.conn_info = conn_info;
  849. msg.source.s_addr = this_ip.sin_addr.s_addr;
  850. gmi_mcast (msg)
  851. In this case conn_info is passed into the library message handler
  852. Then the executive message handler determines if this processor is responsible
  853. for responding:
  854. if (req_exec_amf_componentregister->source.in_addr.s_addr ==
  855. this_ip.sin_addr.s_addr) {
  856. libais_send_response ();
  857. }
  858. Not pretty, but it works :)
  859. Update: the source address of a message is now passed into the exec handler message
  860. which can be used instead of recording the source in the source.in_addr field.
  861. Eventually the source.in_addr will be removed so consider using the source_addr
  862. passed into the function handler.
  863. ----------------------------
  864. sending messages using gmi
  865. ----------------------------
  866. To send a message to every processor and the local processor for self
  867. delivery according to virtual synchrony semantics use:
  868. #define GMI_PRIO_RECOVERY 0
  869. #define GMI_PRIO_HIGH 1
  870. #define GMI_PRIO_MED 2
  871. #define GMI_PRIO_LOW 3
  872. int gmi_mcast (
  873. struct gmi_groupname *groupname,
  874. struct iovec *iovec,
  875. int iov_len,
  876. int priority);
  877. groupname is a global and should always be aisexec_groupname
  878. An example usage of this function is:
  879. struct req_exec_clm_nodejoin req_exec_clm_nodejoin;
  880. struct iovec req_exec_clm_iovec;
  881. int result;
  882. req_exec_clm_nodejoin.header.size =
  883. sizeof (struct req_exec_clm_nodejoin);
  884. req_exec_clm_nodejoin.header.id = MESSAGE_REQ_EXEC_CLM_NODEJOIN;
  885. memcpy (&req_exec_clm_nodejoin.clusterNode, &thisClusterNode,
  886. sizeof (SaClmClusterNodeT));
  887. req_exec_clm_iovec.iov_base = &req_exec_clm_nodejoin;
  888. req_exec_clm_iovec.iov_len = sizeof (req_exec_clm_nodejoin);
  889. result = gmi_mcast (&aisexec_groupname, &req_exec_clm_iovec, 1,
  890. GMI_PRIO_HIGH);
  891. Notice the priority field. Priorities are used when determining which
  892. queued messages to send first. Higher priority messages (on one processor)
  893. are sent before lower priority messages.
  894. -----------------
  895. library handler
  896. -----------------
  897. Every library handler has the prototype:
  898. static int message_handler_req_clm_init (struct conn_info *conn_info,
  899. void *message);
  900. The start of the handler function should look something like this:
  901. int message_handler_req_clm_trackstart (struct conn_info *conn_info,
  902. void *message)
  903. {
  904. struct req_clm_trackstart *req_clm_trackstart =
  905. (struct req_clm_trackstart *)message;
  906. { package up library handler message into executive message }
  907. }
  908. This assigns the void *message to a structure that can be used by the
  909. library handler.
  910. The conn_info field is used to indicate where the response should respond to.
  911. Use the tricks described in deferring a response to the executive handler to
  912. have the executive handler respond to the message.
  913. avoid doing anything tricky in a library handler. Do all the work in the
  914. executive handler at first. If later, it is possible to optimize, optimize
  915. away.
  916. -------------------
  917. executive handler
  918. -------------------
  919. Every executive handler has the prototype:
  920. static int message_handler_req_exec_clm_nodejoin (void *message,
  921. struct in_addr *source_addr);
  922. The start of the handler function should look something like this:
  923. static int message_handler_req_exec_clm_nodejoin (void *message,
  924. struct in_addr *source_addr)
  925. {
  926. struct req_exec_clm_nodejoin *req_exec_clm_nodejoin = (struct req_exec_clm_nodejoin *)message;
  927. { do real work of executing request, this is done on every node }
  928. }
  929. The conn_info structure is not available. If it is needed, it can be stored
  930. in the message sent by the library message handler in a source structure.
  931. The message field contains the message sent by the library handler
  932. The source_addr field contains the source ip address of the processor that
  933. multicasted the message.
  934. --------------------
  935. the libais_init_fn
  936. --------------------
  937. This function is responsible for authenticating the connection. If it is
  938. not properly implemented, no further communication to the executive on that
  939. connection will work. Copy the init function from some other service
  940. changing what looks obvious.
  941. --------------------
  942. the libais_exit_fn
  943. --------------------
  944. This function is called every time a service connection is disconnected by
  945. the executive. Free memory, change structures, or whatever work needs to
  946. be done to clean up.
  947. If the exit_fn couldn't complete because it is waiting for some event, it may
  948. return -1, which will allow the executive to make some forward progress. Then
  949. exit_fn will be called again. Return 0 when the exit was completed. THis is
  950. most useful when the group messaging protocol should be used to queue a message,
  951. but the queue is full. In this case, waiting a few more seconds may open up the
  952. queue, so return -1, and then the executive will try again to call exit_fn. Do
  953. NOT return -1 forever or the ais executive will spin.
  954. If -1 is returned, ENSURE that the state of the library hasn't changed so much that
  955. exit_fn cannot be called again. If exit_fn returns -1, it WILL be called again
  956. so expect it in the code.
  957. ----------------
  958. the confchg_fn
  959. ----------------
  960. This function is called whenever a configuration change occurs. Some
  961. services may not need this function, while others may. This is a good way
  962. to sync up joining nodes with the current state of the information stored
  963. on a particular processor.
  964. -------------------------------------------------------------------------------
  965. Final comments
  966. -------------------------------------------------------------------------------
  967. GDB is your friend, especially the "where" command. But it stops execution.
  968. This has a nasty side effect of killing the current configuration. In this
  969. case GDB may become your enemy.
  970. printf is your friend when GDB is your enemy.
  971. If stuck, ask on the mailing list, send your patches. Alot of time has been
  972. spent designing openais, and even more time debugging it. There are people
  973. that can help you debug problems, especially around things like message
  974. delivery.
  975. Submit patches early to get feedback, especially around things like parallel
  976. style. Parallel style is very important to ensure maintainability by the
  977. openais community.
  978. If this document is wrong or incomplete, complain so we can get it fixed
  979. for other people.
  980. Have fun!