OpenStack RPC Guarantees
Here, we describe the required guarantees of OpenStack RPC implementations.
The statements made here are made through logical analysis and hypothesis. Experiments have not been conducted. I might have made mistakes, I welcome critism through the comments… and better yet, I welcome proof that anything I say here is wrong (or right!)
OpenStack utilizes two messaging primitives,
CAST is sent and is not expected to return a value; the message is sent out into the ether and will succeed or fail. A
CALL message is more complex, and a reply is expected.
CALL is received by the RPC driver, a message is sent, and the RPC driver blocks until it returns. As messaging occurs asynchronously, the blocking is entirely performed on the client/publisher. If the thread is blocked too long, and no reply is received, a Timeout
Exception is raised. If the remote
CALL generates an
Exception, it is serialized over the wire and re-raised by the RPC driver, back to the
CALLER. Each caller is responsible for properly handling any
Exception that might be raised.
It is necessary to expect failure. Drivers should always attempt to reconnect to their broker and/or peers. Messages destined for any consumer/worker are superfluous if they cannot be delivered within a reasonable time frame. This applies to both
CALL. A timeout is set globally and can be manipulated per-message to adjust the TTL of messages. Best efforts to ensure delivery should be made, until the expiration of the TTL. If delivery fails, a Timeout
Exception should be raised.
Atomic actions and Idempotency
The RPC driver does not seem to be the place to implement atomic actions. The RPC drivers should simply raise
Exceptions; idempotency and/or atomic-operations should be performed by the caller.
Surviving Restarts / Durability
If a publisher is sending a
CAST, but the message has not been received by a consumer, upon restart of said consumer, that message should still be pending delivery until said consumer should consume that message; Except and unless said message has reached the expiration of its TTL, and the publisher has received a Timeout
Exception. Failing to implement such TTLs is dangerous. For example, instance launches that cannot be succeed due to a failure of nova-scheduler should not be queued until the return of the scheduler (beyond a reasonable timeout), as this will inhibit the operation of autoscaling applications which may continuously attempt to launch new machine images. Currently, the AMQP-based drivers in OpenStack do not sufficiently support this TTL mechanism.
If a publisher attempts to send a
CAST, but the message has not been received by a consumer, upon restart of the publisher, attempts to deliver said message should continue upon restoration of the publisher process. Alternatively, messages may be independently queued and guaranteed delivery by a third-party, as in the case of a centralized AMQP broker. The ZeroMQ driver does not currently provide this guarantee.
If a publisher attempts to send a
CALL, but the message has not been received by a consumer, upon restart of the publisher, it is recommended that no attempts to deliver said message should be made upon restoration of the publisher process. This is because the purpose of the
CALL was either to receive a return value from the consumer, or to block in a chain of events to prevent a race-condition. In the former, the return value of the call will be lost, as the state of the publisher has been discarded and the operational stack no longer exists. In the latter case, the chain of events will be broken, regardless, due to the loss of state and stack. Those looking for better guarantees should utilize an event-actor based model around
CAST. While supporting a guarantee here is not recommended, it does not appear to be particularly dangerous. Instead, it is unnecessary and inefficient. Currently, the AMQP implementations in OpenStack guarantee delivery of a
CALL, even if the publisher has expired.
RPC in OpenStack is doing fairly well, but it could do better. This applies to the ZeroMQ, Qpid, and RabbitMQ drivers. Hopefully, having documents/blogs like this will help drive improvements and innovations in these drivers, and assist those that might seek to write a new driver.