RPC Guarantees in OpenStack

July 6th 2012, ewindisch

OpenStack RPC Guarantees

Here, we describe the required guarantees of OpenStack RPC implementations.

The statements made here are made through logical analysis and hypothesis. Experiments have not been conducted. I might have made mistakes, I welcome critism through the comments… and better yet, I welcome proof that anything I say here is wrong (or right!)

Missing Messages

OpenStack utilizes two messaging primitives, CAST and CALL. A CAST is sent and is not expected to return a value; the message is sent out into the ether and will succeed or fail. A CALL message is more complex, and a reply is expected.

A CALL is received by the RPC driver, a message is sent, and the RPC driver blocks until it returns. As messaging occurs asynchronously, the blocking is entirely performed on the client/publisher. If the thread is blocked too long, and no reply is received, a Timeout Exception is raised. If the remote CALL generates an Exception, it is serialized over the wire and re-raised by the RPC driver, back to the CALLER. Each caller is responsible for properly handling any Exception that might be raised.

Service Availability

It is necessary to expect failure. Drivers should always attempt to reconnect to their broker and/or peers. Messages destined for any consumer/worker are superfluous if they cannot be delivered within a reasonable time frame. This applies to both CAST and CALL. A timeout is set globally and can be manipulated per-message to adjust the TTL of messages. Best efforts to ensure delivery should be made, until the expiration of the TTL. If delivery fails, a Timeout Exception should be raised. Atomic actions and Idempotency

The RPC driver does not seem to be the place to implement atomic actions. The RPC drivers should simply raise Exceptions; idempotency and/or atomic-operations should be performed by the caller.

Surviving Restarts / Durability

If a publisher is sending a CALL or CAST, but the message has not been received by a consumer, upon restart of said consumer, that message should still be pending delivery until said consumer should consume that message; Except and unless said message has reached the expiration of its TTL, and the publisher has received a Timeout Exception. Failing to implement such TTLs is dangerous. For example, instance launches that cannot be succeed due to a failure of nova-scheduler should not be queued until the return of the scheduler (beyond a reasonable timeout), as this will inhibit the operation of autoscaling applications which may continuously attempt to launch new machine images. Currently, the AMQP-based drivers in OpenStack do not sufficiently support this TTL mechanism.

If a publisher attempts to send a CAST, but the message has not been received by a consumer, upon restart of the publisher, attempts to deliver said message should continue upon restoration of the publisher process. Alternatively, messages may be independently queued and guaranteed delivery by a third-party, as in the case of a centralized AMQP broker. The ZeroMQ driver does not currently provide this guarantee.

If a publisher attempts to send a CALL, but the message has not been received by a consumer, upon restart of the publisher, it is recommended that no attempts to deliver said message should be made upon restoration of the publisher process. This is because the purpose of the CALL was either to receive a return value from the consumer, or to block in a chain of events to prevent a race-condition. In the former, the return value of the call will be lost, as the state of the publisher has been discarded and the operational stack no longer exists. In the latter case, the chain of events will be broken, regardless, due to the loss of state and stack. Those looking for better guarantees should utilize an event-actor based model around CAST. While supporting a guarantee here is not recommended, it does not appear to be particularly dangerous. Instead, it is unnecessary and inefficient. Currently, the AMQP implementations in OpenStack guarantee delivery of a CALL, even if the publisher has expired.

Conclusion

RPC in OpenStack is doing fairly well, but it could do better. This applies to the ZeroMQ, Qpid, and RabbitMQ drivers. Hopefully, having documents/blogs like this will help drive improvements and innovations in these drivers, and assist those that might seek to write a new driver.

blog comments powered by Disqus