Cody Shepp

Twitter | Github | LinkedIn

Message Queues: Lessons Learned

Published on 4/2/2017

Most larger systems can benefits in some way from the introduction of queueing. Message queues can be used for asynchronous communication, task buffers, and more - but at what cost? In most cases, the answer is additional complexity with regard to operations, monitoring, and troubleshooting. I wanted to shared some lessons I've learned for reducing (or at least preparing for) the additional complexity of message queues.

Message names should be past tense events

Message queues, like HTTP APIs, require a disciplined attention to semantics. Sloppy practices in this area can result in tight coupling and dependency/deployment hell. The easiest way to prevent messages from turning into RPC calls is to restrict the message context to the domain of the publisher. In general, the publisher shouldn't care about the services that are consuming its messages. Naming messages as past tense events (e.g., AccountCreated or PaymentAccepted) frames the message as a notification to other systems.

Don't conflate message content and message routing

When you write an email, the contents of the email usually depends on the intended recipient. It makes sense to assume that same relationship applies to messages sent over a queue. However, that assumption can quickly lead to duplicated code, duplicated messages, and a giant mess. Learn from my mistakes - message content should be independent of message routing. Don't include the names of publishers or subscribers in message names. Messages should be business objects that can stand on their own without the need for additional context (e.g., a purchase order or a payment). Likewise, routing a message shouldn't require inspecting the message's contents. Keep routing keys, intended recipients, etc out of the message itself - instead, put that routing-related information in message headers, queue names, or some other queue-specific metadata.

Use data contracts

Having data contracts in place for messages ensures that both the publisher and subscribers understand the data in the same way. By using a library like Google Protocol Buffers, you can also realize performance gains from faster serialization and smaller message size. Protocol buffers have the added benefit of allowing type-safe communication between services written in different languages. Keep these contracts in a separate repo, and make sure to use semantic versioning for all releases. Consider distributing the contracts via some sort of package e.g., NuGet, npm, Composer.

Create a client library

Creating a client library for interacting with message queues is especially helpful if you have multiple systems and teams communicating with each other via queues. The library should provide methods for creating and connecting to queues using a standardized convention. It can also include audit-related functionality (discussed below), which guarantees consistent and reliable metrics. Having this common layer of abstraction in place provides a relatively small surface area for changes if you want to swap out your queue technology down the road.

Have audit infrastructure in place

Having visibility into the status of your publishers, subscribers, and queues in absolutely paramount for the successful operation of your system. The more hops your data makes, the more important auditing is - queues that span datacenters are a prime target for intermittent interruptions.

There are two types of data that you can collect to give you visibility into your queues - event data and metrics. Event data is related to specific messages. A publisher might log a "published" event that references a message ID and a timestamp. Metrics, on the other hand, provide summary data like the number of messages consumed by a process in the last 30 seconds.

If you don't have many cycles to spend on auditing infrastructure, at the very least you should log event data. As long as the events have timestamps, you can calculate metrics as you aggregate the event logs.

If your system spans multiple datacenters, you might want to create an additional "audit" service as a secondary method of tracking message delivery. This service would simply consume all published messages and keep a log of event data - rather than "published" or "consumed" events, the audit service would produce "observed" events. This data can help identify replication/federation problems by giving you the means to determine the last piece of your pipeline that encountered a given message. You can also track latency with greater detail, i.e., how long it takes messages to replicate from Datacenter A to Datacenter B.

Create a proxy API

Using a message queue doesn't mean that all of your services need to talk to the queue directly. Maybe you use a language that doesn't have an official library for interacting with your queue of choice, or maybe you'd like to accept messages for another system in a queued fashion, but that other system is a scheduled job that can only make HTTP requests. In these cases, you can make a simple HTTP proxy API that takes the payload from an HTTP request and puts it on a queue. In the other direction, the proxy can subscribe to messages from a queue, and then POST to an HTTP endpoint in some other system to notify that system that a message has arrived.

Do you have more tips about implementing message queues that you'd like to share? Tweet me @cas002 to share the lessons you've learned.