Every time something important happens in your system, submit an event to Riemann. Just handled an HTTP request? Send an event with the time it took. Caught an exception? Send an event with the stacktrace. Small daemons can watch your servers and send events about disk capacity, CPU use, and memory consumption. Every second. It's like top for hundreds of machines at once.
Riemann filters, combines, and acts on flows of events to understand your systems.
Events are just structs. They're sent over Protocol Buffers, and in Riemann are treated as immutable maps. Each event has these (optional) fields:
|host||A hostname, e.g. "api1", "foo.com"|
|service||e.g. "API port 8000 reqs/sec"|
|state||Any string less than 255 bytes, e.g. "ok", "warning", "critical"|
|time||The time of the event, in unix epoch seconds|
|tags||Freeform list of strings, e.g. ["rate", "fooproduct", "transient"]|
|metric||A number associated with this event, e.g. the number of reqs/sec.|
|ttl||A floating-point time, in seconds, that this event is considered valid for. Expired states may be removed from the index.|
In addition to these standard fields, events can contain any number of *custom* fields with string values. See Howto: Custom event attributes for more details.
The index is a table of the current state of all services tracked by Riemann. Each event is uniquely indexed by its host and service. The index just keeps track of the most recent event for a given (host, service) pair.
Streams run quickly and retain as little state as possible. The index is Riemann's state table; its picture of the world. The dashboard, network clients, and streams can all query the index to see what the system looks like now. The Dashboard is just an HTML view of the index.
Events entered into the index have a
:ttl field which
indicate how long that event is valid for. Events that sit in the index
for longer than their TTL are removed from the index and reinserted into
the event streams with state "expired". Instead of polling for failure,
just let events expire. Services which fail to check in regularly can be
handled by the same state-transition streams
you use for error handling.
Streams live in the
(streams ...) section of your config
file. You can think of
(streams) as the top level
stream, the source from which all events flow. In this
documentation, we'll often leave it implied.
Here's a simple stream which sends critical events from any service beginning with "riak" to email@example.com.
The first argument to
where is the predicate
expression, which where uses to test each event. Events which match the
predicate are passed to each of where's children, and if they
don't match, nothing happens. In this case, there's one child the
You can think of streams like rivers in the real world. Events flow through tributaries and streams, pool in lakes and dams, and are filtered by grates and boulders. Riemann streams aren't really queries; they're more like pipelines which events flow through.
Let's consider a more complex problem. We want to send an email every time a service on a given host starts to report a new state; e.g., goes from reporting "ok", "ok", "ok", ... to "error", "error", "error". But in the event of a cascade failure we don't want to receive hundreds of emails--let's limit it to five per hour, and if more transitions occur, we'll roll them all up into the last email for that hour.
Since streams are just functions which take events as arguments, they're composable. We'll chain together four streams to solve the problem.
We need to detect state transitions for each service independently, so we'll start by splitting the event stream with
The (by) stream is like a river delta, or a fuel manifold. It splits the event stream into n distinct streams, each one independent. We get one fork for each unique host and service, so all the events from host 1 and service 1 flow together.
Since we haven't given
(by) any child streams yet, it'll
just drop every event on the floor. Most streams can take any number of
children, so it's common to see something like
(by [:host :service]
(some-stream ...) (some-other-stream ...) ...).
For each one of those independent streams, we'll detect state
Whenever an event arrives with a different state than the
(changed :state) passes on the new event to
its children. Otherwise, nothing happens.
With those state transitions, we'll roll up more than
5 events per
3600 seconds. In a given hour, rollup will allow four events to pass through immediately. Then it aggregates all successive events in a buffer, which is passed on at the end of the hour.
Rollup's child is an
(email) stream, which emails events it receives to Delacroix. I think there's a problem in the sim units today!
To learn more, see the howto articles on streams, and check out the Streams API for details.
Clients can query the index for particular events using a simple query language. Dashboard views are each powered by a single query. Queries can be applied on the index of past events, but can also tap into the stream of events going to the index in real-time.
Query messages return a list of matching events. See the query tests for tons of examples, or read the full grammar.
"Riemann" usually refers to the JVM process that handles events, i.e.
bin/riemann. Internally, that process has several distinct
servers which are specified in the configuration file.
The TCP and UDP servers listen on port 5555 for TCP connections and UDP datagrams. They accept a stream of protocol buffer messages containing events (or queries, control messages, etc). Those events are then applied to a tree of streams. The TCP server also supports querying the index for the current state of the system. The UDP protocol is much faster but lossy. The TCP protocol is slower, but provides reliable delivery and acknowledgement of receipt.
The websockets server (ws-server) accepts HTTP websockets connections. This server streams events which match a given query to clients.