Observability

"Observability" is the generic term we use to describe all the characteristics of an application that help us to observe its behavior in production, reason about problems, develop hypotheses, and verify fixes. Broadly, the specific features which together provide observability are:

Logging
Tracing
Metrics

Each of these areas is a topic unto itself. In addition, the concept of "labels" is important, as they are used for logging and metrics (and probably in the future for tracing as well).

All observability functionality in Rust is contained in the cheburashka crate. Probably 95% of this functionality is just the re-exporting of various other third-party crates, with the remaining 5% being Elastio-specific helpers and defaults that helps us ensure all of our various projects have the same behavior.

Logging

Probably the most important, and most basic, aspect of observability is logging. If there is no logging, we have no visibility into the behavior of our application in production.

We use tracing from the Tokio project as the basis for our logging, however internally we avoid taking explicit dependencies on tracing, instead we re-export the tracing crate from cheburashka as cheburashka::logging. Check out the RustDoc comments on cheburashka::logging for a comprehensive review of the functionality available. In this handbook we'll just call out a few key features, and also Elastio internal conventions.

Log levels

Like most logging frameworks, tracing and therefore cheburashka has the concept of log levels. They are:

trace
debug
info
warn
error

There is an option to have trace events compiled into the binary only in debug builds, but we don't use that.

By default, the log level info and above (meaning also warn and error) are visible on the console when running code that's instrumented with log events. This can be controlled by setting a different filter expression in the ELASTIO_LOG environment variable, or by programmatically configuring logging with a call to cheburashka::logging::config_cli or cheburashka::logging::config_server.

There can be no absolute rules about which log level is appropriate, but we do have a few conventions:

Use error only for errors that might be of interest to humans. For example if there's some code that retries after a failed network operation, it would not make sense to log an initial failure as error when the code will automatically retry. In that case, you should log the retriable errors as debug, and only if the retry limit is reached or the error is determined to not be retryable would you log something at the error level.
warn should indicate an event that is out of the ordinary but not necessarily indicative of a failure. For example, if an optional function is unavailable because of insufficient permissions that might be information you communicate as a warning. Another example would be an error for which there is a known fall-back but it's important that the fact that the error occurred is communicated.
info is for logging information that occurs on the "happy" path, which is at the level of detail that a normal user would be interested in seeing (yes, "normal user" here is entirely subjective). For example, if a backup of the C:\ drive completed successfully, that might be a useful info message, but you would not want to report every successful block read at the info level because that will spam the log and make it impossible to navigate. To put it another way, if the log level is info, the log output should be enough to tell what is going on at a high level, without scrolling through pages and pages of repetitive log events.

In our log browsing tools, we will typically start with a filter of info or higher so get a quick overview of what's going on in that log. Bear this in mind before you log anything at info.
debug is easily the most common log level for us. It is everything that doesn't fit in any of the other levels. While you should still take care not to generate thousands of debug log events for some small operation, there are less restrictions on verbosity at this level.
trace is generally reserved for excessively verbose logging which you would only conceivably want to enable for a small region of the code at a time, when debugging a specific problem. To continue the backup example above, you might log every block offset you read from C:\ at the trace level.

Log output on the console

When writing a CLI, you often need to print some output for the user on stdout. Typically in Rust you'd use println! for this, but the problem is that your println! output will not also be captured in the log, so you really need a way to generate log messages which should be printed to the console.

In cheburashka we have versions of all of the log event macros (event!, debug!, info!, etc) which start with print_ and always generate log messages with the target console. If cheburashka logging is configured with config_cli, the default behavior is to display only console events, and only those at info or higher. Of course both of those can be overridden.

This means that where you'd normally reach for println! to output something, you should use print_info! (or some other log level as appropriate) instead. By default this will render on the console without time stamps or other decoration, so it should look and feel like println! output, except that if full log output is enabled these log events will take on the form of all other log output. It also means that when we're running in a server with logs output to JSON, this will include all of your print_* events as regular log events with all of the associated metadata.

You should not ever use println! unless there's a compelling reason why using the print_ macros doesn't suit your use case.

use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

fn main() {
// Don't do this!  WRONG!
println!("Something happened!");

// Do this instead
print_info!("Something happened!");
}

Structured Logging

In many other languages, the way one composes a log message is by formatting a string with placeholders like %s or {} standing for variables, which are then expanded into the value of the variable at runtime. So if you're logging the number of widgets you frobulated, you might do:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

let widgets = vec![1];
// Old way of doing things:
info!("Frobulated {} widgets", widgets.len());
}

This is okay but it has a few problems. First, what if we want to see how many occurrences of this particular log event happened? We could write a regex that matches it, but that will be brittle if we change the log message in the future and anyway doesn't scale to hundreds or thousands of messages. Second, what if we want to find all cases where we frobulated 10 or more widgets? That's probably also possible with a regex but it's not easy. Also, our logging system needs to store many unique strings, all variations on the message with different expansions of the placeholder. Presumably it uses compression but still that increases the per-message overhead a lot.

Fortunately there's a better way, a pattern that's called "structured logging". In this pattern, which is the pattern we observe in almost all cases (see below for the exception case), you do something like this:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

let widgets = vec![1];
// New hotness:
info!(widgets_count = widgets.len(), "Frobulated widgets");
}

We're reporting the same event, but this time our log event has an unparameterized message "Frobulated widgets", and an integer field widgets_count which is equal to the number of widgest frobulated. Assuming this log event gets sent to a log system like Loki or LogStash which also supports structured logging, we can even query our logs for all cases where widgets_count > 10, or we can count how many times the "Frobulated widgets" event was generated. This is way more powerful, and also in the Rust implementation a bit more efficient since it avoids the heap allocation and string formatting (of course there's other overhead associated with structured logging but tracing's implementation is very well optimized).

Make sure to review the tracing docs for all of the various tricks one can do. Just remember that we use cheburashka::logging instead of tracing, but the capabilities are the same.

This works by default for any type for which tracing has a suitable implementation of its Value trait, which in practice is almost everything you'd want to log. There are a few tips for making use of this:

Structured Logging Types

For integral types of all kinds, they can be used as in the example above.

For &str values you can also just pass them as-is:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

// Works fine with &str
let my_field: &str = "foo"; // of course you wouldn't log a const like this it's just an example

info!(value = my_field, "This is a &str");
}

However String isn't supported. I'm not sure why, it seems like an odd limitation (GitHub issue). Fortunately you can always log anything that has either a Display or a Debug implementation by using the % or ? sigils, respectively. String implements both, but Display is the one that produces the exact contents of the string:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

// Logging a `String` is a bit more tricky...
let my_field: String = "foo".to_string();

// info!(value = my_field, "This is a String"); <-- Won't compile; no implementation for `Value` for `String`
info!(value = %my_field, "This is a String"); // <-- This works because of the `%` sigil
}

This trick can also be used to log &Path and PathBuf, which don't have a (lossless) cast to strings:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::PathBuf;

let my_field: PathBuf = PathBuf::from("foo");

info!(path = %my_field.display(), "This is a PathBuf"); // <-- This works because of the `%` sigil
}

Note here that you can't just use %my_field as we did with the string types, but Path has a method display() which returns a Display impl that will render the path properly.

Finally there's another special case for logging errors (by which I mean implementations of std::error::Error). Errors implement Display so you might be tempted to use the % sigil and log the error message that way, but that's not correct. Tracing subscribers have specific support for Error implementations, and subscribers could (and definitely will in the future) log the entire Error including nested causes and backtraces to some error reporting system like Sentry. If you just grab the top level error message when logging, you'll miss out on that.

Unfortunately, it's pretty awkward to put Errors in the the form of a &dyn Error + 'static that tracing expects. So we have a helper in cheburashka called log_error (and log_dyn_error and log_boxed_error variations as well) which perform what amounts to a compile-time cast to make sure the error gets captured by tracing as an error type:

use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::fs::File;

fn main() -> anyhow::Result<()> {
File::create("/tmp/foo.json")
  .map_err(|e| {
  error!(err = log_error(&e), "Failed to create the file");

  e
  })?;
Ok(())
}

Note that by convention we use the field name err when logging errors.

When not to use structured logging

The one exception to the mandate to use structured logging is when using the print_ macros to print to the console and log at the same time. Because these events by default are rendered on the console without any decorations, any fields that you add to the log event will not be visible unless the console output has been configured as the full log output. In this case, and in this case only, you should use the only style string formatting technique to generate log messages. Note that you do not need to use format! for this; all of the log event macros include formatting support built in.

Spans

Often one needs to log the same fields multiple times, eg:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

fn do_stuff(path: PathBuf) {
  info!(path = %path.display(), "Going to do something...");

  if let Err(e) = fallible_operation(&path) {
     error!(err = log_error(&e), path = %path.display(), "Failed to do something with path");
  } else {
     info!(path = %path.display(), "Successfully did something with path");
  }
}
}

In this example, three log events are reported, all with the same path field. This is reptitive, and difficult to maintain. If you add another argument to this function, will you remember to add that argument to all three log events?

tracing and therefore cheburashka have the concept of a span which addresses this. A span is a set of name/value pairs like a log event, except all events that are generated within the context of a span inherit that span's fields. To the above example would be re-written as:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

fn do_stuff(path: PathBuf) {
  let span = info_span!("In do_stuff", path = %path.display());
  let _guard = span.enter();
  info!("Going to do something...");

  if let Err(e) = fallible_operation(&path) {
     error!(err = log_error(&e), "Failed to do something with path");
  } else {
     info!("Successfully did something with path");
  }
}
}

All of these three log events will be in the span "In do_stuff", with the path field attached. Any log events generated while the _guard is in scope will automatically be in the span.

Async gotcha

Note that the above code won't work for async functions:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

async fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

async fn do_stuff(path: PathBuf) {
  let span = info_span!("In do_stuff", path = %path.display());
  let _guard = span.enter();
  info!("Going to do something...");

  if let Err(e) = fallible_operation(&path).await { // <- span exited here
     error!(err = log_error(&e), "Failed to do something with path");
  } else {
     info!("Successfully did something with path");
  }
}
}

The reason why is too complex to explain here but suffice it to say, it won't work. There is an extention trait available which will add an instrument method to all Futures, that will solve this problem:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use cheburashka::logging::futures::Instrument;
use std::path::{Path, PathBuf};

async fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

async fn do_stuff(path: PathBuf) {
  let span = info_span!("In do_stuff", path = %path.display());

  async move {
    info!("Going to do something...");

    if let Err(e) = fallible_operation(&path).await {
       error!(err = log_error(&e), "Failed to do something with path");
    } else {
       info!("Successfully did something with path");
    }
  }.instrument(span).await
}
}

Ugly and cumbersome, yes, but async does tend to make things more complex sometimes. Fortunately there's (sometimes) a better way...

Automatic Instrumentation

A very common pattern in instrumenting code with structured logging is logging an event when a function is entered, possibly including its arguments, then logging an event when the function is exited successfully, or logging an error if the call fails. That's such a common pattern that tracing has a proc macro that does it for you! And, it works on both regular and async functions! Behold:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

// NOTE: this struct is not `Debug`
pub struct MyStruct{
  inner: PathBuf
};

fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

#[instrument(skip(x), fields(path = %x.inner.display()))]
fn do_stuff(x: MyStruct) {
  info!("Going to do something...");

  if let Err(e) = fallible_operation(&x.inner) {
     error!(err = log_error(&e), "Failed to do something with path");
  } else {
     info!("Successfully did something with path");
  }
}
}

In this case, skip is used because MyStruct cannot be logged as a field directly; instead a custom field called path is added to the span, using the result of x.inner.display(). This will automatically create a span with the name of the method, and ensure all code in the method is in this span. It doesn't matter if do_stuff is async; this still works.

If the method you annotate is fallible (meaning it returns Result<...>) there's another helpful option:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

#[instrument(err)]
fn do_stuff(path: PathBuf) -> Result<(), std::io::Error> {
  info!("Going to do something...");

  if let Err(e) = fallible_operation(&path) {
     error!(err = log_error(&e), "Failed to do something with path");
  } else {
     info!("Successfully did something with path");
  }

  Ok(())
}
}

The err option in instrument tells it to log an error log event if this function returns an error. It doesn't matter how control leaves the function, whether a return statement or a ? on a failed result, the error will be caught and logged.

Now, you might be tempted to run off and put #[instrument(err)] on every function you write from now on so that you can always know when things fail. Resist that urge, and use this capability wisely and with intent. We do not want to spam logs with a bunch of error events every time one thing goes wrong. Think carefully about where logical boundaries in your code are, and log errors only as they cross those boundaries. There are no hard and fast rules here, but see the chapter on errors for some thoughts.

Labels

In the section about logging, particularly structured logging, we described how to generate log events that have named fields logged along with a message. Up to this point, which fields were logged was hard-coded at the event or span level. But even with the span feature allowing you to put fields on all events in a span, the definition of those fields is manual, and if you want to use the same fields in many places in code, you must copy-paste. That's fine for ad-hoc log events, but there are cases where you want more consistency.

In a later section we'll talk about metrics, but for now just accept that metrics have labels on them similar conceptually to the fields we log with structured logging. We have tools like Loki and Grafana, which let us switch between a graphical view of metrics, and the log events that were logged with the same labels as those metrics. This only works if we are consistent in making sure the labels we attach to our metrics and the fields we log in our log events line up.

In cases like this, cheburashka introduces the concept of labels, in the form of the LabelSet trait and a proc macro which generates LabelSet impls on arbitrary structs. Labels are a concept implemented directly in cheburashka; the underlying tracing crate doesn't have any concept of labels, just the arbitrary fields we've already discussed.

The doc comments in cheburashka::labels are very extensive, so go read them to learn about the concept and its implementation. The key point to take away is that given a set of labels, you can create a log span with those labels, and you can create metrics that require those labels, with compile-time checking that the labels applied to the metrics are the correct ones. This provides a convenient way to define labels in code and use them in multiple places without duplication of effort or manual work.

Metrics

At a high level, metrics are just time-series data about the behavior of your code. There are many ways to create, report, and store metrics. Our approach is based on what is built in Prometheus.

For useful background, read these sections of the Prometheus docs:

Note that we don't actually use Prometheus internally. We find that VictoriaMetrics is a better implementation, with higher performance and better support for high cardinality labels. However VictoriaMetrics is another implementation of the same Prometheus concepts, and the Prometheus docs are better, so in practice the fact that we don't actually use Prometheus doesn't change how we instrument our code to collect metrics.

The metrics implementation is in cheburashka::metrics. It's based on the prometheus crate, however it adds quite a bit of additional convenience functionality on top.

The most significant improvement is the support for labeled metrics, which means metrics which take a specific type of LabelSet defined at compile time, so that attempting to apply the wrong set of labels to the metric is a compile error. You are strongly urged to use this in favor of the regular metrics types re-exported by cheburashka from prometheus.

The doc comments in cheburashka::metrics cover all of this in detail.

Elastio Engineering Handbook