Observability
"Observability" is the generic term we use to describe all the characteristics of an application that help us to observe its behavior in production, reason about problems, develop hypotheses, and verify fixes. Broadly, the specific features which together provide observability are:
- Logging
- Tracing
- Metrics
Each of these areas is a topic unto itself. In addition, the concept of "labels" is important, as they are used for logging and metrics (and probably in the future for tracing as well).
All observability functionality in Rust is contained in the cheburashka
crate. Probably 95% of this functionality is
just the re-exporting of various other third-party crates, with the remaining 5% being Elastio-specific helpers and
defaults that helps us ensure all of our various projects have the same behavior.
Logging
Probably the most important, and most basic, aspect of observability is logging. If there is no logging, we have no visibility into the behavior of our application in production.
We use tracing
from the Tokio project as the basis for our logging, however
internally we avoid taking explicit dependencies on tracing
, instead we re-export the tracing
crate from
cheburashka
as cheburashka::logging
. Check out the RustDoc comments on cheburashka::logging
for a comprehensive
review of the functionality available. In this handbook we'll just call out a few key features, and also Elastio
internal conventions.
Log levels
Like most logging frameworks, tracing
and therefore cheburashka
has the concept of log levels. They are:
trace
debug
info
warn
error
There is an option to have trace
events compiled into the binary only in debug builds, but we don't use that.
By default, the log level info
and above (meaning also warn
and error
) are visible on the console when running
code that's instrumented with log events. This can be controlled by setting a different filter expression in the
ELASTIO_LOG
environment variable, or by programmatically configuring logging with a call to
cheburashka::logging::config_cli
or cheburashka::logging::config_server
.
There can be no absolute rules about which log level is appropriate, but we do have a few conventions:
-
Use
error
only for errors that might be of interest to humans. For example if there's some code that retries after a failed network operation, it would not make sense to log an initial failure aserror
when the code will automatically retry. In that case, you should log the retriable errors asdebug
, and only if the retry limit is reached or the error is determined to not be retryable would you log something at theerror
level. -
warn
should indicate an event that is out of the ordinary but not necessarily indicative of a failure. For example, if an optional function is unavailable because of insufficient permissions that might be information you communicate as a warning. Another example would be an error for which there is a known fall-back but it's important that the fact that the error occurred is communicated. -
info
is for logging information that occurs on the "happy" path, which is at the level of detail that a normal user would be interested in seeing (yes, "normal user" here is entirely subjective). For example, if a backup of theC:\
drive completed successfully, that might be a usefulinfo
message, but you would not want to report every successful block read at theinfo
level because that will spam the log and make it impossible to navigate. To put it another way, if the log level isinfo
, the log output should be enough to tell what is going on at a high level, without scrolling through pages and pages of repetitive log events.In our log browsing tools, we will typically start with a filter of
info
or higher so get a quick overview of what's going on in that log. Bear this in mind before you log anything atinfo
. -
debug
is easily the most common log level for us. It is everything that doesn't fit in any of the other levels. While you should still take care not to generate thousands ofdebug
log events for some small operation, there are less restrictions on verbosity at this level. -
trace
is generally reserved for excessively verbose logging which you would only conceivably want to enable for a small region of the code at a time, when debugging a specific problem. To continue the backup example above, you might log every block offset you read fromC:\
at thetrace
level.
Log output on the console
When writing a CLI, you often need to print some output for the user on stdout
. Typically in Rust you'd use
println!
for this, but the problem is that your println!
output will not also be captured in the log, so you really
need a way to generate log messages which should be printed to the console.
In cheburashka
we have versions of all of the log event macros (event!
, debug!
, info!
, etc) which start with
print_
and always generate log messages with the target console
. If cheburashka
logging is configured with
config_cli
, the default behavior is to display only console
events, and only those at info
or higher. Of course
both of those can be overridden.
This means that where you'd normally reach for println!
to output something, you should use print_info!
(or some
other log level as appropriate) instead. By default this will render on the console without time stamps or other
decoration, so it should look and feel like println!
output, except that if full log output is enabled these log
events will take on the form of all other log output. It also means that when we're running in a server with logs
output to JSON, this will include all of your print_*
events as regular log events with all of the associated
metadata.
You should not ever use println!
unless there's a compelling reason why using the print_
macros doesn't suit your
use case.
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
fn main() {
// Don't do this! WRONG!
println!("Something happened!");
// Do this instead
print_info!("Something happened!");
}
Structured Logging
In many other languages, the way one composes a log message is by formatting a string with placeholders like %s
or
{}
standing for variables, which are then expanded into the value of the variable at runtime. So if you're logging
the number of widgets you frobulated, you might do:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; let widgets = vec![1]; // Old way of doing things: info!("Frobulated {} widgets", widgets.len()); }
This is okay but it has a few problems. First, what if we want to see how many occurrences of this particular log event happened? We could write a regex that matches it, but that will be brittle if we change the log message in the future and anyway doesn't scale to hundreds or thousands of messages. Second, what if we want to find all cases where we frobulated 10 or more widgets? That's probably also possible with a regex but it's not easy. Also, our logging system needs to store many unique strings, all variations on the message with different expansions of the placeholder. Presumably it uses compression but still that increases the per-message overhead a lot.
Fortunately there's a better way, a pattern that's called "structured logging". In this pattern, which is the pattern we observe in almost all cases (see below for the exception case), you do something like this:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; let widgets = vec![1]; // New hotness: info!(widgets_count = widgets.len(), "Frobulated widgets"); }
We're reporting the same event, but this time our log event has an unparameterized message "Frobulated widgets"
, and
an integer field widgets_count
which is equal to the number of widgest frobulated. Assuming this log event gets sent
to a log system like Loki or LogStash which also supports structured logging, we can even query our logs for all cases
where widgets_count > 10
, or we can count how many times the "Frobulated widgets"
event was generated. This is way
more powerful, and also in the Rust implementation a bit more efficient since it avoids the heap allocation and string
formatting (of course there's other overhead associated with structured logging but tracing
's implementation is very
well optimized).
Make sure to review the tracing
docs for all of the various
tricks one can do. Just remember that we use cheburashka::logging
instead of tracing
, but the capabilities are the
same.
This works by default for any type for which tracing
has a suitable implementation of its
Value
trait, which in practice is almost everything you'd
want to log. There are a few tips for making use of this:
Structured Logging Types
For integral types of all kinds, they can be used as in the example above.
For &str
values you can also just pass them as-is:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; // Works fine with &str let my_field: &str = "foo"; // of course you wouldn't log a const like this it's just an example info!(value = my_field, "This is a &str"); }
However String
isn't supported. I'm not sure why, it seems like an odd limitation (GitHub issue).
Fortunately you can always log anything that has either a
Display
or a Debug
implementation by using the %
or ?
sigils, respectively. String
implements both, but
Display
is the one that produces the exact contents of the string:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; // Logging a `String` is a bit more tricky... let my_field: String = "foo".to_string(); // info!(value = my_field, "This is a String"); <-- Won't compile; no implementation for `Value` for `String` info!(value = %my_field, "This is a String"); // <-- This works because of the `%` sigil }
This trick can also be used to log &Path
and PathBuf
, which don't have a (lossless) cast to strings:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; use std::path::PathBuf; let my_field: PathBuf = PathBuf::from("foo"); info!(path = %my_field.display(), "This is a PathBuf"); // <-- This works because of the `%` sigil }
Note here that you can't just use %my_field
as we did with the string types, but Path
has a method display()
which
returns a Display
impl that will render the path properly.
Finally there's another special case for logging errors (by which I mean implementations of std::error::Error
).
Errors implement Display
so you might be tempted to use the %
sigil and log the error message that way, but that's
not correct. Tracing subscribers have specific support for Error
implementations, and subscribers could (and
definitely will in the future) log the entire Error
including nested causes and backtraces to some error reporting
system like Sentry. If you just grab the top level error message when logging, you'll miss out on that.
Unfortunately, it's pretty awkward to put Error
s in the the form of a &dyn Error + 'static
that tracing
expects.
So we have a helper in cheburashka
called log_error
(and log_dyn_error
and log_boxed_error
variations as well)
which perform what amounts to a compile-time cast to make sure the error gets captured by tracing
as an error type:
use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; use std::fs::File; fn main() -> anyhow::Result<()> { File::create("/tmp/foo.json") .map_err(|e| { error!(err = log_error(&e), "Failed to create the file"); e })?; Ok(()) }
Note that by convention we use the field name err
when logging errors.
When not to use structured logging
The one exception to the mandate to use structured logging is when using the print_
macros to print to the console and
log at the same time. Because these events by default are rendered on the console without any decorations, any fields
that you add to the log event will not be visible unless the console output has been configured as the full log output.
In this case, and in this case only, you should use the only style string formatting technique to generate log messages.
Note that you do not need to use format!
for this; all of the log event macros include formatting support built in.
Spans
Often one needs to log the same fields multiple times, eg:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; use std::path::{Path, PathBuf}; fn fallible_operation(path: &Path) -> Result<(), std::io::Error> { // do something that might fail // ... Ok(()) } fn do_stuff(path: PathBuf) { info!(path = %path.display(), "Going to do something..."); if let Err(e) = fallible_operation(&path) { error!(err = log_error(&e), path = %path.display(), "Failed to do something with path"); } else { info!(path = %path.display(), "Successfully did something with path"); } } }
In this example, three log events are reported, all with the same path
field. This is reptitive, and difficult to
maintain. If you add another argument to this function, will you remember to add that argument to all three log events?
tracing
and therefore cheburashka
have the concept of a span which addresses this. A span is a set of name/value
pairs like a log event, except all events that are generated within the context of a span inherit that span's fields.
To the above example would be re-written as:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; use std::path::{Path, PathBuf}; fn fallible_operation(path: &Path) -> Result<(), std::io::Error> { // do something that might fail // ... Ok(()) } fn do_stuff(path: PathBuf) { let span = info_span!("In do_stuff", path = %path.display()); let _guard = span.enter(); info!("Going to do something..."); if let Err(e) = fallible_operation(&path) { error!(err = log_error(&e), "Failed to do something with path"); } else { info!("Successfully did something with path"); } } }
All of these three log events will be in the span "In do_stuff"
, with the path
field attached. Any log events
generated while the _guard
is in scope will automatically be in the span.
Async gotcha
Note that the above code won't work for async functions:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; use std::path::{Path, PathBuf}; async fn fallible_operation(path: &Path) -> Result<(), std::io::Error> { // do something that might fail // ... Ok(()) } async fn do_stuff(path: PathBuf) { let span = info_span!("In do_stuff", path = %path.display()); let _guard = span.enter(); info!("Going to do something..."); if let Err(e) = fallible_operation(&path).await { // <- span exited here error!(err = log_error(&e), "Failed to do something with path"); } else { info!("Successfully did something with path"); } } }
The reason why is too complex to explain here but suffice it to say, it won't work. There is an extention trait
available which will add an instrument
method to all Future
s, that will solve this problem:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; use cheburashka::logging::futures::Instrument; use std::path::{Path, PathBuf}; async fn fallible_operation(path: &Path) -> Result<(), std::io::Error> { // do something that might fail // ... Ok(()) } async fn do_stuff(path: PathBuf) { let span = info_span!("In do_stuff", path = %path.display()); async move { info!("Going to do something..."); if let Err(e) = fallible_operation(&path).await { error!(err = log_error(&e), "Failed to do something with path"); } else { info!("Successfully did something with path"); } }.instrument(span).await } }
Ugly and cumbersome, yes, but async
does tend to make things more complex sometimes. Fortunately there's (sometimes)
a better way...
Automatic Instrumentation
A very common pattern in instrumenting code with structured logging is logging an event when a function is entered,
possibly including its arguments, then logging an event when the function is exited successfully, or logging an error if
the call fails. That's such a common pattern that tracing
has a proc macro that does it for you! And, it works on
both regular and async functions! Behold:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; use std::path::{Path, PathBuf}; // NOTE: this struct is not `Debug` pub struct MyStruct{ inner: PathBuf }; fn fallible_operation(path: &Path) -> Result<(), std::io::Error> { // do something that might fail // ... Ok(()) } #[instrument(skip(x), fields(path = %x.inner.display()))] fn do_stuff(x: MyStruct) { info!("Going to do something..."); if let Err(e) = fallible_operation(&x.inner) { error!(err = log_error(&e), "Failed to do something with path"); } else { info!("Successfully did something with path"); } } }
In this case, skip
is used because MyStruct
cannot be logged as a field directly; instead a custom field called
path
is added to the span, using the result of x.inner.display()
. This will automatically create a span with the
name of the method, and ensure all code in the method is in this span. It doesn't matter if do_stuff
is async
; this
still works.
If the method you annotate is fallible (meaning it returns Result<...>
) there's another helpful option:
#![allow(unused)] fn main() { use doc::cheburashka; // Nasy hack use cheburashka::logging::prelude::*; use std::path::{Path, PathBuf}; fn fallible_operation(path: &Path) -> Result<(), std::io::Error> { // do something that might fail // ... Ok(()) } #[instrument(err)] fn do_stuff(path: PathBuf) -> Result<(), std::io::Error> { info!("Going to do something..."); if let Err(e) = fallible_operation(&path) { error!(err = log_error(&e), "Failed to do something with path"); } else { info!("Successfully did something with path"); } Ok(()) } }
The err
option in instrument
tells it to log an error log event if this function returns an error. It doesn't
matter how control leaves the function, whether a return
statement or a ?
on a failed result, the error will be
caught and logged.
Now, you might be tempted to run off and put #[instrument(err)]
on every function you write from now on so that you
can always know when things fail. Resist that urge, and use this capability wisely and with intent. We do not want to
spam logs with a bunch of error events every time one thing goes wrong. Think carefully about where logical boundaries
in your code are, and log errors only as they cross those boundaries. There are no hard and fast rules here, but see
the chapter on errors for some thoughts.
Labels
In the section about logging, particularly structured logging, we described how to generate log events that have named fields logged along with a message. Up to this point, which fields were logged was hard-coded at the event or span level. But even with the span feature allowing you to put fields on all events in a span, the definition of those fields is manual, and if you want to use the same fields in many places in code, you must copy-paste. That's fine for ad-hoc log events, but there are cases where you want more consistency.
In a later section we'll talk about metrics, but for now just accept that metrics have labels on them similar conceptually to the fields we log with structured logging. We have tools like Loki and Grafana, which let us switch between a graphical view of metrics, and the log events that were logged with the same labels as those metrics. This only works if we are consistent in making sure the labels we attach to our metrics and the fields we log in our log events line up.
In cases like this, cheburashka
introduces the concept of labels, in the form of the LabelSet
trait and a proc macro
which generates LabelSet
impls on arbitrary structs. Labels are a concept implemented directly in cheburashka
; the
underlying tracing
crate doesn't have any concept of labels, just the arbitrary fields we've already discussed.
The doc comments in cheburashka::labels
are very extensive, so go read them to learn about the concept and its
implementation. The key point to take away is that given a set of labels, you can create a log span with those labels,
and you can create metrics that require those labels, with compile-time checking that the labels applied to the metrics
are the correct ones. This provides a convenient way to define labels in code and use them in multiple places without
duplication of effort or manual work.
Metrics
At a high level, metrics are just time-series data about the behavior of your code. There are many ways to create, report, and store metrics. Our approach is based on what is built in Prometheus.
For useful background, read these sections of the Prometheus docs:
Note that we don't actually use Prometheus internally. We find that VictoriaMetrics is a better implementation, with higher performance and better support for high cardinality labels. However VictoriaMetrics is another implementation of the same Prometheus concepts, and the Prometheus docs are better, so in practice the fact that we don't actually use Prometheus doesn't change how we instrument our code to collect metrics.
The metrics implementation is in cheburashka::metrics
. It's based on the prometheus
crate, however it adds quite a
bit of additional convenience functionality on top.
The most significant improvement is the support for labeled metrics, which means metrics which take a specific type of
LabelSet
defined at compile time, so that attempting to apply the wrong set of labels to the metric is a compile
error. You are strongly urged to use this in favor of the regular metrics types re-exported by cheburashka
from
prometheus
.
The doc comments in cheburashka::metrics
cover all of this in detail.