Copyright

The contents of this work are Copyright(C) 2020 Elastio Software, Inc.

This book is made available subject to the terms of the CC BY 4.0 license. All rights reserved.

Foreward

This book is a result of over a year of experimentation within Elastio with the best way to communicate our approach to software engineering to existing team members and especially new hires. As an early stage startup working with Rust at a time when that language was undergoing rapid change, best practices went from "best" to "deprecated" very quickly. At the same time, rapid growth in our Engineering team meant new members were joining the team with no formal Rust experience and no better way to learn the ropes then submit code in PRs only to have to rewrite it in response to a torrent of feedback from more experienced members.

It became clear that we needed a better way. The result is this book. We hope it achieves the following goals:

  • Easily accessible content that's easy to read and understand. English is not the first language for most of our team.
  • Always up to date because updating the book is as easy as editing markdown and making a PR to this repo. Netlify makes this particularly easy.
  • Broadly applicable outside of Elastio. While the primary audience for this book is members of Elastio engineering both new and seasoned, much of the content is derived from practical experience building commercial software at scale and should be useful to readers facing similar challenges in other organizations.

If you find errors, omissions, or obsolete information anywhere in this book, you're encouraged to open a PR. If you're an Elastio engineer, it's literally part of your job, but even if you're not we welcome contributions from our fellow software engineers outside of Elastio.

Introduction

This handbook is only as help as we as a team make it. When you find information that's out of date or simply wrong, it's important that you take responsibility for correcting that information. The process to do so is very straightforward, simply branch the repo and submit a PR with the fix.

Elastio engineering team members, including new hires going through initial orientation, are required to contribute to keeping this handbook up to date.

New Hire Orientation

Welcome to Elastio! This chapter covers the steps you need to go through to on-board into the Engineering organization and begin to contribute code to our repos.

A reminder that this is a living document. If you find inaccurate or incomplete information or details that are simply missing, you must take action to improve the handbook by making a branch of this handbook's GitHub repo, making a change, and opening a PR.

Preparing for Onboarding

If you have time before your start day, consider doing some of this prep in advance. If not, you should work through these steps first thing in the morning on your first day:

  1. Complete all on-boarding paperwork with HR. If you're on-boarding in Ukraine, make sure you've signed all documents, including the Elastio IP Rights Assignment doc before starting work. If you were not asked to sign this document, contact @anelson and let him know immediately.

  2. Install a TOTP-compatible token generating app on your phone. Most of the Elastio team use Aegis, but it's up to you what app you use. The only requirement is that it be on a smartphone, and NOT on your laptop or desktop computer. 2FA isn't really 2FA when the second factor is stored on the same computer where you enter the first factor!

  3. Install 1Password.

    IMPORTANT NOTE: The step here is "install", not "setup". In particular, do not try to set up an account with 1password. Elastio have a corporate account, elastio.1password.com, to which you will be added on your first day. This preparation step is just to install these tools to prepare yourself for your Day 1 onboarding.

    How you do this depends on what OS you're running:

    1. Linux: Install the 1Password browser extension on Chrome or Firefox.

    2. Windows or macOS: Install the 1Password desktop application for Windows or macOS, and then the 1Password Classic (NOT "1Password", but "1Password Classic") browser extension.

    3. Mobile: Install the 1Password mobile app for Android or iOS . This will be useful regardless of which desktop OS you use, as a backup and when not at your desk.

  4. Disable your browser's automatic password saving feature. As a matter of policy, Elastio requires that all credentials to access our systems be stored securely in 1Password, and NOT in the Firefox or Chrome or Safari password managers. Either disable this feature entirely, or set up a separate Elastio profile in your browser of choice with password management disabled, and use this profile for all official Elastio work.

    This is very important for the security of our systems. Every engineer is responsible for the security of the account credentials assigned to them; you do not want a breach to be your responsibility!

  5. Think of or generate a unique, secure master password that you will use to access your Elastio 1Password account. Read about how to generate a strong master password, and if you need help coming up with password ideas use the 1password secure generator. This is the only password you will need to remember, as all others will be stored in 1password.

  6. If you are going to use your personal computer instead of an Elastio-provided system, make sure whole disk encryption is enabled. This is a requirement for any system where Elastio intellectual property is downloaded.

Day 1 Onboarding

  1. @anelson or some other manager will create an account for you on our team 1password instance (elastio.1password.com). If you followed the prep instructions above, you have in mind a secure master passphrase already. If not, you're not off to a great start :) Go back and follow the preparing instructions!

    Once this account has been created, you'll receive an email invitation to your personal email address (US) or your Polytech email (Ukraine). Follow the instructions, and enter your master password when prompted.

    As part of this setup process an Emergency Kit PDF will be generated. Read about the emergency kit and place the PDF in a safe place in your personal cloud storage (for example, email it to yourself, save it in Dropbox, etc). Elastio management cannot recover this for you if you lose it.

    Once this account is set up, log in to this account using both the 1Password app on your desktop (Windows or macOS), the 1Password browser extension in your browser (Linux), and the 1Password mobile app. Confirm everything work correctly before proceeding.

    At this point you should take some time to learn how to use 1password. The intro video is a good starting point. When correctly installed and configured, 1Password will automatically offer to save passwords you type in a browser, and when filling out a form to create a new account it will offer to generate unique, secure passwords for every account.

    IMPORTANT NOTE: ALL credentials to access Elastio systems must be secured in 1Password and nowhere else. For systems where you are prompted to choose a password, always use 1Password to generate a secure, unique password.

  2. You will need a GitHub account to join the Elastio organization. You may use your existing personal account, or create a new one just for Elastio. Either way, you must generate a new secure password for your GitHub account, again using 1Password.

    You must also enable Two-factor authentication (2FA) on the GitHub account. Use the Aegis app on your phone to scan the QR code to start generating GitHub 2FA codes on your phone. GitHub will also offer you recovery codes, which are single-use passwords that can be used to access your account in the event you lose your 2FA token generator. These recovery codes must be stored in 1Password as well. We recommend storing them in the 'Notes' field for your GitHub account entry in 1Password.

    Be careful not to lose your recovery codes, as without them you'll be unable to log in to your GitHub account if you ever lose access to your phone. Also be careful to store the recovery codes only in 1Password and not on your local system where they are not secure.

  3. Now that 2FA is enabled, @anelson will add your GitHub account to the Elastio org. Confirm you can see the Elastio repos. Checkout elastio and verify you can build and run the solution by running cargo test. You will not be able to do this without installing some packages; see the README for details.

  4. If your job in Elastio will involve using SSH to connect to our systems, make sure you have at least one SSH public key associated with your GitHub account. If not, generate one with ssh-keygen. The password for this key pair should be, obviously, generated by and stored in 1Password.

  5. Adam will create a Google Workspace account for you, with a one-time-use password. Log in to GSuite with this temporary password, and use 1Password to generate a new password for GSuite. You need to enable 2FA on GSuite as well. Once again, use the Aegis app on your phone to add the GSuite account to Aegis and generate the initial 2FA token.

    NOTE Google's fiendish attempts to force its paying customers to use their Google Prompt crapware have made it more difficult to on-board using 2FA. You'll have to coordinate this with @anelson, initially set up 2FA using SMS (which is not at all secure; thanks Google!), then once that's set up add another 2FA method via Aegis, and finally delete the SMS method. This is complicated and probably will require a screen share with @anelson.

  6. Now that you have a elastio.com email address, change your 1Password account to use this email address. Verify you can still access 1password. @anelson can't do this, it has to be done within your 1password login. You should see a "Change Email" option in the 1P UI on the left side of the screen when viewing your profile page.

  7. Congratulations! You're done with the initial on-boarding steps. Report to your mentor on Slack and start your job-specific on-boarding.

Environment

For developers, you'll need to set up your development environment. There's no Elastio rule about how you do this, it's entirely a matter of personal preference. This chapter contains some advice from various Elastio team members. Take it or leave it.

Windows Setup

Whether you're using a Linux system for your development environment or not, it's very likely you'll need to work with a Linux environment at some point.

If you're on Windows, @anelson suggests using WSL 2, and in particular the Fedora version. This is well tested, performs well, and let's you forget for a few moments that you're on Windows. Make sure you install a decent terminal emulator if you go this route. Alacritty is great, the Windows Terminal app in the MSFT store is pretty good. The console you get running PowerShell or cmd.exe is an abomination and it's use in the presence of other Elastio team members will invite derisive comments at the very least, or merciless ridicule at worst.

The Rust toolchain should be installed on your Windows system, as well as on your Linux VM. Most of the Elastio system supports Windows, although some key components including ScaleZ are Linux-only. That's why you'll need a Linux environment no matter what OS you run.

@olesiastorchak runs Windows ask her if you get stuck.

Linux Setup

Various team members use various distributions. @anelson's current recommendation is the latest Fedora (Fedora 32 at the time of this writing). Other developers swear by Ubuntu (20.04 as of this writing). This comes down to personal preference, but that means you're responsible for getting whatever your distro of choice is working. If you don't care and just want to get your work done, go with Ubuntu or Fedora.

To build the Elastio Rust codebase on Linux you'll need some dependencies installed. There's a shell script in the elastio/devboxes repo which is used to build our Vagrant build boxes for every supported distro we target (there are a lot of them). You'd be well advised to run that script on your system, or at least read it and install the packages therein. @anelson apologizes that this is a shell script and not an Ansible playbook, he has thus far failed to convert the shell scripting heathens to the virtuous path of Ansible.

macOS Setup

Good luck to you. @vvv uses a mac maybe he can help you.

Editors

There are many opinions in the company about which editor to use. A partial list of who prefers what is below. Depending on what editor you prefer, you may consider reaching out to one of your colleagues for advice:

EditorUsers
(neo)vim@anelson
emacs@vvv
VS Code@Veetaha
CLion or IDEA Rust plugin@olesiastorchak @JusSergey
Visual StudioLOLWUT?? Seriously? Who uses that?

GitHub

Elastio uses GitHub for our source code repository, and also the basis for our software development process. This chapter is specifically concerned with the use of GitHub for managing source code. For details about how we use GitHub in our development process, see the chapter in Part III.

Repositories

The vast majority of the Elastio code is in a monorepo, elastio/elastio. However there are still many other repos. Below is a partial list:

RepoPurpose
elastioElastio monorepo, almost all (or all?) Rust code lives here
handbookSource code for this book that you're reading now
elastio-snapLinux change tracking device driver, forked from datto-bd
driversWindows change tracking and virtual disk drivers.
Originally supposed to contain Linux driver as well; probably should be renamed by now
elastio-uiElastio UI project. This may eventually merge with another repo.
ciElastio CI infrastructure scripts
devboxesScripts which build our Vagrant build, dev, and test boxes.
packagingScripts and spec files for Elastio packages for various Linux distros plus Windows
wishmachineOur internal QA orchestration project

CI

We use GitHub Actions for CI/CD. As part of our corporate GitHub account we have several thousand minutes per month on GitHub's hosted runners, as well as our own fleet of self-hosted runners.

When possible we prefer to use GitHub's hosted runners, as they're less work for us to maintain and already have every imaginable development tool already installed. This is what's used to build the UI, Wish Machine, and parts of the back-end that can be built and tested with simple Docker containers.

Sadly not all of our solution is that simple. GitHub runners have some limitations that make them unsuitable for key parts of our stack:

  • Total of 14GB of SSD-backed storage. This is woefully inadequate for building our Rust codebase.
  • Linux runners are only available as Ubuntu VMs. We can easily test RHEL, Fedora, etc userlands with Docker, but without the corresponding kernel component the tests are meaningless.
  • Cache functionality sucks and the underpowered VMs are slow to build.

Therefore, much of our Rust code builds on GitHub Actions Runners running in our own infrastructure. We have servers in Hetzner which are dedicated to this purpose. The way we organize the runners is very specific to Elastio's needs.

The runners themselves run not in VMs but on bare metal Ubuntu 20.04 servers. Each server has multiple runners, enough to ensure all runners together fully utilize the CPU and I/O resources of each server.

The CI jobs that we run on our self-hosted runner do not actually do any building in the runners directly. Rather, we maintain a selection of Vagrant boxes for each supported target platform, and we run the builds within those vagrant boxes on each runner. This is a very unusual configuration and it's a bit awkward to maintain, but it lets us build and test all of our components, including kernel drivers which would be very irresponsible to load directly in a build runner but are perfectly safe to use within a VM.

The elastio/devboxes repo contains the scripts that build these VMs. From time to time you may need to submit PRs to this repo if your code requires access to a new library or you add a new dependency that should be pre-installed to minimize build times. See the chapter on Vagrant for more details on how these boxes work and how you can use them in your own workflows.

Build on Push vs Build on PR

Currently we don't do any CI builds on pushes in a non-master branch. We may revisit that in the future, and if you have a strong opinion please speak up. But for now understand that until you open a PR, your code is not being built by CI.

In cases where you want to get CI looking over your code but you're not ready to have colleagues look at your work, open a Draft PR and don't request any reviewers. You can be assured no one will bother looking at a Draft PR that they haven't been specifically requested to review.

Workflow

This chapter contains guidance on how we use Git in general and GitHub in particular in our day-to-day engineering work.

Branching

With very few rare exceptions all work should be done in branches, not master. Branches should be named with a prefix indicating the type of work done, and the GitHub issue ID the work pertains to. For example:

  • feat/100-implement-frobber-support
  • bug/200-gonkolator-tainted-on-refresh
  • scrap/experiement-with-webasm-port

If there is a GitHub ID associated with the work you're doing, it should be in the branch name. Some developers are lazy about this, but @anelson is strict about it.

The prefix on the branch name should indicate the type of work being done:

  • feat - A new feature
  • bug - A bug fix. Sometimes the line between the two is blurry. Use your best judgment.
  • scrap - When working on spikes or prototypes sometimes there is no intention to merge the code back into master.

If you think your task doesn't fit into these categories and a different prefix would better clarify your intentions, then go ahead and follow your best judgment. But ignoring this convention out of laziness or failure to read this section of the handbook will likely get your PR rejected.

Branching off of branches

In rare cases, you need to do some work relative not to the code in master, but someone else's branch. This should be avoided because it has the potential for some truly disastrous Git calamities. If you must do this, make sure you coordinate carefully with whoever is working in your upstream branch, and make sure they know that if they force-push their branch you will have your vengeance, in this life or the next.

Organizing your commits in a branch

When you are not yet ready to submit your branch to a Pull Request, you can do whatever you want in that branch. Force push, merge commits, cherry picking, whatever awful contortions Git will let you perform, you're free to do. We have no rules around this.

@anelson tries to maintain fastidious notes on his work by frequently committing code with meaningful comments on each commit. His reasoning is that this helps to prepare the pull request by having all of the details about the progress that leads up to the final result, close at hand in the commit messages.

Some other members of the team are believers in the "fix", "fix 2", "fix fix fix", "fix 100" school of meaningless commit messages. What you do in the privacy of your own branch is your business (subject to the caveat above if you're sharing the branch with someone else, in which case as a courtesy to your colleages please at least make an effort to be decent).

All of this changes, however, when you're ready to submit your branch to a PR for review by your colleagues. See the next section.

Pull Requests

PRs should be created when a branch is ready to merge, or created as a Draft PR if the code is not ready for merging but you'd like to solicit some feedback (or, alternatively, when you want to trigger CI builds on your branch).

When a PR is ready for someone to review, you are expected to provide a meaningful description. At the very least your description text must explain:

  • What is the purpose of the PR
  • What work was performed. The level of detail needed here depends on the part of the code you're working in, but in general this should be detailed enough to serve as a guide as the reviewer is reading through the diff.
  • Links to the GitHub issue(s) that merging this PR will resolve. These must be in the form "Closes #XXX" where XXX is the issue number. Each issue must be alone on a separate line, with its own "Closes " prefix. This is required so that GitHub will automatically close those issues when the PR is merged.

@anelson's approach is to use the squashed commit message as the PR description. If you've squashed commits down to one commit before creating the PR, GitHub will automatically use that commit as the description.

Preparing a branch for a PR

Before submitting a PR for review, you must squash all commits into a single commit with a meaningful commit message. If you follow @anelson's approach described above, you'll do this before you open the PR so that the PR description automatically is the same as the commit message. If not you'll have to compose the description text separately which seems wasteful.

Squashing commits serves two purposes:

  • It makes it easier to know which commits after the PR is created need to be re-reviewed.
  • It is a reminder to the developer to write a single, coherent, comprehensive commit message describing the work done.

If you do not squash your commits before you request a review, the reviewer is within their rights to refuse to perform the review until you fix this mistake. If your reviewer is @anelson you're practically guaranteed to have your review rejected.

The best way to perform this squashing, and clean up your commit message(s) at the same time, is to use interactive rebase (git rebase --interactive). If you're not familiar with this, read up on it before your first attempt. It can be intimidating the first time or two, particularly if your git-fu is not strong.

Note that as with most rules, there's an exception. In some cases, organizing your changes into multiple commits serves to better present your changes to the reviewer. For example, a big refactor that happens in multiple steps might make sense to present with one commit for each of the steps, letting the reviewer review a step at a time or collapse those steps into the final result. Use your best judgment as to whether your particular changes should be presented this way. Good faith deviation from the squash rule will not be rejected.

Rebasing

Up until the time when you have requested a review, you can rebase whenever you want. After you have requested a review, you should never rebase, unless you must do so to resolve a conflict with master. The reason why rebasing after a review starts is considered bad (to say nothing of rude) is because after a rebase, GitHub's PR UI will no longer allow a reviewer to look only at changes made since the last review. This is hugely disruptive, particularly for large PRs with a lot of changes.

That said, if your branch has a conflict with master, GitHub cannot build your PR (because PR builds are done by doing a local trial merge with master first), and you will not be able to merge even if your PR is approved. In this case you have no choice but to rebase and force push, but realize that your reviewing colleagues may secretly (or not-so-secretly) hate you as a result.

Making changes based on PR comments

Normally you'll submit an initial PR with one squashed commit for review, and then as you get feedback in the review you'll need to make changes. These changes should NOT be squashed or force-pushed. Each round of changes should be in a separate commit, with a meaningful summary of what changed.

This is really important for the reviewers. See the above section "Rebasing" about why this is so important.

Merging to master

When merging to master after a successful approved PR, always use a fast-forward merge and always squash the commit. Most of our repositories are configured to require this, such that GitHub won't allow you to perform the merge incorrectly.

If you have followed the guidance above about squashing and rebasing for your PR this will be automatic, but even if for some reason you did not, this merge guidance applies. Specifically, do not put extraneous merge commits into master. The purpose of this principle is to ensure the change log on master is a log of features implemented, bugs fixed, etc. Not arcane minutiae reflecting which team members merged which branches into which other branches during development.

Note that by default GitHub will generate a commit message that is a concatenation of all of your commits since the initial PR. This will include commits containing changes based on PR feedback. Most of the time, there's no informational value in these commits, so you should edit them out of the final commit message before merging to master. Obviously if any of these commits make material changes to the resulting code, that guidance doesn't apply, and you should include all relevant details about those changes in the final commit.

Commit messages

Adhere to the standard git commit message convention when writing commit messages. This is absolutely required for the commits merging PRs into master, but it is a good practice for all commit messages. Even if you are pretty sure a commit message will end up getting squashed into another message, it's quite useful when doing your interactive rebase to be able to scan the list of commits and understand immediately which one does what.

Never break the build on merge

You may do whatever you want in your branches during development, including breaking the build in spectacular ways. However, a branch cannot be merged into master until the CI server has both successfully built the branch, /and/ successfully built the branch with an experimental merge into master. You don't have to do anything to verify this; the CI system pushes the results of both builds into the PR. All you have to do is make sure you don't merge until these are both green.

This is important not only to avoid disrupting the work of your colleagues if they merge your changes from master into their feature branches, but it also allows workflows using git bisect to search the history to identify when a particular bug was introduced.

Vagrant Boxes

Our use of Vagrant boxes to perform builds has a happy side-effect: you can easily reproduce platform-specific build errors on your local system by running the exact same Vagrant box that CI runs.

The README in the devboxes repo has detailed instructions.

In particular, you must be in a Linux environment, and KVM must be installed and compatible with your hardware. Note that KVM will not work in WSL2 because WSL 2 is already a VM and you can't run KVM inside another hypervisor. For the same reason this won't work on EC2 instances.

A summary:

In the elastio/elastio repo, in the cli/vagrantboxes/ directory, run vagrant up BOXNAME to start a Vagrant box called BOXNAME. To see all the possible box names, run vagrant status. As of this writing the output is:

Current machine states:

amazon2-amd64-build             shutoff (libvirt)
amazon2-amd64-test              not created (libvirt)
debian8-amd64-build             not created (libvirt)
debian8-amd64-test              not created (libvirt)
debian9-amd64-build             not created (libvirt)
debian9-amd64-test              not created (libvirt)
debian10-amd64-build            running (libvirt)
debian10-amd64-test             not created (libvirt)
debian11-amd64-build            not created (libvirt)
debian11-amd64-test             not created (libvirt)
centos7-amd64-build             not created (libvirt)
centos7-amd64-test              not created (libvirt)
centos8-amd64-build             not created (libvirt)
centos8-amd64-test              not created (libvirt)
fedora31-amd64-build            not created (libvirt)
fedora31-amd64-test             not created (libvirt)
fedora34-amd64-build            not created (libvirt)
fedora34-amd64-test             not created (libvirt)
fedora35-amd64-build            not created (libvirt)
fedora35-amd64-test             not created (libvirt)
ubuntu2004-amd64-build          not created (libvirt)
ubuntu2004-amd64-test           not created (libvirt)
windows2019-amd64-build         not created (libvirt)
windows2019-amd64-test          not created (libvirt)
windows2019-amd64-build-desktop not created (libvirt)
windows2019-amd64-test-desktop  not created (libvirt)

This environment represents multiple VMs. The VMs are all listed
above with their current state. For more information about a specific
VM, run `vagrant status NAME`.

The -build boxes are used for building and have all of our development tools and dependencies installed. The -test boxes are more or less clean, and are intended for testing release installation.

Running vagrant up will download the box from object storage to your local system. This normally doesn't take very long, but the Windows boxes are very heavy and might take hours.

Once vagrant up has completed, you have the VM running locally. You can use vargrant ssh to open a shell in a given box, or vagrant rsync to sync your local source tree into the box. Sadly there's a very nasty Windows bug that Hashicorp are ignoring, where rsync won't work after the first sync.

Coding Conventions

Each language has its own conventions and idioms. However some principles are universal.

In general, code should be written with the goal of maximizing readability and clarity. Usually this means following the same style as the existing code, if any. Coding conventions exist to resolve some ambiguities with that guidance.

If the language supports a strict mode or an equivalent of -Wall which reports all warnings, this mode be used In the strictest mode, the code should compile without warnings. If there are warnings that are in fact intentional the warnings should be suppressed only where they occur with a #pragma or some other construct, and not disabled categorically.

If the compiler supports a mode that treats warnings as errors and fails the build, that mode should be enabled.

Specific guidance in this book notwithstanding, code should follow the conventions applicable to that code. For example, a Linux device driver and a Mac library might both be written in C or C++, but it’s likely there are very different style conventions used by idiomatic code on those respective platforms.

Some guidance for specific languages follows:

Rust

See the Rust coding conventions chapter for details.

C

The main part of the C code in our project is elastio-snap kernel module.
A good practice for the kernel modules to use Linux kernel coding style. And we'll stick to this practice.

Our repo is a fork of the Dattobd, which already following this code style. However, we'll not change existing code lines that don't match the style, just for the sake of style, until the functionality changes.

The same rules apply to any user-space libraries and utilities developed on C.

C++

C++ developers, to fill out

Go

Our team’s Go coding standards are simple and easily applied. All .go files are to be renamed .rs and modified until they compile cleanly under the Rust compiler.

Javascript

Thankfully Javascript is no longer needed, as Rust and C++ can be compiled to WebAssembly :) If for some reason WebAssembly isn’t an option, then TypeScript will have to do.

Tools we Use

We use a variety of tools and services in Elastio. Depending on your role you may need to use some or all of them:

  • Chat: Slack
  • Hosting: Heztner and AWS
  • Code coverage: codecov.io
  • Dashboards, visualizing metrics: Grafana Cloud (elastio.grafana.net)
  • Centralized Logging: Loki (hosted on Grafana Cloud)
  • Metrics: VictoriaMetrics self-hosted in Hetzner
  • Project Planning: ZenHub
  • Centralized error reportiong: Sentry.io
  • Email/Calendar: Google Workspaces

Generally, deploy from master or release candidate isn't any different from deploying the release in terms of the steps that need to be taken.

Note that distros built from release, release-candidate and nightly branches are placed on the https://repo.elastio.com server. Whereas distros built from development branches (master and feature branches) are stored on the https://repo.elastio.us server. You should update the download link according to your desired branch.

Deploy from master

Linux

  1. Deploying the Elastio cli
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/elastio/elastio-stack/master/scripts/install-elastio.sh) $0 $1" -- -b master

-b here stands for branch

  1. Deploy / Update the Red Stack (if needed)

Run aws configure to connect the AWs account, where the Red Stack will be deployed, if outside the AWS account.

Set the environment variable to pull artifacts from master:

export ELASTIO_ARTIFACTS_SOURCE=ci:master

Then run

elastio stack deploy --tenant-name dev.staging.elastio.us

--tenant-name for internal use is dev.staging.elastio.us.

  1. Create a vault
elastio vault deploy <example_name>

Please note: vault deploy command will require ECS service linked role in the account. It’s created per account and can be already present. If not you need to run

aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com

Windows

  1. Deploying the Elastio cli and drivers
Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://repo.elastio.us/master/cli/install.ps1'))
Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://repo.elastio.us/master/drivers/install.ps1'))
  1. Deploy / Update the Red Stack (if needed)

Run aws configure to connect the AWs account, where the Red Stack will be deployed, if outside the AWS account.

Set the environment variable to pull artifacts from master:

$env:ELASTIO_ARTIFACTS_SOURCE="ci:master"

Then run

elastio stack deploy --tenant-name dev.staging.elastio.us

--tenant-name for internal use is dev.staging.elastio.us or your own.

  1. Create a vault
elastio vault deploy <example_name>

Please note: vault deploy command will require ECS service linked role in the account. It’s created per account and can be already present. If not you need to run

aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com

Deploy from release-candidate

Linux

  1. Deploying the Elastio cli
sudo /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/elastio/elastio-stack/master/scripts/install-elastio.sh) $0 $1" -- -b release-candidate

-b here stands for branch

  1. Deploy / Update the Red Stack (if needed)

Run aws configure to connect the AWs account, where the Red Stack will be deployed, if outside the AWS account.

Set the environment variable to pull artifacts from master:

export ELASTIO_ARTIFACTS_SOURCE=release-candidate

Then run

elastio stack deploy --tenant-name dev.staging.elastio.us

--tenant-name for internal use is dev.staging.elastio.us.

  1. Create a vault
elastio vault deploy <example_name>

Please note: vault deploy command will require ECS service linked role in the account. It’s created per account and can be already present. If not you need to run

aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com

Windows

  1. Deploying the Elastio cli and drivers
Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://repo.elastio.com/release-candidate/cli/install.ps1'))
Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://repo.elastio.com/release-candidate/drivers/install.ps1'))
  1. Deploy / Update the Red Stack (if needed)

Run aws configure to connect the AWs account, where the Red Stack will be deployed, if outside the AWS account.

Set the environment variable to pull artifacts from master:

$env:ELASTIO_ARTIFACTS_SOURCE="release-candidate"

Then run

elastio stack deploy --tenant-name dev.staging.elastio.us

--tenant-name for internal use is dev.staging.elastio.us or your own.

  1. Create a vault
elastio vault deploy <example_name>

Please note: vault deploy command will require ECS service linked role in the account. It’s created per account and can be already present. If not you need to run

aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com

Setup

Private registry cloudsmith.io

cloudsmith.io is a service that provides functionality similar to crates.io but is used for publishing crates to a private registry.

Some of our crates are published there to prevent them from being publicly available.

Get access for using crates from cloudsmith

Safety is important! Everyone on our team needs to be invited to the private repository at cloudsmith.io. Each team member has their own credentials for accessing build dependencies stored on cloudsmith.io.

Use this elastio's link to login to cloudsmith.io

After signing in, you have to enable 2FA; otherwise, you won't be able to view an entitlement token.

There are two types of configuration access to the registry, each with its own security settings. Actively used repositories employ the sparse protocol, while old and legacy repositories are still configured to use the old, git-based protocol. To determine which protocol is used in a particular repository, check the .cargo/config.toml file. If you see a registry URL starting with sparse+https, the new sparse protocol is in use.

Example of configuration of the private registry:

[registries.elastio-private]
credential-provider = "cargo:token"
index               = "sparse+https://cargo.cloudsmith.io/elastio/private/"

Authentication with sparse proto

To authenticate with the modern sparse proto, you can use either an API-KEY or an Entitlement token.

API-KEY can be obtained from user's settings page on cloudsmith.io

To use the API-KEY for authentication, define the CARGO_REGISTRIES_ELASTIO_PRIVATE_TOKEN environment variable:

export CARGO_REGISTRIES_ELASTIO_PRIVATE_TOKEN=API_KEY

or run cargo login --registry elastio-private in the project's top directory and provide the token.

cd elastio/
cargo login --registry elastio-private

Note: If you use the cargo login command, your token will be saved in the $CARGO_HOME/credentials.toml file in clear text!


To obtain the Entitlement token, go to the "Repositories" page, then navigate to the elastio/private repo. In the left sidebar, find Entitlement Tokens, copy the secret key (TOKEN) from the line with your account name, and use it.

To use the Entitlement Token, use the same process, except that instead of the API_KEY value, the string Token TOKEN should be used:

export CARGO_REGISTRIES_ELASTIO_PRIVATE_TOKEN='Token XXXXXXXXXXX'

# or

cargo login --registry elastio-private
    Updating `elastio-private` index
please paste the token found on https://cargo.cloudsmith.io/elastio/private/me below
Token XXXXXXXXXXXXXX  <<< - Your input is here
       Login token for `elastio-private` saved

Authentication with the legacy protocol

Authentication using the legacy protocol can be performed with the Entitlement token only!

The next step is to setup git credentials using TOKEN:

git config --global credential.helper store
echo "https://token:TOKEN@dl.cloudsmith.io" >> ~/.git-credentials

Congratulations! You have access to use cloudsmith.io.

Extra info

For additional information, please refer to:

  • https://help.cloudsmith.io/docs/cargo-registry#registry-setup
  • https://doc.rust-lang.org/cargo/reference/config.html#credentials

Coding Conventions

Basis

Unless otherwise specified below, our Rust coding convention is identical to the following resources:

Note that some of the guidance in the API Guidelines document is specifically applicable to public APIs of libraries. This doesn't necessarily apply to all of the code we write, although if we deviate from the API guidelines it should be deliberately and with a good reason.

Guiding Principle

The purpose of having a style guide and conventions is to produce consistent, readable, clear, and maintainable code. There will be situations in which a strict adherence to style conventions will be counter-productive to these objectives. Fortunately, Elastio team members are highly intelligent, educated, rational people who are capable of making subjective judgement calls based on the best information available to them.

Simply put, use your best judgement. If a rule seems to make code worse, then don't follow it (but be prepared to defend your decision in code review, and to reverse it if you're wrong).

Basics

All code should be formatted with the rustfmt utility with the default settings. This can be made automatic in most editors. Code that isn’t formatted this way will fail the CI build.

Code should compile without any warnings, and CI builds will treat warnings as errors.

All code should pass clippy static analysis with the default lints enabled. Failure to pass any of these will fail the CI build. If a particular clippy link is not applicable it should be disabled with the finest granularity possible (eg, at the variable or function or statement or at most module level; never at the crate level).

Using use

use is used to import modules or specific types into the current scope.

For large code files this can get complicated, so the goal of our conventions is to ensure we import other types in a way that maximizes readability and clarity.

Grouping

There should be one single block of use statements at the beginning of the module to which they apply. There should be no white space between the use statements. They should be sorted alphabetically, though as long as you adhere to the preceding rules rustfmt will handle sorting them automatically.

Importing a module

In the majority of cases, it's preferable to import a module, not the item(s) in that module.

For example, if you want to call foo in the bar module in the widget crate, you should do this:

use widget::bar;

// ...

bar::foo();

This makes it very clear that foo is not an item in the current module but is imported from elsewhere.

Importing a type

Exceptions to the above rule do exist. Remember that the goal is clarity. If you're implementing Debug, it's okay to use std::fmt::Debug; we all know what Debug is and won't need to scratch our heads to figure out what it means.

Another exception is types that are used frequently throughout a particular crate. For example in the RocksDB wrapper crate, there's a trait BinaryStr that is used all over in various modules. There's not much value in qualifying it as ops::BinaryStr, as its origin and purpose are readily apparent to anyone working in the code.

A third and very common exception is when calling methods on a trait. Rust requires that the trait be in scope in order to call methods on it, and in these cases there is no reasonable alternative but to explicitly use the trait itself.

Lastly, it's often the case that the name of a type is sufficient to make it clear where it's from. For example, Serialize and Deserialize are obviously from the serde crate. MyCrateError is obviously an error type from my_crate, and does not need to be qualified as my_crate::MyCrateError for clarity. This is a matter of good judgement and may be questioned in code review.

Wildcard imports

Wildcard imports are to be considered forbidden except under specific circumstances:

  1. When a crate exposes a prelude module for convenience. An example of this is the Rayon crate, which has rayon::prelude specifically to provide a single place to import all of the helper traits and extensions to existing traits. Especially if this module is called prelude it's apparent by convention what it's for, and thus a wildcard import of that module can be used whenever it makes sense.

  2. When bringing into scope types whose actual location are obvious from context or not relevant to understanding the code. This is obviously a very subjective standard, and you should be prepared to defend this assessment in a code review if you rely on it for justification of a wildcard import.

  3. In a lib.rs file when re-exporting items from private modules. It's almost always good practice to organize code into logical modules, even if those modules themselves are not public. In that case one must re-export the relevant types, and there's usually not any descriptive or clarifying value in explicitly listing each of the items one is re-exporting.

In general wildcard imports obscure the origin of the types being used and make it harder to understand the code at a glance. Especially since tools like rust-analyze and IntelliJ can automatically generate import statements for specific types as they are used, there's no excuse for lazily pulling in everything in a module.

Error Handling

How to represent, handle, and raise errors is a big topic. See the Errors chapter.

Private documentation

We host docs for our private crates here.

Due to some limitations of docs.rs for private registries the search bar doesn't work. In order to search crates you need to use Releases button at the top right corner and find crates using Ctrl+F in the browser. One day we will make a PR into docs.rs to fix this issue. Also, docs for some crates may not render because currenlty we move generated by build.rs files out of $OUT_DIR. We plan to fix that too, but you need to build docs for such crates manually for now.

Tutorials

There are also some useful tutorials and write ups made by Elasito members. Check out:

Guide to Commonly Used Crates

Cheburashka

Cheburashka is the observability framework used by our other crates. It provides standardized logging, metrics, labeling and (in the future) tracing capabilities, mostly by re-exporting third party crates. See the Cheburashka README , and the Rust docs for more information.

Matroskin

Matroskin is the Rust gRPC framework which we use to generate Rust code from protobuf specifications. Internally it mostly re-uses tonic and prost, with some additional customizations specific to our use cases. See the README in matroskin/ for more details.

Yozhik

Yozhik provides a high-level Rust abstraction over our Windows and Linux change tracking drivers. In both cases, these drivers allow us to take point-in-time snapshots of block devices, and quickly determine which blocks on the device changed since our last snapshot. This is fundamental to our ability to backup systems in cases where agentless snapshots are not possible.

Elastio-agentless

Elastio-agentless is equivalent to Yozhik, but operating on systems that provide APIs to take snapshots of disks and read changed blocks from them without running any code on the system being snapped. This includes Amazon EBS volumes, Azure Page Blobs, and VMWare ESXi disks.

Binary-Ids

Binary-ids is a library that provides macros and interfaces for the creation of ID types that consist of opaque fixed-length byte sequences. This was originally conceived as a way to work with SHA hashes more conveniently, but now also supports ULIDs.

The Rust docs provide much more information about how this works.

Kolobok

Kolobok is the crate that implements the file and block device ingest. That is, it takes input from either Yozhik or Elastio-agentless, does a lot of computation, and in the end ingests the data on the source files/discks into Scalez-Stor.

See the Rust docs for more.

Elastio-rocks

We use RocksDB internally for metadata storage in ScaleZ. The existing rust-rocksdb crate was forked, and slightly modified by Elastio to make some internal fields public. Then we build elastio-rocks on top, which provides a lot of functionality missing in the original rust-rocksdb, a Rust idiomatic interface, metrics, logging, async IO, etc. It's extensively documented in the Rust code.

Scalez-KV

Scalez-KV is a Rust native key/value store abstraction layered on top of Elastio-rocks. It lets service developers define a storage layer in terms of tables, where each table is a Rust struct. This also defines traits with which arbitrary Rust types can be used as keys or values, with high performance serialization.

Scalez-Stor

Scalez-Stor is the storage service and API responsible for storing all backup system metadata in RocksDB. It's built on top of Scalez-KV and provides an even higher-level interface, as well as a gRPC API which other components of the system use to talk to it.

Scalez-Stor-cli

The executable is called s0. This is primarily an internal testing tool, which lets us to ingest benchmarks, and run the scalez-stor server whenever we need it.

Elastio-cli

The official command line interface to all Elastio code. The executable is called elastio on Linux, and elastio.exe on Windows. With this we can invoke yozhik, elastio-agentless, scalez-stor, and more.

Xtask

This is a dev cli that is used only during the development and not intended for production distribution. You can invoke it via the cargo alias. Use it to run codegen, install code formatting and generation git pre-commit hook, etc. Run cargo xtask to see documentation on available scripts.

Errors

Rust's try operator (?) and support for sum types (e.g., Result which can be either Ok or Err) make for powerful and easy-to-reason-about error handling. Unfortunately as of this writing in late 2020 there is no consensus solution for representing and reporting errors in crates. The existing std::result::Result and std::error::Error types provide the bare bones for representing successful and failed operations, but they're not adequate on their own.

In first year or so of Elastio, we used the snafu crate, which is an opinionated implementation of error handling in Rust that seemed to work well. Many of our crates still use Snafu today, because there's no reason to migrate. However recently momentum has been building behind two crates which together are fast becoming the de facto Right Way(tm) for doing things: anyhow and thiserror.

anyhow provides a universal Error type which can wrap any std::error::Error implementation and provide some useful operations. It's the equivalent of throws Exception in Java. We use anyhow when building CLIs that need to be able to handle various kinds of errors from different crates and have no need to wrap those errors in a specific error type. In some cases we also use anyhow when writing tests, when the alternative would be Box<dyn Error> which is nasty.

thiserror is used at Elastio when a library crate needs to expose an error enum as part of it's API. Where historically we've used snafu, newer crates (and any crates created in the future) use thiserror to build a CrateNameError variant.

This chapter describes the best practices we've evolved with these crates, and should be followed unless there's a good reason to do something different.

Official Docs

Most of the details of how to use anyhow and thiserror are covered in their respective docs. None of that will be repeated here, so before reading the rest of this chapter make sure you've reviewed the official docs for both crates and have a good understanding of how they work in general (and, in particular, how anyhow is different than thiserror and under what circumstances one should use one versus the other).

Legacy Snafu

As noted above, many of our biggest and most important crates were written before thiserror was clearly the way forward for library error reporting. Those crates use snafu instead. There's nothing wrong with snafu, and it's similar in design to thiserror in many ways. Those crates that use snafu will continue to do so, and if you find yourself needing to add or modify the error variants in those crates, you too must use snafu. Porting existing error handling to thiserror without a compelling reason is not a good use of engineering time.

Having said that, starting a new crate in 2020 or beyond and using snafu to build the error type is also not a good use of engineering time.

Error variant naming

Each variant of the error enum is obviously an error, because it's a variant of the error enum. Thus, the word Error should not appear in the name of the error enum as it's redundant. So IoError is a bad name for an error; Io is good.

Error messages

thiserror

When defining an error type with thiserror, it's easy to define what the error message should be for an error:

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Error, Debug)]
pub enum Error {
    #[error("IO error")]
    Io {
        source: std::io::Error
    }
}
}

You might be tempted to try to make another error type's message part of your error, like this:

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Error, Debug)]
pub enum Error {
    // WRONG DON'T DO THIS
    #[error("IO error: {source}")]
    Io {
        source: std::io::Error
    }
}
}

You might even find this pattern in our code still. However the latest best practice is to report the error type which caused a particular error by returning it in the std::error::Error::source() function, which for thiserror means a field named source or marked with the #[source] or #[from] attributes. Why?

Because it's often not at all useful to include the error message of an inner error. Maybe that error is itself just a higher-level error type, and the real cause of the error is nested three or more errors deep. Or maybe you need to know the details of all of the errors in the chain from root cause on up to your error type in order to understand what really happened.

Therefore, instead of forcing the nested error's message into your error message, you should rely on crates like anyhow (or color-eyre) to pretty-print error types at the point where they are communicated to users (printed to the console in the case of a CLI, or written to the log in the case of a server).

anyhow

When reporting errors with anyhow, the principle is the same but the mechanics are slightly different. Since anyhow::Error is a universal container for any error type, you do not use anyhow to define strongly typed errors. Instead you wrap arbitrary error types in anyhow. But sometimes you have a situation where you have an error e, and you want to report it as an anyhow::Error but with some additional context to help clarify the error. This is the wrong way:

#![allow(unused)]

fn main() {
use anyhow::anyhow;

// DO NOT DO THIS
let e = std::io::Error::last_os_error();
// Assume `e` contains an error you want to report
anyhow!(format!("Got an error while trying to frobulate the gonkolator: {}", e));
}

This has the same problem as the thiserror example above. You're losing all of the information in e other than it's message. Maybe it had its own source with valuable context, or a backtrace that would have clarified where this happened. Instead you should use anyhow to wrap the error in some context:

#![allow(unused)]

fn main() {
use anyhow::anyhow;

// This is the right way
let e = std::io::Error::last_os_error();
// Assume `e` contains an error you want to report
anyhow::Error::new(e).context("Got an error while trying to frobulate the gonkolator");
}

Using anyhow context or with_context

The above example uses context on an anyhow::Error. anyhow also has a Context trait which adds context and with_context to arbitrary Result types, to make it easier to wrap possible errors in context information.

Be advised that in this case you should avoid allocating strings when calling context. For example:

use anyhow::Context;

fn get_username() -> String {
    // ...
  "foo".to_owned()
}

fn get_host() -> String {
    // ...
  "foo".to_owned()
}

fn main() -> anyhow::Result<()> {

// WRONG
std::fs::File::create("/tmp/f00")
 .context(format!("Error creating file foo for user {} on host {}", get_username(), get_host()))?;

// RIGHT
std::fs::File::create("/tmp/f00")
 .with_context(|| format!("Error creating file foo for user {} on host {}", get_username(), get_host()))?;

Ok(())
}

By passing a closure to with_context, you defer the evaluation of format! unless File::create actually fails. On success you skip all of this computation and the associated heap allocation, and calls to get_username() and get_host().

error module

Each library crate should have an error module. This should define a thiserror-based error enum named CrateNameError, where CrateName is the pascal case representation of the crate's name.

Many Rust crates and the Rust std lib use an error representation called Error, but this leads to problems with code clarity when dealing with multiple different crates' error types.

The error module should also define a type alias called Result, which aliases to std::result::Result with the default error type set to the crate's error type, e.g.:

#![allow(unused)]
fn main() {
// In `error.rs`
pub enum CrateNameError { /* ... */ }

pub type Result<T, E = CrateNameError> = std::result::Result<T, E>;
}

If necessary (and only if necessary!) the root of each library crate should expose the error module publically. As of now the only reason this would be necessary is to expose the Snafu-generated context selectors to other crates, which is only needed if macros are generating code that needs to use those error types. That's an edge case; in general error should be private.

In all cases, the error enum and Result type alias should be re-exported from the root of the crate, e.g.:

// In lib.rs

// The `error` module should not be public except for the edge case described above
mod error;

// but the Error type and the Result type alias should be part of the public API in the root module
pub use error::{Result, CrateNameError};

// And all funcs should use this public re-export

// Note this is using crate::Result, not crate::error::Result
pub fn some_public_func() -> Result<()> {
  todo!()
}

Other modules in the crate should use crate::CrateNameError, NOT use crate::error::CrateNameError. This is for consistency between crate code and external callers, and also because it's less typing.

Note that when using Snafu, Snafu's context selectors should NOT be re-exported this way, and when referenced within other modules in the crate, those modules should use crate::error, and refer to the context selectors with error::SomeErrorKind.

Using ensure

Anyhow provides the ensure! macro to test a condition and fail with an error if it is false. Prefer this to explicit if or match expressions unless they add some clarity somehow.

Observability

"Observability" is the generic term we use to describe all the characteristics of an application that help us to observe its behavior in production, reason about problems, develop hypotheses, and verify fixes. Broadly, the specific features which together provide observability are:

  • Logging
  • Tracing
  • Metrics

Each of these areas is a topic unto itself. In addition, the concept of "labels" is important, as they are used for logging and metrics (and probably in the future for tracing as well).

All observability functionality in Rust is contained in the cheburashka crate. Probably 95% of this functionality is just the re-exporting of various other third-party crates, with the remaining 5% being Elastio-specific helpers and defaults that helps us ensure all of our various projects have the same behavior.

Logging

Probably the most important, and most basic, aspect of observability is logging. If there is no logging, we have no visibility into the behavior of our application in production.

We use tracing from the Tokio project as the basis for our logging, however internally we avoid taking explicit dependencies on tracing, instead we re-export the tracing crate from cheburashka as cheburashka::logging. Check out the RustDoc comments on cheburashka::logging for a comprehensive review of the functionality available. In this handbook we'll just call out a few key features, and also Elastio internal conventions.

Log levels

Like most logging frameworks, tracing and therefore cheburashka has the concept of log levels. They are:

  • trace
  • debug
  • info
  • warn
  • error

There is an option to have trace events compiled into the binary only in debug builds, but we don't use that.

By default, the log level info and above (meaning also warn and error) are visible on the console when running code that's instrumented with log events. This can be controlled by setting a different filter expression in the ELASTIO_LOG environment variable, or by programmatically configuring logging with a call to cheburashka::logging::config_cli or cheburashka::logging::config_server.

There can be no absolute rules about which log level is appropriate, but we do have a few conventions:

  • Use error only for errors that might be of interest to humans. For example if there's some code that retries after a failed network operation, it would not make sense to log an initial failure as error when the code will automatically retry. In that case, you should log the retriable errors as debug, and only if the retry limit is reached or the error is determined to not be retryable would you log something at the error level.

  • warn should indicate an event that is out of the ordinary but not necessarily indicative of a failure. For example, if an optional function is unavailable because of insufficient permissions that might be information you communicate as a warning. Another example would be an error for which there is a known fall-back but it's important that the fact that the error occurred is communicated.

  • info is for logging information that occurs on the "happy" path, which is at the level of detail that a normal user would be interested in seeing (yes, "normal user" here is entirely subjective). For example, if a backup of the C:\ drive completed successfully, that might be a useful info message, but you would not want to report every successful block read at the info level because that will spam the log and make it impossible to navigate. To put it another way, if the log level is info, the log output should be enough to tell what is going on at a high level, without scrolling through pages and pages of repetitive log events.

    In our log browsing tools, we will typically start with a filter of info or higher so get a quick overview of what's going on in that log. Bear this in mind before you log anything at info.

  • debug is easily the most common log level for us. It is everything that doesn't fit in any of the other levels. While you should still take care not to generate thousands of debug log events for some small operation, there are less restrictions on verbosity at this level.

  • trace is generally reserved for excessively verbose logging which you would only conceivably want to enable for a small region of the code at a time, when debugging a specific problem. To continue the backup example above, you might log every block offset you read from C:\ at the trace level.

Log output on the console

When writing a CLI, you often need to print some output for the user on stdout. Typically in Rust you'd use println! for this, but the problem is that your println! output will not also be captured in the log, so you really need a way to generate log messages which should be printed to the console.

In cheburashka we have versions of all of the log event macros (event!, debug!, info!, etc) which start with print_ and always generate log messages with the target console. If cheburashka logging is configured with config_cli, the default behavior is to display only console events, and only those at info or higher. Of course both of those can be overridden.

This means that where you'd normally reach for println! to output something, you should use print_info! (or some other log level as appropriate) instead. By default this will render on the console without time stamps or other decoration, so it should look and feel like println! output, except that if full log output is enabled these log events will take on the form of all other log output. It also means that when we're running in a server with logs output to JSON, this will include all of your print_* events as regular log events with all of the associated metadata.

You should not ever use println! unless there's a compelling reason why using the print_ macros doesn't suit your use case.

use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

fn main() {
// Don't do this!  WRONG!
println!("Something happened!");

// Do this instead
print_info!("Something happened!");
}

Structured Logging

In many other languages, the way one composes a log message is by formatting a string with placeholders like %s or {} standing for variables, which are then expanded into the value of the variable at runtime. So if you're logging the number of widgets you frobulated, you might do:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

let widgets = vec![1];
// Old way of doing things:
info!("Frobulated {} widgets", widgets.len());
}

This is okay but it has a few problems. First, what if we want to see how many occurrences of this particular log event happened? We could write a regex that matches it, but that will be brittle if we change the log message in the future and anyway doesn't scale to hundreds or thousands of messages. Second, what if we want to find all cases where we frobulated 10 or more widgets? That's probably also possible with a regex but it's not easy. Also, our logging system needs to store many unique strings, all variations on the message with different expansions of the placeholder. Presumably it uses compression but still that increases the per-message overhead a lot.

Fortunately there's a better way, a pattern that's called "structured logging". In this pattern, which is the pattern we observe in almost all cases (see below for the exception case), you do something like this:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

let widgets = vec![1];
// New hotness:
info!(widgets_count = widgets.len(), "Frobulated widgets");
}

We're reporting the same event, but this time our log event has an unparameterized message "Frobulated widgets", and an integer field widgets_count which is equal to the number of widgest frobulated. Assuming this log event gets sent to a log system like Loki or LogStash which also supports structured logging, we can even query our logs for all cases where widgets_count > 10, or we can count how many times the "Frobulated widgets" event was generated. This is way more powerful, and also in the Rust implementation a bit more efficient since it avoids the heap allocation and string formatting (of course there's other overhead associated with structured logging but tracing's implementation is very well optimized).

Make sure to review the tracing docs for all of the various tricks one can do. Just remember that we use cheburashka::logging instead of tracing, but the capabilities are the same.

This works by default for any type for which tracing has a suitable implementation of its Value trait, which in practice is almost everything you'd want to log. There are a few tips for making use of this:

Structured Logging Types

For integral types of all kinds, they can be used as in the example above.

For &str values you can also just pass them as-is:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

// Works fine with &str
let my_field: &str = "foo"; // of course you wouldn't log a const like this it's just an example

info!(value = my_field, "This is a &str");
}

However String isn't supported. I'm not sure why, it seems like an odd limitation (GitHub issue). Fortunately you can always log anything that has either a Display or a Debug implementation by using the % or ? sigils, respectively. String implements both, but Display is the one that produces the exact contents of the string:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;

// Logging a `String` is a bit more tricky...
let my_field: String = "foo".to_string();

// info!(value = my_field, "This is a String"); <-- Won't compile; no implementation for `Value` for `String`
info!(value = %my_field, "This is a String"); // <-- This works because of the `%` sigil
}

This trick can also be used to log &Path and PathBuf, which don't have a (lossless) cast to strings:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::PathBuf;

let my_field: PathBuf = PathBuf::from("foo");

info!(path = %my_field.display(), "This is a PathBuf"); // <-- This works because of the `%` sigil
}

Note here that you can't just use %my_field as we did with the string types, but Path has a method display() which returns a Display impl that will render the path properly.

Finally there's another special case for logging errors (by which I mean implementations of std::error::Error). Errors implement Display so you might be tempted to use the % sigil and log the error message that way, but that's not correct. Tracing subscribers have specific support for Error implementations, and subscribers could (and definitely will in the future) log the entire Error including nested causes and backtraces to some error reporting system like Sentry. If you just grab the top level error message when logging, you'll miss out on that.

Unfortunately, it's pretty awkward to put Errors in the the form of a &dyn Error + 'static that tracing expects. So we have a helper in cheburashka called log_error (and log_dyn_error and log_boxed_error variations as well) which perform what amounts to a compile-time cast to make sure the error gets captured by tracing as an error type:

use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::fs::File;

fn main() -> anyhow::Result<()> {
File::create("/tmp/foo.json")
  .map_err(|e| {
  error!(err = log_error(&e), "Failed to create the file");

  e
  })?;
Ok(())
}

Note that by convention we use the field name err when logging errors.

When not to use structured logging

The one exception to the mandate to use structured logging is when using the print_ macros to print to the console and log at the same time. Because these events by default are rendered on the console without any decorations, any fields that you add to the log event will not be visible unless the console output has been configured as the full log output. In this case, and in this case only, you should use the only style string formatting technique to generate log messages. Note that you do not need to use format! for this; all of the log event macros include formatting support built in.

Spans

Often one needs to log the same fields multiple times, eg:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

fn do_stuff(path: PathBuf) {
  info!(path = %path.display(), "Going to do something...");

  if let Err(e) = fallible_operation(&path) {
     error!(err = log_error(&e), path = %path.display(), "Failed to do something with path");
  } else {
     info!(path = %path.display(), "Successfully did something with path");
  }
}
}

In this example, three log events are reported, all with the same path field. This is reptitive, and difficult to maintain. If you add another argument to this function, will you remember to add that argument to all three log events?

tracing and therefore cheburashka have the concept of a span which addresses this. A span is a set of name/value pairs like a log event, except all events that are generated within the context of a span inherit that span's fields. To the above example would be re-written as:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

fn do_stuff(path: PathBuf) {
  let span = info_span!("In do_stuff", path = %path.display());
  let _guard = span.enter();
  info!("Going to do something...");

  if let Err(e) = fallible_operation(&path) {
     error!(err = log_error(&e), "Failed to do something with path");
  } else {
     info!("Successfully did something with path");
  }
}
}

All of these three log events will be in the span "In do_stuff", with the path field attached. Any log events generated while the _guard is in scope will automatically be in the span.

Async gotcha

Note that the above code won't work for async functions:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

async fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

async fn do_stuff(path: PathBuf) {
  let span = info_span!("In do_stuff", path = %path.display());
  let _guard = span.enter();
  info!("Going to do something...");

  if let Err(e) = fallible_operation(&path).await { // <- span exited here
     error!(err = log_error(&e), "Failed to do something with path");
  } else {
     info!("Successfully did something with path");
  }
}
}

The reason why is too complex to explain here but suffice it to say, it won't work. There is an extention trait available which will add an instrument method to all Futures, that will solve this problem:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use cheburashka::logging::futures::Instrument;
use std::path::{Path, PathBuf};

async fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

async fn do_stuff(path: PathBuf) {
  let span = info_span!("In do_stuff", path = %path.display());

  async move {
    info!("Going to do something...");

    if let Err(e) = fallible_operation(&path).await {
       error!(err = log_error(&e), "Failed to do something with path");
    } else {
       info!("Successfully did something with path");
    }
  }.instrument(span).await
}
}

Ugly and cumbersome, yes, but async does tend to make things more complex sometimes. Fortunately there's (sometimes) a better way...

Automatic Instrumentation

A very common pattern in instrumenting code with structured logging is logging an event when a function is entered, possibly including its arguments, then logging an event when the function is exited successfully, or logging an error if the call fails. That's such a common pattern that tracing has a proc macro that does it for you! And, it works on both regular and async functions! Behold:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

// NOTE: this struct is not `Debug`
pub struct MyStruct{
  inner: PathBuf
};

fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

#[instrument(skip(x), fields(path = %x.inner.display()))]
fn do_stuff(x: MyStruct) {
  info!("Going to do something...");

  if let Err(e) = fallible_operation(&x.inner) {
     error!(err = log_error(&e), "Failed to do something with path");
  } else {
     info!("Successfully did something with path");
  }
}
}

In this case, skip is used because MyStruct cannot be logged as a field directly; instead a custom field called path is added to the span, using the result of x.inner.display(). This will automatically create a span with the name of the method, and ensure all code in the method is in this span. It doesn't matter if do_stuff is async; this still works.

If the method you annotate is fallible (meaning it returns Result<...>) there's another helpful option:

#![allow(unused)]
fn main() {
use doc::cheburashka; // Nasy hack
use cheburashka::logging::prelude::*;
use std::path::{Path, PathBuf};

fn fallible_operation(path: &Path) -> Result<(), std::io::Error> {
  // do something that might fail
  // ...
Ok(())
}

#[instrument(err)]
fn do_stuff(path: PathBuf) -> Result<(), std::io::Error> {
  info!("Going to do something...");

  if let Err(e) = fallible_operation(&path) {
     error!(err = log_error(&e), "Failed to do something with path");
  } else {
     info!("Successfully did something with path");
  }

  Ok(())
}
}

The err option in instrument tells it to log an error log event if this function returns an error. It doesn't matter how control leaves the function, whether a return statement or a ? on a failed result, the error will be caught and logged.

Now, you might be tempted to run off and put #[instrument(err)] on every function you write from now on so that you can always know when things fail. Resist that urge, and use this capability wisely and with intent. We do not want to spam logs with a bunch of error events every time one thing goes wrong. Think carefully about where logical boundaries in your code are, and log errors only as they cross those boundaries. There are no hard and fast rules here, but see the chapter on errors for some thoughts.

Labels

In the section about logging, particularly structured logging, we described how to generate log events that have named fields logged along with a message. Up to this point, which fields were logged was hard-coded at the event or span level. But even with the span feature allowing you to put fields on all events in a span, the definition of those fields is manual, and if you want to use the same fields in many places in code, you must copy-paste. That's fine for ad-hoc log events, but there are cases where you want more consistency.

In a later section we'll talk about metrics, but for now just accept that metrics have labels on them similar conceptually to the fields we log with structured logging. We have tools like Loki and Grafana, which let us switch between a graphical view of metrics, and the log events that were logged with the same labels as those metrics. This only works if we are consistent in making sure the labels we attach to our metrics and the fields we log in our log events line up.

In cases like this, cheburashka introduces the concept of labels, in the form of the LabelSet trait and a proc macro which generates LabelSet impls on arbitrary structs. Labels are a concept implemented directly in cheburashka; the underlying tracing crate doesn't have any concept of labels, just the arbitrary fields we've already discussed.

The doc comments in cheburashka::labels are very extensive, so go read them to learn about the concept and its implementation. The key point to take away is that given a set of labels, you can create a log span with those labels, and you can create metrics that require those labels, with compile-time checking that the labels applied to the metrics are the correct ones. This provides a convenient way to define labels in code and use them in multiple places without duplication of effort or manual work.

Metrics

At a high level, metrics are just time-series data about the behavior of your code. There are many ways to create, report, and store metrics. Our approach is based on what is built in Prometheus.

For useful background, read these sections of the Prometheus docs:

Note that we don't actually use Prometheus internally. We find that VictoriaMetrics is a better implementation, with higher performance and better support for high cardinality labels. However VictoriaMetrics is another implementation of the same Prometheus concepts, and the Prometheus docs are better, so in practice the fact that we don't actually use Prometheus doesn't change how we instrument our code to collect metrics.

The metrics implementation is in cheburashka::metrics. It's based on the prometheus crate, however it adds quite a bit of additional convenience functionality on top.

The most significant improvement is the support for labeled metrics, which means metrics which take a specific type of LabelSet defined at compile time, so that attempting to apply the wrong set of labels to the metric is a compile error. You are strongly urged to use this in favor of the regular metrics types re-exported by cheburashka from prometheus.

The doc comments in cheburashka::metrics cover all of this in detail.

Versioning and Publishing Crates

When Elastio started we did almost all of our work in a single monorepo called elastio. As of this writing in mid 2021 the bulk of our work still happens in the monorepo, but we are gradually moving away from this as parts of our codebase mature enough to justify being moved into their own repos.

As soon as we started doing this we had to address the problem of dependencies between crates in different repos. Everything in one repo is easy; you just reference the other crates by their path. Once the repos split you need to start to version crates and publish them to a cargo registry where the code in other repos can access them.

While we plan to eventually publish most of our data plane and red stack code on crates.io, we're not there yet so we're using a private Cargo registry called Cloudsmith (actually it supports dozens of technologies including npm and docker and Debian, but we're focused here on cargo). We have a Cloudsmith org elastio, and a single private registry private where all of our private artifacts (including non-cargo artifacts) live. When you went through on-boarding you should have received access to this system; you can access it at https://cloudsmith.io/~elastio/repos/private/packages/.

There are many approaches to Rust crate management, but we've settled on a very opinionated, Elastio-specific approach which works for us. This is mostly automated in the elastio/xtask-core repo, which contains both a library crate xtask-core for integration into other repo-specific automations, and a binary cargo-xelastio which can be cargo installd and invoked like cargo xelastio ... through the magic of cargo's extension mechanism. cargo-xelastio has a subcommand publish which implements our publication logic. You won't ever invoke this manually, it's built in to our CI workflows, however you need to know how it works and how to use it to make a release.

Semver Principles

We follow semver when versioning our crates. You probably know at least a little bit about semver, but there are some less-known concepts that we also rely on which you need to understand.

In short, versions are major.minor.patch, typically starting at 0.1.0 for a new crate. Fixing a bug means incrementing the patch number, eg 1.2.3 => 1.2.4. Adding a new capability to the API without making any breaking changes to the existing public interface means incrementing the minor number, eg 1.2.3 => 1.3.0. Note: some breaking changes can be unobvious:

  • items moving from pub to non-pub and vice-versa;
  • items changing their kind, i.e. from a struct to an enum;
  • additions and removals of region parameters to and from an item's declaration;
  • additions and removals of (possibly defaulted) type parameters to and from an item's declaration;
  • changes to the variance of type and region parameters;
  • additions and removals of enum variants (although additions can be non-breaking changes when tagged as non_exhaustive);
  • additions and removals of enum variant- or struct fields;
  • changes from tuple structs or variants to struct variants and vice-versa;
  • changes to a function or method's constness;
  • additions and removals of a self-parameter on methods;
  • additions and removals of (possibly defaulted) trait items;
  • correct handling of "sealed" traits;
  • changes to the unsafety of a trait;
  • type changes of all toplevel items, as well as associated items in inherent impls and trait definitions;
  • additions and removals of inherent impls or methods contained therein;
  • additions and removals of trait impls.

Breaking changes must always increment the major number, even if the change itself doesn't feel "major", like for example renaming a structure or adding an argument to a public method. Eg 1.2.3 => 2.0.0. It's important to adhere to these rules because cargo assumes you do; if you ask for version 1.2.3 of a crate, but version 1.2.10 is available, cargo will assume it should use 1.2.10 unless you have explicitly told it not to. If 1.2.10 is a breaking change, then your downstream deps will break suddenly without their authors understanding why, and they will invent creative nicknames for you involving anatomically improbable contortions and/or livestock.

Semver versions can also contain two other optional components: a pre-release version and build metadata. The pre-release version follows a - and looks like 1.2.3-alpha or 1.2.3-beta.3 or 1.2.3-foo.bar. No assumption is made about the meaning or structure of the pre-release version, however it is assumed that a given major.minor.patch version without any pre-release version should take priority over the same version with a pre-release component, and that given two pre-release components with the same version, the one whose pre-relase component comes last in lexicographic sort order should take priority.

For example: 1.2.3 will be chosen over 1.2.3-alpha, while 1.2.3-bravo will take priority over 1.2.3-alpha, and 1.2.3-zulu will win over 1.2.3-oscar.whiskey.3.

It's also important to note that, within pre-release versions, no assumptions are made about stability. In other words, if from 1.2.3 to 1.2.4 you make a breaking change to the public interface, a plague of locusts will be visited upon your house. On the other hand, if you make two releases of 1.2.3-dev and the second release is in absolutely no way compatible with the first one, no one can complain because you have made it explicit with -dev that this is a pre-release version and not subject to the same stability guarantees.

Build metadata is a bit different in that it never influences version resolution. Two versions with the same major, minor, patch, and pre-release but with different build metadata are considered identical. Cargo even has a bug whereby if there are two such crates on a registry, differing only by their build metadata, cargo shits the bed. Build metadata follows a + and takes the same form as pre-release version.

Versioning in Workspaces

Within a workspace (meaning within a Git repo in our case), all crates always have the same version. This is a really important simplifying assumption. For an example of how complicated things get when this assumption doesn't hold, look at the Tokio tracing project. It's a nightmare and not one we wish to live with.

So this means that if you have a workspace with awesome-core, awesome-lib and awesome-cli, and you make a small bug fix to awesome-cli, you will be releasing a new patch version of all three of those crates, even though only awesome-cli changed. In practice this isn't a big deal; actual published crates are very lightweight (tens of KB typically) and cargo generally is smart enough to automatically pick up the latest patch release of crates unless you deliberately pin your dependency to a specific version.

There's a command cargo xelastio version which will determine the version of all crates in the workspace and print it to stdout, or fail with an error if not all crates have the same version.

Versioning and Publishing in master

The code in master is never assumed to be suitable for release. The master version of the crates always has a pre-release version dev. Every successful build on master is published to Cloudsmith with this pre-release modifier and also with build metadata containing the git commit hash and the commit date, in a form $hash.YYYY-MM-DD.

Note that due to the aforementioned cargo bug (#7180), we only keep the most recent version of master with a given major.minor.patch version published on Cloudsmith. At publish time the previous versions are automatically deleted. This in practice doesn't matter anyway because build metadata are never used to resolve dependenies, so even if you tried to make a dependency on some older version with different build metadata, cargo will ignore that metadata when it resolves the dependency.

So, you might ask, why bother embedding the build metadata at all? Because it's nonetheless exposed as the crate version, which means we can use the native Cargo crate version as a log or telemetry label or for other diagnostic purposes and it always includes information about the date and commit that code came from. This is helpful information and costs us nothing to maintain since it's built in to the version structure used by cargo.

Versioning and Publishing of releases

While it's possible for you to write code that depends on a -dev version of some other crate, you should never make any assumptions about the stability of a -dev version. We publish dev versions to Cloudsmith because sometimes it's useful to be able to do experimental work on two crates in two repos at the same time and still be able to refer to the crate in the other repo, but you should never release a crate that has a dev dependency because it's almost guaranteed to end badly.

Thus, after you've made some changes and landed them to master and it's time for other crates in the ecosystem to be able to use this new version, you need to make a proper release. You do this by manually running a Github Action workflow, typically called release. When you go into the GitHub UI, or using the gh CLI, you will be asked to provide a value for a parameter bumpLevel. This is a string, either major, minor, or patch, and tells the release process which component of the version should be incremented. Think carefully about this, and bear in mind the Semver rules.

Whichever bump level you pick, the following will happen:

  • The current master is taken as the starting point
  • If there is a prerelease version like dev it's removed
  • If the bumpLevel is patch, if there was a prerelease version then no change is made (in effect it goes from 1.2.3-dev to 1.2.3), if there was no prerelease version then the patch number is incremented
  • If the bumpLevel is something else, either the minor or major version is incremented in accordance with Semver
  • All crates are packaged and published to Cloudsmith
  • The changes with the new versions are committed to master with a comment about preparing for release
  • A tag vX.Y.Z is created, where X.Y.Z is the released version (never with prerelease version or build metadata)
  • A new dev version is made by incrementing the patch level of the released version and adding a dev prerelease. Eg, if we just released 1.2.3, master is modified so all crates now have a version 1.2.4-dev.
  • That change is also committed to master with a comment about preparing the next dev release
  • All changes to git are pushed to GitHub

Deviations from the Standard

This section describes the standard process which is implemented in xtask-core and cargo-xelastio. Not all repos will use this exact process; if you're using a repo that has a different publishing approach you probably know about it already. If you're not sure ask your teamlead. If you are the teamlead, ask @anelson. If you are @anelson then God help us!

Scrum Background

Scrum is a lightweight framework that helps people, teams and organizations generate value through adaptive solutions for complex problems.

In a nutshell, Scrum requires a Scrum Master to foster an environment where:

  1. A Product Owner orders the work for a complex problem into a Product Backlog.
  2. The Scrum Team turns a selection of the work into an Increment of value during a Sprint.
  3. The Scrum Team and its stakeholders inspect the results and adjust for the next Sprint.
  4. Repeat

The Scrum framework is purposefully incomplete, only defining the parts required to implement Scrum theory. Scrum is built upon by the collective intelligence of the people using it. Rather than provide people with detailed instructions, the rules of Scrum guide their relationships and interactions. Various processes, techniques and methods can be employed within the framework. Scrum wraps around existing practices or renders them unnecessary. Scrum makes visible the relative efficacy of current management, environment, and work techniques, so that improvements can be made.

Scrum is founded on empiricism and lean thinking. Empiricism asserts that knowledge comes from experience and making decisions based on what is observed. Lean thinking reduces waste and focuses on the essentials.

Scrum employs an iterative, incremental approach to optimize predictability and to control risk. Scrum engages groups of people who collectively have all the skills and expertise to do the work and share or acquire such skills as needed.

Scrum combines four formal events for inspection and adaptation within a containing event, the Sprint. These events work because they implement the empirical Scrum pillars of transparency, inspection, and adaptation.

Transparency

The emergent process and work must be visible to those performing the work as well as those receiving the work. With Scrum, important decisions are based on the perceived state of its three formal artifacts. Artifacts that have low transparency can lead to decisions that diminish value and increase risk.

Transparency enables inspection. Inspection without transparency is misleading and wasteful.

Inspection

The Scrum artifacts and the progress toward agreed goals must be inspected frequently and diligently to detect potentially undesirable variances or problems. To help with inspection, Scrum provides cadence in the form of its five events.

Inspection enables adaptation. Inspection without adaptation is considered pointless. Scrum events are designed to provoke change.

Adaptation

If any aspects of a process deviate outside acceptable limits or if the resulting product is unacceptable, the process being applied or the materials being produced must be adjusted. The adjustment must be made as soon as possible to minimize further deviation.

Adaptation becomes more difficult when the people involved are not empowered or self-managing. A Scrum Team is expected to adapt the moment it learns anything new through inspection.

Scrum Values

Successful use of Scrum depends on people becoming more proficient in living five values:

Commitment, Focus, Openness, Respect, and Courage

The Scrum Team commits to achieving its goals and to supporting each other. Their primary focus is on the work of the Sprint to make the best possible progress toward these goals. The Scrum Team and its stakeholders are open about the work and the challenges. Scrum Team members respect each other to be capable, independent people, and are respected as such by the people with whom they work. The Scrum Team members have the courage to do the right thing, to work on tough problems.

These values give direction to the Scrum Team with regard to their work, actions, and behavior. The decisions that are made, the steps taken, and the way Scrum is used should reinforce these values, not diminish or undermine them. The Scrum Team members learn and explore the values as they work with the Scrum events and artifacts. When these values are embodied by the Scrum Team and the people they work with, the empirical Scrum pillars of transparency, inspection, and adaptation come to life building trust.

Scrum Team

The fundamental unit of Scrum is a small team of people, a Scrum Team. The Scrum Team consists of one Scrum Master, one Product Owner, and Developers. Within a Scrum Team, there are no sub-teams or hierarchies. It is a cohesive unit of professionals focused on one objective at a time, the Product Goal.

Scrum Teams are cross-functional, meaning the members have all the skills necessary to create value each Sprint. They are also self-managing, meaning they internally decide who does what, when, and how.

The Scrum Team is small enough to remain nimble and large enough to complete significant work within a Sprint, typically 10 or fewer people. In general, we have found that smaller teams communicate better and are more productive. If Scrum Teams become too large, they should consider reorganizing into multiple cohesive Scrum Teams, each focused on the same product. Therefore, they should share the same Product Goal, Product Backlog, and Product Owner.

The Scrum Team is responsible for all product-related activities from stakeholder collaboration, verification, maintenance, operation, experimentation, research and development, and anything else that might be required. They are structured and empowered by the organization to manage their own work. Working in Sprints at a sustainable pace improves the Scrum Team’s focus and consistency.

The entire Scrum Team is accountable for creating a valuable, useful Increment every Sprint. Scrum defines three specific accountabilities within the Scrum Team: the Developers, the Product Owner, and the Scrum Master.

Developers

Developers are the people in the Scrum Team that are committed to creating any aspect of a usable Increment each Sprint.

The specific skills needed by the Developers are often broad and will vary with the domain of work. However, the Developers are always accountable for: - Creating a plan for the Sprint, the Sprint Backlog; - Instilling quality by adhering to a Definition of Done; - Adapting their plan each day toward the Sprint Goal; and, - Holding each other accountable as professionals.

Product Owner

The Product Owner is accountable for maximizing the value of the product resulting from the work of the Scrum Team. How this is done may vary widely across organizations, Scrum Teams, and individuals.

The Product Owner is also accountable for effective Product Backlog management, which includes: - Developing and explicitly communicating the Product Goal; - Creating and clearly communicating Product Backlog items; - Ordering Product Backlog items; and, - Ensuring that the Product Backlog is transparent, visible and understood.

The Product Owner may do the above work or may delegate the responsibility to others. Regardless, the Product Owner remains accountable. For Product Owners to succeed, the entire organization must respect their decisions. These decisions are visible in the content and ordering of the Product Backlog, and through the inspectable Increment at the Sprint Review.

The Product Owner is one person, not a committee. The Product Owner may represent the needs of many stakeholders in the Product Backlog. Those wanting to change the Product Backlog can do so by trying to convince the Product Owner.

Scrum Master

The Scrum Master is accountable for establishing Scrum as defined in the Scrum Guide. They do this by helping everyone understand Scrum theory and practice, both within the Scrum Team and the organization.

The Scrum Master is accountable for the Scrum Team’s effectiveness. They do this by enabling the Scrum Team to improve its practices, within the Scrum framework.

Scrum Events

The Sprint is a container for all other events. Each event in Scrum is a formal opportunity to inspect and adapt Scrum artifacts. These events are specifically designed to enable the transparency required. Failure to operate any events as prescribed results in lost opportunities to inspect and adapt. Events are used in Scrum to create regularity and to minimize the need for meetings not defined in Scrum.

Optimally, all events are held at the same time and place to reduce complexity.

The Sprint

Sprints are the heartbeat of Scrum, where ideas are turned into value. They are fixed length events of two weeks to create consistency. A new Sprint starts immediately after the conclusion of the previous Sprint. All the work necessary to achieve the Product Goal, including Sprint Planning, Daily Scrums, Sprint Review, and Sprint Retrospective, happen within Sprints.

During the Sprint: - No changes are made that would endanger the Sprint Goal; - Quality does not decrease; - The Product Backlog is refined as needed; and, - Scope may be clarified and renegotiated with the Product Owner as more is learned.

Sprints enable predictability by ensuring inspection and adaptation of progress toward a Product Goal at least every calendar month. When a Sprint’s horizon is too long the Sprint Goal may become invalid, complexity may rise, and risk may increase. Shorter Sprints can be employed to generate more learning cycles and limit risk of cost and effort to a smaller time frame. Each Sprint may be considered a short project.

Various practices exist to forecast progress, like burn-downs, burn-ups, or cumulative flows. While proven useful, these do not replace the importance of empiricism. In complex environments, what will happen is unknown. Only what has already happened may be used for forward-looking decision making. A Sprint could be cancelled if the Sprint Goal becomes obsolete. Only the Product Owner has the authority to cancel the Sprint.

Sprints contain all Scrum Ceremonies, that are defined in the respective chapter.

Scrum Ceremonies

As defined in the Scrum Guide, we adopt 4 key Scrum Ceremonies to be included into every sprint.

Sprint Planning

Sprint Planning initiates the Sprint by laying out the work to be performed for the Sprint. This resulting plan is created by the collaborative work of the entire Scrum Team. The attendees should be prepared to discuss the most important Product Backlog items and how they map to the Product Goal for the Sprint. The Scrum Team may also invite other people to attend Sprint Planning to provide advice. Sprint Planning is timeboxed to a 45 minutes per team meeting for a two-weeks Sprint.

Before the Sprint Planning the developers are kindly asked to pick the tasks for the upcoming Sprint from the Product Backlog and add them into the Sprint Backlog. Besides, to promote deeper tasks analysis and more precise planning, we kindly ask the participants to think of and list all possible risks and dependencies for the issue at question in the comments of the respective issue before the Sprint Planning. Besides, it is highly desirable that a short description of how a task is going to be executed be listed in the comments as well. The description should be short, precise and describe the logic of the solution that the developer anticipates.

Sprint Planning for Elastio teams is normally held on the first Monday of the Sprint. Additionally, teams are kindly asked to hold an Estimation Session in the beginning of the sprint to make sure all work planned is estimated.

Sprint Planning addresses the following topics:

  1. Why is this Sprint valuable?

The Product Owner proposes how the product could increase its value and utility in the current Sprint. The whole Scrum Team then collaborates to define a Sprint Goal that communicates why the Sprint is valuable to stakeholders. The Sprint Goal must be finalized prior to the end of Sprint Planning.

  1. What can be Done this Sprint?

Through discussion with the Product Owner, the Developers select items from the Product Backlog to include in the current Sprint. The Scrum Team may refine these items during this process, which increases understanding and confidence. Selecting how much can be completed within a Sprint may be challenging. However, the more the Developers know about their past performance, their upcoming capacity, and their Definition of Done, the more confident they will be in their Sprint forecasts.

  1. How will the chosen work get done?

For each selected Product Backlog item, the Developers plan the work necessary to create an Increment that meets the Definition of Done. This is often done by decomposing Product Backlog items into smaller work items of one day or less. How this is done is at the sole discretion of the Developers. No one else tells them how to turn Product Backlog items into Increments of value. The Sprint Goal, the Product Backlog items selected for the Sprint, plus the plan for delivering them are together referred to as the Sprint Backlog.

Daily Scrum

The purpose of the Daily Scrum is to inspect progress toward the Sprint Goal and adapt the Sprint Backlog as necessary, adjusting the upcoming planned work. The Daily Scrum is a 15-minute event for the Developers of the Scrum Team. To reduce complexity, it is held at the same time and place every working day of the Sprint. If the Product Owner or Scrum Master are actively working on items in the Sprint Backlog, they participate as Developers. Daily Scrums help improve communications, identify impediments, promote quick decision-making, and consequently eliminate the need for other meetings.

Daily Scrum at Elastio is now done for each team separately, as is agreed within a team. Scrum of Scrums is held daily at 18.30.

Please note: Daily Scrum is currently replaced by Sprint Retrospective on the first Monday of a Sprint, and by Sprint Review on the last Friday of the Sprint.

Sprint Review

The purpose of the Sprint Review is to inspect the outcome of the Sprint and determine future adaptations. The Scrum Team presents the results of their work to key stakeholders and progress toward the Product Goal is discussed. All tasks that can be presented as a Demo should be showcased as a part of Sprint Review, otherwise progress can be shown another way. During the event, the Scrum Team and stakeholders review what was accomplished in the Sprint and what has changed in their environment. Based on this information, attendees collaborate on what to do next. The Product Backlog may also be adjusted to meet new opportunities. The Sprint Review is a working session and the Scrum Team should avoid limiting it to a presentation.

Every participant should complete a list of tasks to be showcased at Sprint Review here.

The Sprint Review is the second to last event of the Sprint and is timeboxed to 2 hours for Elastio. It is normally held on the last Friday of the Sprint at 17.00.

Sprint Retrospective

The purpose of the Sprint Retrospective is to plan ways to increase quality and effectiveness. The Scrum Team inspects how the last Sprint went with regards to individuals, interactions, processes, tools, and their Definition of Done. Inspected elements often vary with the domain of work. Assumptions that led them astray are identified and their origins explored. The Scrum Team discusses what went well during the Sprint, what problems it encountered, and how those problems were (or were not) solved.

As a result of the Sprint Retrospective the document should be completed for the previous Sprint.

The Sprint Retrospective concludes the Sprint. It is timeboxed to an hour and is held every first Monday of the following Sprint at 18.00.

Issues Estimation

In Elastio we use Planning Poker to do estimations. In the beginning of each sprint, the team should make sure that all tasks planned for the sprint are estimated.

Bugs and issues that occur in the middle of the sprint do not get estimated for 2 reasons: not to take time from the team's work, and not to distort the team's velocity. Any urgent bugs or other work which comes up during the sprint and requires engineering attention is, by it's nature, not predictable. Such tasks should not count towards the team's velocity because the velocity is a measure of the amount of planned work accepted in scope at the start of the sprint a team can execute. For example, in a hypothetical (and dysfunctional) scenario where half of a team's engineering capacity is spent dealing with urgent bugs as they come up, that team's velocity should be 50% lower as a result of all of that unplanned work. If management is unhappy with the lower velocity then management should address the large number of unplanned work items.

We use Fibonacci sequence for our estimations: 1,2,3,5,8,13,21,40. Each team should have a set of reference issues to make their estimation process easier and more transparent. Reference issues are normally defined when the estimation process through story points is implemented. In some cases, where rapid team growth occurs and as the team settles in the estimation process the reference issues might be redefined. However, it is not recommended to do so more often than once in 6 months, as the new reference issues might make past velocity data irrelevant.

Planning poker (also called Scrum poker) helps agile teams estimate the time and effort needed to complete each initiative on their product backlog. The name from this gamified technique is planning poker because participants originally used physical cards. These cards, which look like playing cards, estimate the number of story points for each backlog story or task up for discussion.

Each team has a session of Planning Poker in the beginning of the sprint, the whole team (in rare cases - some amount of people working on the same functionality/crate) is needed to estimate. Normally, one member of the team would explain what work needs to be done in the scope of the issue, then the team members are asked to put the amount of story points for the task. When all team members have places their votes, the cards are revealed.

If the number of points for the issue differs a lot, team members having placed the biggest and smallest estimate should explain the reasoning behind the estimate. As a result of this discussion team members should reach a unified decision about the amount of story points the issue is worth. If that's not the case, the assignee for the issue has the final word.

A story point is a metric used in agile project management and development to estimate the difficulty of implementing a given user story, which is an abstract measure of effort required to implement it. In simple terms, a story point is a number that tells the team about the difficulty level of the story. Difficulty could be related to complexities, risks, and efforts involved.

How we Use GitHub and ZenHub with Scrum

Organizing Epics

We represent the addition of a new feature or some other unit of work spanning a single sprint as an epic.

In some cases if the work will span multiple teams, we use the multi-level epic feature in ZenHub to make one high-level epic with sub-epics that pertain to each team. However this is as far as we go in nesting epics; while it's possible in ZenHub to have sub-sub-epics and even sub-sub-sub-epics, down this way lies madness so we limit ourselves to just two levels.

Labels

We use GitHub labels to help classify and organize issues and epics.

Here are the labels we use:

  • project/* - Identifies which project (aka which team) an issue pertains to. Values are data-plane, blue-stack, red-stack, scalez. Sometimes an issue will be tagged with multiple projects if it pertains to more than one, but we try to avoid that interdependency when we can.

  • crate/* - Identifies the crate (or really, "crate family") that an issue pertains to. We don't always use this, it depends on the issue whether or not it makes sense to capture this. Bugs should always be tagged with the crate the bug is in, if known. New features or additional functionality should only be tagged with a crate if the issue explicitly pertains to that crate. For example, an issue to add a capability to our logging framework cheburashka would be tagged crate/cheburashka, but an epic that adds some new telemetry feature would not be tagged that way even though some of the work will touch the cheburashka crate.

  • phase/* - This is an experimental convention. It might not be retained, or it might be migrated to a release or milestone instead. This captures what phase of the work leading up to and including the Elastio MVP this item pertains to. In general, work in earlier phases takes priority over work in later phases, so this is a convenient handle for organizing work by coarse-grained priority.

    Phases are things like alpha-1, alpha-2, beta-1, mvp, and post-mvp

  • bug - This is a bug. Sometimes the line between "bug" and "feature" is not obvious, but nonetheless sometimes we want to be able to view and manage all defects separately from all other work. A bug that isn't tagged this way might get lost in the backlog grooming shuffle.

  • ci - An item that pertains not to the product itself but to our CI/CD infrastructure (this includes our dev and test AWS accounts).

  • groomed - Once an issue has been completely processed during grooming and doesn't require any more consideration, we put this label on it so we know to skip it next time we groom. This label can also be removed from an issue if a material change is made or if a question is raised that requires further discussion.

Sprints

While ZenHub has recently introduced support for sprints as an explicit entity in their data model, we have not migrated to this feature yet. We still use Milestones for each sprint. At the start of sprint planning (usually in the evening on the first day of the sprint after planning activities) all tasks in scope for the sprint are set to the milestone for that sprint. If tasks are not completed in a sprint, then after the next sprint's sprint planning if those tasks are still in scope then they are assigned to the next sprint's milestone.