04 June, 2021

Signal safety number privacy issues

Credit

Affected application versions (known as of 6/4/21)

  • 5.13.0 and below (iOS)
  • 5.3.0 and below (macOS)
  • 5.3.0 and below (Windows)
  • 5.10.8 and below (Android)
  • 5.3.0 and below (Linux)

Initial test versions (5/13/21)

  • 5.11.0 and below (iOS)
  • 5.1.0 and below (macOS)
  • 5.1.0 and below (Windows)
  • 5.9.7 and below (Android)
  • 5.1.0 and below (Linux)

Intro

Signal provides a free, cross-platform private messenger app. Folks in all kinds of unsafe situations rely on Signal, as a highly visible and popular app which the security and privacy professional communities endorse. Journalists rely on Signal to ensure confidential communication with their sources.

What privacy guarantees does one really have though if you can't prove the identity of who you're communicating with?

The problem

Mid-May, I got a new phone. At the time I understood that with *any change* to the device or installation of either party in a chat with message history, the Signal chat "safety number" changes. This used to be but (following an involved email back-and-forth with the Signal team over the course of a month) is no longer reflected in the Signal support documentation.

When a safety number changes, Signal shows a message to both parties in the conversation. The most recent alert I recall seeing prior to this adventure (which I believe was initially received April 14, about a month before I changed phones) looks like this:


Expecting similar alerts to be sent out to my existing chat threads upon phone changeover, I messaged a few of my more recent chats.


Signal has a pretty convenient iOS device transfer method to help migrate everything over (I later discovered it not only transfers your chat threads and settings, it also transfers your key material) by simply scanning a QR code on your new device using the old device. It worked beautifully. But then - nada mรกs. I went to check docs to see if I had missed something obvious.

In the Signal user support documentation:


Also from some old Signal blog posts announcing the most recent to date safety number feature updates (2016, 2017) it seemed like my contacts should have gotten alerted when I changed phones.

What *are* safety numbers anyway?

As far as I know, the idea of safety numbers as implemented in Signal doesn't have a publicly available product-level nor technical specification, unlike some of the other algorithm and protocol components.

Backing up a few steps for folks who aren't familiar (anyone who knows Signal can obviously skip this), here's a bit more on how safety numbers work.

Let's say there are two participants in a Signal chat, Alice and Bob. This chat has a single unique safety number which both parties can check in the app.

This number is a human-readable representation of Alice and Bob's shared public key material. Slightly more technically put, it's a combination of Alice and Bob's individual cryptographic fingerprints in decimal format. There's also a QR code version in the mobile app so it's easy for folks to compare.

Here's a safety number in the Android app:



The safety number for Alice and Bob *should eventually* change whenever Alice *or* Bob change their Signal installation (example: uninstalling and reinstalling the app to prevent a third party seeing the chat history on Alice's phone). This allows Alice and Bob to verify the privacy of their communication over Signal as desired.

Alice and Bob can also mark their safety number verified in a way that is supposed to make sending messages to their chat always require manual approval right in the UI.

Again however there had been no safety number alerts in the UI to any of my existing chat threads. We later found out that it's an intentional product decision by Signal staff to not have device "quick start" transfers cause safety number changes, even though this is inconsistent with the app behavior on every other kind of device or installation change and inconsistent with the documentation.

Sesame documentation

Signal enables users to have multiple devices on the same account through the Sesame device and session management algorithm. The need for users to verify each other to get any real privacy guarantee gets mentioned multiple times in the Sesame doc.

X3DH documentation

Here's another callout about the requirement for identity checking.

Initial Investigation

To reproduce the issue, I transferred Signal to my old device again. John Jackson observed that the safety number for our chat stayed the same before and after transfer, where he should have been alerted and the safety number should have changed:

Before

After


I additionally checked the chat safety number on another device associated with my Signal account and saw it had also not changed there:

Desktop device

Further Investigation: App Uninstallation

Later with Rob, Sick Codes, and others we observed that safety numbers also did not change for multiple user pairs across all device types Signal currently supports (Linux, macOS, Android, iOS, Windows) when at least one party deleted the app and reinstalled it on one of their linked devices. This meant the issue was not something isolated to communicating with my user account.

Requesting a CVE ID and notifying vendor

At this point we felt we had two fairly well-tested cases where the documentation did not reflect the observed behavior:

  • chat safety numbers not changing on device transfer
  • chat safety numbers not changing on Signal uninstall/reinstall on same device

We requested a CVE and emailed security@signal.org a first draft we proposed for the CVE follow-on writeup with this summary:

Missing cryptographic step in Signal Private Messenger ("Signal") across multiple platforms allows an attacker to masquerade as a victim Signal user to the victim's contacts. Signal at time of writing does not rotate the safety key ("safety number") between a user pair upon re-installation of the application, nor on transfer of application data from one device to another using a method such as iOS Quick Start, despite clear indication in the Signal documentation this must occur in order to let the user's contacts know the user's device or installation has changed. Failure of key rotation results in lack of non-repudiation of communications and indeterminate potential for impersonation and man-in-the-middle attacks.

Safety number verification

Then, I started to wonder if safety numbers actually changed under any circumstances.

Rob and I determined that the "you have not marked person as verified"  functionality also did not force a safety number change when Signal was uninstalled and reinstalled on Rob's device, most likely as well due to whatever underlying issue was causing safety numbers to not change on uninstallation and reinstallation:

Prior to verification

Following verification

Following Rob deleting and reinstalling the app

Clearing data then uninstalling does actually cause safety number change on reinstall

Sick Codes and I proved the chat safety number did actually change when upon clearing data in some flavour of the desktop app. Clearing my data then uninstalling Signal on macOS caused me to lose all chat threads and contents of group chats on the macOS device. Chat threads were all still present across my other devices until I resynced my phone to the cleared desktop app, at which point all messages from before clearing my data were gone from my phone as well.

Before clearing data on desktop app

After clearing data

I accidentally went through a CAPTCHA loop twice trying to get everything put back together properly, but it was worth it to prove safety number changes can happen sometimes if you really try.

Vendor communications

Some time passed before Signal requested further information. We provided many of the screenshots included here and detailed some ways we thought the issues could be abused to cause Signal users harm. We felt it was clear the app behaviour and documentation did not match when we reported, but Signal staff surprisingly said they were unable to reproduce.

After several more emails spaced out over multiple days Signal staff requested a call. As everyone mentioned in this blog works full time elsewhere and this is our side gig, it was very kind of Signal staff to agree to take a call after Pacific Time working hours, but the call was ultimately not very useful from our perspective. We arrived at deadlock and pretty much nothing else productive came from our long email chain. Eventually, Signal staff stopped replying after telling us they planned to update the customer docs and not change anything else. We later found after dismissing our report, Signal not only had updated the customer docs, but had started rolling out patches for the issues.

Though as far as we understand it wasn't taken into consideration, we provided Signal via email after the call with what we considered to be our ideal timeline and outcome. We would have liked to see disclosure at each step by Signal in order to show accountability to users and the security and privacy communities. We felt this request was appropriate considering the Signal app is a privacy-critical product, people in unsafe situations may rely on it for communications, and it seemed to us that some of Signal's privacy guarantees weren't being met.

Mobile apps' version bump

We tried to obtain independent verification from Ax Sharma this past week, but he unfortunately wasn't able to reproduce the reinstallation lack of safety number change on iOS and Android on Friday (4 June). This tipped us off to something maybe having quietly been changed in the Signal codebase since we were consistently able to reproduce the issue across multiple user/device matchups for several weeks beforehand.

Turns out, there is a freshly baked new version out in the Android and iOS app stores as of Friday. Even if unintentionally included, it seems we have our fix for at least the uninstall/reinstall safety number change issue on mobile:

Android

iOS

Safety number now changes on mobile for uninstall / reinstall!

Where before you could uninstall the app, wait awhile, reinstall it, and just get dropped back into all your chats with all message history available, now you have to follow a registration workflow, the chat safety number changes, and the party on the other end gets a UI alert. Further, on iOS any previous messages are no longer available on app reinstallation.

From uninstalling and reinstalling on iOS the evening of 4 June:

Before (seen on John's phone)

New prompt on iOS

After (seen on Kelly's phone)

After (seen on John's phone)

Desktop Electron app

Upon learning from Ax that uninstall/reinstall now showed a safety number change and proving that for ourselves, we wanted to see if the same was true for the desktop app, which is more or less the same Electron/NodeJS codebase across macOS, Linux, and Windows. We noticed that despite a similar version bump, the desktop app on macOS still did not produce a safety number update or show alerts after uninstall/reinstall for a chat:

Before uninstall/reinstall

Version info

I was unable to start the reinstalled 5.3.0 dmg I had saved in my downloads and had to grab 5.4.0:

after uninstall/reinstall

Device transfer

I wanted to also see if the device transfer issue was also fixed. I got an interesting interstitial which wouldn't let me complete transfer til I updated the Signal version on my old device, but even after update my contacts didn't see safety number changes.

This time, I verified the safety number for my chat thread with John was the same from both my own devices as well as doing the transfer. John or anyone else communicating with me are still unable to tell anything has changed at all after transfer from my old to my new phone or vice versa.

Before (new phone)

After (back on old phone again)

Why are we doing this?

We don't want anyone to get hurt by way of trusting privacy guarantees which may be more conditional than they appear from the docs!

If Bob notices the chat safety number with Alice has changed and then Alice sends a bunch of suspect-sounding messages or asks to meet in person and Bob has never met Alice in person before, for example, Bob should be wary. After Alice for example is forced to provide device passcode or unlock their device with their fingerprint or face, Alice's device could be cloned over to a new device by way of quick transfer functionality without Alice's consent, and the messages could be coming from the cloned device rather than Alice's actual device.

Timeline

  • 12 May 2021: vulnerability discovered
  • 13 May 2021: CVE requested from Mitre
  • 13 May 2021: vendor notified via security@ email address
  • 14 May 2021: vendor requested additional information
  • 14 May 2021: researchers responded
  • 15 May 2021: vendor requested additional information
  • 15 May 2021: researchers responded
  • 18 May 2021: researchers requested response
  • 18 May 2021: vendor denied vulnerability
  • 19 May 2021: researchers responded
  • 22 May 2021: vendor requested video call
  • 24 May 2021: video call with vendor engineering manager, Kelly Kaoudis, and John Jackson to discuss
  • 25 May 2021: researchers provided sketch of ideal timeline for disclosure to vendor
  • 27 May 2021: vendor notified researchers of planned support page update and lack of plans to mitigate vulnerability or lack of clarity in technical documentation
  • 29 May 2021: researchers discover additional issue
  • 2 June 2021: vendor notified researchers of support page update 
  • 2 June 2021: researchers requested vendor preferred timeline for issue remediation for both initial and second vulnerabilities
  • 2 June 2021: researchers approached Ax Sharma for independent verification and possible writeup
  • 4 June 2021: Ax Sharma unable to reproduce by uninstalling and reinstalling Signal on Android and iOS
  • 4 June 2021: researchers determine safety numbers now change in Android and iOS when uninstalling and reinstalling Signal, but not on macOS nor when performing device transfer between two iOS devices (original device transfer issue remains unpatched)

References

- Signal on GitHub
- Wayback 2021-05-22: Signal Support Center Safety Number Documentation
- Wayback 2021-06-04: Signal Support Center Safety Number Documentation
- Wayback 2021-05-04: Signal blog "Safety Number Updates"
- Wayback 2021-05-04: Signal blog "Verified Safety Number Updates"
- Wayback 2021-06-04: Signal blog "iOS Device Transfer"
- Wayback 2021-06-03: Signal Sesame Specification
- Wayback 2021-06-03: Signal X3DH Specification

Additional Thanks To

Ax Sharma, Amber Harrington, exp1r3d for their independent assistance testing!

04 April, 2021

Adventures in Systemd-Linux-Dockerland

I am just recently starting to use Docker again after an about six year hiatus from it (used regularly ~2014-2015). I do not recommend following the actions I document here in any way, shape, or form, but wanted to chronicle them in case it's educational for others, or in case I don't use Docker again until 2030 and want to know what worked in 2021.

Constructive suggestions or corrections (so I know for the future) are welcome! There are likely better or more efficient ways to do all of the following.

Intro

The proper, real, Docker setup documentation for Arch may be found here. If you actually want to use Docker on Arch, I would recommend trying that. You might also want to get a peep at the systemctl and systemd manpages.

Disclaimer: some of the following may be universal across Linux distributions, but as I run Arch, some of it is likely specific to Arch and perhaps also my own setup.

How I got Docker running as described here, from zero, in Arch, differs rather a bit from how people ought to do it. I was going too fast and assumed I still knew stuff. I also barely tolerate systemd, though I've gotten more accustomed to it in recent years.

I hope this illustrates some potential pitfalls of not properly following official documentation for the thing you want to run or the documentation for running said thing on your own distro when you actually want to get something done fast.

What I did, with explanations:

Took a wild guess here and:

$ pacman -S docker

According to this, you need a loop module and then you'll install Docker from the Snap store. Needing loop tracked with my previous experience with Docker so I didn't question it. I don't use Snap (as recommended in the link) given I have Pacman already in Arch, so I proceeded to skip that part of the tutorial and went on to the next bits. 

We need Docker running in order to create and interact with containers. One can run Docker a couple of different ways: with systemd (systemctl), or just by calling dockerd directly. I went the systemd route this time.

$ sudo tee /etc/modules-load.d/loop.conf <<< "loop"
$ sudo modprobe loop // this loads the loop module we just made
$ sudo systemctl start docker.service

The above call to the docker.service unit with sudo is how this recommended to start Docker, but I felt this didn't make sense for my objective after trying it. With the sudo-prepended call to systemctl, we're as far as I understand affecting the root user environment versus the current user environment even though the start command will not, to my current knowledge, cause docker.service to automatically run during future sessions. Running `sudo systemctl enable docker.service` instead would do that. 

According to the systemctl manpage `systemctl start docker` and `systemctl start docker.service` are equivalent. If not specified systemctl will just add the appropriate unit suffix for us (.service by default), but sudo adds a dimension of weirdness here. It seems to override systemctl's usual requirement to ask for your password. `systemctl start docker` without sudo would have done what I actually wanted to do: simply use systemctl to manage dockerd in a way that is localised to the booted-in user session on my machine. When sudo executes a command in this form, we use it to act as the superuser (root). This I believe implies I'd have to also use sudo to initialise, run, etc any future Docker containers as well, which wasn't the outcome I wanted.

Back to Docker. At this point, Docker Was Not Up.

`systemctl status docker.service` showed the service had failed to start. I got cranky, so went looking for docker.service. The following is what I got by default when I installed the docker package on Arch. As shown here under the Requires section, docker.socket is a dependency of docker.service:

    
$ cat /usr/lib/systemd/system/docker.service
    [Unit]
    Description=Docker Application Container Engine
    Documentation=https://docs.docker.com
    After=network-online.target docker.socket firewalld.service
    Wants=network-online.target
    Requires=docker.socket

    [Service]
    Type=notify
    # the default is not to use systemd for cgroups because the delegate issues still
    # exists and systemd currently does not support the cgroup feature set required
    # for containers run by docker
    ExecStart=/usr/bin/dockerd -H fd://
    ExecReload=/bin/kill -s HUP $MAINPID
    LimitNOFILE=1048576
    # Having non-zero Limit*s causes performance problems due to accounting             overhead
    # in the kernel. We recommend using cgroups to do container-local accounting.
    LimitNPROC=infinity
    LimitCORE=infinity
    # Uncomment TasksMax if your systemd version supports it.
    # Only systemd 226 and above support this version.
    #TasksMax=infinity
    TimeoutStartSec=0
    # set delegate yes so that systemd does not reset the cgroups of docker                 containers
    Delegate=yes
    # kill only the docker process, not all processes in the cgroup
    KillMode=process
    # restart the docker process if it exits prematurely
    Restart=on-failure
    StartLimitBurst=3
    StartLimitInterval=60s

    [Install]
    WantedBy=multi-user.target

So I tried:

$ systemctl start docker.socket

Which worked in that docker.socket came up successfully, but this system state change did not help me to do the command from the tutorial I was still at this point trying to follow, `sudo systemctl start docker.service`. Starting Docker either with systemctl start docker, or with dockerd should, I believe, start the socket automatically for you, though you may notice if you systemctl stop docker but still have docker.socket active that the docker.socket unit can start docker again as well.

Since docker is a socket-activated daemon as installed by default on Arch, I could have enabled the docker.socket unit and then just used that as a base for my containers in the future. In this mode, systemd would listen on the socket in question and start up docker when I start containers. This style of usage is meant to be less resource-intensive since docker daemon would then only run as needed. We could also go full socketception and make socket-activated on-demand container units, if we have containers we want to reuse, and then use systemctl to control them.

But still, all I really wanted to do was `systemctl start docker` and just use docker by itself (no sudo, no extra systemctl units) after that so I tried to fix my environment up again with:

$ systemctl disable --now docker
$ systemctl disable --now docker.socket


So that we can have networking in our containers, it appears Docker will automatically create a docker0 bridge in the DOWN state for us when we start it as our own user account using systemctl:

$ sudo ip link
[sudo] password for user:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback ...
2: enp0s25: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ...
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DORMANT group default qlen 1000
    link/ether ...

$ systemctl status docker
○ docker.service - Docker Application Container Engine
     Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
     Active: inactive (dead)
TriggeredBy: ○ docker.socket
       Docs: https://docs.docker.com

$ systemctl start docker
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ====
Authentication is required to start 'docker.service'.
Authenticating as: user
Password:
==== AUTHENTICATION COMPLETE ====

$ systemctl status docker            
● docker.service - Docker Application Container Engine
     Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
     Active: active (running) since ...; 1h 36min ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
   Main PID: 2881 (dockerd)
      Tasks: 20 (limit: 19044)
     Memory: 155.5M
        CPU: 3.348s
     CGroup: /system.slice/docker.service
             ├─2881 /usr/bin/dockerd -H fd://
             └─2890 containerd --config /var/run/docker/containerd/containerd.toml --log-level info

$ sudo ip link           
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback ...
2: enp0s25: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ...
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DORMANT group default qlen 1000
    link/ether ...
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether ...

At this point, I had Docker itself working as my own user account. I thought I was ready to pull an image down and try to make a container. 

$ docker pull sickcodes/docker-osx:auto

I was wrong! 

Trying to build a headless container from the docker-osx:auto image as specified in the doc I was following did not fully work:

$ docker run -it \
    --device /dev/kvm \
    -p 50922:10022 \
    sickcodes/docker-osx:auto

I kept getting LibGTK errors which it turned out were not due to anything being wrong with the container, but rather an assortment of packages I still needed to install and a few missing group memberships for my user. I got stuck here for awhile trying to figure out all the different things I didn't have yet from a combination of the Arch documentation, the docker-osx documentation, and the rest of the internet. It's plausible you might encounter a similar error trying to run docker-osx if you don't have xhost available and this stumped me for rather awhile since I figured the issue was just xhost as described in the docker-osx troubleshooting docs at first.

Mini disclaimer: I use yay typically, not pacman, but wanted to provide pacman in the previous example since it's more commonly known. I don't recall which of the following packages are in AUR versus standard repositories, but here is what I think I ended up needing:

$ yay -S qemu libvirt dnsmasq virt-manager bridge-utils flex bison iptables-nft edk2-ovmf

I then set up the additional daemons I needed (actually enabling them this time so they'd be available on future boots) and added my user to the libvirt, kvm, and docker groups.

$ sudo systemctl enable --now libvirtd
$ sudo systemctl enable --now virtlogd
$ echo 1 | sudo tee /sys/module/kvm/parameters/ignore_msrs
$ sudo modprobe kvm
$ lsmod | grep kvm
$ sudo usermod -aG libvirt user
$ sudo usermod -aG kvm user
$ sudo usermod -aG docker user


I figured just in case the LibGTK error *was* really an xhost problem after all that, I'd follow the troubleshooting documentation I was using as well.

$ yay -S
xorg-xhost
$ xhost +


Finally, I was able to create a container and boot into it for the first time using:

$ docker run -it \
    --device /dev/kvm \
    -p 50922:10022 \
    sickcodes/docker-osx:auto

Do note this run command makes you a fresh, brand new container every time you run it. You'll be able to see all of yours with `docker ps --all`.

So that I can install things in my container and reuse it, I'll (after this) use `docker start -ai <CONTAINER ID>` or `docker start -ai <NAME>` instead of another round of `docker run -it`, but you may wish to stick to the run command if you want a new container each time you start up.

I also ran into a small snag when I decided to update my system and then suddenly couldn't start containers anymore with either `docker run -it` or `docker start` following running kernel upgrade ("docker: Error response from daemon: failed to create endpoint dazzling_ptolemy on network bridge: failed to add the host (veth1e8eb9b) <=> sandbox (veth73c911f) pair interfaces: operation not supported"), which was exciting, but fixable with just a reboot. 

The cause of this issue is a mismatch between the running kernel - the kernel still running from before the upgrade - and most recent kernel modules which match the new, upgraded kernel instead of the running kernel. On reboot, we boot into our freshly updated Linux version, which matches our most recent modules, so Docker can load kernel modules which match the running kernel once again.

Thus: Docker is either exactly as terrible as I recall, or worse I had forgotten just about everything useful I used to know about it, but I think from here on out I'll be mostly okay.

24 March, 2021

Resume antipatterns

I get to look at a lot of resumes as part of my job. Each of these represents a thinking, feeling human being, but after awhile it is very hard not to get picky about what makes it easier for people who don't have a lot of time to spend reading your document to digest it and get the information they need.

This blog post is not about the CV. That is a different document. Though there might be commonalities like publications and work experience, they should be represented differently. 

It is also specific to United States tech culture, and particularly the intersection of infosec and software development. What is perfectly acceptable in Germany or Vietnam (or CS academia, or claims adjusting, or ...) doesn't come across the same way here, and vice versa.

Looking at your resume, I'm trying to see a small number of things:

  • How well you communicate
  • Your experience in (languages|technologies|methodologies) as listed in the job description
  • Where you got that experience (startup? bootcamp? big tech companies?) so that I have a better idea about your familiarity with various styles of infrastructure and process and what will be different for you with our infrastructure and process
  • The sizes and types of teams you worked on previously
  • Whether you have technical leadership experience if you applied to a more senior role

You might consider having a peer in your field with more experience take a look at your resume before sending it in to see if they are able to grok your experience and aren't put off by formatting.

Specific peeves

None of these are from any particular resume or meant to call anyone out, just want to talk about patterns I've noticed over the last couple years.


Progress bars or star ratings

I don't know what "90% Javascript" means.  Putting that stat right next to "50% threat modelling" makes me wonder if you feel you are better overall at Javascript than threat modelling and what such a comparison even means, or how to compare these two things at all. 

It's better to just list Javascript and threat modelling as skills since then the reader can judge for themselves. If you've contributed to a bunch of core Javascript libraries or something, list a few of the most popular ones.

The following doesn't say anything useful about what John can do to people who don't know him. (John is also trolling me.)

For approximately the same reason one should not plot disparate data on the same chart unless it can be graphed on the same axes at the same scale, putting progress bars or star ratings next to each other is confusing to the reader.

Overly personal details

Some flavour is good. If you are able, get a second opinion from someone outside your social group(s) before including something not super directly related to your field.

Including one or two minor and quasi-professional things at the very end of the document, especially something somewhat common in your field (CTF participation, ham radio licence, etc), is great. It makes it easier to remember who you are when my team and I have been looking at resumes for an hour and yours is somewhere in the middle of the stack. If you choose to include something from this category include it under items like publications/talks, certifications, and so on in terms of position from top of the document. Do not include factoids like your taste in TV or music unless you were a session musician on the album in question or something.

Do not include your spouse's occupation, or your parents' occupations. This may be done sometimes elsewhere (and please do if you need to elsewhere) but it's not customary in this field, or country. Same goes for religion, children's ages, and street address. That kind of stuff is kind of uncomfortable to learn about someone one doesn't know without any other context, and not stuff that would come up in the workplace anyway unless you're close to your coworkers.

Design

This comes with caveat that I don't have experience in working in or hiring for folks in design/UX/UI/marketing/growth hacking/etc, but for software engineering and infosec please keep the funky tables, graphics, pie charts, and flourishes for other documents.


 Source: https://commons.wikimedia.org/wiki/File:Server-side_websites_programming_languages.PNG

I just want to read this thing quickly and get a sense for the work you've done and what you know and whether you'd be a helpful pair of hands to have about for the projects my team needs to work on. Graphics are usually just something I have to puzzle through and interpret.

More than one or maybe two fonts is just going to make it hard to read continuously across the various areas of your resume. Fancy typography has got to be first and foremost readable for use in a resume. 

 As a very hyperbolic example, consider the following typography:

Source: https://www.metalsucks.net/2019/09/25/completely-unreadable-band-logo-of-the-week-win-a-grab-bag-of-metal-goodies-85/

Logos and other pictures

Source: https://www.cvplaza.com/cv-basics/logo-picture-on-a-cv/

Especially don't include logos if they don't quite render in the place you wanted in your PDF or would get mangled when your Word document comes through the recruiting system.

It is not customary in the tech industry in the United States to include a picture of yourself. Adding a headshot is customary in a lot of places, but I want to be able to consider your written accomplishments without bias toward or against the way you look.

Certifications

Some infosec people really don't like them. Some do. Some jobs you can't get without a CISSP. If you have enough work experience to show you can do the things I am looking for, it pretty much doesn't matter if you have certs or not for just plain old tech jobs (government jobs and so forth are not the same). If you have bug bounty contributions, open source contributions, or CVEs that's a lot more interesting compared to whether you are good at taking tests.

If you're not applying for an entry-level role, don't include entry-level certificates like CEH if you don't also have something like OSCP to counterbalance them. 

However, if you are applying for an entry-level role, literally anything you can say on your resume to show you're at least interested in the work, even if you don't have experience, will help me understand you're interested in the work. CEH and similar are great in this case.

Extremely long resumes

I've seen folks with 20+ years' experience doing a wide variety of things fit all that in two pages or less just fine. Most resumes are one to two pages. You can do it. I believe in you.




04 February, 2021

Oncall dysfunctions

The main idea of having an oncall, as I understand it, is to share the burden of interruptish operations work so no one human on the team is a single point of failure. I have so far encountered several kinds of dysfunction in implementations of "oncall rotation" and would like to provide some ideas for other folks in similar situations on how to get to a better place.

What ownership of code or services means to a team seems to be a general contributing issue to oncall dysfunction. Everyone doesn't need to have 100% of the context for everything, but enough needs to be written down that someone with low context on the problem can quickly get pointed in the right direction to fix it. 

This post is a) not perfect, b) limited by my own experience, and c) loosely inspired by this. Some or all of this may not be new information to the reader; this is not a new topic but I wanted to take a whack at writing about it.

Assumptions in this post:  

  • if you carry the pager for it, you should be able to fix it
  • nobody on a team is exempt from oncall if the team has an oncall rotation
  • the team is larger than one person 
  • the team owns things which have existing customers 

I felt like I had a lot of thoughts on this topic after tweeting about it, so here ya go. 

Everything is on Fire

Possible symptoms: Time cannot be allocated for necessary upkeep perhaps due to management expectations or a tight release schedule for other work, leading to a situation where multiple aspects of a service or product are broken at once, or at least one feature is continually broken. The team is building on a quicksand and not meeting their previous agreements with customers.

Ideas: A relatively reasonable way out of this situation is to push back on product or whoever in management sets the schedule, and make time during normal working hours to clean up the service. Even if the amount of time the engineers can negotiate to allocate to cleaning up technical debt and fixing things that are on fire is minimal, some is better than none. 

While spending less than 100% of the team's time on net new work could result in delays to the release schedule, if the codebase is a collapsing house of cards, every new feature just worsens the situation. Adding more code to code that isn't well hardened / fails in mysterious ways just means more possibility for weirdness in production.  

The right amount of upkeep work enables the team to meet customer service agreements while still leaving plenty of time for net new work. The balance of upkeep work to new work will vary by team, but getting to a place where it's, say, 85% new to 15% upkeep if everything is really on fire across multiple products may require multiple weeks' worth of 75% or even 100% upkeep and no net new additions during that time to get to a better place. 

On the other hand, even if there's some backlog of stuff to fix, that is okay (perhaps even unavoidable) as long as agreed upon objectives are met and it's possible to prioritize fixes for critical bugs. If the team does not have SLAs but customers are Not Pleased, consider setting SLAs as a team and socializing them among your customers (perhaps delivered with context to adjust expectations like "to deliver <feature> by <date>, we need to reduce <product> service for <time period> to <specifics of availability/capacity/latency/etc> in order to focus on improving the foundation so we can deliver <feature> on time and at an <availability/capacity/latency/etc> of <metric>").

When everything is on fire long enough, it becomes normal for everything to be on fire, and so folks don't necessarily fix things until they get paged for whatever it is. One possibility for a team of at least three is to have a secondary oncall for a period of time until things are quietened. The person "up" on the secondary rotation dedicates at least part of their oncall period working on the next tech-debt-backlog item(s) and serves as a reserve set of helping hands/second pair of eyes if the primary oncall is overwhelmed.

The next step could be to write runbooks or FAQs to avoid the situation turning into Permanent Oncall over time as individuals with context leave.

Permanent Oncall (Siloing)

Possible symptoms: The team has an oncall rotation, but it doesn't reflect the team's reality. The answer to asking whoever is oncall "what's wrong with <thing on fire>" most of the time is "I need <single point of failure person's name> to take a look", not "I will check the metrics and the docs and let you know by <date and time> if I need help". When the SPOF is out sick or on vacation, the oncall might attempt to fix the issue, but there isn't enough documentation for the issue to be fixed quickly. The team would add to the documentation if much existed, but to start from scratch for all the things is too much a burden for any one oncall. When the SPOF comes back, they may return to everything on fire. 

This can happen when a team has a few experienced folks, hires a lot suddenly, and are not intentional about sharing context and onboarding. This can also happen when context / documentation hasn't been kept up over the life cycle of the team, and suddenly the folks who don't know some domain of the team's work outnumber those who do.

Ideas: Having a working agreement that includes a direct set of expectations regarding who owns what (the entire team should own everything the team owns if everyone on the team shares in oncall; the number of individuals with context on any one thing the team is responsible for, for each thing the team is responsible for, should be at least half the team, if the team size > 2) and how things are to be documented and shared can help here. 

If the team is to have a working agreement and it actually reflects reality, the EM/PM/tech lead cannot dictate it, but can help shape it to meet their expectations and customer expectations. Having one person on the team who felt like their voice didn't get heard in the process of creating a working agreement could lead to that person becoming more dissatisfied with the team and eventually leaving. Working agreements can help with evening out load between team members or at least serve as a place that assumptions get codified so everyone is mostly on the same page about what should happen.

Some teams only work on one project at a time to try to prevent this (but this may be an unrealistic thing to try for teams with many projects or many customers). It can be hard to build context as an engineer on a project you have never gone into the codebase for if you are not the kind of person who learns best from reading things. If the entire team is not going to work on the exact same workstream(s) all the time, it is crucial to have good documentation to get to a place where oncall has some information on common failure modes in the service and common places to look for things contributing to those failures. Mixing and matching tasks across team members is hard at first, but if everyone is oncall for all the things, this is going to happen anyway and it's better to do it earlier so that people have context when it's crucial. Going the other way and limiting the oncall rotation for any service to just the segment of the team who wrote it or know it best is just another, more formal variant on the Permanent Oncall problem and is also best avoided.

Lots of folks do not enjoy writing documentation or may feel like it takes them too long to document stuff for it to be useful, but oncall is not a shared burden if there is not enough sharing of context. When the SPOF leaves because they're burnt out, the team is going to have to hustle to catch up or will transition into No Experts Here.

An alternative to having extensive runbooks, if they aren't something your team is willing or able to dedicate time to keeping up, might be regularly scheduled informal discussions about how the items the team owns work, combined with the oncall documenting what the issue was and how they fixed it whenever something new to them breaks in FAQs. An FAQ will serve most of the purpose of a runbook, but may not cover the intended functionality of the system.

Firefighters

Sometimes, teams have an oncall rotation, but one often more senior person swoops in and tries to fix every problem within a domain. This paralyzes the other team members in situations where the swooping firefighter team member is not available and encourages learned helplessness and siloing. This is an edge case of Permanent Oncall where the SPOF actually likes and may actively encourage the situation.

Sometimes, the firefighter doesn't trust the oncall or doesn't feel they are adequately capable of fixing the problem. This can be incredibly frustrating as the oncall, especially when there is adequate documentation for the system and its failure modes and the oncall believes they can come to a working fix in an acceptable amount of time. It is also likely the firefighter hoards their own knowledge and is not writing down enough of what they know.

Sometimes the firefighter just is a little overenthusiastic about a good bug and management doesn't care how the problem gets solved. As I become more experienced in software engineering, I am finding areas within teams I have been part of where (despite having confidence the oncall will be able to eventually solve the problem...) I am susceptible to sometimes offering a little too much advice when a thing I have been primarily responsible for building or know a lot about breaks and I happen to notice around the same time the oncall does. In cases like this I am working to write down more of what I know for the team.

Actively seeking out the advice of a teammate as the oncall is likely not an example of this pattern nor of Permanent Oncall in general unless every single bug in a system requires the same person or couple of folks to fix it.

No Experts Here

Possible symptoms: The team inherited something (service, set of flaky cron jobs...) secondhand and it is unreliable. Maybe it's a proof of concept that needs to be hardened; maybe it's an old app which has no existing documentation but is critical to some aspect of the business and someone has to be woken up at 3 am to at the very least kick the system so it gets back into a slightly less broken state. The backlog for this project is just bug reports and technical debt. Nobody takes responsibility and ownership for this secondhand service is not prioritized. Management is prioritizing work on other projects. This is an edge case of Everything is On Fire.

Ideas: perhaps one or two individuals do a deep dive into the problem service, start a runbook or FAQs based on observations, and then present their findings to the team. As oncalls encounter issues, they will then have a place to add their findings and how they fixed the problem(s). The deep dive doesn't have to result in a perfect understanding of the service, just a starting point that the oncall can use to inform themselves and/or build on. 

The team as a whole needs to be on board with the idea that as owners of <thing>, they must know it well enough to fix it, or it will always be broken. If only part of the team is on board, this turns into Permanent Oncall for those folks, which is also not ideal. If nobody has time and mental space for <thing>, it needs to be transferred to a group who do have the time and space to develop proper knowledge of it, or it needs to be deprecated and spun down.

No Oncall

Possible symptoms: Everything that breaks is someone else's problem. The team does not carry a pager, but has agreements with customers about the level of production service that will be provided.

Someone else without good docs or context (perhaps an SRE or a production engineer?) is in charge of fixing things in production in order to keep agreements with customers. Perhaps this person is oncall for many different teams and does not have the time to devote to gaining enough context on each of them.

Ideas: Some things that may help this situation and help the SRE out (if having people who don't carry the pager is for some deeply ingrained cultural reason unavoidable): the team and the SRE come to a working agreement specifying when and how much to share context, when the SRE will pull someone from the team in to help, who they will pull into help at what points, and so on. If you must have someone not on the team oncall for the product, it may be useful to have the team have a "shadow" oncall so that the situation does not turn into Permanent Oncall for any one or two individuals. 

Newbies

When and how folks get added to the rotation are also critical to making oncall a healthier and more equitable experience. Expecting someone to just magically know how to fix something they don't understand and have never worked on is a great way to make that person want to leave the team. Having a newer person shadow a few different folks on the existing rotation before expecting them to respond to production issues ("shadowing" == both the newer person and the experienced person get paged if something breaks, the experienced person explains their process for understanding and fixing the issue to the newer person as they go, and the newer person is able to ask questions before, during, and after the issue).

Conclusion

It is hopefully not news that software engineering is a collaborative profession that requires communication; being intentional about when and how agreements with customers happen and how customer expectations get adjusted so they can be met is crucial. It may also be the case that the service(s) the team owns are noncritical enough that there doesn't need to be an oncall rotation outside business hours at all and fixing issues as they can be prioritized is fine, but taking care to prioritize bugs and production problems at least some of the time is necessary. It may also be the case that the team is distributed enough that oncall can "follow the sun" and oncall can be arranged to follow business hours for all team members. There are lots of ways to learn of and fix bugs in a timely way to meet customer expectations and no one way is perfect.