grack.com

Blog

There Is Life Before Main in Rust

Disclosures

🧠 This post is 100% human-written. Claude was used for feedback and to assist with the linker symbol diagram. Cursor was used for feedback and to ensure examples were compilable.

The author of this post is deeply interested in the topic of life-before-main: he is the author of the ctor crate, and the creator of the linktime project that we’ll be using in the examples below.

Every Rust binary has one thing in common: fn main(). If you come from the C world, that might be more familiar as int main(argc, argv). Some platforms might obfuscate it a bit more, but under the hood, every binary has an entrypoint.

We’re going to discuss what happens before main and what interesting things we can do there. In addition, we’ll be showing some novel techniques for mutable data that aren’t in common use in the Rust ecosystem today.

This post is a deep dive into some technical details of how Rust source becomes a Rust binary. Some background knowledge may be helpful to the reader, including:

Before main

What might not be familiar to most developers is how you get into the main function. You see, under the hood for every language is the runtime. C has one: the C runtime that you might recognize as libc. Rust also has its own runtime: the Rust standard library. And because C is the lingua franca of runtimes for most executable code 1, Rust builds its own runtime atop of C’s, effectively building its own higher-level abstraction encapsulating C’s.

A runtime is a bit fuzzy to define. It’s both the executable code that lives on disk and compilable headers and libraries used at compile time. But the purpose of a runtime is always the same: integrating developer code with the platform’s operating system.

There’s an entire ecosystem of processing that happens before the function you declared as main starts up. C uses this to configure allocation, file access, thread-local storage and other C runtime services. Rust uses this time to configure parts of its own language and runtime. Specifically, Rust has infrastructure to handle panics and unwinding. Rust also needs to translate the C-style program arguments 2 into its own std::env::args interface. The machinery for all this is visible in the Rust compiler project.

Runtimes make use of this pre-main phase because it guarantees (1) running before user code, and (2) a single-threaded, highly-consistent and predictably-ordered environment, which allow for reliable and deterministic initialization.

By not taking advantage of this environment, you are missing out on a very useful bootstrapping phase. We’ll see later on in this post how we can build some useful primitives making use of life before main.

Entry Points

A binary starts when the operating system’s loader 3 - the part of the OS that loads the binary into memory and sets up the environment - hands off control. The runtime is responsible for accepting the hand-off from the loader. There’s a platform-specific hook on every OS that accepts the hand-off - to some extent this is the real main. On Linux, the entry point is stored in the e_entry field of the ELF header, and by default, the linker places the address of a symbol named _start there. A similar hook exists on Windows, and boots the executable in a function named _WinMainCRTStartup. At this point the C runtime has a chance to configure itself, and the way that all runtimes do this is via initialization functions.

In early iterations of runtimes, bootstrapping was a static tree of function calls: initialize file I/O, initialize the allocator, etc. As runtimes became more complex, this tree of function calls became more complex, and binary sizes increased to absorb more C runtime functionality that they may or may not need.

Over time, linkers developed the ability to discard unused code before even writing the binary to disk (including unused parts of the C runtime), and with that came a need for a replacement for the static init call trees.

The most popular method 4 of declaring init code came from GCC: __attribute__((constructor)). The way this worked was to place a list of init functions into a contiguous chunk of the binary on disk. When the C runtime started, it could walk through each of these functions and call them, allowing various bits of the C runtime to request initialization without strongly coupling subsystems, and allowing the linker to jettison unused subsystems, init code and all.

Eventually the need for constructor ordering became important enough that constructors could be given a priority and run in a specific order, allowing the runtime to initialize subsystems before and after each other. E.g., the memory allocation (malloc) subsystem might be needed for buffered file I/O.

On most platforms 5, the linker was called in to do the priority work: each platform ended up with a way to prioritize the order in which data gets written to sections, which allowed for the C runtime to end up with a well-ordered list of function pointers 6.

We can even build an example of this by hand in Rust using the #[unsafe(link_section = "...")] attribute (try it in the Rust Playground):

/// Linux example: the modern glibc runtime uses `.init_array` to hold function
/// pointers, and a numeric suffix allows them to be ordered. Note that priorities
/// less than or equal to 100 are reserved for the runtime itself, so any code that
/// wants to use the C runtime must use a priority of 101 or higher.

// On Linux, `.init_array` holds _function pointers_, not functions.
// We can convert a function to a function pointer with one of the below
// blocks which is equivalent to this:
//
// #[used] // <-- without this, Rust might decide the init function is unused and remove it
// #[unsafe(link_section = ".init_array.NNNNN")] // <-- the section where we place the function pointer
// static INIT_ARRAY_FN_PTR: extern "C" fn() 
//     = function; // <-- the function pointer data: we assign the function to it
//
// extern "C" fn function() { ... } // <-- the function itself

#[used]
#[unsafe(link_section = ".init_array.101")]
static INIT_FN_FIRST: extern "C" fn() = const {
    extern "C" fn init() {
        println!("Initializing (first!)");
    }
    init
};

#[used]
#[unsafe(link_section = ".init_array.201")]
static INIT_FN_SECOND: extern "C" fn() = const {
    extern "C" fn init() {
        println!("Initializing (second!)");
    }
    init
};

fn main() {
    println!("Main!")
}

The examples in this post will work on Linux and various BSDs, but are not designed to be cross-platform examples. For example, macOS has start and stop symbols, but they are named differently 7. Windows does not support start and stop symbols, but has a set of rules for sorting sections that is effectively equivalent.

Because platforms are so widely variable, we’ll be introducing the ctor and link-section crates (from the linktime project) as a way to abstract away platform-specific differences and hide the general complexity of linker work.

The excellent inventory and linkme are two other very popular crates built on the same principles, but have limitations 8 that make them less suitable for the examples in this post.

If you’d like to learn more, the link-section crate contains a detailed report on platform-specific behaviour.

The ctor crate is designed to handle all of the boilerplate of registering constructors in a cross-platform way. This allows us to simplify our examples above to:

use ctor::ctor;

#[ctor(unsafe, priority = 101)]
fn init1() {
    println!("Initializing (first)!");
}

#[ctor(unsafe, priority = 201)]
fn init2() {
    println!("Initializing (second)!");
}

fn main() {
    println!("Main!")
}

Note that neither example explicitly calls the init functions. The linker organized them in a way that the C runtime called them for us!

Sections and Linker Scripts

The process in which constructors are linked isn’t mysterious, though. In fact, compilers allow you to name the location in the binary (on most platforms called a “section”) you want to place any of your data and/or code. And by extension, and as we saw above, Rust allows this as well. The challenge, as we will see, is making use of this organizational feature.

Linkers have been the key to C’s ability to target any form of binary for some time. Most linkers allow for developers to provide linker scripts - text files that live alongside your source code (which is compiled to object files) and instruct the linker on how those object files are assembled. Using a linker script, a single C file might become a Linux executable, or a block of raw assembly that lives in the boot sector of a hard drive.

Linker scripts also allow for defining virtual symbols - that is, symbols that don’t exist in any source file but can be used by C code to access pointers to the underlying data in the loaded binary.

Linker scripts are a complex topic and beyond the scope of this post, but we can easily find examples of them in the wild:

// Adapted from https://wiki.osdev.org/Linker_Scripts
SECTIONS
{
  .text.start (_KERNEL_BASE_) : {
    startup.o( .text )
  }

  .text : ALIGN(CONSTANT(MAXPAGESIZE)) {
_TEXT_START_ = .;
    *(.text)
_TEXT_END_ = .;
  }

  .data : ALIGN(CONSTANT(MAXPAGESIZE)) {
_DATA_START_ = .;
    *(.data)
_DATA_END_ = .;
  }
}

In the above example, the virtual symbols _TEXT_START_ and _TEXT_END_ are explicitly defined to point to the beginning and end of the .text section, respectively. The period in _TEXT_START_ = .; is a special syntax that refers to a location counter that resolves roughly to the current output address in the binary.

Linker Symbols

This trips up most developers that encounter it for the first time, but the linker is setting the address of the start and end symbols, and therefore where the static with the same name is placed, and not setting the value of symbols that are pointers. That is to say: the start and stop symbols aren’t a *const Type. The start and stop symbols carry no data themselves and are used for their addresses only! The section consists of the range of data between the start (inclusive) and stop (exclusive) symbols.

Specifying start and end symbols for every section can be complex and tedious in linker scripts, so many linkers 9 eventually gained a feature where they could automatically define symbols bounding all sections in the executable. E.g., for GNU toolchains, a section named MY_SECTION will automatically have symbols __start_MY_SECTION and __stop_MY_SECTION defined. macOS has a similar pattern where it synthesizes a section$start and section$end symbol for each section.

In the GNU linker, those sections not explicitly defined in the linker script are called “orphan sections” 10. One important thing to note: if (and only if!) a section’s name is compatible with a C symbol name, the linker will automatically define a _start- and _stop-prefixed symbol for the section. In the example you’ll see below, the section name our_strings that we used works, but if we had chosen our.strings or .our_strings it would not have!

You’ll see in the example below that the start and stop symbols are MaybeUninit<()>. The boundary symbols contain no data, and only their address is significant.

The ideal Rust type for these would be an “opaque external type” (this would be implemented by the extern_types feature). As these are not currently implemented in Stable Rust, MaybeUninit is a stand-in. It signifies to the compiler that the data is uninitialized, and generally not safe to read via reference. Since taking a &raw const pointer to a static item is always valid, however, we can still safely capture its address without ever reading its value.

Try it in the Rust Playground:

use std::mem::MaybeUninit;

#[used]
#[unsafe(link_section = "our_strings")]
static FIRST_STRING: &'static str = "Hello, ";

#[used]
#[unsafe(link_section = "our_strings")]
static SECOND_STRING: &'static str = "world!";

// Note: these are not pointers. Instead, the linker has placed
// the boundary symbols STATIC_STRING_START and STATIC_STRING_END at
// the start and end of the section!
unsafe extern "C" {
    #[link_name = "__start_our_strings"]
    static STATIC_STRING_START: MaybeUninit<()>;
    #[link_name = "__stop_our_strings"]
    static STATIC_STRING_END: MaybeUninit<()>;
}

fn main() {
    let strings: &'static [&'static str] = unsafe {
        // SAFETY: get the addresses of the start and end symbols without
        // reading them.
        let start = &raw const STATIC_STRING_START as *const &'static str;
        let end = &raw const STATIC_STRING_END as *const &'static str;
        std::slice::from_raw_parts(start, end.offset_from(start) as usize)
    };
    
    // "Hello, world!"
    println!("String: {}", strings.join(""));
}

The link-section crate is designed to abstract away the details of these linker sections and convert them into traditional Rust slices with all standard slice operations available. We can use it to simplify the example above to:

use link_section::{in_section, section};

#[section(typed)]
static OUR_STRINGS: link_section::TypedSection<&'static str>;

#[in_section(OUR_STRINGS)]
static FIRST_STRING: &'static str = "Hello, ";

#[in_section(OUR_STRINGS)]
static SECOND_STRING: &'static str = "world!";

fn main() {
    println!("String: {}", OUR_STRINGS.join(""));
}

In these examples we’re submitting items to the link section in a single module within a single crate, but that’s not a requirement. In fact, the power of link sections is that you can submit items to a link section from any crate that contributes code to a binary - the linker will gather them all together just before writing the final binary.

Dependency Injection

The registration pattern we’re about to build is Dependency Injection by another name. This is a well-known pattern: frameworks like Dagger and Spring are built on the same principle that consumers of registration data should not be coupled to the providers of that data. A provider registers data at its definition site, a consumer simply reads the registry.

What’s somewhat different with linker sections versus traditional DI is that in DI the framework often needs to walk the module graph or scan loaded classes at startup to discover both providers and consumer sites. With linker sections, this magic is handled when the binary is written. The linker is the one that gathers all of the provider data and makes it trivially available to the consumer.

The example below uses a link_section::section to register CLI subcommands and is an instance of this pattern. More complex projects like Turbopack use this pattern to register string-pool constants, and the registration machinery used for serialization/deserialization and turbotask incremental compilation functions. A hypothetical webserver could make use of this pattern to register routes and middleware that are automatically collected at build time. The core mechanism is the same: the contributors place data into a shared registration system from any crate in the dependency tree, and the consumer reads the collected data without having to know where it was provided from.

Using Sections for Registration

One advantage we have in doing work before main is that it is well-behaved. No threads are running unless we start them. This means we are able to avoid the complexity of locks and other synchronization primitives in many cases, and that we can explicitly split our writable and immutable phase of our data’s lifecycle clearly: before and after main. And because of that, accessing data in the running program can become both simpler and more efficient by avoiding the need to acquire and release locks.

First, we’ll define our subcommand, a const constructor function, and a #[section] to collect them:

use std::collections::VecDeque;
use std::path::Path;

use link_section::{in_section, section};

struct CliSubcommand {
    is_default: bool,
    name: &'static str,
    description: &'static str,
    f: fn(&Path, &[String]),
}

impl CliSubcommand {
    const fn new(name: &'static str,
                 description: &'static str, 
                 f: fn(&Path, &[String])) -> Self {
        Self { is_default: false, name, description, f }
    }

    const fn new_default(name: &'static str,
                        description: &'static str, 
                        f: fn(&Path, &[String])) -> Self {
        Self { is_default: true, name, description, f }
    }
}

#[section(typed)]
static CLI_SUBCOMMANDS: link_section::TypedSection<CliSubcommand>;

Then we’ll register subcommands - these can live anywhere in your code:

mod list {
    #[in_section(CLI_SUBCOMMANDS)]
    static CLI_SUBCOMMAND_LIST: CliSubcommand = 
        CliSubcommand::new("list", "List all items", |_exe, _args| {
            println!("Listing all items");
        });
}

mod add {
    #[in_section(CLI_SUBCOMMANDS)]
    static CLI_SUBCOMMAND_ADD: CliSubcommand = 
        CliSubcommand::new("add", "Add a new item", |_exe, _args| {
            println!("Adding a new item");
        });
}

mod help {
    #[in_section(CLI_SUBCOMMANDS)]
    static CLI_SUBCOMMAND_HELP: CliSubcommand = 
        CliSubcommand::new_default("help", "Show help", |exe, _args| {
            println!("Usage: {} <subcommand> [options]", exe.display());
            println!();
            println!("Subcommands:");
            for subcommand in CLI_SUBCOMMANDS {
                println!("  {}: {}", subcommand.name, subcommand.description);
            }
        });
}

And then in our main function we can dynamically dispatch to any registered subcommand without ever having to know what they are or where they live. It only needs to be able to see the CLI_SUBCOMMANDS section definition:

fn main() {
    let mut args: VecDeque<String> = std::env::args().collect();
    let exe = args.pop_front().expect("No executable name provided");
    let exe = Path::new(&exe);
    let subcommand_name = args.pop_front().unwrap_or_default();
    let rest: Vec<String> = args.into();

    // Try to find the subcommand by name
    for cmd in CLI_SUBCOMMANDS {
        if cmd.name == subcommand_name {
            (cmd.f)(exe, &rest);
            return;
        }
    }
    // If no subcommand was found, fall back to the default subcommand
    for cmd in CLI_SUBCOMMANDS {
        if cmd.is_default {
            (cmd.f)(exe, &rest);
            return;
        }
    }
}

Running the code above works as you’d expect:

$ ./cli
Usage: ./cli <subcommand> [options]

Subcommands:
  list: List all items
  add: Add a new item
  help: Show help

$ ./cli list
Listing all items

Beyond Immutable Data

This section deals with some more advanced topics. Familiarity with Rust Atomics and Locks, or at least reading the first chapter on the basics of Rust concurrency, is recommended!

The example above assumes that the linked data is immutable. But that’s only half the power of using linker organization for data. Mutability in global static data is a common problem with well-known solutions in standard Rust. We could potentially use Rust’s built-in tools for interior mutability like mutexes, or atomic types, for example. Each of those comes with some runtime cost. If they are “uncontended” they aren’t expensive, but they are not necessarily free. 11

But what if we want to minimize the overhead of runtime data access? Immutable data is trivial: Rust allows safe concurrent access to immutable data by default 12. Rust has strict requirements for mutable data, however. There are two requirements to safely mutate data: (1) the modifications must be done in a thread-safe manner, and (2), there must never be more than one reference to the data if a mutable reference exists.

At the beginning of this post, we mentioned that life-before-main is a useful place to bootstrap because no threads are running unless we start them. And the solution to (1) is trivial if the data is currently accessible to a single thread only! We don’t need to do anything atomically; we only need to ensure that all of the changes we make to that data “happen before” any reads to the data. In a single-threaded environment, “happens before” is automatic 13. This means that we can mutate data in a link-section before main and it will be safe to access, lock free, from any thread after main.

The resolution for (2) is similar: as long as we only ever take a mutable reference (and only a mutable reference!) before main, there will never be more than one reference to the data when a mutable reference exists.

The pre-main environment satisfies both (1) and (2), without needing to reach for locks or other synchronization primitives.

There’s also one additional gotcha with linker sections we need to be very careful with: the slice that contains all of the items in a section is an alias to the static item that lives in the section. The rules about aliasing apply to both the slice and the static item, and you must ensure that static items are placed in UnsafeCell to safely mutate them from the slice 14. Rust does not allow a static item to be modified through other means. With static items that aren’t wrapped in an UnsafeCell, LLVM may consider itself free to cache, reorder or otherwise make assumptions about the data itself. UnsafeCell itself is not Sync, so you’ll need to add your own wrapper types on top of this!

Note that in the example below, we’re now using MaybeUninit<SyncUnsafeCell<...>> for the boundary symbols and SyncUnsafeCell<...> for the items.

Because we’re planning on sorting the slice, we need to tell Rust that the slice items are not immutable so the data doesn’t end up in read-only memory. By using a type that includes UnsafeCell - a semantic signal that Rust uses to indicate interior mutability - the Rust compiler will then know to place that data in a part of the binary that can be written.

On some platforms (Windows in particular), omitting this from the data items will cause segmentation faults when trying to sort the slice. On other platforms (AIX for example), section mutability is part of a section’s identifier, so the boundary symbols’ mutability needs to match the section’s mutability!

Let’s walk through an example of how we might otherwise do something like this. We’re going to build a string interning pool defined entirely at link-time, and add a wrinkle: we want to be able to sort the slice of interned strings at runtime so we can quickly intern a string by value if needed via binary search (try it in the Rust Playground):

use std::cell::UnsafeCell;
use std::mem::MaybeUninit;
#[cfg(debug_assertions)]
use std::sync::atomic::{AtomicBool, Ordering};

/// Nightly Rust offers a built-in `SyncUnsafeCell`. This is a minimal
/// reimplementation of that: 
/// <https://doc.rust-lang.org/std/cell/struct.SyncUnsafeCell.html>
#[repr(transparent)]
struct SyncUnsafeCell<T: ?Sized>(UnsafeCell<T>);
// SAFETY: safety burden of UnsafeCell is placed entirely on the user
unsafe impl<T: ?Sized + Sync> Sync for SyncUnsafeCell<T> {}

macro_rules! intern_string {
    ($name:ident, $string:literal) => {
        #[allow(unused)]
        const $name: &'static str = const {
            // This is not a common pattern, but it's entirely valid
            // to nest static items inside of const blocks. 
            // You can think of this as a way to hide the symbols
            // in a completely anonymous namespace.
            const VALUE: &str = $string;
            // Safety note: this static must _never_ be used. This is
            // purely a submission to the linker and _any_ access to it
            // may be UB.
            #[used]
            #[unsafe(link_section = "our_strings")]
            static ITEM: SyncUnsafeCell<&'static str> = 
                SyncUnsafeCell(UnsafeCell::new(VALUE));
            VALUE
        };
    };
}

intern_string!(WORLD, "world");
intern_string!(EXCLAMATION, "!");
intern_string!(HELLO, "hello");
intern_string!(FROM, "from");
intern_string!(RUST, "Rust");

unsafe extern "C" {
    #[link_name = "__start_our_strings"]
    static STATIC_STRING_START: MaybeUninit<SyncUnsafeCell<()>>;
    #[link_name = "__stop_our_strings"]
    static STATIC_STRING_END: MaybeUninit<SyncUnsafeCell<()>>;
}

/// Debug check to make sure the slice is sorted once and only once. This _could_
/// be enabled in release mode without any major performance impact, but we have enough
/// guarantees in place. Note that atomic access _does_ establish some memory ordering
/// guarantees, but the soundness guarantees are upheld with or without this atomic check.
#[cfg(debug_assertions)]
static SLICE_IS_SORTED: AtomicBool = AtomicBool::new(false);

// Implementation note: this function must not be called before `SORT_STRINGS_CTOR` has
// run.
fn interned_strings() -> &'static [&'static str] {
    // We use Acquire/Release pairing as a double-initialization check
    #[cfg(debug_assertions)]
    debug_assert!(SLICE_IS_SORTED.load(Ordering::Acquire), "Oh no! Slice was not sorted!");

    // SAFETY: we are calling this after main and we can guarantee that no
    // mutable reference is still alive. Since we know that no other code
    // is running before main, and that `SORT_STRINGS_CTOR` will run before main,
    // we can guarantee creating these slices is safe as 1) the sort "happens-before"
    // any access and 2) the mutable reference has been closed before any read-reference
    // access (satisfying aliasing XOR mutability requirement).
    let strings: &'static [&'static str] = unsafe {
        let start = &raw const STATIC_STRING_START as *const &'static str;
        let end = &raw const STATIC_STRING_END as *const &'static str;
        std::slice::from_raw_parts(start, end.offset_from(start) as usize)
    };
    strings
}

// Implementation note: this function assumes the slice has been sorted. See
// the guarantee above on `interned_strings` for reasoning.
fn maybe_intern_string(s: impl AsRef<str>) -> Option<&'static str> {
    let s = s.as_ref();
    let strings = interned_strings();
    strings.binary_search(&s).ok().map(|index| strings[index])
}

// SAFETY: We use the reserved `.init_array.0` priority because we do not
// access any C runtime functions (sort_unstable does not allocate) and we
// want to run before all other code. `.init_array.101` would work in our
// case, but this prevents other early-init code from accidentally running
// in the wrong order. `SLICE_IS_SORTED` is a debug check to make sure that
// doesn't happen. Note that all early-init code is tagged with `unsafe` so
// it always needs to be aware of safety guarantees of all APIs it touches.
#[used]
#[unsafe(link_section = ".init_array.0")]
static SORT_STRINGS_CTOR: extern "C" fn() = const {
    extern "C" fn sort_strings() {
        // We use Acquire/Release pairing as a double-initialization check
        #[cfg(debug_assertions)]
        debug_assert!(!SLICE_IS_SORTED.load(Ordering::Acquire), "Oh no! Sorted twice?!?");
        // SAFETY: we are calling this before main and we can guarantee that no
        // reference from `interned_strings` exists yet because we know no other
        // threads will be running, and we're not calling `interned_strings` yet.
        let strings: &mut [&'static str] = unsafe {
            // SAFETY: the bounds markers are not mutable, but we can safely
            // cast them to mutable pointers because we know the data behind
            // them is stored within `UnsafeCell` which is Rust's way of 
            // giving us interior mutability.
            let start = &raw const STATIC_STRING_START as *mut &'static str;
            let end = &raw const STATIC_STRING_END as *mut &'static str;
            std::slice::from_raw_parts_mut(start, end.offset_from(start) as usize)
        };
        strings.sort_unstable();
        #[cfg(debug_assertions)]
        SLICE_IS_SORTED.store(true, Ordering::Release);
    }
    sort_strings
};

fn main() {
    for (i, s) in interned_strings().iter().enumerate() {
        println!("[{i}]: {s}");
    }

    println!("{}, {}{}", HELLO, WORLD, EXCLAMATION);
    println!(
        "{}, {}{}",
        maybe_intern_string("hello").unwrap(),
        maybe_intern_string("world").unwrap(),
        maybe_intern_string("!").unwrap()
    );
}

The above example is pretty heavy (and a bit thicker thanks to the generous commentary), but it’s a good example of how much boilerplate crates like ctor and link-section can save you.

The equivalent using those crates can make use of the TypedMutableSection and a ctor to ensure the items are sorted before main. Note that the requirements for TypedMutableSection are that the items must be const - the reason is that the mutable section uses a similar style of code to the manually-implemented example above.

//! String interning pool using `ctor` and `link-section`.
use ctor::ctor;
use link_section::{in_section, section};

#[section(mutable)]
static INTERNED_STRINGS: link_section::TypedMutableSection<&'static str>;

#[in_section(INTERNED_STRINGS)]
const WORLD: &'static str = "world";

#[in_section(INTERNED_STRINGS)]
const EXCLAMATION: &'static str = "!";

#[in_section(INTERNED_STRINGS)]
const HELLO: &'static str = "hello";

#[in_section(INTERNED_STRINGS)]
const FROM: &'static str = "from";

#[in_section(INTERNED_STRINGS)]
const RUST: &'static str = "Rust";

#[ctor(unsafe)]
fn sort_strings() {
    let strings: &mut [&'static str] = unsafe { INTERNED_STRINGS.as_mut_slice() };
    strings.sort_unstable();
}

fn maybe_intern_string(s: impl AsRef<str>) -> Option<&'static str> {
    let s = s.as_ref();
    let strings = INTERNED_STRINGS.as_slice();
    strings.binary_search(&s).ok().map(|index| strings[index])
}

fn main() {
    for (i, s) in INTERNED_STRINGS.iter().enumerate() {
        println!("[{i}]: {s}");
    }

    println!("{}, {}{}", HELLO, WORLD, EXCLAMATION);
    println!(
        "{}, {}{}",
        maybe_intern_string("hello").unwrap(),
        maybe_intern_string("world").unwrap(),
        maybe_intern_string("!").unwrap()
    );
}

This particular example isn’t impossible without link sections, of course. What we get from the patterns we’ve discussed in this post are three things: (1) the guaranteed aggregation of tagged items, with all data pre-allocated and contiguous in memory; (2) the ability to distribute registrations anywhere in the code; and (3) a guaranteed count of the items in the section.

One major benefit that falls out of the three advantages above is that link sections require no allocations. If we were to rewrite this without link sections we’d be allocating a HashMap, Vec or other data structure, and potentially resizing it a number of times as we gather items (because we don’t actually know how many items we’ll have until runtime!).

The second major benefit that falls out is the Inversion of Control. The dependency graph for a traditional “collection” approach looks like below, with shared types deeply nested in the dependency graph, modules depending on that shared types module, and then a collector module that depends on all those modules to collect their types:

Shared Types

Module A

Module B

Module C

Collector

The change doesn’t seem large, but there is a large impact: the collector can now live anywhere and no longer needs to care what modules are contributing data:

(anywhere!)

Shared Types

Collector

Module A

Module B

Module C

And of course, we aren’t just limited to slices: you’ll find that there are analogues to many data-structures with link-time support in the scattered-collect crate:

  • Scattered*Slice: Various Vec-like structures that provide slices (and optionally sorting).
  • ScatteredMap/ScatteredSet: An analogue to HashMap/HashSet that provides hashed key-to-value lookup with some minimal pre-main initialization.

But Seriously: When Not to Use This

Link-time computation is fun and powerful, and it’s not always the right tool for the job. There is often a non-link-time equivalent: manually collecting data in crates that have visibility into each crate that wishes to contribute data. This can be inconvenient at times - instead of the contributors seeing a single contribution point “upstream” in a core crate, a “collector” crate with lots of crate references is required to collect them all.

Dead-code elimination becomes challenging: the link-section crate (and the linkme equivalent) both decorate all items using #[used], so the linker is disallowed from pruning unused data. Figuring out how to make link-time collection and dead-code elimination work well together is a complex problem beyond the scope of this post. For smaller bits of data like interned string atoms this might not be a problem, but if a program wants to intern larger chunks of data like chunks of raw JSON/JavaScript, or extensive data structures, this may add up to a lot of dead-code that may be difficult to identify.

Pre-main constructor functions have limitations: they cannot panic 15, Rust does not guarantee that all stdlib functions are available, and the order that the initialization functions are called within a given priority level is not guaranteed and highly platform-dependent. With careful planning, these limitations may be worked around but life-before-main may not be correct for subtle and difficult-to-debug reasons.

At this time, Miri is not fully compatible with all pre-main constructors and link-section constructions: it has a very basic view of pre-main execution, and does not model link sections at all. This may improve over time, but as of the time of writing, LLVM sanitizers (ASan, TSan, and others) are the recommended way to test for undefined behaviour.

The Inversion of Control pattern also has a cost: it makes it potentially harder to audit all the places that contribute data to a link section.

In reality, many widely-deployed and heavily-used Rust programs already rely on pre-main functionality: the ctor, link-section, inventory and linkme crates are used by many downstream crates today.

Briefly, on WASM

The examples above omitted a fairly important platform, though for good reason. WASM does not currently support linker sections natively because of an inconvenient choice many years back (51088 and 52353 for more details). Instead of allowing the #[link_section] annotation to place items in a true code section, the items are placed in a WASM custom section which is inaccessible to the WASM code itself!

The linktime crates do support WASM and have an emulation workaround that makes the approaches work for WASM binaries, but the author of this post hopes to make a suggestion in the near future on how proper WASM support could be added!

Conclusions

You can do a lot before main, and the benefits of doing so are pretty significant for certain cases. It’s a highly-ordered, highly-controllable environment that lets you more confidently do a lot of work without locks, atomics and other synchronization primitives. Link sections give you arbitrary aggregation and co-location of related data across your whole binary without awkward crate dependency order. In a lot of cases you can even avoid allocations completely which helps keep you away from one of the worst allocator sins: churn of allocations leading to fragmentation.

For further reading, check out the various crates discussed in this post:

  • ctor: Module initialization functions that run before main
  • dtor: Not discussed in this post, but the shutdown analogue to ctor
  • link-section: Linker-managed typed (slices) and untyped sections, with mutability support.
  • scattered-collect: Linker-managed higher-level collections: slices, sorted slices, maps

Thanks

Thanks to my lovely wife Mia, Benjamin Woodruff and Luke Sandberg, as well as @ssokolow for their feedback and review. This post would not have been possible without their help.


  1. Go is a notable exception in that it avoids the C runtime on some platforms, only using libc as an ABI stability boundary on platforms that require it e.g., Apple’s libSystem.dylib and OpenBSD’s libc. 

  2. On Windows, these are DOS-style (which were in turn derived from CP/M-style) arguments. 

  3. Before the loader runs, the program is just some bytes on a disk and the loader (which can be the kernel itself or a user-space system component like ld.so on Linux) maps those bytes into memory and hands off control. 

  4. The most popular method… in the humble author’s opinion. 

  5. macOS does not support this. The C runtime does its own initialization and then just runs every user constructor function in the order the linker saw them. 

  6. AIX has a special symbol naming convention for constructor functions: the sinit prefix, followed by a hexadecimal priority value. 

  7. This will be discussed later in the post, but macOS synthesizes a section$start and section$end symbol for each section instead of a __start_ and __stop_ symbol. 

  8. linkme creates distributed slices, but does not currently support WASM, and does not support mutable section data required to sort a section. inventory supports WASM, but requires a ctor-like function per item in the section. 

  9. The Windows linker does not support this feature, but instead defines an overall sort order for symbols that is effectively equivalent. 

  10. Orphan sections have a complicated algorithm for placement. 

  11. For example, an atomic value must always be re-read, and that may incur use of CPU cache which is pretty numerous these days, but definitely not infinite. 

  12. As long as it’s Sync. Sync means that it’s safe to share a reference to the data between threads. 

  13. This is a complex topic, and Rust Atomics and Locks is the best resource for learning more. Starting a new thread means that all the previous writes “happen before” anything on the new thread, but we’ll leave the proof of this for the reader (or possibly a future post). 

  14. Even taking a reference to a mutable static is disallowed by default in Rust 2024! 

  15. Well, more accurately they can panic but they shouldn’t panic. In fact you’ll get a double panic: thread '<unnamed>' panicked at ...: pre-main panic! and then immediately thread '<unnamed>' panicked at ...: panic in a function that cannot unwind. 

Deriving a Bit-Twiddling Hack: Signed Integer Overflow

As a thin layer on top of assembly language, C’s integer arithmetic APIs have always been minimal, effectively just mapping the underlying assembly opcodes to the C arithmetic operators. In addition, while unsigned arithmetic can safely overflow in C, signed arithmetic overflow is considered undefined behaviour, and UB can end up in heartache for C developers.

More modern languages like Rust have a much richer integer API, however. By default, using the standard addition and subtraction operators in debug mode, integer APIs will panic! on overflow or underflow. The same operators will wrap in two’s-complement mode in release mode, though the behaviour is considered defined. If the developer wants to specify carrying, wrapping, checked, or saturating operations, APIs for each of these modes are available.

We don’t have these convenient APIs available in C yet (see the epilogue for some nuance), but it would be great to have them. Given that signed arithmetic overflow is undefined behaviour, can we build a function with the following C signature that works?

bool will_add_overflow(int32_t a, int32_t b);

All modern machines use two’s complement representation for negative integers, but C was developed when computing was experimenting with other forms of representing signed integers. In this post, we will safely assume that all processors we’re targetting on are not using one of the alternative forms. If you find yourself programming for esoteric machines, you may wish to consult your local manual or guru.

A valid, quick-and-dirty solution to this would be to use integer promotion and add 64-bit signed integers instead, checking to see if the result is within range of an int16_t:

bool will_add_overflow_64bit(int32_t a, int32_t b) {
    // a and b are promoted to 64-bit signed integers
    int64_t result = (int64_t)a + b;
    if (result < INT32_MIN || result > INT32_MAX) {
        return true;
    }
    return false;
}

No undefined behaviour! It also has the advantage of being easily read and obviously correct. But this required sign-extending two 32-bit numbers and performing two 64-bit additions:

will_add_overflow_64bit:
        movsxd  rax, edi
        movsxd  rcx, esi
        add     rcx, rax
        movsxd  rax, ecx
        cmp     rax, rcx
        setne   al
        ret

We can do better by taking advantage of the fact that on two’s-complement machines, addition is bitwise-identical between signed and unsigned numbers so long as you ignore carry, overflow, underflow and any other flags. In addition, the C specification (C99 6.3.1.3 ¶2) guarantees that the bit pattern will be preserved on a two’s-complement system.

We know that unsigned overflow is not UB, and we know that we can only overflow if a > 0 and b > 0, and we can only underflow if a < 0 and b < 0. If either a or b is zero, we’re safe. We also know that adding two positive integers must result in a positive result if no overflow occurred. For two negative integers, the result must also be negative. If we find that the sign of the sum does not match the sign expected, we’ve wrapped around!

bool will_add_overflow_if(int32_t a, int32_t b) {
    // Explicitly convert to uint32_t and then back
    int32_t c = (int32_t)((uint32_t)a + (uint32_t)b);
    if (a > 0 && b > 0 && c < 0) {
        return true;
    }
    if (a < 0 && b < 0 && c >= 0) {
        return true;
    }
    return false;
}

And we get a fairly hefty assembly representation:

will_add_overflow_if:
        lea     ecx, [rsi + rdi]
        test    edi, edi
        jle     .LBB2_3
        test    esi, esi
        jle     .LBB2_3
        mov     al, 1
        test    ecx, ecx
        jns     .LBB2_3
        ret
.LBB2_3:
        test    esi, edi
        sets    dl
        test    ecx, ecx
        setns   al
        and     al, dl
        ret

This is arguably a bit worse, as now we have a branch in the mix. But we can start to see a pattern here:

abcresult
> 0> 0< 0true
< 0< 0>= 0true

In two’s-complement, the expression x < 0 is equivalent to the expression (x & 0x80000000) == 0x80000000. Similarly, x >= 0 is equivalent to (x & 0x80000000) == 0.

Let’s create a NEG macro with the above expression and reproduce our pseudo-truth table in code. Note that we’ll also collapse the if statements into a single boolean expression so we can eliminate those branches:

bool will_add_overflow_expression(int32_t a_, int32_t b_) {
    // Explicitly work with uint32_t in this function
    uint32_t a = (uint32_t)a_, b = (uint32_t)b_;
    uint32_t c = (uint32_t)a + (uint32_t)b;
    #define NEG(x) (((uint32_t)(x) & 0x80000000) == 0x80000000)
    return ((!NEG(a) && !NEG(b) && NEG(c)) 
        || (NEG(a) && NEG(b) && !NEG(c)));
    #undef NEG
}

This is looking better, but because we’re using short-circuiting logic, those branches are still there: we still have a jump!

will_add_overflow_expression:
        mov     eax, esi
        or      eax, edi
        setns   dl
        mov     ecx, esi
        add     ecx, edi
        sets    al
        and     al, dl
        test    edi, edi
        jns     .LBB3_3
        test    al, al
        jne     .LBB3_3
        test    esi, esi
        sets    dl
        test    ecx, ecx
        setns   al
        and     al, dl
.LBB3_3:
        ret

We can get rid of the branches by using non-short-circuiting bitwise operators:

bool will_add_overflow_bitwise(int32_t a_, int32_t b_) {
    uint32_t a = (uint32_t)a_, b = (uint32_t)b_;
    uint32_t c = (uint32_t)a + (uint32_t)b;
    #define NEG(x) (((uint32_t)(x) & 0x80000000) == 0x80000000)
    return ((!NEG(a) & !NEG(b) & NEG(c)) 
        | (NEG(a) & NEG(b) & !NEG(c)));
    #undef NEG
}

And now it’s starting to look pretty compact (though we can do better):

will_add_overflow_bitwise:
        lea     ecx, [rsi + rdi]
        mov     eax, esi
        or      eax, edi
        and     esi, edi
        xor     eax, esi
        not     eax
        and     eax, ecx
        xor     eax, esi
        shr     eax, 31
        ret

Notice that the assembly gives us a bit of a hint here that repeated use of our macro isn’t actually necessary. The sign bit we’re interested in isn’t tested until the end of the function! Because we’re testing the same bit in every part of the expression, and bits in a given position only interact with other bits in the same position, we can pull that bit test out of the whole expression:

bool will_add_overflow_bitwise_2(int32_t a_, int32_t b_) {
    uint32_t a = (uint32_t)a_, b = (uint32_t)b_;
    uint32_t c = (uint32_t)a + (uint32_t)b;
    #define NEG(x) (((uint32_t)(x) & 0x80000000) == 0x80000000)
    return NEG((~a & ~b & c) | (a & b & ~c));
    #undef NEG
}

We can also make use of the knowledge that testing the sign bit is the same as an unsigned shift right:

bool will_add_overflow_bitwise_3(int32_t a_, int32_t b_) {
    uint32_t a = (uint32_t)a_, b = (uint32_t)b_;
    uint32_t c = (uint32_t)a + (uint32_t)b;
    return ((~a & ~b & c) | (a & b & ~c)) >> 31;
}

Not too bad! But let’s revisit the truth table and instead use the value of the sign bit directly. What we see is that a and b need to be the same value, and c needs to be the opposite value:

abc
110
001

This truth table shows that what we ultimately want to test is this:

(a == 1 && b == 1 && c == 0) || (a == 0 && b == 0 && c == 1)

… but with a bit of work, we can simplify this down to two shorter expression candidates:

(a == b) && (a == !c)
(c == !a) && (c == !b)

For bit twiddling like we’re doing here, xor (^) can work like a “not-equals” operator (outputs 1 iff the inputs are 0,1 or 1,0), which means we can re-write our two expressions like so:

~(a ^ b) & (c ^ a)
(c ^ a) & (c ^ b)

By looking at those two options, is there a hint that one might be cheaper to implement? Let’s plug both into the compiler and see what we get!

bool will_add_overflow_optimized_a(int32_t a_, int32_t b_) {
    uint32_t a = (uint32_t)a_, b = (uint32_t)b_;
    uint32_t c = (uint32_t)a + (uint32_t)b;
    return (~(a ^ b) & (c ^ a)) >> 31;
}

bool will_add_overflow_optimized_b(int32_t a_, int32_t b_) {
    uint32_t a = (uint32_t)a_, b = (uint32_t)b_;
    uint32_t c = (uint32_t)a + (uint32_t)b;
    return ((c ^ a) & (c ^ b)) >> 31;
}

And the resulting compiled versions:

will_add_overflow_optimized_a:
        lea     eax, [rsi + rdi]
        xor     eax, edi
        mov     ecx, edi
        xor     ecx, esi
        not     ecx
        and     eax, ecx
        shr     eax, 31
        ret
will_add_overflow_optimized_b:
        lea     eax, [rsi + rdi]
        xor     edi, eax
        xor     eax, esi
        and     eax, edi
        shr     eax, 31
        ret

We have a clear winner here: the compiler can do a much better job with (c ^ a) & (c ^ b). This is most likely because of the common sub-expression and the removal of the bitwise-not operator.

We can also confirm that there’s no known undefined behaviour by compiling it with clang’s -fsanitize=undefined feature. No UB warnings are printed, which means no UB was detected!

Epilogue

While this is the fastest we can get with bog-standard C99, this isn’t necessarily the best we can do.

Rust makes use of the compiler intrinsics to access the overflow flag of the processor directly:

pub fn add(a: i32, b: i32) -> bool {
    a.checked_add(b).is_none()
}
example::add:
    add     edi, esi
    seto    al
    ret

It turns out that both GCC and LLVM have C intrinsics that you can use. While they are non-portable to some compilers, they drastically simplify the assembly output!

bool will_add_overflow_intrinsic(int32_t a, int32_t b) {
    int32_t result;
    return __builtin_add_overflow(a, b, &result);
}

And, just like with the Rust compiler above, this generates optimal assembly!

will_add_overflow_intrinsic:
        add     edi, esi
        seto    al
        ret

Not to worry about this being so deeply compiler-specific for now, however. This will be standardized in C23 with the addition of the functions in the stdckdint.h header.

A full suite of tests to explore the solutions is available on Godbolt or as a Gist.

Bootstrapping a language

I was hoping to make more progress on self-hosting my scripting language (Kalos) this week, but I’m running out of steam because I think I coded myself into a corner. This is not a post where I’ve got everything figured out, but instead I’m taking a few moments to re-hash where things are at and figure out a plan for the future.

My original plan for this language was to offer a Python-like experience with minimal resource requirements: it should be able to run on an AVR, on bare metal, or even as part of a DOS executable. I had originally planned to support compilation on those devices, and even built a zero-allocation parser. The runtime is lightweight, integers are a configurable size, and strings are even optional. I believe it’ll be a great option on lower-end devices where Python is just too heavy.

Over the last week I started work to extract a small piece of the parser that deals with KIDL (example of KIDLE below), the part of the language that glues it to your C code. This is currently part of the existing parser, and the only piece that I could think of to carve off on the slow march to true self-hosting.

idl {
    prefix "kalos_module_idl_";
    dispatch name;

    module builtin {
        fn println(s: string) = kalos_idl_compiler_println;
        fn print(s: string) = kalos_idl_compiler_print;
        fn log(s: string) = kalos_idl_compiler_log;
    }
    ...
}

Where I’m stuck now is that the language is almost good enough to parse itself, but I’m finding lots of corner cases and papercut bugs that make it less-than-ideal. For example, I’m finding that the parser as-is is not quite good enough to handle function calls that are deeply nested in expressions.

I also ended up hacking in some dynamic-dispatch objects that help with not having classes, but that’s not a long-term thing that I want to support in lieu of a proper object/class system.

Eventually I’ll have to commit to rewriting the whole parser in Kalos, but as the title of the post suggests, I’m stuck in a potential energy well where the next steps are going to be difficult. The current parser is written in C and – while the code is pretty clean – it’s a lot of work to make changes to it. It’s going to take some time and effort to add support for things like classes, tear-off functions, etc, and being allocation-free doesn’t make any of this easy.

My mistake was being too ambitious and going right for C as the bootstrap, rather than something higher level. I should have started with Python as the bootstrap!

So I need to gather enough energy to choose and work on one of the following paths:

  1. Commit to rewriting the parser in Kalos, maybe after adding support for hashtables/dicts to the language. The language spec is still small enough that I could port the current parser. Debugging is very difficult, and you need to ensure that you’re running the code while developing it to discover if you’ve accidentally stepped on one of the many landmines. Once I have the parser/compiler in a higher-level language like Kalos itself, these landmines will be much easier to fix!
  2. Rewrite the parser in Python, knowing that I’ll have to rewrite it in Kalos later. This might not be terrible because the language is supposed to be python-like and there might be a mechanical translation route available. The thought of writing more code to throw away doesn’t fill me with a lot of joy, but it might just be what I have to do.
  3. Scrap the hand-rolled parser and switch to something like Lemon. We’re already using the amazing re2c to write the lexer, so adding another tool isn’t a bad idea. Again, we’re putting in a bunch of effort knowing that this will be tossed away later, but maybe there’s a middle ground like just having Lemon build an AST, then have Kalos script generate the bytecode?

Hacking Bluetooth to Brew Coffee from GitHub Actions: Part 3 - GitHub Actions

This is the last part of a three-part series covering the odyssey of getting a new coffeemaker, learning BTLE and how it works, reverse-engineering the Bluetooth interface and Android applications for the coffeemaker, writing a Rust-based CLI interface, and finally, hooking it all up to a GitHub actions bot that lets you brew a coffee just by filing an issue!

In part 2 we got the command-line application working, and now it’s time to connect the dots and build a secure, yet web-accessible interface.

We could choose a standard web host, add some sort of authentication on top of it, build the right web tooling to integrate with the nice command-line application we built, and all the associated security so random people can’t brew coffee. But as you’ve guessed from the title of these posts, we’re going hook this command-line app into a private GitHub repo as our “interface”.

Making use of GitHub issues for automating weird things isn’t new, but I think this is the first time you can make coffee from it!

Getting Started

Here’s our goal:

  1. We want to allow users to brew a coffee from a GitHub issue, which will be pre-populated from a number of pre-defined templates
  2. The issue will contain part of the command-line that we want to run, and we’ll need to validate that it’s reasonable and correct, and that nobody is trying to inject any sort of “funny business” to break/backdoor the runner
  3. We don’t want coffee brewers to have to chase down the status of the brewing operation, so we’re going to make use of issue comments as our basic UI. The user will be able to follow the progress of their coffee inside of the issue, and get a notification when it’s done.

This is what the user will see just before they brew the coffee:

The first question you might have is how we’re going to talk to a Bluetooth coffeemaker from GitHub’s system. This part turns out to be pretty easy: we can use GitHub self-hosted runners as a backdoor into the coffeemaker’s physical location! By running this on a computing device that has a Bluetooth radio in proximity to the coffeemaker, we can send commands to it in response to events occurring in a repo. Conveniently the Raspberry Pi 3 Model B and Pi 4 both support Bluetooth, but in our case we’re going to be using a spare MacBook that’s kicking around.

First thing, we need to create a new runner on GitHub for our project, and then set up the runner on the MacBook:

curl -O -L https://github.com/actions/runner/releases/download/${version}/actions-runner-osx-x64-${version}.tar.gz
tar xzf ./actions-runner-osx-x64-${version}.tar.gz
./config.sh --url https://github.com/mmastrac/brew-a-coffee --token ${token}
./run.sh

GitHub actions are pretty flexible and we have a huge number of events that can trigger them. In our case, we want the creation of a new issue to trigger a run, so our trigger becomes:

on:
  issues:
    types: [opened]

We’ll pull in the create-or-update-comment action from peter-evans for updating the user about the status of their coffee:

steps:
  - name: Add comment
    uses: peter-evans/create-or-update-comment@v2

And once the coffee is brewed or the process has failed for some other reason, we’ll want to close that issue, so we’re going to pull in peter-evans/close-issue for this:

  - name: Close issue
    if: always()
    uses: peter-evans/close-issue@v2

The actual brewing part will be pretty easy as well, but it’s going to require use to fetch the text of the issue and use that to create the command-line to run.

Let’s take a look at the event information that GitHub provides to us in $GITHUB_EVENT_PATH. There’s a lot that GitHub provides for us in this file, and this particular one is trimmed down significantly:

{
    "action": "opened",
    "issue": {
        ...
        "body": "This is my issue body comment\r\n",
        ...
        "title": "This is my issue title!",
        ...
    }
    ...

jq is one the best tools for integrating JSON APIs with shell scripts, so we’ll make use of that. We’ll create a small test JSON file called test.json that contains just the interesting subset of what’s available in the file at $GITHUB_EVENT_PATH:

{
    "action": "opened",
    "issue": {
        "body": "This is my issue body comment\r\n",
        "title": "This is my issue title!"
    }
}

First, we can test extraction of the issue body:

$ jq -r '.issue.body' < test.json
This is my issue body comment

$

That worked, but we’ve got some extra whitespace there. We can trim that with another jq command, gsub. By replacing leading or trailing whitespace (gsub("^\\s+|\\s+$";"")) with nothing, we can get just the text of the comment:

$ jq -r '.issue.body|gsub("^\\s+|\\s+$";"")' < test.json
This is my issue body comment
$

Better!

Extracting the Command-Line

Now what we want to do is allow the user to specify the command-line in the issue, but ensure that they can’t run anything nefarious on the runner. We developed a command-line cappuccino recipe in part 2 that we ran like this:

cargo run -- brew --beverage cappuccino --coffee 40 --milk 100 --taste extrastrong

So let’s extract out everything past the hyphens and make that the required input in our newly-filed issues:

brew --beverage cappuccino --coffee 40 --milk 100 --taste extrastrong

To work on extraction, we’ll update the issue.body field in our test.json file to this partial command-line:

{
    "action": "opened",
    "issue": {
        "body": "brew --beverage cappuccino --coffee 40 --milk 100 --taste extrastrong\r\n"
    }
}

Since we are creating a valid partial command-line for our brewing app, we can make use of that fact that we know the exact structure. In this case, we know we want it to:

  1. Start with the subcommand brew
  2. Next, contain the beverage to brew with --beverage <something>
  3. Finally, contain a list of beverage parameters which are limited to coffee, milk, hotwater, taste, and temperature. Each parameter is separated from its value by a space (ie: --coffee 100), and is either a number or an enumeration value (ie: --taste strong).

We can then build a regular expression that will be limited to just the arguments we’re allowing here. We’ll use the \w character class as it’s a close match to the values required by our parameters.

We could go further in validating the --beverage parameter, or the values for the ingredients, but we know that those are carefully checked in the application and we’ll let the application handle the validation:

^brew --beverage \w+( --(coffee|milk|taste|hotwater|temperature) \w+)*$

Now we can put it all together and extract the command-line like so (note that we have to escape the \w patterns in regular expression):

CMDLINE=$(jq -r '.issue.body|gsub("^\\s+|\\s+$";"")|select(test("^brew --beverage \\w+( --(coffee|milk|taste|hotwater|temperature) \\w+)*$"))' < $GITHUB_EVENT_PATH)

And that’s probably the only tricky part of the process. Now we can build our GitHub action, piece-by-piece.

Building the Workflow

First, the preamble that tells GitHub where and when to run the action, and what permissions it has:

name: Brew

on:
  issues:
    types: [opened]

jobs:
  build:
    runs-on: self-hosted
    permissions:
      issues: write

steps:

Our first step will drop a comment into the issue so the user knows things are happening

      - name: Add initial comment
        uses: peter-evans/create-or-update-comment@v2
        id: comment
        with:
          issue-number: $
          body: ' - [X] Getting ready to brew your ☕️!'

We’ll then install the longshot executable from cargo and let them know it was done:

      - name: Install longshot
        run: cargo install --root /tmp/longshot -- longshot 
      - name: Update comment
        uses: peter-evans/create-or-update-comment@v2
        with:
          issue-number: $
          comment-id: $
          body: ' - [X] Installed the `longshot` executable'

Next, we’ll process the requested brew operation using the jq incantation from earlier. This step will create a body.md that we’ll use to update the comment, as well as cmdline.txt that will be used to execute our brewing operation later on:

      - name: Process the request
        run: |
          OUTFILE=body.md
          CMDLINE=$(
            jq -r '
              .issue.body |
              gsub("^\\s+|\\s+$";"") |
              select(
                test("^brew --beverage \\w+( --(coffee|milk|taste|hotwater|temperature) \\w+)*$")
              )' < $GITHUB_EVENT_PATH
          )
          echo Command-line we parsed was: $CMDLINE
          if [[ "$CMDLINE" == "" ]]; then
            echo " - [X] Couldn't parse the command line from your comment? 🤔" > $OUTFILE
            exit 1
          fi
          echo -n ' - [X]' Running brew command: \`$CMDLINE\` > $OUTFILE
          echo ' [(Log here)](https://github.com/'${GITHUB_REPOSITORY}'/actions/runs/'${GITHUB_RUN_ID}')' >> $OUTFILE
          echo "/tmp/longshot/bin/longshot $CMDLINE --device-name $" > cmdline.txt

We then update the comment with body.md:

      - name: Update comment
        uses: peter-evans/create-or-update-comment@v2
        with:
          issue-number: $
          comment-id: $
          body-file: body.md

And run the brewing command:

      - name: Brew coffee
        run: |
          echo '<details><summary>Log</summary><pre>' > log.md
          sh -c "`cat cmdline.txt`" | tee -a log.md
          echo '</pre></details>' >> log.md
          echo '✅ Success!' >> log.md
      - name: Update comment on success
        uses: peter-evans/create-or-update-comment@v2
        with:
          issue-number: $
          comment-id: $
          body-file: log.md

Finally, we’ll log a message to the comment on an error, and close the issue unconditionally:

      - name: Update comment on failure
        if: failure()
        uses: peter-evans/create-or-update-comment@v2
        with:
          issue-number: $
          comment-id: $
          body: |
            <br>
            ❌ Failed! Please check the log for the reason.
      - name: Close issue
        if: always()
        uses: peter-evans/close-issue@v2

And with all those steps, we can get ourselves a coffee from GitHub!

While you can’t access my private repository that I’m using to brew us coffee at home, you can definitely try out the example repo that I’ve set up here which uses the command-line interface’s simulator and runs on GitHub’s action runners instead:

https://github.com/mmastrac/brew-a-coffee-demo

To recap, we:

Follow me on Mastadon for more updates on this adventure!

Hacking Bluetooth to Brew Coffee from GitHub Actions: Part 2 - Reverse Engineering

This is part 2 of a three-part series covering the odyssey of getting a new coffeemaker, learning BTLE and how it works, reverse-engineering the Bluetooth interface and Android applications for the coffeemaker, writing a Rust-based CLI interface, and finally, hooking it all up to a GitHub actions bot that lets you brew a coffee just by filing an issue!

In part 1 we got our coffeemaker brewing using a sniffed command that we logged from the actual application, and then sent to the coffeemaker using a small Rust program. However, we don’t really understand the language we’re speaking yet, we’re just repeating the application-to-device babbling we’ve snooped.

Understanding the Packets

Now that we know that we can send a request, we want to understand what the format of the request looks like. The first thing we want to do is understand what a packet is. A packet is a chunk of data of a defined length, in contrast to a stream of data that continues indefinitely. Packets are used throughout most communication technologies and are a fundamental way of describing discrete communication messages.

When dealing with embedded devices, packets will almost always have a header, and sometimes a footer. The header and footer are called the framing of the packet, and they delimit it so we can identify exactly where it starts and stop.

Inside the header and footer might be things like start-of-packet, or end-of-packet markers, and a length for framing. There may also be additional metadata like a checksum to detect corruption.

Why is this framing important? Devices will often use framing to help recover from corruption. If you lose or corrupt a byte anywhere in the packet, you can often recover synchronization quickly by just restarting the packet parsing at the next byte that looks like a start byte.

Here’s a few packets we captured being sent from the coffeemaker to the application while asking it to brew a coffee, and then waiting for it to finish cleaning:

0d0575f0c4d5                             # Some sort of status request
0d1483f007010100410900be02030c001c0206dc # Brew a cappuccino
0d0883f00702062f41                       # Cancel brewing

d00783f0010064d9                         # Response to brew/stop request
d012750f02040100400700000000000000d621   # Status response
d012750f04050100400c030900000000001cf0   # Status response
d012750f000000000000036400000000009080   # Status response

What information can we glean from this? First of all, the first byte is always 0d or d0 (13 or 240 in decimal), suggesting this is a start-of-packet byte that varies depending on the direction of communication. That’s one byte probably identified!

+> 0d 0575f0c4d5
+> 0d 1483f007010100410900be02030c001c0206dc
+> 0d 0883f00702062f41
|
+> d0 0783f0010064d9
+> d0 12750f02040100400700000000000000d621
+> d0 12750f04050100400c030900000000001cf0
+> d0 12750f000000000000036400000000009080
|
+--------------------------------------- Start of packet (0x0d or 0xd0)

Next, the second byte of the packet seems to vary depending on the length of the packet, and it corresponds exactly with the change in packet size. This is highly likely to be a length, and from what we can see here in a couple of the packets we captured earlier, it would be the length of the packet not including the start-of-packet byte.

   v---5 bytes--v
0d 05 75 f0 c4 d5
   v------7 bytes-----v
d0 07 83 f0 01 00 64 d9
   v------------------18 bytes (0x12)------------------v
d0 12 75 0f 02 04 01 00 40 07 00 00 00 00 00 00 00 d6 21
^  ^
|  +------------------------------------------------------ Length of packet
+--------------------------------------------------------- Start of packet (0xd0)

We can’t glean much about the rest of the packet yet, but we’re getting some of the framing nailed down here. Time to pull out some more analysis tools.

There are three approaches we can use to understand the binary language of Delonghi’s ECAM machines:

  1. We can disassemble the firmware of the coffeemaker and understand what it expects and what it sends, or
  2. We can observe the application’s communication with the coffeemaker over a period of time, changing one or two things at a time and seeing what changes in the protocol, or
  3. We can disassemble the application that controls the coffeemaker and understand its inputs and outputs.

The firmware of the machine itself would be the ideal place for us to look, but according to some various coffeemaker-hacking forums, the controllers are PIC-based, and disassembling/dumping PIC firmware somewhat tricky.

In addition, a disadvantage to disassembling microcontroller firmware is that due to size constraints, it’s far less likely for text strings to have survived the compilation process to give us hints as to what’s going on. Finding leftover snippets of logging or “debug” print statements are gold for the reverse engineer, and we’d like to use that as a signpost to guide our future work.

Observing the application’s communication directly is definitely an option. This is inconvenient as we saw from the HCI snooping adventures earlier on, and we might not know how to perturb the system enough to fully understand most of the fields we receive.

The best option we’re left with is disassembling the application itself and looking for hints as to what it’s doing, hopefully for some symbols that give us names, or text strings that may give us context.

Disassembling the Delonghi APK

We’re going to disassemble the Delonghi APK to learn more about how we can automate our caffeine fix.

Android applications are shipped in APK (Android Package Kit) format, and we’re going to download a few historical versions of the APK from APK Pure, a site that archives older versions of shipped applications. Getting a few different versions is a good idea, as developers will sometimes forget to enable obfuscation in some versions. If we’re lucky enough to get a version of the application without obfuscation, we can get the internal names for constants and fields.

In the past, APK decompilation was somewhat tricky. As Android is Java-based, but doesn’t use Java’s bytecode directly, either you’d need to learn how to understand smali, or you’d use dex2jar to convert the app to a faux-Java JAR file and use standard Java analysis tools to reverse engineer it. Jadx is a new Java analysis tool, which is far easier to use and much more powerful than the older tools.

Let’s open the APK in Jadx. Once it has decompiled that app, the first thing we’ll notice is that there are package names here. This is great news and suggests that even if the application is obfuscated, it’s not obfuscated fully and we’ll be able to learn how it ticks.

When we dig into some of the classes, we see method and field names, showing us a pretty clear representation of the original source code. Even better news!

After decompiling, our first goal should be to conclusively identify the framing of the packets and answer the questions we raised earlier on. With some reading through the source, we can identify the source of one of the packets we saw being sent to the machine…

0d 05 75 f0 c4 d5

… as coming from here:

public static byte[] getByteMonitorMode(int i) {  
    String str = TAG;  
    DLog.m188e(str, "getByteMonitorMode  dataN" + i);  
    byte[] bArr = new byte[6];  
    bArr[0] = 0xd;                            // ** 0d
    bArr[1] = 5;                              // ** 05
    if (i == 0) {  
        bArr[2] = DATA_0_ANSWER_ID;  
    } else if (i == 1) {  
        bArr[2] = DATA_1_ANSWER_ID;  
    } else if (i == 2) {  
        bArr[2] = 117;                        // ** 75
    }  
    bArr[3] = 0xf0;                           // ** f0
    int checksum = checksum(bArr);  
    bArr[4] = (byte) ((checksum >> 8) & 255); // ** c4 
    bArr[5] = (byte) (checksum & 255);        // ** d5
    return bArr;  
}

We can see the canonical source for every byte in that packet above in this function. 0d is the start_of_packet header. 05 is the length, and it looks like it’s just hardcoded here since the packet is always the same length. We can also see that the last few digits are a checksum, and if we dig into the checksum function, how it is calculated:

public static int checksum(byte[] bArr) {  
    int i = 7439;  
    for (int i2 = 0; i2 < bArr.length - 2; i2++) {  
        int i3 = (((i << 8) | (i >>> 8)) & 65535) ^ (bArr[i2] & 255);  
        int i4 = i3 ^ ((i3 & 255) >> 4);  
        int i5 = i4 ^ ((i4 << 12) & 65535);  
        i = i5 ^ (((i5 & 255) << 5) & 65535);  
    }  
    return i & 65535;  
}

Great! checksum looks like it could be one of the CRC family of functions, but we don’t necessarily have to fully understand it yet if we have its implementation. We now have all the framing necessary to construct any packet:

d0 LE (data) C1 C2
^  ^         ^--^-- Our checksum bytes   
+  +--------------- Length of packet, minus start-of-packet
+------------------ Start of packet (0xd0 or 0x0d depending on direction)

Now we can start to guess at the meaning of the rest of the bytes. Byte 2 appears to be a command ID. Byte 3 is a constant, and scanning the rest of the file suggests that it’s always 0x0f or 0xf0, depending on the command. That leaves the remainder of the packet for the command payload, if it’s used for the command.

0d 05 75 f0 c4 d5
^  ^  ^  ^  ^--^-- Our checksum
|  |  |  +-------- Always 0xf0 or 0x0f
|  |  +----------- The command ID (0x75 = monitor mode 2)
|  +-------------- The packet length
+----------------- Start of packet (0x0d)

Since we have a single spot for calculating the packet checksum, and we know that every request requires a checksum to be calculated, we can figure out all the places in the app that create request packets to get a better idea of what we can ask the machine to do by looking for the callers of checksum:

This is very interesting! There’s a lot of commands here, and each of them has a reasonably-well-defined name that we can use to understand what its function might be. We’ll have to start working through them one-by-one. With a bit of work, we can assemble a table of these commands:

enum EcamRequestId {
    SetBtMode = 17,
    MonitorV0 = 96,
    MonitorV1 = 112,
    MonitorV2 = 117,
    BeverageDispensingMode = 131,
    AppControl = 132,
    ParameterRead = 149,
    ParameterWrite = 144,
    ParameterReadExt = 161,
    StatisticsRead = 162,
    Checksum = 163,
    ProfileNameRead = 164,
    ProfileNameWrite = 165,
    RecipeQuantityRead = 166,
    RecipePriorityRead = 168,
    ProfileSelection = 169,
    RecipeNameRead = 170,
    RecipeNameWrite = 171,
    SetFavoriteBeverages = 173,
    RecipeMinMaxSync = 176,
    PinSet = 177,
    BeanSystemSelect = 185,
    BeanSystemRead = 186,
    BeanSystemWrite = 187,
    PinRead = 210,
    SetTime = 226,
}

Some of these request IDs are guesses based on the surrounding code context, and some of them are defined in enumerations in the application source. It’s a pretty good start for us to get going on figuring out how to brew our own beverage from scratch.

Revisiting the brew command

From just the bytes sent across the connection it’s difficult to understand exactly how the application is creating the packet to brew a coffee. However, in the disassembly we find a function dispenseBeveragePacket that appears to construct the packet we saw before:

if (arrayList != null) {  
    Iterator<ParameterModel> it3 = arrayList.iterator();  
    loop1: while (true) {  
        i4 = 0;  
        while (it3.hasNext()) {  
            next = it3.next();  
            if (next.getId() < 23 || next.getId() == 28) {  
                if (bool.booleanValue() || i != 200 || next.getId() != 2) {  
                    i6 = i6 + 2 + i4;  
                    bArr[i6 + 6] = (byte) next.getId();  
                    if (Utils.isTwoBytesShort(next.getId())) {  
                        bArr[i6 + 7] = (byte) (next.getDefValue() >> 8);  
                        bArr[i6 + 8] = (byte) next.getDefValue();  
                        i4 = 1;  
                    }  
                }  
            }  
        }  
        bArr[i6 + 7] = (byte) next.getDefValue();  
    }  
    i5 = i4;  
}

If we clean it up a bit to some Java pseudocode, it looks like this:

int index = 6;
for (ParameterModel param in params) {
    if (param.getId() < CLEAN_TYPE || param.getId() == ACCESSORIO) {
        array[index++] = param.getId();
        if (Utils.isTwoBytesShort(param.getId())) {
            array[index++] = param.getDefValue() >> 8;   // upper 8 bytes
            array[index++] = param.getDefValue() & 0xff; // 
        } else {
            array[index++] = param.getDefValue() & 0xff;
        }
    }
}

So, from this pseudocode we see that a beverage is constructed from a list of ingredients (which with some further investigation, we find in the disassembled source as IngredientsId below), and an associated one or two byte value (param.getDefValue() above). Digging through the source for which ingredients are one or two bytes doesn’t yield much fruit, but maybe we can understand what’s going on by investigating further.

Where do we go next? We find that there are two commands that, based on their name, seem to be related to the application UI used for brewing: RecipeQuantityRead and RecipeMinMaxSync.

Let’s try sending these commands to the machine! To do this, it’s time to re-visit our Rust code.

First, let’s create a function that will add the packet framing (header, length and checksum) to any payload we want to send:

pub fn checksum(buffer: &[u8]) -> [u8; 2] {
    let mut i: u16 = 7439;
    for x in buffer {
        let i3 = ((i << 8) | (i >> 8)) ^ (*x as u16);
        let i4 = i3 ^ ((i3 & 255) >> 4);
        let i5 = i4 ^ (i4 << 12);
        i = i5 ^ ((i5 & 255) << 5);
    }

    [(i >> 8) as u8, (i & 0xff) as u8]
}

fn packetize(buffer: &[u8]) -> Vec<u8> {
    let mut out = [&[
        0x0d,
        (buffer.len() + 3).try_into().expect("Packet too large"),
    ], buffer].concat();
    out.extend_from_slice(&checksum(&out));
    out
}

async fn run_with_peripheral(peripheral: Peripheral, characteristic: Characteristic) -> Result<(), Box<dyn std::error::Error>> {
    let data = packetize(/* data */);
    peripheral.write(&characteristic, data, WriteType::WithoutResponse);
    Ok(())
}

Now we can start sending some example packets and exploring the responses. Here’s two test packet’s we’ll send (in pseudo-code and byte form):

Packet                             Bytes
RecipeInfo(profile=1, beverage=7)  0d 07 a6f0 01 07 75c2
RecipeMinMaxInfo(beverage=7)       0d 06 b0f0 07 6af4

Let’s look at what comes back, in raw byte form (some spaces added to help the reader visualize the packet):

RecipeInfo(profile=1, beverage=7):       
d0 17 a6f0 01 07 
↳ 0100410900be02030c001b0419011c02
↳ a2cd

RecipeMinMaxInfo(beverage=7):
d0 2c b0f0 07
↳ 010014004100b409003c00be038402000305
↳ 18010101190101010c0000001c000200
↳ 1b000404
↳ d03c

The first part of the response packets appear to be the machine echoing back the input. This makes sense, as the application will need a way to match up responses to requests.

For the remainder of the packet, we have the advantage of the decompilation above. We know the list of ingredients from the IngredientsId enumeration in the decompiled source, and if we match up the type of beverages with the ingredients, it makes a lot of sense that it’s what we’re seeing here:

RecipeInfo response:

01·0041•09·00be•02·03•0c·00•1b·04•19·01•1c·02
---+--- ---+--- --+-- --+-- --+-- --+-- --+--  
   |       |      |     |     |     |     ╰---- Accessorio=2
   |       |      |     |     |     |          
   |       |      |     |     |     ╰---------- Visible=1
   |       |      |     |     |                
   |       |      |     |     ╰---------------- IndexLength=4
   |       |      |     |                      
   |       |      |     ╰---------------------- Inversion=0
   |       |      |                            
   |       |      ╰---------------------------- Taste=3
   |       |                                   
   |       ╰----------------------------------- Milk=190
   |                                           
   ╰------------------------------------------- Coffee=65

RecipeMinMaxInfo response:

01·0014·0041·00b4•09·003c·00be·0384•02·00·03·05•
--------+-------- --------+-------- -----+-----  
        |                 |              ╰------- Taste: 0<=3<=5
        |                 |                      
        |                 ╰---------------------- Milk: 60<=190<=900
        |                                        
        ╰---------------------------------------- Coffee: 20<=65<=180
> 18·01·01·01•19·01·01·01•0c·00·00·00•1c·00·02·00•
  -----+----- -----+----- -----+----- -----+-----  
       |           |           |           ╰------- Accessorio: 0<=2<=0
       |           |           |                   
       |           |           ╰------------------- Inversion: 0<=0<=0
       |           |                               
       |           ╰------------------------------- Visible: 1<=1<=1
       |                                           
       ╰------------------------------------------- Programmable: 1<=1<=1
> 1b¡00¡04¡04
  -----+-----  
       ╰------- IndexLength: 0<=4<=4

From reading the application source, these packets seem to represent the current settings for the beverage, and the minimum and maximum ranges for each of the parameters. We can also start to guess at what the lengths of each of the ingredients’ parameter values are: most are a single byte, but a handful seem to be reliably two bytes wide and they all seem to deal with liquids.

Let’s confirm our intuition here by looking at the recipes for a latte and hot water:

Latte recipe:

01·003c•09·01f4•02·03•0c·00•1b·04•19·01•1c·02
---+--- ---+--- --+-- --+-- --+-- --+-- --+--  
   |       |      |     |     |     |     ╰---- Accessorio=2
   |       |      |     |     |     |          
   |       |      |     |     |     ╰---------- Visible=1
   |       |      |     |     |                
   |       |      |     |     ╰---------------- IndexLength=4
   |       |      |     |                      
   |       |      |     ╰---------------------- Inversion=0
   |       |      |                            
   |       |      ╰---------------------------- Taste=3
   |       |                                   
   |       ╰----------------------------------- Milk=500
   |                                           
   ╰------------------------------------------- Coffee=60

Hot water recipe:

0f·00fa•19·01•1c·01
---+--- --+-- --+--  
   |      |     ╰---- Accessorio=1
   |      |          
   |      ╰---------- Visible=1
   |                 
   ╰----------------- HotWater=250

We can see that the pattern continues here: liquid amounts are two bytes long, while everything else is a single byte.

If you’re curious about every recipe this machine can make, a gist is available with a dump of the recipes.

We can now put all the pieces together and build our own brewing command with a recipe that we’re going to write from scratch: a large cappuccino!

83 f0 07 01 01 00 78 09 01 77 02 04 02
^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^- Preparation mode = PREPARE
|  |  |  |  |  |  |  |  |  |  +--+---- Taste (02) = strong (value 4)
|  |  |  |  |  |  |  +--+--+---------- Milk (09) = 375
|  |  |  |  +--+--+------------------- Coffee (01) = 120
|  |  |  +---------------------------- Trigger=START
|  |  +------------------------------- Beverage=Cappuccino (value 7)
|  +---------------------------------- Always 0xf0
+------------------------------------- Beverage dispense request   

Let’s send that to the machine and see what happens:

Success! We have the most complicated and important part of communication with a coffeemaker working: brewing the coffee.

There’s a bit of work required to turn this into a full application, but you can find that pre-written in my longshot project on GitHub. You don’t need to have this coffeemaker, as it includes a simulator mode that will allow you to brew a virtual coffee and test out the packet parsing and generation code.

Continue reading… In part three of the series we’ll hook this up to GitHub Actions so that we can automate coffee brewing from the browser!

Further reading for this post

Follow me on Mastadon for more updates on this adventure!