Max Bernstein's Blog https://bernsteinbear.com Sorry for marking all the posts as unread I noticed that the URLs were all a little off (had two slashes instead of one) and went in and fixed it. I did not think everyone's RSS software was going to freak out the way it did. PS: this is a special RSS-only post that is not visible on the site. Enjoy. Wed, 31 Jan 2024 00:00:00 +0000 rss-only-post-1 A multi-entry CFG design conundrum

Background and bytecode design

The ZJIT compiler compiles Ruby bytecode (YARV) to machine code. It starts by transforming the stack machine bytecode into a high-level graph-based intermediate representation called HIR.

We use a more or less typical1 control-flow graph (CFG) in HIR. We have a compilation unit, Function, which has multiple basic blocks, Block. Each block contains multiple instructions, Insn. HIR is always in SSA form, and we use the variant of SSA with block parameters instead of phi nodes.

Where it gets weird, though, is our handling of multiple entrypoints. See, YARV handles default positional parameters (but not default keyword parameters) by embedding the code to compute the defaults inside the callee bytecode. Then callers are responsible for figuring out what offset in the bytecode they should start running the callee, depending on the amount of arguments the caller provides.2

In the following example, we have a function that takes two optional positional parameters a and b. If neither is provided, we start at offset 0000. If just a is provided, we start at offset 0005. If both are provided, we can start at offset 0010.

$ ruby --dump=insns -e 'def foo(a=compute_a, b=compute_b) = a + b'
...
== disasm: #<ISeq:foo@-e:1 (1,0)-(1,41)>
local table (size: 2, argc: 0 [opts: 2, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] a@0<Opt=0> [ 1] b@1<Opt=5>
0000 putself ( 1)
0001 opt_send_without_block <calldata!mid:compute_a, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0003 setlocal_WC_0 a@0
0005 putself
0006 opt_send_without_block <calldata!mid:compute_b, argc:0, FCALL|VCALL|ARGS_SIMPLE>
0008 setlocal_WC_0 b@1
0010 getlocal_WC_0 a@0[Ca]
0012 getlocal_WC_0 b@1
0014 opt_plus <calldata!mid:+, argc:1, ARGS_SIMPLE>[CcCr]
0016 leave [Re]
$

(See the jump table debug output: [ 2] a@0<Opt=0> [ 1] b@1<Opt=5>)

Unlike in Python, where default arguments are evaluated at function creation time, Ruby computes the default values at function call time. For this reason, embedding the default code inside the callee makes a lot of sense; we have a full call frame already set up, so any exception handling machinery or profiling or ... doesn't need special treatment.

Since the caller knows what arguments it is passing, and often to what function, we can efficiently support this in the JIT. We just need to know what offset in the compiled callee to call into. The interpreter can also call into the compiled function, which just has a stub to do dispatch to the appropriate entry block.

This has led us to design the HIR to support multiple function entrypoints. Instead of having just a single entry block, as most control-flow graphs do, each of our functions now has an array of function entries: one for the interpreter, at least one for the JIT, and more for default parameter handling. Each of these entry blocks is separately callable from the outside world.

Here is what the (slightly cleaned up) HIR looks like for the above example:

Optimized HIR:
fn foo@tmp/branchnil.rb:4:
bb0():
 EntryPoint interpreter
 v1:BasicObject = LoadSelf
 v2:BasicObject = GetLocal :a, l0, SP@5
 v3:BasicObject = GetLocal :b, l0, SP@4
 v4:CPtr = LoadPC
 v5:CPtr[CPtr(0x16d27e908)] = Const CPtr(0x16d282120)
 v6:CBool = IsBitEqual v4, v5
 IfTrue v6, bb2(v1, v2, v3)
 v8:CPtr[CPtr(0x16d27e908)] = Const CPtr(0x16d282120)
 v9:CBool = IsBitEqual v4, v8
 IfTrue v9, bb4(v1, v2, v3)
 Jump bb6(v1, v2, v3)
bb1(v13:BasicObject):
 EntryPoint JIT(0)
 v14:NilClass = Const Value(nil)
 v15:NilClass = Const Value(nil)
 Jump bb2(v13, v14, v15)
bb2(v27:BasicObject, v28:BasicObject, v29:BasicObject):
 v65:HeapObject[...] = GuardType v27, HeapObject[class_exact*:Object@VALUE(0x1043aed00)]
 v66:BasicObject = SendWithoutBlockDirect v65, :compute_a (0x16d282148)
 Jump bb4(v27, v66, v29)
bb3(v18:BasicObject, v19:BasicObject):
 EntryPoint JIT(1)
 v20:NilClass = Const Value(nil)
 Jump bb4(v18, v19, v20)
bb4(v38:BasicObject, v39:BasicObject, v40:BasicObject):
 v69:HeapObject[...] = GuardType v38, HeapObject[class_exact*:Object@VALUE(0x1043aed00)]
 v70:BasicObject = SendWithoutBlockDirect v69, :compute_b (0x16d282148)
 Jump bb6(v38, v39, v70)
bb5(v23:BasicObject, v24:BasicObject, v25:BasicObject):
 EntryPoint JIT(2)
 Jump bb6(v23, v24, v25)
bb6(v49:BasicObject, v50:BasicObject, v51:BasicObject):
 v73:Fixnum = GuardType v50, Fixnum
 v74:Fixnum = GuardType v51, Fixnum
 v75:Fixnum = FixnumAdd v73, v74
 CheckInterrupts
 Return v75

If you're not a fan of text HIR, here is an embedded clickable visualization of HIR thanks to our former intern Aiden porting Firefox's Iongraph:

(You might have to scroll sideways and down and zoom around. Or you can open it in its own window.)

Each entry block also comes with block parameters which mirror the function's parameters. These get passed in (roughly) the System V ABI registers.

This is kind of gross. We have to handle these blocks specially in reverse post-order (RPO) graph traversal. And, recently, I ran into an even worse case when trying to implement the Cooper-style "engineered" dominator algorithm: if we walk backwards in block dominators, the walk is not guaranteed to converge. All non-entry blocks are dominated by all entry blocks, which are only dominated by themselves. There is no one "start block". So what is there to do?

The design conundrum

Approach 1 is to keep everything as-is, but handle entry blocks specially in the dominator algorithm too. I'm not exactly sure what would be needed, but it seems possible. Most of the existing block infra could be left alone, but it's not clear how much this would "spread" within the compiler. What else in the future might need to be handled specially?

Approach 2 is to synthesize a super-entry block and make it a predecessor of every interpreter and JIT entry block. Inside this approach there are two ways to do it: one (2.a) is to fake it and report some non-existent block. Another (2.b) is to actually make a block and a new instruction that is a quasi-jump instruction. In this approach, we would either need to synthesize fake block arguments for the JIT entry block parameters or add some kind of new LoadArg<i> instruction that reads the argument i passed in.

(suggested by Iain Ireland, as seen in the IBM COBOL compiler)

Approach 3 is to duplicate the entire CFG per entrypoint. This would return us to having one entry block per CFG at the expense of code duplication. It handles the problem pretty cleanly but then forces code duplication. I think I want the duplication to be opt-in instead of having it be the only way we support multiple entrypoints. What if it increases memory too much? The specialization probably would make the generated code faster, though.

(suggested by Ben Titzer)

None of these approaches feel great to me. The probable candidate is 2.b where we have LoadArg instructions. That gives us flexibility to also later add full specialization without forcing it.

Cameron Zwarich also notes that this this is an analogue to the common problem people have when implementing the reverse: postdominators. This is because often functions have multiple return IR instructions. He notes the usual solution is to transform them into branches to a single return instruction.

Do you have this problem? What does your compiler do?

  1. We use extended basic blocks (EBBs), but this doesn't matter for this post. It makes dominators and predecessors slightly more complicated (now you have dominating instructions), but that's about it as far as I can tell. We'll see how they fare in the face of more complicated analysis later.

  2. Keyword parameters have some mix of caller/callee presence checks in the callee because they are passed in un-ordered. The caller handles simple constant defaults whereas the callee handles anything that may raise. Check out Kevin Newton's awesome overview.

Thu, 22 Jan 2026 00:00:00 +0000 January 22, 2026 https://bernsteinbear.com/blog/multiple-entry/?utm_source=rss https://bernsteinbear.com/blog/multiple-entry/
The GDB JIT interface

GDB is great for stepping through machine code to figure out what is going on. It uses debug information under the hood to present you with a tidy backtrace and also determine how much machine code to print when you type disassemble.

This debug information comes from your compiler. Clang, GCC, rustc, etc all produce debug data in a format called DWARF and then embed that debug information inside the binary (ELF, Mach-O, ...) when you do -ggdb or equivalent.

Unfortunately, this means that by default, GDB has no idea what is going on if you break in a JIT-compiled function. You can step instruction-by-instruction and whatnot, but that's about it. This is because the current instruction pointer is nowhere to be found in any of the existing debug info tables from the host runtime code, so your terminal is filled with ???. See this example from the V8 docs:

#8 0x08281674 in v8::internal::Runtime_SetProperty (args=...) at src/runtime.cc:3758
#9 0xf5cae28e in ?? ()
#10 0xf5cc3a0a in ?? ()
#11 0xf5cc38f4 in ?? ()
#12 0xf5cbef19 in ?? ()
#13 0xf5cb09a2 in ?? ()
#14 0x0809e0a5 in v8::internal::Invoke (...) at src/execution.cc:97

Fortunately, there is a JIT interface to GDB. If you implement a couple of functions in your JIT and run them every time you finish compiling a function, you can get the debugging niceties for your JIT code too. See again a V8 example:

#6 0x082857fc in v8::internal::Runtime_SetProperty (args=...) at src/runtime.cc:3758
#7 0xf5cae28e in ?? ()
#8 0xf5cc3a0a in loop () at test.js:6
#9 0xf5cc38f4 in test.js () at test.js:13
#10 0xf5cbef19 in ?? ()
#11 0xf5cb09a2 in ?? ()
#12 0x0809e1f9 in v8::internal::Invoke (...) at src/execution.cc:97

Unfortunately, the GDB docs are somewhat sparse. So I went spelunking through a bunch of different projects to try and understand what is going on.

The big picture (and the old interface)

GDB expects your runtime to expose a function called __jit_debug_register_code and a global variable called __jit_debug_descriptor. GDB automatically adds its own internal breakpoints at this function, if it exists. Then, when you compile code, you call this function from your runtime.

In slightly more detail:

  1. Compile a function in your JIT compiler. This gives you a function name, maybe other metadata, an executable code address, and a code size
  2. Generate an entire ELF/Mach-O/... object in-memory (!) for that one function, describing its name, code region, maybe other DWARF metadata such as line number maps
  3. Write a jit_code_entry linked list node that points at your object ("symfile")
  4. Link it into the __jit_debug_descriptor linked list
  5. Call __jit_debug_register_code, which gives GDB control of the process so it can pick up the new function's metadata
  6. Optionally, break into (or crash inside) one of your JITed functions
  7. At some point, later, when your function gets GCed, unregister your code by editing the linked list and calling __jit_debug_register_code again

This is why you see compiler projects such as V8 including large swaths of code just to make object files:

Because this is a huge hassle, GDB also has a newer interface that does not require making an ELF/Mach-O/...+DWARF object.

Custom debug info (the new interface)

This new interface requires writing a binary format of your choice. You make the writer and you make the reader. Then, when you are in GDB, you load your reader as a shared object.

The reader must implement the interface specified by GDB:

GDB_DECLARE_GPL_COMPATIBLE_READER;
extern struct gdb_reader_funcs *gdb_init_reader (void);
struct gdb_reader_funcs
{
 /* Must be set to GDB_READER_INTERFACE_VERSION. */
 int reader_version;

 /* For use by the reader. */
 void *priv_data;

 gdb_read_debug_info *read;
 gdb_unwind_frame *unwind;
 gdb_get_frame_id *get_frame_id;
 gdb_destroy_reader *destroy;
};

The read function pointer does the bulk of the work and is responsible for matching code ranges to function names, line numbers, and more.

Here are some details from Sanjoy Das.

Only a few runtimes implement this interface. Most of them stub out the unwind and get_frame_id function pointers:

I think it also requires at least the reader to proclaim it is GPL via the macro GDB_DECLARE_GPL_COMPATIBLE_READER.

Since I wrote about the perf map interface recently, I have it on my mind. Why can't we reuse it in GDB?

Adapting to the Linux perf interface

I suppose it would be possible to try and upstream a patch to GDB to support the Linux perf map interface for JITs. After all, why shouldn't it be able to automatically pick up symbols from /tmp/perf-...? That would be great baseline debug info for "free".

In the meantime, maybe it is reasonable to create a re-usable custom debug reader:

  • When registering code, write the address and name to /tmp/perf-... as you normally would
  • Write the filename as the symfile (does this make /tmp the magic number?)
  • Have the debug info reader just parse the perf map file

It would be less flexible than both the DWARF and custom readers support: it would only be able to handle filename and code region. No embedding source code for GDB to display in your debugger. But maybe that is okay for a partial solution?

Update: Here is my small attempt at such a plugin.

The n-squared problem

V8 notes in their GDB JIT docs that because the JIT interface is a linked list and we only keep a pointer to the head, we get O(n2) behavior. Bummer. This becomes especially noticeable since they register additional code objects not just for functions, but also trampolines, cache stubs, etc.

Garbage collection

Since GDB expects the code pointer in your symbol object file not to move, you have to make sure to have a stable symbol file pointer and stable executable code pointer. To make this happen, V8 disables its moving GC.

Additionally, if your compiled function gets collected, you have to make sure to unregister the function. Instead of doing this eagerly, ART treats the GDB JIT linked list as a weakref and periodically removes dead code entries from it.

Tue, 30 Dec 2025 00:00:00 +0000 December 30, 2025 https://bernsteinbear.com/blog/gdb-jit/?utm_source=rss https://bernsteinbear.com/blog/gdb-jit/
Load and store forwarding in the Toy Optimizer

Another entry in the Toy Optimizer series.

A long, long time ago (two years!) CF Bolz-Tereick and I made a video about load/store forwarding and an accompanying GitHub Gist about load/store forwarding (also called load elimination) in the Toy Optimizer. I said I would write a blog post about it, but never found the time--it got lost amid a sea of large life changes.

It's a neat idea: do an abstract interpretation over the trace, modeling the heap at compile-time, eliminating redundant loads and stores. That means it's possible to optimize traces like this:

v0 = ...
v1 = load(v0, 5)
v2 = store(v0, 6, 123)
v3 = load(v0, 6)
v4 = load(v0, 5)
v5 = do_something(v1, v3, v4)

into traces like this:

v0 = ...
v1 = load(v0, 5)
v2 = store(v0, 6, 123)
v5 = do_something(v1, 123, v1)

(where load(v0, 5) is equivalent to *(v0+5) in C syntax and store(v0, 6, 123) is equvialent to *(v0+6)=123 in C syntax)

This indicates that we were able to eliminate two redundant loads by keeping around information about previous loads and stores. Let's get to work making this possible.

The usual infrastructure

We'll start off with the usual infrastructure from the Toy Optimizer series: a very stringly-typed representation of a trace-based SSA IR and a union-find rewrite mechanism.

This means we can start writing some new optimization pass and our first test:

def optimize_load_store(bb: Block):
 opt_bb = Block()
 # TODO: copy an optimized version of bb into opt_bb
 return opt_bb

def test_two_loads():
 bb = Block()
 var0 = bb.getarg(0)
 var1 = bb.load(var0, 0)
 var2 = bb.load(var0, 0)
 bb.escape(var1)
 bb.escape(var2)
 opt_bb = optimize_load_store(bb)
 assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = load(var0, 0)
var2 = escape(var1)
var3 = escape(var1)"""

This test is asserting that we can remove duplicate loads. Why load twice if we can cache the result? Let's make that happen.

Caching loads

To do this, we'll model the the heap at compile-time. When I say "model", I mean that we will have an imprecise but correct abstract representation of the heap: we don't (and can't) have knowledge of every value, but we can know for sure that some addresses have certain values.

For example, if we have observed a load from object O at offset 8 v0 = load(O, 8), we know that the SSA value v0 is at heap[(O, 8)]. That sounds tautological, but it's not. Future loads can make use of this information.

def get_num(op: Operation, index: int=1):
 assert isinstance(op.arg(index), Constant)
 return op.arg(index).value

def optimize_load_store(bb: Block):
 opt_bb = Block()
 # Stores things we know about the heap at... compile-time.
 # Key: an object and an offset pair acting as a heap address
 # Value: a previous SSA value we know exists at that address
 compile_time_heap: Dict[Tuple[Value, int], Value] = {}
 for op in bb:
 if op.name == "load":
 obj = op.arg(0)
 offset = get_num(op, 1)
 load_info = (obj, offset)
 previous = compile_time_heap.get(load_info)
 if previous is not None:
 op.make_equal_to(previous)
 continue
 compile_time_heap[load_info] = op
 opt_bb.append(op)
 return opt_bb

This pass records information about loads and uses the result of a previous cached load operation if available. We treat the pair of (SSA value, offset) as an address into our abstract heap.

That's great! If you run our simple test, it should now pass. But what happens if we store into that address before the second load? Oops...

def test_store_to_same_object_offset_invalidates_load():
 bb = Block()
 var0 = bb.getarg(0)
 var1 = bb.load(var0, 0)
 var2 = bb.store(var0, 0, 5)
 var3 = bb.load(var0, 0)
 bb.escape(var1)
 bb.escape(var3)
 opt_bb = optimize_load_store(bb)
 assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = load(var0, 0)
var2 = store(var0, 0, 5)
var3 = load(var0, 0)
var4 = escape(var1)
var5 = escape(var3)"""

This test fails because we are incorrectly keeping around var1 in our abstract heap. We need to get rid of it and not replace var3 with var1.

Invalidating cached loads

So it turns out we have to also model stores in order to cache loads correctly. One valid, albeit aggressive, way to do that is to throw away all the information we know at each store operation:

def optimize_load_store(bb: Block):
 opt_bb = Block()
 compile_time_heap: Dict[Tuple[Value, int], Value] = {}
 for op in bb:
 if op.name == "store":
 compile_time_heap.clear()
 elif op.name == "load":
 # ...
 opt_bb.append(op)
 return opt_bb

That makes our test pass--yay!--but at great cost. It means any store operation mucks up redundant loads. In our world where we frequently read from and write to objects, this is what we call a huge bummer.

For example, a store to offset 4 on some object should never interfere with a load from a different offset on the same object1. We should be able to keep our load from offset 0 cached here:

def test_store_to_same_object_different_offset_does_not_invalidate_load():
 bb = Block()
 var0 = bb.getarg(0)
 var1 = bb.load(var0, 0)
 var2 = bb.store(var0, 4, 5)
 var3 = bb.load(var0, 0)
 bb.escape(var1)
 bb.escape(var3)
 opt_bb = optimize_load_store(bb)
 assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = load(var0, 0)
var2 = store(var0, 4, 5)
var3 = escape(var1)
var4 = escape(var1)"""

We could try instead checking if our specific (object, offset) pair is in the heap and only removing cached information about that offset and that object. That would definitely help!

def optimize_load_store(bb: Block):
 opt_bb = Block()
 compile_time_heap: Dict[Tuple[Value, int], Value] = {}
 for op in bb:
 if op.name == "store":
 load_info = (op.arg(0), get_num(op, 1))
 if load_info in compile_time_heap:
 del compile_time_heap[load_info]
 elif op.name == "load":
 # ...
 opt_bb.append(op)
 return opt_bb

It makes our test pass, too, which is great news.

Unfortunately, this runs into problems due to aliasing: it's entirely possible that our compile-time heap could contain a pair (v0, 0) and a pair (v1, 0) where v0 and v1 are the same object (but not known to the optimizer). Then we might run into a situation where we incorrectly cache loads because the optimizer doesn't know our abstract addresses (v0, 0) and (v1, 0) are actually the same pointer at run-time.

This means that we are breaking abstract interpretation rules: our abstract interpreter has to correctly model all possible outcomes at run-time. This means to me that we should instead pick some tactic in-between clearing all information (correct but over-eager) and clearing only exact matches of object+offset (incorrect).

The term that will help us here is called an alias class. It is a name for a way to efficiently partition objects in your abstract heap into completely disjoint sets. Writes to any object in one class never affect objects in another class.

Our very scrappy alias classes will be just based on the offset: each offset is a different alias class. If we write to any object at offset K, we have to invalidate all of our compile-time offset K knowledge--even if it's for another object. This is a nice middle ground, and it's possible because our (made up) object system guarantees that distinct objects do not overlap, and also that we are not writing out-of-bounds.2

So let's remove all of the entries from compile_time_heap where the offset matches the offset in the current store:

def optimize_load_store(bb: Block):
 opt_bb = Block()
 compile_time_heap: Dict[Tuple[Value, int], Value] = {}
 for op in bb:
 if op.name == "store":
 offset = get_num(op, 1)
 compile_time_heap = {
 load_info: value
 for load_info, value in compile_time_heap.items()
 if load_info[1] != offset
 }
 elif op.name == "load":
 # ...
 opt_bb.append(op)
 return opt_bb

Great! Now our test passes.

This concludes the load optimization section of the post. We have modeled enough of loads and stores that we can eliminate redundant loads. Very cool. But we can go further.

Caching stores

Stores don't just invalidate information. They also give us new information! Any time we see an operation of the form v1 = store(v0, 8, 5) we also learn that load(v0, 8) == 5! Until it gets invalidated, anyway.

For example, in this test, we can eliminate the load from var0 at offset 0:

def test_load_after_store_removed():
 bb = Block()
 var0 = bb.getarg(0)
 bb.store(var0, 0, 5)
 var1 = bb.load(var0, 0)
 var2 = bb.load(var0, 1)
 bb.escape(var1)
 bb.escape(var2)
 opt_bb = optimize_load_store(bb)
 assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = store(var0, 0, 5)
var2 = load(var0, 1)
var3 = escape(5)
var4 = escape(var2)"""

Making that work is thankfully not very hard; we need only add that new information to the compile-time heap after removing all the potentially-aliased info:

def optimize_load_store(bb: Block):
 opt_bb = Block()
 compile_time_heap: Dict[Tuple[Value, int], Value] = {}
 for op in bb:
 if op.name == "store":
 offset = get_num(op, 1)
 compile_time_heap = # ... as before ...
 obj = op.arg(0)
 new_value = op.arg(2)
 compile_time_heap[(obj, offset)] = new_value # NEW!
 elif op.name == "load":
 # ...
 opt_bb.append(op)
 return opt_bb

This makes the test pass. It makes another test fail, but only because--oops--we now know more. You can delete the old test because the new test supersedes it.

Now, note that we are not removing the store. This is because we have nothing in our optimizer that keeps track of what might have observed the side-effects of the store. What if the object got escaped? Or someone did a load later on? We would only be able to remove the store (continue) if we could guarantee it was not observable.

In our current framework, this only happens in one case: someone is doing a store of the exact same value that already exists in our compile-time heap. That is, either the same constant, or the same SSA value. If we see this, then we can completely skip the second store instruction.

Here's a test case for that, where we have gained information from the load instruction that we can then use to get rid of the store instruction:

def test_load_then_store():
 bb = Block()
 arg1 = bb.getarg(0)
 var1 = bb.load(arg1, 0)
 bb.store(arg1, 0, var1)
 bb.escape(var1)
 opt_bb = optimize_load_store(bb)
 assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = load(var0, 0)
var2 = escape(var1)"""

Let's make it pass. To do that, first we'll make an equality function that works for both constants and operations. Constants are equal if their values are equal, and operations are equal if they are the identical (by address/pointer) operation.

def eq_value(left: Value|None, right: Value) -> bool:
 if isinstance(left, Constant) and isinstance(right, Constant):
 return left.value == right.value
 return left is right

This is a partial equality: if two operations are not equal under eq_value, it doesn't mean that they are different, only that we don't know that they are the same.

Then, after that, we need only check if the current value in the compile-time heap is the same as the value being stored in. If it is, wonderful. No need to store. continue and don't append the operation to opt_bb:

def optimize_load_store(bb: Block):
 opt_bb = Block()
 compile_time_heap: Dict[Tuple[Value, int], Value] = {}
 for op in bb:
 if op.name == "store":
 obj = op.arg(0)
 offset = get_num(op, 1)
 store_info = (obj, offset)
 current_value = compile_time_heap.get(store_info)
 new_value = op.arg(2)
 if eq_value(current_value, new_value): # NEW!
 continue
 compile_time_heap = # ... as before ...
 # ...
 elif op.name == "load":
 load_info = (op.arg(0), get_num(op, 1))
 if load_info in compile_time_heap:
 op.make_equal_to(compile_time_heap[load_info])
 continue
 compile_time_heap[load_info] = op
 opt_bb.append(op)
 return opt_bb

This makes our load-then-store pass and it also makes other tests pass too, like eliminating a store after another store!

def test_store_after_store():
 bb = Block()
 arg1 = bb.getarg(0)
 bb.store(arg1, 0, 5)
 bb.store(arg1, 0, 5)
 opt_bb = optimize_load_store(bb)
 assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = store(var0, 0, 5)"""

Unfortunately, this only works if the values--constants or SSA values--are known to be the same. If we store different values, we can't optimize. In the live stream, we left this an exercise for the viewer:

@pytest.mark.xfail
def test_exercise_for_the_reader():
 bb = Block()
 arg0 = bb.getarg(0)
 var0 = bb.store(arg0, 0, 5)
 var1 = bb.store(arg0, 0, 7)
 var2 = bb.load(arg0, 0)
 bb.escape(var2)
 opt_bb = optimize_load_store(bb)
 assert bb_to_str(opt_bb) == """\
var0 = getarg(0)
var1 = store(var0, 0, 7)
var2 = escape(7)"""

We would only be able to optimize this away if we had some notion of a store being dead. In this case, that is a store in which the value is never read before being overwritten.

Removing dead stores

TODO, I suppose. I have not gotten this far yet. If I get around to it, I will come back and update the post.

In the real world

This small optimization pass may seem silly or fiddly--when would we ever see something like this in a real IR?--but it's pretty useful. Here's the Ruby code that got me thinking about it again some years later for ZJIT:

class C
 def initialize
 @a = 1
 @b = 2
 @c = 3
 end
end

CRuby has a shape system and ZJIT makes use of it, so we end up optimizing this code (if it's monomorphic) into a series of shape checks and stores. The HIR might end up looking something like the mess below, where I've annotated the shape guards (can be thought of as loads) and stores with asterisks:

fn initialize@tmp/init.rb:3:
# ...
bb2(v6:BasicObject):
 v10:Fixnum[1] = Const Value(1)
 v31:HeapBasicObject = GuardType v6, HeapBasicObject
* v32:HeapBasicObject = GuardShape v31, 0x400000
* StoreField v32, :@a@0x10, v10
 WriteBarrier v32, v10
 v35:CShape[0x40008e] = Const CShape(0x40008e)
* StoreField v32, :_shape_id@0x4, v35
 v16:Fixnum[2] = Const Value(2)
 v37:HeapBasicObject = GuardType v6, HeapBasicObject
* v38:HeapBasicObject = GuardShape v37, 0x40008e
* StoreField v38, :@b@0x18, v16
 WriteBarrier v38, v16
 v41:CShape[0x40008f] = Const CShape(0x40008f)
* StoreField v38, :_shape_id@0x4, v41
 v22:Fixnum[3] = Const Value(3)
 v43:HeapBasicObject = GuardType v6, HeapBasicObject
* v44:HeapBasicObject = GuardShape v43, 0x40008f
* StoreField v44, :@c@0x20, v22
 WriteBarrier v44, v22
 v47:CShape[0x400090] = Const CShape(0x400090)
* StoreField v44, :_shape_id@0x4, v47
 CheckInterrupts
 Return v22

If we had store-load forwarding in ZJIT, we could get rid of the intermediate shape guards; they would know the shape from the previous StoreField instruction. If we had dead store elimination, we could get rid of the intermediate shape writes; they are never read. (And the repeated type guards to check if it's a heap object still are just silly and need to get removed eventually.)

This is on the roadmap and will make object initialization even faster than it is right now.

Wrapping up

Thanks for reading the text version of the video that CF and I made a while back. Now you know how to do load/store elimination on traces.

I think this does not need too much extra work to get it going on full CFGs; a block is pretty much the same as a trace, so you can do a block-local version without much fuss. If you want to go global, you need dominator information and gen-kill sets.

Maybe check out the implementation in other compilers:

Maybe I will touch on this in a future post...

Thank you

Thank you to CF, who walked me through this live on a stream two years ago! This blog post wouldn't be possible without you.

  1. In this toy optimizer example, we are assuming that all reads and writes are the same size and different offsets don't overlap at all. This is often the case for managed runtimes, where object fields are pointer-sized and all reads/writes are pointer-aligned.

  2. We could do better. If we had type information, we could also use that to make alias classes. Writes to a List will never overlap with writes to a Map, for example. This requires your compiler to have strict aliasing--if you can freely cast between types, as in C, then this tactic goes out the window.

    This is called Type-based alias analysis (PDF).

Wed, 24 Dec 2025 00:00:00 +0000 December 24, 2025 https://bernsteinbear.com/blog/toy-load-store/?utm_source=rss https://bernsteinbear.com/blog/toy-load-store/
ZJIT is now available in Ruby 4.0

Originally published on Rails At Scale.

ZJIT is a new just-in-time (JIT) Ruby compiler built into the reference Ruby implementation, YARV, by the same compiler group that brought you YJIT. We (Aaron Patterson, Aiden Fox Ivey, Alan Wu, Jacob Denbeaux, Kevin Menard, Max Bernstein, Maxime Chevalier-Boisvert, Randy Stauner, Stan Lo, and Takashi Kokubun) have been working on ZJIT since the beginning of this year.

In case you missed the last post, we're building a new compiler for Ruby because we want to both raise the performance ceiling (bigger compilation unit size and SSA IR) and encourage more outside contribution (by becoming a more traditional method compiler).

It's been a long time since we gave an official update on ZJIT. Things are going well. We're excited to share our progress with you. We've done a lot since May.

In brief

ZJIT is compiled by default--but not enabled by default--in Ruby 4.0. Enable it by passing the --zjit flag or the RUBY_ZJIT_ENABLE environment variable or calling RubyVM::ZJIT.enable after starting your application.

It's faster than the interpreter, but not yet as fast as YJIT. Yet. But we have a plan, and we have some more specific numbers below. The TL;DR is we have a great new foundation and now need to pull out all the Ruby-specific stops to match YJIT.

We encourage you to experiment with ZJIT, but maybe hold off on deploying it in production for now. This is a very new compiler. You should expect crashes and wild performance degradations (or, perhaps, improvements). Please test locally, try to run CI, etc, and let us know what you run into on the Ruby issue tracker (or, if you don't want to make a Ruby Bugs account, we would also take reports on GitHub).

State of the compiler

To underscore how much has happened since the announcement of being merged into CRuby, we present to you a series of comparisons:

Side-exits

Back in May, we could not side-exit from JIT code into the interpreter. This meant that the code we were running had to continue to have the same preconditions (expected types, no method redefinitions, etc) or the JIT would safely abort. Now, we can side-exit and use this feature liberally.

For example, we gracefully handle the phase transition from integer to string; a guard instruction fails and transfers control to the interpreter.

def add x, y
 x + y
end

add 3, 4
add 3, 4
add 3, 4
add "three", "four"

This enables running a lot more code!

More code

Back in May, we could only run a handful of small benchmarks. Now, we can run all sorts of code, including passing the full Ruby test suite, the test suite and shadow traffic of a large application at Shopify, and the test suite of GitHub.com! Also a bank, apparently.

Back in May, we did not optimize much; we only really optimized operations on fixnums (small integers) and method sends to the main object. Now, we optimize a lot more: all sorts of method sends, instance variable reads and writes, attribute accessor/reader/writer use, struct reads and writes, object allocations, certain string operations, optional parameters, and more.

For example, we can constant-fold numeric operations. Because we also have a (small, limited) inliner borrowed from YJIT, we can constant-fold the entirety of add down to 3--and still handle redefinitions of one, two, Integer#+, ...

def one
 1
end

def two
 2
end

def add
 one + two
end

Register spilling

Back in May, we could not compile many large functions due to limitations of our backend that we borrowed from YJIT. Now, we can compile absolutely enormous functions just fine. And quickly, too. Though we have not been focusing specifically on compiler performance, we compile even large methods in under a millisecond.

C methods

Back in May, we could not even optimize calls to built-in C methods. Now, we have a feature similar to JavaScriptCore's DOMJIT, which allows us to emit inline HIR versions of certain well-known C methods. This allows the optimizer to reason about these methods and their effects (more on this in a future post) much more... er, effectively.

For example, Integer#succ, which is defined as adding 1 to an integer, is a C method. It's used in Integer#times to drive the while loop. Instead of emitting a call to it, our C method "inliner" can emit our existing FixnumAdd instruction and take advantage of the rest of the type inference and constant-folding.

fn inline_integer_succ(fun: &mut hir::Function,
 block: hir::BlockId,
 recv: hir::InsnId,
 args: &[hir::InsnId],
 state: hir::InsnId) -> Option<hir::InsnId> {
 if !args.is_empty() { return None; }
 if fun.likely_a(recv, types::Fixnum, state) {
 let left = fun.coerce_to(block, recv, types::Fixnum, state);
 let right = fun.push_insn(block, hir::Insn::Const { val: hir::Const::Value(VALUE::fixnum_from_usize(1)) });
 let result = fun.push_insn(block, hir::Insn::FixnumAdd { left, right, state });
 return Some(result);
 }
 None
}

Fewer C calls

Back in May, the machine code ZJIT generated called a lot of C functions from the CRuby runtime to implement our HIR instructions in LIR. We have pared this down significantly and now "open code" the implementations in LIR.

For example, GuardNotFrozen used to call out to rb_obj_frozen_p. Now, it requires that its input is a heap-allocated object and can instead do a load, a test, and a conditional jump.

fn gen_guard_not_frozen(jit: &JITState,
 asm: &mut Assembler,
 recv: Opnd,
 state: &FrameState) -> Opnd {
 let recv = asm.load(recv);
 // It's a heap object, so check the frozen flag
 let flags = asm.load(Opnd::mem(64, recv, RUBY_OFFSET_RBASIC_FLAGS));
 asm.test(flags, (RUBY_FL_FREEZE as u64).into());
 // Side-exit if frozen
 asm.jnz(side_exit(jit, state, GuardNotFrozen));
 recv
}

More teammates

Back in May, we had four people working full-time on the compiler. Now, we have more internally at Shopify--and also more from the community! We have had several interested people reach out, learn about ZJIT, and successfully land complex changes. For this reason, we have opened up a chat room to discuss and improve ZJIT.

A cool graph visualization tool

You have to check out our intern Aiden's integration of Iongraph into ZJIT. Now we have clickable, zoomable, scrollable graphs of all our functions and all our optimization passes. It's great!

Try zooming (Ctrl-scroll), clicking the different optimization passes on the left, clicking the instruction IDs in each basic block (definitions and uses), and seeing how the IR for the below Ruby code changes over time.

class Point
 attr_accessor :x, :y
 def initialize x, y
 @x = x
 @y = y
 end
end

P = Point.new(3, 4).freeze

def test = P.x + P.y

More

...and so, so many garbage collection fixes.

There's still a lot to do, though.

To do

We're going to optimize invokeblock (yield) and invokesuper (super) instructions, each of which behaves similarly, but not identically, to a normal send instruction. These are pretty common.

We're going to optimize setinstancevariable in the case where we have to transition the object's shape. This will help normal @a = b situations. It will also help @a ||= b, but I think we can even do better with the latter using some kind of value numbering.

We only optimize monomorphic calls right now--cases where a method send only sees one class of receiver while being profiled. We're going to optimize polymorphic sends, too. Right now we're laying the groundwork (a new register allocator; see below) to make this much easier. It's not as much of an immediate focus, though, because most (high 80s, low 90s percent) of sends are monomorphic.

We're in the middle of re-writing the register allocator after reading the entire history of linear scan papers and several implementations. That will unlock performance improvements and also allow us to make the IRs easier to use.

We don't handle phase changes particularly well yet; if your method call patterns change significantly after your code has been compiled, we will frequently side-exit into the interpreter. Instead, we would like to use these side-exits as additional profile information and re-compile the function.

Right now we have a lot of traffic to the VM frame. JIT frame pushes are reasonably fast, but with every effectful operation, we have to flush our local variable state and stack state to the VM frame. The instances in which code might want to read this reified frame state are rare: frame unwinding due to exceptions, Binding#local_variable_get, etc. In the future, we will instead defer writing this state until it needs to be read.

We only have a limited inliner that inlines constants, self, and parameters. In the fullness of time, we will add a general-purpose method inlining facility. This will allow us to reduce the amount of polymorphic sends, do some branch folding, and reduce the amount of method sends.

We only support optimizing positional parameters, required keyword parameters, and optional parameters right now but we will work on optimizing optional keyword arguments as well. Most of this work is in marshaling the complex Ruby calling convention into one coherent form that the JIT can understand.

Performance

We have public performance numbers for a selection of macro- and micro-benchmarks on rubybench. Here is a screenshot of what those per-benchmark graphs look like. The Y axis is speedup multiplier vs the interpreter and the X axis is time. Higher is better:

A line chart of ZJIT performance on railsbench—represented as a speedup multiplier when compared to the interpreter—improving over time, passing interpreter performance, catching up to YJIT.

You can see that we are improving performance on nearly all benchmarks over time. Some of this comes from from optimizing in a similar way as YJIT does today (e.g. specializing ivar reads and writes), and some of it is optimizing in a way that takes advantage of ZJIT's high-level IR (e.g. constant folding, branch folding, more precise type inference).

We are using both raw time numbers and also our internal performance counters (e.g. number of calls to C functions from generated code) to drive optimization.

Try it out

While Ruby now ships with ZJIT compiled into the binary by default, it is not enabled by default at run-time. Due to performance and stability, YJIT is still the default compiler choice in Ruby 4.0.

If you want to run your test suite with ZJIT to see what happens, you absolutely can. Enable it by passing the --zjit flag or the RUBY_ZJIT_ENABLE environment variable or calling RubyVM::ZJIT.enable after starting your application.

On YJIT

We devoted a lot of our resources this year to developing ZJIT. While we did not spend much time on YJIT (outside of a great allocation speed up), YJIT isn't going anywhere soon.

Thank you

This compiler was made possible by contributions to your PBS station open source project from programmers like you. Thank you!

  • Aaron Patterson
  • Abrar Habib
  • Aiden Fox Ivey
  • Alan Wu
  • Alex Rocha
  • Andre Luiz Tiago Soares
  • Benoit Daloze
  • Charlotte Wen
  • Daniel Colson
  • Donghee Na
  • Eileen Uchitelle
  • Etienne Barrie
  • Godfrey Chan
  • Goshanraj Govindaraj
  • Hiroshi SHIBATA
  • Hoa Nguyen
  • Jacob Denbeaux
  • Jean Boussier
  • Jeremy Evans
  • John Hawthorn
  • Ken Jin
  • Kevin Menard
  • Max Bernstein
  • Max Leopold
  • Maxime Chevalier-Boisvert
  • Nobuyoshi Nakada
  • Peter Zhu
  • Randy Stauner
  • Satoshi Tagomori
  • Shannon Skipper
  • Stan Lo
  • Takashi Kokubun
  • Tavian Barnes
  • Tobias Lutke

(via a lightly touched up git log --pretty="%an" zjit | sort -u)

Wed, 24 Dec 2025 00:00:00 +0000 December 24, 2025 https://bernsteinbear.com/blog/launch-zjit/?utm_source=rss https://bernsteinbear.com/blog/launch-zjit/
How to annotate JITed code for perf/samply

Brief one today. I got asked "does YJIT/ZJIT have support for [Linux] perf?"

The answer is yes, and it also works with samply (including on macOS!), because both understand the perf map interface.

This is the entirety of the implementation in ZJIT1:

fn register_with_perf(iseq_name: String, start_ptr: usize, code_size: usize) {
 use std::io::Write;
 let perf_map = format!("/tmp/perf-{}.map", std::process::id());
 let Ok(file) = std::fs::OpenOptions::new().create(true).append(true).open(&perf_map) else {
 debug!("Failed to open perf map file: {perf_map}");
 return;
 };
 let mut file = std::io::BufWriter::new(file);
 let Ok(_) = writeln!(file, "{start_ptr:x} {code_size:x} zjit::{iseq_name}") else {
 debug!("Failed to write {iseq_name} to perf map file: {perf_map}");
 return;
 };
}

Whenever you generate a function, append a one-line entry consisting of

START SIZE symbolname

to /tmp/perf-{PID}.map. Per the Linux docs linked above,

START and SIZE are hex numbers without 0x.

symbolname is the rest of the line, so it could contain special characters.

You can now happily run perf record your_jit [...] or samply record your_jit [...] and have JIT frames be named in the output. We hide this behind the --zjit-perf flag to avoid file I/O overhead when we don't need it.

There is also the JIT dump interface

Perf map is the older way to interact with perf: a newer, more complicated way involves generating a "dump" file and then perf injecting it.

  1. We actually use {:#x}, which I noticed today is wrong. {:#x} leaves in the 0x, and it shouldn't; instead use {:x}.

Thu, 18 Dec 2025 00:00:00 +0000 December 18, 2025 https://bernsteinbear.com/blog/jit-perf-map/?utm_source=rss https://bernsteinbear.com/blog/jit-perf-map/
A catalog of side effects

Optimizing compilers like to keep track of each IR instruction's effects. An instruction's effects vary wildly from having no effects at all, to writing a specific variable, to completely unknown (writing all state).

This post can be thought of as a continuation of What I talk about when I talk about IRs, specifically the section talking about asking the right questions. When we talk about effects, we should ask the right questions: not what opcode is this? but instead what effects does this opcode have?

Different compilers represent and track these effects differently. I've been thinking about how to represent these effects all year, so I have been doing some reading. In this post I will give some summaries of the landscape of approaches. Please feel free to suggest more.

Some background

Internal IR effect tracking is similar to the programming language notion of algebraic effects in type systems, but internally, compilers keep track of finer-grained effects. Effects such as "writes to a local variable", "writes to a list", or "reads from the stack" indicate what instructions can be re-ordered, duplicated, or removed entirely.

For example, consider the following pseodocode for some made-up language that stands in for a snippet of compiler IR:

# ...
v = some_var[0]
another_var[0] = 5
# ...

The goal of effects is to communicate to the compiler if, for example, these two IR instructions can be re-ordered. The second instruction might write to a location that the first one reads. But it also might not! This is about knowing if some_var and another_var alias--if they are different names that refer to the same object.

We can sometimes answer that question directly, but often it's cheaper to compute an approximate answer: could they even alias? It's possible that some_var and another_var have different types, meaning that (as long as you have strict aliasing) the Load and Store operations that implement these reads and writes by definition touch different locations. And if they look at disjoint locations, there need not be any explicit order enforced.

Different compilers keep track of this information differently. The null effect analysis gives up and says "every instruction is maximally effectful" and therefore "we can't re-order or delete any instructions". That's probably fine for a first stab at a compiler, where you will get a big speed up purely based on strength reductions. Over-approximations of effects should always be valid.

But at some point you start wanting to do dead code elimination (DCE), or common subexpression elimination (CSE), or loads/store elimination, or move instructions around, and you start wondering how to represent effects. That's where I am right now. So here's a catalog of different compilers I have looked at recently.

There are two main ways I have seen to represent effects: bitsets and heap range lists. We'll look at one example compiler for each, talk a bit about tradeoffs, then give a bunch of references to other major compilers.

We'll start with Cinder, a Python JIT, because that's what I used to work on.

Cinder

Cinder tracks heap effects for its high-level IR (HIR) in instr_effects.h. Pretty much everything happens in the memoryEffects(const Instr& instr) function, which is expected to know everything about what effects the given instruction might have.

The data representation is a bitset representation of a lattice called an AliasClass and that is defined in alias_class.h. Each bit in the bitset represents a distinct location in the heap: reads from and writes to each of these locations are guaranteed not to affect any of the other locations.

Here is the X-macro that defines it:

#define HIR_BASIC_ACLS(X) \
 X(ArrayItem) \
 X(CellItem) \
 X(DictItem) \
 X(FuncArgs) \
 X(FuncAttr) \
 X(Global) \
 X(InObjectAttr) \
 X(ListItem) \
 X(Other) \
 X(TupleItem) \
 X(TypeAttrCache) \
 X(TypeMethodCache)

enum BitIndexes {
#define ACLS(name) k##name##Bit,
 HIR_BASIC_ACLS(ACLS)
#undef ACLS
};

Note that each bit implicitly represents a set: ListItem does not refer to a specific list index, but the infinite set of all possible list indices. It's any list index. Still, every list index is completely disjoint from, say, every entry in a global variable table.

(And, to be clear, an object in a list might be the same as an object in a global variable table. The objects themselves can alias. But the thing being written to or read from, the thing being side effected, is the container.)

Like other bitset lattices, it's possible to union the sets by or-ing the bits. It's possible to query for overlap by and-ing the bits.

class AliasClass {
 // The union of two AliasClass
 AliasClass operator|(AliasClass other) const {
 return AliasClass{bits_ | other.bits_};
 }

 // The intersection (overlap) of two AliasClass
 AliasClass operator&(AliasClass other) const {
 return AliasClass{bits_ & other.bits_};
 }
};

If this sounds familiar, it's because (as the repo notes) it's a similar idea to Cinder's type lattice representation.

Like other lattices, there is both a bottom element (no effects) and a top element (all possible effects):

#define HIR_OR_BITS(name) | k##name

#define HIR_UNION_ACLS(X) \
 /* Bottom union */ \
 X(Empty, 0) \
 /* Top union */ \
 X(Any, 0 HIR_BASIC_ACLS(HIR_OR_BITS)) \
 /* Memory locations accessible by managed code */ \
 X(ManagedHeapAny, kAny & ~kFuncArgs)

Union operations naturally hit a fixpoint at Any and intersection operations naturally hit a fixpoint at Empty.

All of this together lets the optimizer ask and answer questions such as:

  • where might this instruction write?
  • (because CPython is reference counted and incref implies ownership) where does this instruction borrow its input from?
  • do these two instructions' write destinations overlap?

and more.

Let's take a look at an (imaginary) IR version of the code snippet in the intro and see what analyzing it might look like in the optimizer. Here is the fake IR:

v0: Tuple = ...
v1: List = ...
v2: Int[5] = ...
# v = some_var[0]
v3: Object = LoadTupleItem v0, 0
# another_var[0] = 5
StoreListItem v1, 0, v2

You can imagine that LoadTupleItem declares that it reads from the TupleItem heap and StoreListItem declares that it writes to the ListItem heap. Because tuple and list pointers cannot be casted into one another and therefore cannot alias, these are disjoint heaps in our bitset. Therefore ListItem & TupleItem == 0, therefore these memory operations can never interfere! They can (for example) be re-ordered arbitrarily.

In Cinder, these memory effects could in the future be used for instruction re-ordering, but they are today mostly used in two places: the refcount insertion pass and DCE.

DCE involves first finding the set of instructions that need to be kept around because they are useful/important/have effects. So here is what the Cinder DCE isUseful looks like:

bool isUseful(Instr& instr) {
 return instr.IsTerminator() || instr.IsSnapshot() ||
 (instr.asDeoptBase() != nullptr && !instr.IsPrimitiveBox()) ||
 (!instr.IsPhi() && memoryEffects(instr).may_store != AEmpty);
}

There are some other checks in there but memoryEffects is right there at the core of it!

Now that we have seen the bitset representation of effects and an implementation in Cinder, let's take a look at a different representation and and an implementation in JavaScriptCore.

JavaScriptCore

I keep coming back to How I implement SSA form by Fil Pizlo, one of the significant contributors to JavaScriptCore (JSC). In particular, I keep coming back to the Uniform Effect Representation section. This notion of "abstract heaps" felt very... well, abstract. Somehow more abstract than the bitset representation. The pre-order and post-order integer pair as a way to represent nested heap effects just did not click.

It didn't make any sense until I actually went spelunking in JavaScriptCore and found one of several implementations--because, you know, JSC is six compilers in a trenchcoat[citation needed].

DFG, B3, DOMJIT, and probably others all have their own abstract heap implementations. We'll look at DOMJIT mostly because it's a smaller example and also illustrates something else that's interesting: builtins. We'll come back to builtins in a minute.

Let's take a lookat how DOMJIT structures its abstract heaps: a YAML file.

DOM:
 Tree:
 Node:
 - Node_firstChild
 - Node_lastChild
 - Node_parentNode
 - Node_nextSibling
 - Node_previousSibling
 - Node_ownerDocument
 Document:
 - Document_documentElement
 - Document_body

It's a hierarchy. Node_firstChild is a subheap of Node is a subheap of... and so on. A write to any Node_nextSibling is a write to Node is a write to ... Sibling heaps are unrelated: Node_firstChild and Node_lastChild, for example, are disjoint.

To get a feel for this, I wired up a simplified version of ZJIT's bitset generator (for types!) to read a YAML document and generate a bitset. It generated the following Rust code:

mod bits {
 pub const Empty: u64 = 0u64;
 pub const Document_body: u64 = 1u64 << 0;
 pub const Document_documentElement: u64 = 1u64 << 1;
 pub const Document: u64 = Document_body | Document_documentElement;
 pub const Node_firstChild: u64 = 1u64 << 2;
 pub const Node_lastChild: u64 = 1u64 << 3;
 pub const Node_nextSibling: u64 = 1u64 << 4;
 pub const Node_ownerDocument: u64 = 1u64 << 5;
 pub const Node_parentNode: u64 = 1u64 << 6;
 pub const Node_previousSibling: u64 = 1u64 << 7;
 pub const Node: u64 = Node_firstChild | Node_lastChild | Node_nextSibling | Node_ownerDocument | Node_parentNode | Node_previousSibling;
 pub const Tree: u64 = Document | Node;
 pub const DOM: u64 = Tree;
 pub const NumTypeBits: u64 = 8;
}

It's not a fancy X-macro, but it's a short and flexible Ruby script.

Then I took the DOMJIT abstract heap generator--also funnily enough a short Ruby script--modified the output format slightly, and had it generate its int pairs:

mod bits {
 /* DOMJIT Abstract Heap Tree.
 DOM<0,8>:
 Tree<0,8>:
 Node<0,6>:
 Node_firstChild<0,1>
 Node_lastChild<1,2>
 Node_parentNode<2,3>
 Node_nextSibling<3,4>
 Node_previousSibling<4,5>
 Node_ownerDocument<5,6>
 Document<6,8>:
 Document_documentElement<6,7>
 Document_body<7,8>
 */
 pub const DOM: HeapRange = HeapRange { start: 0, end: 8 };
 pub const Tree: HeapRange = HeapRange { start: 0, end: 8 };
 pub const Node: HeapRange = HeapRange { start: 0, end: 6 };
 pub const Node_firstChild: HeapRange = HeapRange { start: 0, end: 1 };
 pub const Node_lastChild: HeapRange = HeapRange { start: 1, end: 2 };
 pub const Node_parentNode: HeapRange = HeapRange { start: 2, end: 3 };
 pub const Node_nextSibling: HeapRange = HeapRange { start: 3, end: 4 };
 pub const Node_previousSibling: HeapRange = HeapRange { start: 4, end: 5 };
 pub const Node_ownerDocument: HeapRange = HeapRange { start: 5, end: 6 };
 pub const Document: HeapRange = HeapRange { start: 6, end: 8 };
 pub const Document_documentElement: HeapRange = HeapRange { start: 6, end: 7 };
 pub const Document_body: HeapRange = HeapRange { start: 7, end: 8 };
}

It already comes with a little diagram, which is super helpful for readability.

Any empty range(s) represent empty heap effects: if the start and end are the same number, there are no effects. There is no one Empty value, but any empty range could be normalized to HeapRange { start: 0, end: 0 }.

Maybe this was obvious to you, dear reader, but this pre-order/post-order thing is about nested ranges! Seeing the output of the generator laid out clearly like this made it make a lot more sense for me.

What about checking overlap? Here is the implementation in JSC:

namespace WTF {
// Check if two ranges overlap assuming that neither range is empty.
template<typename T>
constexpr bool nonEmptyRangesOverlap(T leftMin, T leftMax, T rightMin, T rightMax)
{
 ASSERT_UNDER_CONSTEXPR_CONTEXT(leftMin < leftMax);
 ASSERT_UNDER_CONSTEXPR_CONTEXT(rightMin < rightMax);

 return leftMax > rightMin && rightMax > leftMin;
}

// Pass ranges with the min being inclusive and the max being exclusive.
template<typename T>
constexpr bool rangesOverlap(T leftMin, T leftMax, T rightMin, T rightMax) {
 ASSERT_UNDER_CONSTEXPR_CONTEXT(leftMin <= leftMax);
 ASSERT_UNDER_CONSTEXPR_CONTEXT(rightMin <= rightMax);

 // Empty ranges interfere with nothing.
 if (leftMin == leftMax)
 return false;
 if (rightMin == rightMax)
 return false;

 return nonEmptyRangesOverlap(leftMin, leftMax, rightMin, rightMax);
}
}

class HeapRange {
 bool overlaps(const HeapRange& other) const {
 return WTF::rangesOverlap(m_begin, m_end, other.m_begin, other.m_end);
 }
}

(See also How to check for overlapping intervals and Range overlap in two compares for more fun.)

While bitsets are a dense representation (you have to hold every bit), they are very compact and they are very precise. You can hold any number of combinations of 64 or 128 bits in a single register. The union and intersection operations are very cheap.

With int ranges, it's a little more complicated. An imprecise union of a and b can take the maximal range that covers both a and b. To get a more precise union, you have to keep track of both. In the worst case, if you want efficient arbitrary queries, you need to store your int ranges in an interval tree. So what gives?

I asked Fil if both bitsets and int ranges answer the same question, why use int ranges? He said that it's more flexible long-term: bitsets get expensive as soon as you need over 128 bits (you might need to heap allocate them!) whereas ranges have no such ceiling. But doesn't holding sequences of ranges require heap allocation? Well, despite Fil writing this in his SSA post:

The purpose of the effect representation baked into the IR is to provide a precise always-available baseline for alias information that is super easy to work with. [...] you can have instructions report that they read/write multiple heaps [...] you can have a utility function that produces such lists on demand.

It's important to note that this doesn't actually involve any allocation of lists. JSC does this very clever thing where they have "functors" that they pass in as arguments that compress/summarize what they want to out of an instruction's effects.

Let's take a look at how the DFG (for example) uses these heap ranges in analysis. The DFG is structured in such a way that it can make use of the DOMJIT heap ranges directly, which is neat.

Note that AbstractHeap in the example below is a thin wrapper over the DFG compiler's own DOMJIT::HeapRange equivalent:

class AbstractHeapOverlaps {
public:
 AbstractHeapOverlaps(AbstractHeap heap)
 : m_heap(heap)
 , m_result(false)
 {
 }

 void operator()(AbstractHeap otherHeap) const
 {
 if (m_result)
 return;
 m_result = m_heap.overlaps(otherHeap);
 }

 bool result() const { return m_result; }

private:
 AbstractHeap m_heap;
 mutable bool m_result;
};

bool writesOverlap(Graph& graph, Node* node, AbstractHeap heap)
{
 NoOpClobberize noOp;
 AbstractHeapOverlaps addWrite(heap);
 clobberize(graph, node, noOp, addWrite, noOp);
 return addWrite.result();
}

clobberize is the function that calls these functors (noOp or addWrite in this case) for each effect that the given IR instruction node declares.

I've pulled some relevant snippets of clobberize, which is quite long, that I think are interesting.

First, some instructions (constants, here) have no effects. There's some utility in the def(PureValue(...)) call but I didn't understand fully.

Then there are some instructions that conditionally have effects depending on the use types of their operands.1 Taking the absolute value of an Int32 or a Double is effect-free but otherwise looks like it can run arbitrary code.

Some run-time IR guards that might cause side exits are annotated as such--they write to the SideState heap.

Local variable instructions read specific heaps indexed by what looks like the local index but I'm not sure. This means accessing two different locals won't alias!

Instructions that allocate can't be re-ordered, it looks like; they both read and write the HeapObjectCount. This probably limits the amount of allocation sinking that can be done.

Then there's CallDOM, which is the builtins stuff I was talking about. We'll come back to that after the code block.

template<typename ReadFunctor, typename WriteFunctor, typename DefFunctor, typename ClobberTopFunctor>
void clobberize(Graph& graph, Node* node, const ReadFunctor& read, const WriteFunctor& write, const DefFunctor& def)
{
 // ...

 switch (node->op()) {
 case JSConstant:
 case DoubleConstant:
 case Int52Constant:
 def(PureValue(node, node->constant()));
 return;

 case ArithAbs:
 if (node->child1().useKind() == Int32Use || node->child1().useKind() == DoubleRepUse)
 def(PureValue(node, node->arithMode()));
 else
 clobberTop();
 return;

 case AssertInBounds:
 case AssertNotEmpty:
 write(SideState);
 return;

 case GetLocal:
 read(AbstractHeap(Stack, node->operand()));
 def(HeapLocation(StackLoc, AbstractHeap(Stack, node->operand())), LazyNode(node));
 return;

 case NewArrayWithSize:
 case NewArrayWithSizeAndStructure:
 read(HeapObjectCount);
 write(HeapObjectCount);
 return;

 case CallDOM: {
 const DOMJIT::Signature* signature = node->signature();
 DOMJIT::Effect effect = signature->effect;
 if (effect.reads) {
 if (effect.reads == DOMJIT::HeapRange::top())
 read(World);
 else
 read(AbstractHeap(DOMState, effect.reads.rawRepresentation()));
 }
 if (effect.writes) {
 if (effect.writes == DOMJIT::HeapRange::top()) {
 if (Options::validateDFGClobberize())
 clobberTopFunctor();
 write(Heap);
 } else
 write(AbstractHeap(DOMState, effect.writes.rawRepresentation()));
 }
 ASSERT_WITH_MESSAGE(effect.def == DOMJIT::HeapRange::top(), "Currently, we do not accept any def for CallDOM.");
 return;
 }
 }
}

(Remember that these AbstractHeap operations are very similar to DOMJIT's HeapRange with a couple more details--and in some cases even contain DOMJIT HeapRanges!)

This CallDOM node is the way for the DOM APIs in the browser--a significant chunk of the builtins, which are written in C++--to communicate what they do to the optimizing compiler. Without any annotations, the JIT has to assume that a call into C++ could do anything to the JIT state. Bummer!

But because, for example, Node.firstChild annotates what memory it reads from and what it doesn't write to, the JIT can optimize around it better--or even remove the access completely. It means the JIT can reason about calls to known builtins the same way that it reasons about normal JIT opcodes.

(Incidentally it looks like it doesn't even make a C call, but instead is inlined as a little memory read snippet using a JIT builder API. Neat.)

Last, we'll look at Simple, which has a slightly different take on all of this.

Simple

Simple is Cliff Click's pet Sea of Nodes (SoN) project to try and showcase the idea to the world--outside of a HotSpot C2 context.

This one is a little harder for me to understand but it looks like each translation unit has a StartNode that doles out different classes of memory nodes for each alias class. Each IR node then takes data dependencies on whatever effect nodes it might uses.

Alias classes are split up based on the paper Type-Based Alias Analysis (PDF): "Our approach is a form of TBAA similar to the 'FieldTypeDecl' algorithm described in the paper."

The Simple project is structured into sequential implementation stages and alias classes come into the picture in Chapter 10.

Because I spent a while spelunking through other implementations to see how other projects did this, here is a list of the projects I looked at. Mostly, they use bitsets.

Other implementations

HHVM

HHVM, a JIT for the Hack language, also uses a bitset for its memory effects. See for example: alias-class.h and memory-effects.h.

HHVM has a couple places that use this information, such as a definition-sinking pass, alias analysis, DCE, store elimination, refcount opts, and more.

If you are wondering why the HHVM representation looks similar to the Cinder representation, it's because some former HHVM engineers such as Brett Simmers also worked on Cinder!

Android ART

(note that I am linking an ART fork on GitHub as a reference, but the upstream code is hosted on googlesource)

Android's ART Java runtime also uses a bitset for its effect representation. It's a very compact class called SideEffects in nodes.h.

The side effects are used in loop-invariant code motion, global value numbering, write barrier elimination, scheduling, and more.

.NET/CoreCLR

CoreCLR mostly uses a bitset for its SideEffectSet class. This one is interesting though because it also splits out effects specifically to include sets of local variables (LclVarSet).

V8

V8 is also about six completely different compilers in a trenchcoat.

Turboshaft uses a struct in operations.h called OpEffects which is two bitsets for reads/writes of effects. This is used in value numbering as well a bunch of other small optimization passes they call "reducers".

Maglev also has this thing called NodeT::kProperties in their IR nodes that also looks like a bitset and is used in their various reducers. It has effect query methods on it such as can_eager_deopt and can_write.

Until recently, V8 also used Sea of Nodes as its IR representation, which also tracks side effects more explicitly in the structure of the IR itself.

Guile

Guile Scheme looks like it has a custom tagging scheme type thing.

Conclusion

Both bitsets and int ranges are perfectly cromulent ways of representing heap effects for your IR. The Sea of Nodes approach is also probably okay since it powers HotSpot C2 and (for a time) V8.

Remember to ask the right questions of your IR when doing analysis.

Thank you

Thank you to Fil Pizlo for writing his initial GitHub Gist and sending me on this journey and thank you to Chris Gregory, Brett Simmers, and Ufuk Kayserilioglu for feedback on making some of the explanations more helpful.

  1. This is because the DFG compiler does this interesting thing where they track and guard the input types on use vs having types attached to the input's own def. It might be a clean way to handle shapes inside the type system while also allowing the type+shape of an object to change over time (which it can do in many dynamic language runtimes).

Tue, 11 Nov 2025 00:00:00 +0000 November 11, 2025 https://bernsteinbear.com/blog/compiler-effects/?utm_source=rss https://bernsteinbear.com/blog/compiler-effects/
Walking around the compiler

Walking around outside is good for you.[citation needed] A nice amble through the trees can quiet inner turbulence and make complex engineering problems disappear.

Vicki Boykis wrote a post, Walking around the app, about a more proverbial stroll. In it, she talks about constantly using your production application's interface to make sure the whole thing is cohesively designed with few rough edges.

She also talks about walking around other parts of the implementation of the application, fixing inconsistencies, complex machinery, and broken builds. Kind of like picking up someone else's trash on your hike.

That's awesome and universally good advice for pretty much every software project. It got me thinking about how I walk around the compiler.

What does your output look like?

There's a certain class of software project that transforms data--compression libraries, compilers, search engines--for which there's another layer of "walking around" you can do. You have the code, yes, but you also have non-trivial output.

By non-trivial, I mean an output that scales along some quality axis instead of something semi-regular like a JSON response. For compression, it's size. For compilers, it's generated code.

You probably already have some generated cases checked into your codebase as tests. That's awesome. I think golden tests are fantastic for correctness and for people to help understand. But this isolated understanding may not scale to more complex examples.

How does your compiler handle, for example, switch-case statements in loops? Does it do the jump threading you expect it to? Maybe you're sitting there idly wondering while you eat a cookie, but maybe that thought would only have occurred to you while you were scrolling through the optimizer.

An example

Say you are CF Bolz-Tereick and you are paging through PyPy IR. You notice some IR that looks like:

v0 = ...
v1 = float_abs v0
...
v2 = float_abs v1
...
v3 = float_abs v2

"Huh", you say to yourself, "surely the optimizer can reason that running float_abs on the result of float_abs is redundant!"

But some quirk in your optimizer means that it does not. Maybe it used to work, or maybe it never did. But this little stroll revealed a bug with a quick fix (adding a new peephole optimization function):

def optimize_FLOAT_ABS(self, op):
 v = get_box_replacement(op.getarg(0))
 arg_op = self.optimizer.as_operation(v)
 if arg_op is not None and arg_op.getopnum() == rop.FLOAT_ABS:
 self.make_equal_to(op, v)
 else:
 return self.emit(op)

Now, thankfully, your IR looks much better:

v0 = ...
v1 = float_abs v0
...
...

and you can check this in as a tidy test case:

def test_abs_abs_no(self):
 ops = """
 [f1]
 f2 = float_abs(f1)
 f3 = float_abs(f2)
 escape_f(f3)
 """
 expected = """
 [f1]
 f2 = float_abs(f1)
 escape_f(f2)
 """
 self.optimize_loop(ops, expected)

Fun fact: this was my first exposure to the PyPy project. CF walked me through fixing this bug1 live at ECOOP 2022! I had a great time.

Internal state

If checking (and, later, testing) your assumptions is tricky, this may be a sign that your library does not expose enough of its internal state to developers. This may present a usability impediment that prevents you from immediately checking your assumptions or suspicions.

For an excellent source of inspiration, see Kate's tweets about program internals.

Even if it does provide a flag like --zjit-dump-hir to print to the console, maybe this is hard to run from a phone2 or a friend's computer. For that, you may want friendlier tools.

Mechanical sympathy and the compiler explorer

The right kind of tool invites exploration.

Matthew Godbolt built the first friendly compiler explorer tool I used, the Compiler Explorer ("Godbolt"). It allows inputting programs into your web browser in many different languages and immediately seeing the compiled result. It will even execute your programs, within reason.

This is a powerful tool:

  1. The feedback is near-instant and live updates on key-up.
  2. There is no fussing with the command line and file watching.
  3. Where possible, it highlights slices of source and compiled result to indicate what regions produced what output.
  4. It's open source and you can add your own compiler.

This combination lowers the barrier to check things tremendously.

Now, sometimes you want the reverse: a Compiler Explorer -like thing in your terminal or editor so you don't have to break flow. I unfortunately have not found a comparable tool.

In addition to the immediate effects of being able to spot-check certain inputs and outputs, continued use of these tools builds long-term intuition about the behavior of the compiler. It builds mechanical sympathy.

I haven't written a lot about mechanical sympathy other than my grad school statement of purpose (PDF) and a few brief internet posts, so I will leave you with that for now.

Every function is special

Your compiler likely compiles some applications and you can likely get access to the IR for the functions in that application.

Scroll through every function's optimized IR. If there are too many, maybe the top N functions' IRs. See what can be improved. Maybe you will see some unexpected patterns. Even if you don't notice anything in May, that could shift by August because of compiler advancements or a cool paper that you read in the intervening months.

One time I found a bizarre reference counting bug that was causing copy-on-write and potential memory issues by noticing that some objects that should have been marked "immortal" in the IR were actually being refcounted. The bug was not in the compiler, but far away in application setup code--and yet it was visible in the IR.

Love your tools

My conclusion is similar to Vicki's.

Put some love into your tools. Your colleagues will notice. Your users will notice. It might even improve your mood.

Acknowledgements

Thank you to CF for feedback on the post.

  1. The actual fix that checks for float_abs(float_abs(x)) and rewrites to float_abs(x).

  2. Just make sure to log off and touch grass.

Tue, 23 Sep 2025 00:00:00 +0000 September 23, 2025 https://bernsteinbear.com/blog/walking-around/?utm_source=rss https://bernsteinbear.com/blog/walking-around/
Linear scan with lifetime holes

In my last post, I explained a bit about how to retrofit SSA onto the original linear scan algorithm. I went over all of the details for how to go from low-level IR to register assignments--liveness analysis, scheduling, building intervals, and the actual linear scan algorithm.

Basically, we made it to 1997 linear scan, with small adaptations for allocating directly on SSA.

This time, we're going to retrofit lifetime holes.

Lifetime holes

Lifetime holes come into play because a linearized sequence of instructions is not a great proxy for storing or using metadata about a program originally stored as a graph.

According to Linear Scan Register Allocation on SSA Form (PDF, 2010):

The lifetime interval of a virtual register must cover all parts where this register is needed, with lifetime holes in between. Lifetime holes occur because the control flow graph is reduced to a list of blocks before register allocation. If a register flows into an else-block, but not into the corresponding if-block, the lifetime interval has a hole for the if-block.

Lifetime holes come from Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. Figure 1, though not in SSA form, is a nice diagram for understanding how lifetime holes may occur. Unfortunately, the paper contains a rather sparse plaintext description of their algorithm that I did not understand how to apply to my concrete allocator.

Thankfully, other papers continued this line of research in (at least) 2002, 2005, and 2010. We will piece snippets from those papers together to understand what's going on.

Let's take a look at the sample IR snippet from Wimmer2010 to illustrate how lifetime holes form:

16: label B1(R10, R11):
18: jmp B2($1, R11)
 # vvvvvvvvvv #
20: label B2(R12, R13)
22: cmp R13, $1
24: branch lessThan B4() else B3()

26: label B3()
28: mul R12, R13 -> R14
30: sub R13, $1 -> R15
32: jump B2(R14, R15)

34: label B4()
 # ^^^^^^^^^^ #
36: add R10, R12 -> R16
38: ret R16

Virtual register R12 is not used between position 28 and 34. For this reason, Wimmer's interval building algorithm assigns it the interval [[20, 28), [34, ...)]. Note how the interval has two disjoint ranges with space in the middle.

Our simplified interval building algorithm from last time gave us--in the same notation--the interval [[20, ...)] (well, [[20, 36)] in our modified snippet). This simplified interval only supports one range with no lifetime holes.

Ideally we would be able to use the physical register assigned to R12 for another virtual register in this empty slot! For example, maybe R14 or R15, which have short lifetimes that completely fit into the hole.

Another example is a control-flow diamond. In this example, B1 jumps to either B3 or B2, which then merge at B4. Virtual register R0 is defined in B1 and only used in one of the branches, B3. It's also not used in B4--if it were used in B4, it would be live in both B2 and B3!

Once we schedule it, the need for lifetime holes becomes more apparent:

0: label B1:
2: R0 = loadi $123
4: blt iftrue: -B3, iffalse: -B2

6: label B2:
8: R1 = loadi $456
10: R2 = add R1, $1
12: jump -B4

14: label B3:
16: R3 = mul R0, $2
18: jump -B4

20: label B4:
22: ret $5

Since B2 gets scheduled (in this case, arbitrarily) before B3, there's a gap where R0--which is completely unused in B2--would otherwise take up space in our simplified interval form. Let's fix that by adding some lifetime holes.

Even though we are adding some gaps between ranges, each interval still gets assigned one location for its entire life. It's just that in the gaps, we get to put other smaller intervals, like lichen growing between bricks.

To get lifetime holes, we have to modify our interval data structure a bit.

Finding lifetime holes

Our interval currently only supports a single range:

class Interval
 attr_reader :range
 def initialize = raise
 def add_range(from, to) = raise
 def set_from(from) = raise
end

We can change this to support multiple ranges by changing just one character!!!

class Interval
 attr_reader :ranges
 def initialize = raise
 def add_range(from, to) = raise
 def set_from(from) = raise
end

Har har. Okay, so we now have an array of Range instead of just a single Range. But now we have to implement the methods differently.

We'll start with initialize. The start state of an interval is an empty array of ranges:

class Interval
 def initialize
 @ranges = []
 end
end

Because we're iterating backwards through the blocks and backwards through instructions in each block, we'll be starting with instruction 38 and working our way linearly backwards until 16.

This means that we'll see later uses before earlier uses, and uses before defs. In order to keep the @ranges array in sorted order, we need to add each new range to the front. This is O(n) in an array, so use a deque or linked list. (Alternatively, push to the end and then reverse them afterwards.)

We keep the ranges in sorted order because it makes keeping them disjoint easier, as we'll see in add_range and set_from. Let's start with set_from since it's very similar to the previous version:

class Interval
 def set_from(from)
 if @ranges.empty?
 # @ranges is empty when we don't have a use of the vreg
 @ranges << Range.new(from, from)
 else
 @ranges[0] = Range.new(from, @ranges[0].end)
 end
 assert_sorted_and_disjoint
 end
end

add_range has a couple more cases, but we'll go through them step by step. First, a quick check that the range is the right way 'round:

class Interval
 def add_range(from, to)
 if to <= from
 raise ArgumentError, "Invalid range: #{from} to #{to}"
 end
 # ...
 end
end

Then we have a straightforward case: if we don't have any ranges yet, add a brand new one:

class Interval
 def add_range(from, to)
 # ...
 if @ranges.empty?
 @ranges << Range.new(from, to)
 return
 end
 # ...
 end
end

But if we do have ranges, this new range might be totally subsumed by the existing first range. This happens if a virtual register is live for the entirety of a block and also used inside that block. The uses that cause an add_range don't add any new information:

class Interval
 def add_range(from, to)
 # ...
 if @ranges.first.cover?(from..to)
 assert_sorted_and_disjoint
 return
 end
 # ...
 end
end

Another case is that the new range has a partial overlap with the existing first range. This happens when we're adding ranges for all of the live-out virtual registers; the range for the predecessor block (say [4, 8]) will abut the range for the successor block (say [8, 12]). We merge these ranges into one big range (say, [4, 12]):

class Interval
 def add_range(from, to)
 # ...
 if @ranges.first.cover?(to)
 @ranges[0] = Range.new(from, @ranges.first.end)
 assert_sorted_and_disjoint
 return
 end
 # ...
 end
end

The last case is the case that gives us lifetime holes and happens when the new range is already completely disjoint from the existing first range. That is also a straightforward case: put the new range in at the start of the list.

class Interval
 def add_range(from, to)
 # ...
 # TODO(max): Use a linked list or deque or something to avoid O(n) insertions
 @ranges.insert(0, Range.new(from, to))
 assert_sorted_and_disjoint
 # ...
 end
end

This is all fine and good. I added this to the register allocator to test out the lifetime hole finding but kept the rest of the same (changed the APIs slightly so the interval could pretend it was still one big range). The tests passed. Neat!

I also verified that the lifetime holes were what we expected. This means our build_intervals function works unmodified with the new Interval implementation. That makes sense, given that we copied the implementation off of Wimmer2010, which can deal with lifetime holes.

Now we would like to use this new information in the register allocator.

Modified linear scan

It took a little bit of untangling, but the required modifications to support lifetime holes in the register assignment phase are not too invasive. To get an idea of the difference, I took the original Poletto1999 (PDF) algorithm and rewrote it in the style of the Mossenbock2002 (PDF) algorithm.

For example, here is Poletto1999:

LinearScanRegisterAllocation
active - {}
foreach live interval i, in order of increasing start point
 ExpireOldIntervals(i)
 if length(active) = R then
 SpillAtInterval(i)
 else
 register[i] - a register removed from pool of free registers
 add i to active, sorted by increasing end point

ExpireOldIntervals(i)
foreach interval j in active, in order of increasing end point
 if endpoint[j] >= startpoint[i] then
 return
 remove j from active
 add register[j] to pool of free registers

SpillAtInterval(i)
spill - last interval in active
if endpoint[spill] > endpoint[i] then
 register[i] - register[spill]
 location[spill] - new stack location
 remove spill from active
 add i to active, sorted by increasing end point
else
 location[i] - new stack location

And here it is again, reformatted a bit. The implicit unhandled and handled sets that don't get names in Poletto1999 now get names. ExpireOldIntervals is inlined and SpillAtInterval gets a new name:

LINEARSCAN()
unhandled - all intervals in increasing order of their start points
active - {}; handled - {}
free - set of available registers
while unhandled {} do
 cur - pick and remove the first interval from unhandled
 //----- check for active intervals that expired
 for each interval i in active do
 if i ends before cur.beg then
 move i to handled and add i.reg to free

 //----- collect available registers in f
 f - free

 //----- select a register from f
 if f = {} then
 ASSIGNMEMLOC(cur) // see below
 else
 cur.reg - any register in f
 free - free - {cur.reg}
 move cur to active

ASSIGNMEMLOC(cur: Interval)
spill - last interval in active
if spill.end > cur.end then
 cur.reg - spill.reg
 spill.location - new stack location
 move spill from active to handled
 move cur to active
else
 cur.location - new stack location

Now we can pick out all of the bits of Mossenbock2002 that look like they are responsible for dealing with lifetime holes.

For example, the algorithm now has a fourth set, inactive. This set holds intervals that have holes that contain the current interval's start position. These intervals are assigned registers that are potential candidates for the current interval to live (more on this in a sec).

I say potential candidates because in order for them to be a home for the current interval, an inactive interval has to be completely disjoint from the current interval. If they overlap at all--in any of their ranges--then we would be trying to put two virtual registers into one physical register at the same program point. That's a bad compile.

We have to do a little extra bookkeeping in ASSIGNMEMLOC because now one physical register can be assigned to more than one interval that is still in the middle of being processed (active and inactive sets). If we choose to spill, we have to make sure that all conflicting uses of the register (intervals that overlap with the current interval) get reassigned locations.

LINEARSCAN()
unhandled - all intervals in increasing order of their start points
active - {}; handled - {}
inactive - {}
free - set of available registers
while unhandled {} do
 cur - pick and remove the first interval from unhandled
 //----- check for active intervals that expired
 for each interval i in active do
 if i ends before cur.beg then
 move i to handled and add i.reg to free
 else if i does not overlap cur.beg then
 move i to inactive and add i.reg to free
 //----- check for inactive intervals that expired or become reactivated
 for each interval i in inactive do
 if i ends before cur.beg then
 move i to handled
 else if i overlaps cur.beg then
 move i to active and remove i.reg from free

 //----- collect available registers in f
 f - free
 for each interval i in inactive that overlaps cur do f - f - {i.reg}

 //----- select a register from f
 if f = {} then
 ASSIGNMEMLOC(cur) // see below
 else
 cur.reg - any register in f
 free - free - {cur.reg}
 move cur to active

ASSIGNMEMLOC(cur: Interval)
spill - heuristic: pick some interval from active or inactive
if spill.end > cur.end then
 r = spill.reg
 conflicting = set of active or inactive intervals with register r that
 overlap with cur
 move all intervals in conflicting to handled
 assign memory locations to them
 cur.reg - r
 move cur to active
else
 cur.location - new stack location

Note that this begins to depart from strictly linear (time) linear scan: the inactive set is bounded not by the number of physical registers but instead by the number of virtual registers. Mossenbock2002 notes that the size of the inactive set is generally very small, though, so "linear in practice".

EDIT: After re-reading Wimmer2010, I noticed that they say:

[...] introduced non-linear parts. Two of them are highlighted in Figure 6 where the set of inactive intervals is iterated. The set can contain an arbitrary number of intervals since it is not bound by the number of physical registers. Testing the current interval for intersection with all of them can therefore be expensive.

When the lifetime intervals are created from code in SSA form, this test is not necessary anymore: All intervals in inactive start before the current interval, so they do not intersect with the current interval at their definition. They are inactive and thus have a lifetime hole at the current position, so they do not intersect with the current interval at its definition. SSA form therefore guarantees that they never intersect [7], making the entire loop that tests for intersection unnecessary.

Unfortunately, splitting of intervals leads to intervals that no longer adhere to the SSA form properties because it destroys SSA form. Therefore, the intersection test cannot be omitted completely; it must be performed if the current interval has been split off from another interval.

Which indicates to me that we may actually be able to leave off that loop over the inactive intervals after all? Unclear. I'll have to come back and mess with this later.

I left out the parts about register weights that are heuristics to improve register allocation. They are not core to supporting lifetime holes. You can add them back in if you like.

Here is a text diff to make it clear what changed:

diff --git a/tmp/lsra b/tmp/lsra-holes
index e9de35b..de79a63 100644
--- a/tmp/lsra
+++ b/tmp/lsra-holes
@@ -1,6 +1,7 @@
 LINEARSCAN()
 unhandled - all intervals in increasing order of their start points
 active - {}; handled - {}
+inactive - {}
 free - set of available registers
 while unhandled {} do
 cur - pick and remove the first interval from unhandled
@@ -8,9 +9,18 @@ while unhandled {} do
 for each interval i in active do
 if i ends before cur.beg then
 move i to handled and add i.reg to free
+ else if i does not overlap cur.beg then
+ move i to inactive and add i.reg to free
+ //----- check for inactive intervals that expired or become reactivated
+ for each interval i in inactive do
+ if i ends before cur.beg then
+ move i to handled
+ else if i overlaps cur.beg then
+ move i to active and remove i.reg from free

 //----- collect available registers in f
 f - free
+ for each interval i in inactive that overlaps cur do f - f - {i.reg}

 //----- select a register from f
 if f = {} then
@@ -23,10 +33,10 @@ while unhandled {} do
 ASSIGNMEMLOC(cur: Interval)
-spill - last interval in active
+spill - heuristic: pick some interval from active or inactive
 if spill.end > cur.end then
- cur.reg - spill.reg
- spill.location - new stack location
- move spill from active to handled
+ r = spill.reg
+ conflicting = set of active or inactive intervals with register r that
+ overlap with cur
+ move all intervals in conflicting to handled
+ assign memory locations to them
+ cur.reg - r
 move cur to active
 else
 cur.location - new stack location

This reformatting and diffing made it much easier for me to reason about what specifically had to be changed.

There's just one thing left after register assignment: resolution and SSA deconstruction.

Resolution and SSA destruction

I'm pretty sure we can actually just keep the resolution the same. In our resolve function, we are only making sure that the block arguments get parallel-moved into the block parameters. That hasn't changed.

Wimmer2010 says:

Linear scan register allocation with splitting of lifetime intervals requires a resolution phase after the actual allocation. Because the control flow graph is reduced to a list of blocks, control flow is possible between blocks that are not adjacent in the list. When the location of an interval is different at the end of the predecessor and at the start of the successor, a move instruction must be inserted to resolve the conflict.

That's great news for us: we don't do splitting. An interval, though it has lifetime holes, still only ever has one location for its entire life. So once an interval begins, we don't need to think about moving its contents.

So I was actually overly conservative in the previous post, which I have amended!

Fixed intervals and register constraints?

Mossenbock2002 also tackles register constraints with this notion of "fixed intervals"--intervals that have been pre-allocated physical registers.

Since I eventually want to use "register hinting" from Wimmer2005 and Wimmer2010, I'm going to ignore the fixed interval part of Mossenbock2002 for now. It seems like they work nicely together.

Wrapping up

We added lifetime holes to our register allocator without too much effort. This better maps the graph-like nature of the IR onto the linear sequence of instructions and should get us some better allocation for short-lived virtual registers.

Maybe next time we will add interval splitting, which will help us a) address ABI constraints more cleanly in function calls and b) remove the dependence on reserving a scratch register.

Sun, 24 Aug 2025 00:00:00 +0000 August 24, 2025 https://bernsteinbear.com/blog/linear-scan-lifetime-holes/?utm_source=rss https://bernsteinbear.com/blog/linear-scan-lifetime-holes/
Liveness analysis with Datalog

After publishing Linear scan register allocation on SSA, I had a nice call with Waleed Khan where he showed me how to Datalog. He thought it might be useful to try implementing liveness analysis as a Datalog problem.

We started off with the Wimmer2010 CFG example from that post, sketching out manually which variables were live out of each block: R10 out of B1, R12 out of B2, etc.

The graph from Wimmer2010 has come back! Remember, we're using block arguments instead of phis, so B1(R10, R11) defines R10 and R11 before the first instruction in B1.

Then we tried to formulate liveness as a Datalog relation.

Liveness is normally (at least for me) defined in terms of two relations: live-in and live-out. Live-out is "what is needed" from all of the successors of a block and live-in is the "what is needed" summary for a block. So, in fake math notation:

live-out(b) = union(live-in(s) for each successor s of b)
live-in(b) = (live-out(b) + used(b)) - defined(b)

where each of the component parts of that expression represent sets of variables:

  • used(b) is the set of variables referenced as in-operands to instructions in a block
  • defined(b) is the set of variables defined by instructions in a block

We ended up computing the live-in sets for blocks in the register allocator post but then using the live-out sets instead. So today let's compute both live-in and live-out sets with Datalog!

Datalog

Datalog is a logic programming language. It probably looks and feels different from every other programming language you have used... except for maybe SQL. It might feel similar to SQL, except SQL has a certain order to it that Datalog does not.

We'll be using Souffle here because Waleed mentioned it and also I learned a bit about it in my databases class.

The thing you do first is define your relations, which is what Datalog calls a table.

In this case, if we want to compute liveness information, we have to know information about what a block uses, defines, and what successors it has.

First, the thing you have to know about Datalog, is that it's kind of like the opposite of array programming. We're going to express things about sets by expressing facts about individual items in a set.

For example, we're not going to say "this block B4 uses [R10, R12, R16]". We're going to say three separate facts: "B4 uses R10", "B4 uses R12", "B4 uses R16". You can think about it like each relation being a database table where each parameter is a column name.

Here are the relations for block uses, block defs, and which blocks follow other blocks:

// liveness.dl
.decl block_use(block:symbol, var:symbol)
.decl block_def(block:symbol, var:symbol)
.decl block_succ(succ:symbol, pred:symbol)

Where symbol here means string.

We can then embed some facts inline. For example, this says "A defines R0 and R1 and uses R0":

block_def("A", "R0").
block_def("A", "R1").
block_use("A", "R0").

You can also provide facts as a TSV but this file format is so irritating to construct manually and has given me silently wrong answers in Souffle before so I am not doing that for this example.

You can, for your edification, manually encode all the use/def/successor facts from the previous post into Souffle--or you can copy this chunk into your file:

// liveness.dl
// ...
block_def("B1", "R10").
block_def("B1", "R11").
block_use("B1", "R11").

block_def("B2", "R12").
block_def("B2", "R13").
block_use("B2", "R13").

block_def("B3", "R14").
block_def("B3", "R15").
block_use("B3", "R12").
block_use("B3", "R13").
block_use("B3", "R14").
block_use("B3", "R15").

block_def("B4", "R16").
block_use("B4", "R16").
block_use("B4", "R10").
block_use("B4", "R12").

block_succ("B2", "B1").
block_succ("B3", "B2").
block_succ("B2", "B3").
block_succ("B4", "B2").

We can declare our live-in and live-out relations similarly to our use/def/succ relations. We mark them as being .output so that Souffle presents us with the results.

// liveness.dl
// ...
.decl live_out(block:symbol, var:symbol)
.output live_out
.decl live_in(block:symbol, var:symbol)
.output live_in

Now it's time to define our relations. You may notice that the Souffle definitions look very similar to our earlier definitions. This is no mistake; Datalog was created for dataflow and graph problems.

We'll start with live-out:

// liveness.dl
// ...
live_out(b, v) :- block_succ(s, b), live_in(s, v).

We read this left to right as "a variable v is live-out of block b if block s is a successor of b and v is live-in to s". The :- defines the left side in terms of the right side. The comma between block_succ and live_in means it's a conjunction--and.

Where's the union? Well, remember what I said about array programming? We're not thinking in terms of sets. We're thinking one program variable at a time. As Souffle executes our relations, live_out will incrementally build up a table.

It's also a little weird to program in this style because s wasn't textually defined anywhere like a parameter or a variable. You kind of have to think of s as connector, a binder, a foreign key--what have you. It's a placeholder. (I don't know how to explain this well. Sorry.)

Then we can define live-in. This on the surface looks more complicated but I think that is only because of Souffle's choice of syntax.

// liveness.dl
// ...
live_in(b, v) :- (live_out(b, v) ; block_use(b, v)), !block_def(b, v).

It reads as "a variable v is live-in to b if it is either live-out of b or used in b, and not defined in b. The semicolons are disjunctions--or--and the exclamation points negations--not.

These relations look endlessly mutually recursive but you have to keep in mind that we're not running functions on data, exactly. We're declaratively expressing definitions of rules--relations. block_use(b, v) in the body of live_in is not calling a function but instead making a query--is the row (b, v) in the table block_use? Datalog builds the tables until saturation.

Now we can run Souffle! We tell it to dump to standard output with -D- but you could just as easily have it dump each output relation in its own separate file in the current directory by specifying -D..

$ souffle -D- liveness.dl
---------------
live_in
block var
===============
B2 R10
B3 R10
B3 R12
B3 R13
B4 R10
B4 R12
===============
---------------
live_out
block var
===============
B1 R10
B2 R10
B2 R12
B2 R13
B3 R10
===============
$

That's neat. We got nicely formatted tables and it only took us two lines of code! Let's compare to our Ruby code from the previous post to underscore the point:

def analyze_liveness
 order = post_order
 gen, kill = compute_initial_liveness_sets(order)
 live_in = Hash.new 0
 changed = true
 while changed
 changed = false
 for block in order
 block_live = block.successors.map { |succ| live_in[succ] }.reduce(0, :|)
 block_live |= gen[block]
 block_live &= ~kill[block]
 if live_in[block] != block_live
 changed = true
 live_in[block] = block_live
 end
 end
 end
 live_in
end

This is because we have separated the iteration-to-fixpoint bit from the main bit of the dataflow analysis: the equation. If we let Datalog do the data movement for us, we can work on defining the rules--and only the rules.

This is probably why, in the fullness of time, many static analysis and compiler tools end up growing some kind of embedded (partial) Datalog engine. Call it Scholz's tenth rule.

Souffle also has the ability to compile to C++, which gives you two nice things:

  1. you can probably get faster execution
  2. you can use it from an existing C++ program

I don't have any experience with this API.

This is when Waleed mentioned offhandedly that he had heard about some embedded Rust datalog called Ascent.

Rust

The front page of the Ascent website is a really great sell if you show up thinking "gee, I wish I had Datalog to use in my Rust program". Right out the gate, you get reasonable-enough Datalog syntax via a proc macro.

For example, here is the canonical path example for Souffle:

.decl edge(x:number, y:number)
.decl path(x:number, y:number)

path(x, y) :- edge(x, y).
path(x, y) :- path(x, z), edge(z, y).

and in Ascent:

ascent! {
 relation edge(i32, i32);
 relation path(i32, i32);

 path(x, y) <-- edge(x, y);
 path(x, z) <-- edge(x, y), path(y, z);
}

Super.

We weren't sure if the Souffle liveness would port cleanly to Rust, but it sure did! It even lets you use your own datatypes instead of just i32 (which the front-page example uses).

use ascent::ascent;

#[derive(Clone, PartialEq, Eq, Hash, Copy)]
struct BlockId(i32);

impl std::fmt::Debug for BlockId {
 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
 write!(f, "B{}", self.0)
 }
}

#[derive(Clone, PartialEq, Eq, Hash, Copy)]
struct VarId(i32);

impl std::fmt::Debug for VarId {
 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
 write!(f, "R{}", self.0)
 }
}

ascent! {
 relation block_use(BlockId, VarId);
 relation block_def(BlockId, VarId);
 relation block_succ(BlockId, BlockId); // (succ, pred)
 relation live_out(BlockId, VarId);
 relation live_in(BlockId, VarId);
 live_out(b, v) <-- block_succ(s, b), live_in(s, v);
 live_in(b, v) <-- (live_out(b, v) | block_use(b, v)), !block_def(b, v);
}
fn main() {
 // ...
}

Notice how we don't have an input or output annotation like we did in Datalog. That's because this is designed to be embedded in an existing program, which probably doesn't to deal with the disk (or at least wants to read/write in its own format).

Ascent lets us give it some vectors of data and then at the end lets us read some vectors of data too.

// ...
fn main() {
 let mut prog = AscentProgram::default();
 let b1 = BlockId(1);
 let b2 = BlockId(2);
 let b3 = BlockId(3);
 let b4 = BlockId(4);
 let r10 = VarId(10);
 let r11 = VarId(11);
 let r12 = VarId(12);
 let r13 = VarId(13);
 let r14 = VarId(14);
 let r15 = VarId(15);
 let r16 = VarId(16);
 prog.block_def = vec![
 (b1, r10),
 (b1, r11),
 (b2, r12),
 (b2, r13),
 (b3, r14),
 (b3, r15),
 (b4, r16),
 ];
 prog.block_succ = vec![
 (b2, b1),
 (b3, b2),
 (b2, b3),
 (b4, b2),
 ];
 prog.block_use = vec![
 (b1, r11),
 (b2, r13),
 (b3, r12),
 (b3, r13),
 (b3, r14),
 (b3, r15),
 (b4, r10),
 (b4, r12),
 (b4, r16),
 ];
 prog.run();
 println!("live out: {:?}", prog.live_out);
 println!("live in: {:?}", prog.live_in);
}

Then we need only run cargo add ascent and cargo run--both of which worked with zero issues--and see the results.

$ cargo run
 Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s
 Running `target/debug/liveness`
live out: [(B2, R12), (B2, R13), (B2, R10), (B1, R10), (B3, R10)]
live in: [(B3, R12), (B3, R13), (B4, R10), (B4, R12), (B2, R10), (B3, R10)]
$

It's not a fancy looking table, but it's very close to my program, which is neat.

This is similar to embedding Souffle in C++ and then calling the C++. One difference, though, is the Souffle process has two steps. It's a slight build system complication. But this isn't meant to be a Datalog comparison post!

More?

Can we model all of linear scan this way? Maybe. I'm new to all this stuff.

Ascent also seems to support lattices, which means we can use it to do abstract interpretation on some cool domains.

Maxime Chevalier-Boisvert and I prototyped loupe, an interprocedural type analysis in Rust. We had to build our own iterate-to-fixpoint engine, which was non-trivial. I wonder how it would look to build something similar on top of Ascent.

I kind of want to check out Frank McSherry's datatoad.

Wrapping up

That's all for now, folks. Just a couple Datalog snippets. Happy hacking.

Wed, 13 Aug 2025 00:00:00 +0000 August 13, 2025 https://bernsteinbear.com/blog/liveness-datalog/?utm_source=rss https://bernsteinbear.com/blog/liveness-datalog/
Linear scan register allocation on SSA

Much of the code and education that resulted in this post happened with Aaron Patterson.

The fundamental problem in register allocation is to take an IR that uses a virtual registers (as many as you like) and rewrite it to use a finite amount of physical registers and stack space1.

This is an example of a code snippet using virtual registers:

add R1, R2 -> R3
add R1, R3 -> R4
ret R4

And here is the same example after it has been passed through a register allocator (note that Rs changed to Ps):

add Stack[0], P0 -> P1
add Stack[0], P1 -> P0
ret

Each virtual register was assigned a physical place: R1 to the stack, R2 to P0, R3 to P1, and R4 also to P0 (since we weren't using R2 anymore).

People use register allocators like they use garbage collectors: it's an abstraction that can manage your resources for you, maybe with some cost. When writing the back-end of a compiler, it's probably much easier to have a separate register-allocator-in-a-box than manually managing variable lifetimes while also considering all of your different target architectures.

How do JIT compilers do register allocation? Well, "everyone knows" that "every JIT does its own variant of linear scan"2. This bothered me for some time because I've worked on a couple of JITs and still didn't understand the backend bits.

There are a couple different approaches to register allocation, but in this post we'll focus on linear scan of SSA.

I started reading Linear Scan Register Allocation on SSA Form (PDF, 2010) by Wimmer and Franz after writing A catalog of ways to generate SSA. Reading alone didn't make a ton of sense--I ended up with a lot of very frustrated margin notes. I started trying to implement it alongside the paper. As it turns out, though, there is a rich history of papers in this area that it leans on really heavily. I needed to follow the chain of references!

For example, here is a lovely explanation of the process, start to finish, from Christian Wimmer's Master's thesis (PDF, 2004).

LINEAR_SCAN
 // order blocks and operations (including loop detection)
 COMPUTE_BLOCK_ORDER
 NUMBER_OPERATIONS
 // create intervals with live ranges
 COMPUTE_LOCAL_LIVE_SETS
 COMPUTE_GLOBAL_LIVE_SETS
 BUILD_INTERVALS
 // allocate registers
 WALK_INTERVALS
 RESOLVE_DATA_FLOW
 // replace virtual registers with physical registers
 ASSIGN_REG_NUM
 // special handling for the Intel FPU stack
 ALLOCATE_FPU_STACK

There it is, all laid out at once. It's very refreshing when compared to all of the compact research papers.

I didn't realize that there were more than one or two papers on linear scan. So this post will also incidentally serve as a bit of a survey or a history of linear scan--as best as I can figure it out, anyway. If you were in or near the room where it happened, please feel free to reach out and correct some parts.

Some example code

Throughout this post, we'll use an example SSA code snippet from Wimmer2010, adapted from phi-SSA to block-argument-SSA. Wimmer2010's code snippet is between the arrows and we add some filler (as alluded to in the paper):

label B1(R10, R11):
jmp B2($1, R11)
 # vvvvvvvvvv #
label B2(R12, R13)
cmp R13, $1
branch lessThan B4()

label B3()
mul R12, R13 -> R14
sub R13, $1 -> R15
jump B2(R14, R15)

label B4()
 # ^^^^^^^^^^ #
add R10, R12 -> R16
ret R16

Virtual registers start with R and are defined either with an arrow or by a block parameter.

Because it takes a moment to untangle the unfamiliar syntax and draw the control-flow graph by hand, I've also provided the same code in graphical form. Block names (and block parameters) are shaded with grey.

We have one entry block, B1, that is implied in Wimmer2010. Its only job is to define R10 and R11 for the rest of the CFG.

Then we have a loop between B2 and B3 with an implicit fallthrough. Instead of doing that, we instead generate a conditional branch with explicit jump targets. This makes it possible to re-order blocks as much as we like.

The contents of B4 are also just to fill in the blanks from Wimmer2010 and add some variable uses.

Our goal for the post is to analyze this CFG, assign physical locations (registers or stack slots) to each virtual register, and then rewrite the code appropriately.

For now, let's rewind the clock and look at how linear scan came about.

In the beginning

Linear scan register allocation (LSRA) has been around for awhile. It's neat because it does the actual register assignment part of register allocation in one pass over your low-level IR. (We'll talk more about what that means in a minute.)

It first appeared in the literature in tcc: A System for Fast, Flexible, and High-level Dynamic Code Generation (PDF, 1997) by Poletto, Engler, and Kaashoek. (Until writing this post, I had never seen this paper. It was only on a re-read of the 1999 paper (below) that I noticed it.) In this paper, they mostly describe a staged variant of C called 'C (TickC), for which a fast register allocator is quite useful.

Then came a paper called Quality and Speed in Linear-scan Register Allocation (PDF, 1998) by Traub, Holloway, and Smith. It adds some optimizations (lifetime holes, binpacking) to the algorithm presented in Poletto1997.

Then came the first paper I read, and I think the paper everyone refers to when they talk about linear scan: Linear Scan Register Allocation (PDF, 1999) by Poletto and Sarkar. In this paper, they give a fast alternative to graph coloring register allocation, especially motivated by just-in-time compilers. In retrospect, it seems to be a bit of a rehash of the previous two papers.

Linear scan (1997, 1999) operates on live ranges instead of virtual registers. A live range is a pair of integers [start, end) (end is exclusive) that begins when the register is defined and ends when it is last used. This means that there is an assumption that the order for instructions in your program has already been fixed into a single linear sequence! It also means that you have given each instruction a number that represents its position in that order.

This may or not be a surprising requirement depending on your compilers background. It was surprising to me because I often live in control flow graph fantasy land where blocks are unordered and instructions sometimes float around. But if you live in a land of basic blocks that are already in reverse post order, then it may be less surprising.

In non-SSA-land, these live ranges are different from the virtual registers: they represent some kind of lifetimes of each version of a virtual register. For an example, consider the following code snippet:

... -> a
add 1, a -> b
add 1, b -> c
add 1, c -> a
add 1, a -> d

There are two definitions of a and they each live for different amounts of time:

 a b c a d
... -> a | <- the first a
add 1, a -> b v |
add 1, b -> c v |
add 1, c -> a v | <- the second a
add 1, a -> d v |

In fact, the ranges are completely disjoint. It wouldn't make sense for the register allocator to consider variables, because there's no reason the two as should necessarily live in the same physical register.

In SSA land, it's a little different: since each virtual registers only has one definition (by, uh, definition), live ranges are an exact 1:1 mapping with virtual registers. We'll focus on SSA for the remainder of the post because this is what I am currently interested in. The research community seems to have decided that allocating directly on SSA gives more information to the register allocator3.

Linear scan starts at the point in your compiler process where you already know these live ranges--that you have already done some kind of analysis to build a mapping.

In this blog post, we're going to back up to the point where we've just built our SSA low-level IR and have yet to do any work on it. We'll do all of the analysis from scratch.

Part of this analysis is called liveness analysis.

Liveness analysis

The result of liveness analysis is a mapping of BasicBlock -> Set[Instruction] that tells you which virtual registers (remember, since we're in SSA, instruction==vreg) are alive (used later) at the beginning of the basic block. This is called a live-in set. For example:

B0:
... -> R12
... -> R13
jmp B1

B1:
mul R12, R13 -> R14
sub R13, 1 -> R15
jmp B2

B2:
add R14, R15 -> R16
ret R16

We compute liveness by working backwards: a variable is live from the moment it is backwardly-first used until its definition.

In this case, at the end of B2, nothing is live. If we step backwards to the ret, we see a use: R16 becomes live. If we step once more, we see its definition--R16 no longer live--but now we see a use of R14 and R15, which become live. This leaves us with R14 and R15 being live-in to B2.

This live-in set becomes B1's live-out set because B1 is B2's predecessor. We continue in B1. We could continue backwards linearly through the blocks. In fact, I encourage you to do it as an exercise. You should have a (potentially emtpy) set of registers per basic block.

It gets more interesting, though, when we have branches: what does it mean when two blocks' live-in results merge into their shared predecessor? If we have two blocks A and B that are successors of a block C, the live-in sets get unioned together.

G C C A A C->A B B C->B

That is, if there were some register R0 live-in to B and some register R1 live-in to A, both R0 and R1 would be live-out of C. They may also be live-in to C, but that entirely depends on the contents of C.

Since the total number of virtual registers is nonnegative and is finite for a given program, it seems like a good lattice for an abstract interpreter. That's right, we're doing AI.

In this liveness analysis, we'll:

  1. compute a summary of what virtual registers each basic block needs to be alive (gen set) and what variables it defines (kill set)
  2. initialize all live-in sets to 0
  3. do an iterative dataflow analysis over the blocks until the live-in sets converge

We store gen, kill, and live-in sets as bitsets, using some APIs conveniently available on Ruby's Integer class.

Like most abstract interpretations, it doesn't matter what order we iterate over the collection of basic blocks for correctness, but it does matter for performance. In this case, iterating backwards (post_order) converges much faster than forwards (reverse_post_order):

class Function
 def compute_initial_liveness_sets order
 # Map of Block -> what variables it alone needs to be live-in
 gen = Hash.new 0
 # Map of Block -> what variables it alone defines
 kill = Hash.new 0
 order.each do |block|
 block.instructions.reverse_each do |insn|
 out = insn.out&.as_vreg
 if out
 kill[block] |= (1 << out.num)
 end
 insn.vreg_ins.each do |vreg|
 gen[block] |= (1 << vreg.num)
 end
 end
 block.parameters.each do |param|
 kill[block] |= (1 << param.num)
 end
 end
 [gen, kill]
 end

 def analyze_liveness
 order = post_order
 gen, kill = compute_initial_liveness_sets(order)
 # Map from Block -> what variables are live-in
 live_in = Hash.new 0
 changed = true
 while changed
 changed = false
 for block in order
 # Union-ing all the successors' live-in sets gives us this block's
 # live-out, which is a good starting point for computing the live-in
 block_live = block.successors.map { |succ| live_in[succ] }.reduce(0, :|)
 block_live |= gen[block]
 block_live &= ~kill[block]
 if live_in[block] != block_live
 changed = true
 live_in[block] = block_live
 end
 end
 end
 live_in
 end
end

We could also use a worklist here, and it would be faster, but eh. Repeatedly iterating over all blocks is fine for now.

The Wimmer2010 paper skips this liveness analysis entirely by assuming some computed information about your CFG: where loops start and end. It also requires all loop blocks be contiguous. Then it makes variables defined before a loop and used at any point inside the loop live for the whole loop. By having this information available, it folds the liveness analysis into the live range building, which we'll instead do separately in a moment.

The Wimmer approach sounded complicated and finicky. Maybe it is, maybe it isn't. So I went with a dataflow liveness analysis instead. If it turns out to be the slow part, maybe it will matter enough to learn about this loop tagging method.

For now, we will pick a schedule for the control-flow graph.

Scheduling

In order to build live ranges, you have to have some kind of numbering system for your instructions, otherwise a live range's start and end are meaningless. We can write a function that fixes a particular block order (in this case, reverse post-order) and then assigns each block and instruction a number in a linear sequence. You can think of this as flattening or projecting the graph:

class Function
 def number_instructions!
 @block_order = rpo
 number = 16 # just so we match the Wimmer2010 paper
 @block_order.each do |blk|
 blk.number = number
 number += 2
 blk.instructions.each do |insn|
 insn.number = number
 number += 2
 end
 blk.to = number
 end
 end
end

A couple interesting things to note:

Even though we have extra instructions, it looks very similar to the example in the Wimmer2010 paper.

16: label B1(R10, R11):
18: jmp B2($1, R11)
 # vvvvvvvvvv #
20: label B2(R12, R13)
22: cmp R13, $1
24: branch lessThan B4() else B3()

26: label B3()
28: mul R12, R13 -> R14
30: sub R13, $1 -> R15
32: jump B2(R14, R15)

34: label B4()
 # ^^^^^^^^^^ #
36: add R10, R12 -> R16
38: ret R16

Since we're not going to be messing with the order of the instructions within a block anymore, all we have to do going forward is make sure that we iterate through the blocks in @block_order.

Finally, we have all that we need to compute live ranges.

Live ranges

We'll more or less copy the algorithm to compute live ranges from the Wimmer2010 paper. We'll have two main differences:

  • We're going to compute live ranges, not live intervals (as they do in the paper)
  • We're going to use our dataflow liveness analysis, not the loop header thing

I know I said we were going to be computing live ranges. So why am I presenting you with a function called build_intervals? That's because early in the history of linear scan (Traub1998!), people moved from having a single range for a particular virtual register to having multiple disjoint ranges. This collection of multiple ranges is called an interval and it exists to free up registers in the context of branches.

For example, in the our IR snippet (above), R12 is defined in B2 as a block parameter, used in B3, and then not used again until some indetermine point in B4. (Our example uses it immediately in an add instruction to keep things short, but pretend the second use is some time away.)

The Wimmer2010 paper creates a lifetime hole between 28 and 34, meaning that the interval for R12 (called i12) is [[20, 28), [34, ...)]. Interval holes are not strictly necessary--they exist to generate better code. So for this post, we're going to start simple and assume 1 interval == 1 range. We may come back later and add additional ranges, but that will require some fixes to our later implementation. We'll note where we think those fixes should happen.

BUILDINTERVALS
for each block b in reverse order do
 live = union of successor.liveIn for each successor of b
 for each phi function phi of successors of b do
 live.add(phi.inputOf(b))
 for each opd in live do
 intervals[opd].addRange(b.from, b.to)
 for each operation op of b in reverse order do
 for each output operand opd of op do
 intervals[opd].setFrom(op.id)
 live.remove(opd)
 for each input operand opd of op do
 intervals[opd].addRange(b.from, op.id)
 live.add(opd)
 for each phi function phi of b do
 live.remove(phi.output)
 if b is loop header then
 loopEnd = last block of the loop starting at b
 for each opd in live do
 intervals[opd].addRange(b.from, loopEnd.to)
 b.liveIn = live

Anyway, here is the mostly-copied annotated implementation of BuildIntervals from the Wimmer2010 paper:

class Function
 def build_intervals live_in
 intervals = Hash.new { |hash, key| hash[key] = Interval.new }
 @block_order.reverse_each do |block|
 # live = union of successor.liveIn for each successor of b
 # this is the *live out* of the current block since we're going to be
 # iterating backwards over instructions
 live = block.successors.map { |succ| live_in[succ] }.reduce(0, :|)
 # for each phi function phi of successors of b do
 # live.add(phi.inputOf(b))
 live |= block.out_vregs.map { |vreg| 1 << vreg.num }.reduce(0, :|)
 each_bit(live) do |idx|
 opd = vreg idx
 intervals[opd].add_range(block.from, block.to)
 end
 block.instructions.reverse.each do |insn|
 out = insn.out&.as_vreg
 if out
 # for each output operand opd of op do
 # intervals[opd].setFrom(op.id)
 intervals[out].set_from(insn.number)
 end
 # for each input operand opd of op do
 # intervals[opd].addRange(b.from, op.id)
 insn.vreg_ins.each do |opd|
 intervals[opd].add_range(block.from, insn.number)
 end
 end
 end
 intervals.default_proc = nil
 intervals.freeze
 end
end

Another difference is that since we're using block parameters, we don't really have this phi.inputOf thing. That's just the block argument.

The last difference is that since we're skipping the loop liveness hack, we don't modify a block's live set as we iterate through instructions.

I know we said we're building live ranges, so our Interval class only has one Range on it. This is Ruby's built-in range, but it's really just being used as a tuple of integers here.

class Interval
 attr_reader :range

 def add_range(from, to)
 if to <= from
 raise ArgumentError, "Invalid range: #{from} to #{to}"
 end
 if !@range
 @range = Range.new(from, to)
 return
 end
 @range = Range.new([@range.begin, from].min, [@range.end, to].max)
 end

 def set_from(from)
 @range = if @range
 if @range.end <= from
 raise ArgumentError, "Invalid range: #{from} to #{@range.end}"
 end
 Range.new(from, @range.end)
 else
 # This happens when we don't have a use of the vreg
 # If we don't have a use, the live range is very short
 Range.new(from, from)
 end
 end

 def ==(other)
 other.is_a?(Interval) && @range == other.range
 end
end

Note that there's some implicit behavior happening here:

  • If we haven't initialized a range yet, we build one automatically
  • If we have a range, add_range builds the smallest range that overlaps with the existing range and incoming information
  • If we have a range, set_from may shrink it

For example, if we have [1, 5) and someone calls add_range(7, 10), we end up with [1, 10). There's no gap in the middle.

And if we have [1, 7) and someone calls set_from(3), we end up with [3, 7).

After figuring out from scratch some of these assumptions about what the interval/range API should and should not do, Aaron and I realized that there was some actual code for add_range in a different, earlier paper: Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mossenbock and Pfeiffer.

ADDRANGE(i: Instruction; b: Block; end: integer)
 if b.first.n <= i.n <= b.last.n then range - [i.n, end[ else range - [b.first.n, end[
 add range to interval[i.n] // merging adjacent ranges

Unfortunately, many other versions of this PDF look absolutely horrible (like bad OCR) and I had to do some digging to find the version linked above.

Finally we can start thinking about doing some actual register assignment. Let's return to the 90s.

Linear scan

Because we have faithfully kept 1 interval == 1 range, we can re-use the linear scan algorithm from Poletto1999 (which looks, at a glance, to be the same as 1997).

I recommend looking at the PDF side by side with the code. We have tried to keep the structure very similar.

LinearScanRegisterAllocation
active - {}
foreach live interval i, in order of increasing start point
 ExpireOldIntervals(i)
 if length(active) = R then
 SpillAtInterval(i)
 else
 register[i] - a register removed from pool of free registers
 add i to active, sorted by increasing end point

ExpireOldIntervals(i)
foreach interval j in active, in order of increasing end point
 if endpoint[j] >= startpoint[i] then
 return
 remove j from active
 add register[j] to pool of free registers

SpillAtInterval(i)
spill - last interval in active
if endpoint[spill] > endpoint[i] then
 register[i] - register[spill]
 location[spill] - new stack location
 remove spill from active
 add i to active, sorted by increasing end point
else
 location[i] - new stack location

Note that unlike in many programming languages these days, {} in the algorithm description represents a set, not a (hash-)map.

In our Ruby code, we represent active as an array:

class Function
 def ye_olde_linear_scan intervals, num_registers
 if num_registers <= 0
 raise ArgumentError, "Number of registers must be positive"
 end
 free_registers = Set.new 0...num_registers
 active = [] # Active intervals, sorted by increasing end point
 assignment = {} # Map from Interval to PReg|StackSlot
 num_stack_slots = 0
 # Iterate through intervals in order of increasing start point
 # TODO(max): Build a deque for intervals, pushing to the front, so we
 # automatically get this in sorted order
 sorted_intervals = intervals.sort_by { |_, interval| interval.range.begin }
 sorted_intervals.each do |_vreg, interval|
 # expire_old_intervals(interval)
 active.select! do |active_interval|
 if active_interval.range.end > interval.range.begin
 true
 else
 operand = assignment.fetch(active_interval)
 raise "Should be assigned a register" unless operand.is_a?(PReg)
 free_registers.add(operand.name)
 false
 end
 end
 if active.length == num_registers
 # spill_at_interval(interval)
 # Pick an interval to spill. Picking the longest-lived active one is
 # a heuristic from the original linear scan paper.
 spill = active.last
 # In either case, we need to allocate a slot on the stack.
 slot = StackSlot.new(num_stack_slots)
 num_stack_slots += 1
 if spill.range.end > interval.range.end
 # The last active interval ends further away than the current
 # interval; spill the last active interval.
 assignment[interval] = assignment[spill]
 raise "Should be assigned a register" unless assignment[interval].is_a?(PReg)
 assignment[spill] = slot
 active.pop # We know spill is the last one
 # Insert interval into already-sorted active
 insert_idx = active.bsearch_index { |i| i.range.end >= interval.range.end } || active.length
 active.insert(insert_idx, interval)
 else
 # The current interval ends further away than the last active
 # interval; spill the current interval.
 assignment[interval] = slot
 end
 else
 reg = free_registers.min
 free_registers.delete(reg)
 assignment[interval] = PReg.new(reg)
 # Insert interval into already-sorted active
 insert_idx = active.bsearch_index { |i| i.range.end >= interval.range.end } || active.length
 active.insert(insert_idx, interval)
 end
 end
 [assignment, num_stack_slots]
 end
end

Internalizing this took us a bit. It is mostly a three-state machine:

  • have not been allocated
  • have been allocated a register
  • have been allocated a stack slot

We would like to come back to this and incrementally modify it as we add lifetime holes to intervals.

I finally understood, very late in the game, that Poletto1999 linear scan assigns only one location per virtual register. Ever. It's not that every virtual register gets a shot in a register and then gets moved to a stack slot--that would be interval splitting and hopefully we get to that later--if a register gets spilled, it's in a stack slot from beginning to end.

I only found this out accidentally after trying to figure out a bug (that wasn't a bug) due to a lovely sentence in Optimized Interval Splitting in a Linear Scan Register Allocator (PDF, 2005) by Wimmer and Mossenbock):

However, it cannot deal with lifetime holes and does not split intervals, so an interval has either a register assigned for the whole lifetime, or it is spilled completely.

Also,

In particular, it is not possible to implement the algorithm without reserving a scratch register: When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register

Also,

Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately

Marvelous.

Let's take a look at the code snippet again. Here it is before register allocation, using virtual registers:

16: label B1(R10, R11):
18: jmp B2($1, R11)
 # vvvvvvvvvv #
20: label B2(R12, R13)
22: cmp R13, $1
24: branch lessThan B4()

26: label B3()
28: mul R12, R13 -> R14
30: sub R13, $1 -> R15
32: jump B2(R14, R15)

34: label B4()
 # ^^^^^^^^^^ #
36: add R10, R12 -> R16
38: ret R16

Let's run it through register allocation with incrementally decreasing numbers of physical registers available. We get the following assignments:

  • 4 registers {R10: P0, R11: P1, R12: P1, R13: P2, R14: P3, R15: P2, R16: P0}
  • 3 registers {R10: Stack[0], R11: P1, R12: P1, R13: P2, R14: P0, R15: P2, R16: P0}
  • 2 registers {R10: Stack[0], R11: P1, R12: Stack[1], R13: P0, R14: P1, R15: P0, R16: P0}
  • 1 register {R10: Stack[0], R11: P0, R12: Stack[1], R13: P0, R14: Stack[2], R15: P0, R16: P0}

Some other things to note:

  • If you have a register free, choosing which register to allocate is a heuristic! It is tunable. There is probably some research out there that explores the space.

    In fact, you might even consider not allocating a register greedily. What might that look like? I have no idea.

  • Spilling the interval with the furthest endpoint is a heuristic! You can pick any active interval you want. In Register Spilling and Live-Range Splitting for SSA-Form Programs (PDF, 2009) by Braun and Hack, for example, they present the MIN algorithm, which spills the interval with the furthest next use.

    This requires slightly more information and takes slightly more time than the default heuristic but apparently generates much better code.

  • Also, block ordering? You guessed it. Heuristic.

Here is an example "slideshow" I generated by running linear scan with 2 registers. Use the arrow keys to navigate forward and backward in time4.

Resolving SSA

At this point we have register assignments: we have a hash table mapping intervals to physical locations. That's great but we're still in SSA form: labelled code regions don't have block arguments in hardware. We need to write some code to take us out of SSA and into the real world.

We can use a modified Wimmer2010 as a great start point here. It handles more than we need to right now--interval splitting--but we can simplify.

RESOLVE
for each control flow edge from predecessor to successor do
 for each interval it live at begin of successor do
 if it starts at begin of successor then
 phi = phi function defining it
 opd = phi.inputOf(predecessor)
 if opd is a constant then
 moveFrom = opd
 else
 moveFrom = location of intervals[opd] at end of predecessor
 else
 moveFrom = location of it at end of predecessor
 moveTo = location of it at begin of successor
 if moveFrom moveTo then
 mapping.add(moveFrom, moveTo)
 mapping.orderAndInsertMoves()

Because we don't split intervals, we know that every interval live at the beginning of a block is either:

  • live across an edge between two blocks and therefore has already been placed in a location by assignment/spill code
  • beginning its life at the beginning of the block as a block parameter and therefore needs to be moved from its source location

For this reason, we only handle the second case in our SSA resolution. If we added lifetime holes interval splitting, we would have to go back to the full Wimmer SSA resolution.

This means that we're going to iterate over every outbound edge from every block. For each edge, we're going to insert some parallel moves.

class Function
 def resolve_ssa intervals, assignments
 # ...
 @block_order.each do |predecessor|
 outgoing_edges = predecessor.edges
 num_successors = outgoing_edges.length
 outgoing_edges.each do |edge|
 mapping = []
 successor = edge.block
 edge.args.zip(successor.parameters).each do |moveFrom, moveTo|
 if moveFrom != moveTo
 mapping << [moveFrom, moveTo]
 end
 end
 # predecessor.order_and_insert_moves(mapping)
 # TODO: order_and_insert_moves
 end
 end
 # Remove all block parameters and arguments; we have resolved SSA
 @block_order.each do |block|
 block.parameters.clear
 block.edges.each do |edge|
 edge.args.clear
 end
 end
 end
end

This already looks very similar to the RESOLVE function from Wimmer2010. Unfortunately, Wimmer2010 basically shrugs off orderAndInsertMoves with an eh, it's already in the literature comment.

A brief and frustrating parallel moves detour

What's not made clear, though, is that this particular subroutine has been the source of a significant amount of bugs in the literature. Only recently did some folks roll through and suggest (proven!) fixes:

This sent us on a deep rabbit hole of trying to understand what bugs occur, when, and how to fix them. We implemented both the Leroy and the Boissinot algorithms. We found differences between Boissinot2009, Boissinot2010, and the SSA book implementation following those algorithms. We found Paul Sokolovsky's implementation with bugfixes. We found Dmitry Stogov's unmerged pull request to the same repository to fix another bug.

We looked at Benoit Boissinot's thesis again and emailed him some questions. He responded! And then he even put up an amended version of his algorithm in Rust with tests and fuzzing.

All this is to say that this is still causing people grief and, though I understand page limits, I wish parallel moves were not handwaved away.

We ended up with this implementation which passes all of the tests from Sokolovsky's repository as well as the example from Boissinot's thesis (though, as we discussed in the email, the example solution in the thesis is incorrect5).

# copies contains an array of [src, dst] arrays
def sequentialize copies
 ready = [] # Contains only destination regs ("available")
 to_do = [] # Contains only destination regs
 pred = {} # Map of destination reg -> what reg is written to it (its source)
 loc = {} # Map of reg -> the current location where the initial value of reg is available ("resource")
 result = []

 emit_copy = -> (src, dst) {
 # We add an arrow here just for clarity in reading this algorithm because
 # different people do [src, dst] and [dst, src] depending on if they prefer
 # Intel or AT&T
 result << [src, "->", dst]
 }

 # In Ruby, loc[x] is nil if x not in loc, so this loop could be omitted
 copies.each do |(src, dst)|
 loc[dst] = nil
 end

 copies.each do |(src, dst)|
 loc[src] = src
 if pred.key? dst # Alternatively, to_do.include? dst
 raise "Conflicting assignments to destination #{dst}, latest: #{[dst, src]}"
 end
 pred[dst] = src
 to_do << dst
 end

 copies.each do |(src, dst)|
 if !loc[dst]
 # All destinations that are not also sources can be written to immediately (tree leaves)
 ready << dst
 end
 end

 while !to_do.empty?
 while b = ready.pop
 a = loc[pred[b]] # a in the paper
 emit_copy.(a, b)
 # pred[b] is now living at b
 loc[pred[b]] = b
 if to_do.include?(a)
 to_do.delete a
 end
 if pred[b] == a && pred.include?(a)
 ready << a
 end
 end

 if to_do.empty?
 break
 end

 dst = to_do.pop
 if dst != loc[pred[dst]]
 emit_copy.(dst, "tmp")
 loc[dst] = "tmp"
 ready << dst
 end
 end
 result
end

Leroy's algorithm, which is shorter, passes almost all the tests--in one test case, it uses one more temporary variable than Boissinot's does. We haven't spent much time looking at why.

def move_one i, src, dst, status, result
 return if src[i] == dst[i]
 status[i] = :being_moved
 for j in 0...(src.length) do
 if src[j] == dst[i]
 case status[j]
 when :to_move
 move_one j, src, dst, status, result
 when :being_moved
 result << [src[j], "->", "tmp"]
 src[j] = "tmp"
 end
 end
 end
 result << [src[i], "->", dst[i]]
 status[i] = :moved
end

def leroy_sequentialize copies
 src = copies.map { it[0] }
 dst = copies.map { it[1] }
 status = [:to_move] * src.length
 result = []
 status.each_with_index do |item, i|
 if item == :to_move
 move_one i, src, dst, status, result
 end
 end
 result
end

Back to SSA resolution

Whatever algorithm you choose, you now have a way to parallel move some registers to some other registers. You have avoided the "swap problem".

class Function
 def resolve_ssa intervals, assignments
 # ...
 # predecessor.order_and_insert_moves(mapping)
 sequence = sequentialize(mapping).map do |(src, _, dst)|
 Insn.new(:mov, dst, [src])
 end
 # TODO: insert the moves!
 # ...
 end
end

That's great. You can generate an ordered list of instructions from a tangled graph. But where do you put them? What about the "lost copy" problem?

As it turns out, we still need to handle critical edge splitting. Let's consider what it means to insert moves at an edge between blocks A -> B when the surrounding CFG looks a couple of different ways.

  • Case 1: A -> B
  • Case 2: A -> B and A -> C
  • Case 3: A -> B and D -> B
  • Case 4: A -> B and A -> C and D -> B

These are the four (really, three) cases we may come across.

In Case 1, if we only have two neighboring blocks A and B, we can insert the moves into either block. It doesn't matter: at the end of A or at the beginning of B are both fine.

In Case 2, if A has two successors, then we should insert the moves at the beginning of B. That way we won't be mucking things up for the edge A -> C.

In Case 3, if B has two predecessors, then we should insert the moves at the end of A. That way we won't be mucking things up for the edge D -> B.

Case 4 is the most complicated. There is no extant place in the graph we can insert moves. If we insert in A, we mess things up for A -> C. If we insert in B, we mess things up for D -> B. Inserting in C or D doesn't make any sense. What is there to do?

As it turns out, Case 4 is called a critical edge. And we have to split it.

We can insert a new block E along the edge A -> B and put the moves in E! That way they still happen along the edge without affecting any other blocks. Neat.

In Ruby code, that looks like:

class Function
 def resolve_ssa intervals, assignments
 num_predecessors = Hash.new 0
 @block_order.each do |block|
 block.edges.each do |edge|
 num_predecessors[edge.block] += 1
 end
 end
 # ...
 # predecessor.order_and_insert_moves(mapping)
 sequence = ...
 # If we don't have any moves to insert, we don't have any block to
 # insert
 next if sequence.empty?
 if num_predecessors[successor] > 1 && num_successors > 1
 # Make a new interstitial block
 b = new_block
 b.insert_moves_at_start sequence
 b.instructions << Insn.new(:jump, nil, [Edge.new(successor, [])])
 edge.block = b
 elsif num_successors > 1
 # Insert into the beginning of the block
 successor.insert_moves_at_start sequence
 else
 # Insert into the end of the block... before the terminator
 predecessor.insert_moves_at_end sequence
 end
 # ...
 end
end

Adding a new block invalidates the cached @block_order, so we also need to recompute that.

We could also avoid that by splitting critical edges earlier, before numbering. Then, when we arrive in resolve_ssa, we can clean up branches to empty blocks!

(See also Nick's post on critical edge splitting, which also links to Faddegon's thesis, which I should at least skim.)

And that's it, folks. We have gone from virtual registers in SSA form to physical locations. Everything's all hunky-dory. We can just turn these LIR instructions into their very similar looking machine equivalents, right?

Not so fast...

Calls

You may have noticed that the original linear scan paper does not mention calls or other register constraints. I didn't really think about it until I wanted to make a function call. The authors of later linear scan papers definitely noticed, though; Wimmer2005 writes the following about Poletto1999:

When a spilled interval is used by an instruction requiring the operand in a register, the interval must be temporarily reloaded to the scratch register. Additionally, register constraints for method calls and instructions requiring fixed registers must be handled separately.

Fun. We will start off by handling calls and method parameters separately, we will note that it's not amazing code, and then we will eventually implement the later papers, which handle register constraints more naturally.

We'll call this new function handle_caller_saved_regs after register allocation but before SSA resolution. We do it after register allocation so we know where each virtual register goes but before resolution so we can still inspect the virtual register operands.

Its goal is to do a couple of things:

  • Insert special push and pop instructions around call instructions to preserve virtual registers that are used on the other side of the call. We only care about preserving virtual registers that are stored in physical registers, though; no need to preserve anything that already lives on the stack.
  • Do a parallel move of the call arguments into the ABI-specified parameter registers. We need to do a parallel move in case any of the arguments happen to already be living in parameter registers. (We're really getting good mileage out of this function.)
  • Make sure that the value returned by the call in the ABI-specified return register ends up in in the location allocated to the output of the call instruction.

We'll also remove the call operands since we're placing them in special registers explicitly now.

class Function
 def handle_caller_saved_regs intervals, assignments, return_reg, param_regs
 @block_order.each do |block|
 x = block.instructions.flat_map do |insn|
 if insn.name == :call
 survivors = intervals.select { |_vreg, interval|
 interval.survives?(insn.number)
 }.map(&:first).select { |vreg|
 assignments[intervals[vreg]].is_a?(PReg)
 }
 mov_input = insn.out
 insn.out = return_reg

 ins = insn.ins.drop(1)
 raise if ins.length > param_regs.length

 insn.ins.replace(insn.ins.first(1))

 mapping = ins.zip(param_regs).to_h
 sequence = sequentialize(mapping).map do |(src, _, dst)|
 Insn.new(:mov, dst, [src])
 end

 survivors.map { |s| Insn.new(:push, nil, [s]) } +
 sequence +
 [insn, Insn.new(:mov, mov_input, [return_reg])] +
 survivors.map { |s| Insn.new(:pop, nil, [s]) }.reverse
 else
 insn
 end
 end
 block.instructions.replace(x)
 end
 end
end

(Unfortunately, this sidesteps handling the less-fun bit of calls in ABIs where after the 6th parameter, they are expected on the stack. It also completely ignores ABI size constraints.)

Now, you may have noticed that we don't do anything special for the incoming params of the function we're compiling! That's another thing we have to handle. Thankfully, we can handle it with yet another parallel move (wow!) at the end of resolve_ssa.

class Function
 def resolve_ssa intervals, assignments
 # ...
 # We're typically going to have more param regs than block parameters
 # When we zip the param regs with block params, we'll end up with param
 # regs mapping to nil. We filter those away by selecting for tuples
 # that have a truthy second value
 # [[x, y], [z, nil]].select(&:last) (reject the second tuple)
 mapping = param_regs.zip(entry_block.parameters).select(&:last).to_h
 sequence = sequentialize(mapping).map do |(src, _, dst)|
 Insn.new(:mov, dst, [src])
 end
 entry_block.insert_moves_at_start(sequence)
 end
end

Again, this is yet another kind of thing where some of the later papers have much better ergonomics and also much better generated code.

But this is really cool! If you have arrived at this point with me, we have successfully made it to 1997 and that is nothing to sneeze at. We have even adapted research from 1997 to work with SSA, avoiding several significant classes of bugs along the way.

Validation by abstract interpretation

We have just built an enormously complex machine. Even out the gate, with the original linear scan, there is a lot of machinery. It's possible to write tests that spot check sample programs of all shapes and sizes but it's very difficult to anticipate every possible edge case that will appear in the real world.

Even if the original algorithm you're using has been proven correct, your implementation may have subtle bugs due to (for example) having slightly different invariants or even transcription errors.

We have all these proof tools at our disposal: we can write an abstract interpreter that verifies properties of one graph, but it's very hard (impossible?) to scale that to sets of graphs.

Maybe that's enough, though. In one of my favorite blog posts, Chris Fallin writes about writing a register allocation verifier based on abstract interpretation. It can verify one concrete LIR function at a time. It's fast enough that it can be left on in debug builds. This means that a decent chunk of the time (tests, CI, maybe a production cluster) we can get a very clear signal that every register assignment that passes through the verifier satisfies some invariants.

Furthermore, we are not limited to Real World Code. With the advent of fuzzing, one can imagine an always-on fuzzer that tries to break the register allocator. A verifier can then catch bugs that come from exploring this huge search space.

Some time after finding Chris's blog post, I also stumbled across the very same thing in V8!

I find this stuff so cool. I'll also mention Boissinot's Rust code again because it does something similar for parallel moves.

See also

It's possible to do linear scan allocation in reverse, at least on traces without control-flow. See for example The Solid-State Register Allocator, the LuaJIT register allocator, and Reverse Linear Scan Allocation is probably a good idea. By doing linear scan this way, it is also possible to avoid computing liveness and intervals. I am not sure if this works on programs with control-flow, though.

Wrapping up

We built a register allocator that works on SSA. Hopefully next time we will add features such as lifetime holes, interval splitting, and register hints.

The full Ruby code listing is not (yet?) public available under the Apache 2 license.

UPDATE: See the post on lifetime holes.

Thanks

Thanks to Waleed Khan and Iain Ireland for giving feedback on this post.

  1. It's not just about registers, either. In 2016, Facebook engineer Dave legendarily used linear-scan register allocation to book meeting rooms.

  2. Well. As I said on one of the social media sites earlier this year, "All AOT compilers are alike; each JIT compiler is fucked up in its own way."

    JavaScript:

    Java:

    Python:

    Ruby:

    • YJIT uses linear scan
    • ZJIT uses more or less the same backend, so also linear scan

    PHP:

    Lua:

  3. Linear Scan Register Allocation in the Context of SSA Form and Register Constraints (PDF, 2002) by Mossenbock and Pfeiffer notes:

    Our allocator relies on static single assignment form, which simplifies data flow analysis and tends to produce short live intervals.

    Register allocation for programs in SSA-form (PDF, 2006) by Hack, Grund, and Goos notes that interference graphs for SSA programs are chordal and can be optimally colored in quadratic time.

    SSA Elimination after Register Allocation (PDF, 2008) by Pereira and Palsberg notes:

    One of the main advantages of SSA based register allocation is the separation of phases between spilling and register assignment.

    Cliff Click (private communication, 2025) notes:

    It's easier. Got it already, why lose it [...] spilling always uses use/def and def/use edges.

  4. This is inspired by Rasmus Andersson's graph coloring visualization that I saw some years ago.

  5. The example in the thesis is to sequentialize the following parallel copy:

    • a - b
    • b - c
    • c - a
    • c - d

    The solution in the thesis is:

    1. c - d (c now lives in d)
    2. a - c (a now lives in c)
    3. b - a (b now lives in a)
    4. d - b (why are we copying c to b?)

    but we think this is incorrect. Solving manually, Aaron and I got:

    1. c - d (because d is not read from anywhere)
    2. b - c (because c is "freed up"; now in d)
    3. a - b (because b is "freed up"; now in c)
    4. d - a (because c is now in d, so d - a is equivalent to old_c - a)

    which is what the code gives us, too.

Wed, 13 Aug 2025 00:00:00 +0000 August 13, 2025 https://bernsteinbear.com/blog/linear-scan/?utm_source=rss https://bernsteinbear.com/blog/linear-scan/