再探Erlang JIT

admin
15 Dec 2023

这篇文章是初识Erlang JIT的后续文章，更深入地挖掘了实现细节。

虽然用机器代码（汇编程序）编写东西给了你很大的自由，但它的代价是必须自己发明几乎所有东西，而且没有聪明的编译器可以帮助你发现错误。例如，如果您以某种方式调用一个函数并且该函数与预期不一致，那么充其量您会使操作系统进程崩溃，或者最坏的情况是花费数小时追逐一个heisenbug。

因此，在编写汇编程序时，约定总是最重要的，所以我们需要在继续之前明确我们选择的一些约定。

最重要的是寄存器，我们基于系统调用的约定，可以更容易调用 C 代码。我在下面列出了 Linux 上使用的 SystemV 约定的表格。寄存器在其他系统（如 Windows）上有所不同，但它们的原理是相同的。

寄存器	名称	调用者保存	用途
RDI	ARG1	no	第1个参数
RSI	ARG2	no	第2个参数
RDX	ARG3	no	第3个参数
RCX	ARG4	no	第4个参数
R8	ARG5	no	第5个参数
R9	ARG6	no	6个参数
RAX	RET	no	数返回值

因此，如果我们想用两个参数调用一个 C 函数，我们在调用它之前将第一个参数放入ARG1，第二个参数放入ARG2，当函数返回时我们将在RET中得到结果。

除了说明哪些寄存器用于传递参数之外，调用约定还说明哪些寄存器在函数调用中保留其值。这些被称为“被调用者保存”寄存器，因为如果它们被修改，被调用函数需要保存和恢复它们。

在这些寄存器中，我们保留 C 代码中很少（如果有的话）更改的常用数据，帮助我们避免在调用 C 代码时保存和恢复它们：

寄存器	名称	用者保存	用途
RBP	active_code_ix	yes	Active code index
R13	_p	yes	Current process
R15	TOP	yes	Top of the current process’ heap
R14	CALLS	yes	Reduction counter
RBX	egisters	yes	BEAM register structure

我们还将当前进程的堆栈指针保存在RSP寄存器（堆栈指针寄存器）中，这样就在Erlang代码中允许使用call和ret指令。

这样做的缺点是我们不能再调用任意 C 代码，因为它可能假设一个更大的堆栈，需要我们在“C 堆栈”和“Erlang 堆栈”之间来回交换。

在我之前的文章中，我们调用了一个C 函数: timeout（在我们更改堆栈的工作方式之前，它曾经是这样做的，但它仍然非常简单，如下所示：

void BeamModuleAssembler::emit_timeout() {
    /* Swap to the C stack. */
    emit_enter_runtime();

    /* Call the 'timeout' C function.
     *
     * runtime_call compiles down to a single 'call'
     * instruction in optimized builds, and has a few
     * assertions in debug builds to prevent mistakes
     * like forgetting to switch stacks. */
    a.mov(ARG1, c_p);
    runtime_call<1>(timeout);

    /* Swap back to the Erlang stack. */
    emit_leave_runtime();
}

交换堆栈是非常便宜的操作，因为我们在设置registers时使用了一个技巧：通过在C函数的堆栈上分配registers结构，我们可以从中计算所述堆栈的地址，这避免了必须保留宝贵的被调用者上下文，并且比保存在内存的某个地方要快得多。

摆脱约定后，我们可以再次开始查看代码。这次让我们选择一个更大的指令test_heap，它分配堆内存：

void BeamModuleAssembler::emit_test_heap(const ArgVal &Needed,
                                         const ArgVal &Live) {
    const int words_needed = (Needed.getValue() + S_RESERVED);
    Label after_gc_check = a.newLabel();

    /* Do we have enough free space already? */
    a.lea(ARG2, x86::qword_ptr(HTOP, words_needed * sizeof(Eterm)));
    a.cmp(ARG2, E);
    a.jbe(after_gc_check);

    /* No, we need to GC.
     *
     * Switch to the C stack, and update the process
     * structure with our current stack (E) and heap
     * (HTOP) pointers so the C code can use them. */
    emit_enter_runtime<Update::eStack | Update::eHeap>();

    /* Call the GC, passing how many words we need and
     * how many X registers we use. */
    a.mov(ARG2, imm(words_needed));
    a.mov(ARG4, imm(Live.getValue()));

    a.mov(ARG1, c_p);
    load_x_reg_array(ARG3);
    a.mov(ARG5, FCALLS);
    runtime_call<5>(erts_garbage_collect_nobump);
    a.sub(FCALLS, RET);

    /* Swap back to the Erlang stack, reading the new
     * values for E and HTOP from the process structure. */
    emit_leave_runtime<Update::eStack | Update::eHeap>();

    a.bind(after_gc_check);
}

虽然这并不太复杂，但它仍然具有相当多的代码：因为所有指令都直接在模块中生成，像这样的低效率往往会很快使模块膨胀。而这除了使用更多 RAM 之外，还浪费了宝贵的指令缓存，因此我们花费了大量时间和精力来减少代码大小。

我们减少代码大小的最常用方法是将尽可能多的指令分解为全局共享部分。让我们看看如何应用这种技术：

void BeamModuleAssembler::emit_test_heap(const ArgVal &Needed,
                                         const ArgVal &Live) {
    const int words_needed = (Needed.getValue() + S_RESERVED);
    Label after_gc_check = a.newLabel();

    a.lea(ARG2, x86::qword_ptr(HTOP, words_needed * sizeof(Eterm)));
    a.cmp(ARG2, E);
    a.jbe(after_gc_check);

    a.mov(ARG4, imm(Live.getValue()));

    /* Call the global "garbage collect" fragment. */
    fragment_call(ga->get_garbage_collect());

    a.bind(after_gc_check);
}

/* This is the global part of the instruction. Since we
* know it will only be called from the module code above,
* we're free to assume that ARG4 is the number of live
* registers and that ARG2 is (HTOP + bytes needed). */
  void BeamGlobalAssembler::emit_garbage_collect() {
  /* Convert ARG2 to "words needed" by subtracting
    * HTOP and dividing it by 8.
    *
    * This saves us from having to explicitly pass
    * "words needed" in the module code above. */
      a.sub(ARG2, HTOP);
      a.shr(ARG2, imm(3));

  emit_enter_runtime<Update::eStack | Update::eHeap>();

  /* ARG2 and ARG4 have already been set earlier. */
  a.mov(ARG1, c_p);
  load_x_reg_array(ARG3);
  a.mov(ARG5, FCALLS);
  runtime_call<5>(erts_garbage_collect_nobump);
  a.sub(FCALLS, RET);

  emit_leave_runtime<Update::eStack | Update::eHeap>();

  a.ret();
  }

虽然我们必须编写尽可能多的代码，但复制到模块中的部分要小得多。

在我们的下一篇文章中，我们将暂停一下实现细节，看看这个 JIT 背后的历史。

PrevNext

再探Erlang JIT

Related Posts

2021 年你需要知道的关于 Erlang 的一切

Erlang JIT中基于类型的优化

Erlang JIT之路