Assembly 为什么我在movaps指令上使用不稳定的裸汇编特性segfault？_Assembly_Rust_X86 64

Assembly 为什么我在movaps指令上使用不稳定的裸汇编特性segfault？

assembly rust

Assembly 为什么我在movaps指令上使用不稳定的裸汇编特性segfault？,assembly,rust,x86-64,Assembly,Rust,X86 64,我知道我的代码使用了很多不安全的内联程序集，但我仍然想知道为什么它只在发布模式下出现故障。我尝试了较低的选择级别，但它只能在选择级别为1时运行 : 在内联t\u return 更新2：在第一个答案和更多测试之后，问题是movaps要求[rsp+128]对齐16字节，但rsp不对齐。guard（内联t\u return）的条目如下在我自己的回答中，更多详细信息通常表明您违反了ABI并错误对齐了堆栈指针。调试模式可能不会自动矢量化此处正在复制的内容的副本 movaps需要对其内存操作数进行16字节

我知道我的代码使用了很多不安全的内联程序集，但我仍然想知道为什么它只在发布模式下出现故障。我尝试了较低的选择级别，但它只能在选择级别为1时运行

在内联

t\u return

更新2：在第一个答案和更多测试之后，问题是

movaps

要求

[rsp+128]

对齐16字节，但

rsp

不对齐。

guard

（内联

t\u return

）的条目如下

在我自己的回答中，更多详细信息通常表明您违反了ABI并错误对齐了堆栈指针。调试模式可能不会自动矢量化此处正在复制的内容的副本

movaps

需要对其内存操作数进行16字节对齐，这与

movups

不同。编译器使用

movaps

，因为它在较旧的CPU上效率更高。ABI保证在进入任何函数时都有16字节的堆栈对齐，因此它可以免费获得16字节的本地对齐。（像这样的ABI保证是编译器为了提高效率而不需要检查就可以假设的。）

请注意，这是将32个字节从

[rdx+rsi+40]

复制到堆栈内存，因此在存储执行之前，堆栈内存的内容是不相关的。

经过更多的测试和测试，结果表明原始内存有缺陷，我的也有缺陷。问题在于使用接下来的8个字节来存储指向

guard

的函数指针

当派生函数返回时，它会将堆栈顶部弹出到

%rip

，从而运行

guard

。与调用方负责堆栈对齐的普通函数调用不同，这将导致

guard

中的堆栈不再对齐到16字节。当随后使用

movaps

时，此程序会出现分段故障。

您是否在调试器中尝试了代码并检查了分段故障发生的位置？是否确定

#[裸]

的用法？我找不到调用方是否正确设置了参数。@泽拉：是的，我在编辑

[rsp+128]

是否为

0x0

时错误地发布了这个问题。它将毫无错误地加载。该故障可能是因为

rsp

未对齐。您是否将生成的程序集从未处于发布模式与处于发布模式的程序集进行了比较？我的解释是否正确，该故障位于堆栈中，且未对齐16字节，因此

movaps

失败？@austraras：这是我的猜测，是的。如果RSP是16字节对齐的，则RSP+128也会对齐。您可以使用调试器检查RSP寄存器的值。（其值的低位十六进制数字应为

）

//! green-threads is a toy implementation on user-space threads in non-preemptive multitasking.
//! This implementation is mostly guided by cfsamson's tutorial:
//! https://cfsamson.gitbook.io/green-threads-explained-in-200-lines-of-rust/green-threads.
#![deny(missing_docs)]
#![feature(llvm_asm)]
#![feature(naked_functions)]

use std::collections::VecDeque;
use std::ptr;

const DEFAULT_STACK_SIZE: usize = 1024 * 1024 * 2;
static mut RUNTIME: usize = 0;

/// Runtime schedule and switch threads. current is the id of thread which is currently running.
pub struct Runtime {
    queue: VecDeque<Task>,
    current: Task,
}

/// ThreadContext contains the registers marked as "callee-saved" (preserved across calls)
/// in the specification of x86-64 architecture. They contain all the information
/// we need to resume a thread.
#[derive(Debug, Default)]
#[repr(C)]
struct ThreadContext {
    rsp: u64,
    r15: u64,
    r14: u64,
    r13: u64,
    r12: u64,
    rbx: u64,
    rbp: u64,
}

struct Task {
    stack: Vec<u8>,
    ctx: ThreadContext,
}

impl Task {
    fn new() -> Self {
        Task {
            stack: vec![0_u8; DEFAULT_STACK_SIZE],
            ctx: ThreadContext::default(),
        }
    }
}

impl Runtime {
    /// Initialize with a base thread.
    pub fn new() -> Self {
        let base_thread = Task::new();

        Runtime {
            queue: VecDeque::new(),
            current: base_thread,
        }
    }

    /// This is cheating a bit, but we need a pointer to our Runtime
    /// stored so we can call yield on it even if we don't have a
    /// reference to it.
    pub fn init(&self) {
        unsafe {
            let r_ptr: *const Runtime = self;
            RUNTIME = r_ptr as usize;
        }
    }

    /// start the runtime
    pub fn run(&mut self) {
        while self.t_yield() {}
    }

    fn t_return(&mut self) -> bool {
        if self.queue.len() == 0 {
            return false;
        }

        let mut next = self.queue.pop_front().unwrap();
        std::mem::swap(&mut next, &mut self.current);

        unsafe {
            switch(&mut next.ctx, &self.current.ctx);
        }

        self.queue.len() > 0
    }

    fn t_yield(&mut self) -> bool {
        if self.queue.len() == 0 {
            return false;
        }

        let mut next = self.queue.pop_front().unwrap();
        std::mem::swap(&mut next, &mut self.current);
        self.queue.push_back(next);

        unsafe {
            let last = self.queue.len() - 1;
            switch(&mut self.queue[last].ctx, &self.current.ctx);
        }
        // Prevents compiler from optimizing our code away on Windows.
        self.queue.len() > 0
    }

    /// spawn a function to be executed by runtime
    pub fn spawn(&mut self, f: fn()) {
        let mut available = Task::new();

        let size = available.stack.len();
        let s_ptr = available.stack.as_mut_ptr();

        unsafe {
            // put the f to the 16 bytes aligned position.
            ptr::write(s_ptr.offset((size - 32) as isize) as *mut u64, f as u64);
            // put the guard 1 byte next to the f for being executed after f returned.
            ptr::write(s_ptr.offset((size - 24) as isize) as *mut u64, guard as u64);

            available.ctx.rsp = s_ptr.offset((size - 32) as isize) as u64;
        }

        self.queue.push_back(available);
    }
}

fn guard() {
    unsafe {
        let rt_ptr = RUNTIME as *mut Runtime;
        (*rt_ptr).t_return();
    }
}

/// yield_thread is a helper function that lets us call yield from an arbitrary place in our code.
pub fn yield_thread() {
    unsafe {
        let rt_ptr = RUNTIME as *mut Runtime;
        (*rt_ptr).t_yield();
    };
}

#[naked]
#[inline(never)]
unsafe fn switch(old: *mut ThreadContext, new: *const ThreadContext) {
    llvm_asm!("
        mov     %rsp, 0x00($0)
        mov     %r15, 0x08($0)
        mov     %r14, 0x10($0)
        mov     %r13, 0x18($0)
        mov     %r12, 0x20($0)
        mov     %rbx, 0x28($0)
        mov     %rbp, 0x30($0)
        mov     0x00($1), %rsp
        mov     0x08($1), %r15
        mov     0x10($1), %r14
        mov     0x18($1), %r13
        mov     0x20($1), %r12
        mov     0x28($1), %rbx
        mov     0x30($1), %rbp
        ret
        "
    :
    :"r"(old), "r"(new)
    :
    : "volatile", "alignstack"
    );
}

fn main() {
    let mut runtime = Runtime::new();
    runtime.init();
    runtime.spawn(|| {});
    runtime.run();
}

    mov qword ptr [rsp + 144], rdi
    movups  xmm0, xmmword ptr [rdx + rsi + 56]
; Here
    movaps  xmmword ptr [rsp + 128], xmm0
    movups  xmm0, xmmword ptr [rdx + rsi + 40]
    movaps  xmmword ptr [rsp + 112], xmm0