World's fastest VM-exit

The most common way to detect the presence of a hypervisor (apart from the CPUID HV presence bit) is to perform some kind of timing attack. All you have to do is measure time without instructions causing VM-exit and then measure it again with ones that do. A big difference between those times means that a hypervisor is most likely present. I won’t go into detail about how exactly that works and why, because there are already dozens of articles about it.

While thinking about this concept, I wondered about one thing: Is it possible to optimize VM-exit so much that the time difference will be within the margin of error? (Well, I knew it won’t be, but still wanted to know how much I can bring it down)

As an example, this is a simple test program that I am going to use. While is is not ideal since it is running in usermode and all sorts of things can happen to mess up the timings, most of the time it works fine and is good enough for the purpose of this article.

printf("Testing with no exit...\n");
int32_t registers[4];
std::vector<uint64_t> no_exit_time;
for (auto x = 0; x < CYCLES_PER_TEST; x++)
{
    uint64_t time = 0;
    for (int i = 0; i < 100; i++)
    {
        auto t1 = __rdtsc();
        auto t2 = __rdtsc();
        time += t2 - t1;
    }

    no_exit_time.push_back(time);
}

printf("No exit median: %llu\n", GetMedian(no_exit_time));

printf("Testing with exit...\n");
std::vector<uint64_t> exit_time_list;
for (auto x = 0; x < CYCLES_PER_TEST; x++)
{
    uint64_t time = 0;
    for (int i = 0; i < 100; i++)
    {
        auto t1 = __rdtsc();
        __cpuid(registers, 0);
        auto t2 = __rdtsc();
        time += t2 - t1;
    }

    exit_time_list.push_back(time);
}

printf("With exit median: %llu\n", GetMedian(exit_time_list));

This is what the program will output on bare metal with no hypervisor.

screenshot

And this is what the program will output when custom made hypervisor with typical VM-exit implementation in C is loaded (note that RDTSC exiting is disabled).

screenshot

As you can see, when we use CPUID (which causes VM-exit unconditionally under Intel VT-x), the total time is significantly higher than on bare metal.

Optimalizations

I am using slightly modified SimpleVisor project for this. The VM-exit handler looks something like this.

AsmVtxEntry PROC
push    rcx
lea     rcx, [rsp+8h]

call    RtlCaptureContext

sub     rsp, 20h
jmp     VtxEntryHandler
AsmVtxEntry ENDP
void VTX::VtxEntryHandler(PCONTEXT context)
{
    context->Rcx = *reinterpret_cast<UINT64*>(reinterpret_cast<ULONG64>(context) - sizeof(context->Rcx));

    const PSHV_VP_DATA vpData = reinterpret_cast<PSHV_VP_DATA>(reinterpret_cast<ULONG64>(context + 1) - KERNEL_STACK_SIZE);

    SHV_VP_STATE guestContext;
    guestContext.GuestEFlags = VtxRead(GUEST_RFLAGS);
    guestContext.GuestRip = VtxRead(GUEST_RIP);
    guestContext.GuestRsp = VtxRead(GUEST_RSP);
    guestContext.ExitReason = VtxRead(VM_EXIT_REASON) & 0xFFFF;
    guestContext.VpRegs = context;
    guestContext.ExitVm = FALSE;
    guestContext.VpData = vpData;
    guestContext.AdditionalData = GetProcessorAdditionalData(GetCurrentProcessorIndex(&guestContext));
    guestContext.AdditionalData->DoNotAdvanceRip = false;

    HandleExit(&guestContext);

    context->Rsp += sizeof(context->Rcx);
    context->Rip = reinterpret_cast<UINT64>(VtxResume);

    AsmRestoreContext(context, nullptr);
}
void VTX::HandleExit(const PSHV_VP_STATE vpState)
{
    switch (vpState->ExitReason)
    {
    case EXIT_REASON_CPUID:
        HandleCPUID(vpState);
        break;
    
    /* more code */

    default:
        KeBugCheck(INVALID_DRIVER_HANDLE);
        break;
    }

    if (vpState->AdditionalData->DoNotAdvanceRip)
        return;

    vpState->GuestRip += VtxRead(VM_EXIT_INSTRUCTION_LEN);
    __vmx_vmwrite(GUEST_RIP, vpState->GuestRip);
}
void VTX::HandleCPUID(const PSHV_VP_STATE vpState)
{
    int registers[4];
    __cpuidex(registers, static_cast<int>(vpState->VpRegs->Rax), static_cast<int>(vpState->VpRegs->Rcx));

    vpState->VpRegs->Rax = registers[0];
    vpState->VpRegs->Rbx = registers[1];
    vpState->VpRegs->Rcx = registers[2];
    vpState->VpRegs->Rdx = registers[3];
}

As you can imagine, this code generates quite a lot of instructions and requires us to capture the thread’s context so that we can restore it later. What if we rewrote it in assembly directly, so that we would not have to do any of that and could keep the instruction count as minimal as possible, while still handling CPUID properly?

Well, let’s try it…

This is code that I wrote within a few minutes. It reads the VM-exit reason; if it’s not CPUID, it will continue as usual to the C handler, and if it is, it will execute CPUID directly, while moving the guest RIP pointer forward. In theory, this is as minimal as it can get, while still allowing the hypervisor to run and the underlying virtualized system to operate normally.

.CONST

GUEST_RIP EQU 0000681eh
VM_EXIT_INSTRUCTION_LEN EQU 0000440ch
VMEXIT_REASON EQU 4402h
EXIT_REASON_CPUID EQU 0Ah

AsmAdvanceGuestRIP PROC
push rcx
push rax
push r11

mov rcx, GUEST_RIP
xor rax, rax
vmread rax, rcx

mov rcx, VM_EXIT_INSTRUCTION_LEN
xor r11, r11
vmread r11, rcx

add rax, r11

mov rcx, GUEST_RIP
vmwrite rcx, rax

pop r11
pop rax
pop rcx

ret
AsmAdvanceGuestRIP ENDP

AsmHandleCPUID PROC
call AsmAdvanceGuestRIP

pop r9
pop r8
popf

cpuid

vmresume
AsmHandleCPUID ENDP

AsmVtxEntry PROC
pushf
push r8
push r9

mov r8, VMEXIT_REASON
xor r9, r9
vmread r9, r8
and r9, 0FFFFh

cmp r9, EXIT_REASON_CPUID
je AsmHandleCPUID

pop r9
pop r8
popf

; Capture context, C handler here

And the result is… lower time difference of 18.5%.

screenshot

So yes, the VM-exit time improved so much that it is actually measurable, but on the other hand, even with this super minimal exit implementation, the time it takes to execute is significantly higher than what a single instruction would have on bare metal. Not really surprising, but I wanted to try it anyway.