World's fastest VM-exit
The most common way to detect the presence of a hypervisor (apart from the CPUID HV presence bit) is to perform some kind of timing attack. All you have to do is measure time without instructions causing VM-exit and then measure it again with ones that do. A big difference between those times means that a hypervisor is most likely present. I won’t go into detail about how exactly that works and why, because there are already dozens of articles about it.
While thinking about this concept, I wondered about one thing: Is it possible to optimize VM-exit so much that the time difference will be within the margin of error? (Well, I knew it won’t be, but still wanted to know how much I can bring it down)
As an example, this is a simple test program that I am going to use. While is is not ideal since it is running in usermode and all sorts of things can happen to mess up the timings, most of the time it works fine and is good enough for the purpose of this article.
printf("Testing with no exit...\n");
int32_t registers[4];
std::vector<uint64_t> no_exit_time;
for (auto x = 0; x < CYCLES_PER_TEST; x++)
{
uint64_t time = 0;
for (int i = 0; i < 100; i++)
{
auto t1 = __rdtsc();
auto t2 = __rdtsc();
time += t2 - t1;
}
no_exit_time.push_back(time);
}
printf("No exit median: %llu\n", GetMedian(no_exit_time));
printf("Testing with exit...\n");
std::vector<uint64_t> exit_time_list;
for (auto x = 0; x < CYCLES_PER_TEST; x++)
{
uint64_t time = 0;
for (int i = 0; i < 100; i++)
{
auto t1 = __rdtsc();
__cpuid(registers, 0);
auto t2 = __rdtsc();
time += t2 - t1;
}
exit_time_list.push_back(time);
}
printf("With exit median: %llu\n", GetMedian(exit_time_list));
This is what the program will output on bare metal with no hypervisor.
And this is what the program will output when custom made hypervisor with typical VM-exit implementation in C is loaded (note that RDTSC
exiting is disabled).
As you can see, when we use CPUID
(which causes VM-exit unconditionally under Intel VT-x), the total time is significantly higher than on bare metal.
Optimalizations
I am using slightly modified SimpleVisor project for this. The VM-exit handler looks something like this.
AsmVtxEntry PROC
push rcx
lea rcx, [rsp+8h]
call RtlCaptureContext
sub rsp, 20h
jmp VtxEntryHandler
AsmVtxEntry ENDP
void VTX::VtxEntryHandler(PCONTEXT context)
{
context->Rcx = *reinterpret_cast<UINT64*>(reinterpret_cast<ULONG64>(context) - sizeof(context->Rcx));
const PSHV_VP_DATA vpData = reinterpret_cast<PSHV_VP_DATA>(reinterpret_cast<ULONG64>(context + 1) - KERNEL_STACK_SIZE);
SHV_VP_STATE guestContext;
guestContext.GuestEFlags = VtxRead(GUEST_RFLAGS);
guestContext.GuestRip = VtxRead(GUEST_RIP);
guestContext.GuestRsp = VtxRead(GUEST_RSP);
guestContext.ExitReason = VtxRead(VM_EXIT_REASON) & 0xFFFF;
guestContext.VpRegs = context;
guestContext.ExitVm = FALSE;
guestContext.VpData = vpData;
guestContext.AdditionalData = GetProcessorAdditionalData(GetCurrentProcessorIndex(&guestContext));
guestContext.AdditionalData->DoNotAdvanceRip = false;
HandleExit(&guestContext);
context->Rsp += sizeof(context->Rcx);
context->Rip = reinterpret_cast<UINT64>(VtxResume);
AsmRestoreContext(context, nullptr);
}
void VTX::HandleExit(const PSHV_VP_STATE vpState)
{
switch (vpState->ExitReason)
{
case EXIT_REASON_CPUID:
HandleCPUID(vpState);
break;
/* more code */
default:
KeBugCheck(INVALID_DRIVER_HANDLE);
break;
}
if (vpState->AdditionalData->DoNotAdvanceRip)
return;
vpState->GuestRip += VtxRead(VM_EXIT_INSTRUCTION_LEN);
__vmx_vmwrite(GUEST_RIP, vpState->GuestRip);
}
void VTX::HandleCPUID(const PSHV_VP_STATE vpState)
{
int registers[4];
__cpuidex(registers, static_cast<int>(vpState->VpRegs->Rax), static_cast<int>(vpState->VpRegs->Rcx));
vpState->VpRegs->Rax = registers[0];
vpState->VpRegs->Rbx = registers[1];
vpState->VpRegs->Rcx = registers[2];
vpState->VpRegs->Rdx = registers[3];
}
As you can imagine, this code generates quite a lot of instructions and requires us to capture the thread’s context so that we can restore it later. What if we rewrote it in assembly directly, so that we would not have to do any of that and could keep the instruction count as minimal as possible, while still handling CPUID
properly?
Well, let’s try it…
This is code that I wrote within a few minutes. It reads the VM-exit reason; if it’s not CPUID
, it will continue as usual to the C handler, and if it is, it will execute CPUID directly, while moving the guest RIP
pointer forward. In theory, this is as minimal as it can get, while still allowing the hypervisor to run and the underlying virtualized system to operate normally.
.CONST
GUEST_RIP EQU 0000681eh
VM_EXIT_INSTRUCTION_LEN EQU 0000440ch
VMEXIT_REASON EQU 4402h
EXIT_REASON_CPUID EQU 0Ah
AsmAdvanceGuestRIP PROC
push rcx
push rax
push r11
mov rcx, GUEST_RIP
xor rax, rax
vmread rax, rcx
mov rcx, VM_EXIT_INSTRUCTION_LEN
xor r11, r11
vmread r11, rcx
add rax, r11
mov rcx, GUEST_RIP
vmwrite rcx, rax
pop r11
pop rax
pop rcx
ret
AsmAdvanceGuestRIP ENDP
AsmHandleCPUID PROC
call AsmAdvanceGuestRIP
pop r9
pop r8
popf
cpuid
vmresume
AsmHandleCPUID ENDP
AsmVtxEntry PROC
pushf
push r8
push r9
mov r8, VMEXIT_REASON
xor r9, r9
vmread r9, r8
and r9, 0FFFFh
cmp r9, EXIT_REASON_CPUID
je AsmHandleCPUID
pop r9
pop r8
popf
; Capture context, C handler here
And the result is… lower time difference of 18.5%.
So yes, the VM-exit time improved so much that it is actually measurable, but on the other hand, even with this super minimal exit implementation, the time it takes to execute is significantly higher than what a single instruction would have on bare metal. Not really surprising, but I wanted to try it anyway.