Saturday, July 11, 2020

Hyper-V #0x2 - Hypercalls part 2

Laziness and mistakes

Hi again, and sorry for the significant delay after the last post! The weather in Estonia was for weeks kind of weird (very sunny and warm) - this, combined with my laziness, resulted in a lot smaller productivity. Now when the weather is back to its regular (rainy and chilly), I hope to move on with the writing and cover hypercalls from the Hyper-V side.

But before I get to it, I must mend a massive oversight that I made last time. I highly recommended Alex Ionescu blog posts, but I forgot to bring out one other researcher whose blog posts I also highly praise for anyone getting into Hyper-V. This person is Saar Amar (@AmarSaar). His blog post "First Steps in Hyper-V Research" (https://msrc-blog.microsoft.com/2018/12/10/first-steps-in-hyper-v-research/) is a definite must-read for anyone starting, and there are many other writings from him at the MSRC blog. Sorry for forgetting you!

Before beginning notes

  • I will be covering all opcodes etc. from Intel CPU point of view. The logic with AMD should be quite the same (hopefully).
  • I will only use x64 architecture - who uses 32bit for virtualization anyway.

Hypercalls from Hyper-V

Logically, the Hyper-V side of the hypercalls flow is quite essential - Hyper-V is the thing that actually does all the work. So it would make sense to go through basic logic how handling from the hypervisor side goes:

0. VM exit is triggered by the vmcall opcode (covered in previous posts)
  1. Hyper-V VM exit handler is triggered & CPU context is stored.
  2. Hyper-V handles the hypercall that was required.
  3. Hyper-V restores the CPU context and resumes the VM.

I will now try to go through all these steps and show how/where these steps are done in the Hyper-V.

1. Hyper-V VM exit handler is triggered & CPU context is stored

Hyper-V VM exit handler is the function to where code execution is directed right after the vmcall opcode (or any other reason why VM was exited - this is not only "vmcall handler"). It can be easily found in two different ways:
  1. Look for code that is storing all the registry values one by one to somewhere.

  2. The handler function location is written into VMCS (Virtual-Machine Control Structure) at field 0x6C16. In the current picture, rcx will contain a pointer to the handler function.

You might wonder where the stack pointer will point in such a situation (if you run "k" in windbg inside handling function, then it displays the empty stack trace). This is also a value taken from VMCS but from field 0x6c14.

But like said before, the handler function for VM exit is universal, so how does Hyper-V understand the reason for exit (it could also be memory exception, MSR read/write, CPUID operation, etc...). This can once again be found from VMCS and at field 0x4402. For vmcall exit, the value in that field is 0x12 (18). Based on this, it's easy to find the location where the value is read and look at the branching based on this (my current version path is shown):

READING VM EXIT REASON


[SOME LESS IMPORTANT CODE]





AND HERE IS THE HYPERCALL HANDLING CODE


2. Hyper-V handles the hypercall that was required

This was already covered in previous posts, but I will quickly go through this again. The hypercall number in the hypercall code parameter/structure will be used as an index to the array inside the Hyper-V at hv+0xC00000. These array elements are 0x18 (24) bytes large, and the first 8 bytes are the pointer to the handling function. The rest of the structure is, of course, not public and I'm not sure about its overall setup, so I will not make any guesses before I'm more certain (haven't been that important yet).

This being said, if you want to quickly know what hypercalls have a handler and which have no actual functionality behind them, just determine handler to the hypercall 0 and compare others to it. Hypercall 0 is an invalid hypercall, so all the other hypercall structures that point to the same handler will also not be in use.


3. Hyper-V restores CPU context and resumes the VM

After finishing the hypercall handling, the execution should return to the VM. For that, 2 things must happen. First, the CPU context has to be restored, and then VM execution must be continued with vmresume opcode. In the current version of the Hyper-V, there are only 4 instances of vmresume opcode in hypervisor executable, so it's not a tricky thing to find out which one is used in the hypercall handling flow. After finding out which one, it's clear to see, that the Hyper-V is restoring the original CPU context before returning:




Reading VMCS fields

Since a lot of essential values are kept in VMCS fields, the question comes - how to read these values. Windbg itself does not seem to allow it, and while I'm quite sure that MS internal hvexts.dll extension gives this capability, it does not help anyone outside. Based on that, we have to make our own tool for it. Luckily it's not hard. I have used the exact same method as in kernel to inject code to the Hyper-V.  I take the end of .text segment where no more code exists on the last executable page:


As you can see, after the code there are more than 0x500 bytes of free memory that is still executable and can also be easily filled by the debugger. Just to verify it also on the live system run "u hv+0x347A82; db hv+0x347A82" on Windbg and the result confirms that memory is available but contains nothing important:


After that, it's easy to generate some code to read the value from VMCS:
push rax
push rcx
mov rax, 0x6C16
vmread rcx, rax
int3
pop rcx
pop rax
int3

This code will store the used registries, read 0x6C16 field from the VMCS, break to the debugger (allowing to learn the value), restore the registries, and break again (to allow the return to the original instruction point).

So the process is:
  1. Write the code to some executable location. For example in version where I'm testing it:
    eb hv+0x347B00 0x50 0x51 0x48 0xC7 0xC0 0x16 0x6C 0x00 0x00 0x0F 0x78 0xC1 0xCC 0x59 0x58 0xCC
  2. When you want to read 0x6C16 value, you point rip to the injected code (also storing the original rip)
  3. Execute until first int3
  4. Read the rcx (contains the VMCS field value)
  5. Execute until the second int3
  6. Restore original rip

Steps from 2-6 can all be combined into:
r $t1=rip; r rip=hv+0x347B00; gh; r rcx; gh; r rip = $t1

What now

Hopefully, this post made things a bit more clear about the hypercall handling from the Hyper-V own side and how it's possible to read VMCS values using Windbg. It's also clear that there are a lot more details about it all, so I will reference a couple of additional resources that can help reverse or generally understand how hypervisors should work in such places. I will also try to add some new tools to this series GitHub repo, to make VMCS reading more comfortable. 

Resources:
  • https://rayanfam.com/topics/hypervisor-from-scratch-part-5/
    Sina & Shahriar hypervisor series is an excellent read overall but regarding VM exit handling and other such things, the part-5 is the most valuable.
  • "intel® 64 and IA-32 Architectures Software Developer’s Manual"
    Well...... to understand as much as possible, you have to dig through this a bit. The 3C part is probably the most reasonable place for hypervisor related topics.

Sunday, May 10, 2020

Hyper-V #0x1 - Hypercalls part 1

Hypercalls logic

So let's get going with a bit more exciting things than the simple setup, and let's start with hypercalls. Hypercalls can be thought of as syscalls but not from userland -> kernel, but kernel -> hypervisor. In essence, they are precisely that, but like always, the devil is in the details. 

NB: All my descriptions are based on x64 architecture. The main logic on x86 is similar, but of course, there are differences in registry names, and in many cases, x86 needs 2 registries to do the same thing that is done by one in x64.

Kernel triggers hypercall with vmcall opcode (in case of AMD CPU, the opcode is vmmcall) that will transfer execution to the hypervisor. Now there is a question of parameters that will be given with the hypercall. This is actually well described by Microsoft in the "Hypervisor Top-Level Functional Specification" document [https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs] at chapter 3. But I will try to cover it in a bit smaller pieces:


  • Every hypercall has 3 "parameters" included (beside extended fast hypercalls, that kind of have a lot more - will be covered separately)
  • The first parameter (in RCX, of course) will contain the primary information about the hypercall. It's a 64bit value with such structure:


    The description of each field is such:


    All in all, the first 2 critical fields are the "Call Code" and "Fast" The "Call Code" is the name/number/code of the hypercall that is triggered in the hypervisor. So from the previous blog post, the last example for 0xB hypercall meant that the handler is called when this "Call Code" is 0xB.
    The second relevant field is the "Fast" that will define how the parameters are passed to the hypervisor.
    The rest of the fields are, of course, not unimportant but will not be covered for now.
  • The second parameter (RDX) can contain one of two things, depending on the "Fast" field value in the first parameter:

      0) This means slow hypercall (or at least not "fast" one), and the second parameter is a pointer to the actual input parameter buffer. One important thing is that the pointer is not a virtual memory pointer but a pointer to the physical memory (only from VM point of view - actually it's a Guest Physical Address that is achieved via SLAT, but from the kernel point of view it's still physical memory).
      1) This means fast hypercall and that no memory pointers etc. are given (well, it still CAN mean a pointer, but that would be silly), and the second parameter is the first part of the overall input to the hypercall.
  • The third parameter  (R8) can also contain one of two things, depending on the "Fast" field value in the first parameter:

      0) The third parameter is a pointer to the actual output parameter buffer. This is also a pointer to physical (not really) memory.
      1) The third parameter is the second part of the overall input to the hypercall.
  • So in the case of the "Fast" hypercall, the input value size for the hypercall is 128 bits (RDX+R8), and there is no output buffer/value (besides the general one that is covered next). But in the case of "Slow" hypercall, there are both input and output buffers, both given to hypervisor via pointers to physical memory.
  • Value directly returned by the hypercall (RAX) is a 64-bit value and also has a structure:



    In most cases, the only important part is the "Result" part that will return 0 if hypercall succeeded or an error number if it failed.
Additional details:
  • Input/Output pointers to physical memory have to point to the beginning of a page.
  • If input/output buffers are larger then 1 page, then the multipage buffer's pages have to obviously follow each other.
  • Some hypercalls can not only return error code but also cause an exception to be triggered in the kernel during the execution. So if you are fuzzing hypercalls, then add try-catch - otherwise BSOD.

Making hypercall

Alex Ionescu had an excellent blog post about this, but it's down currently, so I will add my own version. If in the future he restores it, then definitely give it a read!!!

But anyway, making hypercall yourself. First, you need code execution in the kernel - it's not possible to do hypercalls from userland directly. The most natural solution for this is to write a driver and then use it as a proxy. Doing this means you can write fuzzing (or any other) logic in userland.

I will not cover the basics of writing windows drivers - mainly because I have never been a driver developer and really don't know that much about it. Also, because there are a lot of much better tutorials and guides.

But I believe I can still cover the "making hypercall" part of the driver. To describe this better I first break the functionality into 2 pieces and then describe the implementation of each piece:
  1. Making hypercall
  2. Preparing input/output buffers

1. Making hypercall

While it is entirely possible to just add vmcall/vmmcall opcodes to your driver to trigger the hypercalls, it is not necessary. The Windows kernel exports a function called HvlInvokeHypercall that is a wrapper around the call to ntHvcallCodeVa trampoline. This function is not included in any header files (as much as I know) abd should be referenced something like this when used:

extern "C" NTSYSAPI ULONGLONG NTAPI HvlInvokeHypercall(ULONGLONG InputValue, ULONGLONG InputPa, ULONGLONG OutputPa);

Of course, the return type and first parameter type should be replaced with the types that allow easier access to sub elements. For example:

typedef struct _HV_X64_HYPERCALL_INPUT
{
UINT32 callCode : 16;
UINT32 Fast : 1;
UINT32 varHdrrSize : 9;
UINT32 dontCare1 : 5;
UINT32 isNested : 1;
UINT32 repCount : 12;
UINT32 dontCare2 : 4;
UINT32 repStart : 12;
UINT32 dontCare3 : 4;
} HV_X64_HYPERCALL_INPUT, * PHV_X64_HYPERCALL_INPUT;

typedef struct _HV_X64_HYPERCALL_OUTPUT
{
HV_STATUS result;
UINT16 dontCare1;
UINT32 repsCompleted : 12;
UINT32 dontCare2 : 20;
} HV_X64_HYPERCALL_OUTPUT, * PHV_X64_HYPERCALL_OUTPUT;

extern "C" NTSYSAPI HV_X64_HYPERCALL_OUTPUT NTAPI HvlInvokeHypercall(HV_X64_HYPERCALL_INPUT InputValue, ULONGLONG InputPa, ULONGLONG OutputPa);

After this, the fast hypercall (not the extended one) can be made simply by calling this function with almost no additional effort.

But of course, this is not enough, since we need buffers and their physical addresses for "slow" hypercalls.


2. Preparing input/output buffers

This is in many parts based on Alex Ionescu's blog post. I used to do it a bit differently, but his approach is better. 

So first, we allocate some pages for the buffer (no matter input or output) with function MmAllocatePartitionNodePagesForMdlEx. This is not a publicly documented method but allows us to get just such allocation as is needed. The function definition is such:

PMDL MmAllocatePartitionNodePagesForMdlEx (
    _In_ PHYSICAL_ADDRESS LowAddress,
    _In_ PHYSICAL_ADDRESS HighAddress,
    _In_ PHYSICAL_ADDRESS SkipBytes,
    _In_ SIZE_T TotalBytes,
    _In_ MEMORY_CACHING_TYPE CacheType,
    _In_ MM_NODE_NUMBER_ZERO_BASED IdealNode,
    _In_ ULONG Flags,
    _In_opt_ PVOID PartitionObject
    );

  • The first three parameters give information for a range where to look for allocation (if you want more info - search for MiAllocatePagesForMdl function. It's first parameters have the same underlying meanings, but there is more info about it). We just take 0 to ~0 range and also skip 0.
  • TotalBytes is what it sounds like - how large should the allocation be. Round allocation size up to page size.
  • CacheType is also what it sounds like (take a look at https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/ne-wdm-_memory_caching_type if needed). Just use MmCached for ease :)
  • IdealNode contains the NUMA node number. Based on that, the search should be done for allocation. I doubt that this will become important for almost anyone (if it is for you, I really would like to know your machine setup), but just in case use the current CPU's NUMA code with KeGetCurrentNodeNumber() function.
  • Flags determine some additional conditions. A couple of them are quite mandatory. Firstly MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS does as the name says and forces the pages to follow each other, not locate in different locations. MM_ALLOCATE_FULLY_REQUIRED flag will make sure that function only returns success when entire allocation succeeds. Final flag MM_DONT_ZERO_ALLOCATION just helps to speed stuff up, so OS will not fill the allocation with 0-s before returning.
  • PartitionObject - Memory partition where the pages are taken. Since it's optional, NULL works well.
So in the end, the call can be something like this:

PMDL pmdl;
PHYSICAL_ADDRESS low, high;
low.QuadPart = 0;
high.QuadPart = ~0ULL;
pmdl = MmAllocatePartitionNodePagesForMdlEx(low, high, low, ROUND_TO_PAGES(size), MmCached, KeGetCurrentNodeNumber(), MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS | MM_ALLOCATE_FULLY_REQUIRED | MM_DONT_ZERO_ALLOCATION, NULL);

Now, if the pmdl is not NULL, the allocation worked, and we have a mdl structure that points to our buffer. To get the virtual memory pointer (so we can read/write) to it, we run a typical MmGetSystemAddressForMdlSafe function (with MdlMappingNoExecute flag is reasonable because we do not need to execute code from there). And to get a physical address, there is MmGetMdlPfnArray function (just don't forget to shift by page size). All in all, the code for both (without error handling) is such:

vmPtr = MmGetSystemAddressForMdlSafe(pmdl, MdlMappingNoExecute);
pmPtr = *MmGetMdlPfnArray(pmdl) << PAGE_SHIFT;

Now you can use vmPtr for reading/writing to/from the buffer and pmPtr as hypercall parameter/-s.


Monitoring Hypercalls

Like said on the last blog post, it's easy to monitor hypercalls via Windbg. You just have to use hardware breakpoint, not software ones (at least in the trampoline part). Now let's move on based on that and see how it's easiest to figure out what common hypercalls might do based on public symbol information about the kernel and already covered knowledge about the hypercalls.


Breaking on hypercalls and getting basic info

I use the same breakpoint as in the last post but I add some extra info that will be displayed on every hypercall:

ba e 1 poi(nt!HvcallCodeVa) ".printf \"Hypercall 0x%X\\n  fast: 0x%X\\n\", rcx & 0xFFFF, (rcx >> 16) & 0x1; gh"

This simple breakpoint will now display hypercall own number and fast value from it. We can extend it to display even more to see how often other values are used:

ba e 1 poi(nt!HvcallCodeVa) ".printf \"Hypercall 0x%X\\n  Fast: 0x%X\\n  Variable header size: 0x%X\\n  Is nested: 0x%X\\n  Rep Count: 0x%X\\n  Rep Start Index: 0x%X\\n\", rcx & 0xFFFF, (rcx >> 16) & 0x1, (rcx >> 17) & 0x1FF, (rcx >> 26) & 0x1, (rcx >> 32) & 0xFFF, (rcx >> 48) & 0xFFF; gh"

Now to generated some guesses about the meaning of some hypercalls, we should load symbols and then also display call stacks. This can give some preliminary information.

ba e 1 poi(nt!HvcallCodeVa) ".printf \"Hypercall 0x%X\\n  Fast: 0x%X\\n  Variable header size: 0x%X\\n  Is nested: 0x%X\\n  Rep Count: 0x%X\\n  Rep Start Index: 0x%X\\n\", rcx & 0xFFFF, (rcx >> 16) & 0x1, (rcx >> 17) & 0x1FF, (rcx >> 26) & 0x1, (rcx >> 32) & 0xFFF, (rcx >> 48) & 0xFFF; k; gh"

Now let's take one example with hypercall 0x4E:



Since nt!HvcallInitiateHypercall is same as HvlInvokeHypercall, we know that it's a merely a wrapper around hypercall trampoline. But it's called by the function nt!HvlpCreateRootVirtualProcessor so it can be deduced that this hypercall should execute the functionality in hypervisor that creates (or is at least connected to) root virtual processor.

With a lot of hypercalls, there is the possibility to understand their use via kernel. There is also a good possibility that by reverse-engineering these kernel functions, you can figure out the structure of the input and output buffer that makes reversing hypervisor much easier. Public symbols give an excellent headstart on such things.


Annoying non-stop hypercalls

If you let Windows boot up and then try to use such an approach, you quite quickly run into a problem. Windows is regularly doing some hypercalls, and it's quite hard to do anything on the system to trigger some new type of hypercall (handling guest VM-s, for example).



You can, of course, add conditions to the breakpoints, but these are checked only after the debugger has already stopped the execution, so it is still not usable. My approach in this has been a bit weird, and I don't know how many others use the same method, but it works well. I look for some block of memory in Windows kernel that is not used but is already executable (most accessible location is the end of .text segment in Windows kernel - in the current version, there is almost half a page of zeroed out executable memory and it's there always). I then write some small assembly (https://defuse.ca/online-x86-assembler.htm is a godsend for doing it without any other tools, shout out to https://twitter.com/defusesec) that will redirect some hypercalls directly to the trampoline and only break with others. Finally, I redirect nt!HvcallCodeVa to this small assembly and can now ignore annoying hypercalls.

For doing this step-by-step (in reality, it makes sense to be automated):
  1. Find out the location of the trampoline:
    1: kd> dq nt!HvcallCodeVa L1
    fffff801`4ed73328  fffff801`4dfc0000
  2. Write assembly code that skips some of the annoying ones (this example skips fast hypercalls):
    test rcx, 0x10000
    jnz skip
    int 3
    skip:
    mov rax, 0xfffff8014dfc0000
    jmp rax
  3. Inject compiled machine code to nt+0x350C00 (in the current version this location is executable but not used)
    eb nt+0x350C00 0x48 0xF7 0xC1 0x00 0x00 0x01 0x00 0x75 0x01 0xCC 0x48 0xB8 0x00 0x00 0xFC 0x4D 0x01 0xF8 0xFF 0xFF 0xFF 0xE0
  4. Overwrite nt!HvcallCodeVa
    eq nt!HvcallCodeVa nt+0x350C00
  5. Now Windbg breaks when slow hypercall is made and skips fast ones.


Conclusion

Hopefully, this information opens some options or ideas for people who are starting with hypercall research. In the next couple of weeks, I will also add some scripts (or write a separate plugin - I have not decided yet) to more comfortably show some of this information and to automatically generate a machine code that helps with filtering stuff. I will also continue with this series. Next time focusing on the hypervisor side of the hypercalls. Take care!


Sunday, May 3, 2020

Hyper-V #0x0 - Research setup

There has been some interest and questions about starting Hyper-V research, so I thought of writing little about it (also, it's been a long time since the last blog post). I will probably do it in quite small pieces.

Also, I will hopefully be covering the topic at INFILTRATE (https://infiltratecon.com/) in October, so I will not go deep into my research for now. But I hope to cover a lot of things needed to get going for anyone starting.

I will start with the most straightforward topic - my research environment and how to do the most simple stuff when starting. This information is available in MULTIPLE other places, but since my goal is to at some point have entire series on Hyper-V, then I think it makes sense to start from the very very basics.

My setup

My main machine is running Linux, so all my Windows-based tools and targets are running on virtual machines. I use VMWare Workstation and I'm quite happy with that. In the past, I used Virtualbox, but things really seem to work better and faster with VMWare + I can easily use the same tool if I need to connect to some ESX or ESXi servers. It's also effortless to set up connections between VMs for kernel/hypervisor debugging purposes via emulated serial ports or virtual networks (it's not related to VMWare - all other main virtualization tools also make it easy).

For Hyper-V research, I use 3 VMs:
  1. Debugging machine that contains Windbg, IDA, and development tools. 
  2. Target machine with the latest Windows build and everything set up so it could be debugged.
  3. Target machine with the latest Windows Slow ring build and everything set up so it could be debugged. "Why?" you might ask. Mainly because Microsoft has such sentence in Hyper-V bug bounty page "If you are submitting a vulnerability for Hyper-V on Windows 10, then the vulnerability must reproduce on the recent WIP slow builds to qualify for a bounty" [https://www.microsoft.com/en-us/msrc/bounty-hyper-v]


Setting everything up

I'm going to explain all based on multiple VMs & VMWare Workstation point of view. But almost the same approach works when using some other virtualization tool. Also, when your main OS in Windows, then you don't need separate debugging VM but can use host OS for that (quite a bit faster too).

First things

  1. Install Windows on debugging VM and Windbg
  2. Install Windows on target VM (I would start with only one VM)
  3. Update all that is needed & install any other thing you like
  4. In the virtualization tool, enable Intel VT-x/EPT (or AMD-V/RVI when running on AMD machine) and IOMMU virtualization on the target machine. This is needed to actually run Hyper-V. On VMWare, it's on VM settings Hardware tab under "Processors" config.
  5. Since you almost certainly want to debug Hypervisor AND Kernel, then you need two different channels for it, so add serial port between VMs and virtual network.
  6. Activate Hyper-V on target VM via "Windows features"

Target VM conf

Because I use a lot of snapshotting on my target machine, I prefer to use serial-based communication. When using NET-based, then it's a lot faster, but there is some session management in NET-based connections. Because of that, I have had a lot of problems when Windbg and target can't talk after the VM revert. Also, the NET-based link has had stability problems.
But in the case of the hypervisor, the serial-based communication seems not to work. So I usually use serial for kernel debugging and NET for the hypervisor.

Setting debugger for the kernel (over serial COM 1):

 bcdedit /dbgsettings serial debugport:1 baudrate:115200
 bcdedit /debug on

Setting debugger for the hypervisor (over the network - write up the code generated by the first command):

 bcdedit /hypervisorsettings NET HOSTIP:???.???.???.??? PORT:50000
 bcdedit /set hypervisordebug on
 bcdedit /set hypervisorlaunchtype auto

Debuggers running

When now setting 2 windbg sessions for each connection, you should see something like this:


If hypervisor connection does not happen then some things to check:
  • The firewall does not block windbg from opening the port
  • Both VMs can actually reach each other over the network (maybe the virtual network is configured in a way that VMs can't access other VMs, etc.)
  • That Hyper-V is actually installed. If you installed Hyper-V in "Windows features" BEFORE enabling VMs virtualization features, then Windows only installs the Hyper-V client but not the hypervisor.

Setup works

If both debuggers get the connection, then you have everything running. You can run some simple commands on the hypervisor debugger to test it. For example, ?hv to get the base address for the hypervisor.
Now from that moment on, things get a bit trickier with the hypervisor. Microsoft has not made hypervisor debug symbols public, and also, the hvexts.dll extension for windbg is available only to MS partners. So from now on, all stuff in hypervisor you have to find by reversing the code. Luckily this is not such a big piece of code as the kernel itself, and a lot of kernel side components (used to talk to the hypervisor or via the hypervisor) do have symbol information.

I will quickly show one example with the hypercalls that I will cover more deeply in the next blog post. 

Hypercall from the kernel side

Hypercalls are made by the kernel via location defined by the pointer at nt!HvcallCodeVa. This is likely because opcodes for switching to hypervisor are different between Intel and AMD. Now if you take a look of the code referenced by that location, the result is such:
1: kd> u poi(nt!HvcallCodeVa) L18
fffff804`0cdf0000 0f01c1          vmcall
fffff804`0cdf0003 c3              ret
fffff804`0cdf0004 8bc8            mov     ecx,eax
fffff804`0cdf0006 b811000000      mov     eax,11h
fffff804`0cdf000b 0f01c1          vmcall
fffff804`0cdf000e c3              ret
fffff804`0cdf000f 488bc1          mov     rax,rcx
fffff804`0cdf0012 48c7c111000000  mov     rcx,11h
fffff804`0cdf0019 0f01c1          vmcall
fffff804`0cdf001c c3              ret
fffff804`0cdf001d 8bc8            mov     ecx,eax
fffff804`0cdf001f b812000000      mov     eax,12h
fffff804`0cdf0024 0f01c1          vmcall
fffff804`0cdf0027 c3              ret
fffff804`0cdf0028 488bc1          mov     rax,rcx
fffff804`0cdf002b 48c7c112000000  mov     rcx,12h
fffff804`0cdf0032 0f01c1          vmcall
fffff804`0cdf0035 c3              ret

In my case, the CPU is Intel one, so the operation for switching is vmcall. You can see multiple different vmcalls here and this will be covered in the future. What is important is that when some piece of code wants to make a hypercall, it almost always does it via such a trampoline. Also, the nt!HvcallInitiateHypercall function is often used (not always).

To intercept such calls, you want to stop execution right at the vmcall operation, so the first reaction is to use bp poi(nt!HvcallCodeVa) command on kernel debugger. But this would be a mistake because the kernel debugger also runs under the control of the hypervisor and this area of code is protected by the hypervisor. So you can't write there anything, not even 0xCC byte :)

The solution is luckily easy - the hardware breakpoints are still allowed
ba e 1 poi(nt!HvcallCodeVa)
or (depending on where you want to break)
ba r 8 nt!HvcallCodeVa

Now you can break and study how and from where such calls are made. For the overall structure and an excellent description of how to do hypercalls with a custom driver, I would highly recommend Alex Ionescu's blog posts about the Hyper-V bridge. But as of now, it seems that the blog is down. I do hope it gets back up soon and will contain that post :)
If not, I will add that information to my own hypercall post, but I don't think that I could explain it as well.

Hypercall from hypervisor side

Hypercalls are number-based, as are the syscalls. This number, in turn, will end up used as an index for an array of 0x18 byte sized elements located at hv+0xC00000. The address of the handler functions is at the first 8 bytes of each element. So, for example, to break in the hypercall 0xB handling function you could run such command in the hypervisor debugger: bp poi(hv+0xC00000+0xb*0x18)
This hypercall happens very often and should be good for testing.
This was very basic info on how to track hypercalls from both kernel and hypervisor sides. I hope it helps anyone starting on this topic.


This is all for now, and I will try to follow this post soon with the post where I cover hypercalls a lot more deeply. 

Friday, January 24, 2020

Let's continue with corpus distillation

My previous post was about the very beginnings of the mutation-based fuzzing. Now I continue with a description of how I collect and filter files that I use as a base set for mutations. The technique that I use is most commonly called "corpus distillation." The toolset that I use to implement this technique is written by myself and released (and updated along with this blog post) at https://github.com/FoxHex0ne/Rehepapp.
Now I must mention that this toolset is using a somewhat alternative way to achieve code coverage monitoring, and it's not suitable for all situations. A more generic but slower (for closed source and only after the optimization phase) method is binary instrumentation. Also, the Intel PT feature will probably allow even more generic and faster tools - I think this is the future, and at some point, I hope to get to use it for such things.

Corpus distillation

So let's start by describing what corpus distillation means in the current context. So the first word "corpus" is the Latin word for "body" but in many settings, it means a broad set of some stuff (for example, "text corpus" means an extensive collection of texts).  In the current context, it means a large set of inputs (files). Distillation part of the term, in turn, means that we will not use the entire corpus but some distilled part of it. This we will do by first minimizing the corpus but in such a way that the resulting set causes the same code coverage as the original corpus.

To describe this more visually, let's imagine 10 inputs (INPUT0 to INPUT9) and the targeted program that has 6 functionalities (F0 to F5) that can be triggered by inputs. Now let's describe what data causes what functionalities to be executed in the targeted program:
  • INPUT0 causes F1, F5
  • INPUT1 causes F1, F2
  • INPUT2 causes F1, F3, F5
  • INPUT3 causes F0, F1, F4
  • INPUT4 causes F0, F1, F2
  • INPUT5 causes F0, F2, F4
  • INPUT6 causes F1, F2
  • INPUT7 causes F1, F2
  • INPUT8 causes F1, F2
  • INPUT9 causes F1, F2

Of course, we could start fuzzing all the possible inputs, but doing it, we would make it less effective. In the current case, half of the inputs only trigger functionalities F1 and F2. By fuzzing the entire set, we will spend 50% of the time fuzzing almost the same things. This is not effective and especially if the original corpus is in hundreds of thousands or even in millions. In such a case, the number of inputs covering the most common functionality (for example, text in the doc file) and nothing more will not be 50% but much much higher. To reduce such a repetition, the "distillation" part comes in. We will remove elements from the original corpus/set that we can without diminishing the entire set coverage. In the current example, we can minimize the set to only inputs INPUT2 and INPUT5 because together, they cover as much functionality as the original 10 inputs.

Now in most programs, we can't separate functionalities called out, and even if we could, the execution of the same functionality could be different based on the input. So we will do code coverage analysis based on the basic blocks inside the targeted programs. Basic blocks (https://en.wikipedia.org/wiki/Basic_block gives longer/better description) are blocks of code that have no branchings inside, so if the block starts execution, it will continue until the end (unless exception happens). 

With a large corpus of inputs, we can record basic block coverage that each of them causes in the targeted program and then distill (minimize) input set to a smaller size that still has the same coverage. By doing this, we increase the effectiveness of the fuzzing.

How much can we reduce? For me, it has usually been over 500x reduction of the original set. I can't say exact number anymore from the past - but I typically have had corpuses of 1M to 3M and minimized sets of 1500 to 3500 with more common filetypes (pdf, doc, xls, ppt...).

Corpus distillation steps

As described, the corpus distillation is quite a simple concept, but like with most things, the hardship rests in implementation. For doing the entire corpus distillation to generate fuzzing base set for mutations, you have to go through such steps:
  1. Collecting the corpus.
  2. Recording the coverage for every corpus element.
  3. Minimizing the corpus based on coverage.

Step 1: Collecting the corpus

You need a lot of proper inputs, and in the current case, it means files. So where do you go if you need to find something? Google, of course!
Let's say you want to get a huge corpus of pdf files. You google "filetype:pdf" and google responds with nothing. So you google "filetype:pdf something" and google says: 


Great right? Just download some part of these 195M files, and you are golden? Not that easy, while Google says it has that many results, it does not want to give you all of them:

So you can't get them so easily.  My solution has been to do alphabetic searches, so I start "filetype:pdf A", "filetype:pdf B",....."filetype:pdf Z","filetype:pdf AA","filetype:pdf AB"....

Now since nobody in their right mind will do such searches manually, you have to automate this. And that is where another roadblock gets in a way - Google does not like you automating such searches. You could use "Custom Search JSON API" but it has somewhat substantial limitations. So what are the options unless you want to teach a deep neural network to solve Google "are you human" challenges?
I personally simply avoided the protection mechanism by making sure that every 2 searches have at least 48 seconds between them (this value has changed a couple of times and might be dependent on other things also). But this also heavily limited the number of searches that I could do in a day (only 1800). And while this is enough if you have a couple of weeks and want less then a million files, it's still not the best.
The way around this, in turn, is to have multiple IPs from where to do such searches. You might have a lot of friends who can run small scripts on some of their machines (if they trust you). You could rent a lot of cheapest VMs with the public IP from some cloud. F1-micro from Google cloud used to work for it well (dirt cheap and had public IP by default).
NB: I just must say that I'm not 100% sure that renting VMs from Goole to bypass Google's own search rate limitation is allowed under terms of service. So I'm not recommending to do it, I'm only saying that it works and you will still be paying to Google for it so morally it's not that bad I hope...

Additional notes for step 1

  • The code for doing searches from Google can be found at https://github.com/FoxHex0ne/Rehepapp/blob/master/Scripts/Libs/Google.py.
  • In the collection phase, don't collect files themselves, only links to them. Especially if you are receiving via multiple machines. You can still download and check their magic signatures or other aspects but don't store them. It's much simpler to later merge the URLs or gather them right away to a single database. I usually do exactly that and store hash and size along with URL also.
  • When distributing the downloading tasks to multiple machines, then alway use search ranges. For example, one device does all the searches with phrases that start with "A", second with "B", etc. If you are using a central server that organizes it all, then things can be even finer tuned.
  • There will be a lot of collisions in search results. Resulting from both the same addresses and the same files from different addresses. This ought to be accounted for.
  • If you decide not to download entire files when collecting the URLs, still download the first couple of bytes to check for filetype magic value. Google will return based on extension, but this means that some of the files returned are not actually correct ones.

Step 2: Calculating coverage

As mentioned at the beginning, my method to calculate code coverage is different from most other tools. The fastest way to calculate code coverage is to inject tracing code into binary of the executable while compiling (one example would be SanitizerCoverage feature). This not only enables tracing on the basic block but edge level - this means that coverage is measured based on the branching path to the basic block, not block alone. This is almost always the best and fastest solution, but at the same time, it can't be used on closed source software.
For closed source software, the binary instrumentation is the most common method, and besides, Intel PT could be even better. Both of them are a lot of more extended topics, and I will not try to describe them here. I only say that binary instrumentation is, in many cases, quite slow (not always), and I think the future is in Intel PT or other hardware supported features that might pop up in the future.

My own solution has couple preparation steps:
  1. Preparation
    1. Use IDA Pro or some other such tools to record the location of all the basic blocks in executable and libraries it uses.
    2. Inside analyzed files, insert 0xCC (software breakpoint) into the beginnings of every basic block.
  2. Main loop
    1. Start the targeted program as debuggable.
    2. Wait for software breakpoint to be triggered (if none has happened for X seconds, then stop the program).
    3. If a software breakpoint happens, record the location, restore the original value of the byte, and continue execution(don't forget decrement instruction pointer).
    4. GOTO 2.2
Such a solution means that every basic block that is executed will trigger a software breakpoint once per execution. This means that with every triggered breakpoint, the targeted program stops, context is switched to the debugger, information is recorded, target program memory is written, and execution context is again switched back. This is a LOT of extra CPU cycles for a single basic block recording even when it happens once per every basic block during a single execution. To solve this obstacle, I used such an approach:
  1. Download 100 inputs(files)
  2. Run code tracing with these 100 files and all the basic block triggers
  3. Find 5 files that have the best coverage out of these 100
  4. Remove all the basic blocks triggers that were covered by these 5
  5. Download 1000 new inputs 
  6. Run code tracing with these 1000 files and the remaining basic block triggers
  7. Find 45 files that have the best coverage out of these 1000
  8. Remove all the basic blocks triggers that were covered by these 45
At the end of such filtering, you are left with 50 files that you can keep for your final set and reduced basic block list. For most software, this is enough to get practically real-time execution for almost any input later used. The only exception is some inputs that trigger functionality that is quite rare, and this is already a good thing.
Just to be sure, I usually do another such filtering with 10000 new files from what I take out 100. You could also keep doing it later.


Additional notes for step 2

  • When I wrote "Find X files that have the best coverage out of this Y", I did not mean that you just choose X best right away. I suggested that you want the best one, dismiss basic blocks covered by it, and then select next best without accounting blocks covered by the first one and etc.
  • The more filtering steps you do, the faster, the later executions are. But it's not reasonable to go very crazy with it because it could reduce the quality of your final set (too many similar ones). 

Step 3: Minimizing

At this point, you should have a vast corpus and coverage information for each element in it. If you have used my tools and my approach, then you already have some files in the final set also. What is left is to calculate the rest of it. This seems easy, but in actuality, there is no right way to find the BEST solution without trying all the options (NP type of problem). So the easiest and in my opinion good enough solution is to use such a loop:

  1. Find one input with the best coverage from the set
  2. Remove all basic blocks from all the coverages that are covered by this one input.
  3. Put this one input into the final set.
  4. If there are any inputs left that cover some basic blocks not yet removed then go to step 1
  5. All done
This does not give the best solution, of course, but in my experience a good enough one.


Additional notes for step 3

  • Depending on the size of the corpus and how many basic blocks are by average per input, this calculation could take a very long time. In the case of millions of inputs, even the optimized list of basic blocks (using methods described at step 2) could take days.
  • When coverage info is stored in a database, it makes writing the minimizer a bit easier. It might not be as fast as a very fine-tuned tool written only for that, but a lot of time could be saved on the development. And you will calculate coverage only once per input type so 2day vs. 3day time difference will not be so terrible.

Rehepapp toolset


So I have made my toolset for such coverage analysis. This has a lot of issues still that I will try to fix as time goes, but this is sadly not a priority for now.  Anyway I will now list all the tools in this toolset and for what they can be used:
  • Tracer.exe - This does actual coverage recording. It takes as an input a list of basic blocks (generated by the IDA script) and records all basic blocks that are reached.
  • Analyzer.exe + IDA_GetBasicBlocks.py - They can be used to create a list of basic blocks from exe/dll files of some application.
  • CoverageCommon.exe - Tool to find common basic blocks covered by multiple coverage recordings.
  • CoverageMain.exe - When combined with tracer.exe,, it can be used to record coverage information for multiple files.
  • CoverageMinimizer.exe - Find X best coverages from the list of coverages.
  • CoveragePlanter.exe - Plants software breakpoints to the exe/dll files based on basic block file. Has to be run before running Tracer.exe.
  • CoverageReducer.exe - Removes basic blocks from the existing list based on coverage recordings.
  • Server.jar - HTTP server that can be used for distributed file collecting and coverage testing.
  • BotCollector.py - Used for distributed file collecting.
  • BotCoverage.py -Used for distributed coverage testing.

I must say that it's a bit long story to show the entire typical usage of Rehepapp here, so I will probably do something a little different and will try to create a demo video on how to best use my tools. If so, I will later add a link for it here.

Tuesday, January 7, 2020

Let's get things going with basics of file parsers fuzzing

So, one of my 2020 new year promises was to actually start writing to my blog. At the start, I thought writing about my current Hyper-V research, but because there has been some interest from people only entering the fuzzing scene, I decided to write about that. So let's get going and I will not start from extreme basics but close to it. I will also connect the explanations and descriptions with my own fuzzer newer version that I have already released last week ( https://github.com/FoxHex0ne/Vanapagan ).


Primary loop

So let us talk about fuzzing. I will start with a description of the primary fuzzing loop and how I have used it. So, in essence, my fuzzing loop for file parsers fuzzing has been something like that (presented in minimized python code for additional weirdness):

try:
    while True:
        clearRegistriesAndTemporaryFiles()
        originalFile = takeRandomFileFromCorrectFilesSet()
        (mutatedFile, mutationsInfo) = mutateFile(originalFile)
        crash = executeTargetedProgramWithInputFile(mutatedFile)
        if crash not null:
            logCrash(originalFile, mutatedFile, mutationsInfo, crash)
except:
    restart()

This is an overly simplified version of course, but I believe this is enough for an explanation. So let's start breaking everything apart:

  • Outer try-except block and restart() 
    This is to catch all kinds of crap that might sooner or later happen while running a fuzzing loop for a longer time. Since I'm running a lot of fuzzing VM-s, then usually I just prefer that machine does the restart if something weird happens. After restart, the fuzzing simply continues. During the debugging phase, I disable restarting and just raise exception to see what happened.
  • while True
    Well... it's a fuzzing loop you know :)
  • clearRegistriesAndTemporaryFiles()
    This is the first part that becomes a bit more interesting. It's not needed with all the targets but with quite a lot of them still. This function would be doing the clearing up of temporary files or registry values that will otherwise influence the fuzzing process. For example, in the case of MS Office main programs (Word, Excel, Powerpoint), the safe mode can be activated when the last loop killed the main process (some registry values have to be cleared to avoid this). Or in the case of Foxit PDF reader, the temporary file is created in the temp-file directory. It is of course not deleted when a reader is killed off by the fuzzer (this would result in larger disk usage and slowdown of the reader startup after some time).
  • originalFile = takeRandomFileFromCorrectFilesSet()
    I'm a big fan of mutation-based fuzzing. This is because I'm too lazy to write smart fuzzers for complex protocols and specifications (I also have a full-time job and university stuff currently so I can redirect blame for lack of time). So what I do, is I choose a set of correct inputs/files as a basis. In this line, I just select a random file from such a set.
  • (mutatedFile, mutationsInfo) = mutateFile(originalFile)
    Now comes the mutation part, No matter what mutators you use (I have lately used mainly bit flipping and special values  - 0x00, 0xFF, 0xFE, 0x80, 0x7F, etc.) it's useful to also keep track of the mutations that were made. Of course, mutations can be deduced by comparing the original and mutated file, but this takes extra effort. Also, in some cases, the file is changed when opened by the targeted program (Excel & Word) and this creates additional confusion. All in all, as a result of mutation functionality, you are left with the mutated file and mutation info.
  • crash = executeTargetedProgramWithInputFile(mutatedFile)
    In this part, the targeted executable is actually executed and the mutated file is provided to it as an input. This part has many moving parts in it and has to do multiple things. As a first thing, it should be functioning as a debugger (or control actual debugger) for detecting crashes inside the process. For example, if invalid memory read/write is detected, this functionality should return information about the crash (location, registry values, stack, etc.). Also, if the CPU usage from the process fells to 0%, it makes sense to stop the process. Besides, there should be an overall time limit for single process execution.
  • if crash not null:
                logCrash(originalFile, mutatedFile, mutationsInfo, crash)

    This should already be quite clear from the existing context. If the crash is detected, then it should be logged somewhere along with the input that caused the crash and some initial analysis. Since the input is in the form of a mutated file, the mutation description and original file should also be stored (less work later)

And this is the main fuzzing loop that I and almost everyone else has used in one form or another.


"Vanapagan" fuzzer

So, in addition to this blog post, I have also made public the newer version of my simple Vanapagan fuzzer that is written in python (and not the nice Python 3 but now unsupported 2.7). Based on this I will describe how fuzzing loop explained before carries over to my actual fuzzer. Now what to keep in mind, is that this fuzzer is a very simple one - it just mutates the input and executes the unmodified target program. In many situations this is a rather slow way to do it - there are multiple ways to speed up execution or to test only certain parts of the main program. But I believe that this is something that comes a bit later and the goal of this post is to show the basics. So here comes overall comment and the main pieces of the Vanapagan fuzzer:



Overall usage:

  • It uses WinAppDbg, pywin32 and psutil libraries. First one for debugging and rest for handling some overall process monitoring (CPU usage, etc.)
  • Is configurable by JSON conf file
  • I usually set it up, add autologin, automatic running, full page heap and restarting in case of exception. Then I let it run for a week or so and evaluate results. 


Pieces:

  • fuzz.py
    Main loop logic and also a file to run when fuzzing
  • Vanapagan\Detector\WinBasic.py
    It contains the class that does the actual debugging work and crash detection. It uses WinAppDbg library for that.
  • Vanapagan\Loging\*.py
    There are two classes currently for logging any findings. One logs the results to the local filesystem (or network shared) directory and the second one can be used to log to an FTP server.
  • Vanapagan\Mutator\*.py
    There is one base class and two implementations for file input mainpulation. One does the bit flipping and other places special value bytes into the input.
  • Vanapagan\Utils\*.py
    Just some helper functions used by the main fuzzing loop
  • Vanapagan\CrashReport.py
    Data class for the crash reports, used by the detector and logging classes.


Configuration example:

I will use fuzzing configuration for Excel as an example because it needs most of the functionality existing in my fuzzer. It should also bring out all the little things that you kind of have to think about when setting fuzzer up.

{
  "name": "excel32",

  "input": ".\\input",
  "retry": 1,
  "binaries": ["excel.exe"],
  "executable": "C:\\Program Files (x86)\\Microsoft Office\\root\\Office16\\excel.exe",
  "regsToDelete": ["HKCU\\Software\\Microsoft\\Office\\16.0\\Excel\\Resiliency"],
  "filesToDelete": [
    "C:\\Users\\jaanus\\AppData\\Local\\Temp\\*",
    "C:\\Users\\jaanus\\AppData\\Roaming\\Microsoft\\Office\\Recent\\*",
    "C:\\Users\\jaanus\\AppData\\Roaming\\Microsoft\\Windows\\Recent\\*"
  ],
  "windowToInteract": "Microsoft Excel",
  "windowToInteractKey": "y",
  "logNullCrashes": true,
  "maxWait": 90,
  "restartWhenException": false,
  "restartWhenLoop": false,

  "mutators": [
    {
      "type": "FileBitFlipping",
      "rate": 40000,
      "weight": 2
    },
    {
      "type": "FileByteValues",
      "rate": 60000
    }
  ],
  "logging": [
    {
      "type": "FtpsLoging",
      "host": "10.20.30.40",
      "username": "bot",
      "password": "s3cr3t",
      "dir": "/BOT/excel32"
    },
    {
      "type": "FilesystemLoging",
      "dir": ".\\crashes\\"
    }
  ]
}




So let's get going with different values in it:

  • name - Name of the fuzzer conf, not used actually anywhere.
  • input - Directory where input set locates.
  • retry - How many times input is tested against target. Usually, 1 is enough, but in some cases, the target might crash for other reasons the input itself. With such targets, the retry should be higher, so in case of the crash, the input is retested to make certain that the crash was caused by it.
  • binaries - List of binaries that are monitored for CPU usage. This is usually not that much needed but could be useful if the target program executes some other applications indirectly and waits for them. In such cases, the primary program CPU usage is not the only one you should monitor. In the current example, it's "excel.exe" itself, so actually, it has no meaning.
  • executable - Target program location.
  • regsToDelete - Registry values to delete before every execution. In the current case, it's for removing the possibility of a "safe mode" startup for excel or the recovery of old files.
  • filesToDelete - Files to delete before every execution.
  • windowToInteract - Window for sending keypress (windowToInteractKey value) to. In the current case, it's for the "do you want to repair this file" dialog.
  • windowToInteractKey - Keypress that is sent to the window described in the previous point.
  • logNullCrashes - Should we store also crashes that happen in the NULL page.
  • maxWait - Maximum execution time of the target application. It is triggered when CPU usage never fells to 0.
  • restartWhenException - Should fuzzer restart the computer if an exception happens. Useful during real fuzzing but should be turned off when debugging.
  • restartWhenLoop - If fuzzer has looped through all input set, should the computer restart.
  • mutators - What mutations are used:
    • Type "FileBitFlipping" - Random bit flipping
    • Type "FileByteValues" - Special byte, word and dword values
    • rate - For how many bytes one change is made. So if in first mutator the rate value is 40 000 and the file size is 1 000 000 bytes, then it results 1 000 000 / 40 000 = 25 changes made.
    • weight - The mutator used in every execution is chosen kind of randomly. So weight value gives mutator importance. In the current configuration, there is 2 mutator and the weight of the first one is 2. This means that 2/3 of times, the first one is used and 1/3 of times the second one.
  • logging - What results logging are used:
    • Type "FtpsLoging" - Logging to FTP server
    • Type "FilesystemLoging" - Logging to the local filesystem
    • All other values should be quite self-explaining



How I use it all:
  1. Prepare crash collecting FTP server
  2. Prepare VM (install & update OS, Office, Python and libraries)
  3. Enable full page heap for Excel
  4. Set up fuzzer (input set, conf file, etc.)
  5. Test fuzzer without restarting ("restartWhenException" should be "false") until it works fine
  6. Activate restarting in fuzzer conf
  7. Activate autologon ( https://docs.microsoft.com/en-us/sysinternals/downloads/autologon )
  8. Add fuzzer execution into "run" registry key
  9. Clone & run as many VMs as possible
  10. Check up on VMs after a day just in case
  11. Wait a week
  12. Look are there any good crashes found
      NO> Cry a bit and change fuzzing conf or pick a new target
      YES> Analyse them and leave fuzzers running for another week


Why mutation-based fuzzing

So I thought that I should also explain the reasons why & how I have chosen mutation-based fuzzing, not some other forms. The main reason is the "results for time spent" calculation. Very dumb fuzzing can't find any results in these days and while smart fuzzing is WAY better, it also takes exponentially more time do develope. This is especially important since I'm actually not doing such research for a living (at least not as a main income) and I don't want to waste a year of free time on writing smart fuzzer that might not even work that well. The mutation-based fuzzing is a good compromise - it's still interesting because you have to find a good initial set of inputs that you can start mutating (it's not that useful to choose it randomly), it does not take that long to prepare and it still gives some results. Of course, it does not always work - for example, it's harder to fuzz browser javascript engine like that or syscalls.
Overall I try to put maximum effort into collecting the best input set possible that will cover a wide range of functionality inside the targeted program. This combined with a stable fuzzing process means that I can skip other optimization options and still get results without sinking too much manual work time. The results, of course, vary heavily with such approach but I have been still quite successful by doing that.
Also, because this approach can't work that well with other types of targets that need a lot of manual work, it gives the possibility to do various types of research concurrently.



Other notes

Some additional things are useful to know before actual fuzzing (I use fuzzing in the Windows environment as an example, but similar points go for all OS-es and settings):

  • Full page heap
    Activate a full page heap for a process that you are fuzzing. Look for description about it from the MS gflags page, but it's pretty much mandatory to be used when fuzzing applications that use windows heap manager. Without this, you might still catch crashes, but they will happen in quite random places. 
  • Crash info gathering
    Storing crashes in the local filesystem works well when testing on a single machine. It is way less fun when running tens or hundreds of VMs. I have mostly used an FTP server or network drive to gather all crashes to a single location.
  • Initial analysis
    This is not important if your target program crashes very rarely. But it becomes critical when there are a lot of crashes. In that case, you want to do at least some initial filtering and categorization of the found issues. I, for example, skip NULL page crashes and rest sort by crash location (this is also not enough if the target program is very broken and you have to start separating by stack trace, but I haven't needed it).
  • Disable sleep
    I have forgotten a LOT of times with new VMs to disable sleep for OS. Because fuzzer works without user input, the OS sometime decides that there is no user activity and will go to sleep. This is not good, of course...
  • GUI actions
    "File seems broken, do you want to try to fix it?" and other such dialog boxes. You usually want your fuzzer to say "hell yes" to them.



Ending comments

Everything covered in this post was as basic and introductory as possible. Fuzzing programs using such simple tools is not that effective resource-wise and not at all fancy. It still works if you have a good initial set of files, but even then, there are more effective ways to do the actual fuzzing part. But at the same time, I have been in MSRC top researchers list for 4 years now and mainly thanks to this tool and such approach. So it DOES WORK, but if I would do such research full time or my livelihood would depend on that, I would look for ways to speed up fuzzing (will write about this also in future).

In my next blog post, I will cover my way of collecting the input sets for mutation-based fuzzing. I will use my Rehepapp tool as an example and will also release some new utilities for it. I hope to get it done before the end of January because it connects strongly to the current post. But no promises...