Friday, January 24, 2020

Let's continue with corpus distillation

My previous post was about the very beginnings of the mutation-based fuzzing. Now I continue with a description of how I collect and filter files that I use as a base set for mutations. The technique that I use is most commonly called "corpus distillation." The toolset that I use to implement this technique is written by myself and released (and updated along with this blog post) at https://github.com/FoxHex0ne/Rehepapp.
Now I must mention that this toolset is using a somewhat alternative way to achieve code coverage monitoring, and it's not suitable for all situations. A more generic but slower (for closed source and only after the optimization phase) method is binary instrumentation. Also, the Intel PT feature will probably allow even more generic and faster tools - I think this is the future, and at some point, I hope to get to use it for such things.

Corpus distillation

So let's start by describing what corpus distillation means in the current context. So the first word "corpus" is the Latin word for "body" but in many settings, it means a broad set of some stuff (for example, "text corpus" means an extensive collection of texts).  In the current context, it means a large set of inputs (files). Distillation part of the term, in turn, means that we will not use the entire corpus but some distilled part of it. This we will do by first minimizing the corpus but in such a way that the resulting set causes the same code coverage as the original corpus.

To describe this more visually, let's imagine 10 inputs (INPUT0 to INPUT9) and the targeted program that has 6 functionalities (F0 to F5) that can be triggered by inputs. Now let's describe what data causes what functionalities to be executed in the targeted program:
  • INPUT0 causes F1, F5
  • INPUT1 causes F1, F2
  • INPUT2 causes F1, F3, F5
  • INPUT3 causes F0, F1, F4
  • INPUT4 causes F0, F1, F2
  • INPUT5 causes F0, F2, F4
  • INPUT6 causes F1, F2
  • INPUT7 causes F1, F2
  • INPUT8 causes F1, F2
  • INPUT9 causes F1, F2

Of course, we could start fuzzing all the possible inputs, but doing it, we would make it less effective. In the current case, half of the inputs only trigger functionalities F1 and F2. By fuzzing the entire set, we will spend 50% of the time fuzzing almost the same things. This is not effective and especially if the original corpus is in hundreds of thousands or even in millions. In such a case, the number of inputs covering the most common functionality (for example, text in the doc file) and nothing more will not be 50% but much much higher. To reduce such a repetition, the "distillation" part comes in. We will remove elements from the original corpus/set that we can without diminishing the entire set coverage. In the current example, we can minimize the set to only inputs INPUT2 and INPUT5 because together, they cover as much functionality as the original 10 inputs.

Now in most programs, we can't separate functionalities called out, and even if we could, the execution of the same functionality could be different based on the input. So we will do code coverage analysis based on the basic blocks inside the targeted programs. Basic blocks (https://en.wikipedia.org/wiki/Basic_block gives longer/better description) are blocks of code that have no branchings inside, so if the block starts execution, it will continue until the end (unless exception happens). 

With a large corpus of inputs, we can record basic block coverage that each of them causes in the targeted program and then distill (minimize) input set to a smaller size that still has the same coverage. By doing this, we increase the effectiveness of the fuzzing.

How much can we reduce? For me, it has usually been over 500x reduction of the original set. I can't say exact number anymore from the past - but I typically have had corpuses of 1M to 3M and minimized sets of 1500 to 3500 with more common filetypes (pdf, doc, xls, ppt...).

Corpus distillation steps

As described, the corpus distillation is quite a simple concept, but like with most things, the hardship rests in implementation. For doing the entire corpus distillation to generate fuzzing base set for mutations, you have to go through such steps:
  1. Collecting the corpus.
  2. Recording the coverage for every corpus element.
  3. Minimizing the corpus based on coverage.

Step 1: Collecting the corpus

You need a lot of proper inputs, and in the current case, it means files. So where do you go if you need to find something? Google, of course!
Let's say you want to get a huge corpus of pdf files. You google "filetype:pdf" and google responds with nothing. So you google "filetype:pdf something" and google says: 


Great right? Just download some part of these 195M files, and you are golden? Not that easy, while Google says it has that many results, it does not want to give you all of them:

So you can't get them so easily.  My solution has been to do alphabetic searches, so I start "filetype:pdf A", "filetype:pdf B",....."filetype:pdf Z","filetype:pdf AA","filetype:pdf AB"....

Now since nobody in their right mind will do such searches manually, you have to automate this. And that is where another roadblock gets in a way - Google does not like you automating such searches. You could use "Custom Search JSON API" but it has somewhat substantial limitations. So what are the options unless you want to teach a deep neural network to solve Google "are you human" challenges?
I personally simply avoided the protection mechanism by making sure that every 2 searches have at least 48 seconds between them (this value has changed a couple of times and might be dependent on other things also). But this also heavily limited the number of searches that I could do in a day (only 1800). And while this is enough if you have a couple of weeks and want less then a million files, it's still not the best.
The way around this, in turn, is to have multiple IPs from where to do such searches. You might have a lot of friends who can run small scripts on some of their machines (if they trust you). You could rent a lot of cheapest VMs with the public IP from some cloud. F1-micro from Google cloud used to work for it well (dirt cheap and had public IP by default).
NB: I just must say that I'm not 100% sure that renting VMs from Goole to bypass Google's own search rate limitation is allowed under terms of service. So I'm not recommending to do it, I'm only saying that it works and you will still be paying to Google for it so morally it's not that bad I hope...

Additional notes for step 1

  • The code for doing searches from Google can be found at https://github.com/FoxHex0ne/Rehepapp/blob/master/Scripts/Libs/Google.py.
  • In the collection phase, don't collect files themselves, only links to them. Especially if you are receiving via multiple machines. You can still download and check their magic signatures or other aspects but don't store them. It's much simpler to later merge the URLs or gather them right away to a single database. I usually do exactly that and store hash and size along with URL also.
  • When distributing the downloading tasks to multiple machines, then alway use search ranges. For example, one device does all the searches with phrases that start with "A", second with "B", etc. If you are using a central server that organizes it all, then things can be even finer tuned.
  • There will be a lot of collisions in search results. Resulting from both the same addresses and the same files from different addresses. This ought to be accounted for.
  • If you decide not to download entire files when collecting the URLs, still download the first couple of bytes to check for filetype magic value. Google will return based on extension, but this means that some of the files returned are not actually correct ones.

Step 2: Calculating coverage

As mentioned at the beginning, my method to calculate code coverage is different from most other tools. The fastest way to calculate code coverage is to inject tracing code into binary of the executable while compiling (one example would be SanitizerCoverage feature). This not only enables tracing on the basic block but edge level - this means that coverage is measured based on the branching path to the basic block, not block alone. This is almost always the best and fastest solution, but at the same time, it can't be used on closed source software.
For closed source software, the binary instrumentation is the most common method, and besides, Intel PT could be even better. Both of them are a lot of more extended topics, and I will not try to describe them here. I only say that binary instrumentation is, in many cases, quite slow (not always), and I think the future is in Intel PT or other hardware supported features that might pop up in the future.

My own solution has couple preparation steps:
  1. Preparation
    1. Use IDA Pro or some other such tools to record the location of all the basic blocks in executable and libraries it uses.
    2. Inside analyzed files, insert 0xCC (software breakpoint) into the beginnings of every basic block.
  2. Main loop
    1. Start the targeted program as debuggable.
    2. Wait for software breakpoint to be triggered (if none has happened for X seconds, then stop the program).
    3. If a software breakpoint happens, record the location, restore the original value of the byte, and continue execution(don't forget decrement instruction pointer).
    4. GOTO 2.2
Such a solution means that every basic block that is executed will trigger a software breakpoint once per execution. This means that with every triggered breakpoint, the targeted program stops, context is switched to the debugger, information is recorded, target program memory is written, and execution context is again switched back. This is a LOT of extra CPU cycles for a single basic block recording even when it happens once per every basic block during a single execution. To solve this obstacle, I used such an approach:
  1. Download 100 inputs(files)
  2. Run code tracing with these 100 files and all the basic block triggers
  3. Find 5 files that have the best coverage out of these 100
  4. Remove all the basic blocks triggers that were covered by these 5
  5. Download 1000 new inputs 
  6. Run code tracing with these 1000 files and the remaining basic block triggers
  7. Find 45 files that have the best coverage out of these 1000
  8. Remove all the basic blocks triggers that were covered by these 45
At the end of such filtering, you are left with 50 files that you can keep for your final set and reduced basic block list. For most software, this is enough to get practically real-time execution for almost any input later used. The only exception is some inputs that trigger functionality that is quite rare, and this is already a good thing.
Just to be sure, I usually do another such filtering with 10000 new files from what I take out 100. You could also keep doing it later.


Additional notes for step 2

  • When I wrote "Find X files that have the best coverage out of this Y", I did not mean that you just choose X best right away. I suggested that you want the best one, dismiss basic blocks covered by it, and then select next best without accounting blocks covered by the first one and etc.
  • The more filtering steps you do, the faster, the later executions are. But it's not reasonable to go very crazy with it because it could reduce the quality of your final set (too many similar ones). 

Step 3: Minimizing

At this point, you should have a vast corpus and coverage information for each element in it. If you have used my tools and my approach, then you already have some files in the final set also. What is left is to calculate the rest of it. This seems easy, but in actuality, there is no right way to find the BEST solution without trying all the options (NP type of problem). So the easiest and in my opinion good enough solution is to use such a loop:

  1. Find one input with the best coverage from the set
  2. Remove all basic blocks from all the coverages that are covered by this one input.
  3. Put this one input into the final set.
  4. If there are any inputs left that cover some basic blocks not yet removed then go to step 1
  5. All done
This does not give the best solution, of course, but in my experience a good enough one.


Additional notes for step 3

  • Depending on the size of the corpus and how many basic blocks are by average per input, this calculation could take a very long time. In the case of millions of inputs, even the optimized list of basic blocks (using methods described at step 2) could take days.
  • When coverage info is stored in a database, it makes writing the minimizer a bit easier. It might not be as fast as a very fine-tuned tool written only for that, but a lot of time could be saved on the development. And you will calculate coverage only once per input type so 2day vs. 3day time difference will not be so terrible.

Rehepapp toolset


So I have made my toolset for such coverage analysis. This has a lot of issues still that I will try to fix as time goes, but this is sadly not a priority for now.  Anyway I will now list all the tools in this toolset and for what they can be used:
  • Tracer.exe - This does actual coverage recording. It takes as an input a list of basic blocks (generated by the IDA script) and records all basic blocks that are reached.
  • Analyzer.exe + IDA_GetBasicBlocks.py - They can be used to create a list of basic blocks from exe/dll files of some application.
  • CoverageCommon.exe - Tool to find common basic blocks covered by multiple coverage recordings.
  • CoverageMain.exe - When combined with tracer.exe,, it can be used to record coverage information for multiple files.
  • CoverageMinimizer.exe - Find X best coverages from the list of coverages.
  • CoveragePlanter.exe - Plants software breakpoints to the exe/dll files based on basic block file. Has to be run before running Tracer.exe.
  • CoverageReducer.exe - Removes basic blocks from the existing list based on coverage recordings.
  • Server.jar - HTTP server that can be used for distributed file collecting and coverage testing.
  • BotCollector.py - Used for distributed file collecting.
  • BotCoverage.py -Used for distributed coverage testing.

I must say that it's a bit long story to show the entire typical usage of Rehepapp here, so I will probably do something a little different and will try to create a demo video on how to best use my tools. If so, I will later add a link for it here.

Tuesday, January 7, 2020

Let's get things going with basics of file parsers fuzzing

So, one of my 2020 new year promises was to actually start writing to my blog. At the start, I thought writing about my current Hyper-V research, but because there has been some interest from people only entering the fuzzing scene, I decided to write about that. So let's get going and I will not start from extreme basics but close to it. I will also connect the explanations and descriptions with my own fuzzer newer version that I have already released last week ( https://github.com/FoxHex0ne/Vanapagan ).


Primary loop

So let us talk about fuzzing. I will start with a description of the primary fuzzing loop and how I have used it. So, in essence, my fuzzing loop for file parsers fuzzing has been something like that (presented in minimized python code for additional weirdness):

try:
    while True:
        clearRegistriesAndTemporaryFiles()
        originalFile = takeRandomFileFromCorrectFilesSet()
        (mutatedFile, mutationsInfo) = mutateFile(originalFile)
        crash = executeTargetedProgramWithInputFile(mutatedFile)
        if crash not null:
            logCrash(originalFile, mutatedFile, mutationsInfo, crash)
except:
    restart()

This is an overly simplified version of course, but I believe this is enough for an explanation. So let's start breaking everything apart:

  • Outer try-except block and restart() 
    This is to catch all kinds of crap that might sooner or later happen while running a fuzzing loop for a longer time. Since I'm running a lot of fuzzing VM-s, then usually I just prefer that machine does the restart if something weird happens. After restart, the fuzzing simply continues. During the debugging phase, I disable restarting and just raise exception to see what happened.
  • while True
    Well... it's a fuzzing loop you know :)
  • clearRegistriesAndTemporaryFiles()
    This is the first part that becomes a bit more interesting. It's not needed with all the targets but with quite a lot of them still. This function would be doing the clearing up of temporary files or registry values that will otherwise influence the fuzzing process. For example, in the case of MS Office main programs (Word, Excel, Powerpoint), the safe mode can be activated when the last loop killed the main process (some registry values have to be cleared to avoid this). Or in the case of Foxit PDF reader, the temporary file is created in the temp-file directory. It is of course not deleted when a reader is killed off by the fuzzer (this would result in larger disk usage and slowdown of the reader startup after some time).
  • originalFile = takeRandomFileFromCorrectFilesSet()
    I'm a big fan of mutation-based fuzzing. This is because I'm too lazy to write smart fuzzers for complex protocols and specifications (I also have a full-time job and university stuff currently so I can redirect blame for lack of time). So what I do, is I choose a set of correct inputs/files as a basis. In this line, I just select a random file from such a set.
  • (mutatedFile, mutationsInfo) = mutateFile(originalFile)
    Now comes the mutation part, No matter what mutators you use (I have lately used mainly bit flipping and special values  - 0x00, 0xFF, 0xFE, 0x80, 0x7F, etc.) it's useful to also keep track of the mutations that were made. Of course, mutations can be deduced by comparing the original and mutated file, but this takes extra effort. Also, in some cases, the file is changed when opened by the targeted program (Excel & Word) and this creates additional confusion. All in all, as a result of mutation functionality, you are left with the mutated file and mutation info.
  • crash = executeTargetedProgramWithInputFile(mutatedFile)
    In this part, the targeted executable is actually executed and the mutated file is provided to it as an input. This part has many moving parts in it and has to do multiple things. As a first thing, it should be functioning as a debugger (or control actual debugger) for detecting crashes inside the process. For example, if invalid memory read/write is detected, this functionality should return information about the crash (location, registry values, stack, etc.). Also, if the CPU usage from the process fells to 0%, it makes sense to stop the process. Besides, there should be an overall time limit for single process execution.
  • if crash not null:
                logCrash(originalFile, mutatedFile, mutationsInfo, crash)

    This should already be quite clear from the existing context. If the crash is detected, then it should be logged somewhere along with the input that caused the crash and some initial analysis. Since the input is in the form of a mutated file, the mutation description and original file should also be stored (less work later)

And this is the main fuzzing loop that I and almost everyone else has used in one form or another.


"Vanapagan" fuzzer

So, in addition to this blog post, I have also made public the newer version of my simple Vanapagan fuzzer that is written in python (and not the nice Python 3 but now unsupported 2.7). Based on this I will describe how fuzzing loop explained before carries over to my actual fuzzer. Now what to keep in mind, is that this fuzzer is a very simple one - it just mutates the input and executes the unmodified target program. In many situations this is a rather slow way to do it - there are multiple ways to speed up execution or to test only certain parts of the main program. But I believe that this is something that comes a bit later and the goal of this post is to show the basics. So here comes overall comment and the main pieces of the Vanapagan fuzzer:



Overall usage:

  • It uses WinAppDbg, pywin32 and psutil libraries. First one for debugging and rest for handling some overall process monitoring (CPU usage, etc.)
  • Is configurable by JSON conf file
  • I usually set it up, add autologin, automatic running, full page heap and restarting in case of exception. Then I let it run for a week or so and evaluate results. 


Pieces:

  • fuzz.py
    Main loop logic and also a file to run when fuzzing
  • Vanapagan\Detector\WinBasic.py
    It contains the class that does the actual debugging work and crash detection. It uses WinAppDbg library for that.
  • Vanapagan\Loging\*.py
    There are two classes currently for logging any findings. One logs the results to the local filesystem (or network shared) directory and the second one can be used to log to an FTP server.
  • Vanapagan\Mutator\*.py
    There is one base class and two implementations for file input mainpulation. One does the bit flipping and other places special value bytes into the input.
  • Vanapagan\Utils\*.py
    Just some helper functions used by the main fuzzing loop
  • Vanapagan\CrashReport.py
    Data class for the crash reports, used by the detector and logging classes.


Configuration example:

I will use fuzzing configuration for Excel as an example because it needs most of the functionality existing in my fuzzer. It should also bring out all the little things that you kind of have to think about when setting fuzzer up.

{
  "name": "excel32",

  "input": ".\\input",
  "retry": 1,
  "binaries": ["excel.exe"],
  "executable": "C:\\Program Files (x86)\\Microsoft Office\\root\\Office16\\excel.exe",
  "regsToDelete": ["HKCU\\Software\\Microsoft\\Office\\16.0\\Excel\\Resiliency"],
  "filesToDelete": [
    "C:\\Users\\jaanus\\AppData\\Local\\Temp\\*",
    "C:\\Users\\jaanus\\AppData\\Roaming\\Microsoft\\Office\\Recent\\*",
    "C:\\Users\\jaanus\\AppData\\Roaming\\Microsoft\\Windows\\Recent\\*"
  ],
  "windowToInteract": "Microsoft Excel",
  "windowToInteractKey": "y",
  "logNullCrashes": true,
  "maxWait": 90,
  "restartWhenException": false,
  "restartWhenLoop": false,

  "mutators": [
    {
      "type": "FileBitFlipping",
      "rate": 40000,
      "weight": 2
    },
    {
      "type": "FileByteValues",
      "rate": 60000
    }
  ],
  "logging": [
    {
      "type": "FtpsLoging",
      "host": "10.20.30.40",
      "username": "bot",
      "password": "s3cr3t",
      "dir": "/BOT/excel32"
    },
    {
      "type": "FilesystemLoging",
      "dir": ".\\crashes\\"
    }
  ]
}




So let's get going with different values in it:

  • name - Name of the fuzzer conf, not used actually anywhere.
  • input - Directory where input set locates.
  • retry - How many times input is tested against target. Usually, 1 is enough, but in some cases, the target might crash for other reasons the input itself. With such targets, the retry should be higher, so in case of the crash, the input is retested to make certain that the crash was caused by it.
  • binaries - List of binaries that are monitored for CPU usage. This is usually not that much needed but could be useful if the target program executes some other applications indirectly and waits for them. In such cases, the primary program CPU usage is not the only one you should monitor. In the current example, it's "excel.exe" itself, so actually, it has no meaning.
  • executable - Target program location.
  • regsToDelete - Registry values to delete before every execution. In the current case, it's for removing the possibility of a "safe mode" startup for excel or the recovery of old files.
  • filesToDelete - Files to delete before every execution.
  • windowToInteract - Window for sending keypress (windowToInteractKey value) to. In the current case, it's for the "do you want to repair this file" dialog.
  • windowToInteractKey - Keypress that is sent to the window described in the previous point.
  • logNullCrashes - Should we store also crashes that happen in the NULL page.
  • maxWait - Maximum execution time of the target application. It is triggered when CPU usage never fells to 0.
  • restartWhenException - Should fuzzer restart the computer if an exception happens. Useful during real fuzzing but should be turned off when debugging.
  • restartWhenLoop - If fuzzer has looped through all input set, should the computer restart.
  • mutators - What mutations are used:
    • Type "FileBitFlipping" - Random bit flipping
    • Type "FileByteValues" - Special byte, word and dword values
    • rate - For how many bytes one change is made. So if in first mutator the rate value is 40 000 and the file size is 1 000 000 bytes, then it results 1 000 000 / 40 000 = 25 changes made.
    • weight - The mutator used in every execution is chosen kind of randomly. So weight value gives mutator importance. In the current configuration, there is 2 mutator and the weight of the first one is 2. This means that 2/3 of times, the first one is used and 1/3 of times the second one.
  • logging - What results logging are used:
    • Type "FtpsLoging" - Logging to FTP server
    • Type "FilesystemLoging" - Logging to the local filesystem
    • All other values should be quite self-explaining



How I use it all:
  1. Prepare crash collecting FTP server
  2. Prepare VM (install & update OS, Office, Python and libraries)
  3. Enable full page heap for Excel
  4. Set up fuzzer (input set, conf file, etc.)
  5. Test fuzzer without restarting ("restartWhenException" should be "false") until it works fine
  6. Activate restarting in fuzzer conf
  7. Activate autologon ( https://docs.microsoft.com/en-us/sysinternals/downloads/autologon )
  8. Add fuzzer execution into "run" registry key
  9. Clone & run as many VMs as possible
  10. Check up on VMs after a day just in case
  11. Wait a week
  12. Look are there any good crashes found
      NO> Cry a bit and change fuzzing conf or pick a new target
      YES> Analyse them and leave fuzzers running for another week


Why mutation-based fuzzing

So I thought that I should also explain the reasons why & how I have chosen mutation-based fuzzing, not some other forms. The main reason is the "results for time spent" calculation. Very dumb fuzzing can't find any results in these days and while smart fuzzing is WAY better, it also takes exponentially more time do develope. This is especially important since I'm actually not doing such research for a living (at least not as a main income) and I don't want to waste a year of free time on writing smart fuzzer that might not even work that well. The mutation-based fuzzing is a good compromise - it's still interesting because you have to find a good initial set of inputs that you can start mutating (it's not that useful to choose it randomly), it does not take that long to prepare and it still gives some results. Of course, it does not always work - for example, it's harder to fuzz browser javascript engine like that or syscalls.
Overall I try to put maximum effort into collecting the best input set possible that will cover a wide range of functionality inside the targeted program. This combined with a stable fuzzing process means that I can skip other optimization options and still get results without sinking too much manual work time. The results, of course, vary heavily with such approach but I have been still quite successful by doing that.
Also, because this approach can't work that well with other types of targets that need a lot of manual work, it gives the possibility to do various types of research concurrently.



Other notes

Some additional things are useful to know before actual fuzzing (I use fuzzing in the Windows environment as an example, but similar points go for all OS-es and settings):

  • Full page heap
    Activate a full page heap for a process that you are fuzzing. Look for description about it from the MS gflags page, but it's pretty much mandatory to be used when fuzzing applications that use windows heap manager. Without this, you might still catch crashes, but they will happen in quite random places. 
  • Crash info gathering
    Storing crashes in the local filesystem works well when testing on a single machine. It is way less fun when running tens or hundreds of VMs. I have mostly used an FTP server or network drive to gather all crashes to a single location.
  • Initial analysis
    This is not important if your target program crashes very rarely. But it becomes critical when there are a lot of crashes. In that case, you want to do at least some initial filtering and categorization of the found issues. I, for example, skip NULL page crashes and rest sort by crash location (this is also not enough if the target program is very broken and you have to start separating by stack trace, but I haven't needed it).
  • Disable sleep
    I have forgotten a LOT of times with new VMs to disable sleep for OS. Because fuzzer works without user input, the OS sometime decides that there is no user activity and will go to sleep. This is not good, of course...
  • GUI actions
    "File seems broken, do you want to try to fix it?" and other such dialog boxes. You usually want your fuzzer to say "hell yes" to them.



Ending comments

Everything covered in this post was as basic and introductory as possible. Fuzzing programs using such simple tools is not that effective resource-wise and not at all fancy. It still works if you have a good initial set of files, but even then, there are more effective ways to do the actual fuzzing part. But at the same time, I have been in MSRC top researchers list for 4 years now and mainly thanks to this tool and such approach. So it DOES WORK, but if I would do such research full time or my livelihood would depend on that, I would look for ways to speed up fuzzing (will write about this also in future).

In my next blog post, I will cover my way of collecting the input sets for mutation-based fuzzing. I will use my Rehepapp tool as an example and will also release some new utilities for it. I hope to get it done before the end of January because it connects strongly to the current post. But no promises...