Now I must mention that this toolset is using a somewhat alternative way to achieve code coverage monitoring, and it's not suitable for all situations. A more generic but slower (for closed source and only after the optimization phase) method is binary instrumentation. Also, the Intel PT feature will probably allow even more generic and faster tools - I think this is the future, and at some point, I hope to get to use it for such things.
Corpus distillation
So let's start by describing what corpus distillation means in the current context. So the first word "corpus" is the Latin word for "body" but in many settings, it means a broad set of some stuff (for example, "text corpus" means an extensive collection of texts). In the current context, it means a large set of inputs (files). Distillation part of the term, in turn, means that we will not use the entire corpus but some distilled part of it. This we will do by first minimizing the corpus but in such a way that the resulting set causes the same code coverage as the original corpus.
To describe this more visually, let's imagine 10 inputs (INPUT0 to INPUT9) and the targeted program that has 6 functionalities (F0 to F5) that can be triggered by inputs. Now let's describe what data causes what functionalities to be executed in the targeted program:
- INPUT0 causes F1, F5
- INPUT1 causes F1, F2
- INPUT2 causes F1, F3, F5
- INPUT3 causes F0, F1, F4
- INPUT4 causes F0, F1, F2
- INPUT5 causes F0, F2, F4
- INPUT6 causes F1, F2
- INPUT7 causes F1, F2
- INPUT8 causes F1, F2
- INPUT9 causes F1, F2
Of course, we could start fuzzing all the possible inputs, but doing it, we would make it less effective. In the current case, half of the inputs only trigger functionalities F1 and F2. By fuzzing the entire set, we will spend 50% of the time fuzzing almost the same things. This is not effective and especially if the original corpus is in hundreds of thousands or even in millions. In such a case, the number of inputs covering the most common functionality (for example, text in the doc file) and nothing more will not be 50% but much much higher. To reduce such a repetition, the "distillation" part comes in. We will remove elements from the original corpus/set that we can without diminishing the entire set coverage. In the current example, we can minimize the set to only inputs INPUT2 and INPUT5 because together, they cover as much functionality as the original 10 inputs.
Now in most programs, we can't separate functionalities called out, and even if we could, the execution of the same functionality could be different based on the input. So we will do code coverage analysis based on the basic blocks inside the targeted programs. Basic blocks (https://en.wikipedia.org/wiki/Basic_block gives longer/better description) are blocks of code that have no branchings inside, so if the block starts execution, it will continue until the end (unless exception happens).
With a large corpus of inputs, we can record basic block coverage that each of them causes in the targeted program and then distill (minimize) input set to a smaller size that still has the same coverage. By doing this, we increase the effectiveness of the fuzzing.
How much can we reduce? For me, it has usually been over 500x reduction of the original set. I can't say exact number anymore from the past - but I typically have had corpuses of 1M to 3M and minimized sets of 1500 to 3500 with more common filetypes (pdf, doc, xls, ppt...).
Corpus distillation steps
As described, the corpus distillation is quite a simple concept, but like with most things, the hardship rests in implementation. For doing the entire corpus distillation to generate fuzzing base set for mutations, you have to go through such steps:
- Collecting the corpus.
- Recording the coverage for every corpus element.
- Minimizing the corpus based on coverage.
Step 1: Collecting the corpus
You need a lot of proper inputs, and in the current case, it means files. So where do you go if you need to find something? Google, of course!
Let's say you want to get a huge corpus of pdf files. You google "filetype:pdf" and google responds with nothing. So you google "filetype:pdf something" and google says:
Great right? Just download some part of these 195M files, and you are golden? Not that easy, while Google says it has that many results, it does not want to give you all of them:
So you can't get them so easily. My solution has been to do alphabetic searches, so I start "filetype:pdf A", "filetype:pdf B",....."filetype:pdf Z","filetype:pdf AA","filetype:pdf AB"....
Now since nobody in their right mind will do such searches manually, you have to automate this. And that is where another roadblock gets in a way - Google does not like you automating such searches. You could use "Custom Search JSON API" but it has somewhat substantial limitations. So what are the options unless you want to teach a deep neural network to solve Google "are you human" challenges?
I personally simply avoided the protection mechanism by making sure that every 2 searches have at least 48 seconds between them (this value has changed a couple of times and might be dependent on other things also). But this also heavily limited the number of searches that I could do in a day (only 1800). And while this is enough if you have a couple of weeks and want less then a million files, it's still not the best.
The way around this, in turn, is to have multiple IPs from where to do such searches. You might have a lot of friends who can run small scripts on some of their machines (if they trust you). You could rent a lot of cheapest VMs with the public IP from some cloud. F1-micro from Google cloud used to work for it well (dirt cheap and had public IP by default).
NB: I just must say that I'm not 100% sure that renting VMs from Goole to bypass Google's own search rate limitation is allowed under terms of service. So I'm not recommending to do it, I'm only saying that it works and you will still be paying to Google for it so morally it's not that bad I hope...
Additional notes for step 1
- The code for doing searches from Google can be found at https://github.com/FoxHex0ne/Rehepapp/blob/master/Scripts/Libs/Google.py.
- In the collection phase, don't collect files themselves, only links to them. Especially if you are receiving via multiple machines. You can still download and check their magic signatures or other aspects but don't store them. It's much simpler to later merge the URLs or gather them right away to a single database. I usually do exactly that and store hash and size along with URL also.
- When distributing the downloading tasks to multiple machines, then alway use search ranges. For example, one device does all the searches with phrases that start with "A", second with "B", etc. If you are using a central server that organizes it all, then things can be even finer tuned.
- There will be a lot of collisions in search results. Resulting from both the same addresses and the same files from different addresses. This ought to be accounted for.
- If you decide not to download entire files when collecting the URLs, still download the first couple of bytes to check for filetype magic value. Google will return based on extension, but this means that some of the files returned are not actually correct ones.
Step 2: Calculating coverage
As mentioned at the beginning, my method to calculate code coverage is different from most other tools. The fastest way to calculate code coverage is to inject tracing code into binary of the executable while compiling (one example would be SanitizerCoverage feature). This not only enables tracing on the basic block but edge level - this means that coverage is measured based on the branching path to the basic block, not block alone. This is almost always the best and fastest solution, but at the same time, it can't be used on closed source software.
For closed source software, the binary instrumentation is the most common method, and besides, Intel PT could be even better. Both of them are a lot of more extended topics, and I will not try to describe them here. I only say that binary instrumentation is, in many cases, quite slow (not always), and I think the future is in Intel PT or other hardware supported features that might pop up in the future.
My own solution has couple preparation steps:
For closed source software, the binary instrumentation is the most common method, and besides, Intel PT could be even better. Both of them are a lot of more extended topics, and I will not try to describe them here. I only say that binary instrumentation is, in many cases, quite slow (not always), and I think the future is in Intel PT or other hardware supported features that might pop up in the future.
My own solution has couple preparation steps:
- Preparation
- Use IDA Pro or some other such tools to record the location of all the basic blocks in executable and libraries it uses.
- Inside analyzed files, insert 0xCC (software breakpoint) into the beginnings of every basic block.
- Main loop
- Start the targeted program as debuggable.
- Wait for software breakpoint to be triggered (if none has happened for X seconds, then stop the program).
- If a software breakpoint happens, record the location, restore the original value of the byte, and continue execution(don't forget decrement instruction pointer).
- GOTO 2.2
Such a solution means that every basic block that is executed will trigger a software breakpoint once per execution. This means that with every triggered breakpoint, the targeted program stops, context is switched to the debugger, information is recorded, target program memory is written, and execution context is again switched back. This is a LOT of extra CPU cycles for a single basic block recording even when it happens once per every basic block during a single execution. To solve this obstacle, I used such an approach:
- Download 100 inputs(files)
- Run code tracing with these 100 files and all the basic block triggers
- Find 5 files that have the best coverage out of these 100
- Remove all the basic blocks triggers that were covered by these 5
- Download 1000 new inputs
- Run code tracing with these 1000 files and the remaining basic block triggers
- Find 45 files that have the best coverage out of these 1000
- Remove all the basic blocks triggers that were covered by these 45
At the end of such filtering, you are left with 50 files that you can keep for your final set and reduced basic block list. For most software, this is enough to get practically real-time execution for almost any input later used. The only exception is some inputs that trigger functionality that is quite rare, and this is already a good thing.
Just to be sure, I usually do another such filtering with 10000 new files from what I take out 100. You could also keep doing it later.
Additional notes for step 2
- When I wrote "Find X files that have the best coverage out of this Y", I did not mean that you just choose X best right away. I suggested that you want the best one, dismiss basic blocks covered by it, and then select next best without accounting blocks covered by the first one and etc.
- The more filtering steps you do, the faster, the later executions are. But it's not reasonable to go very crazy with it because it could reduce the quality of your final set (too many similar ones).
Step 3: Minimizing
At this point, you should have a vast corpus and coverage information for each element in it. If you have used my tools and my approach, then you already have some files in the final set also. What is left is to calculate the rest of it. This seems easy, but in actuality, there is no right way to find the BEST solution without trying all the options (NP type of problem). So the easiest and in my opinion good enough solution is to use such a loop:
- Find one input with the best coverage from the set
- Remove all basic blocks from all the coverages that are covered by this one input.
- Put this one input into the final set.
- If there are any inputs left that cover some basic blocks not yet removed then go to step 1
- All done
This does not give the best solution, of course, but in my experience a good enough one.
Additional notes for step 3
- Depending on the size of the corpus and how many basic blocks are by average per input, this calculation could take a very long time. In the case of millions of inputs, even the optimized list of basic blocks (using methods described at step 2) could take days.
- When coverage info is stored in a database, it makes writing the minimizer a bit easier. It might not be as fast as a very fine-tuned tool written only for that, but a lot of time could be saved on the development. And you will calculate coverage only once per input type so 2day vs. 3day time difference will not be so terrible.
Rehepapp toolset
So I have made my toolset for such coverage analysis. This has a lot of issues still that I will try to fix as time goes, but this is sadly not a priority for now. Anyway I will now list all the tools in this toolset and for what they can be used:
- Tracer.exe - This does actual coverage recording. It takes as an input a list of basic blocks (generated by the IDA script) and records all basic blocks that are reached.
- Analyzer.exe + IDA_GetBasicBlocks.py - They can be used to create a list of basic blocks from exe/dll files of some application.
- CoverageCommon.exe - Tool to find common basic blocks covered by multiple coverage recordings.
- CoverageMain.exe - When combined with tracer.exe,, it can be used to record coverage information for multiple files.
- CoverageMinimizer.exe - Find X best coverages from the list of coverages.
- CoveragePlanter.exe - Plants software breakpoints to the exe/dll files based on basic block file. Has to be run before running Tracer.exe.
- CoverageReducer.exe - Removes basic blocks from the existing list based on coverage recordings.
- Server.jar - HTTP server that can be used for distributed file collecting and coverage testing.
- BotCollector.py - Used for distributed file collecting.
- BotCoverage.py -Used for distributed coverage testing.
I must say that it's a bit long story to show the entire typical usage of Rehepapp here, so I will probably do something a little different and will try to create a demo video on how to best use my tools. If so, I will later add a link for it here.