By Souheil Moghnie, NortonLifeLock with Kostya Serebryany, Google, Rohit Shambhuni, Autodesk and Adith Sudhakar, VMWare
We are continuing our Focus on Fuzzing blog series with a quick overview of the different types of fuzzers. Understanding the taxonomy of fuzzing can help when thinking about selecting the right fuzzing tool for your project and determining whether a given fuzzing effort is the most effective approach for meeting your specific testing needs. It will also set the stage for our upcoming blog posts.
In this blog, we provide useful context for those new to fuzzing, while at the same time providing some valuable information that experienced fuzz-testers can still benefit from.
First, let’s start with the different types of fuzzers, which can be loosely divided into three main categories according to a commonly accepted framework published by Microsoft: 1) knowledge of the input format; 2) knowledge of the target application structure; and, 3) method of generating new inputs.
Knowledge of Input Format
Does the fuzzer have any awareness of the expected data input and its format? If not, it is considered a “dumb” fuzzer. In “dumb fuzzing” input data is generated or mutated randomly without awareness of the expected input data and its format, e.g. randomly flipping x number of bits in random locations within an input.
In contrast, in “smart fuzzing” input data is generated or mutated semi-randomly with awareness of the expected format, such as encodings ( i.e. base-64 encoding); file formats (e.g. PE headers); relations (offsets, checksums, lengths, etc.); IP addresses; or even grammar. Today, “smart fuzzing” is more often referred to as structure-aware fuzzing.
However, smart or dumb fuzzing is not a binary statement of either/or, but rather where does this fuzzing technique fall within the two extremes? Therefore, the spectrum of fuzzers ranges from the “dumbest” all the way to the “smartest” with a wide range in between.
It is a common misconception that smart fuzzers are always better than dumb fuzzers. While true for some use cases, sometimes using pure random fuzzers without intelligence factored in can be a better alternative to the more complex smart fuzzers. We will elaborate on the efficacy of each type in subsequent blog posts.
Knowledge of Target Application Structure
The second area to consider is how familiar is the fuzzer with the structure of the code being tested (as opposed to the data being fuzzed in the previous section). Most people involved in security testing are familiar with the terms white box and black box, and the same concepts can be applied to fuzzing. Consequently, one can consider black vs. white box fuzzing as follows:
- Black-box fuzzing: channeling of altered or generated data without knowledge or verification of which code branches were covered or not
- White-box fuzzing: some define it as channeling of input data with visibility into the structure of the code being fuzzed to maximize code coverage. Others, define it as “performing a dynamic symbolic execution to collect constraints on inputs gathered from predicates in branch statements.”
You’ll sometimes hear the term “gray-box fuzzing,” which is often used to describe fuzzing done with a greater focus on maximizing code coverage. To avoid confusion, we used the term “coverage-guided fuzzing” to refer to both types.
With coverage-guided fuzzing, code coverage is the key metric to be maximized. It is used to ensure that generated inputs touch diverse parts of the code. Coverage can be collected either by instrumenting the source code (e.g. AFL-gcc or LLVM SanitizerCoverage) or the binary (via QEMU, Intel PT, PIN, etc). It can also be used in different ways.
Given its importance and exceptional efficacy, we will dedicate an entire blog post to coverage-guided fuzzing (stay tuned).
Input Production Scheme
The last category has to do with “how” the data used for fuzzing is created. That is, are we generating new data from scratch, or are we using existing data and mutating it? In the former approach, the data is being created anew (e.g. randomly or semi-randomly generating bytes to create a file) whereas, in the latter, existing data is being used to generate the fuzzed data (e.g. flipping bits in a file, or appending/prepending data, etc.).
In other words, the two types can be summarized as follows:
- Generation – each subsequent iteration’s input data is produced independently of any previous input, and is typically based on a model of the input format
- Mutation – modification of existing input data is done according to certain patterns
Traditional fuzzing is simply described as the use of fuzzers that operate by channeling malformed and corrupted data to an application or service entry point. Modern fuzzers continue to improve security by generating valid as well as invalid inputs of highly structured types and continuing to reach deeper layers of the system under test.
Up Next
We’ll be referring to many of these terms as we continue to explore fuzzing practices. Stay tuned for our next post, which will take a closer look at coverage-guided fuzzing.