17 October 2019

When Results Are All That Matters: Consequences

by Andreas Zeller and Sascha Just; with Kai Greshake

In our previous post "When Results Are All That Matters: The Case of the Angora Fuzzer", we reported our findings when investigating the Angora fuzzer [1]. If you have not read that post yet, you should stop here and read our write-up first. There, we focus on our findings and problems that surprised us when experimenting with Angora.
In this article, we have collected some suggestions to advance the field of fuzzing and have a long-term impact on the reliability of software.

1. Science is about insights, not products.

To ensure scientific progress, we need to know which technique works, how, and under which circumstances. We write papers to document such insights such that the next generation of researchers, as well as the non-scientific world, can build on them. The value of a paper comes from the impact of its insights.

2. Scientists and companies can create tools.

It is fun to build a tool, and if it works well, the better. Typically, this will involve not one single magical technique, but a multitude of techniques working together. Tools will have to succeed on the market, though, and will be evaluated not based on their insights, but their effectiveness.

Evaluating tools for their effectiveness can be part of a scientific approach. However, evaluation settings should

be fair and thus not be defined by tool authors; and
avoid overspecialization and thus involve tests not previously known to tool authors.

In other words, the only way to obtain reliable performance comparisons is by independent assessment. Other communities do this through specific tool contests that operate on secret benchmarks created for this very purpose. And of course, tools need to be available for evaluation in the first place. It is nice to see the security community to adapt such techniques, such as artifact evaluation.

3. Combinations of techniques must be assessed individually.

If results depend on a larger set of novel processing steps, the contribution of each must be – for instance, by replacing each processing step by a naive approach and assessing the impact of the change. All decisions affecting performance must be well motivated and documented.

Without assessing the impact of each step individually, one can still have a great tool, but the insight on what makes it great will be very limited. As an analogy: We know that Usain Bolt is a record shattering sprinter; the scientific insight is to find out why.

4. Document your hypotheses, experiments, and results.

Good scientific practice mandates that experiments and their results be carefully documented. This helps others (but also yourself!) in assessing and understanding the decisions in the course of your project. If you make some design decision, such as a parameter setting, after examining how your software runs on some example, it is important that the motivation for this design decision can be traced back to the experiment and its result.

If this sounds like lot of work, that's because it is. We're talking about the scientific method, not some fiddling around with parameters until we reach the desired result on a benchmark. Fortunately, there are great means to help you with these tasks. Jupyter Notebooks [8], for instance, allow you to collect your hypotheses (in natural language), your experiment design, its results (in beautiful and interactive graphs, among others), and your next refinement step – allowing anyone (as well as yourself) to understand how a specific result came to be. Be sure to place your notebooks (and code) under version control from day one, and throw in some tests and assertions for quality assurance. Control your environment carefully to make results reproducible for anyone.

5. Having benchmarks to compare tools and approaches is helpful, but brings risks.

Benchmarks are helpful means to assess the performance of tools. However, they bring two risks.

First, there is the risk of having researchers focus on the benchmark rather than insights. It is nice to have a well-performing tool, but its scientific value comes from the insights that make its performance.
Second, benchmarks bring the risk of researchers knowingly or unknowingly optimizing their tools for this very benchmark. We have seen this with compilers, databases, mobile phones, fault localization, machine learning, and now fuzzing. To mitigate the risk of overspecialization, tool performance should be compared on programs they have not seen before.

A benchmark like LAVA-M is representative for detecting buffer overflows during input processing but very little else. As the LAVA creators state themselves, "LAVA currently injects only buffer overflows into programs" and "A significant chunk of future work for LAVA involves making the generated corpora look more like the bugs that are found in real programs." [3].

It has been shown that optimizing against the artificial LAVA bugs, such as 4-byte string triggers, can have very naive approaches yield impressive results [2]. The conceptual match between the features injected by LAVA and those features exploited by fuzzers such as Angora is striking.

The question of what makes a good benchmark for fuzzers and test generation is still open. One possible alternative to LAVA-M is Google's fuzzing test suite which contains a diverse set of programs with real bugs [5]. Michael Hicks has compiled excellent guidelines on how to evaluate fuzzers [4, 6].

6. Researchers must resist the temptation of optimizing their tools towards a specific benchmark.

While developing an approach, it is only natural to try it out on some examples to assess its performance, such that results may guide further refinement. The risk of such guidance, however, is that development may result in overspecialization – i.e., an approach that works well on a benchmark, but not on other programs. As a result, one will get a paper without impact and a tool that nobody uses.

Every choice during implementation has to be questioned "Will this solve a general problem that goes way beyond my example?", and one should take that choice only with a positive, well-motivated answer, possibly involving other experts who would be asked in the abstract. We recommend that during implementation, only a very small set of examples should be used for guidance; the evaluation should later be run on the full benchmark.

Good scientific practice mandates to create a research and evaluation plan with a clear hypothesis well before the evaluation, and possibly even before the implementation. This helps to avoid being too biased towards one's own approach. Note that the point of the evaluation is not to show that an approach works, but to precisely identify the circumstances under which it works and the circumstances under which it does not work.

Papers should investigate those situations and clearly report them. Again, papers are about insights, not competition.

7. It is nice to have tools discovering vulnerabilities...

...especially as these vulnerabilities have a value on their own. However, vulnerabilities do not follow statistical distribution rules (hint: otherwise it would be easier to find them). Having a tool find a number of vulnerabilities in a program, therefore, is not necessarily a good predictor to find bugs in another program.

In any case, the process through which vulnerabilities were found must be carefully documented and made fully reproducible; for random-driven approaches such as fuzzers, one thus needs to log and report random seeds. Obviously, one must be clear not to optimize tools towards given vulnerabilities.

For fuzzing tools, the technical challenge is to find inputs that cover a wide range of behavior across the program and not only during input processing and error handling. Let us remind you that during testing, executing a location is a necessary condition for finding a bug in that very location. Since we are still far from reaching satisfying results in covering functionality, improvements in code coverage are important achievements regardless of bugs being found.

8. What does this mean for reviewers and authors?

Papers must clearly show how the insights of the paper contribute to the result, both in terms of motivation as well as in evaluation.

In many cases, it will be hard to describe all the details of all the necessary steps in the paper. Therefore, it will be necessary to supply an artifact that allows for not only reproduction but also applying it on subjects not seen before; again, all design decisions in the code must be motivated and documented. This is tedious; this is rigorous; this is how science works.

Reviewers should be aware that an approach is not simply "better" because it performs well on a benchmark or because it found new bugs. Approaches have a long-term impact not only through performance, but also through innovation, generality, and simplicity. Researchers are selected as reviewers because the community trusts them to assess such qualities. Tool performance that is achieved through whatever means has little scientific value.

Having said that, conference organizers should create forums for tool builders and tool users to discuss lessons learned. Such exchanges can be extremely fruitful for scientific progress, even if they may not be subject to rigorous scientific assessment. Tool contests with clear and fair rules would allow assessing the benefits and fallbacks of current approaches, and again to foster and guide discussions on where the field should be going. A contest like Rode0day [7] could serve as a starting point.

Conclusion

Having tools is good, and having tools that solve problems is even better. As scientists, however, we also must understand why what works and what does not. As tools and vulnerabilities come and go, it is these insights that have the longest impact. Our papers, our code and our processes, therefore, must all be shaped to produce, enable, assess, and welcome such insights. This is the long-term path of how we as scientists can help to make software more reliable and more secure.

Acknowledgments. Marcel Böhme, Cas Cremers, Thorsten Holz, Mathias Payer, and Ben Stock provided helpful feedback on earlier revisions of this post. Thanks a lot!

References

[1] P. Chen and H. Chen, "Angora: Efficient Fuzzing by Principled Search." 2018 IEEE Symposium on Security and Privacy (IEEE S&P), San Francisco, CA, 2018, pp. 711-725.
[2] Of bugs and baselines
[3] B. Dolan-Gavitt et al., "LAVA: Large-Scale Automated Vulnerability Addition," 2016 IEEE Symposium on Security and Privacy (S&P), San Jose, CA, 2016, pp. 110-121.
[4] George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS '18). ACM, New York, NY, USA, 2123-2138.
[5] Google's fuzzing test suite
[6] Michael Hicks, Evaluating Empirical Evaluations (for Fuzz Testing
[7] Rode0day
[8] Project Jupyter
tags: