How Semgrep Cut Taint Analysis Time by 75%¶
Ch12.096 How Semgrep Cut Taint Analysis Time by 75%¶
📊 Level ⭐⭐ | 4.0KB |
entities/how-semgrep-cut-taint-analysis-time-by-75.md
How Semgrep Cut Taint Analysis Time by 75%¶
Markdown Content: Semgrep Pro Engine 1.158.0 and onwards is now shipping a redesigned taint analysis engine resulting in up to 75% speedups on full scans. This is the second in a three part series on improving the performance of Semgrep.
In the previous blog post we discussed a new continuous profiler we released for OCaml, called Pyro Caml. The motivation behind building it was so we could improve the performance of Semgrep, whose core analysis engine is written in OCaml. In this blog post we’ll explore how we used it to validate where we thought our biggest bottleneck was, and how doing something once instead of twice is a great way to improve the performance of your programs. Specifically, the 95th percentile of Semgrep scan times went from 10 minutes, to 7 minutes 30 seconds, and our P99 went from a very noisy ~45 minutes on average, to a much more consistent 35 minutes. Additionally, the number of scans reaching the max allowable scan time dropped significantly.
Motivation: Why Taint Analysis Was Costing a Third of Our CPU Time
¶
How Taint Analysis Works: Sources, Sinks, and Data Flow
¶
The Semgrep engine has something called a matching engine that runs a set of rules . There’s a bunch of great explanations of matching out there, but in short, Semgrep will take patterns like do_thing( … ) and flag code such as do_thing(arg1, arg2) or do_thing(x, y). This is useful for finding all sorts of vulnerabilities, but it was supercharged when we released support for taint analysis, all the way back in 2021. This would let you find code patterns such as:
Specifically, you could write patterns that track the flow of data through a program, and flag places they might “taint” in that flow, such as tracking user input flowing into an arbitrary eval function, like in the above example.
This first pass of taint analysis was only intra-procedural, meaning that within a file we could usually detect how the data flows. If the user input flowed into some function that was defined in another file though, and that other function had an eval, we wouldn’t detect it. This is called interfile analysis, for obvious reasons.
Many users had been asking for exactly this, and so two year
→ 原文存档