Avoiding The Needless Multiplication Of Forms
Static analysis in LLVM/Clang
Dr Andrew Moss2020-11-19
Call path grep
Reviewing pieces of code requires an understanding of both details and context. We need to understand what the code does - but also we need to know when that happens, and what is being operated on. This typically means tracing the call paths to find out where functions are called from. Sometimes we do it the other way around and chase the calls that a piece of code makes to find out what they do.
Normally I would do this with grep so I have some basic functions defined in my shell:
findc ()
{
find . -name '*c' -type f -exec grep -Hn "$1" \{\} \;
}
findch ()
{
find . -name '*[ch]' -type f -exec grep -Hn "$1" \{\} \;
}
findh ()
{
find . -name '*h' -type f -exec grep -Hn "$1" \{\} \;
}
The basic approach is a loop of:
- Grep for a function name
- Filter out the interesting uses
- Read those bits of code and check what they call / where they are called from
- Repeat until happy
I've watched other people do this in IDEs and they seem to follow the same process, but using the "Go to definition" or "Go to references" buttons instead. I prefer to do it in the shell because the process is basically a tree-search - scrolling back through a terminal gives a record of the visited points, which is harder to do mentally using an IDE.
Normally I would keep notes while I do this, which look like sketches that join together sets of possible traces through the code with notes on context (i.e. which state is used and how it propagates). Typically they are messy partial views of pieces of code and they're made to be thrown away after use so they look something this like:
Understanding the context and extracting the relevant part of the state is non-trivial. But walking the call-graph to find the parts of traces is quite easy. I've wanted a tool to do this for a while so it seems like a nice toy to write against the LLVM code.
Matching function names works well with regular expressions. This seems like a good starting point, adding functionality for matching calling sequences. This is a simple extension to the regex syntax. Each regex matches over a function name and >matches a call between functions. This can be used like this:
Example expression | Matches |
.* | Any function in the program |
.*>.* | Any caller/callee pair |
foo.*>mutex_.* | Any caller whose name starts with foo that calls a function whose name starts with mutex_ |
main>+uv_.* | Any path from main to a function whose name starts with uv_ |
init_x>+init_y>+.*alloc | Any call-path from init_x though init_y to a function that ends with alloc |
So the new addition >acts like a binary operator (syntactically) between regexes that are used to match function names. It can be modified as >+to indicate one-or-more calls between the functions.
The anchors on function names are implicit so .*can be used to match substrings within a name. One other useful construct is an anchor for the paths to indicate entry points in the program:
Example expression | Matches |
^.*>+foo | Any path from an entry to foo |
For the regular expression match we will use a variation on Thompson's original regex matching code because it is very simple to implement. The extension that we need is for the call operator, this just iterates over successor nodes in a DFS of the call-graph. Adding it to the regex matching code means interleaving graph-walking steps with matching steps as we scan the regex.
The LLVM interface for building and using the call-graph is straight-forward:
lvm::Module WP("wholeprogram",C);
... merge modules as shown in previous post ...
llvm::Linker linker(WP);
llvm::CallGraph cg(WP);
// ... Iterating over the externally visible functions (possible entry points) ...
auto *outside = cg.getExternalCallingNode();
for (auto entry: *outside) {
auto *F = entry.second->getFunction();
// .. or iterating over all functions ...
for (auto &N: cg)
....
We can wrap up the graph-walking parts into a single function to make the matcher implementation simpelr:
void matchSuccs(llvm::CallGraphNode* node, std::vector<std::string> &path, std::set<llvm::CallGraphNode*> &visited,
bool orMany, char *regex) {
for (auto succ: *node) {
if (visited.find(succ.second) == visited.end()) {
match(succ.second, path, visited, orMany, regex);
}
}
}
The interface encodes what we need to perform a DFS on the graph:
- The visited set is used mark node so that terminate on loops in the call-graph. It can be cleared between DFS passes (i.e. as we iterate over each starting point in the call-graph).
- The path just records the trace that we used to get to this point in the graph. It is used to output results when the whole regex has been matched.
- The code assumes that matchwill add nodes to the visited set to guarantee termination.
With this in place the matcher code is a simple extension of Thompson's code, we just implement a DFS walk in the same style and glue it into the matcher:
void match(llvm::CallGraphNode* node, std::vector<std::string> &path, std::set<llvm::CallGraphNode*> &visited,
bool orMany, char *regex) {
auto *F = node->getFunction();
if (!F || !F->hasName()) // null nodes gets inserted as calls from any external node
return;
std::string name = F->getName().str();
visited.insert(node);
if (*regex==0) {
emitResult(path);
llvm::outs() << "\n";
return; // continue matching >+ as suffix?
}
path.push_back(name);
if (*regex=='>') {
if (regex[1]=='+')
matchSuccs(node, path, visited, true, regex+2);
else
matchSuccs(node, path, visited, false, regex+1);
path.pop_back();
return;
}
int m = matchNodeName(regex,name.c_str());
if (m<0 && !orMany) {
path.pop_back();
return;
}
if (m>0) {
if (regex[m]==0) {
emitResult(path);
llvm::outs() << "\n";
}
else
matchSuccs(node, path, visited, false, regex+m); // path-step, overwrite orMany flag
}
if (orMany)
matchSuccs(node, path, visited, true, regex);
path.pop_back();
}
We have to check if the node that we are traversing corresponds to a function or not (the call-graph that LLVM builds has extra nodes with meta-data inserted). If we have hit the end of the regex then it was a match so output the result. If we hit the >operator then walk the graph, this will recursively call matchon the new nodes with the char *regexadvanced to the point for the node name. The pathis updated before we recurse and cleared when control-flow returns - it is just a simple stack recording the trace.
All of the string-matching parts of the regex are handled inside matchNodeName Each of the procedures allows backtracking - if we match then the result is output as a side-effect and control returns. This means that the processing of the >+modifier on the operator just means try the following regex at each possible successor as we walk over the tree - handled by recursing back into matchSuccsat the current node.
If the netdata test case has been prepared then there is a script to call the analyser in the test case directory:
$ callgrep/build.sh && tests/netdata/callgrep.sh 'main>+rrd.*'
main appconfig_set appconfig_section_create callocz fatal_int netdata_cleanup_and_exit rrdhost_cleanup_all
main appconfig_set appconfig_section_create callocz fatal_int netdata_cleanup_and_exit rrdhost_cleanup_all rrdhost_delete_charts
main appconfig_set appconfig_section_create callocz fatal_int netdata_cleanup_and_exit rrdhost_cleanup_all rrdhost_delete_charts rrdset_delete_custom
main appconfig_set appconfig_section_create callocz fatal_int netdata_cleanup_and_exit rrdhost_cleanup_all rrdhost_cleanup_charts
main appconfig_set appconfig_section_create callocz fatal_int netdata_cleanup_and_exit rrdhost_cleanup_all rrdhost_cleanup_charts rrdset_delete_obsolete_dimensions
main appconfig_set appconfig_section_create callocz fatal_int netdata_cleanup_and_exit rrdhost_cleanup_all rrdhost_cleanup_charts rrdset_save
...
The backtracking search is not particularly efficient. This is because the constant factors are quite low so it is not worth the effort to optimize (typically this would mean performing the NFA to DFA construction to compile state tracking information into the search instead). But the ratio between CPU speed and callgraph size on a modern machine is high enough that the tool is instanteous on my use-cases so it is not interesting to optimize.
The call-graph that LLVM generates is the statically defined call-graph for the program so it misses many call paths. Any program that uses callbacks (or other dynamic despatch constructs) will have call sequences that are not found. Using "The Trick" to specialize this for small sets of possible targets is the subject of the next part...
Comments
Sign in at the top of the page to leave a comment