# Avoiding The Needless Multiplication Of Forms

Dr Andrew Moss

2020-11-17

## Making some simple toys

Once upon a time, long long ago, I took a look at using clang for source-level static analysis. It was in the middle of a crazy research project that I will say very little about and we wanted to analyse some large industrial code bases to try to build some models.
At the time the choices available were:
1. Write a C parser, build models from the AST.
2. Extract AST from gcc.
3. Extract AST from clang.
4. Build something on-top of SUIF.
I can't remember why we decided SUIF was not a viable option (this is about seven years ago and it might have been a lack of familiarity and a complex build environment). Working inside / on-top of gcc was rapidly ruled out for reasons as anybody who has looked at the gcc source code would understand.
This left a choice of write a C parser from scratch or using clang. Writing a C parser is a reasonable amount of work, but not insane. Clang was very new and interesting, and the -dump-astoutput initially looked very interesting so we went down that road.
It did not go well.
At the time there were two options available for building an analyser on-top of clang:
• libclang, a "stable" but limited external API.
• an internal API to access the AST.
We tried libclang and extracted some of what we needed, but ran into some serious issues that we could not work around. I can't remember the exact details, but roughly we could not tell the difference between function definitions and declarations (and they were hidden behind USRs), and weirdly, we could not tell the difference between the l.h.s and r.h.s of assignments. Both of which broke what we wanted to do completely.
The internal interface would have given us the information that we needed, but it was not forward-compatible between point-releases and looked pretty awful to work with. So I wrote a C parser and we moved on with our lives. The project ended up changing its requirements twice after that ("small thing, the code-base will now be template-heavy C++ instead of C, will that be an issue?", "that C++ code-base? it turns out its actually in java, is that ok?") and was consigned to history.

## History (about two years ago)

Five years have gone by and it seems like the right time to take another look and re-evaluate clang as a front-end. People seem to have used it a lot since then for analysis work, surely it has improved?
The internal interface still lacks forward-compatability but it seems to be stabilizing (guesswork based on the release-notes). It is called libtooling now and seems a lot more sane to work with. There are still some design-choices that are baffling (no base-class for AST nodes, access through overloaded visitor interfaces only) but it looks reasonable to work with.
My first attempt to build LLVM / Clang did not go well. I grabbed the code from github and tried to configure / build it. The configure/make approach is no longer supported and it now relies on cmake (grumble mumble grumble). The first attempt took about 3 hours and used 28GB (!!?!) of disk-space. This was only LLVM as the cmake configure step had missed the clang source tree slapped into the LLBVM source tree.
Attempt two reached about 60% and then failed with a horrific error message about sort not being an available function deep in clang! Experience suggests that trying to "fix" this would be pointless as it is probably the build-environment that is the problem. Switching to clang-3.7 instead of gcc 5.4 hit exactly the same problem (after another few hours).

## Bang-up-to-date

So... another couple of years went by. I finally have some time to look at some interesting code and so I dug out LLVM for another attempt.
Wow. Things have changed. The API has stabilized, and looks mature and usable. Building "out of tree" is now a thing. All in all it looks simple enough to pick up and play with. This is huge.
So I started writing some toys to see how it behaves. As yet, nothing complicated as I'm still exploring the API and assimilating a view of the architecture.

## Toy 1

It's easy to convert C code into LLVM's intermediate representation, clanghas a switch to emit a translation unit:
clang $COPTS file.c -emit-llvm -o file.ll These intermediate files can be parsed directly into LLVM modules: llvm::SMDiagnostic Diag; llvm::LLVMContext C; std::unique_ptr<llvm::Module> M = llvm::parseIRFile(filename, Diag, C); So the first thing that I started playing with was walking through the function declarations to see what is inside the translation unit. Any functions that are defined are listed, even if the declarations are elsewhere in the program. So toy number 1 will just simulate linking a set of translation units together and then working out which function declarations are are internal (the code is defined) or external (probably a library call). The code to handle merging the translation units is already in LLVM, and like everything else it is exposed through a simple API: int main(int argc, char **argv) { if (argc < 2) { std::cout << "usage: modlist <IR files>...\n"; return 1; } llvm::SMDiagnostic Diag; llvm::LLVMContext C; llvm::Module WP("wholeprogram",C); llvm::Linker linker(WP); for(int i=1; i<argc; i++) { std::unique_ptr<llvm::Module> M = llvm::parseIRFile(argv[i], Diag, C); bool broken_debug_info = false; if (M.get() == nullptr || llvm::verifyModule(*M, &llvm::errs(), &broken_debug_info)) { llvm::errs() << "error: module not valid\n"; return 1; } if (broken_debug_info) { llvm::errs() << "caution: debug info is broken\n"; return 1; } linker.linkInModule(std::move(M)); } for (auto &F: WP.functions()) { if (F.isDeclaration()) std::cout << "Outside " << F.getName().str() << " " << F.getLinkage() << std::endl; } for (auto &F: WP.functions()) { if (!F.isDeclaration()) std::cout << "Inside " << F.getName().str() << " " << F.getLinkage() << std::endl; } llvm::llvm_shutdown(); } This code just takes a list of .llfiles as arguments and combines them into a single module to resolve the function declarations: clang++$($LLVMCONFIG --cppflags --ldflags) -std=c++14 -g main.cc$($LLVMCONFIG --system-libs --libs all) ./a.out file1.ll file2.ll ... ## Realistic test-case I wanted to try this on something non-trivial so I pulled the code from the last place that I worked ( Netdata ) and used it as a test-case. The build-system is a bit of an ugly mess of autotools / shell-scripts so it has some real-world warts to try out. The obvious way to recover the list of translation units would be to try and parse the makefile. But this approach is flawed in several ways: • The makefile is automatically generated. • The generation depends on the system configuration. • There are side-effects from shell scripts that execute first to check/install dependencies. Trying to parse this would result in a simulator for something at least as complex as autotools. This is too much work for too-little reward. It makes more sense to just run the build-system for real and record a trace of what it does, then transform this into the info that we need. ./netdata-installer.sh --install$BASE/install --dont-wait --disable-telemetry make clean make --trace >>../make.trace grep ^gcc make.trace >make.units python3 common.py >build.sh
The python script is used to convert the list of gcccommands back into a list of units. To do this it just finds the common prefix and removes it from each command (this is the command and the initial arguments that encode the machien configuration).
#!/usr/bin/env python3 import os.path lines = list(open("make.units").readlines()) n = min([len(l) for l in lines]) for i in range(n): prefixes = set([ l[:i] for l in lines ]) if len(prefixes)==1: longest = i prefix = lines[0][:longest] stems = [ l[longest:] for l in lines ] units = [] for s in stems: words = s.split(" ") for w in words: if w[-2:]=='.c': units.append(w) dirs = set([ os.path.dirname(u) for u in units ]) for d in sorted(dirs): print(f"mkdir -p ../units/{d}") for u in units: print(f"clang-11 {prefix[4:]} -g -S -emit-llvm {u} -o ../units/{u[:-2]}.ll")
Once we know the list of units and the compiler switches that were needed to make them compile we can rebuild the set of units and emit the .llfiles into a matching tree.

## Combination

With the intermediate code, and a list of the units we can run the toy on the code to generate the output. Easy :)
#!/usr/bin/env bash BASE=$(dirname$0) pushd $BASE >/dev/null ../../funclist/a.out$(cat units.list) popd >/dev/null
Running over the latest tree from Netdata indicates that it works:
... Outside strcat 0 Outside statvfs 0 Outside semctl 0 Outside llvm.experimental.vector.reduce.add.v8i64 0 Outside gettimeofday 0 Outside llvm.experimental.vector.reduce.or.v4i64 0 Outside llvm.experimental.vector.reduce.or.v16i64 0 Outside llvm.experimental.vector.reduce.or.v8i64 0 Outside pthread_getaffinity_np 0 Outside pthread_setaffinity_np 0 Inside print_build_info 0 Inside get_netdata_execution_path 0 Inside create_needed_dir 0 Inside clean_directory 0 Inside become_user 0 ...

## Next steps

So a very basic proof of concept to show
• accessing the LLVM API.
• solving a trivial problem.
• applying it to real code.
Next time I will describe something more interesting - a tool for matching call-sequences in the program against regular expressions.