Once upon a time, long long ago, I took a look at using clang for source-level static analysis. It was in the middle of a crazy research project that I will say very little about and we wanted to analyse some large industrial code bases to try to build some models.

I can't remember why we decided SUIF was not a viable option (this is about seven years ago and it might have been a lack of familiarity and a complex build environment). Working inside / on-top of gcc was rapidly ruled out for reasons as anybody who has looked at the gcc source code would understand.

This left a choice of write a C parser from scratch or using clang. Writing a C parser is a reasonable amount of work, but not insane. Clang was very new and interesting, and the -dump-astoutput initially looked very interesting so we went down that road.

We tried libclang and extracted some of what we needed, but ran into some serious issues that we could not work around. I can't remember the exact details, but roughly we could not tell the difference between function definitions and declarations (and they were hidden behind USRs), and weirdly, we could not tell the difference between the l.h.s and r.h.s of assignments. Both of which broke what we wanted to do completely.

The internal interface would have given us the information that we needed, but it was not forward-compatible between point-releases and looked pretty awful to work with. So I wrote a C parser and we moved on with our lives. The project ended up changing its requirements twice after that ("small thing, the code-base will now be template-heavy C++ instead of C, will that be an issue?", "that C++ code-base? it turns out its actually in java, is that ok?") and was consigned to history.

Five years have gone by and it seems like the right time to take another look and re-evaluate clang as a front-end. People seem to have used it a lot since then for analysis work, surely it has improved?

The internal interface still lacks forward-compatability but it seems to be stabilizing (guesswork based on the release-notes). It is called libtooling now and seems a lot more sane to work with. There are still some design-choices that are baffling (no base-class for AST nodes, access through overloaded visitor interfaces only) but it looks reasonable to work with.

My first attempt to build LLVM / Clang did not go well. I grabbed the code from github and tried to configure / build it. The configure/make approach is no longer supported and it now relies on cmake (grumble mumble grumble). The first attempt took about 3 hours and used 28GB (!!?!) of disk-space. This was only LLVM as the cmake configure step had missed the clang source tree slapped into the LLBVM source tree.

Attempt two reached about 60% and then failed with a horrific error message about sort not being an available function deep in clang! Experience suggests that trying to "fix" this would be pointless as it is probably the build-environment that is the problem. Switching to clang-3.7 instead of gcc 5.4 hit exactly the same problem (after another few hours).

Wow. Things have changed. The API has stabilized, and looks mature and usable. Building "out of tree" is now a thing. All in all it looks simple enough to pick up and play with. This is huge.

llvm::SMDiagnostic Diag; llvm::LLVMContext C; std::unique_ptr<llvm::Module> M = llvm::parseIRFile(filename, Diag, C);

So the first thing that I started playing with was walking through the function declarations to see what is inside the translation unit. Any functions that are defined are listed, even if the declarations are elsewhere in the program. So toy number 1 will just simulate linking a set of translation units together and then working out which function declarations are are internal (the code is defined) or external (probably a library call).

int main(int argc, char **argv) { if (argc < 2) { std::cout << "usage: modlist <IR files>...\n"; return 1; } llvm::SMDiagnostic Diag; llvm::LLVMContext C; llvm::Module WP("wholeprogram",C); llvm::Linker linker(WP); for(int i=1; i<argc; i++) { std::unique_ptr<llvm::Module> M = llvm::parseIRFile(argv[i], Diag, C); bool broken_debug_info = false; if (M.get() == nullptr || llvm::verifyModule(*M, &llvm::errs(), &broken_debug_info)) { llvm::errs() << "error: module not valid\n"; return 1; } if (broken_debug_info) { llvm::errs() << "caution: debug info is broken\n"; return 1; } linker.linkInModule(std::move(M)); } for (auto &F: WP.functions()) { if (F.isDeclaration()) std::cout << "Outside " << F.getName().str() << " " << F.getLinkage() << std::endl; } for (auto &F: WP.functions()) { if (!F.isDeclaration()) std::cout << "Inside " << F.getName().str() << " " << F.getLinkage() << std::endl; } llvm::llvm_shutdown(); }

I wanted to try this on something non-trivial so I pulled the code from the last place that I worked ( Netdata ) and used it as a test-case. The build-system is a bit of an ugly mess of autotools / shell-scripts so it has some real-world warts to try out.

The makefile is automatically generated.
The generation depends on the system configuration.
There are side-effects from shell scripts that execute first to check/install dependencies.

Trying to parse this would result in a simulator for something at least as complex as autotools. This is too much work for too-little reward. It makes more sense to just run the build-system for real and record a trace of what it does, then transform this into the info that we need.

The python script is used to convert the list of gcccommands back into a list of units. To do this it just finds the common prefix and removes it from each command (this is the command and the initial arguments that encode the machien configuration).

#!/usr/bin/env python3 import os.path lines = list(open("make.units").readlines()) n = min([len(l) for l in lines]) for i in range(n): prefixes = set([ l[:i] for l in lines ]) if len(prefixes)==1: longest = i prefix = lines[0][:longest] stems = [ l[longest:] for l in lines ] units = [] for s in stems: words = s.split(" ") for w in words: if w[-2:]=='.c': units.append(w) dirs = set([ os.path.dirname(u) for u in units ]) for d in sorted(dirs): print(f"mkdir -p ../units/{d}") for u in units: print(f"clang-11 {prefix[4:]} -g -S -emit-llvm {u} -o ../units/{u[:-2]}.ll")

... Outside strcat 0 Outside statvfs 0 Outside semctl 0 Outside llvm.experimental.vector.reduce.add.v8i64 0 Outside gettimeofday 0 Outside llvm.experimental.vector.reduce.or.v4i64 0 Outside llvm.experimental.vector.reduce.or.v16i64 0 Outside llvm.experimental.vector.reduce.or.v8i64 0 Outside pthread_getaffinity_np 0 Outside pthread_setaffinity_np 0 Inside print_build_info 0 Inside get_netdata_execution_path 0 Inside create_needed_dir 0 Inside clean_directory 0 Inside become_user 0 ...

Markdown syntax is based on RST: 
*italics* **bold** are inline styles. 
Paragraphs wrap

until blank lines separate. Other inline 
styles are :code:`x=y+z`, :shell:`ls` and 
:math:`x^n=y^n+z^n`. Links are 
`here `_ and blocks 
require blank lines

.. code::

  x <- f(y)
  // Until indent change

* Bullets are not indented
* Listed between blank lines

Markdown syntax is based on RST: italics bold are inline styles. Paragraphs wrap

until blank lines separate. Other inline styles are x=y+z lsand \\((x^n=y^n+z^n\\)) Links are here and blocks require blank lines

x <- f(y) // Until indent change

Bullets are not indented
Listed between blank lines

Avoiding The Needless Multiplication Of Forms

Static analysis in LLVM/Clang

Making some simple toys

History (about two years ago)

Bang-up-to-date

Toy 1

Realistic test-case

Combination

Next steps

Comments