Avoiding The Needless Multiplication Of Forms
Automated Testing In DockerDr Andrew Moss
The first batch of student assignments has been graded, and so this is the penultimate post in this series. Today I want to write a little about the experience of testing via docker. The final post will be some time later as I want to build a queuing system for tests - the long-term idea is that students will have access to the testing facility before them submission, rather than a teacher using it before grading.
Battery consumption on the mac.
Using docker on the mac rapidly makes the Terminal application the most energy intensive. Opening it up inside Activity Monitor shows that the VBoxHeadless process is running and consuming power in-between test uses. The simple solution is virtually suspend/resume the machine:
$ VBoxManage vmcontrol pause default
$ VBoxManage vmcontrol resume default
Underspecification in the assignment.
Students seem to have a natural genius for spotting parts of the assignment that are underspecified or ambiguous, then as a weakly coordinated group they spread out among the possibilities that are just within the specification and their work samples the boundary condition of what is allowed. It's like watching a particle filter explore environmental constraints. Anyway, the compiler assignment they were given in this year forgot to mention some things:
- The output is weakly defined; there is no split between what should go to stdout, which filenames the dot representation should be saved under etc.
- The filenames in the submission tarball are not defined; the reference to the previous assignment is ambiguous and depends on finding abbreviations that make sense to an english speaker. If ass1-int was the name of the first directory then the second should be ass2-comp, but this depends on cultural experience of abbreviation. Amazingly one student in the batch did actually deduce this from the context, but amongst the others were reasonable estimates of ass2-int, ass1-int and ass2.
- A problem that occurs in both assignments is that the gold standard of "make a working compiler for this language" was been weakened to support a range of passing grades. The output of the various target grades is not strictly monotonic, so the output for a student targeting a grade E is not compatible with a grade D for example.
- The "easier" result for a low grade is harder to check. This was not obvious while writing the assignment as the focus is upon what they need to do (code) in order to get a passing mark. Because the output for a grade E on the second assignment is not executable - other than a human reading the CFG in the final pdf file - it requires more knowledge to evaluate if the traces in the testcase are properly shown in the output. Some student forgot edge labels making their CFGs ambiguous, others missed blocks in the CFG or confused ordering. It is not obvious how to make the "easier result" easier to validate for the student. This point requires some thought.
False positives on plagiarism.
One of the onerous duties of a teacher is that we need to be aware of plagiarism and spot when work has been duplicated. We must also avoid false positives as the process for handling student plagiarism does not handle abort cleanly (people are humans and tend to get pissy when accussed of something they have not done). In this batch it looked like there was a clear case of copied. Very clear cut. Almost too easy to spot.
Experience indicates that the more obvious something is the harder we have to work to verify that it is true. So in this case the individual tests had to be rerun manually and the code examined against them to check what was happening. Luckily I have a high background level of disbelief so after checking minutely within the testing environment it was time to dismantle the testing environment and manually unpack and stage the submissions to verify that the testing environment had not magically made their codes look the same. In this case this level of technical paranoia was justified as the testing environment had indeed altered their codes to make them look the same. Bad evil docker.
So what happened? If we recall the file-sytem inside a container is a union-fs that stacks parts of the FS tree together from more basic pieces. Docker tries to be smart about this so it uses a cache of those previous pieces. Reruning the same build instructions from the same source files (which have been reloaded with a fresh submission) does not propagate dependencies - Docker is not make. It does not check if the source has changed, it has a different set of rules for when to rebuild dependencies and when to reuse the cached FS layer. Bad docker.
When tools misbehave we just turn them off, and at least Docker is flexible:
$ docker build --no-cache -t=current_test .
This disables all caching and ensures that we build the container only from the submitted work and the pieces of the base image - never caching state from another submission. Phew, narrowly avoided disaster.
As far as I can tell Docker may be flakey and crappy with processing the return-codes of RUN commands within build scripts. Or not. The caching issue above may have contributed to executing the wrong executable. It is hard to pin down an exact failure, and the various bug-reports against Docker for this type of issue have been closed. This could have been a ghost issue as a result of fixing the caching problem above. Either way it does seem safer to write out the return-codes as a result of the test rather than using them to decide if the test continues:
RUN ./comp /testcases/1.lua >redirect.dot; echo $? >/testresults/1.retcode
Well, it ain't quick. But on the other hand it is much faster than typing in the tests manually. Each run takes about 15 seconds at the moment (scraping the various filenames together and handling every variation in the underspecified assignment takes many many minutes). It is definitely slower than something I would stick behind a CGI gateway and give students access to at the moment. One speed-up will be attained by smashing together various RUN commands into a single shell command separated by semicolons. But it will still be slow enough to be some kind of batch system. Also, having up to 50 students hitting the testing server at the same time may introduce a latency of 10-15 minutes.
There are generally two ways that student submission fail to terminate cleanly, either they crash or they hang. Programs that seg-fault are shown in the build output. Need to find a way to record this automatically, at the moment it is manual. Something very nice that is normally hard to do is to recover a stack-trace for the failure.
Running an interactive container on the current_test image allows the installation of gdb, manually flip on debug info in the student makefile and then replicate the crash inside a debugger. This allowed me to email stack traces to students that submitted crashing code. I am suitably happy with this :) I also need to think of a way to automate this process.
Hanging code is a bit harder and I suspect that I will find a cpu quota option in Docker somewhere and put a cpu-timelimit in the assignment specification. This will interact with the performance issues above and affect the maximum latency on a submission queue.
Grading student work is quite a subjective process that depends on an objective testing of the artefact they submit. The testing process is quite error-prone and fragile, much more so that many teachers would like to admit. For the testing process to really be objective it must be made robust; reproducibility of the testing used to inform a grading decision is vital to minimising the subjectivity of the decision. My initial experiences from trialling this approach suggest that it has great potential. I estimate that when students get access to the testing mechanism it could have great benefits for them.