COMP0233: Research Software Engineering With Python




When programming, it is very important to know that the code we have written does what it was intended. Unfortunately, this step is often skipped in scientific programming, especially when developing code for our own personal work.

Researchers sometimes check that their code behaves correctly by manually running it on some sample data and inspecting the results. However, it is much better and safer to automate this process, so the tests can be run often -- perhaps even after each new commit! This not only reassures us that the code behaves as it should at any given moment, it also gives us more flexibility to change it, because we have a way of knowing when we have broken something by accident.

In this chapter, we will mostly look at how to write unit tests, which check the behaviour of small parts of our code. We will work with a particular framework for Python code, but the principles we discuss are general. We will also look at how to use a debugger to locate problems in our code, and services that simplify the automated running of tests.

A few reasons not to do testing

Sensibility Sense
It's boring Maybe
Code is just a one off throwaway As with most research codes
No time for it A bit more code, a lot less debugging
Tests can be buggy too See above
Not a professional programmer See above
Will do it later See above

A few reasons to do testing

  • laziness: testing saves time
  • peace of mind: tests (should) ensure code is correct
  • runnable specification: best way to let others know what a function should do and not do
  • reproducible debugging: debugging that happened and is saved for later reuse
  • code structure / modularity: since we may have to call parts of the code independently during the tests
  • ease of modification: since results can be tested

Not a panacea

Trying to improve the quality of software by doing more testing is like trying to lose weight by weighing yourself more often. - Steve McConnell

  • Testing won't correct a buggy code
  • Testing will tell you were the bugs are...
  • ... if the test cases cover the bugs

If the test cases do not cover the bugs, things can go horribly wrong - an example for this is Therac-25.

Tests at different scales

Level of test Area covered by test
Unit testing smallest logical block of work (often < 10 lines of code)
Component testing several logical blocks of work together
Integration testing all components together / whole program

Always start at the smallest scale!
If a unit test is too complicated, go smaller.

Legacy code hardening

  • Very difficult to create unit-tests for existing code
  • Instead we make a regression test
  • Run program as a black box:
setup input
run program
read output
check output against expected result
  • Does not test correctness of code
  • Checks code is a similarly wrong on day N as day 0

Testing vocabulary

  • fixture: input data
  • action: function that is being tested
  • expected result: the output that should be obtained
  • actual result: the output that is obtained
  • coverage: proportion of all possible paths in the code that the tests take

Branch coverage:

if energy > 0:
    ! Do this 
    ! Do that

Is there a test for both energy > 0 and energy <= 0?