Site menu How the RPN calculator app is tested

How the RPN calculator app is tested

Once upon a time, my RPN calculators were a toy that I had developed while idling at airport lounges. As soon as it was packaged as a mobile app, and the number of flavors increased (11C, 15C, 16C), I was forced to migrate to a more "professional" process.

At first sight, it seems to downgrade the project from "pet project" to "boring" status. But the development of testing/release processes can be fun and instructive by itself, and having a solid self-test infrastructure actually made development easier, with far less manual testing required.

When a new version of the calculator "engine" is released, even if the change was just one single character, the following steps are executed when I type make:

While not part of the "make" script, there is a script that keeps testing the code in a loop, in random order, with random concurrency (i.e. N is random in make -j N), in order to catch race conditions or timing-dependent changes in behavior. It is run in a Raspberry 3 (my poor Raspberry 1 rev. A did not have enough RAM to run the 15C test) and in any computer that happens to be idling around.

Here is a sample of unit test:

UT.push(function () {
        var m = H.machine;

        printf("Test 16");
                "1", ".", "5",
        ], function () {
                if (Math.round(m.x * 1000000) !== 223130) {
                        ut_return("exp " + m.x);

        return UTDELAY;

This particular test exercises the function ex by calculating e-1.5. It does cover a lot of unrelated code as a side-effect. Since all RPN flavors (12C, 11C, 15C) have the exponential function, this test applies to all.

The test framework has an "assembler" of sorts to interpret mnemonics like ENTER, EXP, STO 0, etc. and convert them into keystrokes. In the example above, the code is run as a program by mt1x() helper function. In other tests, the script is executed in interactive mode, in order to simulate typing.

Since the calculator always executes a program in asynchronous mode (execute one instruction, then call setTimeout() to schedule the next, sleep, rinse and repeat) the unit test framework must provide asynchronous tools as well. Simpler tests can be synchronous and return the result directly, instead of the UTDELAY/ut_return() scheme.

One "error" I committed when I began to write unit and regression tests, was not adopting a ready-made framework. I am still in two minds about this. On the one hand, I had to write a lot of boilerplate code, and the whole scheme is a bit ugly. On the other hand, Javascript frameworks of any kind seem to fade in and fade out every week, so there is no guarantee that the framework I would have adopted in 2011 would still exist.

Recording behavior

The introduction of "fast mode" (circa end of 2015) brought a number of subtle bugs. For example, the calculator should display "running" while the program is running, but it should display the X register when paused (PSE) — and show "running" again when the pause ends.

This kind of middle-flight behavior is difficult to check in simple unit tests, because it is time-dependent. And this user-facing behavior must be verified in a dozen calculator modes: interactive × running a program; normal × fast mode; 15C's integral and root-finding functions (that can call each other within a program).

Observing this case, I felt that many other situations needed closer examination as well. For example, the calculator "flickers" the LCD when an operation is keyed in, to simulate the real calculator's behavior. (The real calculator's CPU is slow, so it flickers out of necessity, but it ends up being a useful feedback for the user, because it signals that some operation was actually carried out.) I was never perfectly sure that flickering was ok after some change in LCD code. That uncertainty translated to more manual tests. It could be the case that flickering would malfunction in some particular operation that I did not test.

The solution I found is to create a number of checkpoints in the engine, and record essential variables at each checkpoint. Below you can see the latter part of "Test 16" record:

  421   R   004 new    | 1.             |      | 1.5  
  421   R   004 lcd    | 1.5            |      | 1.5  
  421   R   004 pos    | 1.5            |      | 1.5  
  421   R   005        | 1.5            |      | 1.5  05
  523   R   005 pre    | 1.5            |      | 1.5  
  523   R   005 new    | 1.5            |      | -1.5  
  523   R   005 lcd    |-1.5            |      | -1.5  
  523   R   005 pos    |-1.5            |      | -1.5  
  523   R   006        |-1.5            |      | -1.5  16
  627   R   006 pre    |-1.5            |      | -1.5  
  627   R   006 mod    |-1.5            |g     | -1.5  
  628   R   006 new    |-1.5            |g     | 0.223130160  
  628   R   006 lcd    | 0.223          |g     | 0.223130160  
  628   R   006 mod    | 0.223          |      | 0.223130160  
  628   R   006 pos    | 0.223          |      | 0.223130160  
  628   R   007        | 0.223          |      | 0.223130160  43.22
  733   R   007 pre    | 0.223          |      | 0.223130160  
  733   R       posp   | 0.223          |      | 0.223130160  
  733   R              | 0.223          |      | 0.223130160  43.33.00
  733   I       new    | 0.223          |      | 0.223130160     v
   v    v    v   v         v               v            v        |
   |    |    |   |         |               |            |       log
   |    |    |   |         |               |            +-- X reg
   |    |    |   |         |               +---- modifier in LCD
   |    |    |   |         +---- LCD contents
   |    |    |   +---- checkpoint (new LCD content, pre-instr, etc.)
   |    |    +---- program instruction pointer
   |    +---- machine status (R=Running, I=Interactive, etc.)
   +------ timestamp in milisseconds         

The meaning of each item is important for post-mortem analysis. But the main objective is to detect unforeseen changes from a known good version to a candidate new version.

In the example above, suppose a regression bug removes the modifier "g" from the LCD when ex is keyed. This defect would not affect the result, and the unit test would still pass. Yet it is a bug and its detection would depend on manual test (or, more probably, on some attentive user that is willing to take the time to write a bug report).

Naturally, changes in timestamp are expected and therefore are not counted as differences. Also, parts of some tests are marked as "mutable", so e.g. changes in X register are expected for these parts, and ignored. The side-by-side comparison between test runs is carried out by a diff algorithm.

One interesting finding: the floating-point value (register X) may change from one architecture to another, while running exactly the same code with the same version of the same JS engine. I found that transcendental functions like asin() and log() do yield different results in different architectures and/or different CPUs, especially for "difficult" arguments e.g. log(x~=1). The difference shows up only in the 17th digit of precision, but some test programs and formulas amplify the initial difference, so the records are compared up to the 12th digit, not above.

This is not just a theoretical problem. I did find a numeric instability on a financial formula employed by the calculator, because it greatly amplified a small architecture-dependent difference in pow(), that depends on log(). The final result was different beyond the 7th digit in different architectures, and also different from a real HP-12C unit. BTW the solution was to implement a custom pow() that uses an alternate formula when the base is almost equal to 1.0.

A test run accumulates a lot of data: 200MB in JSON format. Converted to the human-readable format shown before, it translates to almost 1 million lines. Yet it is a valuable asset because it crystalizes the collective trust (my manual tests + happy users) on a given version of the calculator. The JSON database is stored and version-controlled along with the source code.