2010/03/21

Optimizing with Psyco and Cython, this time for real

The last experience with Cython was just a test, moved by sheer curiosity. But, guess what, I ended up needing some help from Cython in the most recent simulation I created.

The performance problem raised calendar spread simulation. I conceived a different stop strategy for this spread, based on options values themselves, instead of a market level. This implies that option values must be calculated for every day in every round.

Just because of that, this particular simulation was taking 45 seconds per scenario, instead of the usual 7-8 seconds that all other spreads were taking. A simple run with

python -m cProfile ./simulacal.py

showed that all the excess CPU time was on Black-Scholes module functions. It was kind of a surprise, because a) Black-Scholes functions are simple, closed-form formulas; and b) I expected that Psyco had already optimized them to the limit.

Moreover, a particular function, N(x), which returns cummulative normal distribution, was sucking half of the calculation CPU time. The code is:

`def N(x):    if x < 0:        return 1 - N(-x)    n = 1.0 - 0.5 * (1.0 + \                     N.kd1 * x + \                     N.kd2 * x ** 2.0 + \                     N.kd3 * x ** 3.0 + N.kd4 * x ** 4.0 + \                     N.kd5 * x ** 5.0 + N.kd6 * x ** 6.0 \                    ) ** -16.0    return nN.kd1 = 0.0498673470N.kd3 = 0.0032776263N.kd5 = 0.0000488906N.kd2 = 0.0211410061N.kd4 = 0.0000380036N.kd6 = 0.0000053830`

This is a power-series approximation for cumulative normal distribution, hence the kd[n] coefficients.

Indeed, Psyco *does* optimize this quite well. Running the calendar spread simulation without Psyco takes 4 whole minutes, six times slower. So, Psyco's performance gain for this particular simulation (6x) was almost twice the gain observed in past simulations (3.6x).

Nevertheless, this purely mathematical function should be faster. Time to try Cython. I created a "bsfast" module with the base Black-Scholes formulae. The Cython version for N(x) is:

`cdef double kd1 = 0.0498673470cdef double kd3 = 0.0032776263cdef double kd5 = 0.0000488906cdef double kd2 = 0.0211410061cdef double kd4 = 0.0000380036cdef double kd6 = 0.0000053830cdef double N(double x):    if x < 0:        return 1 - N(-x)    cdef double n = 1.0 - 0.5 * pow(1 + \                        kd1 * x + \                        kd2 * pow(x, 2.0) + \                        kd3 * pow(x, 3.0) + kd4 * pow(x, 4.0) + \                        kd5 * pow(x, 5.0) + kd6 * pow(x, 6.0), \                        -16.0)    return n`

Cython converts this into C code. The more you specify static types and avoid to call Python functions, the more straight and fast the C code will be, and the less the C code will be polluted/slowed by conversion to/from Python objects. The N(x) function translated well, as we can see (or not) below:

` __pyx_v_n = (1.0 - (0.5 * pow(((((((1 + (__pyx_v_6bsfast_kd1 * __pyx_v_x)) + (__pyx_v_6bsfast_kd2 * pow(__pyx_v_x, 2.0))) + (__pyx_v_6bsfast_kd3 * pow(__pyx_v_x, 3.0))) + (__pyx_v_6bsfast_kd4 * pow(__pyx_v_x, 4.0))) + (__pyx_v_6bsfast_kd5 * pow(__pyx_v_x, 5.0))) + (__pyx_v_6bsfast_kd6 * pow(__pyx_v_x, 6.0))), (-16.0))));`

Optimizing N(x) alone lowered the CPU time from 44 to 29 seconds.

Optimizing the remaining functions (not shown here) lowered it further to 19 seconds. At that point, I thought nothing else could be done.

The replacement of "vanilla" functions by the fast versions is carried out by a new function in blackscholes.py module. The module client calls patch_for_fast() if it needs better-than-usual performance:

`def patch_for_fast():    global Call, Put    import bsfast    teste = testes[0][0:5]    assert(abs(Call(*teste) - bsfast.Call(*teste)) < 0.000000001)    assert(abs(Put(*teste) - bsfast.Put(*teste)) < 0.000000001)    Call = bsfast.Call    Put = bsfast.Put`

Finally, I found a performance bug in my simulation: both profit-locking and stop-loss functions were calculating options' values, duplicating the same work.

Solving this problem shrunk the CPU time to 13 seconds, which is compatible with the CPU time of the other simulations (taking into account that calendar simulation is more complex than the others). Mission accomplished.

UPDATE: After hint from Fredrik Johansson, I replaced pow() by simple multiplications in Cython version of the N() function. This simple change lowered the simulation time even further, from 13 to 9 seconds! Almost 2x gain.

The newest version of N(x) looks like this:

`cdef double N(double x):    if x < 0:        return 1 - N(-x)    cdef double xp, p    xp = x # x ** 1    p = 1.0 + kd1 * xp    xp *= x # x ** 2    p += kd2 * xp    xp *= x # x ** 3    p += kd3 * xp    xp *= x # x ** 4    p += kd4 * xp    xp *= x # x ** 5    p += kd5 * xp    xp *= x # x ** 6    p += kd6 * xp    p *= p # ** 2    p *= p # ** 4    p *= p # ** 8    p *= p # ** 16    return 1.0 - 0.5 * (1.0 / p) # ** -16`

Python version of N(x) (which is optimized by Psyco) also benefits from the equivalent change (not using the ** operator). Running the simulation without Cython fast functions, it took 27 seconds and now it takes only 22.