Site menu Plugging leaks in Python

Plugging leaks in Python

Python applications do leak memory. Generally, not due to the language itself, but due to application bugs. Applications written in every language (Javascript, Ruby, etc.) may and will suffer similar issues. It may be actually more difficult to debug this kind of problem because you don't have a tool like Valgrind that promptly show leaks in C/C++ programs.

Recent versions of Python have a true garbage collector that breaks cyclical references, you may still leak a lot of memory by keeping object references in forsaken corners of your code.

Another common reason of memory leak is the presence of the __del__ method in a class, which prevents the garbage collector to break cycles with those classes. More often than not, people implement the __del__ method without knowing exactly why, and what are the consequences. And then, the uncollected object keeps references to others, which keep references to more others, and suddendly 90% of your object pool is tied down.

Unfortunately my application was leaking so much memory in this manner, that it was getting sluggish to use after half an hour. So I had to hunt which objects were not being freed, and why. I managed to improve the situation a lot by breaking references "manually" (setting all references to other classes to None when the class had an explicit unload method), until I found the real culprit: three classes that had __del__ methods without actual reason.

The technique I put together (with the help of a lot of Googling) was to explore some features of garbage collector (gc).

import gc
objects = gc.get_objects()
objects_id = {}
for o in objects:
    objects_id[id(o)] = True
# gc.garbage

In this code, I force a garbage collection, so I won't see collectable cyclic references; and then I get the complete pool of active objects. There will be several thousands of them at minimum, since everything can be found in the pool: functions, methods, modules, instances, variables, etc.

The gc.garbage list contains a list of objects that gc could not garbage-collect because it didn't know how to brake the cycle of references; and it typically happens when one class has a __del__ method, which means that developer should clean the reference by himself, but he didn't. It is a very good place to start searching for leaks.

But my application was also keeping objects alive by non-cyclical references, and I needed to find who was keeping these references. In order to do that, I wrote the following code:

import gc
verbose = 0

for o in gc.get_objects():
    print o
    if verbose >= 2:
        if o in gc.garbage:
            print o
            print "    In gc.garbage (possible cause: " \
		  "presence of __del__ method)"
            cold_trail, lines = show_referrers(o, [id(o)], 1)
            for line in lines:
                print line
def show_referrers(initial_object, backrefs, level):
   cold_trail = True
   lines = []

   for o in gc.get_referrers(initial_object):
       bump = 0
       if (id(o) in backrefs):
           # cyclical reference to an object of the trail
       elif (id(o) not in objects_id):
           # object created within this very routine

       if isinstance(o, (type, ModuleType, FunctionType)):
           # dead end, but at least we are 100% sure 
	   # this trail does not lead to a cycle
           # lines.append("  "*(level+1) + str(type(o)) + \
	   # " " + str(o)[0:80])
           cold_trail = False

       if isinstance(o, (BufferType)):
           # uninteresting to print, but must be followed
           lines.append("  "*(level+1) + str(type(o)) + " " + \
           bump = 1

       if len(backrefs) < 8:
           backrefs_new = backrefs[:]
           referrers_are_cold_trails, referrers_lines = \
                show_referrers(o, backrefs_new, level+bump)
           cold_trail = cold_trail and referrers_are_cold_trails

   if cold_trail:
       # our introspection was worthless because
       # only lead to cyclical refs
       lines = []

   return (cold_trail, lines)

It is centered around the gc.get_referrers() function which returns who is keeping references to a given object. Since the primary reference is most likely being kept by a list or a dictionary, we need then to find who refers to that list or dict, and so on.

Of course one object may be referred by many others, and some references end up forming a cycle. Such cycles may be ignored because if they were the problem, GC would have solved them (except by cases when there is a __del__). What keeps the object alive is always a non-cyclic reference. So my code tries to detect and ignore referral paths that lead to a cycle, calling it a "cold trail".

When the ultimate referrer to an object is a module or a function, this fact may or may not be relevant. In my case, it was not, so I commented out the code that annotates such objects. If referrer to the object is not printed, try then to enable this annotation too.

I used object IDs in object_id and backref since "object in list" may fail if some involved class implements a custom __eq__. And, due to the low-level nature of those operations, I felt more comfortable using IDs, as if it were C++ pointers.