A look at a slow memory growth case and how recently-added features in Devel::MAT version 0.38 can help get to the bottom of it.

Slow Growth Over Time

Given a process that grows a small amount of memory over time, it can sometimes be difficult to look for the cause of this with only one heapdump file containing just one frozen moment in time. The size of the overall process could be large as compared to the small amount that it grows by. Unless the process is very old and thus has accumulated a lot of growth, it won't easily become apparent where the problem is.

Alternatively, we can find all the SVs in a heap dump that appear to be memory leaks, and see if we can identify some common pattern among them which may point towards an explanation.

The nature of a leaking SV is one that gets allocated at some time, and thereafter is never touched - it always hangs around, having fundamentally the same type at the same address. A newly-added script in the Devel-MAT distribution looks for SVs matching this pattern. To run this, we will need a sequence of heapdump files taken from the process.

Automated Heapdumps

The case we're looking at today involves a suspected leak in bom-feed-distributor, because the process grows in memory usage over time. This was observed by periodic inspection of the VSZ field in the output of the ps utility.

To get some heapdumps out of this program we start by editing its main script, adding a line to load Devel::MAT::Dumper configured to write a heapdump file on receipt of a SIGQUIT signal. This signal handler is designed specifically for this purpose, because because it doesn't stop the process once it's finished, but instead resumes running normally after the file is written.

use Devel::MAT::Dumper qw( -dump_at_SIGQUIT -file /tmp/bom-feed-distributor-NNN.pmat );

Whenever we send a SIGQUIT signal we'll get a new serial-numbered heapdump file in /tmp. We can automate this to run regularly, and so leave it running overnight to collect up some files for analysis later:

$ PID=$(pgrep -f bom-feed-distributor)

$ while sleep $PID; do kill -QUIT 30829; done

Differential Analysis

Now that we have a collection of .pmat files we can apply the pmat-leakreport tool to analyse them. It needs at least three heapdump files to work with, but the more it has the better its analysis will be. It works by comparing the first two files to look for every newly-allocated SV that appears in the second that wasn't in the first. These become the "leak candidates". Then for every subsequent file it looks to see if any of those candidates get released or changed in type - in either case it turns out that candidate wasn't a leak after all. Once it has finished invalidating cases by looking at the remaining files, whatever is still left as a candidate is printed in a list. These are usually good places to continue investigating.

$ pmat-leakreport -1 bom-feed-distributor-2*.pmat >leakreport.txt

We'll take the entire series of ten files numbered 20 to 29, as they'll be sufficiently far from the startup of the process that it should be settled in nicely, so as to minimise any false positives. We direct the output to a file in case the leak report is quite long - we'll be wanting to refer back to it repeatedly. Generating it is computationally-intensive, so it's nice to keep it on disk to refer back to.

Looking at the start of this file we can see a number of SVs that might make interesting candidates to investigate:

$ head -10 leakreport-2.txt 
LEAK[1] SCALAR(UV,NV,PV) at 0x40ed0d0
LEAK[1] SCALAR(UV) at 0x5ed26d8
LEAK[1] REF() at 0x5f76ec8
LEAK[1] SCALAR(PV) at 0x5ecbf88
LEAK[1] SCALAR(UV) at 0x5e6e9d8
LEAK[1] REF() at 0x5fad540
LEAK[1] REF() at 0x5e59228
LEAK[1] REF() at 0x5fac498
LEAK[1] SCALAR(UV) at 0x5fd15d8
LEAK[1] SCALAR(UV) at 0x5f18998

Lets start by picking one of these and seeing what it is. Since these SVs are ones that have remained undisturbed until the end of the sequence of .pmat files, we can load the latest one into the pmat shell and try to identify it:

$ pmat bom-feed-distributor-29.pmat 
Perl memory dumpfile from perl 5.14.2
Heap contains 266212 objects
pmat> identify 0x5f76ec8
REF() at 0x5f76ec8 is:
└─not found

Well this is already somewhat exciting. A leaked SV that identify isn't able to find. This may be because it is no longer referenced from anywhere but still has a nonzero refcount, or because it's referenced by something that Devel::MAT::Dumper can't see.

Unidentifiable SVs

Lets try to work out what this reference is, to see if we can learn more about it:

pmat> show 0x5f76ec8
REF() at 0x5f76ec8 with refcount 1
  size 24 bytes
  RV=HASH(14)=Date::Utility at 0x5f60f88

It's a Date::Utility instance. Lets take a look at another SV and see if we can find a pattern:

pmat> identify 0x5ed26d8
SCALAR(UV) at 0x5ed26d8 is:
└─value {month} of HASH(14)=Date::Utility at 0x5f02008, which is:
  └─the referrant of REF() at 0x60c9730, which is:
    └─not found

pmat> identify 0x40ed0d0
SCALAR(UV,NV,PV) at 0x40ed0d0 is:
└─value {epoch} of HASH(14)=Date::Utility at 0x5ef57b8, which is:
  └─the referrant of REF() at 0x5f7f910, which is:
    └─not found

A pattern begins to emerge. Most of these leaking SVs are related to Date::Utility instances. Perhaps there are quite a few in this file - we should count the total of them. A newly-added ability in Devel::MAT 0.38 allows us to do this, by using the --count option to the find command. We ask it to count the number of SVs blessed into this package:

pmat> find --count blessed Date::Utility
Total: 308 SVs

That's quite a few instances of a utility object class to have hanging around a server instance. We can use the previous files to discover the history here; using the --quiet option to the pmat shell to reduce unrelated noise in the output:

$ for F in bom-feed-distributor-*.pmat; do echo -n "$F: "; pmat --quiet $F find --count blessed Date::Utility; done
bom-feed-distributor-00.pmat: Total: 22 SVs
bom-feed-distributor-01.pmat: Total: 45 SVs
bom-feed-distributor-02.pmat: Total: 301 SVs
bom-feed-distributor-03.pmat: Total: 301 SVs
bom-feed-distributor-04.pmat: Total: 301 SVs
bom-feed-distributor-05.pmat: Total: 301 SVs
bom-feed-distributor-06.pmat: Total: 301 SVs
bom-feed-distributor-07.pmat: Total: 301 SVs
bom-feed-distributor-29.pmat: Total: 308 SVs

It seems that these objects accumulate fairly early on in program startup, over the first few hours, and thereafter remain relatively stable at just over 300.

Clues In Source Code

At this point, I'll admit I got stuck trying to think of what might be going on, so I asked around to see if anyone else had ideas. Tom Molesworth observed the following in Date/Utility.pm line 50:

my $lru = tie %popular, 'Tie::Hash::LRU', 300;

Here we see a cache implemented by the Tie::Hash::LRU package, which keeps 300 objects around. That sounds like a very likely cause of the issue we see here. Furthermore, this package uses an XS-based implementation and so may well have Perl object references stored somewhere inside C-based code where Devel::MAT::Dumper can't see. This would explain why identify was unable to provide any further information about them.

Here we reach an eventual conclusion that is perhaps not quite as exciting as in previous articles. We've discovered that, although the memory requirement of this program does slowly increase initially, it does eventually reach a stable level after it has been running for some time. These Date::Utility instances do hang around for a while, but intentionally so as part of a caching mechanism designed to reduce object churn for commonly used values. This is not actually a memory leak as such, but instead an effect of object caching.

Additionally, we've seen that there are limits to how far Devel::MAT::Dumper can see. Specifically, that things like XS-implemented modules can confuse it by breaking visibility of the relationship between SVs that live on the heap, causing us not to see certain details.

Finally, it's worth repeating the fact that sometimes when analysing a problem and you get stuck, talking it over with co-workers and colleagues can often reveal new hints of idea that can help resolve the issue.