Randomness [n]: an invention to catch out programmers
...not just because Allinea DDT makes it so (we're too modest to claim that) but because random things tend to happen more often when you have highly parallel or multithreaded applications. The more parallelism you have, the more chance that something apparently random and bad will happen whilst you're watching it.
When bad things happen, you can fix them: you have a reproducer, pure gold for a programmer. When bad things don't happen, "works for me" is a lousy response to give to a user - the user is not "me" and has no instructions on how to become "me".
I'd like to recollect a bug story from whilst we were developing the scalability in Allinea DDT. The machine was booted, and we debugged 100,000 cores: it worked. It worked every time. Awesome - no-one had done that before!
Then, a curious thing happened: it just stopped working. The machine had been "up" for a few days and we couldn't debug those jobs any more.
A Schroedinbug is something that works until a bug is first observed and thereafter never works again? This machine was used by Quantum Physicists, so perhaps this was their joke?
The problem was, everything seemed to work every time at 1,000 cores. There was a bug and we couldn't reproduce it except at large scale. We were lucky - we had unlimited access to our scalable debugger and ran at the scale of the problem. Allinea DDT debugged itself. Debugging a debugger is not as implausible as it sounds - and eating your own dog food is easier when it tastes good (dog owners, do not try this at home).
Our favourite parallel stack view and variable comparisons in Allinea DDT led us to the problem: we were looking for the wrong processes. A system library that we relied on had a fatal flaw.
The process ID counter is independent on each physical machine in a supercomputer, increasing every time a process started. It wraps around to 1 (actually the first available positive integer) when it reaches 32767. However, whenever this wrap-around happens, a bug in a system library was triggered: that library produced the wrong answers which Allinea DDT then used.
With a bug as unlikely as this - what's the chance that the counter is wrapping round for any host in a job? I'll leave you to calculate how many debugging sessions at only 32 processes would be needed to witness the bug with even 50% probability. Run at 100,000 cores and it's not just likely - it's inevitable - that you see the problem.
We could be forgiven for being a little smug - it's system library 0, Allinea DDT 1 - but there's a sting in the tail of this story.
One of our test cases failed recently - a test case for a function identifying the descendant processes of a process. It failed rarely, less than 1 in 1,000 runs. With Allinea DDT, it was simple enough to reproduce - we ran the test case many times - all in parallel - in a great big debug session. Sure enough the test case failed in seconds.
Under the debugger, an assertion in the test stopped the one process showing the issue at exactly the problem. The test case and its assertion were priceless. It pinpointed the problem - and narrowed down the search for a cause. Now we could use the debugger to explain why the problem had happened, not just where.
Two minutes later. The Eureka moment.
Guess what: Process ID wrap-round wasn't handled correctly. This time, it was our fault!
The moral of the story? We all are capable of creating bugs – however embarassing, but solving them quickly is what matters. Having good test cases and a great debugger helps!