Here at RJMetrics we pride ourselves on providing one of the leading hosted solutions in business analytics. One of our product’s highlights is the ability to replicate and warehouse client’s data in our data servers. Replication is not an easy task. It is driven by a set of very complex algorithms spanning across a number of different modules. Such algorithms are prone to latent bugs and vulnerabilities. They are hidden away in edge cases, just waiting for the right time to reveal themselves.
Why logging is not enough
No matter how good of a developer you are, bugs are inevitable and the best way to deal with them is to always expect them. That is why every good software engineer should always provide some means of error tracking. Traditionally this is achieved through logging, also known as tracing. Logging has the benefit of allowing the development team to go back at any given time and trace back the exact code path an operation took.
Logging is not a silver bullet though, and for it to be useful an alert system has to be in place to make it obvious to the development team that something is wrong and someone has to attend to it. Without a good reporting mechanism, bugs in fundamental areas of the system are never discovered and resolved, and any further enhancements to it are made on the false assumption that the core of the system is reliable.
The problem with email alerts
Usually development teams handle error reporting by injecting email alerts whenever an exception is caught. These emails often go to a single recipient, or to a mailing list. The problem with email alerts going to a single person is simply NO transparency. The rest of the team will have no clue about these issues unless they are forwarded/delegated to the rest of the team.
Email alerts are not any more useful if they are sent to a set of developers because it provides no accountability. Everyone in that mailing list will just expect everyone else to attend to them. Furthermore such email alerts will eventually get lost in someone’s inbox with no good way of going back to them or tracking them.
Our take on error reporting
The same exact scenario was playing when I first started working here. One of the first assignments I was tasked with was to replace all of the annoying, at best, emails with a much better system that would allow us to be more aggressive and proactive in resolving bugs. The initial solution was:
1. Provide a web interface through which at any time we could see any errors that occurred in the replication process
2. The interface would include detailed information along with a full stack trace, exception messages, as well as the frequency at which an issue has been happening
3. Reporting would occur for every distinct combination of replication job type and construct.
You might rightfully say, “how was that a better solution than emailing everyone in the team!”. It wasn’t much different, but it allowed us to make our error reporting even more useful by just taking small incremental steps with no significant effort from our development team.
Our next step was to dedicate one person in every sprint cycle to attend to all the issues being reported, and to make sure that any false positives are filtered out and any critical issues are resolved. This was obviously already working much better that the email alert system but it was still not providing enough transparency to the rest of the team. Unless we all took the time to check, we wouldn’t know how many errors got generated and how many of them got resolved.
The solution was to integrate our error tracking and reporting mechanism with fogbugz. New tickets get created when replication job errors occur, and they are handled just like any other high priority tickets in our sprint. We can now track replication job errors through our familiar ticket tracking system and even get to use our own product to make a more intelligent analysis of bugs in our system.
Conclusion
In just a few sprint cycles we were able to bring down the number of replication job errors from affecting around 20 clients close to none at any given time. We were able to uncover and resolve some really obscure bugs and we can now be confident that when something goes wrong we will be addressing it at a timely manner and before it affects any of our clients.