Reifying problems in our software: a short story

Yet another lesson from a large project with legacy code. Did you happen to see “known issues” in software that you have worked on? And how many times have you seen them “solved” with “quick fixes” or “workarounds”?

Reifying problems in our software: a short story

Intro

Yet another lesson from a large project with legacy code. Did you happen to see “known issues” in software that you have worked on? And how many times have you seen them “solved” with “quick fixes” or “workarounds”?

Sometimes there are errors that are known, and we have to learn to live with them. Why? Maybe because you do not have time to fix them, or you think that they are temporary or not relevant enough, or getting to the root cause would be too difficult! In the meantime, though, you have to keep your system running and making $$$.

A real-life example

We were working in an order import system. The responsibility of this piece of software was to read orders from external marketplaces (like Amazon or eBay) and create records of those orders in our system. This basically meant to incorporate them to the rest of the application flow, and triggering the whole fulfillment process for that order, shipping included.

The first version of this order importing process was too naive and it didn’t have much error handling. Later on, we figured out a lot of the problems that could happen because of that. Some of them would be “solvable” and others wouldn’t (like catalog items out-of-sync).

Approach to solving the problem

The natural approach, and ultimately the one we decided on, was to start fixing the problems right away. Suddenly, the time for putting up a fix for each error or inconsistency was too high because we had to dig into many log events, long stack traces, inconsistent data, and then use those pieces of information to reconstruct in our minds what could have happened. We tried to improve log events and exception reports, but at some point, some of the problems we were having weren’t an exceptional behavior, but rather part of a flawed process.  We had to recognize that even if the results were not always as expected, that was a part of our process, and to get rid of it completely would take a lot of time.

So we realized that, if we wanted to continue with this process, and these ‘problems’ were going to be a natural part of it, we had to introduce a new concept into our model that would let us work with them. And this new concept should enable us to detect whether or not we were having a problem of a particular sort, record it, and treat it somehow.That’s when the concept of an “unfulfillable order” appeared (we called it UFO :-)). Was it something that we knew from the beginning? No. Was it something that would have existed if we had made a better order importer system from the beginning? Probably not! But given the situation at that time, we had to do something about it. So we acknowledged the problem had its own “entity” and it had to be modeled just like any other concept in the domain was.

Starting to see some benefits

After building the unfulfillable order model, our code became much simpler because every incoming order started falling into one of these buckets:

  • Successful orders
  • Unfulfillable orders

That gave us a lot of visibility because now we could count those orders, group them by day and/or marketplace, and analyze patterns among the product data attached to them (this was actually helpful to identify errors associated with some particular products or shipping countries).

Also, we built some metrics and alerts associated with those records… so, for instance, if the UFO percentage exceeded the 5% threshold, it required our attention.

We spent less than a week building the UFO table plus the logic to create those records, and then we were able to quickly identify and fix related bugs. Without this tracking data, it would have been impossible to analyze the impact of each bug, but now we could be sure that our focus was on the most critical ones.

Having done this, we started to work on reducing the number of UFOs by making our order importer more fault tolerant (actually what we all wanted from the very beginning!). We took this very seriously and we created a cheer for ending each daily standup meeting (“UFOs to Zero!” with a clap at the end). Although it sounds silly, it really helped us to keep focused on the goal!

Conclusions

  • Software should always reflect reality. Sometimes problems get so much presence and they become part of the system itself. But we can build logic to auto-recover from those problems and learn from them.
  • If we have well-known problems in our system, and we can identify them, it’s beneficial to leave track of them in the code. A comment, a log message, a record on a database, something to give “entity” to that problem, so you can see it, allowing yourself to leave aside the idealistic vision of “I shouldn’t build software to track the errors we have, I’d rather fix them”.
  • As software evolves and you find out the root cause of the problem, you can remove those lines of code and other pieces of software related to it (once you are sure the problem is fixed). As one would do on a regular refactor
  • It is useful to track the problems in the project history, including when something started to fail, when the failure was identified, for how long it lived in the system, and when we got rid of it.