Recently I got pulled into a project to help get a feature that was mostly finished and needed to go through a “final QA round” before being ready for release. I felt that this wouldn’t require much of my time, but as you can imagine, things didn’t quite go as expected. The QA round found about a dozen errors in the new feature that I eventually divided into two classifications: requirements SNAFUs and code quality issues.
The requirements SNAFUs were the sorts of problems where the original programmer built what was explicitly asked for, but QA took the one of everything approach trying all sorts of cases that weren’t specified at all. These sorts of problems can be impactful from a time consumption perspective but aren’t that difficult to fix. The code quality issues are much more pernicious.
Digging into the code itself I quickly found an interesting fact. There were two fields, the currentPlanId and the activePlan, that were being mutated in various portions of the application, generally together. There wasn’t any clear distinction between the active plan and the current plan in the code, and at one point the currentPlanId was being set to the id from the active plan, sort of implying it’s the same thing but with poor naming. There were other places where one or both of them would mutate, and I went about tracing what caused the two to diverge or converge.
On initial page load the two would be different, with the active plan being blank, then when an item was selected on the drop down the two could converge, depending on what was selected. I went and started looking for the tests covering this to see if there would be any clarification of the scenarios that were going on and turned up none. At this point I let others know of my findings and that while the problem seemed minor, there was a bigger quality problem under the hood of the system.
The first code change I made was a relatively minor one affecting when a particular button should show up; adding a special case and another test case started behaving. So far so good. Then I started tweaking the functions that were setting currentPlanId and activePlan. By this point I had managed to figure that current was a chronological state and active was a UI state, but it still wasn’t immediately clear how the system was making decisions about which plan was current. This obscured information seemed to be intertwined with the cause of a lot of the remaining broken cases.
I followed the webservice calls back through various layers microservices to where I knew the information had to be coming from and made an intriguing discovery. The way the frontend was deciding which plan was current was incorrectly being based on the timing between two different web service calls. I started digging around trying to find the right way to set all of this information and started to become clear that the initial architecture was missing a layer to coordinate the requests at the initial page load.
That got everything trending in the right direction. I still want to find some time to work through some additional unit tests and leave the code in a good state rather than just a better state.