We had a big all-hands off-site meeting. In the runup to it, there were two days that most teams had left out of their planning process, so we ended up running a company wide bug bash to close out all of the existing bugs. Even those pesky minor bugs that are never really a priority. It felt like over those two days my team closed out an entire three week sprint’s worth of bugs by points. After the fact I went back and counted and found that the feeling was right. Some of the bugs had previously been pointed and some were quick estimates I put on them after the fact, but it was just a slew of 1-3 point tickets that all got closed out.
Dealing with ~30 tickets in two days was a furious endeavour for the team. It was really satisfying to deal with all of those tickets that had built up. I know they had been weighing on my mind in the sense that the list just seemed to keep growing and I wasn’t sure what to do about it.
This brought me to another interesting question: why were we so much more effective in this than in a normal sprint? Are we underestimating the stories relative to the bugs? That seems the most obvious answer. Is the low end of the point spectrum too compressed so the difference between a two point task and a three point task isn’t sufficiently granular? I spoke with some of the other members of my team about it and there was some speculation that since we each just took tickets related to what we already knew we just got more done, but that shouldn’t account for all of the difference. There was also a possibility that since all of the pieces were strongly independent we had less communication lag.
Maybe I should just be happy that all of those bugs got dealt with. But, I’d really love to find a way to bring that efficiency to our normal processes. If anyone has any ideas, please share in the comments.
My team has been doing more pair programming recently so I picked up a copy of Pair Programming Illuminated. I had never done a significant amount of pair programming before and while I felt I understood the basics, I was hoping to ramp up on some of the nuances of the practice.
It covers why you should be pair programming, convincing management that you should be able to pair program, the physical environment for local pairing, and common social constructs around different kinds of pairs. All of this is useful information, to varying degrees. Since the book was written in 2003, some of the specifics of the physical environment section didn’t age well – advising the use of 17” monitors most obviously. Both of the evangelizing sections seemed to cover the same ground, and did not seem to be written in a way to try and convince someone who is not already open to the concept. Neither section seemed to be written to the person who isn’t already in favor of doing pair programming. There were lots of references to studies, and some personal anecdotes, but none of it stuck in a way that felt like it would change someone’s mind.
The social aspects were interesting, however most of the section was stuff that felt obvious. If you have two introverts working together then they need to work differently than if you have two extroverts working together. A lot of the time the tips were common sense, and didn’t seem like it was necessary to write it down in the book. I would have liked to see more discussion of getting someone to vocalize more and clearly what they’re thinking about.
I feel like I’m better equipped to do pair programming because of having read this, but I also feel like a long blog post would have been just as good a resource and much more focused. I don’t know what else I would have wanted to fill out the rest of the book.
I recently ran into an interesting issue with an application running in a container. It would fire off a bunch of parallel web requests (~50) and sometimes would get but not process the results in a timely manner. This was despite the application performance monitoring we were using saying the CPU usage during the request stayed very low. After a ton of investigation, I found out a few very important facts that contradicted some assumptions I had made about how containers and the JVM interact.
- We had been running the containers in marathon with a very low CPU allocation (0.5) since they didn’t regularly do much computation. This isn’t a hard cap on resource usage of the container. Instead it is used by Mesos to decide which physical host should run the container and it influences the scheduler of the host machine. More information available on this in this blog post.
- The number of processors the runtime reports is the number of processors the host node has. It doesn’t have anything to do with a CPU allocation made to the container. This impacts all sorts of under the hood optimizations the runtime makes including thread pool sizes and JIT resources allocated. Check out this presentation for more information on this topic.
- Mesos can be configured with different isolation modes that control how the system behaves when containers begin to contest for resources. In my case this was configured to let me pull against future CPU allocation up to a certain point.
This all resulted in the service firing off all of the web requests on independent threads which burned through the CPU allocation for the current time period and the next. So then the results came back and weren’t processed. Immediately we changed the code to only fire off a maximum number of requests at a time. In the longer term we’re going to change how we are defining the number of threads but since that has a larger impact it got deferred until later when we could measure the impact more carefully.
Recently I got pulled into a project to help get a feature that was mostly finished and needed to go through a “final QA round” before being ready for release. I felt that this wouldn’t require much of my time, but as you can imagine, things didn’t quite go as expected. The QA round found about a dozen errors in the new feature that I eventually divided into two classifications: requirements SNAFUs and code quality issues.
The requirements SNAFUs were the sorts of problems where the original programmer built what was explicitly asked for, but QA took the one of everything approach trying all sorts of cases that weren’t specified at all. These sorts of problems can be impactful from a time consumption perspective but aren’t that difficult to fix. The code quality issues are much more pernicious.
Digging into the code itself I quickly found an interesting fact. There were two fields, the currentPlanId and the activePlan, that were being mutated in various portions of the application, generally together. There wasn’t any clear distinction between the active plan and the current plan in the code, and at one point the currentPlanId was being set to the id from the active plan, sort of implying it’s the same thing but with poor naming. There were other places where one or both of them would mutate, and I went about tracing what caused the two to diverge or converge.
On initial page load the two would be different, with the active plan being blank, then when an item was selected on the drop down the two could converge, depending on what was selected. I went and started looking for the tests covering this to see if there would be any clarification of the scenarios that were going on and turned up none. At this point I let others know of my findings and that while the problem seemed minor, there was a bigger quality problem under the hood of the system.
The first code change I made was a relatively minor one affecting when a particular button should show up; adding a special case and another test case started behaving. So far so good. Then I started tweaking the functions that were setting currentPlanId and activePlan. By this point I had managed to figure that current was a chronological state and active was a UI state, but it still wasn’t immediately clear how the system was making decisions about which plan was current. This obscured information seemed to be intertwined with the cause of a lot of the remaining broken cases.
I followed the webservice calls back through various layers microservices to where I knew the information had to be coming from and made an intriguing discovery. The way the frontend was deciding which plan was current was incorrectly being based on the timing between two different web service calls. I started digging around trying to find the right way to set all of this information and started to become clear that the initial architecture was missing a layer to coordinate the requests at the initial page load.
That got everything trending in the right direction. I still want to find some time to work through some additional unit tests and leave the code in a good state rather than just a better state.