Bug Bash

We had a big all-hands off-site meeting. In the runup to it,  there were two days that most teams had left out of  their planning process, so we ended up running a company wide bug bash to close out all of the existing bugs. Even those pesky minor bugs that are never really a priority. It felt like over those two days my team closed out an entire three week sprint’s worth of bugs by points. After the fact I went back and counted and found that the feeling was right. Some of the bugs had previously been pointed and some were quick estimates I put on them after the fact, but it was just a slew of 1-3 point tickets that all got closed out.

Dealing with ~30 tickets in two days was a furious endeavour for the team. It was really satisfying to deal with all of those tickets that had built up. I know they had been weighing on my mind in the sense that the list just seemed to keep growing and I wasn’t sure what to do about it.

This brought me to another interesting question: why were we so much more effective in this than in a normal sprint? Are we underestimating the stories relative to the bugs? That seems the most obvious answer. Is the low end of the point spectrum too compressed so the difference between a two point task and a three point task isn’t sufficiently granular? I spoke with some of the other members of my team about it and there was some speculation that since we each just took tickets related to what we already knew we just got more done, but that shouldn’t account for all of the difference. There was also a possibility that since all of the pieces were strongly independent we had less communication lag.

Maybe I should just be happy that all of those bugs got dealt with. But, I’d really love to find a way to bring that efficiency to our normal processes. If anyone has any ideas, please share in the comments.

Advertisements

Java Containers on Mesos

I recently ran into an interesting issue with an application running in a container. It would fire off a bunch of parallel web requests (~50) and sometimes would get but not process the results in a timely manner. This was despite the application performance monitoring we were using saying the CPU usage during the request stayed very low. After a ton of investigation, I found out a few very important facts that contradicted some assumptions I had made about how containers and the JVM interact.

  1. We had been running the containers in marathon with a very low CPU allocation (0.5) since they didn’t regularly do much computation. This isn’t a hard cap on resource usage of the container. Instead it is used by Mesos to decide which physical host should run the container and it influences the scheduler of the host machine. More information available on this in this blog post.
  2. The number of processors the runtime reports is the number of processors the host node has. It doesn’t have anything to do with a CPU allocation made to the container. This impacts all sorts of under the hood optimizations the runtime makes including thread pool sizes and JIT resources allocated. Check out this presentation for more information on this topic.
  3. Mesos can be configured with different isolation modes that control how the system behaves when containers begin to contest for resources. In my case this was configured to let me pull against future CPU allocation up to a certain point.

This all resulted in the service firing off all of the web requests on independent threads which burned through the CPU allocation for the current time period and the next. So then the results came back and weren’t processed. Immediately we changed the code to only fire off a maximum number of requests at a time. In the longer term we’re going to change how we are defining the number of threads but since that has a larger impact it got deferred until later when we could measure the impact more carefully.

Snipe Hunt

Recently I got pulled into a project to help get a feature that was mostly finished and needed to go through a “final QA round” before being ready for release. I felt that this wouldn’t require much of my time, but as you can imagine, things didn’t quite go as expected. The QA round found about a dozen errors in the new feature that I eventually divided into two classifications: requirements SNAFUs and code quality issues.

The requirements SNAFUs were the sorts of problems where the original programmer built what was explicitly asked for, but QA took the one of everything approach trying all sorts of cases that weren’t specified at all. These sorts of problems can be impactful from a time consumption perspective but aren’t that difficult to fix. The code quality issues are much more pernicious.

Digging into the code itself I quickly found an interesting fact. There were two fields, the currentPlanId and the activePlan, that were being mutated in various portions of the application, generally together. There wasn’t any clear distinction between the active plan and the current plan in the code, and at one point the currentPlanId was being set to the id from the active plan, sort of implying it’s the same thing but with poor naming. There were other places where one or both of them would mutate, and I went about tracing what caused the two to diverge or converge.

On initial page load the two would be different, with the active plan being blank, then when an item was selected on the drop down the two could converge, depending on what was selected.  I went and started looking for the tests covering this to see if there would be any clarification of the scenarios that were going on and turned up none. At this point I let others know of my findings and that while the problem seemed minor, there was a bigger quality problem under the hood of the system.

The first code change I made was a relatively minor one affecting when a particular button should show up; adding a special case and another test case started behaving. So far so good. Then I started tweaking the functions that were setting currentPlanId and activePlan. By this point I had managed to figure that current was a chronological state and active was a UI state, but it still wasn’t immediately clear how the system was making decisions about which plan was current. This obscured information seemed to be intertwined with the cause of a lot of the remaining broken cases.

I followed the webservice calls back through various layers microservices to where I knew the information had to be coming from and made an intriguing discovery. The way the frontend was deciding which plan was current was incorrectly being based on the timing between two different web service calls. I started digging around trying to find the right way to set all of this information and started to become clear that the initial architecture was missing a layer to coordinate the requests at the initial page load.

That got everything trending in the right direction. I still want to find some time to work through some additional unit tests and leave the code in a good state rather than just a better state.

Being a Wizard

A somewhat obscure question got asked in a chat channel at work that I knew the answer to, which helped out some other engineer. The question wasn’t anything that abnormal – it was about a weird error message coming from an internal library. Searching through the library’s code wasn’t immediately helpful since the unique part of the error message didn’t appear in the code. The reason I knew the answer wasn’t because it was easy, but because I had spent an hour investigating it the day before.

Sometimes when you see someone have an apparently impressive insight, that doesn’t necessarily mean they are better than you, they may just have had an experience which makes the answer obvious to them. This applies to all sorts of other technical activities. During the Hackathon I did a similar thing. One of the other devs on the team was integrating the portion of the code I was working on and having trouble. It was immediately obvious to me why, because I had put in the time earlier to figure it out the hard way. Your mind is a powerful pattern matching system. It immediately recognizes this:

 

happycatOr thisftc

 

If you think back to when you first started learning calculus, the terminology and symbols of it were complicated and foreign, but after a while you gained a certain familiarity with them and after a while they became second nature.

You may go to work and make some business web app in one particular technology stack, but there are all sorts of concepts that go with it that aren’t the business or the tech stack. You’re synthesizing things like design patterns, test driven development, RESTful web services, algorithms, or just the HTTP stack and everything that goes with that. These are all the transferable skills that can help you “cast a spell” and jump past a problem.

When I sat down to learn Scala, it wasn’t that big a task since most of the language features had equivalents I was familiar with in other languages. That let me skip forward to the nuances of those implementations and the few language features I was less familiar with. Getting experience with those ideas in the abstract let me appear as a wizard going forward since I jumped ahead on the learning curve and look the wizard. Some of the common feelings of impostor syndrome are the worry to be found out like another wizard.

wizard_behind_the_curtain

Book Chat: Zero Bugs and Program Faster

Zero Bugs and Program Faster by Kate Thompson is a book that’s hard to describe. None of it is a really novel way of looking at creating software but it’s all of those things that you would expect to describe when you think about how to do programming well. It’s a breezy and fun read that is divided into enough small sections that you can read it in however much time you have available.

The book is structured in two parts. The first part is a series of short vignettes about programming. Some of which are more direct, like the chapter on ACID; some are more abstract, like the chapter entitled “The Many Sides of the Elephant.” I appreciated the dual chapters of “Do It Now” and “Do It Later” that are about how you can’t always do it now but you shouldn’t always defer it either. None of it was a mind shattering revelation but it was all solid advice about programming.

The second part is extracts from various programs to demonstrate a lot of different ideas. The code samples in the second part were generally significantly older, mostly in assembly or C. The low level nature of the examples made it more difficult for me to appreciate. Seeing Altair assembly from the 70s that’s notable for being clever and concise won’t help me build a better web service today.

If you are the sort of person who is reading lots of programming books, you will appreciate the book, however you may not get much from it. If you aren’t the kind of person who reads lots of programming books some of the more oblique points may be obscured. I don’t have anything bad to say about it but don’t know who I would recommend the book to.

Mongo Play Evolutions

I ran into an odd situation with some Play Framework evolutions for MongoDB and hope to save the next person in this situation some time. I got two messages from it that I wasn’t really expecting, the first was “!!! WARNING! This script contains DOWNS evolutions that are likely destructives” and the second was seemingly more helpful “mongev: Run with -Dmongodb.evolution.applyProdEvolutions=true and -Dmongodb.evolution.applyDownEvolutions=true if you want to run them automatically, including downs (be careful, especially if your down evolutions drop existing data)” The big issue I had was I couldn’t tell why it felt it should be running the downs portion of the evolution at all.

Some digging in the logs showed it wanted to run the down for evolution 71 and the up for evolution 71 as well. This was when I got really confused, why would it attempt to run both the down of the evolution and the up for the same evolution? I spent a while digging through the code looking at how it decided what evolutions to run and it turns out to be comparing the saved content of the evolution that was run at some point in the past with the current content of the evolution. So it recognized that the current 71 evolution was different from the previous evolution 71 and was attempting to fix this by running the saved down evolution and then the new up evolution.

The environment was setup to not run down evolutions since that usually meant that you had screwed up somewhere. We had accidentally deployed a snapshot version to the environment a while back, which is where the unexpected behavior came from. We ended up fixing the problem by breaking the evolution into two different evolutions so there was no down to be run.

Software Performance Economics

There is lots of information out there about how software performance impacts your ability to sell, but I haven’t seen much information about the cost of building higher performing software. What does it cost to go back and add a caching tier after the fact? What scales that cost? I would think it is related to the complexity of usage of whatever it was trying to cache, but I’d love to see hard data. Quantifying the relationship seems like it would be an interesting place for research. Today’s post is mostly on the performance side of the equation; next week I’ll be looking at the scalability side.

There is plenty of discussion in the community of the economics of software construction related to the cost of building quality in versus dealing with the fallout from low quality later. But there is little discussion of the economics of building a high performance solution as opposed to a low performance one. I think this is because low performance is perceived as a design defect and low quality as an implementation defect.

It seems like there are a couple of points at play. There is some fixed cost for building a feature in an application. It has some minimum level of performance, such as an n^2 algorithm with the number of customers in the system. When you have ten customers n^2 is fine; with ten thousand you might have a problem; at a million it is looking ugly. But if you started with n^2, it is now a question of what it costs to get from n^2 to n log n or even better. Premature optimization has a bad reputation, since it is assumed (for good reason) that a better performing solution would cost more, either in initial cost or in long-term maintenance.

If it takes 4 developers a month to build something that works at an n^2 performance level, would it take another week to improve it to run at n log n or would it take another month? What about if you wanted to go directly to n log n or better, how would that impact the results?

Imagine writing a sort implementation. Writing either a bubble sort implementation or a quicksort implementation is easy enough since they’re both out there and well know enough that you just go look it up by name. There are so many available options that – outside of school – most people will never write one. On the other hand, for a more complex sorting situation, maybe there aren’t so many published options to choose from at various levels of optimization. Maybe you need a sort that can take prior sorting actions into account, you’d end up with a selection sort that runs in n^2 time. I had to use this previously, and in the specific case I was working on, the runtime was fine since n was capped at 100, but for other circumstances this would be a severe bottleneck. If n was expected to be truly large what could I have done? There are additional concurrent sorting options that are out there which may have been able to apply more resources to the problem. If n was truly ridiculously large there are ways to sort data that don’t fit into memory too. But this example is still in a well-defined and well-researched problem domain.

The original solution to my specific case was a hideous optimization solution that approximated the correct answer. This approximation eventually caused trouble when an item that should have been included in the result wasn’t. Once we instead described it in terms of a sort, it afforded us the opportunity to rebuild the solution. We rebuilt in significantly less time than the original solution (two weeks versus four weeks), since we ended up with a simpler algorithm to do the same thing. Most of the time spent during the rewrite was on research trying to figure out how to solve the problem;during the initial implementation it was mostly spent on testing and bug fixing. This is the only time I’ve been involved in this sort of rewrite at the application level, which I suspect is a significant outlier.

This post is unfortunately more questions than answers. I don’t have the data to make any good conclusions on this front. But it is clear that building the most efficient implementation regardless of the scale of the business need isn’t the right choice. Understanding the situation is key to deciding what level of performance you need to hit. Not everything is as straightforward as sorting algorithms, but most interesting problems have some level of researched algorithms available for them.

Postscript for anyone that’s interested in the specific problem I was working on. We were trying to figure out what order events should be done in, but some of them couldn’t start until after a specific time, not some other set of events. So we needed to pick the first thing to occur than the second etc. Most sorts don’t work that way, but that how we ended up with a selection sort. It always selects the first element then the second element. So we could compute what time the next item would be starting at.

Funny Bug Story

Ran into a funny bug and I thought that others would see the humor with it. We have multiple QA environments set up: QA1.domain.com, QA2.domain.com, etc. Each environment is running a slightly different version of the web application, some of which are configured slightly differently. One of the least used environments, QA9, got a new version installed and immediately when anyone attempted to log in, they got redirected back to the login screen. This environment is used for testing hotfixes going out to a production environment for a client who is very picky about taking new versions and was willing to pay for the privilege. This deployment had two changes in it. One was a change (done by my team) to the login code and the other was a feature change in code that clearly wasn’t being executed.

Obviously, at this point we got a call to go investigate what happened. Simultaneously QA9 was rolled back to the last known good version. Our change was in several other QA environments already without incident so the change seemed like it should have been good. We started looking into whether our change depended on some other change that wasn’t in that single environment. We also started considering if there was some configuration difference in that environment that could have caused the problem. There didn’t seem to be any dependency issues, and we had tested the particular configuration case that was represented by that environment for the feature. The system didn’t record the login attempts as  failed or otherwise, or even log any errors – for extra fun in investigating.

By the time we finished this exercise, the roll back had finished and things got weird. The problem didn’t go away, even though the code went back. This got us thinking about things that could persist across versions. Cookies, something funky in the database, etc. We cleared cookies and then the problem magically went away. So we logged in to start looking around at the cookies and didn’t find anything out of the ordinary.

A little later one of the QA Engineers mentioned that immediately after they had cleared their cookies they could log in, but then later they tried to log in again and couldn’t. This immediately had our interest piqued. The code that had supposedly caused the problem wasn’t there anymore, the cookies had been cleared, yet it broke again. Things had gotten more weird. We investigated the cookies on that browser; there was one session cookie with the QA9.domain.com domain set and another session cookie with the .domain.com domain set. That was unexpected.

We went and checked out the session cookie code that was deployed to that environment – nothing about setting the .domain.com version of the cookie. The code didn’t even look like it would support setting that cookie at all. We spent a while thinking about how this could happen. We tried to get that code to set that cookie and couldn’t. We started asking the QA Engineer what he had done between being able to log in and being unable to log in. He listed out a couple of different things; the notable one was that he logged into a different QA environment.

We go check out the session cookie code that was running in that QA environment and find a whole bunch of new code there about changing the domain. BINGO! The other environments had started setting .domain.com cookies which were conflicting with the cookie for this environment. We reached out to the team that made the change and they rolled it back in those environments. This immediately lead to an even funnier problem in that now every other environment had the same problem. But once everyone cleared their cookies it all got back to normal for now.

This little bit of irony felt like it needed to be shared. Breaking one system with a stray cookie from another environment made for an interesting day. In the comments share any funny bugs you’ve run into.