Software Performance Economics

There is lots of information out there about how software performance impacts your ability to sell, but I haven’t seen much information about the cost of building higher performing software. What does it cost to go back and add a caching tier after the fact? What scales that cost? I would think it is related to the complexity of usage of whatever it was trying to cache, but I’d love to see hard data. Quantifying the relationship seems like it would be an interesting place for research. Today’s post is mostly on the performance side of the equation; next week I’ll be looking at the scalability side.

There is plenty of discussion in the community of the economics of software construction related to the cost of building quality in versus dealing with the fallout from low quality later. But there is little discussion of the economics of building a high performance solution as opposed to a low performance one. I think this is because low performance is perceived as a design defect and low quality as an implementation defect.

It seems like there are a couple of points at play. There is some fixed cost for building a feature in an application. It has some minimum level of performance, such as an n^2 algorithm with the number of customers in the system. When you have ten customers n^2 is fine; with ten thousand you might have a problem; at a million it is looking ugly. But if you started with n^2, it is now a question of what it costs to get from n^2 to n log n or even better. Premature optimization has a bad reputation, since it is assumed (for good reason) that a better performing solution would cost more, either in initial cost or in long-term maintenance.

If it takes 4 developers a month to build something that works at an n^2 performance level, would it take another week to improve it to run at n log n or would it take another month? What about if you wanted to go directly to n log n or better, how would that impact the results?

Imagine writing a sort implementation. Writing either a bubble sort implementation or a quicksort implementation is easy enough since they’re both out there and well know enough that you just go look it up by name. There are so many available options that – outside of school – most people will never write one. On the other hand, for a more complex sorting situation, maybe there aren’t so many published options to choose from at various levels of optimization. Maybe you need a sort that can take prior sorting actions into account, you’d end up with a selection sort that runs in n^2 time. I had to use this previously, and in the specific case I was working on, the runtime was fine since n was capped at 100, but for other circumstances this would be a severe bottleneck. If n was expected to be truly large what could I have done? There are additional concurrent sorting options that are out there which may have been able to apply more resources to the problem. If n was truly ridiculously large there are ways to sort data that don’t fit into memory too. But this example is still in a well-defined and well-researched problem domain.

The original solution to my specific case was a hideous optimization solution that approximated the correct answer. This approximation eventually caused trouble when an item that should have been included in the result wasn’t. Once we instead described it in terms of a sort, it afforded us the opportunity to rebuild the solution. We rebuilt in significantly less time than the original solution (two weeks versus four weeks), since we ended up with a simpler algorithm to do the same thing. Most of the time spent during the rewrite was on research trying to figure out how to solve the problem;during the initial implementation it was mostly spent on testing and bug fixing. This is the only time I’ve been involved in this sort of rewrite at the application level, which I suspect is a significant outlier.

This post is unfortunately more questions than answers. I don’t have the data to make any good conclusions on this front. But it is clear that building the most efficient implementation regardless of the scale of the business need isn’t the right choice. Understanding the situation is key to deciding what level of performance you need to hit. Not everything is as straightforward as sorting algorithms, but most interesting problems have some level of researched algorithms available for them.

Postscript for anyone that’s interested in the specific problem I was working on. We were trying to figure out what order events should be done in, but some of them couldn’t start until after a specific time, not some other set of events. So we needed to pick the first thing to occur than the second etc. Most sorts don’t work that way, but that how we ended up with a selection sort. It always selects the first element then the second element. So we could compute what time the next item would be starting at.

Advertisements

Funny Bug Story

Ran into a funny bug and I thought that others would see the humor with it. We have multiple QA environments set up: QA1.domain.com, QA2.domain.com, etc. Each environment is running a slightly different version of the web application, some of which are configured slightly differently. One of the least used environments, QA9, got a new version installed and immediately when anyone attempted to log in, they got redirected back to the login screen. This environment is used for testing hotfixes going out to a production environment for a client who is very picky about taking new versions and was willing to pay for the privilege. This deployment had two changes in it. One was a change (done by my team) to the login code and the other was a feature change in code that clearly wasn’t being executed.

Obviously, at this point we got a call to go investigate what happened. Simultaneously QA9 was rolled back to the last known good version. Our change was in several other QA environments already without incident so the change seemed like it should have been good. We started looking into whether our change depended on some other change that wasn’t in that single environment. We also started considering if there was some configuration difference in that environment that could have caused the problem. There didn’t seem to be any dependency issues, and we had tested the particular configuration case that was represented by that environment for the feature. The system didn’t record the login attempts as  failed or otherwise, or even log any errors – for extra fun in investigating.

By the time we finished this exercise, the roll back had finished and things got weird. The problem didn’t go away, even though the code went back. This got us thinking about things that could persist across versions. Cookies, something funky in the database, etc. We cleared cookies and then the problem magically went away. So we logged in to start looking around at the cookies and didn’t find anything out of the ordinary.

A little later one of the QA Engineers mentioned that immediately after they had cleared their cookies they could log in, but then later they tried to log in again and couldn’t. This immediately had our interest piqued. The code that had supposedly caused the problem wasn’t there anymore, the cookies had been cleared, yet it broke again. Things had gotten more weird. We investigated the cookies on that browser; there was one session cookie with the QA9.domain.com domain set and another session cookie with the .domain.com domain set. That was unexpected.

We went and checked out the session cookie code that was deployed to that environment – nothing about setting the .domain.com version of the cookie. The code didn’t even look like it would support setting that cookie at all. We spent a while thinking about how this could happen. We tried to get that code to set that cookie and couldn’t. We started asking the QA Engineer what he had done between being able to log in and being unable to log in. He listed out a couple of different things; the notable one was that he logged into a different QA environment.

We go check out the session cookie code that was running in that QA environment and find a whole bunch of new code there about changing the domain. BINGO! The other environments had started setting .domain.com cookies which were conflicting with the cookie for this environment. We reached out to the team that made the change and they rolled it back in those environments. This immediately lead to an even funnier problem in that now every other environment had the same problem. But once everyone cleared their cookies it all got back to normal for now.

This little bit of irony felt like it needed to be shared. Breaking one system with a stray cookie from another environment made for an interesting day. In the comments share any funny bugs you’ve run into.

Book Chat:Software Estimation

I just finished re-reading Software Estimation: Demystifying the Black Art by Steve McConnell. I was inclined to go back to it after an item at work came in at roughly 500% over our estimate. We thought it would be done two weeks after it started, even with the team only working on it part-time; it wasn’t finished until twelve weeks had passed. A lot has changed in my life and the development world since I read the book the first time, back in 2008. At the time I was doing contracting work, some of which was firm fixed price, so accurate estimates were much more immediately valuable to the company – both in terms of calendar and man-hours. Today I’m doing software product work where estimates aren’t as immediately impactful on the bottom line, but are still necessary to coordinate timing between teams and launch new features.

I wanted to understand how the item had gotten so badly underestimated. With the help of McConnell, I identified a couple of things we did wrong. First, when we broke down the item into its component parts we missed several. Second, we got sidetracked onto other tasks several times while working on the item. Third, even though we had not done similar tasks before we had assumed it would be easy because it was not conceptually difficult. After the fact, each of these points became apparent, but I was looking to understand what we could have done better beforehand.

The three points that caused our delay are interrelated. We missed subtasks during the initial breakdown because we took the conceptual simplicity to mean we fully understood the problem. The longer we spent on the task, the greater the odds that something would come up that needed attention, causing us to spend more calendar time on the task.

While most of the book was on entire project-level estimation, some of the tips and techniques can be applied to smaller tasks. Some of these techniques are immediately applicable to the item we had estimated poorly. The first technique is to understand the estimate’s purpose. In this case, it was so that we could let other teams that depended on us know when to expect the results. The second technique is to understand how the support burden impacts the team’s ability to do work. There was a much higher than normal support burden on our team during this timeframe due to items other teams were working on that would need our input and assistance; we should have anticipated that. The third technique is estimating each individual part separately, which would allow individual errors to cancel each other out. This usage of the Law of Large Numbers would help to remove random error, although it still leaves in systemic error (like our missing some of the composite parts). Since we did one estimate for the item, we had all of the error in one direction. Using these three techniques, we probably would have gotten down to 100 to 200% over the estimate.

100% over the estimate is still bad. The biggest issue in our estimate was that we didn’t stop and fully discuss the task itself, because we thought we had a good enough understanding of the problem. Based on what happened as we went on, it seems like among the team, someone had thought of most of the challenges we encountered, but nobody had thought about all of them, nor had we effectively shared our individual observations. As an example, the person who realized that each test case was going to be extremely time consuming to run was a different person than the one who recognized the need for an additional set of test cases. Therefore, we did not realize the extreme time consumption involved in testing the item until we were already testing. We had not stopped to discuss the assumptions that went into the estimate since we had all agreed closely on what the estimate was. If we had stopped to discuss the assumptions before generating our individual estimates we could have come to a much better final estimate.

McConnell discusses the Wideband Delphi method as a way to avoid this sort of problem. It requires an upfront discussion about the assumptions; if you have an initial agreement on the estimate, someone plays devil’s advocate to ensure that the estimators can defend their estimate against criticism. They take the average of the generated estimates and if the group unanimously agrees on it the average is accepted as the estimate. If not the process continues iteratively until the group converges on an accepted answer. McConnell cites that this method reduces estimation error by 40%. He also suggests using this estimation technique when you are less familiar with whatever it is you are trying to estimate about. He presented other data that showed that 20% of the time the initial range of estimates does not include the final result, but in a third of those cases the process moves outside of its initial range and towards the final result.

Going forward we’re going to make sure to discuss the assumptions going into the more complex estimates. We probably won’t ever do a full Wideband Delphi, but on larger estimates we will definitely be sure to discuss our assumptions even if we are in agreement as to the answer the first time around. That seems like the best balance between time invested and accuracy for us.

NuGet Conversion Part 3

This is a follow up to my earlier posts about the NuGet conversion project I’ve been working on. Since my last post, my team put out the initial version of the package and updated the rest of the system to reference it. This magnified some of the other existing problems with the system; there is lots of app code leftover in some older parts of the system that makes it more difficult to understand if all of the dependencies are resolving in a way that will work correctly.

My team ended up running into some diamond dependency problems regarding versions of the newtonsoft.json serializer. There were three different major versions of newtonsoft.json in the system. One was the newest version in its major version line, which also happened to be the newest major version. The larger development team already had a couple of versions in the GAC which caused some complications, since the application would pick up a version from the GAC that was not necessarily the one you expected. Since some versions of newtonsoft.json do not maintain semantic versioning, combining that versioning issue with the app code resulted in code that indicated it was correct at compile time but was when the app code went to compile at runtime, failed. Unfortunately, the portions of the application using app code were all older, and also had few unit tests associated with them, if any.

When the larger development team was using ilmerge they had been instructing it not to merge the newtonsoft.json dll, because it was assumed to be in the GAC. My team had debated adding the newer versions of newtonsoft.json to the GAC to work around that deployment issue, but decided instead that we wanted to get away from using the GAC altogether, and this seemed like as good place to start as any. Unfortunately, that meant that we had to add literally hundreds of copies of the newtonsoft.json dll to source control, since it keeps a working copy of all of the various services and batch jobs. Previously the larger development team had been acting like every developer had the entire running system locally. The long term modularization effort will invalidate this assumption. However, in the short term, we need to let each team understand their own dependencies as they move out of the larger monolith.

As a part of this project, we ended up pushing the upgrade to the newest version of newtonsoft.json to most of the application, since as we converted everyone to start consuming our package, they took on the package dependency on newtonsoft.json. We also spotted an odd change to how xml documents got serialized to json. Previously the xml elements <foo/> and <foo></foo> serialized to the same json content of foo:null but now <foo></foo> serializes to foo:””. It’s not a big change, but it’s hard to tell what that sort of things this will impact, given the test coverage challenges in parts of the application.

My team seems to have hit the end of the road with this package. While we’ve got more packages to create, we expect most of them to be less complex, since they are only used in tens of projects rather than hundreds and they also don’t pull in other third party dependencies. The whole ordeal turned out to be much more complex than we thought. While we reap some tangible benefits – like being able to control the release cadence of the package – now most of the benefits are still further down the line. The intangible benefit is that we find and get to confront all of this technical debt we accrued over the years and do something to modernize the older portions of the application.