Software Ownership vs Caretaking

Recently, I was trying to describe the relationship of my team to a particular piece of software within our company. I ended up settling on the term “caretaking.” We don’t have ownership of this particular piece of software in the traditional sense, wherein we are responsible for the content of the software and whether it is correct. Yet, we were in the terrible position of being the first people called whenever it ran into problems.

The software itself was in weird but decent shape; a lot of old battle-tested code that could benefit from some modernization and better automated testing (it had a couple of high-level integration tests but they were finicky and would only run in one particular QA environment). It is in that zombie state where we don’t make any changes to it since it meets the requirements, but nobody would look at it and say that’s some nice code. Nobody on the team had any particular experience with this code, but someone who had been on the team years ago had worked on it before joining this team. The code just sort of followed along with the person, and nobody ever really pushed back on it, since it didn’t run into problems. This particular code is even slated to be replaced, as a new team – we’ll call them “Triangle” – is centralizing all of responsibilities in this application’s domain into a couple of different services. We were caretaking – being nominally responsible for now but not paying too much attention to it.

This was all fine, right up until a serious security bug was found in this code. Our security team was rightfully saying this needed to get resolved ASAP. We wanted Triangle to just replace it now since that was going to happen anyway and we were already booked on other time-critical activities. Triangle wasn’t inclined to get pulled into this and commit to something on a deadline, which is a completely fair reaction on their part.

We took a look at our options for trying to resolve the problem and realized we had three. The first was a short-term fix, which would prevent the security exploit but would likely cause other problems in the medium term. The second was to gut the code and rebuild it without the exploit in place; this was pretty quickly rejected because it seemed wasteful to gut it for this then replace it again in the short term, especially given that the other team was planning to replace it in a different tech stack than the current one. The third was a nuanced tweak to the existing behavior that looked like it would fix the security issues but it was unclear as to if it would have any other negative side effects. We decided to go with the third approach since the first approach would likely end up coming back and costing us more time in the long term.

We decided at this point that we were also going to actively take ownership of the piece of code in the medium term. We didn’t gut the existing solution, but I applied some iterative refactoring to the situation to safely create the seams we needed to test the software to our satisfaction. The existing code was one big method, about 1500 lines. I started by extracting out some methods from that, then converting those methods into a couple of classes. I did it this way so that the available tools handled most of the refactoring so that I could be confident in the correctness of the refactored code, even with the limited integration tests available. We felt that this little bit of unit testing would enable us to make the change we needed to with confidence that we didn’t break any of the broader test cases.

We ended up fixing the problem and one little corner of the codebase. Applying some TLC to this portion of the codebase was rewarding, since every time we had looked at it before when it did something odd it was a negative experience everyone wanted to avoid. Overall, I only spent about a day cleaning it out and making the basic workflow conceptually clean. The software got much better and it just took the will to make the changes. The effort was lower than what would have been expected, to try and fix this and it built a good bit of confidence if anything else in this area. It was a rewarding little win.

The pride of ownership and fully tackling a problem to make the system better,  rather than continually avoiding, has lots of benefits. You actually make things better, which can be motivating to continue to tackle more problems. Making something a little better, then a little better, then a little better again, and all of a sudden you’ve changed the tenor of the system. In this instance, tackling the problem head-on proved to be easier than we thought. We are still caretaking a bunch of other code, a good bit of which is slated to be replaced as well, but if push comes to shove we’ll hopefully go ahead take full ownership of that code too.

Advertisements

Book Chat: Don’t Make Me Think Revisited

Don’t Make Me Think Revisited by Steve Krug is a usability and user experience book. It’s a series of well-illustrated rules for getting the basics of usability into a web site. There is also an excellent chapter on how to run your own usability test on the cheap. The revisited edition includes some information about working on mobile apps, and mobile web sites as well. The rules of creating a good navigation system was a great way to codify things you know but may have struggled to express.

Krug’s guidance on running your own cheap usability study comes down to using fewer users and putting together a simpler reporting plan. Instead of having 20 users and generating a giant report, you get two or three users and select a couple of important things that can be actively fixed. You don’t need a report with hundreds of issues, since once you fix a couple of issues the rest of it might not be useful anymore. Once you run one study and fix those issues, iterate and try again. It’s almost like A/B testing except it works on smaller scales.

The information on mobile testing goes beyond the usual advice of “make the buttons bigger” and “put less on each page.” Krug’s advice on deep linking on mobile is obvious, that it needs to work when you navigate to a full site url on a mobile device needs to work, but nobody seems to get it right. And, even though this is minor, the number one thing I got from the mobile chapter was a way to combine a webcam and a clip on light, and build a mount that lets you see what the user’s hands are doing while they’re using the app. That’s way more useful than  conventional video recording equipment where you would mount the phone but then they don’t hold the phone normally. Plus it only takes an hour to put together and costs about $30.

Krug’s rules for a navigation system basically boil down to the obvious. Keep the navigation consistent page to page. Make it clear where you are in the application. Put the things most people want in easy to find places. Ensure that every page has a name and it is in a consistent place, which maybe isn’t as immediately intuitive as the other rules. However, when I went and checked some sites and most seem to do this. Putting all of these rules together comes out as an explanation of the tabbed design and why it became so common.

Overall design is not one of my strong interests, but this is a good primer on the topic to help keep you from doing anything too silly. I certainly feel like I can put Krug’s advice into practice. There are also lots of references to some other books if you want to try and dig in deeper.

Hiring Randomness

I ran across this article on companies’ interest in interviewing different archetypes of programmers, e.g., “Academic programmer”, or “Enterprise programmer”. I had two big takeaways from the article. First, all of the companies were different, so none of the archetypes appealed to everyone. Second, the archetype that received the most attention wasn’t the one described in terms of technical abilities, but in terms of the applicant’s interest in product development. The typical software engineer only hears interview feedback about themselves in relation to an individual employer, as opposed to hearing how different places consider the same set of people, so this was an interesting discussion to read.

The first insight, that different companies want different things, seems to have matched my intuition from years of moving around the industry. Each company seems to be looking for a different mix of skills and weighs technical versus soft skills differently. It’s really no different than the idea that different people look for different things in employers. When a job posting is put up, it discusses the technical and soft skills the role would want, so it filters the incoming candidates to people who think they match those. So, even on the inside, as an interviewer, you won’t see candidates who are self-selecting against the description of what you say you want to remove themselves from your potential hiring pool.

The second insight was more interesting, that the archetype that was described as about building product more so than building good software was the one that got the most positive response. This might make sense when you look at the audience viewing the archetypes in this particular example: small startups that need product. This archetype seems great, but why is there so much more interest than any of the other archetypes? I can see why most companies would want that archetype, but it isn’t clear why there is so much less for some of the other archetypes, who I feel like I would want to work with more.

There are some other interesting thoughts buried in the grid of results. The difference between the “child prodigy” and the “strong junior” archetypes are interesting. They both represent the same sort of talent, but with a different story, so why should there be a significantly different opinion of the two archetypes? Who is the company who rejected every profile but the enterprise programmer? Why would you take the “experienced and rusty” archetype over the “technical programmer” archetype? All of this taken together makes it seem like there is more at play in these discussions than just the background of the person being seen. It also seems that each company rejected about half of the candidates.

This led me to reconsider some of the steps I’ve used in hiring in the past. The shape of the funnel used in the hiring pipeline reinforces some of the built-in problems with attracting talent. It seems like the funnel is always too wide, pulling in candidates you can’t work with, and the resume screen is tossing the wheat with the chaff. The article’s advice to programmers seems to be to spend more time on each application and personalize it. This is good advice for the individual, but it doesn’t resolve the issue on the company side that everyone seems to be throwing away talent that some other company seems eager to have. That half is up to us when we are on the hiring side of the table to take an open mind to the background and find the talent that is interested.

Software Performance Economics Part2

Last time I discussed the performance aspect of software economics and how having a faster application can directly impact income, as well as the lack of data on the cost of building a faster solution. This time I’ll explore more on the scalability side of the equation and how better scalability holds down operational costs.

With Moore’s law finally breaking down, scaling will become an even larger concern than it is today. I know that where I’m working today, we’ve ridden Moore’s law to solve a lot of scaling problems for us, wherein the hardware was getting faster and cheaper quickly enough to handle the scaling problems for years. This had allowed us to invest heavily in building out new features and dominating the market. Now, it’s catching up to us, and I doubt we’re the only company with this issue.

In general, you need a certain base level of users to justify the fixed cost of building a system. But, variable costs like servers and storage go up as the user base grows. Most algorithms scale memory and/or CPU at a rate above linear. So, theoretically, at some point growth hurts your ability to actually make a profit. Yet, as an industry we continue to scale up and seek out network effects to get even bigger. How do these two ideas support each other despite seeming contradictory?

In part, the cost of individual resources is so low that it doesn’t matter unless you’ve chosen an architecture that is scale up rather than scale out. The other part is that even in a scale up situation, what you can get today will generally still scale to the level of millions of users, as long as you are willing to pay for it.

This is where the idea that it is easier to get a 10x increase rather than a 10% one comes from. Changing the runtime of a solution is easy enough until you hit the lower bounds of the complexity class. Instead of seeking scalability with the same algorithms, you can rebuild the infrastructure and algorithms to either be more efficient on their runtime, e.g., improving from n^2 to n log n. Or, you can move to scaling against a slower-growing aspect or change where your bottleneck is, e.g., scaling against memory instead of CPU. The problem is that most of the time, the item that you are scaling with is your only option. In many business problems you’re scaling against the needs of the business itself: if you need to send emails your resource usage is going to scale with the number of emails being sent and isn’t a way around that. If you’re at the theoretical bounds you need to solve the business problem through other, more clever means.

Those more clever means usually involve things like caching or estimating. Caching lets you avoid an entire computation with its own complexity bound: this again trades CPU for memory, but is sometimes the best option. Whereas, estimating is all about understanding the degree of precision truly required in a calculation. If the question is “How many players are online right now?” you need to know if saying “a million” is good enough, versus needing to know it was 1,002,783 exactly. For a counter showing the real time player total for monitoring purposes a rough count might well be fine. Then you can also use a known option like HyperLogLog that can count unique entries in less than n space.

The cost of the n^2 algorithm is 50,000 times more expensive than the n log n solution with n = 1,000,000 and 430,000 times more expensive with n = 10,000,000. Regardless of how cheap resources are that will add up pretty quickly. Using an approximation can be an amazing way to cut back resource utilization, but  requires a more nuanced understanding of how the system is going to be used, which can create other sorts of fixed costs.

The moral of the story is that more data should eventually require you to go back and reconsider how the system is built. This is true for scale out architectures as well as scale up. Resource consumption will always have a little bit of overhead and as the system gets bigger the overhead is only going to get bigger as a proportion of the whole.