Service Creation Overhead Followup

I previously mentioned the new service we were spinning up and the discussion of the overhead therein. Having finished getting the initial version of the service out into production, I feel like I have some answers now. The overhead wasn’t that bad, but could have been lower.

The repo was easy as expected. The tool for setting up the CI jobs was quite helpful, although we didn’t know about a lot of the configuration options available to us. We initially configured with the options we were familiar with, but found ourselves going back  to make a couple of tweaks to the initial configuration. The code generators worked out great and saved a ton of time to get started.

The environment configuration didn’t work out as well as expected. The idea was that the new service would pick up defaults for essentially all of its needed configuration, which would reduce the time we would need to spend figuring it out ourselves. This worked out reasonably well in the development environment. In the integration environment we ran into some problems because the default configuration was missing some required elements. This resulted in us not having any port mappings set up so nothing could talk to our container. We burned a couple of hours on sorting out this problem. But when we went to the preproduction environment we again found its port mapping settings were different from the lower environments and needed to be setup differently. Here we ended up burning even more time since the service isn’t exposed externally and we needed to figure out how to troubleshoot the problem differently.

In the end I still think spinning up the new services on this short timeframe was the right thing to do – we would have had to learn this stuff eventually when building a new service. Doing it all on the tight timeline was unfortunate but the idea of getting the services factored right is the best thing.

Productivity and Maintainability

As I had alluded to previously, the project I’m currently on has some very aggressive and fixed deadlines. The project got scoped down as much as possible but there have still been some trade-offs for productivity over maintainability. Some of it happened as a team decision – we discussed ways to compress the schedule and found some shortcuts to produce the needed functionality faster. But as new requirements came up in the process those shortcuts ended up causing much more work overall.

We knew that the shortcuts would cause more total work but we thought that the additional work could be deferred until after the initial deadline. The change of requirements meant that some of the automated testing that had been deferred was an issue. The initial manual testing plus the retest after the changes probably wasn’t much less than the effort to do the automated testing the first time. This specific shortcut wasn’t responsible for much of the time we shaved off the schedule, but when you’re doing similar things all over, the total cut was significant.

The key part of this tradeoff is that it assumed we would go back after the initial deadline and fill in the bits like we intended to instead of moving to something else (which is appearing more and more likely). Product management has a request for an additional feature in another existing application. This request has a similarly tight deadline, but is much smaller in scope (15 points) than the initial request(~120 points for the scoped down version). Picking up this request would mean that we end up further from the decision to defer portions of this and we’d be less likely to be able to deal with the technical debt in a timely manner that we accrued to the do the first request.

Since I’m still new at this job I’m not sure if this behavior is regular or since both of these items are related to the same event if it’s just a one-time thing. If this is a one-off problem then we can survive more or less regardless of how well we deal with it. If this is going to be a regular occurrence without an opportunity to resolve the problem, I feel like I’d need to take a different tactic in the future about how to work with the problem of short term needs.

This feels like the overall challenge of engineering: there are problems that need to be solved in a timeboxed way. We need to make good enough solutions that fit all of the parameters, be they time or cost. We can be productive and leave behind a disaster in our wake but succeed at business objectives, or we can build an amazing throne to software architecture but have the project fail for business reasons. The balance between the two is the where this gets really hard. If you’ve already been cutting corners day to day and the request to push the needle gets made, you have no resources left. The lack of an ability to really quantify the technical debt of an application in a systematic way makes it hard to project where we are on the continuum to others who aren’t working on the system daily.

Without this information, program-level decisions are hard to make and you end up with awkward mandates from the top that aren’t based on the realistic situation on the ground. Schedule pressure causes technical debt, testing coverage mandates cause tests that are brittle to just ratchet up the coverage, mandating technology and architectural decisions results in applications that just don’t work right.

The Systems of Software Engineering

This idea came together from two different sources. First, I’ve been reading The Fifth Discipline about the creation of learning organizations. One of the elements of becoming a learning organization is Systems Thinking, which I had heard of before and seems like a great idea. Then I went to a local meetup about applying Systems Thinking to software development. We ran through a series of systems diagrams from Scaling Lean & Agile Development. These diagrams helped us to both understand Systems Thinking by analyzing a domain we were already familiar with in a structured way where we were presented the nodes of the system diagram and asked to draw connections between them.

There were several different groups each looking at the same sets of nodes and drawing different conclusions for some of the relationships. There were a few very adamant people who all felt that an increase of incentives for the developers would increase the velocity of the team, which was interesting to see since they were mostly scrum masters or software development managers with many years of experience doing the management of software projects and should have seen how that does work out. If the impact of an incentive plan on a team is significantly uncertain, it makes me curious about the sorts of teams these other leaders are running.

Through the entire process we had interesting conversations about how various aspects of the software process were related. Everyone agreed that the interaction of a strong mentoring process on the overall system would decrease the number of low skills developers on the team. There was a discussion of whether it would also directly impact the velocity of the team. Some people were adamant that it would lower velocity since the mentoring time was time that people weren’t working on building features. It’s a reasonable consideration, but it doesn’t seem to match with my specific experiences with mentoring. If you were actively spending enough time to lower your velocity significantly on mentoring activities you would be forcing the mentoring relationship instead of letting it happen organically.

The experience made me wonder if we could construct a system diagram that describes some of the other dysfunctions of many organizations. The diagram we built described (1) dysfunctions around hiring large numbers of low skill developers, (2) using certain kinds of financial incentives, and (3) the push for delivering features above all else. It didn’t describe why a lot of organizations end up in siloed configurations, “not invented here” behavior on technical systems, or lots of the other dysfunctions in multiple organizations. I made the below diagram with the intent of describing  the situation about siloed organizations but without any real experience in this sort of analysis I’m not sure if I’m doing it well.

I made another simple diagram about the causes of “not invented here.”

It feels to me like these diagrams describe the dysfunctions that are common in software organizations. Expanding these diagrams to try and find the leverage points in the system might yield some greater insights into the problems. In spending some time thinking about both of the diagrams I’m not sure what other nodes should be there to further describe the problems.

I’m going to definitely do some more reading on Systems Thinking and try to expand on the thoughts behind these diagrams. If you’ve got more experience with System Thinking I’d love to hear some feedback on these charts.

Service Creation Overhead

At work we started spinning up a new service and concerns were expressed by some interested parties about the overhead required to get a new service into production. Among the specific concerns that were articulated: it needed to have a repo made, CI builds set up, be configured for the various environments, be creating databases in those environments, etc. Having not deployed a new service at this job I’m not sure if the concerns are overblown or not.

We’ve got a platform set up to help speed the creation of new microservices. The platform can help spin up CI jobs and simplify environment configuration. Creating the repo should be a couple of clicks to create it and assign the right permissions then set up the hook for the CI process. I’m optimistic that this process should make it easy, but only two people on the team were part of spinning up a new service the last time the team did it, and neither of them was involved in much of dealing with this infrastructure.

The project all of this is for needs to be put together and running in production in the next month, so the overhead can’t take up much of the actual schedule. The first version of the service isn’t complicated at all and can probably be cranked out in a week or so of heads-down work. It needs a little bit of front-end work but has been descoped as far as possible to make the timelines easy. We’ve got a couple of code generators to help bootstrap a lot of the infrastructure of the service; we’ve even got a custom yeoman generator to help bootstrap the UI code.

I’m curious if the concerns were memories of spinning up new services in the world that predates a lot of the platform infrastructure we have today or if it’s a reasonable understanding in the complexity of doing all of this. But it raises the question of how much overhead you should have for spinning up a new service. As little as possible seems reasonable, but the amount of effort to automate this setup relative to how often you create new services makes that not as reasonable as it first appears. I think it needs to be easy enough that you create a new service when it makes logical sense in the architecture and the overhead of doing so isn’t considered in the equation.

Cloud Computing Patterns

A colleague recommended this list of application design patterns for cloud computing. The idea of common dictionary of terminology is always important. This is a good collection of the different patterns that can be used in a cloud application. Using the abstract pattern names rather than the specific implementation names can help abstract the discussion from a single implementation to the more generic architectural discussion. A specific implementation can have specific strengths, weaknesses and biases, but a generic pattern is the more pure concept. The concept allows the discussion to focus on the goals rather than implementation specifics at the initial level.

I read through the patterns and found a couple I hadn’t been aware of like timeout-based message processor, community cloud and data abstractor. The timeout-based message processor is a name for a concept I was familiar with, but never had a name for. Essentially it’s a message handling process where you ack the message after processing, as opposed to on receipt. The community cloud is a pattern for sharing resources across multiple applications and organizations. The data abstractor is a way to handle multiple services having eventually consistent data and show a single consistent view of the underlying data, by abstracting the data to hide the inconsistencies. None of these three patterns is a mind blowing solution to a problem I’ve been having, but they’re all interesting tools to have in my toolbox.

There are a bunch of different ideas which vary from the common base level knowledge of cloud computing to solutions for some quite specific problems. The definitions of the various patterns are concise and reference other related patterns, so even if you don’t know exactly what you are looking for you should be able to find your way to it fairly quickly. Overall worth checking out if you are working on a distributed application.

Automation and the Definition of Done

I recently listened to this episode of Hanselminutes about including test automation in the definition of done. This reminded me of acceptance test driven development (ATDD) where you define the acceptance criteria as a test which you build software to fix. These are both ways to try to roll the creation of high level automated tests into the normal software creation workflow. I’ve always been a fan of doing more further up the test pyramid but never had significant success with it.

The issue I ran into the time I attempted to do ATDD was that the people who were writing the stories weren’t able to define the tests. We tried using gherkin syntax but kept getting into ambiguous states where terms were being used inconsistently or the intention was missing in the tests. I think if we had continued to do it past the ~3 months we tried it, we would have gotten our terminology consistent and gotten more experience at it.

At a different job we used gherkin syntax written by the test team; they enjoyed it but produced some significantly detailed tests that were difficult to keep up-to-date with changing requirements. The suite was eventually scrapped as they were making it hard to change the system due to the number of test cases that needed to be updated and the way that they were (in my opinion) overly specific.

I don’t think either of these experience say that the idea isn’t the right thing to do, just that the execution is difficult. At my current job we’ve been trying to backfill some API and UI level tests. The intention is to eventually roll this into the normal workflow;  I’m not sure how that transition will go, but gaining experience with the related technologies ahead of time seems like a good place to get started. We’ve been writing the tests using the Robot Framework for the API tests and Selenium for the UI ones. The two efforts are being led by different portions of the organization, and so far the UI tests seem to be having more success but neither has provided much value yet as they have been filling in over portions of the application that are more stable. Neither effort is far enough along to help regress the application more efficiently yet, but the gains will hopefully be significant.

Like a lot of changes in a software engineering the change is more social than technical. We need to integrate the change into the workflow, learn to use new tools and frameworks, or unify terminology. The question of which to do keeps coming up but the execution has been lacking in several attempts I’ve seen. I don’t think it’s harder than other social changes I’ve seen adopted, such as an agile transition, but I do think it’s easier to give up on since it isn’t necessary to have high level automated testing.

Including automation in the definition of done is a different way to describe the problem of keeping the automation up to date. I think saying it’s a whole team responsibility to ensure the automation gets created and is kept up to date makes it harder to back out later. The issue of getting over the hump of the missing automation is still there, but backfilling as you make changes to an application rather than all upfront may work well.

2016 Year in Review

Looking back at this year’s blogging, strictly by the numbers everything is much bigger than last year. Last year there were 63 pageviews by 42 visitors, this year there were 440 views by 359 visitors. Part of the increase was since I have a whole year rather than a couple of months’ worth but that doesn’t account for the whole increase.

Last year I outlined some goals for this year.

  • Increase readership
  • Keep up the weekly posts
  • Write more technical posts
  • Write about what I was doing at work some

I feel like I achieved each of these goals. The views were up considerably, for both the new posts and some of the old posts. I hit the weekly post goal, although I went through a good bit of the backlog of post ideas I had. I got some more technical posts with some of the Nuget conversion, Scala posts and the anonymous type serialization. The Nuget conversion was what I was doing at work and I got to talk about that some.

As for some of the specifics of this year, F# Koans and the post on the Javascript Ecosystem both continued to get a good number of views. The post on Modern Agile was a surprise hit on my part racking up 108 views and a surprising variety of incoming sources from LinkedIn, Slack, and a couple of different tweets.

I’ve got a couple of goals going forward for 2017

  • Continued increases in readership
  • Keep up the weekly post cadence
  • Write a longform piece and see what that looks like in this format

I don’t have any specific means to accomplish the readership increase. Last year I added references to the blog in a couple of profiles I had other places, but that seems to have only generated five referrals. Since a quarter of the views came from one post I’m slightly concerned that readership may go down. The weekly post cadence helps me keep in the writing habit and while there have been some posts that didn’t generate any views, that’s something I think I can live with. I’d like to try the longform piece since a number of the posts I’ve been writing are just pointing towards other people’s ideas and not a lot of new thoughts. To a certain point the problems I’m fighting and the solutions I’m deploying aren’t that different than what other people are doing so a lot of these ideas are already out there in the world.

Book Chat: Daring Greatly

This is a bit outside of the normal material I write about here, but I felt that it was something that others might appreciate as well. One of my wife’s classmates gave her a copy of Daring Greatly and said that it was part of his inspiration to be in their doctoral program. I’ve always read broadly and tried to find a way to integrate that into my life. I saw this book and wasn’t sure if it was the sort of thing for me, but I decided to read it to find out. I started reading it and maybe a third of the way through I thought to myself, “I’ve read this book before but with the word vulnerability replaced with with word authenticity.” But I kept reading and eventually something started resonating with me.

What resonated was the idea that you need to put yourself out there, give more of yourself to the situation, and say those things you are thinking. You have to do that even if it’s not be easy, because it is important and is part of what separates good from great. The willingness to say something that puts yourself in a vulnerable position to open yourself to those around you takes a bigger commitment than most people are willing to make. People are willing to say the easy thing but not the hard things that require them to push against the structure around them. It’s like the Emperor’s New Clothes, everyone sees something but the incentive structure is put together so they don’t recognize the problem. This book is all about how to structure your own thoughts to be able to push against the structure around you.

There are portions of the book written in an abstract sense and some others that are much more specific. The specific sections contain several manifestos to describe how leaders or parents should behave around those they are responsible for. It puts out there that you may not be perfect but that you strive to be better and hope that everyone else engages with you in trying to be better. Each of the manifestos describes how the person in that situation can open up to those subordinate to them to truly embrace the position they are in.

I’m not sure if this is that profound, or if it found me at the point in my life that had me open to what it was saying. However I felt moved by this. It made me feel that I should push outwards and express my opinions further. I had been expressing myself in some domains, however it is hard to put yourself out there in all areas all of the time. Sometimes you just want to wait and let things happen, but sometimes you need to make things happen. Not in the doing sense but in the living sense, where you can’t just wait for something to happen but need something to progress the situation.

Mongo Play Evolutions

I ran into an odd situation with some Play Framework evolutions for MongoDB and hope to save the next person in this situation some time. I got two messages from it that I wasn’t really expecting, the first was “!!! WARNING! This script contains DOWNS evolutions that are likely destructives” and the second was seemingly more helpful “mongev: Run with -Dmongodb.evolution.applyProdEvolutions=true and -Dmongodb.evolution.applyDownEvolutions=true if you want to run them automatically, including downs (be careful, especially if your down evolutions drop existing data)” The big issue I had was I couldn’t tell why it felt it should be running the downs portion of the evolution at all.

Some digging in the logs showed it wanted to run the down for evolution 71 and the up for evolution 71 as well. This was when I got really confused, why would it attempt to run both the down of the evolution and the up for the same evolution? I spent a while digging through the code looking at how it decided what evolutions to run and it turns out to be comparing the saved content of the evolution that was run at some point in the past with the current content of the evolution. So it recognized that the current 71 evolution was different from the previous evolution 71 and was attempting to fix this by running the saved down evolution and then the new up evolution.

The environment was setup to not run down evolutions since that usually meant that you had screwed up somewhere. We had accidentally deployed a snapshot version to the environment a while back, which is where the unexpected behavior came from. We ended up fixing the problem by breaking the evolution into two different evolutions so there was no down to be run.

Book Chat:The Mikado Method

The Mikado Method describes a way to discover how to accomplish a particular refactoring. The method itself asks that you first attempt to do what you want “naively” and identifying the problems with that approach. Then you roll the code base back to the original state in order to tackle one of those problems and iterate on the process until you can begin to resolve the problems in a bottom-up fashion, resulting in multiple small refactorings rather than one big one. This strategy means you can merge or push with the master branch more frequently since the codebase is regularly in a working state. This avoids the rabbit hole of making changes and more changes and never being sure how close you are to having compiling software with a passing test suite again.

The actual description of the technique and examples is only about 60 pages of the roughly 200 pages of the book; most of the rest is other tips and tricks for working on refactoring. There is also a rather long appendix on technical debt that I found expressed some ideas I had been thinking about recently; it describes four techniques for tackling the sources of technical debt.

The four techniques listed are absolve, resolve, solve, and dissolve. Absolve is essentially normalizing the practice and saying it is okay to do things this way. This would be something like lowering the automated test coverage necessary during a hard scheduled push. Resolve is reverting a change in the current processes and environments that had unintended negative effects. This would be something like getting rid of an internal bug bounty if it was being abused. Solving is changing the incentive schemes in order to put groups into alignment. For instance, having development teams on call in order to help align their incentives with the operations teams. Dissolving is the sort of radical solution that completely removes the friction between groups and makes the problem disappear completely. To continue with the previous example this would be a sort of devops culture where operations and developers are all on the same team and there is less distinction between the two. Each of these techniques could be applied to various means that create technical debt, or even to other sorts of problems.

The actual Mikado technique doesn’t seem book-worthy in the sense that it isn’t complex enough to warrant an entire book on it’s own. The other refactoring techniques weren’t anything particularly novel to people who are already familiar with Refactoring Legacy Code or other similar material. Overall it was a quick read and enjoyable but not the sort of thing I would be recommending to others strongly.