Book Chat: Effective DevOps

Effective DevOps is about the culture of the DevOps movement. The technical practices that today coincide with DevOps are the result of the culture practices, not the cause. The cause is an underlying culture that is safe, and respectful to those in it, which truly empowers the team to try things to improve the way that work is done and leads to the technical practices associated with DevOps. The book is overall written more from a management perspective than an individual contributor perspective. The book is centered around the four pillars of effective DevOps: Collaboration, Affinity, Tools, and Scaling.

Collaboration is the normal sort of mentoring, and workflow information that would be familiar to most agile or lean practitioners. The Affinity pillar though builds on top of Collaboration with the idea that it takes time and work to forge a group of individuals into a team and explores the requirements to build those bonds. These two pillars lead i7nto the Scaling pillar nicely since, while you can eliminate waste and automate things, at the end of the day the biggest scaling maneuver is in hiring. Hiring renews the importance of the Collaboration and Affinity aspects of this since as you bring new people into the system you must fully integrate them.

The section on the Tools pillar is written in a tool agnostic fashion, wherein it describes categories of tools commonly used to the DevOps. That makes it much more interesting than any other book that is tied to a particular set of technologies since it is focused on the concept not the implementation.

Overall it’s an interesting read. The focus on the social aspects of what’s going on makes it less useful in my day to day activities, but the longer I do this job the more I think that the technical aspect is essentially table stakes to doing the job and everything else is where more long term growth come from.

Advertisements

Book Chat: Beyond Legacy Code

Beyond Legacy Code is a description of nine practices to help improve the value of software. The author directed it not just at developers or engineers, but also at development or IT managers, product managers, project managers, and software customers. That’s a broad array of people who are coming to a problem with a wide set of goals and preconceptions. Eight of the nine practices are pretty normal and obvious items for most software engineers. One however was novel to me: implement the design last.

The basic idea is pretty straight forward – do a sort of bottom up build of components and then compose them into larger and larger units. Then allow the design of the larger pieces to emerge from that. Since you already have all these well written and tested units you can compose them together safely in ways that you understand. It keeps you from reaching for a more complex design pattern that you may not need because you are still working through all of the smaller pieces. I see it as the red-green-refactor mantra in the macro sense.

I had often tried to accomplish this similarly by starting at the top and stubbing out other smaller pieces as I went. This didn’t always work out since the interface for the piece you stubbed out may not have the information it needed to do its work. I have also seen this end up with odd pieces that don’t really make sense outside of the context of what I was working on so I had less reusable components afterwards. Overall it worked fairly well to try to map decompose the problem in the initial pass.

Since reading this book, I’ve tried their bottom up buildout a couple of times. It seems to have taken me significantly longer to do the work, but I think the overall reusability of the subsequent design is better. I feel that with more practice that I should be able to be at least as productive as before. I haven’t had to come back to any of the code I wrote like this in maintenance so I don’t have any data yet on if it delivers on the maintainability ideas that it promises.

I don’t think that book delivers to the full audience of people who are intended as readers, but it feels well directed at Software Engineers, considering the principles and guidelines we use. I don’t know what a large portion of the audience would get from reading this other than a familiarization with the terms used so they could communicate better. I don’t see how it would cause a project manager to reconsider the schedule, or an IT manager to deal with a project differently. Maybe I can’t take their point of view well enough, so I saw large portions of the suggested practices as ‘normal’ but to those roles they would help articulate the value of unit testing. I don’t know of a better modern book for those involved in the management side of software without a software background that is still technical. Classics like Peopleware or The Mythical Man-Month still show most of what you need to do to run a software team from a strictly management perspective and this doesn’t supplant those. Looking at the reviews on Amazon though it seems as though my concerns that this isn’t what non-developers want seems to be unfounded. The consistent praise it is garnering there makes me curious to see what other non-developers I know would think if they read it.

Goals For Senior Engineers

It’s annual review time around my office. Looking back at my personal goals for the last year and trying to come up with some for next year is a good opportunity for reflection. First up are the ones from last year, summarized.

  1. Make creating a new microservice simpler
  2. Lead a major project with a company wide impact
  3. Get the team into a better place regarding technical debt
  4. Mentor some Junior Engineers on the team
  5. Improve on call documentation/processes

I didn’t succeed at the first one, largely because I never really got a chance to try. It’s unfortunate because it would be great for the company in the long term. However, it’s an engineering-centric efficiency improvement for which the benefits are difficult to quantify, so it was never urgent. Not getting this done was disappointing for me personally. I had identified a way to help lots of engineering teams through an idea that’s obvious in hindsight, but I wasn’t able to find the time to get it implemented.

The second one was a pretty big success. The design I had put together got built out and the resulting architecture worked out great. Despite running into some issues with some of the prerequisites to operationalizing it we were able to get the whole thing into production ahead of schedule. On top of that we beat our load, response time, and resource usage targets handily. Overall I’m amazingly pleased with how this one came out.

The third one is an odd one, since by the metrics I set out for myself we nailed it, but by other metrics you could use to measure it we failed. I led some process changes to help us stop accruing technical debt, and start making more time to deal with the debt. This resulted in us closing an amazing number of tickets, three times the target my boss said was ‘too ambitious’ when I set it. He had suggested tracking the number of open technical debt tickets, but I stuck to the number of tickets dealt with since I felt as people saw these tickets got resolved they would be more inclined to file more of them. I was right on that, so while our metrics got worse, our position really got better.

I spent significant time mentoring our two new grads and two other new hires to help them get up to speed, not just on Scala but also on our systems and software engineering in general. The plan was to set up pair programming time and do accomplish the mentoring during that time. Since I’m on the East coast and a morning person and the new grads are on the West coast and not morning people it made things a bit complex at first in terms of scheduling time. Eventually we found a rhythm of me pairing with each of them an hour a week. This ended up helping them out a bunch. They got solutions to their problems and I taught them a lot of little things like IDE shortcuts and how to use some problem solving resources like symbol hound (a search engine that will search for special characters).

The last goal was meant to resolve a problem that had existed from the time I joined the team. Then, the on call rotation was two people trading off with no backup. The reason was that there wasn’t any documentation for how to deal with some of the ‘done’ systems the team supported, i.e., the systems that are feature complete and don’t have any active development on them. Nobody learned anything about them since they never touched them, but it also meant that the new engineers, myself included, couldn’t really go on call since if the system broke they didn’t have any idea what to do. I ended up doing a deep dive into each of the systems and producing docs about how they worked/logged so that if you got an error you had somewhere to start and an idea of what to do next. We opened up the pager rotation and everyone on the team gets the opportunity to be primary. It’s unclear if the documentation is any good because the systems haven’t had any problems.

This all brings me to goals for next year. There is some reorganization going on so I may end up on a new team. With that unknown right now, I’m not sure how I’ll figure out my goals for the coming year. This complicates the issues I have with this topic every year. We do scrum, so the team is supposed to be working on tickets selected by the product owner, so while you can influence them, you don’t have a ton of control over what you build. You can pick goals around how you work, like the mentoring and technical debt goals I had last year, which you can apply to whatever you end up doing. The more project based goals were based on large work items I knew were coming down the pipeline and probably wouldn’t be moved.

If I stay on my current team I have a good idea of what sort of work I’ll be doing and some process items to try to get to. If I move to a new team I’m not sure what exactly I’ll be doing. I also don’t know if what team I end up on will be finalized before I need to submit my official goals. I could write some goals around community involvement, or doing internal training/presentations. Maybe finding a way to automate some new metrics would be worth doing. I’ve always had difficulty finding good individual goals, since I see what we do as a team effort.

Individual goals are difficult to align with the team-based metrics generally used when doing scrum. This creates a sort of tension between individual goals that the team isn’t aware of and the team metrics and goals which are secondary to everyone’s individual performance. I see the goals for Senior Engineers as needing to be more team focused than those of lower lever engineers since they are more big picture oriented. If you are a Senior Engineer, you’ve reached the point where you are trusted to tackle issues independently, which should creates space to focus not just on merely doing, but on doing better.

Strike Teams Part 2

This is a follow up to the original strike team post I did a while back. I’m writing this as the strike team is wrapping up. It’s been an interesting experience. We hit most of our broad goals but did it in a significantly different way than anticipated.

The first big change came early on when we were looking into the proposed changes to an existing piece of software that was going to be a consumer of what we were building. We realized that accommodating it was probably weeks of work by itself. The initial assumption of how much work was needed had been minimal, but apparently that particular solution was untenable to the people who supported the consuming software in the long term. This ended up with them pulling support from the strike team and becoming uninvolved. At the time, the scope and resourcing change was challenging since it threw everything into an uncertain state. I think it ended up being a good thing; we were able to focus on building out the functionality desired without spending a lot of time making changes to one consumer of the service. It does make the follow up more complex since we do need to make sure that work gets done otherwise the whole system will end up in a more complex state.

There was one portion of what we were going to deliver that had a complex deployment step. There were other changes that had to go either before or after this particular piece so we tried to move that deployment as far forward on the project plan as possible. The intention was to deliver it at the end of the first sprint. We encountered some significant scope changes that all unfortunately all had to go before that step. This ended up pushing the actual deployment out until the middle of the third sprint. At this point we had about ten tickets that we did the work for all sitting in this feature branch just waiting for the eventual opportunity to merge the whole thing back to master with nearly 5000 lines of changes. We ended up performing a series of messy merges bringing changes from master back to the feature branch. The final merge from the feature branch back to master was ugly since we wanted to preserve some but not all of the history of that branch. In retrospect we should have seen the merge issues coming and been more proactive about doing more small merges, but lesson learned for later.

Every team member lost some time to dealing with issues brought to them from the team they were on loan from. We had anticipated this during the first sprint but didn’t expect it to continue through the remainder of the strike team. I know I personally got sucked into some production oddities that were occurring and that after the first sprint without me my team missed my presence in code review to the point that they got me back to reviewing basically everything. Both of the other members of the team got pulled back to the their original team for issues that happened. We didn’t fight any of these needs figuring that since they seemed like critical issues for those teams, having their people available was the best overall usage of time, even if it hurt our overall ability to get things done on the strike team.

The team as a whole never really jelled to any significant degree. Everyone kept working the way their team did, which created some minor conflicts. One engineer’s team has a very lax overall structure and he would sort of disappear for days then show up with large amounts of excellent code, but it was frustrating for him to essentially go off the grid, without forewarning. The other came from a team with experience in a different technical stack and vacillated between asking not enough questions and spending too much time stuck and asking too many questions. The three of us were also in three different locations which made it difficult to really all get on the same page and get working together.

Overall the strike team model worked for this, since having representatives from some of the teams that were going to consume this service working on it made sure that we didn’t run into any issues with a misunderstanding of the domain. There were problems setting up a new team, that we should have attacked proactively since setting up any new team needs to initially get organized and form their own norms. The transient nature of the strike team prohibits a lot of identity building which, in my opinion is key to building a good team. Overall based on this experience I think that the strike team model can deal with software projects crossing multiple software team boundaries, but there may be other better ways out there to be found still.

Strike Teams

At work there has been a new practice of starting up strike teams for different projects. The idea is that for projects that require expertise not found on any individual team, you pull in a person or two from multiple different teams to get all of the correct skills on a single team. That’s the pitch of the strike team model, but the hidden downside is that it breaks up the cohesion of the teams that people are pulled from and the created team may not be together long enough to create new cohesion. This post is mostly going to be a chronicle of the issues that I encountered starting up a strike team and what we did to try to resolve the problems.

The first problem was coordinating the various different teams to figure out who was going to be involved. In the case of the strike team I was forming, the two teams contributing resources both wanted to know who the other team was going to contribute before making their own decisions. They also wanted to have a fixed end date to the project before deciding who to contribute. We ended up resolving this issue with a fixed date for whoever they contributed to the team and getting the two of them to discuss the situation amongst themselves.

The second problem was aligning the start date of the strike team. The three teams contributing resources all have different schedules to their individual sprints so it makes coordinating a ‘start’ date for the strike team difficult.  Those who were on teams about to start another sprint wanted the strike team to start now, whereas those with other commitments made weren’t available. We ended up doing a rolling start where as each team finished their sprint they rolled onto the team and the team ramps up as people become available. We did some preparatory work to get everyone up to speed on the goals and challenges of the team so they were aware of what’s going on whenever they were able to join.

The third big problem is more specific to our particular organization, not to the strike team model. As part of setting up the strike team we needed to schedule things like standups and retros for the team. Since the team is split across both coasts, the hours for scheduling these meetings are limited. The conference rooms are also pretty much all booked because every other team beat us to scheduling. We ended up asking IT to rearrange some other non-recurring meetings and managed to get a consistent slot for the standups. The retro slot was more complicated but we managed to get meetings roughly spaced; by not being a stickler for strict week deliminations we managed to find times that worked.

So with all the overhead sorted out we finally get to move on to the real work of the situation which will be a nice change of pace from the administrative aspect.

Book Chat: Pair Programming Illuminated

My team has been doing more pair programming recently so I picked up a copy of Pair Programming Illuminated. I had never done a significant amount of pair programming before and while I felt I understood the basics, I was hoping to ramp up on some of the nuances of the practice.

It covers why you should be pair programming, convincing management that you should be able to pair program, the physical environment for local pairing, and common social constructs around different kinds of pairs. All of this is useful information, to varying degrees. Since the book was written in 2003, some of the specifics of the physical environment section didn’t age well – advising the use of 17” monitors most obviously. Both of the evangelizing sections seemed to cover the same ground, and did not seem to be written in a way to try and convince someone who is not already open to the concept. Neither section seemed to be written to the person who isn’t already in favor of doing pair programming. There were lots of references to studies, and some personal anecdotes, but none of it stuck in a way that felt like it would change someone’s mind.

The social aspects were interesting, however most of the section was stuff that felt obvious. If you have two introverts working together then they need to work differently than if you have two extroverts working together. A lot of the time the tips were common sense, and didn’t seem like it was necessary to write it down in the book. I would have liked to see more discussion of getting someone to vocalize more and clearly what they’re thinking about.

I feel like I’m better equipped to do pair programming because of having read this, but I also feel like a long blog post would have been just as good a resource and much more focused. I don’t know what else I would have wanted to fill out the rest of the book.

The Systems of Software Engineering

This idea came together from two different sources. First, I’ve been reading The Fifth Discipline about the creation of learning organizations. One of the elements of becoming a learning organization is Systems Thinking, which I had heard of before and seems like a great idea. Then I went to a local meetup about applying Systems Thinking to software development. We ran through a series of systems diagrams from Scaling Lean & Agile Development. These diagrams helped us to both understand Systems Thinking by analyzing a domain we were already familiar with in a structured way where we were presented the nodes of the system diagram and asked to draw connections between them.

There were several different groups each looking at the same sets of nodes and drawing different conclusions for some of the relationships. There were a few very adamant people who all felt that an increase of incentives for the developers would increase the velocity of the team, which was interesting to see since they were mostly scrum masters or software development managers with many years of experience doing the management of software projects and should have seen how that does work out. If the impact of an incentive plan on a team is significantly uncertain, it makes me curious about the sorts of teams these other leaders are running.

Through the entire process we had interesting conversations about how various aspects of the software process were related. Everyone agreed that the interaction of a strong mentoring process on the overall system would decrease the number of low skills developers on the team. There was a discussion of whether it would also directly impact the velocity of the team. Some people were adamant that it would lower velocity since the mentoring time was time that people weren’t working on building features. It’s a reasonable consideration, but it doesn’t seem to match with my specific experiences with mentoring. If you were actively spending enough time to lower your velocity significantly on mentoring activities you would be forcing the mentoring relationship instead of letting it happen organically.

The experience made me wonder if we could construct a system diagram that describes some of the other dysfunctions of many organizations. The diagram we built described (1) dysfunctions around hiring large numbers of low skill developers, (2) using certain kinds of financial incentives, and (3) the push for delivering features above all else. It didn’t describe why a lot of organizations end up in siloed configurations, “not invented here” behavior on technical systems, or lots of the other dysfunctions in multiple organizations. I made the below diagram with the intent of describing  the situation about siloed organizations but without any real experience in this sort of analysis I’m not sure if I’m doing it well.

I made another simple diagram about the causes of “not invented here.”

It feels to me like these diagrams describe the dysfunctions that are common in software organizations. Expanding these diagrams to try and find the leverage points in the system might yield some greater insights into the problems. In spending some time thinking about both of the diagrams I’m not sure what other nodes should be there to further describe the problems.

I’m going to definitely do some more reading on Systems Thinking and try to expand on the thoughts behind these diagrams. If you’ve got more experience with System Thinking I’d love to hear some feedback on these charts.

Theories of Technical Debt

There are a couple of different major causes of technical debt even on the best run projects.

  1. Schedule pressure
  2. Changed requirements
  3. Lack of understanding of the domain

You can choose to take on debt strategically to accommodate an aggressive schedule, you can accumulate debt from having things change, or you can collect debt from doing what appeared to be the right thing but turned out not to be once you learned more about the underlying situation. There are plenty of other reasons that technical debt can be acquired, but most of those can be avoided. Changing requirements can’t really be avoided; things can change, that’s the nature of life. Understanding of the domain is a tricky issue, since you can spend more time upfront to better understand the domain but you will still uncover new aspects as you go.

Schedule pressure is the most demanding of the three. You can consciously say that technical debt should be taken on by doing something the expedient way. You can also have implicit schedule pressure that pervades the decision making process. This sort of pervasive pressure causes people to value different things. If leadership discusses the schedule day in, day out, but doesn’t mention quality, it ends up lacking.

Technical debt is fundamentally a lack of quality; not in the defect sense but in the lack of craftsmanship sense. All of those 5,000 line classes got written by other engineers who were doing the best they could within the constraints of the environment at the time. But some engineers look at that and don’t want to touch it afraid of what it is and how hard it is to change. Some other engineers look at it and see a mountain to be climbed, or a wilderness to be civilized. The problem code is something to be taken and broken to your will. Each kind of engineer has a place in the development lifecycle.

If a company needs to hit a product window and is only full of the kind of engineers who see debt as a challenge to be dealt with they might not be able to make that tradeoff. If you only have the engineers who are concerned with maximum velocity but leave behind chaos in their wake you will have trouble as the codebase matures. Every engineer seems to have a range on the spectrum where they are most comfortable. Where in that range they land day to day seems to be controlled by the environment around them.

If you are surrounded by one side or the other you might lean towards that side of the range. The messages management sends out about the quality of software relative to the messages they send out about schedule of the project is another important factor. If they keep hammering home to get more done, people will find corners they think can get cut. What those corners are will differ between people, but they will take on debt in various corners of the codebase. This sort of debt is really insidious since it was never consciously decided on. If you decide that you will defer good error messages, or avoid building out an abstraction now, but you do it explicitly because of the schedule is the right business choice, then since the team discussed and decided to do it, they know as a whole that’s not the best technical solution but is the best business solution, whereas if someone just does it everyone else may not even be aware of the other options.

Fix vs Replace

I was thinking about when to fix a piece of software versus replace it as part of our normal software lifecycle, prompted by a discussion at work of a particular piece of software. The software in question works great, but nobody is confident that they are successful when changes do need to be made. Part of the lack of confidence is because changes are only made once or twice a year, so getting it set up and running locally and testing it is a chore, plus it turns into a complex process with many chances of failure. The other part is it runs on top of another library that nobody is really familiar with; the library isn’t used anywhere else in our stack, so nobody develops any familiarity with it. This particular piece of software is also critical to expanding our business, as it’s the primary way new client data is loaded into the system.

If we wanted to fix it, we could take the existing solid software and enhance it to make the setup and testing easier, or we could clean up the usage of the underlying library so it’s more intuitive. However, the decision was instead made to replace it with something totally new. I wasn’t involved in the decision, but became involved with the original piece of software after the fact to make a change while the replacement is still being developed. It seems like this software could be rehabilitated at first glance, but clearly someone else thought otherwise.

I know in general I’m biased towards fixing existing software. I’ve spent most of my career working on brownfield applications and building oddly shaped pegs to fit back into the oddly shaped holes of those applications. I think that I’ve done this because I enjoy it; building everything from scratch is almost too easy since there are so many fewer constraints involved. I get a different sort of satisfaction from it. I know I’m not the only one who has this particular tendency; the folks at Corgibytes are specializing in this sort of work. I’ve even been nostalgic for a codebase I’ve worked on – not the application but the codebase itself.

I feel like most organizations are biased towards replacing software because it lets you just say the entire thing is bad and try again instead of having to pick a particular thing to do or change. You don’t have to agree on what’s wrong with it, or get into details of what to change. This flexibility of scope leads to the quid pro quo rewrite where a piece of software is replaced with a new version that also contains a major new feature; this concept was introduced to me by Re-Engineering Legacy Software. Re-Engineering Legacy Software describes this as a bargaining tool to enable you to gain acceptance of a plan to do the rewrite, but I’ve always seen the business bring up the rewrite with the idea that they’re unhappy with the team’s ability to change a piece of software and this would clean up the underlying causes of the problems.

That’s the big problem: the current software has problems due to something. And unless you deal with whatever that “something” is, the new software probably is not going to be significantly better than the old. It may not have had the chance to become crufty yet, so it seems better when it’s new, but given a few years goes back to the same sorts of problems you had the last time. You need normal software processes that enable you to create and maintain quality software even as requirements change.

You need feedback into your initial processes of what caused the cruftiness to accumulate. This can be seen as a form of double-loop learning, where the feedback of what happened impacts how you see the world and thereby influences your decision-making process, not just the decisions themselves. If you are accumulating cruft because you put schedule pressure on the initial development resulting in a less modular design, the feedback to the decision-making process would be different than if you are accumulating cruft because the requirements changed radically. To make true long-term improvements, that’s the step you need to take, which sometimes might lead you to fix, and sometimes might lead you to replace.

Modern Agile

I heard some references to Modern Agile and went to investigate further. To me, it’s just a reimagining of values espoused in the Agile Manifesto in a response to how the original manifesto was misused. The four values of the modern agile movement are:

  • Make People Awesome
    • The “people”: here is those making, using, or otherwise impacted by software
  • Make Safety a Prerequisite
    • Mostly psychological safety, but includes physical safety too if that is a concern
  • Experiment and Learn Rapidly
    • Try to find ways to do whatever they do better
  • Deliver Value Continuously
    • This, to me, means the technical practices around continuous delivery. But could mean other things outside of the web application space.
    • This is also breaking work into smaller individually valuable pieces.

At first I wasn’t that enthusiastic about the idea, and it seemed to be splintering the agile community. Then I started thinking about the recent resignations at the Scrum Alliance, and consider that maybe the community was already splintering and this is just people writing down what had already happened. The people for whom orthodox scrum is agile had already disregarded “individuals and interactions over process and tools.” Both LeSS and SAFe give up on “responding to change over following a plan” to varying degrees.  

The “focus on people” is, to me, the only right way to build systems and products, but the diffusion of who you are supposed to be making “awesome” makes this focus less impactful. If you were trying to empower the makers to do awesome you would do things differently than a single-minded focus on the customer, and some of those things are mutually exclusive. Modern agile examples of focus on the customer from Amazon and Apple come with horror stories on the engineering side of the late nights and insane demands – can you really make all the parties named “awesome” at the same time?.

“Safety as a prerequisite” is the opposing force to the focus on the customer and the negatives it can cause. This can be protecting work-life balance and everything else that makes people able to engage with the work. Another aspect of this safety is an openness to discuss problems in a candid way. I don’t know if I’ve ever seen the sort of openness to admitting problems that this is designed to inspire. I’ve been places where small mistakes were admitted to openly, but it’s unclear what would happen with larger mistakes since as far as I know none happened while I was there. I don’t think I’ve ever seen the sort of transparency from above which would make this level of openness either which makes it difficult to reciprocate.

I think experimentation is the key insight of modern agile. The original manifesto implied that if external stimuli did not change, the team did not need to try to change. This implication resulted in a challenge-response feedback loop and sought a steady state where if things around you didn’t change much you didn’t change. Even then there are a lot of organizations where the right thing to do is whatever is specified by the scrum guide and anything else is wrong. This experimentation adds an additional focus on continuously trying to find a way to do better regardless of how well you are already doing.

I see “delivering value continuously” as a continuation of working software over comprehensive documentation. Working software, from the agile manifesto, is the expression of what is most valuable,” so you could say it described this concept in a narrower sense. The focus on “value” rather than “software” is helpful since sometimes software isn’t the right solution – for instance, if you have an internal API that works but nobody uses because they are having trouble with the API, activities other than writing more software would be more valuable.

If we as practitioners wanted to stop and reconsider how we go about doing work, this seems like a good place to start that conversation. I would have liked to have seen it as a more open discussion rather than appearing fully formed from the head of a consultancy. These ideas seem reasonable, but it’s not clear what the next step is in taking up this mantle. I would always try to be on a team that was doing all of these things, but I don’t know if this alone is enough to make a team that I want to work on.