Strike Teams

At work there has been a new practice of starting up strike teams for different projects. The idea is that for projects that require expertise not found on any individual team, you pull in a person or two from multiple different teams to get all of the correct skills on a single team. That’s the pitch of the strike team model, but the hidden downside is that it breaks up the cohesion of the teams that people are pulled from and the created team may not be together long enough to create new cohesion. This post is mostly going to be a chronicle of the issues that I encountered starting up a strike team and what we did to try to resolve the problems.

The first problem was coordinating the various different teams to figure out who was going to be involved. In the case of the strike team I was forming, the two teams contributing resources both wanted to know who the other team was going to contribute before making their own decisions. They also wanted to have a fixed end date to the project before deciding who to contribute. We ended up resolving this issue with a fixed date for whoever they contributed to the team and getting the two of them to discuss the situation amongst themselves.

The second problem was aligning the start date of the strike team. The three teams contributing resources all have different schedules to their individual sprints so it makes coordinating a ‘start’ date for the strike team difficult.  Those who were on teams about to start another sprint wanted the strike team to start now, whereas those with other commitments made weren’t available. We ended up doing a rolling start where as each team finished their sprint they rolled onto the team and the team ramps up as people become available. We did some preparatory work to get everyone up to speed on the goals and challenges of the team so they were aware of what’s going on whenever they were able to join.

The third big problem is more specific to our particular organization, not to the strike team model. As part of setting up the strike team we needed to schedule things like standups and retros for the team. Since the team is split across both coasts, the hours for scheduling these meetings are limited. The conference rooms are also pretty much all booked because every other team beat us to scheduling. We ended up asking IT to rearrange some other non-recurring meetings and managed to get a consistent slot for the standups. The retro slot was more complicated but we managed to get meetings roughly spaced; by not being a stickler for strict week deliminations we managed to find times that worked.

So with all the overhead sorted out we finally get to move on to the real work of the situation which will be a nice change of pace from the administrative aspect.

Advertisements

Seven More Languages in Seven Weeks

Seven More Languages in Seven Weeks is a continuation of the idea started in Seven Languages in Seven Weeks that by looking at other languages you can expand your understanding of concepts in software engineering. While you may never write production code in any of these languages, looking at the ideas that are available may influence the way you think about problems and provide better idioms for solving them.

This installment brings chapters on Lua, Factor, Elixir, Elm, Julia, MiniKanren, and Idris. Each of these languages is out on the forefront of some part of software engineering. Lua is a scripting language with excellent syntax for expressing data as code. Factor is a stack-based programming language with interesting function composition capabilities. Elixir is Ruby-like syntax on the Erlang VM. Elm is reactive functional programming targeting javascript as an output language. Julia is technical computing with a more user friendly atmosphere, and good parallelism primitives. MiniKanren is a logic programming language and constraint solver built on top of Clojure. Idris is a Haskell descendent bringing in the power of dependent types to provide provably correct functional code.

Overall it was an interesting survey of the variety of programming languages. Some I had done a bit with before (Lua, Elixir) some I had heard of before (Elm, Julia, and Idris) and some I hadn’t even heard of (Factor and MiniKanren). Each chapter was broken into three ‘days’ indicating a logical chunk of the book to tackle at once. Each day ends with a series of exercises to help make sure you understand what’s being presented.

Since these languages are out on the edge of the world in programming terms, they are evolving fairly quickly. This ended up biting the Elm example code particularly hard since large portions of it have been deprecated in the releases since then and they didn’t work on the current runtime. Compared to the lineup from the original book (Clojure, Haskell, Io, Prolog, Scala, Erlang, and Ruby) you’ve got a much broader variety of languages in the sequel, but nothing with the popularity of Ruby or the legacy install base of Erlang. Since this was written in 2014, none of these have had a massive breakout in terms of popularity and adoption, however they do seem to do well in terms of languages people want to work with.

Overall it’s an interesting take on where things could be going.  I don’t think most of the languages covered have significant mainstream appeal right now. Two of these languages seem to be more ready for the primetime than the others. Julia definitely has a niche where it could be successful. I feel like the environment is ripe for something like Elm to surge in popularity since frontend technology seems to be going through constant revisions.

Bug Bash

We had a big all-hands off-site meeting. In the runup to it,  there were two days that most teams had left out of  their planning process, so we ended up running a company wide bug bash to close out all of the existing bugs. Even those pesky minor bugs that are never really a priority. It felt like over those two days my team closed out an entire three week sprint’s worth of bugs by points. After the fact I went back and counted and found that the feeling was right. Some of the bugs had previously been pointed and some were quick estimates I put on them after the fact, but it was just a slew of 1-3 point tickets that all got closed out.

Dealing with ~30 tickets in two days was a furious endeavour for the team. It was really satisfying to deal with all of those tickets that had built up. I know they had been weighing on my mind in the sense that the list just seemed to keep growing and I wasn’t sure what to do about it.

This brought me to another interesting question: why were we so much more effective in this than in a normal sprint? Are we underestimating the stories relative to the bugs? That seems the most obvious answer. Is the low end of the point spectrum too compressed so the difference between a two point task and a three point task isn’t sufficiently granular? I spoke with some of the other members of my team about it and there was some speculation that since we each just took tickets related to what we already knew we just got more done, but that shouldn’t account for all of the difference. There was also a possibility that since all of the pieces were strongly independent we had less communication lag.

Maybe I should just be happy that all of those bugs got dealt with. But, I’d really love to find a way to bring that efficiency to our normal processes. If anyone has any ideas, please share in the comments.

Java Containers on Mesos

I recently ran into an interesting issue with an application running in a container. It would fire off a bunch of parallel web requests (~50) and sometimes would get but not process the results in a timely manner. This was despite the application performance monitoring we were using saying the CPU usage during the request stayed very low. After a ton of investigation, I found out a few very important facts that contradicted some assumptions I had made about how containers and the JVM interact.

  1. We had been running the containers in marathon with a very low CPU allocation (0.5) since they didn’t regularly do much computation. This isn’t a hard cap on resource usage of the container. Instead it is used by Mesos to decide which physical host should run the container and it influences the scheduler of the host machine. More information available on this in this blog post.
  2. The number of processors the runtime reports is the number of processors the host node has. It doesn’t have anything to do with a CPU allocation made to the container. This impacts all sorts of under the hood optimizations the runtime makes including thread pool sizes and JIT resources allocated. Check out this presentation for more information on this topic.
  3. Mesos can be configured with different isolation modes that control how the system behaves when containers begin to contest for resources. In my case this was configured to let me pull against future CPU allocation up to a certain point.

This all resulted in the service firing off all of the web requests on independent threads which burned through the CPU allocation for the current time period and the next. So then the results came back and weren’t processed. Immediately we changed the code to only fire off a maximum number of requests at a time. In the longer term we’re going to change how we are defining the number of threads but since that has a larger impact it got deferred until later when we could measure the impact more carefully.

Snipe Hunt

Recently I got pulled into a project to help get a feature that was mostly finished and needed to go through a “final QA round” before being ready for release. I felt that this wouldn’t require much of my time, but as you can imagine, things didn’t quite go as expected. The QA round found about a dozen errors in the new feature that I eventually divided into two classifications: requirements SNAFUs and code quality issues.

The requirements SNAFUs were the sorts of problems where the original programmer built what was explicitly asked for, but QA took the one of everything approach trying all sorts of cases that weren’t specified at all. These sorts of problems can be impactful from a time consumption perspective but aren’t that difficult to fix. The code quality issues are much more pernicious.

Digging into the code itself I quickly found an interesting fact. There were two fields, the currentPlanId and the activePlan, that were being mutated in various portions of the application, generally together. There wasn’t any clear distinction between the active plan and the current plan in the code, and at one point the currentPlanId was being set to the id from the active plan, sort of implying it’s the same thing but with poor naming. There were other places where one or both of them would mutate, and I went about tracing what caused the two to diverge or converge.

On initial page load the two would be different, with the active plan being blank, then when an item was selected on the drop down the two could converge, depending on what was selected.  I went and started looking for the tests covering this to see if there would be any clarification of the scenarios that were going on and turned up none. At this point I let others know of my findings and that while the problem seemed minor, there was a bigger quality problem under the hood of the system.

The first code change I made was a relatively minor one affecting when a particular button should show up; adding a special case and another test case started behaving. So far so good. Then I started tweaking the functions that were setting currentPlanId and activePlan. By this point I had managed to figure that current was a chronological state and active was a UI state, but it still wasn’t immediately clear how the system was making decisions about which plan was current. This obscured information seemed to be intertwined with the cause of a lot of the remaining broken cases.

I followed the webservice calls back through various layers microservices to where I knew the information had to be coming from and made an intriguing discovery. The way the frontend was deciding which plan was current was incorrectly being based on the timing between two different web service calls. I started digging around trying to find the right way to set all of this information and started to become clear that the initial architecture was missing a layer to coordinate the requests at the initial page load.

That got everything trending in the right direction. I still want to find some time to work through some additional unit tests and leave the code in a good state rather than just a better state.

Book Chat: The Pragmatic Programmer

For a long time this had been on my list of books to buy and read with a “note to self” saying to check if there was a copy of it somewhere on my bookshelf before buying one. It felt like a book I had read at some point years ago, but that I didn’t really remember anymore. Even the woodworking plane on the cover felt familiar. It felt like it was full of ideas about creating software that you love when you encounter them but are disappointingly sparse in practice. Despite being from the year 2000 it still contains a wealth of great advice on the craft of creating software.

Since it is about the craft of software, not any specific technologies or tools or styles, it aged much better than other books. That timeless quality makes the book like a great piece of hardwood furniture, it may wear a little but it develops that patina that says these are the ideas that really matter. There is an entire chapter devoted to mastering the basic tools of the trade: your editors and debuggers, as well as the suite of command line tools available to help deal with basic automation tasks. While we’ve developed a number of specialized tools to do a lot of these tasks it is valuable to remember than you don’t need to break out a really big tool to accomplish a small but valuable task.

It’s all about the fundamentals, and mastering these sorts of skills will transfer across domains and technical stacks. It was popular enough that is spawned an entire series of books – The Pragmatic Bookshelf – and while I have only written about one of them I have read a few more and they’ve all been informative.

About two-thirds of the way through the book I realized that I had indeed read it before – I had borrowed a copy of it from a coworker at my second job. He had recommended it to me as a source he had learned a lot from. I remember having enjoyed it a lot but not really appreciating the timeless quality. Probably since that would have been around 2007, it wouldn’t have seemed as old, especially since things seemed to be moving less quickly then. Maybe I just feel that way since I didn’t know enough of the old stuff to see it changing.

If you haven’t read it, go do it.

tumblr_inline_o2aushqfpx1slrvm0_1280

Changing Stacks

I had an odd realization recently – the jobs that I changed stacks for seemed to be better than the jobs where I already knew the stack. I’m not sure if it’s a common experience or something that’s unique to my experiences.

I’ve got a few theories about why this might be the case for me. First, the jobs where I changed stacks appeared more engaging to me because there was more to learn. Second, the jobs where I changed stacks felt better because they used different hiring practices, where they looked for underlying talent and skills as opposed to specific stack-related experiences. Third, relating to the second point, the diverse points of view brought together because so many colleagues were also changing stacks creates a better workplace. Fourth, the people that are willing to change stacks are the kind of people who are more open to learning going forward. Lastly, the jobs where I changed stacks happened to be better by pure chance due to small sample sizes. None of these theories is particularly provable. Several of them could be true and working together. They could all be false and it could be something completely different.

I know that from talking to some of my coworkers that they all came into their current jobs without particular experience in the Scala/Play Framework/MongoDB stack as well, although most of them came from a much more similar Java stack rather than the C# stack I came from most recently, or any of the other stacks I worked with in the past. They mentioned that we have issues recruiting in our DC office because of the pay differentials for cleared work in the area;  there are lots of places you can go and get a comfortable government contracting job and not really stretch yourself and grow. There has also been some discussion about how the stack effects our ability to recruit since it’s not as common as other stacks, and some candidates have expressed reservations since the stack isn’t really growing in popularity. There was a lot of information in the most recent Stack Overflow Developer Survey that corroborates the idea that the Scala is shrinking, but it says it is shrinking in relative of share of all votes. In absolute terms the change is less clear.

I guess that I’ve enjoyed the stack changes I’ve made. Only once did I set out with an intention of changing stacks as part of a job change. Then I wasn’t looking to go to anywhere particular, but I was looking to move away from ColdFusion since it didn’t really mesh with how I like to do development. I don’t have any intention to make another change any time soon, but I’ll definitely consider another stack change when I do just because it opens up a wider variety of options and seems to have worked out for me in the past.

Book Chat: Growing Object-Oriented Software Guided By Tests

Growing Object-Oriented Software Guided By Tests is an early text on TDD. Since it was published in 2010, the code samples are fairly dated, but the essence of TDD is there to be expressed. So, you need to look past some of the specific listings since their choice of libraries (JUnit, jMock, and something called Window Licker I had never heard of) seem to have fallen out of favor. Instead, focus on the listings where they show all of the steps and how their code evolved through building out each individual item. It’s sort of as if you are engaged in pair programming with the book, in that you see the thought process and those intermediate steps that would never show up in a commit history, sort of like this old post on refactoring but with the code intermixed.

This would have been mind blowing stuff to me in 2010, however the march of time seems to have moved three of the five parts of the book into ‘correct but commonly known’ territory. The last two parts cover what people are still having trouble with when doing TDD.

Part 4 of the book really spoke to me. It is an anti-pattern listing describing ways they had seen TDD go off the rails and options for how to try to deal with each of those issues. Some of the anti-patterns were architectural like singletons, some were specific technical ideas like patterns for making test data, and some were more social in terms of how to write the tests to make the more readable or create better failure messages.

Part 5 covers some advanced topics like how to write tests for threads or asynchronous code. I haven’t had a chance to try the strategies they are showing but they do look better than the ways I had coped with these problems in the past. There is also an awesome appendix on how to write a hamcrest matcher which when I’ve had to do it in the past was more difficult to to do the first time than it would look.

Overall if you are doing TDD and are running into issues, checking out part 4 of this book could easily help you immediately. Reading parts 1 through 3 is still a great introduction to the topic if you aren’t already familiar. I didn’t have a good recommendation book on TDD before and while this isn’t amazing in all respects I would recommend it to someone looking to get started with the ideas.

Session Fun

I’ve been working on getting a new session infrastructure set up for the web application I’m working on. We ended up going with a stateful session stored in mongoDB along with some endpoints to query the session with. This design has a couple of nice aspects – all of the logic about if a session is active or not can live inside of one specific service and a session can be terminated if needed.

Building a session infrastructure is a fairly common activity, but we are building a session that is used by multiple services and we can’t roll all of them out simultaneously. So we’ve been building a setup that can process both the new session and the old session as a way to have a backwards compatible intermediate step. This is creating some interesting flow issues. There are two specific issues I wanted to discuss: (1) how to maintain the activity of the new session during backwards compatibility and (2) processing of identity federation.

Maintaining the new session from applications that have not been updated is nuanced. Since, by rule, you haven’t made changes to the applications that haven’t been updated, you can’t add any calls. In our case there was a call made from all of the application frontends to a specific service, so we are using that to piggyback keeping the session alive. We looked into a couple of other options but didn’t find anything easy. We considered rolling out a heartbeat to the various application frontends but that would require an extra round of updates, testing, and deploys for code that was likely to all be ripped out when we were done.

The federation flow is extra complex because a federation in the new scheme is not that different from some of the session passing semantics under the old session scheme. This ends up mixing together the case where there is just a federation occurring and the case where the new session has timed out and the old session is still valid. This creates an awkward compromise; you’d like to be able to say that if the new session has timed out the entire session is expired, but if you can’t tell the difference between the two cases that’s not possible. This means that the new session expiration can’t be any longer than the old session expiration. But it also solved the problem with maintaining the new session while in applications that haven’t been updated yet.

The one problem solved the other problem which was a nice little win.

Continuation Passing Style

I have been doing some work with a library that creates guards around various web endpoints. They have different kinds of authentication and authorization rules, but are all written in a continuation passing style. The idea of the continuation passing style is that you give some construct a function to ‘continue’ execution with once it does something. If you’ve ever written an event handler, that was a continuation. The usages all look somewhat like

secureActionAsync(parseAs[ModelType]) { (userInfo, model) => user code goes here }

There was some discussion around whether we wanted to do it like that or with a more traditional control flow like

secureActionAsync(parseAs[ModelType]) match {
    case Allowed(userInfo, model) => user code goes here then common post action code
    case Unallowed => common error handling code
}

The obvious issue with the second code sample is the need to call the error handling code and post action code by hand. This creates an opportunity to fail to do so or to do so incorrectly. The extra routine calls also distract from the specific user code that is the point of the method.  

There were some additional concerns about the testability and debuggability of the code in the continuation passing style. The debugging side does have some complexity but it isn’t any more difficult to work through than a normal map call which is already common in the codebase. The testability aspect is somewhat more complex though. The method can be overwritten with a version that always calls the action, but it still needs to do the common post action code. The common post action code may or may not need to be mocked. If it doesn’t need to be mocked this solution works great. If the post action code needs to be mocked then putting together a method that will set up the mocks can simplify that issue.

This usage of a continuation helps collect the cross cutting concern and keeps it all on one place. You could wrap up this concern other ways, notably with something like like AspectJ. The issue with doing it with something like AspectJ is that it is much less accessible than the continuation passing style. AspectJ for this problem is like using a bazooka on a fly, it can solve it but the level of complexity introduced isn’t worth it.