Book Chat: Site Reliability Engineering

Site Reliability Engineering is about the practices and processes Google uses internally to run their infrastructure and services. There are a series of principles and practices espoused for how to run that sort of highly available distributed systems. Some of the practices are obvious, like having a good plan for what to do during an incident; some are more complex, like how to design a system to be resilient to cascading failures.

For those unaware of the Site Reliability Engineering (SRE) team at Google, it is a hybrid operations-software engineering team that isn’t responsible for functionality of a system but is responsible for ensuring that the service meets its uptime requirements. Not all services get a corresponding SRE team, just those with higher business value and reliability needs. By bringing in individuals with the blend of skills that are not as common and giving them this unique mission they are uniquely positioned to solve reliability problems in a systematic way.

The book describes a framework for discussing and measuring the risks of changing a software system. Most incidents are the direct result of a change to the system. The authors argue that necessitates putting the team that is responsible for the reliability of the system into the flow of releases and giving them the ability to influence the rate of change of the underlying service. That allows them to flow information back to the engineers building the system in a structured way. The ability to ‘return the pager’ gives the SRE team leverage that a normal operations team doesn’t have when dealing with an engineering team.

The limits of operational burden on the SRE team are a strong cultural point. The team is engineers and they need to leverage their software engineering skills to automate their jobs so that the number of SREs scales with the complexity of the service not the size of the service. By placing this limit to the amount of manual work the team engages in and the fact that they have a process in place for how to reboot a team that has gotten too deep into manual work builds a strong understanding of what a successful team looks like. The cultural aspect of rebuilding a team is more important than the technical aspect of it since each of these people knows how to do the right thing but their priorities have gotten warped over time.

As someone on the engineering side, there are significant portions of the book that aren’t immediately relevant to what I do. In reading this I may have learned more than I ever really wanted to know about load balancing or distributed consensus protocols. But the sections on effective incident response, post mortems, and culture more than make up for it for me.

The SRE discipline is an interesting hybrid of software engineering and software operations, and it is the only real way to handle the complexities of software systems going forward. The book stressed repeatedly that it takes a special breed to see how to build the systems to enable automation of this sort of work. I can see that in the operations staff I’ve interacted with over the years. A lot of them had a strong “take a ticket, do a ticket” mentality with no thought on to how to make the tickets self-service, or remove the need to perform the task at all. It’s a lot like bringing back the distinction between systems programming and application programming, where there was one kind of engineer that was capable of working at that lower level of the stack and building the pieces other users could work with on top of that.

Overall I enjoyed the book. It brought together the ideas that operations teams shouldn’t be that different from the engineering teams in terms of the sort of culture that makes them effective. The book really covers good software practices from the guise of that lower level of the operational stack. Then again I’m a sucker for the kind of software book that has 5 appendices and 12 pages of bibliography.

Advertisements

Functional Programming Katas

Based upon my success with the F# koans I went looking for some more covering the functional side of Scala programming. I’ve found a couple so far, which I’ve completed with varying levels of success.

First was the Learn FP repo is a github repo to checkout and has some code to fill in to make some tests pass. The exercise asks you to provide the implementations of various type classes for different types. There were some links to other articles about the topics but otherwise it was just code. The first part of this was fairly straightforward; I had some trouble with State and Writer but otherwise persevered until I hit the wall at Free. I ended up breaking down and looking up the completed solutions provided in a different branch to find that IntelliJ indicates that the correct solution doesn’t compile (Turns out the IntelliJ Scala plugin has an entire scala compiler in it, and it’s rough around the edges). That frustrated me for a while but I eventually managed to power through, and thankfully the rest of the exercises didn’t suffer from the same problem.

Next was the Cats tutorial. I had done some of the other exercises here when first learning Scala and that had been pretty helpful. This has a neat interactive website to run the code you fill in, but it makes it harder to experiment more with the code. This seemed like a reasonable place to start to cover a lot of the major type classes in Cats. It has you look at sample code and fill in what it would evaluate to. It was good but I had two issues with it. First, there are multiple blanks to fill in some of the sections and it evaluates all of them as a group and doesn’t provide any feedback helping you know which one you got wrong. Second, it’s a lot of looking at other code and describing what it does, no writing of code in this style yourself. Overall it helped me feel more comfortable with some of the terminology, but didn’t produce that “ah ha” moment I was looking for regarding the bigger picture.

Then I went to the Functional Structures Refactoring Kata, which is an application to be refactored into a more functional style with samples in multiple languages.The authors provide a ‘solution’ repo with refactored code to compare to. The issue I had with this exercise is that other than going to look at the solution there isn’t a real way to tell when you’re done.  Even then, some of the ways they factored their solution are opinion based. While seeing that opinion is interesting they don’t really explain the why of their decisions.

The last tutorial I tried was the Functional Programming in Scala exercises. It’s from the same people as the Cats tutorial above and is based on the exercises in the book Functional Programming in Scala. I managed to get about halfway through it without having read the book. While there is some prose in between exercises, it doesn’t adequately explain all of the concepts. While I’m reading the book I will come back to this and do the rest of the exercises.

Overall I would strongly recommend the Learn FP repo, and recommend the Cats tutorial. I would pass on Functional Structures Refactoring Kata. I’ll hold judgment on Functional Programming in Scala until I can try it with  the book. While these were largely good starts, I still haven’t had that conceptual breakthrough I’m looking for on how to use all of these pieces in practice.

HTTP/2 Multiplexing and Bundling Frontend Assets

This post is the result of two somewhat related ideas coming to mind together to form  a question. How will the adoption of HTTP/2 with support for multiplexing impact the practice of resource bundling for front end assets?

The first idea is HTTP/2 enables multiplexing, which is multiple HTTP requests sharing a single TCP connection. This removes the TCP connection overhead of completing the handshake. This also works around some of the slow start issues with TCP connections so your connection needs to scale up once instead of for every subsequent HTTP request.

The second idea is bundling frontend assets, which is taking multiple smaller javascript or css files and packaging them into one big file of that type. Working with the code base you want to have multiple smaller scripts to make it easier to organize and understand the code. However, the network didn’t work well with moving lots of tiny files around, due to the issues discussed above. This resulted in the bundling practice of combining the assets into larger files before serving them to end users. But, this fights with HTTP caching semantics since these large files change more often, which invalidates caches more often.

So the two ideas clearly impact each other, but since HTTP/2 adoption is still early the outcomes aren’t necessarily sorted out yet. Some brief internet research yields two different opinions on possible outcomes. The first is a suggestion to avoid bundling, except where it makes sense from compression perspective. The second is to do semantic bundling, since there is some overhead to each resource but much less than before. Essentially, you bundle resources that likely to change together to maximize caching.

So there isn’t a strict best practice, and since adoption of HTTP/2 is still in the 20-25% range, we’ve got a while to go before changing your asset delivery plan. If you’re reworking your frontend resource pipeline it might be an interesting consideration anyway, given that you will want to significantly change your bundling strategy in the 2020 timeframe. This little bit of knowledge doesn’t seem to have much practical purpose today, but it will come up in the future.HTTP/2 Multiplexing and Bundling Frontend Assets