Book Chat: Perspectives on Data Science for Software Engineering

Perspectives on Data Science for Software Engineering is a collection of short research papers on using the tools provided by data science to do research into software engineering. It isn’t about the concepts of data science for software engineers as I thought it would be when I initially picked it up. This difference had me put it down the first time I picked it up to read it, but when I came back around to it I found myself interested not in the data science aspect of it, but the software engineering research aspect.

While none of the individual papers was something I read and immediately knew how I could apply in my own practice, the overall package helped me feel positive for progress in software engineering. Outside of language design, it sometimes feels like most of the software engineering learning we’ve done going as far back as the 70’s and 80’s hasn’t been applied in practice. I think part of the difference is because the research is disconnected from the way software is built in the wild. The research is hyper-specific, (e.g., focusing on a particular kind of software in a single language) or defines problems but not solutions (e.g., the work on code quality metrics). The research isn’t wrong, but it’s missing a step about how to apply the work to what you’re doing.

The only piece in here that I saw and felt had an immediate connection to what I was doing was the piece on bug clustering. That showed that the more bugs a file had the more likely it was to have more bugs in future iterations. This seems like it may lend some credence to the idea of rewriting a piece of code that has quality problems to effectively blank the slate and start over again.

Overall the book was intellectually stimulating but has no real practical usage for what I do or what I feel would be the average software developer. If your role straddles the practical and academic worlds then this may have more value to you.



Recently I’ve been working on rolling out a Vault implementation at work and to migrate all of our existing secrets over. It is a tool designed to secure secret data and control access to it. It also offers a variety of ways to handle dynamic secrets for things like database credentials. The dynamic database credentials are are an interesting security feature; any particular set of database credentials can be shut off at any point if compromised and are effectively rotated each time a new instance starts up. It can also act as a certificate authority. This is all built on top of a configurable set of backends and HA clustering setups.

One of the most interesting things is the unsealing process. The system starts sealed, where all of the secrets are inaccessible. The unseal process requires a majority of key fragments to be provided to unseal the vault. This is an implementation of Shamir’s Secret Sharing which i sa cool concept. In the enterprise version, it also provides an auto-unsealing mechanism built on top of AWS Key Management Service.

The REST API is pretty good and most major languages have a third party client available already. The third party clients have different levels of compatibility with all of the features of the system; since it is a plugin based system they don’t necessarily support everything. Sadly, the UI also doesn’t support all of the features, which makes doing some basic testing about how the system works more painful.

Vault seems like a very good tool chest for dealing with secrets, but I would like a more opinionated system about how to do this. I can build my own system on top of it but would like to have integrated support for creating a key of some type and storing it securely. Similarly, its scheme to provide transit encryption requires a lot of work on my side if I wanted to use it. Despite these areas for improvement I’m still excited to get it integrated into our systems.

Where are all the consultants?

I spend a lot of time engaging with programming related content, online and in person. Most of this content is created by people who describe themselves as “consultants” of some variety. Up until recently I had never worked with any consultants of this variety anywhere. I had wondered, where are all the consultants? Recently at work the floodgates opened and a huge wave of consultants appeared to help a couple of teams hit their objectives. I’m talking 40-50 consultants against an existing total engineering team of ~250.

Watching this from the outside was interesting since it seemed like our people spent a lot of time trying to get the consultants up to speed on everything that is going on. This was exacerbated as the consultants did not get access to everything a normal employee would, most notably the wiki; that meant that large quantities of the documentation that would normally just be linked to a new employee had to be exported and therefore couldn’t easily be contributed back to either. There were also timezone issues since many of the consultants were in eastern Europe , which resulted in them having a limited access window to interact with anyone on the US east coast and no reasonable time for them to interact with those on the US west coast. The remote only contractor presence was interesting given our unwillingness to start full time employees as remote. Overall the teams that picked up the consultants seemed to be able to eventually get around the obstacles and get the consultants contributing.

All of this was of idle curiosity as to the way the rest of the organization was run until the team I was on was slated to pick up oversight of two new consultants. Fortunately by the time we had gotten there most of the immediate logistical problems had been solved, and the majority of the basic onboarding documentation had been extracted from the wiki and put into a google drive the consultants were able to see. We also had the advantage of picking up US based consultants so the time zone issues weren’t an issue. Overall both consultants are very sharp, and experienced in the kind technologies we use. But, we have them for three months to start with, so we get the whole onboarding overhead but only three months to get the return on investment that comes from it.

This raises three questions in my mind. First, when the consultants are done how much more did we get done over what we could have gotten rather than just doing it ourselves?  Second, isn’t the whole process just going to repeat itself with the next big set of deliverables for engineering? Third, is the content being generated by these consultants I’m seeing their reaction to other companies that have already gotten themselves into trouble? The first question seems like it should be net positive, at least for the consultants my team has, but I think part of that is because the kinks in the system were worked out by others who went first. I feel like the second question is much more intriguing. It seems like the initial need for the consultants was due to a failure of organic growth in engineering. So the resources we put into finding and vetting the consultants weren’t being put into finding and vetting employees. Therefore, it seems like while we may have gotten more engineering work done in the short-term, HR/management resources were spread thinner in terms of doing the long term recruiting. Even though the consultants were doing great work, it feels our longer term ambitions may have been sacrificed to meet present obligations. The third question is much broader. If the advice being poured out into the internet and being delivered at conference talks and similar is the result of consultants looking at lots of organizations that are already dysfunctional, then it’s possible that it’s biased toward bringing bad to passable versus aiming for great. It strikes me as being like trying to form a psychological theory using just a prison population because that’s the psychologist happens to treat everyday. Since having this thought I haven’t been able to see any common architectural or management mantras that are clearly thought up based on these sorts of situations. Maybe Tolstoy was right after all: Happy families are all alike, every unhappy family is unhappy in its own way.

Book Chat: The Architecture of Open Source Applications Volume 2

The Architecture of Open Source Applications Volume 2 has writeups describing the internal structure and evolution of nearly two dozen different open source projects, ranging from tools to web servers to web services. This is different from volume one, which didn’t have any web service-like software, which is what I build day to day. It is interesting to see the differences between what I’m doing and how something like MediaWiki powers Wikipedia.

Since each section has a different author the book doesn’t have a consistent feel to it or even a consistent organization to the sections on each application. It does however give space to allow some sections to spend a lot of time discussing the past of the project to explain how it evolved to the current situation. If looked at from the perspective of a finished product some choices don’t make sense, but the space to explore the history shows that each individual choice was a reasonable response to the challenges being engaged with at the time. The history of MediaWiki is very important to the current architecture whereas something like SQLAlchemy(a Python ORM) has evolved more around how it adds new modules to enable different databases and their specific idiosyncrasies.

I found the lessons learned that are provided with some of the projects to be the best part of the book. They described the experience of working with codebases over the truly long term. Most codebases I work on are a couple of years old while most of these were over 10 years old as of the writing of the book, and are more than 15 years old now. Seeing an application evolve over longer time periods can truly help validate architectural decisions.

Overall I found it an interesting read, but it treads a fine line between giving you enough context on the application to understand the architecture, and giving you so much context that the majority of the section is on the “what” of the application. I felt that a lot of the chapters dealt too much with the “what” of the application. Some of the systems are also very niche things where it’s not clear how the architecture choices would be applicable to designing other things in the future, because nobody would really start a new application in the style. If you have an interest in any of the applications listed check out the site and see the section there, and buy a copy to support their endeavours if you find it interesting.