I recently ran into an interesting issue with an application running in a container. It would fire off a bunch of parallel web requests (~50) and sometimes would get but not process the results in a timely manner. This was despite the application performance monitoring we were using saying the CPU usage during the request stayed very low. After a ton of investigation, I found out a few very important facts that contradicted some assumptions I had made about how containers and the JVM interact.
- We had been running the containers in marathon with a very low CPU allocation (0.5) since they didn’t regularly do much computation. This isn’t a hard cap on resource usage of the container. Instead it is used by Mesos to decide which physical host should run the container and it influences the scheduler of the host machine. More information available on this in this blog post.
- The number of processors the runtime reports is the number of processors the host node has. It doesn’t have anything to do with a CPU allocation made to the container. This impacts all sorts of under the hood optimizations the runtime makes including thread pool sizes and JIT resources allocated. Check out this presentation for more information on this topic.
- Mesos can be configured with different isolation modes that control how the system behaves when containers begin to contest for resources. In my case this was configured to let me pull against future CPU allocation up to a certain point.
This all resulted in the service firing off all of the web requests on independent threads which burned through the CPU allocation for the current time period and the next. So then the results came back and weren’t processed. Immediately we changed the code to only fire off a maximum number of requests at a time. In the longer term we’re going to change how we are defining the number of threads but since that has a larger impact it got deferred until later when we could measure the impact more carefully.
I’ve been working on getting a new session infrastructure set up for the web application I’m working on. We ended up going with a stateful session stored in mongoDB along with some endpoints to query the session with. This design has a couple of nice aspects – all of the logic about if a session is active or not can live inside of one specific service and a session can be terminated if needed.
Building a session infrastructure is a fairly common activity, but we are building a session that is used by multiple services and we can’t roll all of them out simultaneously. So we’ve been building a setup that can process both the new session and the old session as a way to have a backwards compatible intermediate step. This is creating some interesting flow issues. There are two specific issues I wanted to discuss: (1) how to maintain the activity of the new session during backwards compatibility and (2) processing of identity federation.
Maintaining the new session from applications that have not been updated is nuanced. Since, by rule, you haven’t made changes to the applications that haven’t been updated, you can’t add any calls. In our case there was a call made from all of the application frontends to a specific service, so we are using that to piggyback keeping the session alive. We looked into a couple of other options but didn’t find anything easy. We considered rolling out a heartbeat to the various application frontends but that would require an extra round of updates, testing, and deploys for code that was likely to all be ripped out when we were done.
The federation flow is extra complex because a federation in the new scheme is not that different from some of the session passing semantics under the old session scheme. This ends up mixing together the case where there is just a federation occurring and the case where the new session has timed out and the old session is still valid. This creates an awkward compromise; you’d like to be able to say that if the new session has timed out the entire session is expired, but if you can’t tell the difference between the two cases that’s not possible. This means that the new session expiration can’t be any longer than the old session expiration. But it also solved the problem with maintaining the new session while in applications that haven’t been updated yet.
The one problem solved the other problem which was a nice little win.
I previously mentioned the new service we were spinning up and the discussion of the overhead therein. Having finished getting the initial version of the service out into production, I feel like I have some answers now. The overhead wasn’t that bad, but could have been lower.
The repo was easy as expected. The tool for setting up the CI jobs was quite helpful, although we didn’t know about a lot of the configuration options available to us. We initially configured with the options we were familiar with, but found ourselves going back to make a couple of tweaks to the initial configuration. The code generators worked out great and saved a ton of time to get started.
The environment configuration didn’t work out as well as expected. The idea was that the new service would pick up defaults for essentially all of its needed configuration, which would reduce the time we would need to spend figuring it out ourselves. This worked out reasonably well in the development environment. In the integration environment we ran into some problems because the default configuration was missing some required elements. This resulted in us not having any port mappings set up so nothing could talk to our container. We burned a couple of hours on sorting out this problem. But when we went to the preproduction environment we again found its port mapping settings were different from the lower environments and needed to be setup differently. Here we ended up burning even more time since the service isn’t exposed externally and we needed to figure out how to troubleshoot the problem differently.
In the end I still think spinning up the new services on this short timeframe was the right thing to do – we would have had to learn this stuff eventually when building a new service. Doing it all on the tight timeline was unfortunate but the idea of getting the services factored right is the best thing.
At work we started spinning up a new service and concerns were expressed by some interested parties about the overhead required to get a new service into production. Among the specific concerns that were articulated: it needed to have a repo made, CI builds set up, be configured for the various environments, be creating databases in those environments, etc. Having not deployed a new service at this job I’m not sure if the concerns are overblown or not.
We’ve got a platform set up to help speed the creation of new microservices. The platform can help spin up CI jobs and simplify environment configuration. Creating the repo should be a couple of clicks to create it and assign the right permissions then set up the hook for the CI process. I’m optimistic that this process should make it easy, but only two people on the team were part of spinning up a new service the last time the team did it, and neither of them was involved in much of dealing with this infrastructure.
The project all of this is for needs to be put together and running in production in the next month, so the overhead can’t take up much of the actual schedule. The first version of the service isn’t complicated at all and can probably be cranked out in a week or so of heads-down work. It needs a little bit of front-end work but has been descoped as far as possible to make the timelines easy. We’ve got a couple of code generators to help bootstrap a lot of the infrastructure of the service; we’ve even got a custom yeoman generator to help bootstrap the UI code.
I’m curious if the concerns were memories of spinning up new services in the world that predates a lot of the platform infrastructure we have today or if it’s a reasonable understanding in the complexity of doing all of this. But it raises the question of how much overhead you should have for spinning up a new service. As little as possible seems reasonable, but the amount of effort to automate this setup relative to how often you create new services makes that not as reasonable as it first appears. I think it needs to be easy enough that you create a new service when it makes logical sense in the architecture and the overhead of doing so isn’t considered in the equation.