I recently ran into an interesting issue with an application running in a container. It would fire off a bunch of parallel web requests (~50) and sometimes would get but not process the results in a timely manner. This was despite the application performance monitoring we were using saying the CPU usage during the request stayed very low. After a ton of investigation, I found out a few very important facts that contradicted some assumptions I had made about how containers and the JVM interact.
- We had been running the containers in marathon with a very low CPU allocation (0.5) since they didn’t regularly do much computation. This isn’t a hard cap on resource usage of the container. Instead it is used by Mesos to decide which physical host should run the container and it influences the scheduler of the host machine. More information available on this in this blog post.
- The number of processors the runtime reports is the number of processors the host node has. It doesn’t have anything to do with a CPU allocation made to the container. This impacts all sorts of under the hood optimizations the runtime makes including thread pool sizes and JIT resources allocated. Check out this presentation for more information on this topic.
- Mesos can be configured with different isolation modes that control how the system behaves when containers begin to contest for resources. In my case this was configured to let me pull against future CPU allocation up to a certain point.
This all resulted in the service firing off all of the web requests on independent threads which burned through the CPU allocation for the current time period and the next. So then the results came back and weren’t processed. Immediately we changed the code to only fire off a maximum number of requests at a time. In the longer term we’re going to change how we are defining the number of threads but since that has a larger impact it got deferred until later when we could measure the impact more carefully.