July 29, 2019
Recently we started experiencing an issue with some of our micro-services. The issue would manifest as an increase in the time taken to start the service, and would appear to occur at random.
We have two sets of systems: our internal systems that we use for testing and development, and our production services. The issue we were having was solely on our internal services, and because of this we were able to rule out the code itself as an issue, and concentrated on where the services where running. We initially thought the problem was related to resource starvation, and installed Prometheus, an open source monitoring tool for a host of different systems. We were able to see that there were no issues with CPU, memory or disk.
After exhausting a few other possibilities, we ended up looking through log files for any issues we could identify. Having no luck with this, we resorted to some basic debugging by taking a few thread dumps form the process using:
kill -3 2827
This produced a thread dump of the process into the container's console output that we then could analyse in the Stack Trace Analyser in the JetBrains IntelliJ product. From this we saw that the main thread was stuck in java.security.SecureRandom.getInstance(String).
From this information, we were able to quickly tell that the issue was in the usage of Java's secure random number generator. This then led us to the conclusion that the system entropy value – which is an integral part of this generator – was too low to generate the secure random numbers needs for the security libraries we were using in our services.
lack of order or predictability; gradual decline into disorder.
Whenever encryption is used in software security, a source of random data is needed. This random data is normally sourced from /dev/random or /dev/urandom. How this random data is generated varies on different architectures. There are many different ways to generate this data, including hardware and software sources.
The issues we had was related to the lack of entropy on our service host systems. We started to look at ways to increase this, and found there are many suggestions out there about how to generate entropy for random number generators. This one from cloud flare has to be my favourite https://blog.cloudflare.com/lavarand-in-production-the-nitty-gritty-technical-details/.
As this problem was restricted to our internal test systems we decided that a simpler solution would be best. We installed the haveged software on all the systems experiencing this issue, and since then the slow-startup problem has not recurred.
From this experience we have learnt that entropy is a key requirement for running our services, and it must be considered a critical resource for ensuring a reliable service.
Since we encountered this problem we have found that Prometheus is able to monitor the available entropy of a system, tracked against the metric node_entropy_available_bits, and we've set it to alert us if it drops below a given threshold.
We now run the haveged service and monitor the entropy with Prometheus on all our internal infrastructure.
Experienced developer in various languages, currently a product owner of nerd.vision leading the back end architecture.