The Importance of Monitoring Entropy

Ben

September 23, 2019

Recently we started experiencing an issue with some of our micro-services. The issue would manifest as an increase in the time taken to start the service, and would appear to occur at random.

We have two sets of systems: our internal systems that we use for testing and development, and our production services. The issue we were having was solely on our internal services, and because of this we were able to rule out the code itself as an issue, and concentrated on where the services where running. We initially thought the problem was related to resource starvation, and installed Prometheus, an open source monitoring tool for a host of different systems. We were able to see that there were no issues with CPU, memory or disk.

After exhausting a few other possibilities, we ended up looking through log files for any issues we could identify. Having no luck with this, we resorted to some basic debugging by taking a few thread dumps form the process using:

kill -3 2827

This produced a thread dump of the process into the container's console output that we then could analyse in the Stack Trace Analyser in the JetBrains IntelliJ product. From this we saw that the main thread was stuck in java.security.SecureRandom.getInstance(String).

From this information, we were able to quickly tell that the issue was in the usage of Java's secure random number generator. This then led us to the conclusion that the system entropy value – which is an integral part of this generator – was too low to generate the secure random numbers needs for the security libraries we were using in our services.

What is Entropy?

lack of order or predictability; gradual decline into disorder.

Whenever encryption is used in software security, a source of random data is needed. This random data is normally sourced from /dev/random or /dev/urandom. How this random data is generated varies on different architectures. There are many different ways to generate this data, including hardware and software sources.

How We Solved Our Issues.

The issues we had was related to the lack of entropy on our service host systems. We started to look at ways to increase this, and found there are many suggestions out there about how to generate entropy for random number generators. This one from cloud flare has to be my favourite https://blog.cloudflare.com/lavarand-in-production-the-nitty-gritty-technical-details/.

As this problem was restricted to our internal test systems we decided that a simpler solution would be best. We installed the haveged software on all the systems experiencing this issue, and since then the slow-startup problem has not recurred.

Conclusion

From this experience we have learnt that entropy is a key requirement for running our services, and it must be considered a critical resource for ensuring a reliable service.

Since we encountered this problem we have found that Prometheus is able to monitor the available entropy of a system, tracked against the metric node_entropy_available_bits, and we've set it to alert us if it drops below a given threshold.

We now run the haveged service and monitor the entropy with Prometheus on all our internal infrastructure.

‍

April 12, 2022

January 25, 2022

January 21, 2022

July 15, 2021

July 1, 2021

June 17, 2021

June 11, 2021

May 10, 2021

January 27, 2021

January 20, 2021

April 12, 2022

April 24, 2021

February 12, 2021

February 2, 2021

January 15, 2021

The Importance of Monitoring Entropy

What is Entropy?

How We Solved Our Issues.

Conclusion

Recent articles

Other articles by author

related articles

Ben