Running a test on a beta cluster in Bell, I noticed two annoying things.
Firstly, some of the readings from clone instances are missed. On the graph, it looks like the clone suddenly drops its CPU, memory and volume to zero. That is obviously not true - just the statistics entry wasn't received in time.
Secondly, the live counters seem to freeze from time to time. While it could be due to the VM configuration issue we have with those boxes, I suspected it might be related to stats collection.
I reviewed the code and immediately found a few places where:
- Stats were collected using a great deal of resources (number of XML and JSON responses, number of currently logged-in users) and not used.
- Stats lock was taken too early.
- Stats lock was not required at all - this is the live counters call, BTW.
I also updated the clone stats logic so that the clone reading was always placed in its proper place in the list, and not always to the most recent cell.
In my test instance, I don't see the missing stats anymore. I need to set up a regular cloned performance test to watch for issues like that.