This is a question as much as a discussion, I can’t find a lot of detail on the state of play in this area and would really welcome any feedback or corrections. Please don’t read this as a negative article as that is not how it is intended; it’s in the optimisation section but I’m not sure what impact it really has. One thing for sure, for 99% of systems you probably don’t have to worry about what’s going on here at all.
Recently I started to wonder how Garbage collection events in large heaps affect cache performance. Earlier today I read an interesting article on the mechanical sympathy blog (Mechanical Sympathy: CPU flushing article ) about the way processor caches work, and it re-enforced a feeling I’ve had for a while; that old generation GC’s on very large heaps could cause a lot of cache misses because each object in the generation has to be marked and swept. This led to more searches where I dug up this page on stack overflow (Stack Exchange: is GC cache friendly).
So reading this, one would assume that so long as the new generation heap is not so large as to overflow the cache, there is some benefit in terms of memory locality.
However, let’s look at old generation GC’s. As I understand it, the way caches are implemented on most CPU’s today, memory has to be read through the cache in segments of a certain size. Now let’s say that we have one million objects (and that may be a small number for a large heap process) that need to be traversed in order to determine some simple state; will all these fit into the cache, probably not. However, what is worse is that these events could cause other data to be evicted from the cache and I’m not sure how efficient the processor is at handling this situation. From what I read in many older articles online I’m beginning to assume – not that well.
Let’s then say that there is some concurrency in this process, and it uses all the cores on the box for the parallel GC operation, it may well affect all the local caches too. One of the difficulties here is that the timeframes and amount of concurrency involved make it very difficult to measure. In fact most systems can only measure CPU utilisation in terms of percentage of time not spent idle. It would maybe be interesting to see it in terms of time stalled waiting for memory.
My first thought from a Java perspective was that G1 may come to the rescue. However, although G1 may not perform an old generation GC in the same way, it is still partitioned into old and new and over time it’s still going to have to traverse all those heaps. Maybe the compartmentalised model reduces the effect on the cache.
Of course here I am discussing in ideal terms, where the heap has at least 20% extra capacity, so it is not constantly thrashing to find free space. If the process is thrashing for free space constantly, all bets are off and the cost of the cache misses would be insignificant anyway.
Up to now, I’ve assumed you’ve only gone one such process on a server. However, in most enterprise environments there are normally several such processes running, each with a large heap.
Is all this important anyway, I don't know.. I have not been able to work out what the real impact of this is. It's more out for discussion rather than as an answer.