This article discusses my opinions around moderate latency and low latency applications built on the JVM using JDK 22 and greater. It is based on many years of trying various approaches. For years the regular way to build exceptionally low latency (or more correctly low GC pause) applications was to avoid memory allocations in the critical parts of the application, and ensure that any libraries used also minimise memory allocation too. We know how long algorythms take to execute, and we can be fairly confident about the time take to execute code, but we cannot be anywhere near as certain about garbage collection.
What do I mean about uncertainty of garbage collection. Let’s ignore custom garbage collectors and focus on mainstream OpenJDK only, and by that I’m talking about G1. Over the years garbage collection has improved to be exceptionally impressive, an application can run with a heap several gigabytes in size, and the user even of a UI application would generally be hard pressed to notice many significant delays. However, we’re not talking about everyday applications here, we’re talking about high throughput low order milliseconds or even better. With that, even the smallest GC pause could make us miss our target. I’m not going to go into the details of how GC works here, but there are both concurrent and pause (think all threads potentially locked) phases. So every GC comes with the possibility of latency, albeit small.
Even if we wrote the same low latency code in C++, it is not an immediate panacea where all problems with memory performance are suddenly cleared. In fact the opposite, a C++ application or library that continuously allocated memory off the heap would run into its own issues with the heap becoming fragmented, possibly even more quickly than the Java heap (as GC also defragments quite frequently). Even more than that, the heap allocation and de-allocation may not be any faster. So even writing in C++ we’d need to apply similar rules to Java, but one immediate advantage for me, is that we can directly map memory onto struct objects, we’ll come to this later.
I’ll break the applications up into two groups, moderate-latency where maybe infrequent 20-30 millisecond pauses are acceptable, and extremely low latency where all pauses need to be avoided. If you are fortunate enough that your application does not need these tough to meet latency targets reading some of the below could still be useful to understand what some of the effects of object creation on GC.
For moderate latency application we still need to be careful around object creation, to ensure instances are either freed before they get out of young generation, or otherwise held on to for the life of the application. Any that don’t meet one of those two targets will get tenured, and then later freed with far greater difficulty.
I’ve written the ultimate low latency applications in C++ for embedded devices, and I can tell you the code generally uses a subset of C++, with little to no dynamic memory allocation beyond initial setup. I’ve written a framework for embedded applications that’s been used in thousands of projects so have seen many different examples. Believe me, whichever way you go, it’s not plain sailing.
Most high performance applications I’ve seen (maybe due to my area of work) are taking in a message, processing it, and producing some output. Maybe there’s a data graph in the middle, maybe a state machine, but nearly always the above holds true. How do we try to avoid GC while doing this.
As messages come in, the raw bytes could be stored within a java class, most data would be kept opaque in the raw data, and any objects that were created to represent fields would be created only once and cached in a map, these would generally represent say an identifier or similar.Then, accessors could break out parts of the data as needed. Things like conversion to String would be avoided in the “hot” sections, and only used by diagnostic utilities. I’m not saying this is all cases, but if exceptionally low latency is required, with truly minimal GC collections, it is a common way of achieving it. We’re effectively trading off a bit of known CPU time parsing the byte array, against the unknowns of garbage collection.
Let’s take a look at how we could do that in Java. Let’s assume we have a supplier that provides only one type of input in the following format
Offset | Data |
---|---|
0..19 | Market Identifier, zero terminated |
20..23 | Price in hundreds |
24..27 | Quantity |
28..31 | Direction (buy 1 or sell 2) |
Low latency and embedded C++ applications follow many of the same rules as for Java, but for embedded, things are even more extreme. Generally speaking, in the embedded domain the use of dynamic allocation outside of setup is heavily discouraged. If you’ve got a mission-critical application on a board with maybe between 32K and 512K RAM then you simply cannot rely on the heap not becoming fragmented.
What does this mean? If you look at the libraries I’ve written, TaskManagerIO, SimpleCollections, IoAbstraction, TcMenu, TcUnicodeHelper, they try to allocate what’s needed up front generally using dynamic allocation only during setup, if a datastructure needs to grow in size, then it will be allocated at runtime, but the memory will not be deallocated. In task manager, it allocates memory in tranches so never deallocates any memory. TcMenu has a code generator that generates all menu structures globally, and in fact on Harvard architecture boards (IE program memory is separated from RAM) it also stores the constant data in FLASH.
I spent a lot of time optimizing the library stack, so it can run although somewhat limited in scope on a board with 32K FLASH and 2K RAM. It has to be responsive. On AVR boards for example, even the use of virtual
has to be carefully calculated.
disqus is off production="Y" and set shortname