When running spring-boot using netty (we ran: 4.1.69), and async code, your server might run into a memory issue after running for a while. In our case we were using spring-cloud-gateway (we ran: 3.0.5), and after two weeks of running, we started hitting long garbage collection times. To get around this, we did multiple code fixes, amongst one was disabling the thread-local caches for non-netty threads.
To do this, add JVM option:
false=Code language: Properties (properties)
For an explanation, see: https://github.com/spring-projects/spring-framework/issues/21174
The PooledByteBufAllocator from Netty creates ThreadLocal caches even for non-Netty Threads. These caches quickly move to Old Gen and do not get collected during normal G1 collections.
Why is this a problem?
If any operation which is executed with subscribeOn(Schedulers.elastic()) causes Netty ByteBuf allocations (lets say using a TcpClient or returning WebFlux responses), a new TheadLocal cache is setup by Netty. Whilst the threads from Schedulers.elastic() eventually get reclaimed, the associated cache does not – at least not during normal G1 collections.
Running the application with -Dio.netty.allocator.useCacheForAllThreads=false fixes above issue by only using ThreadLocal caches in Netty Threads. Maybe this should be the default as – whilst it hurts performance a little with above usage – fixes the potential leak.
In general, if you want to investigate netty memory leak issues, you can use the following JVM options:
paranoid =40=Code language: Properties (properties)
If you do not see enough info, try to enable the DEBUG logs of package: reactor.netty.channel
See also https://netty.io/wiki/reference-counted-objects.html for more information.
Note: after changing the cache flag, we did run some performance tests, and the difference was not really noticeable.