Canton Network: Critical K8s/JVM Memory Bug Alert

PSA :Alert:

A current bug that is affecting Ubuntu based k8s deployments more generally may impact Splice deployments using Helm. (Docker compose-based deployments are not affected!)

In certain circumstances the JVM is no longer considering the container limits when determining Java memory limits, which can lead to restarts of pods as they exceed their k8s memory limits.

More details: The 6.14 Linux kernel version removed support for the v1 cgroups(https://bugs.launchpad.net/ubuntu/+source/linux-hwe-6.14/+bug/2122368). This was then reverted again, i.e., fixed, in 6.14.0-36. The Java version that has support for v2 cgroup is not released yet(https://bugs.openjdk.org/browse/JDK-8347811). This means that container support for Java is broken for all 6.14 versions up to 6.14.0-36. Specifying options such as -XX:MaxRAMPercentage=75 -XX:InitialRAMPercentage=75 (the current default in the Splice Helm charts) will not work as intended; it will apply a limit based on the resources available on the host system, not the container.

If you're using managed k8s clusters such as GKE or EKS you will most likely not be impacted as they currently run on older kernel versions. For example: GCP - https://docs.cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan

Workaround (if you suspect that you might be affected): You can override the default JVM options, replacing the relative memory limits with absolute values. This involves the following steps:
1: In the Helm values YAML for your participant deployment, add defaultJvmOptions: -XX:+UseG1GC -Xms24g -Xmx24g -Dscala.concurrent.context.numThreads=8 -XX:ActiveProcessorCount=8 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/persistent-data
* If you changed the default memory limits for this Helm deployment previously, please adapt -Xms24g -Xmx24g to represent 75% of the memory limits of the container.
2: In the Helm values YAML for your validator deployment, add defaultJvmOptions: -XX:+UseG1GC -Xms6g -Xmx8g -Dscala.concurrent.context.numThreads=8 -XX:ActiveProcessorCount=8 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/persistent-data
* If you changed the default memory limits for this Helm deployment previously, please adapt -Xms6g -Xmx8g to represent 75% of the memory limits of the container.
3: Run helm upgrade to apply your changes.

Avoiding this issue: We recommend against running the latest kernel version in production. Ideally, upgrades of the host system / k8s cluster should follow a linear propagation similar to Splice upgrades (i.e., upgrade first your DevNet cluster, then TestNet, then MainNet).

@Dev Announcements @Canton Builder

The latest from Canton Network