Canton Network: Critical K8s/JVM Memory Bug Alert

Join Community

Summary

The Canton Network has issued a critical alert regarding a bug in certain Linux kernel versions (specifically 6.14 before 6.14.0-36) that causes the JVM to ignore container memory limits in Ubuntu-based Kubernetes deployments using Helm. This can lead to pods exceeding their allocated memory and restarting. Users are advised to apply a workaround by overriding default JVM options with absolute memory values or avoid using the affected kernel version in production.

PSA :Alert:

A current bug that is affecting Ubuntu based k8s deployments more generally may impact Splice deployments using Helm. (Docker compose-based deployments are not affected!)

In certain circumstances the JVM is no longer considering the container limits when determining Java memory limits, which can lead to restarts of pods as they exceed their k8s memory limits.

More details: The 6.14 Linux kernel version removed support for the v1 cgroups(https://bugs.launchpad.net/ubuntu/+source/linux-hwe-6.14/+bug/2122368). This was then reverted again, i.e., fixed, in 6.14.0-36. The Java version that has support for v2 cgroup is not released yet(https://bugs.openjdk.org/browse/JDK-8347811). This means that container support for Java is broken for all 6.14 versions up to 6.14.0-36. Specifying options such as -XX:MaxRAMPercentage=75 -XX:InitialRAMPercentage=75 (the current default in the Splice Helm charts) will not work as intended; it will apply a limit based on the resources available on the host system, not the container.

If you're using managed k8s clusters such as GKE or EKS you will most likely not be impacted as they currently run on older kernel versions. For example: GCP - https://docs.cloud.google.com/kubernetes-engine/docs/how-to/migrate-cgroupv2#transition-plan

Workaround (if you suspect that you might be affected): You can override the default JVM options, replacing the relative memory limits with absolute values. This involves the following steps:
1: In the Helm values YAML for your participant deployment, add defaultJvmOptions: -XX:+UseG1GC -Xms24g -Xmx24g -Dscala.concurrent.context.numThreads=8 -XX:ActiveProcessorCount=8 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/persistent-data
* If you changed the default memory limits for this Helm deployment previously, please adapt -Xms24g -Xmx24g to represent 75% of the memory limits of the container.
2: In the Helm values YAML for your validator deployment, add defaultJvmOptions: -XX:+UseG1GC -Xms6g -Xmx8g -Dscala.concurrent.context.numThreads=8 -XX:ActiveProcessorCount=8 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/persistent-data
* If you changed the default memory limits for this Helm deployment previously, please adapt -Xms6g -Xmx8g to represent 75% of the memory limits of the container.
3: Run helm upgrade to apply your changes.

Avoiding this issue: We recommend against running the latest kernel version in production. Ideally, upgrades of the host system / k8s cluster should follow a linear propagation similar to Splice upgrades (i.e., upgrade first your DevNet cluster, then TestNet, then MainNet).

@Dev Announcements @Canton Builder

The latest from Canton Network

Canton Network: March 31 Fork and Fee Updates

**Three important updates:** **Batched and Weighted Featured App Markers** ~~are live on MainNet as of last week (Splice 0.5.11)~~. Coming March 31st to MainNet. This …

Security Advisory: Secrets in Validator Logs

## Security Advisory — Potential secrets exposure in validator logs We have identified an issue where sensitive credentials (such as your PostgreSQL password and Ledger …

Canton Network: Privacy and Institutional Scale

# Privacy and Institutional Scale on Canton Privacy is a core requirement for institutional deployment. Canton’s **network-of-networks architecture** supports horizontal scaling through independent synchronizers and …

Splice 0.5.13 Notice: Scan Script Deprecated

## Attention [Splice 0.5.13](https://docs.dev.sync.global/release_notes.html#release-notes), which will be available on DevNet next week, deprecates the sample python script, `scan_txlog.py`, that has until now provided a reference …