The Cryptographic Wall - Fast External Deterministic Verification of LLM Code Execution
Abstract
The proliferation of agentic AI systems is fundamentally blocked from deployment in mission-critical environments by a critical security vulnerability: the inability to reliably verify code execution where users lack control over the computational stack. This trust gap renders such systems dangerously unreliable, as Large Language Models (LLMs) are prone to hallucinating outputs and fabricating execution traces that appear plausible but are disconnected from reality. Without a robust verification method, some parts of autonomous agentic systems remain confined to low-risk applications. In this post, I introduce and validate a novel solution called the “Cryptographic Wall,” a method that exploits the inherent computational difficulty LLMs face in performing chaotic mathematical operations, such as the MD5 hashing algorithm. By challenging an LLM with a simple cryptographic task, a user can achieve near-instantaneous and deterministic verification of a genuine sandbox execution environment. I explore the dual-use implications of this technique, positioning it as a “Trusted Handshake” for legitimate users to establish session integrity, and as a powerful probing tool for security researchers or malicious actors to perform system reconnaissance. This work contributes a practical, high-fidelity methodology to the fields of LLM security and trusted computing, offering a crucial step toward building more reliable and verifiable AI systems.
1. Introduction
The emergence of agentic AI systems capable of autonomously generating and executing code represents a paradigm shift in computing. This advancement introduces a severe security caveat: as the NVIDIA AI red team has noted, all code generated by an LLM must be treated as untrusted output [1].
This discussion, however, focuses not on the generation, but about the execution of the code. In many AI architectures, users interact with agents through APIs with no direct control or visibility into the underlying execution stack. This creates a profound trust gap that is a fundamental blocker to deploying agentic AI in high-stakes environments such as finance, healthcare, and defense. Without verifiable execution, users cannot confirm whether an agent’s output is the result of genuine computation or a sophisticated hallucination, rendering these systems dangerously unreliable.
This gap is exacerbated by the well-documented tendency of LLMs to fail in reasoning tasks and simulate code execution traces inaccurately [2], [3]. Traditional verification methods, such as temporal latency analysis—colloquially known as the “sleep test”—are fundamentally flawed. These techniques, which involve injecting a blocking time.sleep() command, are susceptible to network variance and introduce unacceptable latency, degrading the user experience [4].
I introduce and validate a superior, high-fidelity verification method: the Cryptographic Wall. My central thesis is that LLMs, as probabilistic systems optimized for pattern recognition, are computationally incapable of correctly and consistently performing the chaotic and precise mathematical operations that underpin cryptographic functions. By leveraging this fundamental limitation, we can create a simple, deterministic, and near-instantaneous test to verify the presence of a true code execution sandbox.
Section 2 provides a detailed analysis of the uncontrolled execution problem and its associated risks. Section 3 contrasts the incumbent temporal analysis method with our proposed Cryptographic Wall protocol. Section 4 explores the dual-use implications of this technique as both a defensive “Trusted Handshake” and an offensive reconnaissance probe. Section 5 discusses potential countermeasures and directions for future research. Finally, Section 6 concludes by summarizing our contributions and emphasizing the critical need for robust verification mechanisms in the age of agentic AI.
2. The Uncontrolled Execution Problem: Risks and Realities
Understanding the risks associated with unverified LLM code execution is of strategic importance for any organization deploying or interacting with modern AI systems. The gap between an LLM’s claimed action and the grounded reality of its execution can lead to flawed decision-making, security breaches, and a fundamental erosion of trust in AI-powered services.
2.1. Defining the Threat Model: Limited Control Scenarios
The prevailing architectural paradigm for agentic AI involves users interacting with LLM-powered services via APIs. In this model, the user has no control over the hardware, operating system, or runtime environment. This establishes a threat model where the underlying infrastructure must be considered “honest-but-curious” (i.e., the infrastructure provider is assumed to follow protocols but may attempt to observe or exfiltrate data passing through its systems) or potentially compromised, an assumption common in research on trusted execution environments (TEEs) [5]. Within this model, users cannot deploy traditional monitoring tools, making it impossible to independently confirm that code provided in a prompt is executed as intended.
2.2. The Specter of Hallucination: Analyzing Execution Risks
LLMs have demonstrated significant failures in tasks requiring reliable simulation of code execution and logical reasoning. Empirical studies reveal a persistent “ideation-execution gap,” where ideas generated by LLMs score highly in initial evaluations but fail to translate into effective outcomes upon actual execution [3]. This “ideation-execution gap” [3] demonstrates that an agent’s plausible description of a plan is no guarantee of its ability to execute it correctly. This necessitates a verification method that tests actual execution, not just plausible output, which the Cryptographic Wall provides. Models are prone to specific reasoning errors, including “Input Misread” and output construction failures where the model calculates a correct answer internally but “misreported” it in the final output [2].
These failures translate into tangible business risks. Decisions based on fabricated data can lead to financial harm and strategic missteps. For industries in high-stakes fields like finance and healthcare, a single hallucinated output could result in serious compliance violations or reputational damage [6]. When an AI’s plausible-sounding falsehoods mislead employees or customers, it erodes brand trust and negates the very productivity gains the technology was intended to deliver.
2.3. The Limits of LLM Generalization in Code
The code generation capabilities of LLMs are often overestimated. Research suggests that a significant portion of their performance stems from data leakage, where solutions to evaluation benchmarks are present in the training data, rather than from true generalization. One analysis concluded that over a third (37.4%) of a model’s performance on coding challenges could be attributed to leakage [7]. This same study notes that LLMs exhibit “diminished performance” on highly complex exercises and “major drops in performance for unpopular languages” [7]. This reliance on memorization makes LLMs particularly unsuited for tasks demanding novel, precise computation. Cryptographic functions, designed to be chaotic and non-pattern-based, represent a class of operations that probabilistic models fundamentally struggle to replicate without actual execution.
These fundamental limitations in control, reasoning, and generalization necessitate the development of robust and novel methodologies for verifying code execution in agentic systems.
3. Methodologies for Verifying Sandbox Execution
A reliable method for verifying code execution is a critical component for building trust in agentic AI systems. Without it, every output remains suspect, undermining the utility of the agent. This section contrasts the incumbent, flawed technique with our proposed cryptographic approach.
3.1. Incumbent Method: Temporal Side-Channel Analysis
The most common technique for differentiating a real sandbox from a token-optimizing LLM is the “Sleep Test,” a rudimentary form of temporal side-channel analysis [4]. The methodology is straightforward: an examiner injects a blocking command, such as import time; time.sleep(5), and measures the response delay. If the system is a true sandbox, its response will be delayed; if it is an LLM merely simulating execution, it will likely return a result almost immediately.
Despite its simplicity, this approach has severe limitations:
- High Latency: The verification process itself introduces multi-second delays, which is impractical for real-time applications and degrades the user experience.
- Unreliability: The test is vulnerable to network jitter—unpredictable variations in packet delay—which can create a timing signature from a non-sandboxed LLM that is indistinguishable from the intentional delay of a sleep command, rendering the test unreliable in real-world network conditions.
3.2. Proposed Method: The Cryptographic Wall
I propose a superior method based on a clear hypothesis: LLMs, as probabilistic systems, are fundamentally incapable of correctly and consistently performing chaotic, computationally intensive mathematical operations like cryptographic hashing.
This claim is substantiated by the immense computational expense and mathematical precision required for cryptographic primitives. The fact that operations like modular division and big integer arithmetic (math/big.nat.div) are bottlenecks even in highly optimized, dedicated cryptographic hardware for zero-knowledge proof generation [8] underscores the computational infeasibility of a probabilistic transformer model, which lacks specialized arithmetic logic units, correctly simulating such a function on the fly. This is not just a matter of difficulty; it is a fundamental architectural mismatch that the Cryptographic Wall exploits.
This limitation allows for a simple yet powerful verification protocol:
- The user generates a random, high-entropy string locally (e.g., a nonce or a secret key).
- The user injects code into the prompt that instructs the LLM agent to compute and print the MD5 hash of this specific string.
- The user locally computes the correct MD5 hash of the same string and compares it to the agent’s output. A perfect, character-for-character match deterministically verifies that the code was executed in a real computational environment.
An illustration of this technique is provided below:
# 1. User generates a high-entropy string LOCALLY, e.g., "SecretKey"
# 2. User sends the following code to the LLM for execution:
import hashlib
random_string = "SecretKey"
md5_hash = hashlib.md5(random_string.encode()).hexdigest()
print(f"The MD5 hash is: {md5_hash}")
# LLM should return: "The MD5 hash is: 0d734a1dc94fe5a914185f45197ea846"
# 3. User locally computes the hash of "SecretKey" and performs a byte-for-byte comparison against the LLM's output. A mismatch is a definitive failure.
The Cryptographic Wall is superior to temporal analysis due to its deterministic nature, near-instantaneous verification time, and immunity to network latency. An incorrect hash is an unambiguous failure, while a correct one is irrefutable proof of execution. The unambiguous nature of this verification transforms it into a powerful dual-use tool, capable of both securing and probing agentic systems.
4. Dual-Use Implications: Trust Handshake vs. Sandbox Probe
Like many powerful security tools, the Cryptographic Wall is a double-edged sword, giving rise to an emerging cat-and-mouse game in AI security. Its defensive application is about building provably secure systems, while its offensive use represents a new, low-cost frontier for reconnaissance and exploitation. Understanding both use cases is essential for building resilient AI systems.
4.1. Defensive Use Case: The “Trusted Handshake” Protocol
For legitimate users and system operators, the cryptographic challenge can serve as a “Trusted Handshake” at the beginning of any critical session [4]. By issuing a challenge with a unique, session-specific random string, a user can receive definitive proof that they are interacting with an agent capable of genuine code execution. A successful handshake establishes a baseline of trust, confirming the authenticity of the execution environment before any sensitive or high-stakes operations are performed.
This protocol can be situated within a formal trust framework as a practical implementation of a Proof-based trust model. Such models require cryptographic proof of behavior—in this case, proof of correct computation—to establish confidence in an agent’s integrity and capabilities [9].
4.2. Offensive Use Case: Probing for Sandbox Vulnerabilities
Conversely, security researchers and malicious actors (“LLM hackers”) can weaponize this same technique to probe a target system’s sandbox implementation. It has been demonstrated that an LLM like GPT-4 can “autonomously compose and execute selected side-channel attacks when provided with access to physical hardware” [10]. The Cryptographic Wall provides a primitive for an LLM agent itself to autonomously probe and characterize other AI systems, not just a tool for a human hacker. By sending varied and repeated cryptographic challenges, an attacker agent can conduct reconnaissance without needing direct system access to:
- Test for execution consistency across different loads or inputs.
- Characterize the performance of the underlying hardware.
- Potentially identify implementation weaknesses or side-channels for further exploitation.
This elevates the threat model significantly, enabling automated, agent-driven mapping of black-box execution environments. The potential for such misuse necessitates a proactive discussion of defensive strategies.
5. Countermeasures and Future Research
Given the dual-use nature of cryptographic verification, it is crucial for system architects to design proactive defenses against malicious probing while exploring more advanced methods for establishing trust. The evolving landscape of AI security demands continuous adaptation and research.
5.1. Mitigating Malicious Probing
System operators can implement several countermeasures to detect and mitigate the use of cryptographic challenges for reconnaissance:
- Rate-Limiting: Throttling or flagging accounts that make frequent or patterned requests containing calls to cryptographic libraries (e.g., hashlib).
- Behavioral Analysis: Deploying monitoring systems to detect anomalous interaction patterns. A user who exclusively sends cryptographic challenges without engaging in other tasks may be indicative of a probing attempt.
- Runtime Attestation with Trusted Execution Environments (TEEs): A more robust, long-term solution is to obviate the need for such external probes. By running agentic workflows within a TEE like Intel SGX, a system can provide a formal, cryptographically signed attestation of the code and its runtime environment to the user [5]. Concrete implementations such as Intel’s protected file system (IPFS) demonstrate that TEEs can provide not just runtime attestation but also secure, attested storage for agentic outputs, as content is “encrypted seamlessly by the trusted library before being written on the media storage” [5]. This shifts from a user-initiated challenge-response model to a system-initiated proof of integrity.
5.2. Avenues for Future Research
This opens several promising directions for future academic and industry research:
- Advanced Cryptographic Challenges: Investigate the use of more complex challenges that remain computationally inexpensive for the user to verify. This could include simple elliptic curve operations or elements from hash-based proof systems like zk-STARKs, which are known to be post-quantum secure [11, 12].
- Standardized Sandbox Attestation: Develop a standardized protocol for agentic AI systems to attest to their execution environment’s integrity. Such a standard would promote interoperability and create a common security baseline for the industry.
- Monitoring LLM Advancements: Continuously evaluate the capabilities of next-generation LLMs. While it is unlikely that probabilistic models will ever achieve perfect cryptographic accuracy, future architectures may develop a greater ability to approximate such computations, requiring the difficulty of the challenges to be increased accordingly.
These research avenues will be critical in building a comprehensive framework for verifiable and trusted AI computation.
6. Conclusion
I have addressed the critical problem of unverified code execution in agentic AI systems—a vulnerability that arises from architectures where users have limited control and visibility. I have shown that existing verification methods based on temporal analysis are unreliable and impractical for real-world applications.
As a solution, I proposed the Cryptographic Wall, a high-fidelity Turing Test that leverages the fundamental inability of probabilistic LLMs to perform precise, chaotic mathematical computations. This method provides a deterministic, near-instantaneous, and robust mechanism for verifying the presence of a genuine sandbox environment. I further analyzed the dual-use nature of this technique, presenting its defensive application as a “Trusted Handshake” for establishing session integrity and its offensive potential as a probe for system reconnaissance.
The Cryptographic Wall is not merely a technique; it is a foundational principle for verifiable computation in an era of probabilistic machines. As agentic systems become more integrated into society’s critical functions, the ability to distinguish between computation and confabulation will define the line between trusted tools and unpredictable liabilities.
7. References
[1] NVIDIA AI Red Team, “How Code Execution Drives Key Risks in Agentic AI Systems,” NVIDIA Technical Blog, 2024.
[2] Z. Li, et al., “Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation,” arXiv preprint, 2024.
[3] C. Si, et al., “The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas,” arXiv preprint, 2025.
[4] Abraham Kim, “Fake RCE in LLM Applications,” Cyber defence magazine, 2025.
[5] G. Russello, G. Mazzeo, L. R. D’Acierno, and A. Hollum, “A Comprehensive Trusted Runtime for WebAssembly with Intel SGX,” arXiv preprint arXiv:2312.09087, 2023.
[6] Cloudsine, “Mitigating LLM Hallucinations and False Outputs in Enterprise Settings,” Cloudsine Blog, 2024.
[7] Á. Barbero Jiménez, “An Evaluation of LLM Code Generation Capabilities Through Graded Exercises,” arXiv preprint arXiv:2410.16292, 2024.
[8] Anonymous, “A Comparative Analysis of zk-SNARKs and zk-STARKs: Theory and Practice,” arXiv preprint, 2024.
[9] Anonymous, “Inter-Agent Trust Models: A Comparative Study of Brief, Claim, Proof, Stake, Reputation and Constraint in Agentic Web Protocol Design,” arXiv preprint, 2024.