Agentic ProbLLMs: Exploiting AI Computer-Use And Coding Agents

New research from Embrace The Red presented at 39C3 reveals critical vulnerabilities in AI agents that can bypass safety controls through probabilistic manipulation. These "Agentic ProbLLMs" attacks demonstrate how malicious actors can exploit the inherent uncertainty in AI decision-making to execute harmful code and manipulate agent behavior.

How the Attack Works

The ProbLLM attack exploits the probabilistic nature of large language models used in coding and computer-use agents. By carefully crafting prompts that appear legitimate but contain subtle malicious instructions, attackers can influence the probability distribution of the model's outputs. The attack leverages the fact that AI agents often operate with temperature settings above zero, introducing variability that can be exploited.

The technique involves feeding agents seemingly normal requests that contain embedded instructions designed to shift the model's probability distribution toward harmful outputs. For example, an attacker might request code generation while including comments or documentation that subtly guide the model toward creating backdoors or security vulnerabilities. The probabilistic nature of the model means it may generate the malicious code while still appearing to follow legitimate instructions.

Real-world implementations show that this attack can bypass traditional safety filters because the malicious content isn't directly present in the prompt but emerges from the model's learned associations. The attack is particularly effective against agents with tool-calling capabilities, as they can execute the generated code directly without human review.

Real-World Implications

The implications for production AI agent deployments are severe. Organizations using autonomous coding agents, DevOps automation, or AI-powered system administration tools face immediate risks. Attackers could potentially gain unauthorized access to systems, exfiltrate data, or establish persistent backdoors through seemingly legitimate agent interactions.

Consider an AI agent responsible for code reviews and deployments. An attacker could submit a pull request with embedded probLLM triggers that cause the agent to approve and deploy malicious code. The attack could also target customer service agents that have access to sensitive systems, using probabilistic manipulation to trick them into performing unauthorized actions.

The research indicates that current safety measures are insufficient against this attack vector. Traditional input validation and output filtering fail because the malicious content emerges from the model's internal probabilistic calculations rather than being directly present in inputs or outputs.

Defensive Measures for AI Agent Operators

Implementing robust defense against ProbLLM attacks requires a multi-layered approach. First, organizations should implement strict temperature controls for production agents, using temperature=0 for critical operations to eliminate probabilistic variability. This reduces the attack surface but requires careful implementation to maintain functionality.

def secure_agent_config(tools, critical_mode=False):
    """Configure agent with security-focused settings"""
    if critical_mode:
        # Zero temperature for deterministic outputs
        llm_config = {
            "temperature": 0,
            "top_p": 1,
            "frequency_penalty": 0,
            "presence_penalty": 0
        }
    else:
        # Minimal temperature for non-critical tasks
        llm_config = {
            "temperature": 0.1,
            "top_p": 0.9,
            "frequency_penalty": 0.1,
            "presence_penalty": 0.1
        }

    # Bind tools with strict validation
    llm_with_tools = llm.bind_tools(
        tools,
        tool_choice="auto",
        strict=True  # Enable strict mode
    )

    return llm_with_tools, llm_config

Second, implement comprehensive output validation before any tool execution. This includes static analysis of generated code, behavior-based detection of anomalous outputs, and sandboxed execution environments for testing agent actions before production deployment.

Third, establish human-in-the-loop validation for critical operations. Even with automated agents, sensitive actions should require human approval. Implement rate limiting and anomaly detection to identify unusual agent behavior patterns that might indicate successful ProbLLM attacks.

Immediate Action Items

Organizations must audit their current AI agent deployments to identify vulnerabilities. Start by inventorying all autonomous agents, their capabilities, and access levels. Implement the temperature controls discussed above and establish monitoring for unusual agent behavior patterns.

Key immediate steps include: - Review and update agent temperature settings, prioritizing deterministic outputs for critical functions - Implement sandboxed execution environments for all agent-generated code - Deploy output validation systems that analyze both code quality and security implications - Establish incident response procedures specifically for AI agent compromise scenarios - Train development teams on ProbLLM attack vectors and defensive coding practices

Regular security assessments should include ProbLLM testing scenarios to validate defensive measures. Consider implementing adversarial testing programs that specifically probe for probabilistic manipulation vulnerabilities.

The research from 39C3 highlights a critical gap in AI agent security that requires immediate attention. As organizations increasingly rely on autonomous AI systems, understanding and defending against probabilistic attacks becomes essential for maintaining security posture. The techniques outlined provide a foundation for securing AI agents against this emerging threat vector.

For detailed technical information and additional defensive strategies, refer to the original research at: https://embracethered.com/blog/posts/2025/39c3-agentic-probllms-exploiting-computer-use-and-coding-agents/

AgentGuard360

Built for agents and humans. Comprehensive threat scanning, device hardening, and runtime protection. All without data leaving your machine.

Coming Soon