The Choice Can Be the Attack: Auditing Aligned Backdoors in LLM Agents

Ongoing work. The manuscript is not public yet, so this page exposes only the abstract-level summary.

Abstract

Agentic LLM systems are vulnerable to aligned backdoors: triggered behaviors that still satisfy the user's instruction but systematically steer which acceptable option the agent selects, such as which brand, vendor, or tool the agent chooses. This work introduces CCA, an endpoint-black-box auditing framework with environment instrumentation, constraint-preserving randomized counterfactual environments, and pooled discrete-choice estimation over eligible options.

Across synthetic WebShop stress tests and a real 3B WebShop endpoint case study, the framework is used to measure trigger-dependent steering, characterize when counterfactual randomization is sufficient, and identify when feature-controlled choice modeling is necessary to separate suspect-only steering from benign quality effects.

Back to research overview