In the world of traditional software development, engineers have long relied on a methodology called “Design by Contract” to build reliable systems. This approach clearly defines what a software component expects (preconditions), what it guarantees (postconditions), and what remains true throughout its execution (invariants). But what happens when we apply these time-tested principles to the newer, more dynamic world of Large Language Models (LLMs)?
In this post, I’ll explain the key ideas behind our research, why they matter, and how they could transform the way we build AI systems.
If you’ve ever worked with LLM APIs (like those from OpenAI, Anthropic, or others), you’ve likely encountered some unexpected behaviors or errors:
These are all examples of contract violations - instances where an implicit assumption about how the API should be used or what it will return was broken.
Unlike traditional APIs with clear documentation and static types, LLM APIs often have many implicit “contracts” - unstated conditions that, if violated, lead to errors or unexpected behaviors. The problem is that developers typically discover these contracts through trial and error or by scouring forum posts, rather than through explicit specifications.
An LLM contract is essentially a formal specification of what’s expected when interacting with an LLM. Contracts can specify:
Contracts make these implicit assumptions explicit and enforceable, allowing developers to catch and handle violations early.
The research identifies several major categories of contracts for LLMs:
These specify what the LLM expects as input:
Input contracts are the most common type, accounting for roughly 60% of issues encountered in practice.
These specify expectations about the LLM’s responses:
Output contracts represent about 20% of observed issues and are particularly important for downstream processing of LLM outputs.
These specify the correct ordering of operations:
Sequence contracts are less frequent (about 15% of issues) but crucial for maintaining state and coherence in more complex applications.
The second paper extends this taxonomy to include:
How do we identify these contracts? The papers describe several approaches:
Developers can write contracts based on their domain expertise and understanding of the LLM. This is precise but labor-intensive.
More interestingly, the research proposes automated ways to discover contracts:
Static Analysis: Examining library code to find checks or error conditions that indicate contracts. For example, if the OpenAI SDK code checks if len(prompt) > MAX_TOKENS: raise Error("Prompt too long")
, we can infer a contract that “prompt length must not exceed MAX_TOKENS.”
Dynamic Analysis: Running tests and observing failures to identify boundaries. For instance, gradually increasing prompt size until the API fails to find the maximum allowed length.
NLP-Based Mining: Using NLP techniques (including LLMs themselves) to extract contract statements from documentation, forum posts, and stack traces. For example, parsing a statement like “The maximum context length is 4096 tokens” from API docs.
Machine Learning Inference: Training models to predict contract conditions from examples of function usage.
This automated mining helps reduce the burden on developers and captures community knowledge about LLM usage.
Once identified, how do contracts get enforced? The papers propose a comprehensive architecture:
The approach can be integrated with popular frameworks like LangChain or LlamaIndex:
from llm_contracts.integrations import ContractLLM
from llm_contracts.contracts import JsonOutputContract, MaxTokensContract
# Create LLM with contracts
contracted_llm = ContractLLM(
llm=OpenAI(model="gpt-4"),
contracts=[
MaxTokensContract(4096),
JsonOutputContract(schema=my_schema)
]
)
# Use normally - contracts are enforced automatically
response = contracted_llm("Generate a summary of this document")
The research included several case studies demonstrating the practical benefits:
For a healthcare Q&A system, contracts ensured the model always:
When answering financial questions, contracts verified:
For a programming helper, contracts:
The research analyzed over 600 instances of LLM API issues and found that:
With proper contracts in place, about 95% of these issues could be caught and many automatically resolved.
Performance impact was minimal: enforcing contracts typically added only 8-15% overhead to API call latency, which is negligible compared to the time saved debugging mysterious failures.
The papers detail several implementation approaches:
A formal language (LLMCL) for specifying contracts that includes:
Contracts can be specified using:
@contract
def generate_summary(title: str, content: str) -> str:
# Preconditions
require(len(content) > 0 and len(content) <= MAX_LEN)
require(title is None or len(title) < 100)
# Postconditions
ensure(is_valid_json(output))
ensure(sentence_count(output.summary) <= 3)
# Probabilistic postcondition
ensure_prob(lambda out: title in out.summary, 0.9)
# Call the LLM here
result = call_llm_api(title, content)
return result
To minimize overhead:
The researchers acknowledge several limitations:
Future research directions include:
The contract-based approach to LLM development represents a significant step toward more reliable AI systems. By making implicit assumptions explicit and enforceable, we can:
As LLMs become increasingly integrated into critical software systems, frameworks like these will be essential for ensuring they operate reliably, safely, and as intended.
Design by Contract for LLMs brings time-tested software engineering principles to the frontier of AI development. By formalizing the “rules of engagement” for LLMs, we can build systems that fail less, are easier to debug when they do fail, and provide stronger guarantees about their behavior.
The papers suggest we’re moving toward a future where specifying contracts for AI components will be as standard as writing unit tests for traditional software. For developers working with LLMs today, adopting these principles—even informally—can significantly improve reliability and reduce development headaches.
As AI becomes increasingly embedded in our software ecosystem, approaches like this will be crucial for bridging the gap between AI’s inherent probabilistic nature and the deterministic guarantees we expect from reliable software.