Gated Dependencies: Making LLMs Do What You Say

You wrote “ALWAYS search the knowledge base before troubleshooting” in your AI agent’s prompt. It worked in testing. In production, the agent skipped it 43% of the time. Worse, it confidently told callers “I found a guide from the manufacturer” when it had never looked anything up.

This post is about why that happens and a structural fix that actually works.

The Problem: Instructions vs. Incentives

LLMs optimize for the goal, not the process. If a model can reach “help the caller” without making the tool call, your instruction becomes optional friction. The model found a faster path to a helpful response and took it.

“ALWAYS do X” is a constraint on behavior. LLMs comply with constraints inconsistently, especially when the constrained action has a latency cost and the model already has relevant knowledge baked into its weights. The model isn’t being defiant. It’s doing exactly what makes LLMs useful: finding shortcuts to the stated objective.

This is the core tension. The same capability that lets an LLM handle novel situations (flexible goal pursuit) is the one that lets it skip your carefully designed tool-call workflow.

How We Caught It

We built a Langfuse enrichment pipeline that tracks every tool call as a span. Each conversation gets a troubleshootingAttempted boolean: did search_troubleshooting fire during this call?

When we started measuring, 43% of work order calls never triggered the knowledge base tool.

We pulled a transcript from one of those calls. The agent said “I found guidance in the manufacturer’s service manual” to the caller. A hallucinated citation. Meanwhile, the KB had three strong matches for the issue (top result: 0.787 similarity) that the agent never retrieved.

The model wasn’t malicious. It had enough general knowledge about the equipment to generate plausible troubleshooting steps. So it did, and it dressed them up with a citation to sound authoritative. The instruction to search first was just a suggestion it could route around.

The Pattern: Gates vs. Instructions

An instruction says: “Always call the tool before proceeding.” The model can rationalize past this. It has general knowledge, the tool call is slow, and the user needs help now. Easy to skip.

A gate says: “The tool result determines which of three paths you take.” The model can’t proceed without it. There’s no downstream behavior available until the gate produces its output.

The structural difference: an instruction adds a constraint the model must choose to follow. A gate creates a dependency the model can’t route around.

Three Ingredients of a Good Gate

1. The upstream step produces something the downstream step consumes.

Not “do X then Y” but “X produces a value that Y branches on.” The downstream step literally can’t execute without the upstream output. If you can describe the flow without mentioning what X returns, it’s an instruction, not a gate.

2. There’s no default path that bypasses the branch.

Every downstream path requires the variable. If there’s a way forward without it, the model will find it. This is the part people get wrong. They add the gate but leave a fallback (“if the tool fails, proceed with your best judgment”), and the model treats the fallback as the default.

3. Each branch has distinct framing.

The language for each path should be different enough that the model can’t blur them together. This prevents the failure mode we saw: hallucinating Path A behavior (citing the KB) while actually on Path B (generating from training data).

Real Example: Before and After

Here’s what our prompt looked like before the fix:

Step 0: Search Knowledge Base — ALWAYS do this
After identifying the equipment, always call search_troubleshooting.

Step 3: If no procedure was found, generate up to 3 steps based
on the equipment and issue.

Two problems. First, the word “always” is carrying the entire enforcement mechanism, and it’s not strong enough. Second, Step 3 gives the model a clean escape hatch: if it skips the search, “no procedure was found” is technically true, and it can jump straight to generating steps from its own knowledge.

Here’s the gated version:

Step 0: Search Knowledge Base — REQUIRED (determines troubleshooting tier)
Call search_troubleshooting. The result determines your approach:
  - KB results with similarity >= 0.50 → Tier 1 (KB-guided procedure)
  - KB empty + operational issue → Tier 2 (general equipment knowledge)
  - KB empty + structural damage → Skip troubleshooting, go to work order

Step 3: Use the tier from Step 0.
  Tier 1: "I found guidance in the manufacturer's service manual."
  Tier 2: "Let's try a few things based on common issues with this equipment."

The tier variable is the gate. Step 3 can’t execute without it. There’s no fallback that lets the model skip the search. And each tier has its own framing, so the model can’t accidentally use Tier 1 language (“I found guidance”) when it’s actually on Tier 2.

After this change, KB search compliance went from 57% to effectively 100%.

Where This Pattern Already Works

You’ve probably used gated dependencies without naming them.

Data collection gates. “Do not call create_work_order until you have ALL of: customer_id, asset_id, task_id, contact_name, contact_phone.” This is why name collection rarely gets skipped in well-built agents. The API call literally requires those fields. The model can’t route around a missing required parameter.

Verification gates. “Use the confirmed store name in the work order comment.” The downstream step (writing the comment) consumes the upstream output (the confirmed name). The model has to do the confirmation step because the comment template references it.

Chain-of-thought. This is the same principle at a token level. Chain-of-thought works because the reasoning step produces tokens that the answer step conditions on. The model can’t generate the answer without first generating the reasoning, because each token is conditioned on the previous ones. It’s a gate built into the autoregressive architecture itself.

Limitations

This pattern isn’t foolproof. A model can still hallucinate a tool result, fabricating the output of a call it never made. But that failure mode is much rarer than casually skipping a call. Hallucinating a structured tool response requires more “effort” from the model than just not calling the tool.

Gates add prompt complexity. Each one needs clear branch definitions with distinct framing. For a prompt with many tool calls, gating all of them makes the instructions harder to maintain and reason about.

Use gates for high-value tool calls where skipping has real consequences: knowledge base lookups that prevent hallucinated citations, data validation that prevents bad records, safety checks that prevent dangerous actions. Don’t gate every instruction. If a skipped step just makes the response slightly less polished, an instruction is fine.

Takeaway

When an LLM ignores your “ALWAYS do X” instruction, don’t reach for more emphasis. Bold text, capital letters, “I REALLY MEAN IT” in the prompt. None of that changes the structural incentive.

Instead, restructure so the downstream behavior requires X’s output. Make the tool call load-bearing, not decorative. If the model can’t proceed without the result, it won’t skip the call.

The fix isn’t louder instructions. It’s better architecture.

I’m an AI consultant who builds voice agents and harnesses for businesses. If your agents are skipping steps they shouldn’t, let’s talk.