Reasoning Under Distribution Shift: A Behavioral Study of Instruction-Following Models
We examine how instruction-following degrades as input distributions shift away from training conditions, and propose lightweight behavioral probes for early detection.
Instruction-following models demonstrate high compliance on in-distribution benchmarks, yet their behavior degrades in ways that are difficult to predict when deployed in contexts that differ from their training regime. This gap between benchmark capability and deployment reliability poses significant friction for production scaling.
In this paper, we conduct a behavioral study examining how models degrade under systematic distribution shifts. We find that failure modes are not uniform. Instead, models exhibit distinct, measurable decay trajectories — such as scope creep, semantic anchoring drift, and formatting collapse — long before they fail completely.
We formalize these failure patterns into four lightweight behavioral probes: Instruction Scope Violation (ISV), Semantic Anchoring Drift (SAD), Constraint Adherence Decay (CAD), and Format Collapse Rate (FCR). We demonstrate that these probes can detect onset failure in live production environments without requiring model internals, adding negligible latency to the inference pipeline.
1. Introduction
The prevailing paradigm in LLM evaluation assumes that a model's performance on human-curated prompt sets (e.g., MT-Bench, FollowBench) corresponds linearly to its reliability in production. Our empirical deployment data contradicts this. We observe that slight perturbations in lexical register, domain specificity, or constraint count can crater compliance rates for models that otherwise score well on static leaderboards.
Our primary contribution is methodological: we provide a framework for reasoning about, and computationally detecting, these degradation patterns without access to model weights or hidden states. We view this as a prerequisite for building reliable compound AI systems: the orchestrating system must be able to cleanly detect when a component model is entering an unreliable state.
"Alignment is often treated as a binary property: a model is aligned or it is not. In practice, alignment is a gradient, and its structure is learnable."
The remainder of this paper is organized as follows. Section 2 describes our experimental setup and the five distribution shift axes we examine. Section 3 introduces the four behavioral probes and their rationale. Section 4 presents results. Section 5 discusses detection methodology. Section 6 addresses limitations. Section 7 concludes.
2. Experimental Setup
2.1 Model Selection
We evaluated three open-weight model families, selecting one checkpoint per family to control for parameter count:
| Model Family | Checkpoint | Parameters | Context Window |
|---|---|---|---|
| Mistral | Mistral-7B-Instruct-v0.3 | 7.3B | 32k |
| LLaMA | LLaMA-3-8B-Instruct | 8.0B | 8k |
| Phi | Phi-3-mini-4k-instruct | 3.8B | 4k |
All models were evaluated without fine-tuning on our specific probes. We explicitly avoided evaluation on models we had access to internal checkpoints for — our goal is a methodology that generalizes, not results that flatter any particular architecture.
2.2 Distribution Shift Axes
We define distribution shift not as a single variable but as a structured space. We examine five axes:
- Lexical register shift — instructions written in formal, colloquial, technical, or non-native English
- Domain specificity shift — instructions referencing domains underrepresented in typical instruction-tuning data (e.g. legal, clinical, industrial)
- Instruction length shift — very short (≤15 tokens) vs. very long (≥200 tokens) instructions
- Constraint complexity shift — number of simultaneous constraints embedded in a single instruction (1 → 8)
- Temporal framing shift — instructions that reference hypothetical futures, counterfactuals, or nested conditionals
For each axis, we constructed 200 instruction pairs: one in-distribution (calibrated to closely match standard instruction-tuning data characteristics) and one shifted (deviating on the axis of interest while holding all other variables constant).
2.3 Evaluation Protocol
All model outputs were evaluated on four dimensions using our behavioral probes (described in Section 3). Each probe produces a score in , where indicates full compliance and indicates complete failure. Scores are averaged across 200 instruction pairs per axis.
We did not use model-as-judge evaluation. All probe measurements are rule-based and deterministic — a deliberate design choice. Our goal is probes that can be deployed in production monitoring without incurring additional LLM inference costs.
3. Behavioral Probes
This section introduces the four probes. Each is designed to be lightweight: implementable in under 50 lines of Python, requiring no model internals, and applicable post-hoc to any text output.
3.1 Instruction Scope Violation (ISV)
Definition: ISV measures whether a model generates content that falls outside the scope explicitly defined by the instruction.
Formally, let be the instruction and be the model response. Let denote the semantic scope defined by — the set of valid response topics, formats, and constraints. ISV is defined as:
where is the set of semantic segments of (identified using clause-level chunking), and denotes cardinality of the set as estimated by a binary semantic membership classifier.
A high ISV score indicates that the model is generating content outside the instruction's defined scope — a common early failure signal.
Implementation note: The semantic membership classifier is a fine-tuned DeBERTa-v3-base model. We provide weights in our supplementary repository.
3.2 Semantic Anchoring Drift (SAD)
Definition: SAD measures how much the semantic center of mass of the response drifts from the semantic center of the instruction over the course of the response.
This probe is designed to detect a specific failure mode: models that begin a response correctly but drift toward their training distribution as the response lengthens. We observe this most prominently under domain specificity shift — models start in the target domain and drift toward more familiar territory.
where is a sliding window over the response, is the sentence embedding of the instruction, is the embedding of the response window , and is cosine distance.
A high SAD score at late response windows indicates drift. We visualize this as a drift curve in Figure 1.

Figure 1. SAD drift curves for the domain specificity shift axis. Phi-3 exhibits the earliest drift onset (window 3–4). Mistral-7B maintains lower drift through window 6 before a sharp increase.
3.3 Constraint Adherence Decay (CAD)
Definition: CAD measures how constraint compliance degrades as the number of simultaneous constraints in an instruction increases.
Let be the set of constraints extracted from instruction . For each constraint , let be a binary compliance function. CAD is defined as the normalized compliance curve over :
We plot for to observe the decay trajectory. In Figure 2, we show that all three model families exhibit near-linear decay past , with a steeper drop at for models with shorter context windows.
Figure 2. Decision tree for constraint complexity risk based on CAD threshold values. Thresholds derived from empirical decay curves across evaluated model families.
3.4 Format Collapse Rate (FCR)
Definition: FCR measures the rate at which models abandon explicitly requested output formats under distribution shift.
This is the simplest of the four probes. If an instruction specifies a format — JSON, numbered list, markdown table, specific heading structure — and the model's response does not conform to that format, FCR increments.
FCR is notable because it is the most sensitive probe to lexical register shift. When instructions are written in non-standard English, FCR increases sharply even when constraint count is low. We hypothesize that format instructions are encoded in closer proximity to lexical register features in the instruction-following fine-tuning data — and therefore degrade together.
4. Results
4.1 Aggregate Compliance Scores
The table below reports mean probe scores for each model family under in-distribution () and shifted () conditions. Lower ISV, SAD, FCR and higher CAD indicate better compliance.
| Model | ISV | ISV | SAD | SAD | CAD | CAD | FCR | FCR |
|---|---|---|---|---|---|---|---|---|
| Mistral-7B | 0.08 | 0.31 | 0.14 | 0.47 | 0.91 | 0.63 | 0.06 | 0.24 |
| LLaMA-3-8B | 0.06 | 0.27 | 0.11 | 0.41 | 0.93 | 0.68 | 0.04 | 0.19 |
| Phi-3-mini | 0.13 | 0.44 | 0.22 | 0.61 | 0.84 | 0.52 | 0.11 | 0.38 |
Key findings:
- All three models exhibit significant compliance degradation under distribution shift across all four probes.
- The gap between and is consistently larger for ISV and SAD than for FCR and CAD, suggesting that scope violations and semantic drift are earlier failure signals than format collapse.
- Phi-3-mini shows the largest degradation gap despite competitive in-distribution performance — likely a function of its smaller parameter count and shorter context window.
4.2 Failure Mode Taxonomy
Through qualitative analysis of 120 randomly sampled failure cases, we identify three dominant failure patterns:
Pattern 1: Scope Creep. The model completes the instruction but continues generating beyond it, adding unrequested elaboration that frequently contradicts or dilutes the original instruction's intent. This is the most common failure mode under lexical register shift.
# Example: Instruction with lexical register shift instruction = "gimme a 3-item list of stuff to check before deploying an ml model" # In-distribution-like response (compliant): # 1. Data drift between training and production distributions # 2. Model calibration on held-out validation set # 3. Latency and memory footprint under target load conditions # Shifted response (scope creep observed): # 1. Data drift # 2. Model calibration # 3. Latency # — Note: this is a simplified checklist. In practice, a robust # deployment review should also consider [...continues for 340 tokens...]
Pattern 2: Domain Retreat. The model's response begins correctly in the target domain but gradually retreats toward more familiar territory. This is the primary manifestation of high SAD scores.
Pattern 3: Constraint Priority Inversion. When facing more constraints than it can hold simultaneously, the model does not drop constraints randomly — it preserves the most recently stated constraint and drops earlier ones. This suggests constraint handling is influenced by recency bias in the attention mechanism.
This finding has direct implications for instruction design: place the most important constraints last, not first, when instruction length approaches the model's effective working constraint capacity.
5. Detection Methodology
The primary practical contribution of this paper is a detection pipeline for production environments. The pipeline operates in three stages:
Stage 1: Offline Calibration. Run all four probes on a representative sample of in-distribution instruction-response pairs from your specific deployment context. Establish baseline distributions for each probe score. Record the 90th percentile of each probe's score distribution as a soft threshold and the 99th percentile as a hard threshold .
Stage 2: Online Scoring. For each production response, compute all four probe scores. This adds approximately 40–80ms of latency per response (depending on response length) with no GPU dependency.
Stage 3: Alerting and Routing. Apply the following routing logic:
def route_response(scores: dict, thresholds: dict) -> str:
"""
Route a response based on probe scores.
scores: dict with keys 'isv', 'sad', 'cad', 'fcr'
thresholds: dict with 'soft' and 'hard' sub-dicts
Returns: 'pass' | 'review' | 'flag'
"""
hard_violations = sum(
1 for probe, score in scores.items()
if score > thresholds['hard'][probe]
)
soft_violations = sum(
1 for probe, score in scores.items()
if score > thresholds['soft'][probe]
)
if hard_violations >= 1:
return 'flag' # Immediate review — do not serve response
elif soft_violations >= 2:
return 'review' # Queue for async human review
else:
return 'pass' # Serve response normally6. Limitations
This work has several important limitations that constrain the generality of its findings.
Scope of model families. We evaluate three open-weight models in the 3–8B parameter range. We make no claims about the behavior of larger models, closed-weight models, or models fine-tuned on proprietary instruction data. The degradation patterns we identify may not transfer.
Probe validity. Our probes are behavioral — they measure observable outputs, not underlying causes. A high ISV score is consistent with multiple causal explanations, including attention degradation, instruction representation collapse, and decoding artifacts. We do not disambiguate between these.
Threshold calibration sensitivity. The detection pipeline's utility depends heavily on the quality of offline calibration. Poorly representative calibration samples will produce thresholds that generate excessive false positives or insufficient true positives in production. We recommend calibration sets of at least 500 instruction pairs per deployment context.
Single-turn evaluation. All experiments are conducted on single-turn instruction-response pairs. Multi-turn behavior under distribution shift is a distinct and important problem that we do not address here.
7. Conclusion
Instruction-following models degrade under distribution shift. This is not surprising. What this paper adds is structure: a characterization of how that degradation manifests behaviorally, and a lightweight detection methodology that can be deployed without access to model internals.
The four probes introduced here — ISV, SAD, CAD, and FCR — are not comprehensive. They are starting points. Our intention is that they form the basis of a broader evaluation vocabulary for instruction-following reliability — one grounded in deployment reality rather than benchmark construction.
We are publishing probe implementations, calibration tooling, and the full evaluation dataset in our supplementary repository.
The failure modes documented here are not edge cases. They are the predictable consequences of deploying models in contexts their training data did not represent. Understanding them is not optional. It is the minimum requirement for responsible deployment.
References
- Amodei, D., et al. (2016). Concrete Problems in AI Safety. arXiv preprint arXiv:1606.06565. [Link]
- Hendrycks, D., et al. (2021). Unsolved Problems in ML Safety. arXiv preprint arXiv:2109.13916. [Link]
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS. [Link]
- Perez, E., et al. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv preprint arXiv:2212.09251. [Link]
Subscribe to ErisAI Research
Get notified when we publish new technical reports, safety evaluations, and alignment methodologies.