PLUGIN SIGNALS
Feature Engineering Inventory
Filter
Repo context → Context pack → Plan artifact → Agent run → Outcome delta → Handoff summary → Next session
A structured representation of a context/planning plugin that helps explain, predict, and optimize outcomes across trials and user sessions. Two scopes, different signal sets. Plugin eval signals measure context quality and downstream agent delta. Plugin product signals measure memory, handoff, and session continuity.
This plugin does not replace the coding agent. It improves the agent's operating context. Its job is to gather relevant repo/session context, optionally maintain a plan artifact, detect milestones and blockers, preserve useful session memory, and generate handoff information so future sessions do not restart from zero.
Context-only
Gathers and packages repo context. No plan artifact. Signals focus on precision, recall, and bloat. Baseline scope — any plugin deployment has at least this role.
Context + planning artifact
Plan is first-class. Created, updated, and tracked through the session. Adds plan drift, milestone detection, and blocker signals on top of context signals.
Context + planning + routing
Plugin also selects tools, delegates sub-tasks, or manages agent topology. Full orchestrator role. All signals apply, plus tool-call delta and routing precision.
Plugin Session Lifecycle — plugin_session › phases[] Each phase is a distinct plugin responsibility within a single session
pre-session
context_pack
gather repo context
select relevant files
build context payload
pre/mid-session
plan_state
create plan artifact
update mid-session
track phase transitions
mid-session
memory_update
surface prior facts
inject reminders
detect blockers
post-session
handoff_artifact
session-end summary
persist memory facts
next-session primer
measured
downstream_agent_outcome
delta vs baseline
pass rate change
cost delta
config
plugin_scope
context_strategy
plan_artifact_enabled
system_reminders_enabled
agent_id
model_id
Signal tiers Computed in the eval harness today — present in session JSONL or meta files    Target defined in the schema, not yet emitted by the pipeline    Needs X infrastructure gap — requires baseline run, annotation, or LLM judge
Plugin Configuration — the knobs applied to this session
These are the independent variables — the treatment applied to each plugin session. If you're comparing plugin configurations, these must be recorded per-session or you cannot do fair attribution. These exist in the plugin config and are available at session start.
Signal Source Type Status Description Validity / Pitfalls
plugin_scope
config.plugin_scope
Config enum Target
Scope of the plugin this session: context-only / context+plan / context+plan+routing. Determines which signal layers are applicable.
Gates interpretation of all other signals. A plan_drift signal on a context-only session is vacuous.
context_strategy
config.context_strategy
Config enum Extracted
Strategy selected by the plugin for this session: bypass / dump / focused / distill / trace. Different strategies have different expected precision/recall profiles.
Must be recorded to interpret context_precision and context_bloat_score. Dump strategy will always show low precision; that is expected.
plan_artifact_enabled
config.plan_artifact_enabled
Config bool Target
Whether plan creation is active this session. When false, all Layer 3 plan signals are not applicable.
Required gate for Layer 3 signals. Do not compute plan_drift_score when this is false.
system_reminders_enabled
config.system_reminders_enabled
Config bool Target
Whether soft-nudge injections (system reminders) are active this session. Affects reminder_injection_count in Layer 3.
Without this flag, reminder_injection_count is uninterpretable — zero injections might mean reminders are disabled, not that none were needed.
agent_id
config.agent_id
Config str Extracted
Downstream agent receiving the context pack from this plugin session. Required for any cross-agent comparison of plugin effectiveness.
Plugin outcomes are agent-specific. A context pack that helps agent A may not help agent B with a different context window or tool set.
model_id
config.model_id
Config str Target
Model used by the plugin for context building and plan generation. Include the full model string — provider + version.
Separates plugin model quality from downstream agent model quality when attributing outcome deltas.
Context Behavior — what the plugin selected and what the agent used
Captures what the plugin selected versus what the agent actually used. These signals answer the core context quality question: did the plugin give the agent the right files? High-value signals are context_precision, context_recall, and context_token_cost.

Key tension: high precision + low recall = over-filtering (plugin picked too few files but they were relevant). Low precision + high recall = bloat (plugin grabbed everything including noise). Both are failure modes with different costs.
Plugin session lifecycle phases — context behavior across modes
Optimal
context_pack plan_create agent_run verify handoff
Context-only
context_pack agent_run verify no plan artifact — Layer 3 signals not applicable
Failed / no context
agent_run verify_fail plugin bypassed or errored — baseline condition
Signal Source Type Status Description Validity / Pitfalls
context_files_selected
plugin.context_files_selected
Plugin list[str] Extracted
Files the plugin selected for the context pack before the agent acted. The raw selection list from the plugin's context-building phase.
Foundation for all derived context quality signals. Without this list, precision and recall cannot be computed.
context_files_used_by_agent
harness.context_files_used_by_agent
Harness list[str] Needs session JSONL
Files the agent actually read or modified during the session. Compare with context_files_selected to derive precision and bloat.
Requires parsing agent session JSONL for file access events. Not the same as touched_files — includes reads that did not result in edits.
missed_relevant_files
derived.missed_relevant_files
Derived list[str] Needs artifact
Files proven relevant but omitted by the plugin. Requires ground-truth relevant_files in the task spec or gold patch to compute.
Requires ground-truth annotation. In eval settings, gold patch files provide this. In product settings, needs LLM judge or manual annotation.
context_precision
derived.context_precision
Derived float Computed
Fraction of plugin-selected files actually read or edited by the agent. |selected ∩ agent_used| / |selected|. Supporting signal — always pair with coverage. High precision + low coverage = over-filtering. Prefer first_tool_precision as the primary KPI.
Can look artificially high if plugin selects very few files. A plugin returning 0 files scores 100% precision — always read alongside coverage. Downgraded from ★: use first_tool_precision as the primary context-quality KPI instead.
context_recall
derived.context_recall
Derived float Computed
Fraction of agent-used files that were pre-selected by the plugin. |selected ∩ agent_used| / |agent_used|. Measures how much of what the agent actually needed was pre-covered. Not the same as ground-truth recall — see context_relevant_recall (target) for that.
Computed from session JSONL — no annotation required. High coverage = plugin anticipated the agent's file needs well. Low coverage = agent explored significantly beyond the plugin's selection.
context_relevant_recall
derived.context_relevant_recall
Derived float Target
Ground-truth recall: |selected ∩ relevant| / |relevant|. Fraction of annotated-relevant files the plugin actually selected. Requires a relevant_files ground-truth list from the task YAML or gold patch. Compare with context_recall (which uses agent-used as proxy for relevant) to measure how well agent behavior proxies ground truth.
Requires annotation — not computable from session JSONL alone. Use context_recall as the zero-annotation proxy while building the annotation pipeline. Null when relevant_files is empty.
context_bloat_score
derived.context_bloat_score
Derived float Computed
Fraction of selected files that were never accessed by the agent. |selected − agent_used| / |selected|. High bloat wastes agent context window.
1 − context_precision. High bloat is always costly — unused context occupies token budget without benefit.
context_noise_ratio
derived.context_noise_ratio
Derived float Needs artifact
Percentage of irrelevant chunks in the context pack at the chunk/passage level (not just file level). A file can be selected but only partially relevant.
Finer-grained than bloat_score. Requires chunk-level relevance annotation. Use file-level signals first.
context_token_cost
harness.context_token_cost
Harness int Extracted
Tokens consumed by context injection into the agent prompt. Key cost signal — directly sets the floor for agent input token cost each session.
Primary lever for cost optimization. Pair with context_precision to distinguish cheap-but-useful from cheap-but-useless context packs.
file_selection_latency_ms
harness.file_selection_latency_ms
Harness int Extracted
Time spent building the context pack before the agent starts. Adds to plugin_latency_ms. High values suggest expensive retrieval or embedding.
Direct user-perceived latency cost of the plugin. Latency + context quality tradeoff is core plugin design problem.
Planning / Decision Support — how the plugin shapes agent decisions
Signals from the plugin's planning and decision-support layer: plan artifacts, milestone detection, blockers, and reminder injections. Only applicable when plugin_scope includes planning (context+plan or context+plan+routing).

Note on plan drift: drift is not always bad. Legitimate task discovery causes benign drift. Flag plan_drift_score only when correlated with downstream failure, not as a standalone quality signal.
Signal Source Type Status Description Validity / Pitfalls
plan_artifact_created
plugin.plan_artifact_created
Plugin bool Needs artifact
Whether plugin created a persistent plan object this session. Only meaningful in context+planning scope.
Gate for all other plan signals. If false, plan_updated_count, plan_drift_score, and plan_phase_transitions are not applicable.
plan_updated_count
plugin.plan_updated_count
Plugin int Needs artifact
Number of times the plan was updated mid-session. Zero updates may indicate the plugin stopped tracking, or the initial plan was correct.
Interpret alongside plan_drift_score. Many updates + low drift = healthy iterative refinement. Few updates + high drift = plan was abandoned silently.
plan_drift_score
derived.plan_drift_score
Derived float Needs artifact
Divergence between the initial plan and the final path taken. 0 = no drift. Computed as edit distance between initial and final plan state.
Drift is not always bad — legitimate discovery causes benign drift. Flag only when correlated with downstream failure. Never use alone as a quality signal.
milestone_detected
plugin.milestone_detected
Plugin bool Target
Whether plugin detected meaningful progress or milestone completion during the session. Leading indicator of session trajectory.
Positive signal when correlated with successful downstream outcome. False + successful outcome may indicate the plugin's milestone detection is too conservative.
blocker_detected
plugin.blocker_detected
Plugin bool Target
Whether plugin flagged a blocking issue during the session. Blockers should correlate with downstream failure or retry events.
Calibrate against downstream outcome. High false-positive blocker rate = plugin is too conservative and interrupts good sessions.
plan_phase_transitions
plugin.plan_phase_transitions
Plugin list[str] Needs artifact
Sequence of plan phase changes observed during the session. Analogous to plannotator phases in the agent — tracks plugin's view of session progress.
Structural, no inference required if the plugin emits phase events. Valuable for understanding plugin behavior across session types.
reminder_injection_count
plugin.reminder_injection_count
Plugin int Target
Number of soft-nudge reminders injected into the agent context during the session. Only meaningful when system_reminders_enabled = true.
High counts may indicate the agent is repeatedly deviating from the plan. Correlate with plan_drift_score and downstream outcome.
Downstream Outcome Attribution — what did the plugin actually change
The plugin's value is measured by what it changes in downstream agent outcomes. These signals require a clean baseline run (same agent, same tasks, no plugin) to compute deltas.

Key constraint: system noise can obscure deltas smaller than ~5 percentage points. Run at least 3 trials per condition before drawing conclusions. A single-run delta is not attributable.
Signal Source Type Status Description Validity / Pitfalls
agent_outcome_delta
derived.agent_outcome_delta
Derived float Needs baseline
Pass rate improvement vs baseline (same agent, no plugin). Positive = plugin helped. Computed as pass_at_1 − pass_at_1_baseline.
Requires a clean baseline run. System noise can obscure deltas < 5pp. Run ≥3 trials per condition. A single-run delta is not attributable to the plugin.
pass_at_1
harness.pass_at_1
Harness float Extracted
P(success in first attempt) with plugin active. The primary eval metric for plugin-assisted sessions.
Only interpretable as a quality signal when compared against pass_at_1_baseline from a clean no-plugin run.
pass_at_1_baseline
harness.pass_at_1_baseline
Harness float Needs baseline
P(success in first attempt) without plugin. Run baseline condition separately on the same task set and agent configuration.
Must use the same agent, model, task set, and tool caps. Any difference in those confounds the comparison.
resolve_rate_delta
derived.resolve_rate_delta
Derived float Needs baseline
Percentage point change in fully resolved tasks with vs without plugin. Broader than pass_at_1 if retries are counted.
Distinguish from pass_at_1_delta: resolve_rate includes retry successes. Track both for full picture.
retry_count_delta
derived.retry_count_delta
Derived float Needs baseline
Change in average retries needed with vs without plugin. Negative = plugin reduces retries. Plugin that improves pass@1 but not retry count has narrow impact.
Negative delta is the target. A plugin that reduces retries saves significant cost even if overall resolve rate is similar.
verification_pass_delta
derived.verification_pass_delta
Derived float Needs baseline
Change in test-pass rate attributable to plugin. Finer-grained than resolve_rate_delta — tracks partial improvement where some tests now pass.
Only meaningful in eval contexts with a test harness. Not applicable in pure product sessions.
faithfulness_score
llm-judge.faithfulness_score
LLM-judge float Needs LLM judge
Percentage of agent claims supported by plugin-provided context. High faithfulness = agent stayed grounded in the context pack.
LLM-as-judge can hallucinate judgements. Calibrate against 10% human spot-check before trusting at scale.
Decision Quality — causal signals on how well the agent navigated the task
These signals are computed directly from the agent session JSONL — no extra annotation required. They distinguish good navigation (direct path to the right file, error handled and recovered) from poor navigation (repeated failed calls, thrashing across files, spinning in a loop). They are causal: a plugin that improves these numbers is provably improving the agent's decision-making, not just its final output.
Signal Source Type Status Description Validity / Pitfalls
first_tool_precision
derived.first_tool_precision
Derived float Needs annotation
Was the agent's first file read/search a relevant file? 1.0 = yes, 0.0 = no. Requires relevant_files in task spec. The fastest proxy for context quality — a plugin that improves this is directly reducing wasted tool calls.
Requires ground-truth annotation in the task YAML. Only applicable when relevant_files is populated. Binary 0/1 — aggregate across many tasks for a stable metric.
time_to_first_correct_file_read
derived.time_to_first_correct_file_read
Derived int (ms) Needs annotation
Milliseconds from session start until the agent first reads a ground-truth relevant file. Requires relevant_files annotation. A delta here directly measures how much the plugin shortened the orientation phase.
Same annotation dependency as first_tool_precision. Sensitive to task complexity — normalize within task family when aggregating. Null if no relevant file was ever read.
error_loop_detected
derived.error_loop_detected
Derived bool Extracted
True if the agent called the same (tool, input) pair 3+ times in a session. A strong indicator of a stuck loop. Computed from session JSONL with no extra annotation.
False positives possible for legitimate repeated operations (e.g. polling). Key=3 is a heuristic — tune threshold per task type if needed.
error_recovery_turns
derived.error_recovery_turns
Derived int Extracted
Number of turns between the first error recovery milestone and the next verification success. Null if no recovery milestone exists. Measures recovery efficiency — lower is better.
Null for sessions that never hit an error recovery milestone. Compare across sessions only when both have the milestone.
decision_reversal_count
derived.decision_reversal_count
Derived int Extracted
Number of times the agent switched its edit target to a different file. Each A→B file switch = +1. Measures thrashing — a plugin that reduces this is genuinely reducing wasted work.
Benign for naturally multi-file tasks. Combine with total edit count and wallClockMs to contextualize. A high count in a single-file task is a red flag.
tool_error_rate
derived.tool_error_rate
Derived float Extracted
Fraction of all tool calls that returned a non-zero exit code. High error rate = agent is guessing rather than reasoning. Computed from session JSONL exit codes.
Includes expected errors (e.g. a test run that fails as the baseline). Pair with error_recovery_turns to distinguish tolerable vs. pathological error rates.
retry_after_error_count
derived.retry_after_error_count
Derived int Extracted
Number of times a failed tool call was immediately followed by the same call with the same input. "Blind retry" pattern — the agent saw an error and repeated the action unchanged.
Low-count false positives possible if identical retries are intentional (e.g. network retry logic in bash). One or two is normal; five or more in a session warrants inspection.
wrong_path_depth
derived.wrong_path_depth
Derived int Not yet implemented
Number of tool calls made after the agent committed to an incorrect approach, before recovering or failing. Measures depth of wrong turns. Requires "wrong approach" detection heuristic or LLM judge.
Requires identifying when an agent is "on the wrong path" — non-trivial without ground truth. Defer to Phase 2 unless gold trajectory is available.
first_decision_correct
derived.first_decision_correct
Derived bool Needs annotation
Whether the agent's first substantive action (first edit or first tool call) was on a relevant file. Coarser version of first_tool_precision — useful when relevant_files annotation is available.
Requires ground-truth annotation. Equivalent to first_tool_precision > 0 — prefer that signal for aggregation. Useful only as a boolean breakdown in task-level analysis.
Context Cost / Bloat — what the plugin costs to run
Plugin overhead in latency, tokens, and dollars. A well-designed plugin should reduce net token usage by giving the agent better context upfront, reducing thrashing and retries. Track both plugin-added cost and plugin-saved cost to get the true delta.
Signal Source Type Status Description Validity / Pitfalls
plugin_latency_ms
harness.plugin_latency_ms
Harness int Extracted
Total plugin overhead (context-pack build + plan setup) before the agent starts. Directly adds to user-perceived time-to-first-agent-action.
Latency + quality tradeoff is the core plugin design question. A 2s plugin that improves pass@1 by 10pp is different from a 10s plugin that improves it by 2pp.
context_pack_tokens
harness.context_pack_tokens
Harness int Extracted
Token cost of the context pack injected into the agent prompt. Sets the floor for agent input token cost this session.
Same as context_token_cost in Layer 2 — included here as the primary cost accounting signal for Layer 5. Track alongside tokens_per_task_delta.
tokens_per_task_delta
derived.tokens_per_task_delta
Derived int Needs baseline
Change in total token usage with vs without plugin. Negative = plugin saves tokens (agent thrashes less because context is better). Positive = plugin adds net cost.
The most important cost signal. A plugin that adds 2k tokens upfront but saves 15k in agent exploration is a net win. Never evaluate cost without this delta.
cost_per_task_delta
derived.cost_per_task_delta
Derived float Needs baseline
USD cost delta attributable to plugin overhead plus downstream savings. The bottom-line cost signal.
Requires accurate per-token cost accounting for both plugin model and downstream agent model. Include both in calculation.
tool_calls_per_task_delta
derived.tool_calls_per_task_delta
Derived int Needs baseline
Change in number of agent tool calls with vs without plugin. A good context pack reduces thrashing — fewer redundant reads and grep loops.
Proxy for agent exploration efficiency. Negative delta = plugin reduces thrashing. Positive = plugin may be confusing the agent.
plugin_error_rate
harness.plugin_error_rate
Harness float Extracted
Rate of plugin failures or timeouts. A plugin that fails silently is worse than one that fails loudly — the agent runs without context and may not know it.
High error rate invalidates all other plugin signals for affected sessions. Always filter on plugin_error_rate before computing quality metrics.
context_growth_curve
derived.context_growth_curve
Derived list[int] Needs session JSONL
Token count growth over session turns. A good plugin should flatten this curve — agent needs fewer additional reads because context was pre-loaded.
Compare shape across plugin vs no-plugin baseline. Flatter curve in plugin condition = plugin is substituting for in-session exploration.
Context-Mode Runtime — MCP hooks, routing, compaction, think-in-code
Signals specific to the context-mode MCP+hooks runtime — the deeper plugin operating mode where Claude Code hooks intercept raw tool calls, substitute ctx_* commands, maintain a SQLite/FTS5 session store, and enforce routing rules.

All signals in this layer are Target — they require the context-mode runtime to emit structured telemetry. Once emitted, most are computable directly from the hook event log with no annotation.

Six signal families: routing/sandbox (tool interception), retrieval/index (BM25 search quality), continuity/compaction (ctx snapshot and restore), think-in-code (ctx_execute adoption), output compression (verbosity reduction), platform capability (hook availability and degraded-mode detection).
Signal Source Type Status Description Validity / Pitfalls
Routing & Sandbox
tool_interception_rate
runtime.tool_interception_rate
Harness float Target
Fraction of raw tool calls routed through the context-mode hook layer rather than executed directly. 1.0 = full interception; <1.0 = hook missed some calls (hook coverage gap).
Requires hook event log. Low rate may indicate hooks not registered, partial env support, or agent bypassing the MCP layer. Should be 1.0 in a fully-wired context-mode session.
ctx_substitution_rate
runtime.ctx_substitution_rate
Harness float Target
Fraction of intercepted tool calls that were substituted with a ctx_* command (e.g. ctx_read, ctx_search, ctx_execute) rather than passed through unchanged.
High rate = context-mode is actively enriching the session. Low rate = hooks are running but not substituting — check routing rules and command recognition patterns.
blocked_command_rate
runtime.blocked_command_rate
Harness float Target
Fraction of tool calls blocked by routing rules (e.g. disallowed shell commands, out-of-scope writes). Measures sandbox enforcement effectiveness.
High rate on legitimate commands = over-restrictive routing. Low rate = sandbox may not be enforced. Pair with agent outcome delta to check if blocking is helping or hurting.
Retrieval & Index
search_hit_rate
runtime.search_hit_rate
Harness float Target
Fraction of ctx_search calls that returned at least one result. Low rate indicates the FTS5/BM25 index is stale, missing, or queries are poorly formed.
Always compare to context_recall: a high search hit rate with low recall means the index returns results but they are not the right ones.
bm25_reuse_rate
runtime.bm25_reuse_rate
Harness float Target
Fraction of ctx_search calls that were served from the cached BM25 index without re-indexing. Proxy for index freshness management cost.
High reuse rate + low hit rate = stale index. Reindex threshold needs tuning. Track alongside file change rate to calibrate.
fetch_cache_hit_rate
runtime.fetch_cache_hit_rate
Harness float Target
Fraction of ctx_fetch calls served from the session fetch cache. Measures how effectively repeated URL fetches are deduplicated across a session.
Low rate on repeated identical fetches = cache is not keying correctly. Track alongside fetch latency to understand cost savings.
Continuity & Compaction
snapshot_created
runtime.snapshot_created
Harness bool Target
Whether the context-mode runtime created a session snapshot before compaction fired. Gate signal for all compaction-recovery signals.
If false and compaction fired, the agent resumes cold. Track rate across sessions — target is 1.0 on all sessions that reach compaction threshold.
restore_success
runtime.restore_success
Harness bool Target
Whether the agent successfully restored working context from the snapshot after compaction. Null if no compaction event occurred in the session.
Only meaningful when snapshot_created = true and a compaction event is detected. A restore failure = cold restart despite snapshot existing (injection bug).
time_to_first_productive_action_post_compaction
runtime.ttfpa_post_compaction_ms
Harness int (ms) Target
Milliseconds from compaction event to first non-orientation tool call (first file edit or meaningful search). Measures how quickly the agent recovers context after compaction. Null if no compaction occurred.
Compare against no-snapshot baseline (same agent, same session, snapshot disabled). Delta is the plugin's continuity contribution. Sensitive to task complexity — normalize within task family.
compaction_event_count
runtime.compaction_event_count
Harness int Target
Number of context compaction events detected in the session. Zero on short sessions; ≥1 indicates the session reached the model's context limit.
Useful for segmenting sessions: compaction sessions are a distinct analysis cohort where continuity signals matter most.
Think-in-Code
ctx_execute_adoption_rate
runtime.ctx_execute_adoption_rate
Harness float Target
Fraction of agent bash calls replaced by ctx_execute (think-in-code mode). High adoption = agent is using the structured execution path with result capture and deduplication.
Low adoption with hooks active = routing rules for bash substitution are not triggering. Check command pattern matching. Pair with file_read_avoidance_ratio.
file_read_avoidance_ratio
runtime.file_read_avoidance_ratio
Harness float Target
Fraction of files that the agent did not need to read because they were pre-loaded into context by the plugin. Measures how much redundant file I/O the plugin eliminates.
Requires tracking which files were in the context pack vs which the agent attempted to read. High ratio = plugin is successfully front-loading the right files.
Output Compression
assistant_token_delta
runtime.assistant_token_delta
Harness int Target
Change in total assistant output tokens with output compression active vs baseline (no compression). Negative = compression is reducing verbose narration. Positive = compression prompts are adding overhead.
Requires a no-compression baseline on the same task. Token savings must be weighed against any quality loss — track alongside pass_at_1 to confirm compression does not regress outcomes.
verbosity_ratio
runtime.verbosity_ratio
Harness float Target
Ratio of assistant reasoning tokens to action tokens (tool calls + edits). High verbosity = agent is narrating heavily relative to acting. Context-mode's output compression system targets this ratio.
Computed from session JSONL — split assistant messages into reasoning vs tool-call content. A declining trajectory after the first few turns is the target pattern.
Platform Capability
hook_availability
runtime.hook_availability
Harness bool Target
Whether the Claude Code hooks system was available and registered at session start. Gate for all routing and think-in-code signals. False = context-mode degraded to context-only.
Must be recorded per session. Always segment analysis by this flag — mixing hook-available and hook-unavailable sessions masks the plugin's true capability delta.
degraded_mode_active
runtime.degraded_mode_active
Harness bool Target
Whether the plugin detected an unsupported platform and fell back to a degraded operating mode (e.g. no hooks, read-only context pack, no routing enforcement). Sessions in degraded mode should be excluded from full-signal analysis.
Degraded-mode sessions are not comparable to full-capability sessions. Always filter before computing routing, think-in-code, or compression signals.
mcp_server_latency_ms
runtime.mcp_server_latency_ms
Harness int Target
Round-trip latency of MCP server calls from the hook layer. High latency here adds directly to the agent's perceived tool call time. Separate from plugin_latency_ms which covers context-pack build time.
Track p50 and p95 separately — occasional spikes matter more than mean latency since a slow MCP call blocks the entire agent turn.
Intent fit, memory, and handoff — the product-side plugin signals
These signals apply to multi-turn, user-in-the-loop sessions where plugin memory and handoff matter. Context fit, memory reuse, and session continuity are the dominant questions here — they are largely irrelevant in deterministic one-shot eval runs. Signals in this tab will be always-null for pure eval runs and should not be collected there.

Memory caveat: stale or contradicting memories surfaced by the plugin can actively harm session quality. Track stale_memory_hit_count alongside repo_memory_hit_count. A plugin that surfaces many memories but most are stale may be worse than one that surfaces fewer but more accurate ones.
Intent / Context Fit — did the plugin understand what was actually needed
Did the plugin select context that matched the actual task intent? Did the context pack reflect the right problem framing, or did it surface related-but-wrong files? These signals are derived post-session and require either a session artifact or an LLM-as-judge pass. They answer a different question than precision/recall: not just "were the right files selected" but "did the plugin understand why."
Signal Source Type Status Description Validity / Pitfalls
context_relevancy_score
derived.context_relevancy_score
Derived float Needs artifact
How well selected context matches the actual task intent (RAGAS-style). Measures semantic alignment between context pack contents and the task description.
RAGAS-style metrics require an embedding model. Can be computed without LLM judge. Best used as a continuous signal rather than binary threshold.
task_intent_match
llm-judge.task_intent_match
LLM-judge float Needs LLM judge
Did the context pack reflect the right problem framing? Requires an LLM judge with access to the task ground truth and the context pack contents.
Requires LLM judge with task ground truth. More expensive than embedding-based relevancy. Use context_relevancy_score as a cheaper proxy first.
instruction_adherence
derived.instruction_adherence
Derived float Needs session JSONL
Did the agent follow the plan structure the plugin provided? Measures alignment between plugin-provided plan and agent's actual action sequence.
Low adherence may mean the plan was wrong (plugin failure) or the agent ignored it (agent failure). Pair with agent_outcome_delta to distinguish.
plan_optimality
llm-judge.plan_optimality
LLM-judge float Needs LLM judge
How good was the plugin's plan vs. the ideal path actually taken? Requires comparing the plugin's initial plan to the successful solution trajectory.
LLM-as-judge can hallucinate. Even top models perform poorly on plan quality evaluation. Requires careful judge design with explicit rubric.
context_fit_by_strategy
derived.context_fit_by_strategy
Derived dict Needs artifact
Precision and recall breakdown per context strategy used. Answers: does focused strategy outperform distill on repo-level tasks?
Requires stratifying sessions by context_strategy. Minimum ~20 sessions per strategy for statistically meaningful comparison.
Memory + Handoff Dynamics — how the plugin preserves and transfers knowledge
Memory and handoff are the plugin's mechanism for avoiding cold-start on every session. These signals measure whether the plugin's memory system is surfacing useful prior facts and whether handoff artifacts enable the next session to resume effectively.

Key caveat: handoff quality depends partly on the downstream agent. Compare against a no-handoff baseline (same agent, no summary injected) before attributing session continuity improvements to the plugin's handoff artifact quality.
Signal Source Type Status Description Validity / Pitfalls
handoff_summary_created
plugin.handoff_summary_created
Plugin bool Target
Whether plugin produced a session-end summary artifact. Gate for all handoff quality signals — if false, handoff_success and prior_session_reuse are not applicable.
Track rate of handoff creation across sessions. Low rate may indicate plugin is timing out or failing at session close.
handoff_success
derived.handoff_success
Derived float Needs next-session data
Whether the next session resumed effectively using the handoff summary. Measured as delta in time-to-first-productive-action vs a no-handoff baseline.
Depends partly on the downstream agent — compare against no-plugin baseline. Requires cross-session linking. Most expensive signal to compute but highest signal quality.
repo_memory_hit_count
plugin.repo_memory_hit_count
Plugin int Target
Prior session facts successfully reused in the current session. Each hit = a memory item was surfaced and used by the agent.
Interpret alongside stale_memory_hit_count. High total hits but also high stale hits = memory precision problem. Net useful hits = repo_memory_hit_count − stale_memory_hit_count.
stale_memory_hit_count
plugin.stale_memory_hit_count
Plugin int Target
Old or misleading memories surfaced by the plugin. High stale count degrades performance — agent receives outdated context and may act on it.
High stale count is actively harmful, not neutral. Track as a quality regression signal, not just informational.
retrieval_recall
derived.retrieval_recall
Derived float Needs artifact
Percentage of relevant prior memory surfaced by the plugin. Requires annotated ground truth of which prior memories were relevant to this session.
Requires ground-truth annotation across sessions. Start with heuristic proxies (time recency, file overlap) before full annotation.
retrieval_precision
derived.retrieval_precision
Derived float Needs artifact
Percentage of surfaced memory that was actually useful in this session. High precision + low recall = memory is conservative but accurate.
Pair with retrieval_recall. Both together give the full picture of memory quality. Use repo_memory_hit_count as a cheaper proxy while building annotation pipeline.
memory_conflict_resolution
derived.memory_conflict_resolution
Derived bool Needs artifact
Plugin correctly detected and resolved contradicting facts across sessions. Binary: did the plugin surface the more recent/correct version when there was a conflict?
Conflict resolution requires the plugin to maintain versioned memory or recency ordering. Absence of conflict detection means both old and new facts may be surfaced simultaneously.
session_continuity_score
llm-judge.session_continuity_score
LLM-judge float Needs LLM judge
How smoothly did the next session resume from the handoff summary? LLM judge evaluates quality of transition — did agent pick up where previous left off?
LLM judge required. Use handoff_success as a cheaper proxy first. Requires access to both sessions for comparison.
User / Session Success Signals — did the plugin help the user succeed
Final-layer signals measuring whether the plugin improved the user's experience and session outcome. These are the most user-proximate signals in the inventory. Collect everything raw. Interpret nothing at collection time. Pair each signal against a no-plugin baseline before drawing conclusions.
Signal Source Type Status Description Validity / Pitfalls
Outcome & Task Completion
task_completion_rate
harness.task_completion_rate
Harness float Extracted
Percentage of tasks fully resolved with plugin assistance. The primary user-facing quality metric for plugin-assisted sessions.
Only meaningful alongside task_completion_rate_baseline. Never report absolute completion rate without the no-plugin comparator.
session_outcome
derived.session_outcome
Derived enum Needs baseline
How did the session compare to the no-plugin baseline? Values: improved / neutral / regressed. Gating label for all per-session quality inference.
Derive from agent_outcome_delta. A plugin that regresses on more than 5% of sessions needs investigation regardless of aggregate improvement.
output_applied
harness.output_applied
Harness bool Target
Whether user accepted and applied the agent's output. Detected from git state at session close. Strongest quality proxy without asking the user directly.
Completely objective. Git commit after session = output was accepted. Track as rate over sessions, not just binary per session.
Time & Effort Deltas
time_to_first_edit_delta
derived.time_to_first_edit_delta
Derived float Needs baseline
Change in seconds to first file modification with vs without plugin. A good context pack helps the agent start faster — negative delta is the target.
Includes plugin_latency_ms in the with-plugin condition. Net delta = (plugin latency + time to first edit with plugin) − (time to first edit without plugin).
user_effort_delta
derived.user_effort_delta
Derived float Needs baseline
Change in user messages needed per task. A plugin that reduces steering effort (fewer corrections, fewer clarifications) is directly improving UX.
Requires session JSONL to count user turns. Compare same tasks across plugin vs no-plugin conditions.
Agent Behavior Signals
agent_hedging_curve
derived.agent_hedging_curve
Derived list Needs session JSONL
Rate of hedged language ("I think", "might be", "probably") per agent turn. A good context pack should reduce hedging — agent is more confident when it has the right files upfront.
Pure regex on session JSONL assistant turns. A declining trajectory in the plugin condition vs flat/rising in baseline is the target pattern.
Multi-Session & Handoff
prior_session_reuse
plugin.prior_session_reuse
Plugin float Target
Percentage of handoff summary content actually referenced in the next session. Measures whether the previous session's handoff artifact was used, not just produced.
Low reuse may indicate the handoff artifact was produced but not injected correctly, or was injected but not relevant. Distinguish these cases before acting.
per_session_cost_delta
derived.per_session_cost_delta
Derived float Needs baseline
USD cost change per session with plugin vs without. Includes plugin overhead plus downstream savings from reduced agent exploration and retries.
The bottom-line cost signal for the product context. Negative = plugin saves money per session net of its own overhead. This is the target.