| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
plugin_scope config.plugin_scope |
Config | enum | Target | Scope of the plugin this session: context-only / context+plan / context+plan+routing. Determines which signal layers are applicable. |
Gates interpretation of all other signals. A plan_drift signal on a context-only session is vacuous. |
context_strategy config.context_strategy |
Config | enum | Extracted | Strategy selected by the plugin for this session: bypass / dump / focused / distill / trace. Different strategies have different expected precision/recall profiles. |
Must be recorded to interpret context_precision and context_bloat_score. Dump strategy will always show low precision; that is expected. |
plan_artifact_enabled config.plan_artifact_enabled |
Config | bool | Target | Whether plan creation is active this session. When false, all Layer 3 plan signals are not applicable. |
Required gate for Layer 3 signals. Do not compute plan_drift_score when this is false. |
system_reminders_enabled config.system_reminders_enabled |
Config | bool | Target | Whether soft-nudge injections (system reminders) are active this session. Affects reminder_injection_count in Layer 3. |
Without this flag, reminder_injection_count is uninterpretable — zero injections might mean reminders are disabled, not that none were needed. |
agent_id config.agent_id |
Config | str | Extracted | Downstream agent receiving the context pack from this plugin session. Required for any cross-agent comparison of plugin effectiveness. |
Plugin outcomes are agent-specific. A context pack that helps agent A may not help agent B with a different context window or tool set. |
model_id config.model_id |
Config | str | Target | Model used by the plugin for context building and plan generation. Include the full model string — provider + version. |
Separates plugin model quality from downstream agent model quality when attributing outcome deltas. |
context_precision, context_recall, and context_token_cost.
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
context_files_selected plugin.context_files_selected |
Plugin | list[str] | Extracted | Files the plugin selected for the context pack before the agent acted. The raw selection list from the plugin's context-building phase. |
Foundation for all derived context quality signals. Without this list, precision and recall cannot be computed. |
context_files_used_by_agent harness.context_files_used_by_agent |
Harness | list[str] | Needs session JSONL | Files the agent actually read or modified during the session. Compare with context_files_selected to derive precision and bloat. |
Requires parsing agent session JSONL for file access events. Not the same as touched_files — includes reads that did not result in edits. |
missed_relevant_files derived.missed_relevant_files |
Derived | list[str] | Needs artifact | Files proven relevant but omitted by the plugin. Requires ground-truth relevant_files in the task spec or gold patch to compute. |
Requires ground-truth annotation. In eval settings, gold patch files provide this. In product settings, needs LLM judge or manual annotation. |
context_precision derived.context_precision |
Derived | float | Computed | Fraction of plugin-selected files actually read or edited by the agent. |selected ∩ agent_used| / |selected|. Supporting signal — always pair with coverage. High precision + low coverage = over-filtering. Prefer first_tool_precision as the primary KPI. |
Can look artificially high if plugin selects very few files. A plugin returning 0 files scores 100% precision — always read alongside coverage. Downgraded from ★: use first_tool_precision as the primary context-quality KPI instead. |
context_recall derived.context_recall |
Derived | float | Computed | Fraction of agent-used files that were pre-selected by the plugin. |selected ∩ agent_used| / |agent_used|. Measures how much of what the agent actually needed was pre-covered. Not the same as ground-truth recall — see context_relevant_recall (target) for that. |
Computed from session JSONL — no annotation required. High coverage = plugin anticipated the agent's file needs well. Low coverage = agent explored significantly beyond the plugin's selection. |
context_relevant_recall derived.context_relevant_recall |
Derived | float | Target | Ground-truth recall: |selected ∩ relevant| / |relevant|. Fraction of annotated-relevant files the plugin actually selected. Requires a relevant_files ground-truth list from the task YAML or gold patch. Compare with context_recall (which uses agent-used as proxy for relevant) to measure how well agent behavior proxies ground truth. |
Requires annotation — not computable from session JSONL alone. Use context_recall as the zero-annotation proxy while building the annotation pipeline. Null when relevant_files is empty. |
context_bloat_score derived.context_bloat_score |
Derived | float | Computed | Fraction of selected files that were never accessed by the agent. |selected − agent_used| / |selected|. High bloat wastes agent context window. |
1 − context_precision. High bloat is always costly — unused context occupies token budget without benefit. |
context_noise_ratio derived.context_noise_ratio |
Derived | float | Needs artifact | Percentage of irrelevant chunks in the context pack at the chunk/passage level (not just file level). A file can be selected but only partially relevant. |
Finer-grained than bloat_score. Requires chunk-level relevance annotation. Use file-level signals first. |
context_token_cost ★ harness.context_token_cost |
Harness | int | Extracted | Tokens consumed by context injection into the agent prompt. Key cost signal — directly sets the floor for agent input token cost each session. |
Primary lever for cost optimization. Pair with context_precision to distinguish cheap-but-useful from cheap-but-useless context packs. |
file_selection_latency_ms harness.file_selection_latency_ms |
Harness | int | Extracted | Time spent building the context pack before the agent starts. Adds to plugin_latency_ms. High values suggest expensive retrieval or embedding. |
Direct user-perceived latency cost of the plugin. Latency + context quality tradeoff is core plugin design problem. |
plugin_scope includes planning (context+plan or context+plan+routing).
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
plan_artifact_created plugin.plan_artifact_created |
Plugin | bool | Needs artifact | Whether plugin created a persistent plan object this session. Only meaningful in context+planning scope. |
Gate for all other plan signals. If false, plan_updated_count, plan_drift_score, and plan_phase_transitions are not applicable. |
plan_updated_count plugin.plan_updated_count |
Plugin | int | Needs artifact | Number of times the plan was updated mid-session. Zero updates may indicate the plugin stopped tracking, or the initial plan was correct. |
Interpret alongside plan_drift_score. Many updates + low drift = healthy iterative refinement. Few updates + high drift = plan was abandoned silently. |
plan_drift_score derived.plan_drift_score |
Derived | float | Needs artifact | Divergence between the initial plan and the final path taken. 0 = no drift. Computed as edit distance between initial and final plan state. |
Drift is not always bad — legitimate discovery causes benign drift. Flag only when correlated with downstream failure. Never use alone as a quality signal. |
milestone_detected ★ plugin.milestone_detected |
Plugin | bool | Target | Whether plugin detected meaningful progress or milestone completion during the session. Leading indicator of session trajectory. |
Positive signal when correlated with successful downstream outcome. False + successful outcome may indicate the plugin's milestone detection is too conservative. |
blocker_detected plugin.blocker_detected |
Plugin | bool | Target | Whether plugin flagged a blocking issue during the session. Blockers should correlate with downstream failure or retry events. |
Calibrate against downstream outcome. High false-positive blocker rate = plugin is too conservative and interrupts good sessions. |
plan_phase_transitions plugin.plan_phase_transitions |
Plugin | list[str] | Needs artifact | Sequence of plan phase changes observed during the session. Analogous to plannotator phases in the agent — tracks plugin's view of session progress. |
Structural, no inference required if the plugin emits phase events. Valuable for understanding plugin behavior across session types. |
reminder_injection_count plugin.reminder_injection_count |
Plugin | int | Target | Number of soft-nudge reminders injected into the agent context during the session. Only meaningful when system_reminders_enabled = true. |
High counts may indicate the agent is repeatedly deviating from the plan. Correlate with plan_drift_score and downstream outcome. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
agent_outcome_delta ★ derived.agent_outcome_delta |
Derived | float | Needs baseline | Pass rate improvement vs baseline (same agent, no plugin). Positive = plugin helped. Computed as pass_at_1 − pass_at_1_baseline. |
Requires a clean baseline run. System noise can obscure deltas < 5pp. Run ≥3 trials per condition. A single-run delta is not attributable to the plugin. |
pass_at_1 ★ harness.pass_at_1 |
Harness | float | Extracted | P(success in first attempt) with plugin active. The primary eval metric for plugin-assisted sessions. |
Only interpretable as a quality signal when compared against pass_at_1_baseline from a clean no-plugin run. |
pass_at_1_baseline harness.pass_at_1_baseline |
Harness | float | Needs baseline | P(success in first attempt) without plugin. Run baseline condition separately on the same task set and agent configuration. |
Must use the same agent, model, task set, and tool caps. Any difference in those confounds the comparison. |
resolve_rate_delta derived.resolve_rate_delta |
Derived | float | Needs baseline | Percentage point change in fully resolved tasks with vs without plugin. Broader than pass_at_1 if retries are counted. |
Distinguish from pass_at_1_delta: resolve_rate includes retry successes. Track both for full picture. |
retry_count_delta derived.retry_count_delta |
Derived | float | Needs baseline | Change in average retries needed with vs without plugin. Negative = plugin reduces retries. Plugin that improves pass@1 but not retry count has narrow impact. |
Negative delta is the target. A plugin that reduces retries saves significant cost even if overall resolve rate is similar. |
verification_pass_delta derived.verification_pass_delta |
Derived | float | Needs baseline | Change in test-pass rate attributable to plugin. Finer-grained than resolve_rate_delta — tracks partial improvement where some tests now pass. |
Only meaningful in eval contexts with a test harness. Not applicable in pure product sessions. |
faithfulness_score llm-judge.faithfulness_score |
LLM-judge | float | Needs LLM judge | Percentage of agent claims supported by plugin-provided context. High faithfulness = agent stayed grounded in the context pack. |
LLM-as-judge can hallucinate judgements. Calibrate against 10% human spot-check before trusting at scale. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
first_tool_precision ★ derived.first_tool_precision |
Derived | float | Needs annotation | Was the agent's first file read/search a relevant file? 1.0 = yes, 0.0 = no. Requires relevant_files in task spec. The fastest proxy for context quality — a plugin that improves this is directly reducing wasted tool calls. |
Requires ground-truth annotation in the task YAML. Only applicable when relevant_files is populated. Binary 0/1 — aggregate across many tasks for a stable metric. |
time_to_first_correct_file_read ★ derived.time_to_first_correct_file_read |
Derived | int (ms) | Needs annotation | Milliseconds from session start until the agent first reads a ground-truth relevant file. Requires relevant_files annotation. A delta here directly measures how much the plugin shortened the orientation phase. |
Same annotation dependency as first_tool_precision. Sensitive to task complexity — normalize within task family when aggregating. Null if no relevant file was ever read. |
error_loop_detected ★ derived.error_loop_detected |
Derived | bool | Extracted | True if the agent called the same (tool, input) pair 3+ times in a session. A strong indicator of a stuck loop. Computed from session JSONL with no extra annotation. |
False positives possible for legitimate repeated operations (e.g. polling). Key=3 is a heuristic — tune threshold per task type if needed. |
error_recovery_turns ★ derived.error_recovery_turns |
Derived | int | Extracted | Number of turns between the first error recovery milestone and the next verification success. Null if no recovery milestone exists. Measures recovery efficiency — lower is better. |
Null for sessions that never hit an error recovery milestone. Compare across sessions only when both have the milestone. |
decision_reversal_count derived.decision_reversal_count |
Derived | int | Extracted | Number of times the agent switched its edit target to a different file. Each A→B file switch = +1. Measures thrashing — a plugin that reduces this is genuinely reducing wasted work. |
Benign for naturally multi-file tasks. Combine with total edit count and wallClockMs to contextualize. A high count in a single-file task is a red flag. |
tool_error_rate derived.tool_error_rate |
Derived | float | Extracted | Fraction of all tool calls that returned a non-zero exit code. High error rate = agent is guessing rather than reasoning. Computed from session JSONL exit codes. |
Includes expected errors (e.g. a test run that fails as the baseline). Pair with error_recovery_turns to distinguish tolerable vs. pathological error rates. |
retry_after_error_count derived.retry_after_error_count |
Derived | int | Extracted | Number of times a failed tool call was immediately followed by the same call with the same input. "Blind retry" pattern — the agent saw an error and repeated the action unchanged. |
Low-count false positives possible if identical retries are intentional (e.g. network retry logic in bash). One or two is normal; five or more in a session warrants inspection. |
wrong_path_depth derived.wrong_path_depth |
Derived | int | Not yet implemented | Number of tool calls made after the agent committed to an incorrect approach, before recovering or failing. Measures depth of wrong turns. Requires "wrong approach" detection heuristic or LLM judge. |
Requires identifying when an agent is "on the wrong path" — non-trivial without ground truth. Defer to Phase 2 unless gold trajectory is available. |
first_decision_correct derived.first_decision_correct |
Derived | bool | Needs annotation | Whether the agent's first substantive action (first edit or first tool call) was on a relevant file. Coarser version of first_tool_precision — useful when relevant_files annotation is available. |
Requires ground-truth annotation. Equivalent to first_tool_precision > 0 — prefer that signal for aggregation. Useful only as a boolean breakdown in task-level analysis. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
plugin_latency_ms ★ harness.plugin_latency_ms |
Harness | int | Extracted | Total plugin overhead (context-pack build + plan setup) before the agent starts. Directly adds to user-perceived time-to-first-agent-action. |
Latency + quality tradeoff is the core plugin design question. A 2s plugin that improves pass@1 by 10pp is different from a 10s plugin that improves it by 2pp. |
context_pack_tokens ★ harness.context_pack_tokens |
Harness | int | Extracted | Token cost of the context pack injected into the agent prompt. Sets the floor for agent input token cost this session. |
Same as context_token_cost in Layer 2 — included here as the primary cost accounting signal for Layer 5. Track alongside tokens_per_task_delta. |
tokens_per_task_delta ★ derived.tokens_per_task_delta |
Derived | int | Needs baseline | Change in total token usage with vs without plugin. Negative = plugin saves tokens (agent thrashes less because context is better). Positive = plugin adds net cost. |
The most important cost signal. A plugin that adds 2k tokens upfront but saves 15k in agent exploration is a net win. Never evaluate cost without this delta. |
cost_per_task_delta ★ derived.cost_per_task_delta |
Derived | float | Needs baseline | USD cost delta attributable to plugin overhead plus downstream savings. The bottom-line cost signal. |
Requires accurate per-token cost accounting for both plugin model and downstream agent model. Include both in calculation. |
tool_calls_per_task_delta derived.tool_calls_per_task_delta |
Derived | int | Needs baseline | Change in number of agent tool calls with vs without plugin. A good context pack reduces thrashing — fewer redundant reads and grep loops. |
Proxy for agent exploration efficiency. Negative delta = plugin reduces thrashing. Positive = plugin may be confusing the agent. |
plugin_error_rate harness.plugin_error_rate |
Harness | float | Extracted | Rate of plugin failures or timeouts. A plugin that fails silently is worse than one that fails loudly — the agent runs without context and may not know it. |
High error rate invalidates all other plugin signals for affected sessions. Always filter on plugin_error_rate before computing quality metrics. |
context_growth_curve derived.context_growth_curve |
Derived | list[int] | Needs session JSONL | Token count growth over session turns. A good plugin should flatten this curve — agent needs fewer additional reads because context was pre-loaded. |
Compare shape across plugin vs no-plugin baseline. Flatter curve in plugin condition = plugin is substituting for in-session exploration. |
ctx_* commands, maintain a SQLite/FTS5 session store, and enforce routing rules.
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
| Routing & Sandbox | |||||
tool_interception_rate ★ runtime.tool_interception_rate |
Harness | float | Target | Fraction of raw tool calls routed through the context-mode hook layer rather than executed directly. 1.0 = full interception; <1.0 = hook missed some calls (hook coverage gap). |
Requires hook event log. Low rate may indicate hooks not registered, partial env support, or agent bypassing the MCP layer. Should be 1.0 in a fully-wired context-mode session. |
ctx_substitution_rate runtime.ctx_substitution_rate |
Harness | float | Target | Fraction of intercepted tool calls that were substituted with a ctx_* command (e.g. ctx_read, ctx_search, ctx_execute) rather than passed through unchanged. |
High rate = context-mode is actively enriching the session. Low rate = hooks are running but not substituting — check routing rules and command recognition patterns. |
blocked_command_rate runtime.blocked_command_rate |
Harness | float | Target | Fraction of tool calls blocked by routing rules (e.g. disallowed shell commands, out-of-scope writes). Measures sandbox enforcement effectiveness. |
High rate on legitimate commands = over-restrictive routing. Low rate = sandbox may not be enforced. Pair with agent outcome delta to check if blocking is helping or hurting. |
| Retrieval & Index | |||||
search_hit_rate ★ runtime.search_hit_rate |
Harness | float | Target | Fraction of ctx_search calls that returned at least one result. Low rate indicates the FTS5/BM25 index is stale, missing, or queries are poorly formed. |
Always compare to context_recall: a high search hit rate with low recall means the index returns results but they are not the right ones. |
bm25_reuse_rate runtime.bm25_reuse_rate |
Harness | float | Target | Fraction of ctx_search calls that were served from the cached BM25 index without re-indexing. Proxy for index freshness management cost. |
High reuse rate + low hit rate = stale index. Reindex threshold needs tuning. Track alongside file change rate to calibrate. |
fetch_cache_hit_rate runtime.fetch_cache_hit_rate |
Harness | float | Target | Fraction of ctx_fetch calls served from the session fetch cache. Measures how effectively repeated URL fetches are deduplicated across a session. |
Low rate on repeated identical fetches = cache is not keying correctly. Track alongside fetch latency to understand cost savings. |
| Continuity & Compaction | |||||
snapshot_created ★ runtime.snapshot_created |
Harness | bool | Target | Whether the context-mode runtime created a session snapshot before compaction fired. Gate signal for all compaction-recovery signals. |
If false and compaction fired, the agent resumes cold. Track rate across sessions — target is 1.0 on all sessions that reach compaction threshold. |
restore_success ★ runtime.restore_success |
Harness | bool | Target | Whether the agent successfully restored working context from the snapshot after compaction. Null if no compaction event occurred in the session. |
Only meaningful when snapshot_created = true and a compaction event is detected. A restore failure = cold restart despite snapshot existing (injection bug). |
time_to_first_productive_action_post_compaction ★ runtime.ttfpa_post_compaction_ms |
Harness | int (ms) | Target | Milliseconds from compaction event to first non-orientation tool call (first file edit or meaningful search). Measures how quickly the agent recovers context after compaction. Null if no compaction occurred. |
Compare against no-snapshot baseline (same agent, same session, snapshot disabled). Delta is the plugin's continuity contribution. Sensitive to task complexity — normalize within task family. |
compaction_event_count runtime.compaction_event_count |
Harness | int | Target | Number of context compaction events detected in the session. Zero on short sessions; ≥1 indicates the session reached the model's context limit. |
Useful for segmenting sessions: compaction sessions are a distinct analysis cohort where continuity signals matter most. |
| Think-in-Code | |||||
ctx_execute_adoption_rate ★ runtime.ctx_execute_adoption_rate |
Harness | float | Target | Fraction of agent bash calls replaced by ctx_execute (think-in-code mode). High adoption = agent is using the structured execution path with result capture and deduplication. |
Low adoption with hooks active = routing rules for bash substitution are not triggering. Check command pattern matching. Pair with file_read_avoidance_ratio. |
file_read_avoidance_ratio runtime.file_read_avoidance_ratio |
Harness | float | Target | Fraction of files that the agent did not need to read because they were pre-loaded into context by the plugin. Measures how much redundant file I/O the plugin eliminates. |
Requires tracking which files were in the context pack vs which the agent attempted to read. High ratio = plugin is successfully front-loading the right files. |
| Output Compression | |||||
assistant_token_delta runtime.assistant_token_delta |
Harness | int | Target | Change in total assistant output tokens with output compression active vs baseline (no compression). Negative = compression is reducing verbose narration. Positive = compression prompts are adding overhead. |
Requires a no-compression baseline on the same task. Token savings must be weighed against any quality loss — track alongside pass_at_1 to confirm compression does not regress outcomes. |
verbosity_ratio runtime.verbosity_ratio |
Harness | float | Target | Ratio of assistant reasoning tokens to action tokens (tool calls + edits). High verbosity = agent is narrating heavily relative to acting. Context-mode's output compression system targets this ratio. |
Computed from session JSONL — split assistant messages into reasoning vs tool-call content. A declining trajectory after the first few turns is the target pattern. |
| Platform Capability | |||||
hook_availability ★ runtime.hook_availability |
Harness | bool | Target | Whether the Claude Code hooks system was available and registered at session start. Gate for all routing and think-in-code signals. False = context-mode degraded to context-only. |
Must be recorded per session. Always segment analysis by this flag — mixing hook-available and hook-unavailable sessions masks the plugin's true capability delta. |
degraded_mode_active runtime.degraded_mode_active |
Harness | bool | Target | Whether the plugin detected an unsupported platform and fell back to a degraded operating mode (e.g. no hooks, read-only context pack, no routing enforcement). Sessions in degraded mode should be excluded from full-signal analysis. |
Degraded-mode sessions are not comparable to full-capability sessions. Always filter before computing routing, think-in-code, or compression signals. |
mcp_server_latency_ms runtime.mcp_server_latency_ms |
Harness | int | Target | Round-trip latency of MCP server calls from the hook layer. High latency here adds directly to the agent's perceived tool call time. Separate from plugin_latency_ms which covers context-pack build time. |
Track p50 and p95 separately — occasional spikes matter more than mean latency since a slow MCP call blocks the entire agent turn. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
context_relevancy_score ★ derived.context_relevancy_score |
Derived | float | Needs artifact | How well selected context matches the actual task intent (RAGAS-style). Measures semantic alignment between context pack contents and the task description. |
RAGAS-style metrics require an embedding model. Can be computed without LLM judge. Best used as a continuous signal rather than binary threshold. |
task_intent_match llm-judge.task_intent_match |
LLM-judge | float | Needs LLM judge | Did the context pack reflect the right problem framing? Requires an LLM judge with access to the task ground truth and the context pack contents. |
Requires LLM judge with task ground truth. More expensive than embedding-based relevancy. Use context_relevancy_score as a cheaper proxy first. |
instruction_adherence ★ derived.instruction_adherence |
Derived | float | Needs session JSONL | Did the agent follow the plan structure the plugin provided? Measures alignment between plugin-provided plan and agent's actual action sequence. |
Low adherence may mean the plan was wrong (plugin failure) or the agent ignored it (agent failure). Pair with agent_outcome_delta to distinguish. |
plan_optimality llm-judge.plan_optimality |
LLM-judge | float | Needs LLM judge | How good was the plugin's plan vs. the ideal path actually taken? Requires comparing the plugin's initial plan to the successful solution trajectory. |
LLM-as-judge can hallucinate. Even top models perform poorly on plan quality evaluation. Requires careful judge design with explicit rubric. |
context_fit_by_strategy derived.context_fit_by_strategy |
Derived | dict | Needs artifact | Precision and recall breakdown per context strategy used. Answers: does focused strategy outperform distill on repo-level tasks? |
Requires stratifying sessions by context_strategy. Minimum ~20 sessions per strategy for statistically meaningful comparison. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
handoff_summary_created plugin.handoff_summary_created |
Plugin | bool | Target | Whether plugin produced a session-end summary artifact. Gate for all handoff quality signals — if false, handoff_success and prior_session_reuse are not applicable. |
Track rate of handoff creation across sessions. Low rate may indicate plugin is timing out or failing at session close. |
handoff_success ★ derived.handoff_success |
Derived | float | Needs next-session data | Whether the next session resumed effectively using the handoff summary. Measured as delta in time-to-first-productive-action vs a no-handoff baseline. |
Depends partly on the downstream agent — compare against no-plugin baseline. Requires cross-session linking. Most expensive signal to compute but highest signal quality. |
repo_memory_hit_count ★ plugin.repo_memory_hit_count |
Plugin | int | Target | Prior session facts successfully reused in the current session. Each hit = a memory item was surfaced and used by the agent. |
Interpret alongside stale_memory_hit_count. High total hits but also high stale hits = memory precision problem. Net useful hits = repo_memory_hit_count − stale_memory_hit_count. |
stale_memory_hit_count plugin.stale_memory_hit_count |
Plugin | int | Target | Old or misleading memories surfaced by the plugin. High stale count degrades performance — agent receives outdated context and may act on it. |
High stale count is actively harmful, not neutral. Track as a quality regression signal, not just informational. |
retrieval_recall derived.retrieval_recall |
Derived | float | Needs artifact | Percentage of relevant prior memory surfaced by the plugin. Requires annotated ground truth of which prior memories were relevant to this session. |
Requires ground-truth annotation across sessions. Start with heuristic proxies (time recency, file overlap) before full annotation. |
retrieval_precision derived.retrieval_precision |
Derived | float | Needs artifact | Percentage of surfaced memory that was actually useful in this session. High precision + low recall = memory is conservative but accurate. |
Pair with retrieval_recall. Both together give the full picture of memory quality. Use repo_memory_hit_count as a cheaper proxy while building annotation pipeline. |
memory_conflict_resolution derived.memory_conflict_resolution |
Derived | bool | Needs artifact | Plugin correctly detected and resolved contradicting facts across sessions. Binary: did the plugin surface the more recent/correct version when there was a conflict? |
Conflict resolution requires the plugin to maintain versioned memory or recency ordering. Absence of conflict detection means both old and new facts may be surfaced simultaneously. |
session_continuity_score llm-judge.session_continuity_score |
LLM-judge | float | Needs LLM judge | How smoothly did the next session resume from the handoff summary? LLM judge evaluates quality of transition — did agent pick up where previous left off? |
LLM judge required. Use handoff_success as a cheaper proxy first. Requires access to both sessions for comparison. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
| Outcome & Task Completion | |||||
task_completion_rate ★ harness.task_completion_rate |
Harness | float | Extracted | Percentage of tasks fully resolved with plugin assistance. The primary user-facing quality metric for plugin-assisted sessions. |
Only meaningful alongside task_completion_rate_baseline. Never report absolute completion rate without the no-plugin comparator. |
session_outcome ★ derived.session_outcome |
Derived | enum | Needs baseline | How did the session compare to the no-plugin baseline? Values: improved / neutral / regressed. Gating label for all per-session quality inference. |
Derive from agent_outcome_delta. A plugin that regresses on more than 5% of sessions needs investigation regardless of aggregate improvement. |
output_applied harness.output_applied |
Harness | bool | Target | Whether user accepted and applied the agent's output. Detected from git state at session close. Strongest quality proxy without asking the user directly. |
Completely objective. Git commit after session = output was accepted. Track as rate over sessions, not just binary per session. |
| Time & Effort Deltas | |||||
time_to_first_edit_delta ★ derived.time_to_first_edit_delta |
Derived | float | Needs baseline | Change in seconds to first file modification with vs without plugin. A good context pack helps the agent start faster — negative delta is the target. |
Includes plugin_latency_ms in the with-plugin condition. Net delta = (plugin latency + time to first edit with plugin) − (time to first edit without plugin). |
user_effort_delta derived.user_effort_delta |
Derived | float | Needs baseline | Change in user messages needed per task. A plugin that reduces steering effort (fewer corrections, fewer clarifications) is directly improving UX. |
Requires session JSONL to count user turns. Compare same tasks across plugin vs no-plugin conditions. |
| Agent Behavior Signals | |||||
agent_hedging_curve derived.agent_hedging_curve |
Derived | list | Needs session JSONL | Rate of hedged language ("I think", "might be", "probably") per agent turn. A good context pack should reduce hedging — agent is more confident when it has the right files upfront. |
Pure regex on session JSONL assistant turns. A declining trajectory in the plugin condition vs flat/rising in baseline is the target pattern. |
| Multi-Session & Handoff | |||||
prior_session_reuse ★ plugin.prior_session_reuse |
Plugin | float | Target | Percentage of handoff summary content actually referenced in the next session. Measures whether the previous session's handoff artifact was used, not just produced. |
Low reuse may indicate the handoff artifact was produced but not injected correctly, or was injected but not relevant. Distinguish these cases before acting. |
per_session_cost_delta derived.per_session_cost_delta |
Derived | float | Needs baseline | USD cost change per session with plugin vs without. Includes plugin overhead plus downstream savings from reduced agent exploration and retries. |
The bottom-line cost signal for the product context. Negative = plugin saves money per session net of its own overhead. This is the target. |