Plugin Signals: Context, Planning, Memory & Handoff

Repo context → Context pack → Plan artifact → Agent run → Outcome delta → Handoff summary → Next session
A structured representation of a context/planning plugin that helps explain, predict, and optimize outcomes across trials and user sessions. Two scopes, different signal sets. Plugin eval signals measure context quality and downstream agent delta. Plugin product signals measure memory, handoff, and session continuity.

This plugin does not replace the coding agent. It improves the agent's operating context. Its job is to gather relevant repo/session context, optionally maintain a plan artifact, detect milestones and blockers, preserve useful session memory, and generate handoff information so future sessions do not restart from zero.

Context-only

Gathers and packages repo context. No plan artifact. Signals focus on precision, recall, and bloat. Baseline scope — any plugin deployment has at least this role.

Context + planning artifact

Plan is first-class. Created, updated, and tracked through the session. Adds plan drift, milestone detection, and blocker signals on top of context signals.

Context + planning + routing

Plugin also selects tools, delegates sub-tasks, or manages agent topology. Full orchestrator role. All signals apply, plus tool-call delta and routing precision.

Plugin Session Lifecycle — plugin_session › phases[] Each phase is a distinct plugin responsibility within a single session

pre-session

context_pack
gather repo context
select relevant files
build context payload

pre/mid-session

plan_state
create plan artifact
update mid-session
track phase transitions

mid-session

memory_update
surface prior facts
inject reminders
detect blockers

post-session

handoff_artifact
session-end summary
persist memory facts
next-session primer

measured

downstream_agent_outcome
delta vs baseline
pass rate change
cost delta

config

plugin_scope
context_strategy
plan_artifact_enabled
system_reminders_enabled
agent_id
model_id

Signal tiers Computed in the eval harness today — present in session JSONL or meta files Target defined in the schema, not yet emitted by the pipeline Needs X infrastructure gap — requires baseline run, annotation, or LLM judge

Plugin Configuration — the knobs applied to this session

2 computed 4 target ▼

These are the independent variables — the treatment applied to each plugin session. If you're comparing plugin configurations, these must be recorded per-session or you cannot do fair attribution. These exist in the plugin config and are available at session start.

Signal	Source	Type	Status	Description	Validity / Pitfalls
plugin_scope config.plugin_scope	Config	enum	Target	Scope of the plugin this session: `context-only` / `context+plan` / `context+plan+routing`. Determines which signal layers are applicable.	Gates interpretation of all other signals. A plan_drift signal on a context-only session is vacuous.
context_strategy config.context_strategy	Config	enum	Extracted	Strategy selected by the plugin for this session: `bypass` / `dump` / `focused` / `distill` / `trace`. Different strategies have different expected precision/recall profiles.	Must be recorded to interpret context_precision and context_bloat_score. Dump strategy will always show low precision; that is expected.
plan_artifact_enabled config.plan_artifact_enabled	Config	bool	Target	Whether plan creation is active this session. When false, all Layer 3 plan signals are not applicable.	Required gate for Layer 3 signals. Do not compute plan_drift_score when this is false.
system_reminders_enabled config.system_reminders_enabled	Config	bool	Target	Whether soft-nudge injections (system reminders) are active this session. Affects reminder_injection_count in Layer 3.	Without this flag, reminder_injection_count is uninterpretable — zero injections might mean reminders are disabled, not that none were needed.
agent_id config.agent_id	Config	str	Extracted	Downstream agent receiving the context pack from this plugin session. Required for any cross-agent comparison of plugin effectiveness.	Plugin outcomes are agent-specific. A context pack that helps agent A may not help agent B with a different context window or tool set.
model_id config.model_id	Config	str	Target	Model used by the plugin for context building and plan generation. Include the full model string — provider + version.	Separates plugin model quality from downstream agent model quality when attributing outcome deltas.

Context Behavior — what the plugin selected and what the agent used

3 computed 6 need artifact 1 target ▼

Captures what the plugin selected versus what the agent actually used. These signals answer the core context quality question: did the plugin give the agent the right files? High-value signals are context_precision, context_recall, and context_token_cost.

Key tension: high precision + low recall = over-filtering (plugin picked too few files but they were relevant). Low precision + high recall = bloat (plugin grabbed everything including noise). Both are failure modes with different costs.

Plugin session lifecycle phases — context behavior across modes

Optimal

context_pack→ plan_create→ agent_run→ verify→ handoff

Context-only

context_pack→ agent_run→ verify no plan artifact — Layer 3 signals not applicable

Failed / no context

agent_run→ verify_fail plugin bypassed or errored — baseline condition

Signal	Source	Type	Status	Description	Validity / Pitfalls
context_files_selected plugin.context_files_selected	Plugin	list[str]	Extracted	Files the plugin selected for the context pack before the agent acted. The raw selection list from the plugin's context-building phase.	Foundation for all derived context quality signals. Without this list, precision and recall cannot be computed.
context_files_used_by_agent harness.context_files_used_by_agent	Harness	list[str]	Needs session JSONL	Files the agent actually read or modified during the session. Compare with `context_files_selected` to derive precision and bloat.	Requires parsing agent session JSONL for file access events. Not the same as touched_files — includes reads that did not result in edits.
missed_relevant_files derived.missed_relevant_files	Derived	list[str]	Needs artifact	Files proven relevant but omitted by the plugin. Requires ground-truth `relevant_files` in the task spec or gold patch to compute.	Requires ground-truth annotation. In eval settings, gold patch files provide this. In product settings, needs LLM judge or manual annotation.
context_precision derived.context_precision	Derived	float	Computed	Fraction of plugin-selected files actually read or edited by the agent. `\|selected ∩ agent_used\| / \|selected\|`. Supporting signal — always pair with coverage. High precision + low coverage = over-filtering. Prefer `first_tool_precision` as the primary KPI.	Can look artificially high if plugin selects very few files. A plugin returning 0 files scores 100% precision — always read alongside coverage. Downgraded from ★: use first_tool_precision as the primary context-quality KPI instead.
context_recall derived.context_recall	Derived	float	Computed	Fraction of agent-used files that were pre-selected by the plugin. `\|selected ∩ agent_used\| / \|agent_used\|`. Measures how much of what the agent actually needed was pre-covered. Not the same as ground-truth recall — see `context_relevant_recall` (target) for that.	Computed from session JSONL — no annotation required. High coverage = plugin anticipated the agent's file needs well. Low coverage = agent explored significantly beyond the plugin's selection.
context_relevant_recall derived.context_relevant_recall	Derived	float	Target	Ground-truth recall: `\|selected ∩ relevant\| / \|relevant\|`. Fraction of annotated-relevant files the plugin actually selected. Requires a `relevant_files` ground-truth list from the task YAML or gold patch. Compare with `context_recall` (which uses agent-used as proxy for relevant) to measure how well agent behavior proxies ground truth.	Requires annotation — not computable from session JSONL alone. Use `context_recall` as the zero-annotation proxy while building the annotation pipeline. Null when `relevant_files` is empty.
context_bloat_score derived.context_bloat_score	Derived	float	Computed	Fraction of selected files that were never accessed by the agent. `\|selected − agent_used\| / \|selected\|`. High bloat wastes agent context window.	1 − context_precision. High bloat is always costly — unused context occupies token budget without benefit.
context_noise_ratio derived.context_noise_ratio	Derived	float	Needs artifact	Percentage of irrelevant chunks in the context pack at the chunk/passage level (not just file level). A file can be selected but only partially relevant.	Finer-grained than bloat_score. Requires chunk-level relevance annotation. Use file-level signals first.
context_token_cost ★ harness.context_token_cost	Harness	int	Extracted	Tokens consumed by context injection into the agent prompt. Key cost signal — directly sets the floor for agent input token cost each session.	Primary lever for cost optimization. Pair with context_precision to distinguish cheap-but-useful from cheap-but-useless context packs.
file_selection_latency_ms harness.file_selection_latency_ms	Harness	int	Extracted	Time spent building the context pack before the agent starts. Adds to plugin_latency_ms. High values suggest expensive retrieval or embedding.	Direct user-perceived latency cost of the plugin. Latency + context quality tradeoff is core plugin design problem.

Planning / Decision Support — how the plugin shapes agent decisions

4 need artifact 3 target ▼

Signals from the plugin's planning and decision-support layer: plan artifacts, milestone detection, blockers, and reminder injections. Only applicable when plugin_scope includes planning (context+plan or context+plan+routing).

Note on plan drift: drift is not always bad. Legitimate task discovery causes benign drift. Flag plan_drift_score only when correlated with downstream failure, not as a standalone quality signal.

Signal	Source	Type	Status	Description	Validity / Pitfalls
plan_artifact_created plugin.plan_artifact_created	Plugin	bool	Needs artifact	Whether plugin created a persistent plan object this session. Only meaningful in `context+planning` scope.	Gate for all other plan signals. If false, plan_updated_count, plan_drift_score, and plan_phase_transitions are not applicable.
plan_updated_count plugin.plan_updated_count	Plugin	int	Needs artifact	Number of times the plan was updated mid-session. Zero updates may indicate the plugin stopped tracking, or the initial plan was correct.	Interpret alongside plan_drift_score. Many updates + low drift = healthy iterative refinement. Few updates + high drift = plan was abandoned silently.
plan_drift_score derived.plan_drift_score	Derived	float	Needs artifact	Divergence between the initial plan and the final path taken. 0 = no drift. Computed as edit distance between initial and final plan state.	Drift is not always bad — legitimate discovery causes benign drift. Flag only when correlated with downstream failure. Never use alone as a quality signal.
milestone_detected ★ plugin.milestone_detected	Plugin	bool	Target	Whether plugin detected meaningful progress or milestone completion during the session. Leading indicator of session trajectory.	Positive signal when correlated with successful downstream outcome. False + successful outcome may indicate the plugin's milestone detection is too conservative.
blocker_detected plugin.blocker_detected	Plugin	bool	Target	Whether plugin flagged a blocking issue during the session. Blockers should correlate with downstream failure or retry events.	Calibrate against downstream outcome. High false-positive blocker rate = plugin is too conservative and interrupts good sessions.
plan_phase_transitions plugin.plan_phase_transitions	Plugin	list[str]	Needs artifact	Sequence of plan phase changes observed during the session. Analogous to plannotator phases in the agent — tracks plugin's view of session progress.	Structural, no inference required if the plugin emits phase events. Valuable for understanding plugin behavior across session types.
reminder_injection_count plugin.reminder_injection_count	Plugin	int	Target	Number of soft-nudge reminders injected into the agent context during the session. Only meaningful when `system_reminders_enabled = true`.	High counts may indicate the agent is repeatedly deviating from the plan. Correlate with plan_drift_score and downstream outcome.

Downstream Outcome Attribution — what did the plugin actually change

2 extracted 5 need baseline or LLM judge ▼

The plugin's value is measured by what it changes in downstream agent outcomes. These signals require a clean baseline run (same agent, same tasks, no plugin) to compute deltas.

Key constraint: system noise can obscure deltas smaller than ~5 percentage points. Run at least 3 trials per condition before drawing conclusions. A single-run delta is not attributable.

Signal	Source	Type	Status	Description	Validity / Pitfalls
agent_outcome_delta ★ derived.agent_outcome_delta	Derived	float	Needs baseline	Pass rate improvement vs baseline (same agent, no plugin). Positive = plugin helped. Computed as `pass_at_1 − pass_at_1_baseline`.	Requires a clean baseline run. System noise can obscure deltas < 5pp. Run ≥3 trials per condition. A single-run delta is not attributable to the plugin.
pass_at_1 ★ harness.pass_at_1	Harness	float	Extracted	P(success in first attempt) with plugin active. The primary eval metric for plugin-assisted sessions.	Only interpretable as a quality signal when compared against pass_at_1_baseline from a clean no-plugin run.
pass_at_1_baseline harness.pass_at_1_baseline	Harness	float	Needs baseline	P(success in first attempt) without plugin. Run baseline condition separately on the same task set and agent configuration.	Must use the same agent, model, task set, and tool caps. Any difference in those confounds the comparison.
resolve_rate_delta derived.resolve_rate_delta	Derived	float	Needs baseline	Percentage point change in fully resolved tasks with vs without plugin. Broader than pass_at_1 if retries are counted.	Distinguish from pass_at_1_delta: resolve_rate includes retry successes. Track both for full picture.
retry_count_delta derived.retry_count_delta	Derived	float	Needs baseline	Change in average retries needed with vs without plugin. Negative = plugin reduces retries. Plugin that improves pass@1 but not retry count has narrow impact.	Negative delta is the target. A plugin that reduces retries saves significant cost even if overall resolve rate is similar.
verification_pass_delta derived.verification_pass_delta	Derived	float	Needs baseline	Change in test-pass rate attributable to plugin. Finer-grained than resolve_rate_delta — tracks partial improvement where some tests now pass.	Only meaningful in eval contexts with a test harness. Not applicable in pure product sessions.
faithfulness_score llm-judge.faithfulness_score	LLM-judge	float	Needs LLM judge	Percentage of agent claims supported by plugin-provided context. High faithfulness = agent stayed grounded in the context pack.	LLM-as-judge can hallucinate judgements. Calibrate against 10% human spot-check before trusting at scale.

Decision Quality — causal signals on how well the agent navigated the task

7 extracted from session 2 need annotation ▼

These signals are computed directly from the agent session JSONL — no extra annotation required. They distinguish good navigation (direct path to the right file, error handled and recovered) from poor navigation (repeated failed calls, thrashing across files, spinning in a loop). They are causal: a plugin that improves these numbers is provably improving the agent's decision-making, not just its final output.

Signal	Source	Type	Status	Description	Validity / Pitfalls
first_tool_precision ★ derived.first_tool_precision	Derived	float	Needs annotation	Was the agent's first file read/search a relevant file? 1.0 = yes, 0.0 = no. Requires `relevant_files` in task spec. The fastest proxy for context quality — a plugin that improves this is directly reducing wasted tool calls.	Requires ground-truth annotation in the task YAML. Only applicable when `relevant_files` is populated. Binary 0/1 — aggregate across many tasks for a stable metric.
time_to_first_correct_file_read ★ derived.time_to_first_correct_file_read	Derived	int (ms)	Needs annotation	Milliseconds from session start until the agent first reads a ground-truth relevant file. Requires `relevant_files` annotation. A delta here directly measures how much the plugin shortened the orientation phase.	Same annotation dependency as first_tool_precision. Sensitive to task complexity — normalize within task family when aggregating. Null if no relevant file was ever read.
error_loop_detected ★ derived.error_loop_detected	Derived	bool	Extracted	True if the agent called the same (tool, input) pair 3+ times in a session. A strong indicator of a stuck loop. Computed from session JSONL with no extra annotation.	False positives possible for legitimate repeated operations (e.g. polling). Key=3 is a heuristic — tune threshold per task type if needed.
error_recovery_turns ★ derived.error_recovery_turns	Derived	int	Extracted	Number of turns between the first error recovery milestone and the next verification success. Null if no recovery milestone exists. Measures recovery efficiency — lower is better.	Null for sessions that never hit an error recovery milestone. Compare across sessions only when both have the milestone.
decision_reversal_count derived.decision_reversal_count	Derived	int	Extracted	Number of times the agent switched its edit target to a different file. Each A→B file switch = +1. Measures thrashing — a plugin that reduces this is genuinely reducing wasted work.	Benign for naturally multi-file tasks. Combine with total edit count and wallClockMs to contextualize. A high count in a single-file task is a red flag.
tool_error_rate derived.tool_error_rate	Derived	float	Extracted	Fraction of all tool calls that returned a non-zero exit code. High error rate = agent is guessing rather than reasoning. Computed from session JSONL exit codes.	Includes expected errors (e.g. a test run that fails as the baseline). Pair with `error_recovery_turns` to distinguish tolerable vs. pathological error rates.
retry_after_error_count derived.retry_after_error_count	Derived	int	Extracted	Number of times a failed tool call was immediately followed by the same call with the same input. "Blind retry" pattern — the agent saw an error and repeated the action unchanged.	Low-count false positives possible if identical retries are intentional (e.g. network retry logic in bash). One or two is normal; five or more in a session warrants inspection.
wrong_path_depth derived.wrong_path_depth	Derived	int	Not yet implemented	Number of tool calls made after the agent committed to an incorrect approach, before recovering or failing. Measures depth of wrong turns. Requires "wrong approach" detection heuristic or LLM judge.	Requires identifying when an agent is "on the wrong path" — non-trivial without ground truth. Defer to Phase 2 unless gold trajectory is available.
first_decision_correct derived.first_decision_correct	Derived	bool	Needs annotation	Whether the agent's first substantive action (first edit or first tool call) was on a relevant file. Coarser version of first_tool_precision — useful when relevant_files annotation is available.	Requires ground-truth annotation. Equivalent to first_tool_precision > 0 — prefer that signal for aggregation. Useful only as a boolean breakdown in task-level analysis.

Context Cost / Bloat — what the plugin costs to run

3 extracted 4 need baseline or session JSONL ▼

Plugin overhead in latency, tokens, and dollars. A well-designed plugin should reduce net token usage by giving the agent better context upfront, reducing thrashing and retries. Track both plugin-added cost and plugin-saved cost to get the true delta.

Signal	Source	Type	Status	Description	Validity / Pitfalls
plugin_latency_ms ★ harness.plugin_latency_ms	Harness	int	Extracted	Total plugin overhead (context-pack build + plan setup) before the agent starts. Directly adds to user-perceived time-to-first-agent-action.	Latency + quality tradeoff is the core plugin design question. A 2s plugin that improves pass@1 by 10pp is different from a 10s plugin that improves it by 2pp.
context_pack_tokens ★ harness.context_pack_tokens	Harness	int	Extracted	Token cost of the context pack injected into the agent prompt. Sets the floor for agent input token cost this session.	Same as context_token_cost in Layer 2 — included here as the primary cost accounting signal for Layer 5. Track alongside tokens_per_task_delta.
tokens_per_task_delta ★ derived.tokens_per_task_delta	Derived	int	Needs baseline	Change in total token usage with vs without plugin. Negative = plugin saves tokens (agent thrashes less because context is better). Positive = plugin adds net cost.	The most important cost signal. A plugin that adds 2k tokens upfront but saves 15k in agent exploration is a net win. Never evaluate cost without this delta.
cost_per_task_delta ★ derived.cost_per_task_delta	Derived	float	Needs baseline	USD cost delta attributable to plugin overhead plus downstream savings. The bottom-line cost signal.	Requires accurate per-token cost accounting for both plugin model and downstream agent model. Include both in calculation.
tool_calls_per_task_delta derived.tool_calls_per_task_delta	Derived	int	Needs baseline	Change in number of agent tool calls with vs without plugin. A good context pack reduces thrashing — fewer redundant reads and grep loops.	Proxy for agent exploration efficiency. Negative delta = plugin reduces thrashing. Positive = plugin may be confusing the agent.
plugin_error_rate harness.plugin_error_rate	Harness	float	Extracted	Rate of plugin failures or timeouts. A plugin that fails silently is worse than one that fails loudly — the agent runs without context and may not know it.	High error rate invalidates all other plugin signals for affected sessions. Always filter on plugin_error_rate before computing quality metrics.
context_growth_curve derived.context_growth_curve	Derived	list[int]	Needs session JSONL	Token count growth over session turns. A good plugin should flatten this curve — agent needs fewer additional reads because context was pre-loaded.	Compare shape across plugin vs no-plugin baseline. Flatter curve in plugin condition = plugin is substituting for in-session exploration.

Context-Mode Runtime — MCP hooks, routing, compaction, think-in-code

18 target ▼

Signals specific to the context-mode MCP+hooks runtime — the deeper plugin operating mode where Claude Code hooks intercept raw tool calls, substitute ctx_* commands, maintain a SQLite/FTS5 session store, and enforce routing rules.

All signals in this layer are Target — they require the context-mode runtime to emit structured telemetry. Once emitted, most are computable directly from the hook event log with no annotation.

Six signal families: routing/sandbox (tool interception), retrieval/index (BM25 search quality), continuity/compaction (ctx snapshot and restore), think-in-code (ctx_execute adoption), output compression (verbosity reduction), platform capability (hook availability and degraded-mode detection).

Signal	Source	Type	Status	Description	Validity / Pitfalls
Routing & Sandbox
tool_interception_rate ★ runtime.tool_interception_rate	Harness	float	Target	Fraction of raw tool calls routed through the context-mode hook layer rather than executed directly. 1.0 = full interception; <1.0 = hook missed some calls (hook coverage gap).	Requires hook event log. Low rate may indicate hooks not registered, partial env support, or agent bypassing the MCP layer. Should be 1.0 in a fully-wired context-mode session.
ctx_substitution_rate runtime.ctx_substitution_rate	Harness	float	Target	Fraction of intercepted tool calls that were substituted with a `ctx_*` command (e.g. `ctx_read`, `ctx_search`, `ctx_execute`) rather than passed through unchanged.	High rate = context-mode is actively enriching the session. Low rate = hooks are running but not substituting — check routing rules and command recognition patterns.
blocked_command_rate runtime.blocked_command_rate	Harness	float	Target	Fraction of tool calls blocked by routing rules (e.g. disallowed shell commands, out-of-scope writes). Measures sandbox enforcement effectiveness.	High rate on legitimate commands = over-restrictive routing. Low rate = sandbox may not be enforced. Pair with agent outcome delta to check if blocking is helping or hurting.
Retrieval & Index
search_hit_rate ★ runtime.search_hit_rate	Harness	float	Target	Fraction of `ctx_search` calls that returned at least one result. Low rate indicates the FTS5/BM25 index is stale, missing, or queries are poorly formed.	Always compare to `context_recall`: a high search hit rate with low recall means the index returns results but they are not the right ones.
bm25_reuse_rate runtime.bm25_reuse_rate	Harness	float	Target	Fraction of `ctx_search` calls that were served from the cached BM25 index without re-indexing. Proxy for index freshness management cost.	High reuse rate + low hit rate = stale index. Reindex threshold needs tuning. Track alongside file change rate to calibrate.
fetch_cache_hit_rate runtime.fetch_cache_hit_rate	Harness	float	Target	Fraction of `ctx_fetch` calls served from the session fetch cache. Measures how effectively repeated URL fetches are deduplicated across a session.	Low rate on repeated identical fetches = cache is not keying correctly. Track alongside fetch latency to understand cost savings.
Continuity & Compaction
snapshot_created ★ runtime.snapshot_created	Harness	bool	Target	Whether the context-mode runtime created a session snapshot before compaction fired. Gate signal for all compaction-recovery signals.	If false and compaction fired, the agent resumes cold. Track rate across sessions — target is 1.0 on all sessions that reach compaction threshold.
restore_success ★ runtime.restore_success	Harness	bool	Target	Whether the agent successfully restored working context from the snapshot after compaction. Null if no compaction event occurred in the session.	Only meaningful when `snapshot_created = true` and a compaction event is detected. A restore failure = cold restart despite snapshot existing (injection bug).
time_to_first_productive_action_post_compaction ★ runtime.ttfpa_post_compaction_ms	Harness	int (ms)	Target	Milliseconds from compaction event to first non-orientation tool call (first file edit or meaningful search). Measures how quickly the agent recovers context after compaction. Null if no compaction occurred.	Compare against no-snapshot baseline (same agent, same session, snapshot disabled). Delta is the plugin's continuity contribution. Sensitive to task complexity — normalize within task family.
compaction_event_count runtime.compaction_event_count	Harness	int	Target	Number of context compaction events detected in the session. Zero on short sessions; ≥1 indicates the session reached the model's context limit.	Useful for segmenting sessions: compaction sessions are a distinct analysis cohort where continuity signals matter most.
Think-in-Code
ctx_execute_adoption_rate ★ runtime.ctx_execute_adoption_rate	Harness	float	Target	Fraction of agent bash calls replaced by `ctx_execute` (think-in-code mode). High adoption = agent is using the structured execution path with result capture and deduplication.	Low adoption with hooks active = routing rules for bash substitution are not triggering. Check command pattern matching. Pair with `file_read_avoidance_ratio`.
file_read_avoidance_ratio runtime.file_read_avoidance_ratio	Harness	float	Target	Fraction of files that the agent did not need to read because they were pre-loaded into context by the plugin. Measures how much redundant file I/O the plugin eliminates.	Requires tracking which files were in the context pack vs which the agent attempted to read. High ratio = plugin is successfully front-loading the right files.
Output Compression
assistant_token_delta runtime.assistant_token_delta	Harness	int	Target	Change in total assistant output tokens with output compression active vs baseline (no compression). Negative = compression is reducing verbose narration. Positive = compression prompts are adding overhead.	Requires a no-compression baseline on the same task. Token savings must be weighed against any quality loss — track alongside pass_at_1 to confirm compression does not regress outcomes.
verbosity_ratio runtime.verbosity_ratio	Harness	float	Target	Ratio of assistant reasoning tokens to action tokens (tool calls + edits). High verbosity = agent is narrating heavily relative to acting. Context-mode's output compression system targets this ratio.	Computed from session JSONL — split assistant messages into reasoning vs tool-call content. A declining trajectory after the first few turns is the target pattern.
Platform Capability
hook_availability ★ runtime.hook_availability	Harness	bool	Target	Whether the Claude Code hooks system was available and registered at session start. Gate for all routing and think-in-code signals. False = context-mode degraded to context-only.	Must be recorded per session. Always segment analysis by this flag — mixing hook-available and hook-unavailable sessions masks the plugin's true capability delta.
degraded_mode_active runtime.degraded_mode_active	Harness	bool	Target	Whether the plugin detected an unsupported platform and fell back to a degraded operating mode (e.g. no hooks, read-only context pack, no routing enforcement). Sessions in degraded mode should be excluded from full-signal analysis.	Degraded-mode sessions are not comparable to full-capability sessions. Always filter before computing routing, think-in-code, or compression signals.
mcp_server_latency_ms runtime.mcp_server_latency_ms	Harness	int	Target	Round-trip latency of MCP server calls from the hook layer. High latency here adds directly to the agent's perceived tool call time. Separate from `plugin_latency_ms` which covers context-pack build time.	Track p50 and p95 separately — occasional spikes matter more than mean latency since a slow MCP call blocks the entire agent turn.

Intent fit, memory, and handoff — the product-side plugin signals

These signals apply to multi-turn, user-in-the-loop sessions where plugin memory and handoff matter. Context fit, memory reuse, and session continuity are the dominant questions here — they are largely irrelevant in deterministic one-shot eval runs. Signals in this tab will be always-null for pure eval runs and should not be collected there.

Memory caveat: stale or contradicting memories surfaced by the plugin can actively harm session quality. Track stale_memory_hit_count alongside repo_memory_hit_count. A plugin that surfaces many memories but most are stale may be worse than one that surfaces fewer but more accurate ones.

Intent / Context Fit — did the plugin understand what was actually needed

0 extracted 5 require LLM judge or session artifact ▼

Did the plugin select context that matched the actual task intent? Did the context pack reflect the right problem framing, or did it surface related-but-wrong files? These signals are derived post-session and require either a session artifact or an LLM-as-judge pass. They answer a different question than precision/recall: not just "were the right files selected" but "did the plugin understand why."

Signal	Source	Type	Status	Description	Validity / Pitfalls
context_relevancy_score ★ derived.context_relevancy_score	Derived	float	Needs artifact	How well selected context matches the actual task intent (RAGAS-style). Measures semantic alignment between context pack contents and the task description.	RAGAS-style metrics require an embedding model. Can be computed without LLM judge. Best used as a continuous signal rather than binary threshold.
task_intent_match llm-judge.task_intent_match	LLM-judge	float	Needs LLM judge	Did the context pack reflect the right problem framing? Requires an LLM judge with access to the task ground truth and the context pack contents.	Requires LLM judge with task ground truth. More expensive than embedding-based relevancy. Use context_relevancy_score as a cheaper proxy first.
instruction_adherence ★ derived.instruction_adherence	Derived	float	Needs session JSONL	Did the agent follow the plan structure the plugin provided? Measures alignment between plugin-provided plan and agent's actual action sequence.	Low adherence may mean the plan was wrong (plugin failure) or the agent ignored it (agent failure). Pair with agent_outcome_delta to distinguish.
plan_optimality llm-judge.plan_optimality	LLM-judge	float	Needs LLM judge	How good was the plugin's plan vs. the ideal path actually taken? Requires comparing the plugin's initial plan to the successful solution trajectory.	LLM-as-judge can hallucinate. Even top models perform poorly on plan quality evaluation. Requires careful judge design with explicit rubric.
context_fit_by_strategy derived.context_fit_by_strategy	Derived	dict	Needs artifact	Precision and recall breakdown per context strategy used. Answers: does `focused` strategy outperform `distill` on repo-level tasks?	Requires stratifying sessions by context_strategy. Minimum ~20 sessions per strategy for statistically meaningful comparison.

Memory + Handoff Dynamics — how the plugin preserves and transfers knowledge

5 need next-session data or artifact 3 target ▼

Memory and handoff are the plugin's mechanism for avoiding cold-start on every session. These signals measure whether the plugin's memory system is surfacing useful prior facts and whether handoff artifacts enable the next session to resume effectively.

Key caveat: handoff quality depends partly on the downstream agent. Compare against a no-handoff baseline (same agent, no summary injected) before attributing session continuity improvements to the plugin's handoff artifact quality.

Signal	Source	Type	Status	Description	Validity / Pitfalls
handoff_summary_created plugin.handoff_summary_created	Plugin	bool	Target	Whether plugin produced a session-end summary artifact. Gate for all handoff quality signals — if false, handoff_success and prior_session_reuse are not applicable.	Track rate of handoff creation across sessions. Low rate may indicate plugin is timing out or failing at session close.
handoff_success ★ derived.handoff_success	Derived	float	Needs next-session data	Whether the next session resumed effectively using the handoff summary. Measured as delta in time-to-first-productive-action vs a no-handoff baseline.	Depends partly on the downstream agent — compare against no-plugin baseline. Requires cross-session linking. Most expensive signal to compute but highest signal quality.
repo_memory_hit_count ★ plugin.repo_memory_hit_count	Plugin	int	Target	Prior session facts successfully reused in the current session. Each hit = a memory item was surfaced and used by the agent.	Interpret alongside stale_memory_hit_count. High total hits but also high stale hits = memory precision problem. Net useful hits = repo_memory_hit_count − stale_memory_hit_count.
stale_memory_hit_count plugin.stale_memory_hit_count	Plugin	int	Target	Old or misleading memories surfaced by the plugin. High stale count degrades performance — agent receives outdated context and may act on it.	High stale count is actively harmful, not neutral. Track as a quality regression signal, not just informational.
retrieval_recall derived.retrieval_recall	Derived	float	Needs artifact	Percentage of relevant prior memory surfaced by the plugin. Requires annotated ground truth of which prior memories were relevant to this session.	Requires ground-truth annotation across sessions. Start with heuristic proxies (time recency, file overlap) before full annotation.
retrieval_precision derived.retrieval_precision	Derived	float	Needs artifact	Percentage of surfaced memory that was actually useful in this session. High precision + low recall = memory is conservative but accurate.	Pair with retrieval_recall. Both together give the full picture of memory quality. Use repo_memory_hit_count as a cheaper proxy while building annotation pipeline.
memory_conflict_resolution derived.memory_conflict_resolution	Derived	bool	Needs artifact	Plugin correctly detected and resolved contradicting facts across sessions. Binary: did the plugin surface the more recent/correct version when there was a conflict?	Conflict resolution requires the plugin to maintain versioned memory or recency ordering. Absence of conflict detection means both old and new facts may be surfaced simultaneously.
session_continuity_score llm-judge.session_continuity_score	LLM-judge	float	Needs LLM judge	How smoothly did the next session resume from the handoff summary? LLM judge evaluates quality of transition — did agent pick up where previous left off?	LLM judge required. Use handoff_success as a cheaper proxy first. Requires access to both sessions for comparison.

User / Session Success Signals — did the plugin help the user succeed

1 computed 5 need baseline or session JSONL 2 target ▼

Final-layer signals measuring whether the plugin improved the user's experience and session outcome. These are the most user-proximate signals in the inventory. Collect everything raw. Interpret nothing at collection time. Pair each signal against a no-plugin baseline before drawing conclusions.

Signal	Source	Type	Status	Description	Validity / Pitfalls
Outcome & Task Completion
task_completion_rate ★ harness.task_completion_rate	Harness	float	Extracted	Percentage of tasks fully resolved with plugin assistance. The primary user-facing quality metric for plugin-assisted sessions.	Only meaningful alongside task_completion_rate_baseline. Never report absolute completion rate without the no-plugin comparator.
session_outcome ★ derived.session_outcome	Derived	enum	Needs baseline	How did the session compare to the no-plugin baseline? Values: `improved` / `neutral` / `regressed`. Gating label for all per-session quality inference.	Derive from agent_outcome_delta. A plugin that regresses on more than 5% of sessions needs investigation regardless of aggregate improvement.
output_applied harness.output_applied	Harness	bool	Target	Whether user accepted and applied the agent's output. Detected from git state at session close. Strongest quality proxy without asking the user directly.	Completely objective. Git commit after session = output was accepted. Track as rate over sessions, not just binary per session.
Time & Effort Deltas
time_to_first_edit_delta ★ derived.time_to_first_edit_delta	Derived	float	Needs baseline	Change in seconds to first file modification with vs without plugin. A good context pack helps the agent start faster — negative delta is the target.	Includes plugin_latency_ms in the with-plugin condition. Net delta = (plugin latency + time to first edit with plugin) − (time to first edit without plugin).
user_effort_delta derived.user_effort_delta	Derived	float	Needs baseline	Change in user messages needed per task. A plugin that reduces steering effort (fewer corrections, fewer clarifications) is directly improving UX.	Requires session JSONL to count user turns. Compare same tasks across plugin vs no-plugin conditions.
Agent Behavior Signals
agent_hedging_curve derived.agent_hedging_curve	Derived	list	Needs session JSONL	Rate of hedged language ("I think", "might be", "probably") per agent turn. A good context pack should reduce hedging — agent is more confident when it has the right files upfront.	Pure regex on session JSONL assistant turns. A declining trajectory in the plugin condition vs flat/rising in baseline is the target pattern.
Multi-Session & Handoff
prior_session_reuse ★ plugin.prior_session_reuse	Plugin	float	Target	Percentage of handoff summary content actually referenced in the next session. Measures whether the previous session's handoff artifact was used, not just produced.	Low reuse may indicate the handoff artifact was produced but not injected correctly, or was injected but not relevant. Distinguish these cases before acting.
per_session_cost_delta derived.per_session_cost_delta	Derived	float	Needs baseline	USD cost change per session with plugin vs without. Includes plugin overhead plus downstream savings from reduced agent exploration and retries.	The bottom-line cost signal for the product context. Negative = plugin saves money per session net of its own overhead. This is the target.