Minimatics Evaluation System¶

Test Socratic coaching adherence across multi-turn conversations.

Quick Start¶

# Basic test
pixi run python -m minimatics.test_runner_minimatics --max-turns 10

# Test different temperatures
pixi run python -m minimatics.test_runner_minimatics --temperature 0.3 --max-turns 5
pixi run python -m minimatics.test_runner_minimatics --temperature 1.0 --max-turns 5

# Multiple conversations in parallel
for i in {1..3}; do pixi run python -m minimatics.test_runner_minimatics --max-turns 8 & done; wait

# Analyze results
pixi run python minimatics/conversation_analyzer.py --summary

Backend bridge (minimatics router @ 8081)¶

The conversation runner now proxies every guru turn through the studio-backend Minimatics router (the same /api/minimatics/getGuruResponse path used by the web app). Before running the tests you must:

Start studio-backend locally on port 8081 (see that repo's README – e.g. pnpm dev).
Provide a valid Studio account for automation via the following .env entries:

MINIMATICS_BACKEND_BASE_URL=http://localhost:8081/api
MINIMATICS_USER_EMAIL=<studio login email>
MINIMATICS_USER_PASSWORD=<studio login password>
MINIMATICS_PROJECT_TITLE=GuruBench Automation
# MINIMATICS_BACKEND_TIMEOUT=30  # optional override (seconds)

Each test run automatically logs in, creates a fresh Minimatics project/chat session, and replays the real tool-calling flow end-to-end.

Use --guru-version-id to select the backend guru version per run (required, must match backend naming). --system-prompt only supplies a testModePrompt override; it does not select the backend version.

What It Tests¶

Socratic Form: Single open questions (15-40 words)
Length Adherence: Concise vs verbose responses
Phase Detection: Character → Environment → Story flow
Drift Analysis: Performance changes over time

Configuration Groups¶

Results are grouped by: system_prompt + temperature + model_version

Example output:

📋 GURU (T=0.3, gpt-4.1)  ← Lower temp: more verbose
📋 GURU (T=1.0, gpt-4.1)  ← Higher temp: more concise

Files¶

test_runner_minimatics.py - Main CLI
conversation_analyzer.py - Results analysis
../system_prompts_minimatics/ - Guru prompts
personas/ - User behavior profiles
results/ - JSON reports

Growth Strategy: step-by-step ablation¶

Objectives (targeted failures to remediate)¶

Termination non-compliance – assistant ignores explicit "End the story" signals; exit criteria opaque to user - i.e. for the user's hard to predict what "done" looks like.
Redundant clarification loops – assistant re-asks for context already provided in the opening exchange instead of advancing phases. Instead, when the user front-loads context, the assistant should progress phases without redundant clarification loops.
Constraint drift – - The assistant dropped core narrative constraints (e.g., teen bullied at school) and prematurely pivoted to an unrelated setting (e.g., the forest).

Approach¶

1. Establish Baseline¶

Pin prompt version, model temperature, and random seed.
Audit the current metric suite to confirm it captures the minimum viable behavior for any Guru version: reliably driving a conversation end-to-end and producing a complete Minimatics output.
Run n≥30 GuruBench Minimatics conversations; record per-objective pass rates as baseline metrics.

2. Validate Metrics¶

Confirm each objective maps to ≥1 metric that correlates with human judgment (spot-check sample).
Decide scoring type per metric: binary | categorical | continuous.

3. Ablation Phase - One Goal at a Time¶

For each objective: 1. Hypothesize – state expected behavioral change and the metric(s) it should move. 2. Intervene – apply minimal prompt delta; snapshot to versioned file in system_prompts_minimatics/. (The latest system_prompts_minimatics/.txt becomes the starting point for the new version.) 3. De-bloat – run prompt_debloat.py --regenerate. 4. Evaluate – run n≥30 conversations; compare to baseline. minimatics.test_runner_minimatic --count 30. 5. Accept/Reject – accept if target metric ≥90% and* no other metric regresses >5 pp from baseline.

4. Verify Behavioral Changes¶

After each ablation, validate the objective via its mapped metric(s).
Success Criteria: Objective met when its metric achieves ≥90% pass rate (p < 0.05 vs. baseline via proportion z-test).
Document effect size and any observed regressions for traceability.