Minimatics Evaluation System¶
Test Socratic coaching adherence across multi-turn conversations.
Quick Start¶
# Basic test
pixi run python -m minimatics.test_runner_minimatics --max-turns 10
# Test different temperatures
pixi run python -m minimatics.test_runner_minimatics --temperature 0.3 --max-turns 5
pixi run python -m minimatics.test_runner_minimatics --temperature 1.0 --max-turns 5
# Multiple conversations in parallel
for i in {1..3}; do pixi run python -m minimatics.test_runner_minimatics --max-turns 8 & done; wait
# Analyze results
pixi run python minimatics/conversation_analyzer.py --summary
Backend bridge (minimatics router @ 8081)¶
The conversation runner now proxies every guru turn through the studio-backend
Minimatics router (the same /api/minimatics/getGuruResponse path used by the
web app). Before running the tests you must:
- Start
studio-backendlocally on port8081(see that repo's README ā e.g.pnpm dev). - Provide a valid Studio account for automation via the following
.enventries:
MINIMATICS_BACKEND_BASE_URL=http://localhost:8081/api
MINIMATICS_USER_EMAIL=<studio login email>
MINIMATICS_USER_PASSWORD=<studio login password>
MINIMATICS_PROJECT_TITLE=GuruBench Automation
# MINIMATICS_BACKEND_TIMEOUT=30 # optional override (seconds)
Each test run automatically logs in, creates a fresh Minimatics project/chat session, and replays the real tool-calling flow end-to-end.
Use --guru-version-id to select the backend guru version per run (required, must match backend naming).
--system-prompt only supplies a testModePrompt override; it does not select the backend version.
What It Tests¶
- Socratic Form: Single open questions (15-40 words)
- Length Adherence: Concise vs verbose responses
- Phase Detection: Character ā Environment ā Story flow
- Drift Analysis: Performance changes over time
Configuration Groups¶
Results are grouped by: system_prompt + temperature + model_version
Example output:
š GURU (T=0.3, gpt-4.1) ā Lower temp: more verbose
š GURU (T=1.0, gpt-4.1) ā Higher temp: more concise
Files¶
test_runner_minimatics.py- Main CLIconversation_analyzer.py- Results analysis../system_prompts_minimatics/- Guru promptspersonas/- User behavior profilesresults/- JSON reports
Growth Strategy: step-by-step ablation¶
Objectives (targeted failures to remediate)¶
- Termination non-compliance ā assistant ignores explicit "End the story" signals; exit criteria opaque to user - i.e. for the user's hard to predict what "done" looks like.
- Redundant clarification loops ā assistant re-asks for context already provided in the opening exchange instead of advancing phases. Instead, when the user front-loads context, the assistant should progress phases without redundant clarification loops.
- Constraint drift ā - The assistant dropped core narrative constraints (e.g., teen bullied at school) and prematurely pivoted to an unrelated setting (e.g., the forest).
Approach¶
1. Establish Baseline¶
- Pin prompt version, model temperature, and random seed.
- Audit the current metric suite to confirm it captures the minimum viable behavior for any Guru version: reliably driving a conversation end-to-end and producing a complete Minimatics output.
- Run nā„30 GuruBench Minimatics conversations; record per-objective pass rates as baseline metrics.
2. Validate Metrics¶
- Confirm each objective maps to ā„1 metric that correlates with human judgment (spot-check sample).
- Decide scoring type per metric: binary | categorical | continuous.
3. Ablation Phase - One Goal at a Time¶
For each objective:
1. Hypothesize ā state expected behavioral change and the metric(s) it should move.
2. Intervene ā apply minimal prompt delta; snapshot to versioned file in system_prompts_minimatics/. (The latest system_prompts_minimatics/.txt becomes the starting point for the new version.)
3. De-bloat ā run prompt_debloat.py --regenerate.
4. Evaluate ā run nā„30 conversations; compare to baseline. minimatics.test_runner_minimatic --count 30.
5. Accept/Reject ā accept if target metric ā„90% and* no other metric regresses >5 pp from baseline.
4. Verify Behavioral Changes¶
- After each ablation, validate the objective via its mapped metric(s).
- Success Criteria: Objective met when its metric achieves ā„90% pass rate (p < 0.05 vs. baseline via proportion z-test).
- Document effect size and any observed regressions for traceability.