Skip to content

Minimatics Evaluation System

Test Socratic coaching adherence across multi-turn conversations.

Quick Start

# Basic test
pixi run python -m minimatics.test_runner_minimatics --max-turns 10

# Test different temperatures
pixi run python -m minimatics.test_runner_minimatics --temperature 0.3 --max-turns 5
pixi run python -m minimatics.test_runner_minimatics --temperature 1.0 --max-turns 5

# Multiple conversations in parallel
for i in {1..3}; do pixi run python -m minimatics.test_runner_minimatics --max-turns 8 & done; wait

# Analyze results
pixi run python minimatics/conversation_analyzer.py --summary

Backend bridge (minimatics router @ 8081)

The conversation runner now proxies every guru turn through the studio-backend Minimatics router (the same /api/minimatics/getGuruResponse path used by the web app). Before running the tests you must:

  1. Start studio-backend locally on port 8081 (see that repo's README – e.g. pnpm dev).
  2. Provide a valid Studio account for automation via the following .env entries:
MINIMATICS_BACKEND_BASE_URL=http://localhost:8081/api
MINIMATICS_USER_EMAIL=<studio login email>
MINIMATICS_USER_PASSWORD=<studio login password>
MINIMATICS_PROJECT_TITLE=GuruBench Automation
# MINIMATICS_BACKEND_TIMEOUT=30  # optional override (seconds)

Each test run automatically logs in, creates a fresh Minimatics project/chat session, and replays the real tool-calling flow end-to-end.

Use --guru-version-id to select the backend guru version per run (required, must match backend naming). --system-prompt only supplies a testModePrompt override; it does not select the backend version.

What It Tests

  • Socratic Form: Single open questions (15-40 words)
  • Length Adherence: Concise vs verbose responses
  • Phase Detection: Character → Environment → Story flow
  • Drift Analysis: Performance changes over time

Configuration Groups

Results are grouped by: system_prompt + temperature + model_version

Example output:

šŸ“‹ GURU (T=0.3, gpt-4.1)  ← Lower temp: more verbose
šŸ“‹ GURU (T=1.0, gpt-4.1)  ← Higher temp: more concise

Files

  • test_runner_minimatics.py - Main CLI
  • conversation_analyzer.py - Results analysis
  • ../system_prompts_minimatics/ - Guru prompts
  • personas/ - User behavior profiles
  • results/ - JSON reports

Growth Strategy: step-by-step ablation

Objectives (targeted failures to remediate)

  1. Termination non-compliance – assistant ignores explicit "End the story" signals; exit criteria opaque to user - i.e. for the user's hard to predict what "done" looks like.
  2. Redundant clarification loops – assistant re-asks for context already provided in the opening exchange instead of advancing phases. Instead, when the user front-loads context, the assistant should progress phases without redundant clarification loops.
  3. Constraint drift – - The assistant dropped core narrative constraints (e.g., teen bullied at school) and prematurely pivoted to an unrelated setting (e.g., the forest).

Approach

1. Establish Baseline

  • Pin prompt version, model temperature, and random seed.
  • Audit the current metric suite to confirm it captures the minimum viable behavior for any Guru version: reliably driving a conversation end-to-end and producing a complete Minimatics output.
  • Run n≄30 GuruBench Minimatics conversations; record per-objective pass rates as baseline metrics.

2. Validate Metrics

  • Confirm each objective maps to ≄1 metric that correlates with human judgment (spot-check sample).
  • Decide scoring type per metric: binary | categorical | continuous.

3. Ablation Phase - One Goal at a Time

For each objective: 1. Hypothesize – state expected behavioral change and the metric(s) it should move. 2. Intervene – apply minimal prompt delta; snapshot to versioned file in system_prompts_minimatics/. (The latest system_prompts_minimatics/.txt becomes the starting point for the new version.) 3. De-bloat – run prompt_debloat.py --regenerate. 4. Evaluate – run n≄30 conversations; compare to baseline. minimatics.test_runner_minimatic --count 30. 5. Accept/Reject – accept if target metric ≄90% and* no other metric regresses >5 pp from baseline.

4. Verify Behavioral Changes

  • After each ablation, validate the objective via its mapped metric(s).
  • Success Criteria: Objective met when its metric achieves ≄90% pass rate (p < 0.05 vs. baseline via proportion z-test).
  • Document effect size and any observed regressions for traceability.