GuruBench¶

Test suite and battle arena for evaluating Guru AI assistants (StoryDesk and Minimatics). Uses DeepEval with G-Eval for LLM-as-judge evaluation.

Read about G-Eval as we use it extensively: https://deepeval.com/docs/metrics-llm-evals

🎯 Projects¶

StoryDesk -¶

System Prompts folder: system_prompts/ (guru + story essentials)
Tests: basic (JSON format, content structure), core (Story guidance, word count, source reference avoidance), story review (Story readiness, critique assessment), tool (tool call validation currently decoupled)
Evaluations: Socratic method, length, guidance quality, story essentials

Minimatics - Story ideation assistant testing¶

System Prompts folder: system_prompts_minimatics/
Tests: Synthetic multi-turn conversations
Evaluations: Socratic adherence, answer length, story completeness, confirmation detection

🧪 Running Tests¶

Dependencies for GuruBench (including the pinned deepeval version) are managed in pyproject.toml.

This repo uses Pixi as the default package manager.

# Install dependencies
pixi install

# Run commands via Pixi
pixi run python --version

Pixi Tasks (Add + Example)¶

Pixi tasks live in pyproject.toml under the [tool.pixi.tasks] table. Each task maps a name to a command you can run with pixi run <task-name>.

[tool.pixi.tasks]
# Simple task example
minimatics_tests_v2 = "python -m minimatics.test_runner_minimatics --guru-version-id v2 --count 10 --temperature 0.7"

# Run the task
pixi run minimatics_tests_v2

StoryDesk Tests¶

# Run all tests in verbose mode and store the results in a file to inspect later
pixi run python test_runner.py all --system_prompt <your_new_system_prompt_in_system_prompts_folder> --verbose &> tests_ouput.txt

# View available tests
pixi run python test_runner.py --stats

# Run specific conversation test
pixi run python test_runner.py vesta_story

# Run all tests with custom system prompt
pixi run python test_runner.py all --system_prompt 180825-baseline.txt

# Run tests by collection
pixi run python test_runner.py --collection core

# Generate fail report (only failed tests, detailed debugging info)
pixi run python test_runner.py --collection sas_story --report-fail

# Run tests in parallel (all at once)
pixi run python test_runner.py --collection sas_story --parallel --report-fail

# Run tests in parallel with limited workers
pixi run python test_runner.py all --parallel 4 --report-fail

Options¶

Flag	Description
`--verbose`, `-v`	Show detailed output
`--report`	Generate JSON/Markdown report in `test_reports/`
`--report-fail`	Generate report for FAILED tests only (with full debugging info)
`--parallel`	Run ALL tests in parallel at once
`--parallel N`	Run tests in parallel with N workers
`--system_prompt [file]`	Use system prompt from `system_prompts/` folder

StoryDesk Story Essentials Tests¶

# Test story grading on sample stories
pixi run python test_story_essentials_basic.py

# Test specific story
pixi run python test_story_essentials_basic.py sdf_ep3

Battle Arena Minimatics¶

# Create a pixi custom test in pyproject.toml and run it
pixi run minimatics_tests_v2 
# refer to pyproject.toml for details on the CLI

# Run single conversation with a cooperative persona, maximum 25 rounds.
pixi run python -m minimatics.test_runner_minimatics \
  --guru-version-id v2 \
  --persona basic_cooperative.txt \
  --max-turns 25

# Run multiple conversations
pixi run python -m minimatics.test_runner_minimatics \
  --guru-version-id v2 \
  --count 10 \
  --temperature 0.7

Battle Arena StoryDesk (Legacy, moved to studio-backend/python_eval)¶

Refer to studio-backend repo , under python_eval

Analyze Arena Results¶

# StoryDesk arena analysis
pixi run python conversation_analyzer.py \
  --results_folder story_arena_results \
  --summary

# Minimatics arena analysis  
pixi run python conversation_analyzer.py \
  --results_folder minimatics/results \
  --summary

# Include persona breakdown
pixi run python conversation_analyzer.py \
  --results_folder story_arena_results \
  --summary \
  --persona-analysis

# Save detailed report to JSON
pixi run python conversation_analyzer.py \
  --results_folder story_arena_results \
  --output analysis_report.json \
  --summary

.env parameters¶

OPENAI_API_KEY=xxx

Storydesk¶

GURU_API_KEY=xxx

GURU_API_URL=xxx¶

MONGODB_URI=xxx (only to pull logs if you use the mongodb_extractor.py tool)¶

Minimatics backend bridge¶

MINIMATICS_BACKEND_BASE_URL=http://localhost:8081/api MINIMATICS_USER_EMAIL= MINIMATICS_USER_PASSWORD= MINIMATICS_PROJECT_TITLE=GuruBench Automation

MINIMATICS_BACKEND_TIMEOUT=30 # optional override (seconds)¶

Use --guru-version-id in the CLI to choose the backend version per run (required).¶

--system-prompt only supplies a testModePrompt override; it does not select the backend version.¶