Skip to content

GuruBench

Test suite and battle arena for evaluating Guru AI assistants (StoryDesk and Minimatics). Uses DeepEval with G-Eval for LLM-as-judge evaluation.

Read about G-Eval as we use it extensively: https://deepeval.com/docs/metrics-llm-evals

๐ŸŽฏ Projects

StoryDesk -

  • System Prompts folder: system_prompts/ (guru + story essentials)
  • Tests: basic (JSON format, content structure), core (Story guidance, word count, source reference avoidance), story review (Story readiness, critique assessment), tool (tool call validation currently decoupled)
  • Evaluations: Socratic method, length, guidance quality, story essentials

Minimatics - Story ideation assistant testing

  • System Prompts folder: system_prompts_minimatics/
  • Tests: Synthetic multi-turn conversations
  • Evaluations: Socratic adherence, answer length, story completeness, confirmation detection

๐Ÿงช Running Tests

Dependencies for GuruBench (including the pinned deepeval version) are managed in pyproject.toml.

This repo uses Pixi as the default package manager.

# Install dependencies
pixi install

# Run commands via Pixi
pixi run python --version

Pixi Tasks (Add + Example)

Pixi tasks live in pyproject.toml under the [tool.pixi.tasks] table. Each task maps a name to a command you can run with pixi run <task-name>.

[tool.pixi.tasks]
# Simple task example
minimatics_tests_v2 = "python -m minimatics.test_runner_minimatics --guru-version-id v2 --count 10 --temperature 0.7"
# Run the task
pixi run minimatics_tests_v2

StoryDesk Tests

# Run all tests in verbose mode and store the results in a file to inspect later
pixi run python test_runner.py all --system_prompt <your_new_system_prompt_in_system_prompts_folder> --verbose &> tests_ouput.txt

# View available tests
pixi run python test_runner.py --stats

# Run specific conversation test
pixi run python test_runner.py vesta_story

# Run all tests with custom system prompt
pixi run python test_runner.py all --system_prompt 180825-baseline.txt

# Run tests by collection
pixi run python test_runner.py --collection core

# Generate fail report (only failed tests, detailed debugging info)
pixi run python test_runner.py --collection sas_story --report-fail

# Run tests in parallel (all at once)
pixi run python test_runner.py --collection sas_story --parallel --report-fail

# Run tests in parallel with limited workers
pixi run python test_runner.py all --parallel 4 --report-fail

Options

Flag Description
--verbose, -v Show detailed output
--report Generate JSON/Markdown report in test_reports/
--report-fail Generate report for FAILED tests only (with full debugging info)
--parallel Run ALL tests in parallel at once
--parallel N Run tests in parallel with N workers
--system_prompt [file] Use system prompt from system_prompts/ folder

StoryDesk Story Essentials Tests

# Test story grading on sample stories
pixi run python test_story_essentials_basic.py

# Test specific story
pixi run python test_story_essentials_basic.py sdf_ep3

Battle Arena Minimatics

# Create a pixi custom test in pyproject.toml and run it
pixi run minimatics_tests_v2 
# refer to pyproject.toml for details on the CLI

# Run single conversation with a cooperative persona, maximum 25 rounds.
pixi run python -m minimatics.test_runner_minimatics \
  --guru-version-id v2 \
  --persona basic_cooperative.txt \
  --max-turns 25

# Run multiple conversations
pixi run python -m minimatics.test_runner_minimatics \
  --guru-version-id v2 \
  --count 10 \
  --temperature 0.7

Battle Arena StoryDesk (Legacy, moved to studio-backend/python_eval)

Refer to studio-backend repo , under python_eval

Analyze Arena Results

# StoryDesk arena analysis
pixi run python conversation_analyzer.py \
  --results_folder story_arena_results \
  --summary

# Minimatics arena analysis  
pixi run python conversation_analyzer.py \
  --results_folder minimatics/results \
  --summary

# Include persona breakdown
pixi run python conversation_analyzer.py \
  --results_folder story_arena_results \
  --summary \
  --persona-analysis

# Save detailed report to JSON
pixi run python conversation_analyzer.py \
  --results_folder story_arena_results \
  --output analysis_report.json \
  --summary

.env parameters

OPENAI_API_KEY=xxx

Storydesk

GURU_API_KEY=xxx

GURU_API_URL=xxx

MONGODB_URI=xxx (only to pull logs if you use the mongodb_extractor.py tool)

Minimatics backend bridge

MINIMATICS_BACKEND_BASE_URL=http://localhost:8081/api MINIMATICS_USER_EMAIL= MINIMATICS_USER_PASSWORD= MINIMATICS_PROJECT_TITLE=GuruBench Automation

MINIMATICS_BACKEND_TIMEOUT=30 # optional override (seconds)

Use --guru-version-id in the CLI to choose the backend version per run (required).

--system-prompt only supplies a testModePrompt override; it does not select the backend version.