GuruBench¶
Test suite and battle arena for evaluating Guru AI assistants (StoryDesk and Minimatics). Uses DeepEval with G-Eval for LLM-as-judge evaluation.
Read about G-Eval as we use it extensively: https://deepeval.com/docs/metrics-llm-evals
๐ฏ Projects¶
StoryDesk -¶
- System Prompts folder:
system_prompts/(guru + story essentials) - Tests: basic (JSON format, content structure), core (Story guidance, word count, source reference avoidance), story review (Story readiness, critique assessment), tool (tool call validation currently decoupled)
- Evaluations: Socratic method, length, guidance quality, story essentials
Minimatics - Story ideation assistant testing¶
- System Prompts folder:
system_prompts_minimatics/ - Tests: Synthetic multi-turn conversations
- Evaluations: Socratic adherence, answer length, story completeness, confirmation detection
๐งช Running Tests¶
Dependencies for GuruBench (including the pinned deepeval version) are managed in pyproject.toml.
This repo uses Pixi as the default package manager.
# Install dependencies
pixi install
# Run commands via Pixi
pixi run python --version
Pixi Tasks (Add + Example)¶
Pixi tasks live in pyproject.toml under the [tool.pixi.tasks] table. Each task maps a name to a command you can run with pixi run <task-name>.
[tool.pixi.tasks]
# Simple task example
minimatics_tests_v2 = "python -m minimatics.test_runner_minimatics --guru-version-id v2 --count 10 --temperature 0.7"
# Run the task
pixi run minimatics_tests_v2
StoryDesk Tests¶
# Run all tests in verbose mode and store the results in a file to inspect later
pixi run python test_runner.py all --system_prompt <your_new_system_prompt_in_system_prompts_folder> --verbose &> tests_ouput.txt
# View available tests
pixi run python test_runner.py --stats
# Run specific conversation test
pixi run python test_runner.py vesta_story
# Run all tests with custom system prompt
pixi run python test_runner.py all --system_prompt 180825-baseline.txt
# Run tests by collection
pixi run python test_runner.py --collection core
# Generate fail report (only failed tests, detailed debugging info)
pixi run python test_runner.py --collection sas_story --report-fail
# Run tests in parallel (all at once)
pixi run python test_runner.py --collection sas_story --parallel --report-fail
# Run tests in parallel with limited workers
pixi run python test_runner.py all --parallel 4 --report-fail
Options¶
| Flag | Description |
|---|---|
--verbose, -v |
Show detailed output |
--report |
Generate JSON/Markdown report in test_reports/ |
--report-fail |
Generate report for FAILED tests only (with full debugging info) |
--parallel |
Run ALL tests in parallel at once |
--parallel N |
Run tests in parallel with N workers |
--system_prompt [file] |
Use system prompt from system_prompts/ folder |
StoryDesk Story Essentials Tests¶
# Test story grading on sample stories
pixi run python test_story_essentials_basic.py
# Test specific story
pixi run python test_story_essentials_basic.py sdf_ep3
Battle Arena Minimatics¶
# Create a pixi custom test in pyproject.toml and run it
pixi run minimatics_tests_v2
# refer to pyproject.toml for details on the CLI
# Run single conversation with a cooperative persona, maximum 25 rounds.
pixi run python -m minimatics.test_runner_minimatics \
--guru-version-id v2 \
--persona basic_cooperative.txt \
--max-turns 25
# Run multiple conversations
pixi run python -m minimatics.test_runner_minimatics \
--guru-version-id v2 \
--count 10 \
--temperature 0.7
Battle Arena StoryDesk (Legacy, moved to studio-backend/python_eval)¶
Refer to studio-backend repo , under python_eval
Analyze Arena Results¶
# StoryDesk arena analysis
pixi run python conversation_analyzer.py \
--results_folder story_arena_results \
--summary
# Minimatics arena analysis
pixi run python conversation_analyzer.py \
--results_folder minimatics/results \
--summary
# Include persona breakdown
pixi run python conversation_analyzer.py \
--results_folder story_arena_results \
--summary \
--persona-analysis
# Save detailed report to JSON
pixi run python conversation_analyzer.py \
--results_folder story_arena_results \
--output analysis_report.json \
--summary
.env parameters¶
OPENAI_API_KEY=xxx
Storydesk¶
GURU_API_KEY=xxx
GURU_API_URL=xxx¶
MONGODB_URI=xxx (only to pull logs if you use the mongodb_extractor.py tool)¶
Minimatics backend bridge¶
MINIMATICS_BACKEND_BASE_URL=http://localhost:8081/api
MINIMATICS_USER_EMAIL=