Skip to content

PoseMaker Rigger Eval

Overview

Tests the PoseMaker rigger LLM in isolation — verifies it produces correct tool calls from pose descriptions.

Running Tests

# Backend must be running with ENABLE_TEST_ROUTES=true

# Run by collection
python test_runner.py --collection pose_maker_rigger_fresh
python test_runner.py --collection pose_maker_rigger_incremental
python test_runner.py --collection pose_maker_rigger_multi
python test_runner.py --collection pose_maker_rigger_with_reference

# Run a single test
python test_runner.py rigger_standing_arms_crossed
python test_runner.py rigger_multi_conversation

# Tag the run with a label so output goes to a per-label subdir
# (recommended when comparing models or prompt changes):
python test_runner.py --collection pose_maker_rigger_fresh --label opus-4-7

Collections

pose_maker_rigger_fresh — Fresh scene, single figure

  • Fixtures: tests/fixtures/pose_library/
  • Scoring: Position-based — applies both expected + actual tool calls to a virtual stick figure (IK simulator), compares final joint positions (head, hands, feet, elbows, knees) relative to hips. Equal-weight average across the 9 joints.
  • Pass threshold: avg_distance ≤ 0.15 (in figure-height units; figure is ~1.31 units tall, so 0.15 ≈ 11% of body height)
  • Source: exported from pose library DB via python scripts/export_pose_library.py, filtered by scripts/verified_poses.json

pose_maker_rigger_incremental — Edits + transitions

  • Fixtures: tests/fixtures/pose_incremental/
  • Scoring: same as fresh
  • Tests: simple edits from neutral (raise arm, look left) + pose-to-pose transitions (confident → arms crossed)

pose_maker_rigger_multi — Multi-figure scenes

  • Fixtures: tests/fixtures/pose_multi_figure/
  • Scoring: Weighted combination of spatial checks + individual pose match:
  • Spatial/structural checks (binary pass/fail, weight=1 each):
    • Figures exist and are positioned (via moveFigure or addFigure with coords)
    • Not overlapping at same position
    • Spacing matches interaction type (conversation=0.4-0.7, hug=0.25-0.5, fight=0.55-0.8, side_by_side=0.35-0.55)
    • faceToward correct (required for face-to-face, optional for same-direction)
    • frame='other:<id>' used for interactions (handshake, hug)
    • Hands within reach distance
  • Pose match checks (continuous 0-1 score, weight=2 each):
    • Each figure's pose compared against library golden using position scorer
    • Uses individualPoseRef (same pose for all) or individualPoseRefs (per-figure)
  • Pass threshold: weighted score ≥ 0.75

pose_maker_rigger_with_reference — Reference-injection A/B tests

  • Fixtures: tests/fixtures/pose_with_reference/
  • Same scoring as fresh, but each fixture has a "references": ["pose-id-1", ...] field. The runner looks each id up in pose_library/, builds {id, name, description, toolCalls} from the canonical fixture, and forwards as referencePoses to the rigger.
  • Lets you measure whether anchoring on a structurally-similar verified pose improves output for poses where the rigger consistently undershoots (sitting, leaning, etc.).

Verified Poses

scripts/verified_poses.json is the allowlist — only these ids are exported as fresh-scene fixtures and only these poses appear in the director's reference catalog at runtime.

To add a pose to the verified set: 1. Render the pose in the FE pose-debug page; confirm tool calls are visually correct 2. Add the id to scripts/verified_poses.json 3. Mirror to Mongo: npx ts-node scripts/poseMaker/markVerifiedPoses.ts --apply 4. Re-export fixtures: python scripts/export_pose_library.py

The markVerifiedPoses.ts script keeps Mongo isVerified in sync with the JSON. Production poseMaker reads isVerified: true to populate the director's reference catalog.

Output Files

After each run, test_results/pose_outputs/<label>/<test_id>.json contains the actual tool calls and scoring details. Paste actual array into the frontend pose-debug renderer to visualize.

If you don't pass --label, output goes to pose_outputs/default/.

Recomputing Scores Without LLM Calls

Once you have cached test outputs, you can recompute scores under updated thresholds / scoring tweaks without paying for new LLM calls:

python scripts/recompute_pose_scores.py opus-4-7              # one label
python scripts/recompute_pose_scores.py opus-4-7 fable        # diff two labels

Regenerating Goldens

The regenerate workflow is now three steps so you don't pay for LLM calls on a dry run, and you can review results before writing to Mongo:

# 1. Preview which entries would regen (no LLM, no DB writes):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --unverified --dry-run

# 2. Run director→rigger and CACHE results to disk
#    (no DB writes; script prints a runId at the end):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --unverified

# Each cached file is { "expected": [...old tool calls...], "actual": [...new tool calls...] }
# under scripts/poseMaker/.regenCache/<runId>/<id>.json. Paste either
# array into the FE pose-debug renderer.

# 3. Apply approved cached results to Mongo (no LLM):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --apply-from-cache <runId>

Filters: --all | --unverified | --category <cat> | --id <id>. Add --verbose to step 2 for per-pose director output.

The apply step skips broken caches automatically (where actual.length <= 1, which usually means the rigger timed out and the cache only has a resetPose).

After applying, re-export test fixtures:

python scripts/export_pose_library.py

Changing Model

Edit src/testDebug/testDebug.service.tsRIGGER_MODEL constant at top of runRigger:

private static readonly RIGGER_MODEL: string = BedrockModelId.CLAUDE_OPUS_4_7;  // or BedrockModelId.CLAUDE_FABLE_5, OpenAIModelId.GPT_5_4
Restart backend, re-run tests with a --label matching the model so outputs don't overwrite each other.

Adding Tests

  • Fresh (verified-allowlist gated): add the pose id to scripts/verified_poses.json, run markVerifiedPoses.ts --apply, then re-export fixtures. The export script regenerates pose_library_index.json automatically and preserves any references field on existing fixtures across re-exports.
  • Incremental/Transition: hand-author JSON in tests/fixtures/pose_incremental/, update incremental_index.json
  • Multi-figure: hand-author JSON in tests/fixtures/pose_multi_figure/, update multi_figure_index.json
  • With-reference A/B: copy a fresh fixture into tests/fixtures/pose_with_reference/, append "references": ["<verified-pose-id>"]. No index file needed — the runner picks up every JSON in the directory.