PoseMaker Rigger Eval¶

Overview¶

Tests the PoseMaker rigger LLM in isolation — verifies it produces correct tool calls from pose descriptions.

Running Tests¶

# Backend must be running with ENABLE_TEST_ROUTES=true

# Run by collection
python test_runner.py --collection pose_maker_rigger_fresh
python test_runner.py --collection pose_maker_rigger_incremental
python test_runner.py --collection pose_maker_rigger_multi
python test_runner.py --collection pose_maker_rigger_with_reference

# Run a single test
python test_runner.py rigger_standing_arms_crossed
python test_runner.py rigger_multi_conversation

# Tag the run with a label so output goes to a per-label subdir
# (recommended when comparing models or prompt changes):
python test_runner.py --collection pose_maker_rigger_fresh --label opus-4-7

Collections¶

`pose_maker_rigger_fresh` — Fresh scene, single figure¶

Fixtures: tests/fixtures/pose_library/
Scoring: Position-based — applies both expected + actual tool calls to a virtual stick figure (IK simulator), compares final joint positions (head, hands, feet, elbows, knees) relative to hips. Equal-weight average across the 9 joints.
Pass threshold: avg_distance ≤ 0.15 (in figure-height units; figure is ~1.31 units tall, so 0.15 ≈ 11% of body height)
Source: exported from pose library DB via python scripts/export_pose_library.py, filtered by scripts/verified_poses.json

`pose_maker_rigger_incremental` — Edits + transitions¶

Fixtures: tests/fixtures/pose_incremental/
Scoring: same as fresh
Tests: simple edits from neutral (raise arm, look left) + pose-to-pose transitions (confident → arms crossed)

`pose_maker_rigger_multi` — Multi-figure scenes¶

Fixtures: tests/fixtures/pose_multi_figure/
Scoring: Weighted combination of spatial checks + individual pose match:
Spatial/structural checks (binary pass/fail, weight=1 each):
- Figures exist and are positioned (via moveFigure or addFigure with coords)
- Not overlapping at same position
- Spacing matches interaction type (conversation=0.4-0.7, hug=0.25-0.5, fight=0.55-0.8, side_by_side=0.35-0.55)
- faceToward correct (required for face-to-face, optional for same-direction)
- frame='other:<id>' used for interactions (handshake, hug)
- Hands within reach distance
Pose match checks (continuous 0-1 score, weight=2 each):
- Each figure's pose compared against library golden using position scorer
- Uses individualPoseRef (same pose for all) or individualPoseRefs (per-figure)
Pass threshold: weighted score ≥ 0.75

`pose_maker_rigger_with_reference` — Reference-injection A/B tests¶

Fixtures: tests/fixtures/pose_with_reference/
Same scoring as fresh, but each fixture has a "references": ["pose-id-1", ...] field. The runner looks each id up in pose_library/, builds {id, name, description, toolCalls} from the canonical fixture, and forwards as referencePoses to the rigger.
Lets you measure whether anchoring on a structurally-similar verified pose improves output for poses where the rigger consistently undershoots (sitting, leaning, etc.).

Verified Poses¶

scripts/verified_poses.json is the allowlist — only these ids are exported as fresh-scene fixtures and only these poses appear in the director's reference catalog at runtime.

To add a pose to the verified set: 1. Render the pose in the FE pose-debug page; confirm tool calls are visually correct 2. Add the id to scripts/verified_poses.json 3. Mirror to Mongo: npx ts-node scripts/poseMaker/markVerifiedPoses.ts --apply 4. Re-export fixtures: python scripts/export_pose_library.py

The markVerifiedPoses.ts script keeps Mongo isVerified in sync with the JSON. Production poseMaker reads isVerified: true to populate the director's reference catalog.

Output Files¶

After each run, test_results/pose_outputs/<label>/<test_id>.json contains the actual tool calls and scoring details. Paste actual array into the frontend pose-debug renderer to visualize.

If you don't pass --label, output goes to pose_outputs/default/.

Recomputing Scores Without LLM Calls¶

Once you have cached test outputs, you can recompute scores under updated thresholds / scoring tweaks without paying for new LLM calls:

python scripts/recompute_pose_scores.py opus-4-7              # one label
python scripts/recompute_pose_scores.py opus-4-7 fable        # diff two labels

Regenerating Goldens¶

The regenerate workflow is now three steps so you don't pay for LLM calls on a dry run, and you can review results before writing to Mongo:

# 1. Preview which entries would regen (no LLM, no DB writes):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --unverified --dry-run

# 2. Run director→rigger and CACHE results to disk
#    (no DB writes; script prints a runId at the end):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --unverified

# Each cached file is { "expected": [...old tool calls...], "actual": [...new tool calls...] }
# under scripts/poseMaker/.regenCache/<runId>/<id>.json. Paste either
# array into the FE pose-debug renderer.

# 3. Apply approved cached results to Mongo (no LLM):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --apply-from-cache <runId>

Filters: --all | --unverified | --category <cat> | --id <id>. Add --verbose to step 2 for per-pose director output.

The apply step skips broken caches automatically (where actual.length <= 1, which usually means the rigger timed out and the cache only has a resetPose).

After applying, re-export test fixtures:

python scripts/export_pose_library.py

Changing Model¶

Edit src/testDebug/testDebug.service.ts → RIGGER_MODEL constant at top of runRigger:

private static readonly RIGGER_MODEL: string = BedrockModelId.CLAUDE_OPUS_4_7;  // or BedrockModelId.CLAUDE_FABLE_5, OpenAIModelId.GPT_5_4

Restart backend, re-run tests with a --label matching the model so outputs don't overwrite each other.

Adding Tests¶

Fresh (verified-allowlist gated): add the pose id to scripts/verified_poses.json, run markVerifiedPoses.ts --apply, then re-export fixtures. The export script regenerates pose_library_index.json automatically and preserves any references field on existing fixtures across re-exports.
Incremental/Transition: hand-author JSON in tests/fixtures/pose_incremental/, update incremental_index.json
Multi-figure: hand-author JSON in tests/fixtures/pose_multi_figure/, update multi_figure_index.json
With-reference A/B: copy a fresh fixture into tests/fixtures/pose_with_reference/, append "references": ["<verified-pose-id>"]. No index file needed — the runner picks up every JSON in the directory.