PoseMaker Rigger Eval¶
Overview¶
Tests the PoseMaker rigger LLM in isolation — verifies it produces correct tool calls from pose descriptions.
Running Tests¶
# Backend must be running with ENABLE_TEST_ROUTES=true
# Run by collection
python test_runner.py --collection pose_maker_rigger_fresh
python test_runner.py --collection pose_maker_rigger_incremental
python test_runner.py --collection pose_maker_rigger_multi
python test_runner.py --collection pose_maker_rigger_with_reference
# Run a single test
python test_runner.py rigger_standing_arms_crossed
python test_runner.py rigger_multi_conversation
# Tag the run with a label so output goes to a per-label subdir
# (recommended when comparing models or prompt changes):
python test_runner.py --collection pose_maker_rigger_fresh --label opus-4-7
Collections¶
pose_maker_rigger_fresh — Fresh scene, single figure¶
- Fixtures:
tests/fixtures/pose_library/ - Scoring: Position-based — applies both expected + actual tool calls to a virtual stick figure (IK simulator), compares final joint positions (head, hands, feet, elbows, knees) relative to hips. Equal-weight average across the 9 joints.
- Pass threshold:
avg_distance ≤ 0.15(in figure-height units; figure is ~1.31 units tall, so 0.15 ≈ 11% of body height) - Source: exported from pose library DB via
python scripts/export_pose_library.py, filtered byscripts/verified_poses.json
pose_maker_rigger_incremental — Edits + transitions¶
- Fixtures:
tests/fixtures/pose_incremental/ - Scoring: same as fresh
- Tests: simple edits from neutral (raise arm, look left) + pose-to-pose transitions (confident → arms crossed)
pose_maker_rigger_multi — Multi-figure scenes¶
- Fixtures:
tests/fixtures/pose_multi_figure/ - Scoring: Weighted combination of spatial checks + individual pose match:
- Spatial/structural checks (binary pass/fail, weight=1 each):
- Figures exist and are positioned (via
moveFigureoraddFigurewith coords) - Not overlapping at same position
- Spacing matches interaction type (conversation=0.4-0.7, hug=0.25-0.5, fight=0.55-0.8, side_by_side=0.35-0.55)
faceTowardcorrect (required for face-to-face, optional for same-direction)frame='other:<id>'used for interactions (handshake, hug)- Hands within reach distance
- Figures exist and are positioned (via
- Pose match checks (continuous 0-1 score, weight=2 each):
- Each figure's pose compared against library golden using position scorer
- Uses
individualPoseRef(same pose for all) orindividualPoseRefs(per-figure)
- Pass threshold: weighted score ≥ 0.75
pose_maker_rigger_with_reference — Reference-injection A/B tests¶
- Fixtures:
tests/fixtures/pose_with_reference/ - Same scoring as fresh, but each fixture has a
"references": ["pose-id-1", ...]field. The runner looks each id up inpose_library/, builds{id, name, description, toolCalls}from the canonical fixture, and forwards asreferencePosesto the rigger. - Lets you measure whether anchoring on a structurally-similar verified pose improves output for poses where the rigger consistently undershoots (sitting, leaning, etc.).
Verified Poses¶
scripts/verified_poses.json is the allowlist — only these ids are exported as fresh-scene fixtures and only these poses appear in the director's reference catalog at runtime.
To add a pose to the verified set:
1. Render the pose in the FE pose-debug page; confirm tool calls are visually correct
2. Add the id to scripts/verified_poses.json
3. Mirror to Mongo: npx ts-node scripts/poseMaker/markVerifiedPoses.ts --apply
4. Re-export fixtures: python scripts/export_pose_library.py
The markVerifiedPoses.ts script keeps Mongo isVerified in sync with the JSON. Production poseMaker reads isVerified: true to populate the director's reference catalog.
Output Files¶
After each run, test_results/pose_outputs/<label>/<test_id>.json contains the actual tool calls and scoring details. Paste actual array into the frontend pose-debug renderer to visualize.
If you don't pass --label, output goes to pose_outputs/default/.
Recomputing Scores Without LLM Calls¶
Once you have cached test outputs, you can recompute scores under updated thresholds / scoring tweaks without paying for new LLM calls:
python scripts/recompute_pose_scores.py opus-4-7 # one label
python scripts/recompute_pose_scores.py opus-4-7 fable # diff two labels
Regenerating Goldens¶
The regenerate workflow is now three steps so you don't pay for LLM calls on a dry run, and you can review results before writing to Mongo:
# 1. Preview which entries would regen (no LLM, no DB writes):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --unverified --dry-run
# 2. Run director→rigger and CACHE results to disk
# (no DB writes; script prints a runId at the end):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --unverified
# Each cached file is { "expected": [...old tool calls...], "actual": [...new tool calls...] }
# under scripts/poseMaker/.regenCache/<runId>/<id>.json. Paste either
# array into the FE pose-debug renderer.
# 3. Apply approved cached results to Mongo (no LLM):
npx ts-node scripts/poseMaker/regeneratePoseLibrary.ts --apply-from-cache <runId>
Filters: --all | --unverified | --category <cat> | --id <id>. Add --verbose to step 2 for per-pose director output.
The apply step skips broken caches automatically (where actual.length <= 1, which usually means the rigger timed out and the cache only has a resetPose).
After applying, re-export test fixtures:
python scripts/export_pose_library.py
Changing Model¶
Edit src/testDebug/testDebug.service.ts → RIGGER_MODEL constant at top of runRigger:
private static readonly RIGGER_MODEL: string = BedrockModelId.CLAUDE_OPUS_4_7; // or BedrockModelId.CLAUDE_FABLE_5, OpenAIModelId.GPT_5_4
--label matching the model so outputs don't overwrite each other.
Adding Tests¶
- Fresh (verified-allowlist gated): add the pose id to
scripts/verified_poses.json, runmarkVerifiedPoses.ts --apply, then re-export fixtures. The export script regeneratespose_library_index.jsonautomatically and preserves anyreferencesfield on existing fixtures across re-exports. - Incremental/Transition: hand-author JSON in
tests/fixtures/pose_incremental/, updateincremental_index.json - Multi-figure: hand-author JSON in
tests/fixtures/pose_multi_figure/, updatemulti_figure_index.json - With-reference A/B: copy a fresh fixture into
tests/fixtures/pose_with_reference/, append"references": ["<verified-pose-id>"]. No index file needed — the runner picks up every JSON in the directory.