π¬ Dialogue System Flow - Complete Architecture¶
π Table of Contents¶
- Overview
- DB Field Lifecycle Workflow
- Creation Flow (Scene Generation)
- Distribution Flow (Shot Generation)
- Redistribution Flow (Shot Edits)
- Data Structures
- Key Design Principles
- Dialogue Audio V2
- Complete Example
Overview¶
The dialogue system operates on a source of truth principle where scene-level dialogue (SceneDialogue[]) is canonical and shot-level dialogue (ShotDialogue[]) is derived. This document explains the complete flow from dialogue creation through distribution and redistribution.
Note: After shot breakdown, scene and shot timings share the same seconds-from-scene-start clock.
High-Level Flow¶
βββββββββββββββ
β SCRIPT β (Text with character dialogue)
ββββββββ¬βββββββ
β
β [1. EXTRACTION]
βββββββββββββββββββββββββββββββββββββββ
β SCENE DIALOGUE (SceneDialogue[]) β
β β’ dialogueId, speaker, text β
β β’ origin, timing, voiceDirection β
β β’ initial contiguous timing β
ββββββββ¬βββββββββββββββββββββββββββββββ
β
β [2. SHOT PLAN + DISTRIBUTION]
βββββββββββββββββββββββββββββββββββββββ
β SHOT DIALOGUE (ShotDialogue[]) β
β β’ references SceneDialogue via β
β dialogueId + portion indices β
β β’ from dialogueDistribution β
ββββββββ¬βββββββββββββββββββββββββββββββ
β
β [3. RETIME + REDISTRIBUTE]
βββββββββββββββββββββββββββββββββββββββ
β RECONCILED TIMING β
β β’ scene timing syncs from shots β
β β’ same-speaker lines stay serial β
β β’ cross-speaker overlap allowed β
ββββββββ¬βββββββββββββββββββββββββββββββ
β
β [4. SHOT EDIT REDISTRIBUTION]
βββββββββββββββββββββββββββββββββββββββ
β UPDATED SHOT DIALOGUE β
β β’ Recalculated when shots change β
β β’ Keeps scene timing in sync β
βββββββββββββββββββββββββββββββββββββββ
DB Field Lifecycle Workflow¶
This is the field-level source-of-truth map for MongoDB storyvideos.scenes[].
generateScenesV2
β
scene.dialogues[] created
- dialogueId, speaker, text, origin, voiceDirection, dialogueIndex
- timing.startTime/endTime from contiguous text duration
- timing.deliverySpeed
- audios[] empty/missing
- shots[] still empty
generateSceneShotsFromPlan
β
shot.dialogueDistribution from LLM (temporary)
β convertDialogueDistribution()
scene.shots[].dialogues[] created from dialogueDistribution
- dialogueId references scene.dialogues[].dialogueId
- portion.start/end/text stores the shot substring
- timing.startOffset is shot-local
- timing.sceneStartOffset is absolute scene seconds
- timing.duration is the spoken portion duration
β retimeAndRedistributeDialogues()
scene.dialogues[].timing reconciled from shot placement
- same-speaker lines stay sequential
- first-start monotonic by dialogueIndex: later script lines may overlap earlier lines, but do not begin before earlier lines begin
- cross-speaker overlap remains allowed
- overflowSceneCropped marks omitted tail text when the scene is too short
scene.shots[].dialogues[] rewritten from reconciled scene timing
scene.shots[].dialogue display string derived from structured shot.dialogues
shot edits (insert / remove / reorder / merge / timing change)
β redistributeDialogue()
scene.dialogues[] remains canonical (timing unchanged on this path)
scene.dialogues[].overflowSceneCropped updated from shot placement + scene duration
scene.shots[].dialogues[] and scene.shots[].dialogue are recalculated
dialogue updates
β updateDialogueText()
scene.dialogues[].text changes
scene.dialogues[].timing.endTime recalculated from existing startTime
scene.dialogues[].audios[].isSelected cleared
scene.shots[].dialogues[] redistributed
scene.dialogues[].overflowSceneCropped refreshed from current shot coverage
dialogue moves
β moveDialogue()
within scene: same SceneDialogue retimed, reindexed, redistributed
across scenes: SceneDialogue removed from source scene and appended to target scene
source and target scene.shots[].dialogues[] redistributed
target scene.characters updated when the moved speaker is missing
dialogue delete
β deleteDialogue()
scene.dialogues[].isDeleted = true
deleted dialogue remains stored for history
active dialogue timings are preserved
scene.shots[].dialogues[] redistributed without deleted dialogue
scene.dialogues[].overflowSceneCropped refreshed from current shot coverage
dialogue audio
β generate/select/update text
scene.dialogues[].audios[] stores TTS history and selected entry
audio selection can restore a text snapshot, then follows updateDialogueText rules
1. Creation Flow (Scene Generation)¶
Trigger¶
User clicks "Generate Scenes" β generateScenesV2()
Step-by-Step Process¶
User Action: "Generate Scenes"
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. LLM Call (generateScenesV2.template.ts) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: β
β β’ scriptContent (raw text) β
β β’ characters[] (from story, with assetIds) β
β β’ environmentSettings[] (from story, with assetIds) β
β β
β Prompt Instructions: β
β "Extract dialogue from script. Return: β
β - dialogue: Array of segments β
β - Each segment: speaker, text, deliverySpeed, β
β voiceDirection, order" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. LLM Output (Raw JSON) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β { β
β scenes: [ β
β { β
β sceneIndex: 1, β
β title: "Koyal and Cuddles Reunion", β
β characters: [ β
β {title: "Koyal", assetId: "char-uuid-1"}, β
β {title: "Cuddles", assetId: "char-uuid-2"} β
β ], β
β dialogue: [ β Raw LLM output β
β { β
β speaker: { β
β type: "CHARACTER", β
β characterName: "Koyal" β
β }, β
β text: "I was on my way to save you...", β
β deliverySpeed: "normal", β
β voiceDirection: "apologetic, tired", β
β order: 1 β
β }, β
β { β
β speaker: { β
β type: "CHARACTER", β
β characterName: "Cuddles" β
β }, β
β text: "You're a bird. You fly. What?", β
β deliverySpeed: "fast", β
β voiceDirection: "sarcastic, annoyed", β
β order: 2 β
β } β
β ] β
β } β
β ] β
β } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Backend Processing (workbench.service.ts) β
β convertLLMDialogueToSceneDialogue() β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Step 3a: Map character names β assetIds β
β characterMap = { β
β "koyal": "char-uuid-1", β
β "cuddles": "char-uuid-2" β
β } β
β β
β Step 3b: Generate dialogue UUIDs β
β dialogueId = generateSegmentId() // Backend UUID β
β β
β Step 3c: Add assetId to speakers β
β if (speaker.type === 'CHARACTER') { β
β speaker.assetId = characterMap.get(...) β
β } β
β β
β Step 3d: Calculate timing (dialogue.utils.ts) β
β recalculateSceneDialogueTimingsContiguous(dialogues)β
β β’ Loops through dialogues in order β
β β’ For each dialogue: β
β - Base duration = text.length / CHARS_PER_SECONDβ
β - Add pauses for punctuation (., ?, !, etc.) β
β - startTime = previous.endTime β
β - endTime = startTime + duration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. Final Scene Data (Saved to MongoDB) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β scene: { β
β sceneId: "scene-uuid", β
β sceneIndex: 1, β
β title: "Koyal and Cuddles Reunion", β
β characters: [...], // AssetRef[] β
β dialogues: [ β CANONICAL DIALOGUE (SceneDialogue[])β
β { β
β dialogueId: "dlg-abc-123", β Backend UUID β
β speaker: { β
β title: "Koyal", β
β type: "CHARACTER", β
β assetId: "char-uuid-1" β Mapped β
β }, β
β text: "I was on my way to save you...", β
β origin: "SCRIPT", β
β timing: { β
β startTime: 0, β Calculated β
β endTime: 3.99, β Calculated β
β deliverySpeed: "normal" β
β }, β
β voiceDirection: "apologetic, tired", β
β dialogueIndex: 1 β
β }, β
β { β
β dialogueId: "dlg-def-456", β
β speaker: { β
β title: "Cuddles", β
β type: "CHARACTER", β
β assetId: "char-uuid-2" β
β }, β
β text: "You're a bird. You fly. What?", β
β origin: "SCRIPT", β
β timing: { β
β startTime: 3.99, β Sequential β
β endTime: 7.11, β
β deliverySpeed: "fast" β
β }, β
β voiceDirection: "sarcastic, annoyed", β
β dialogueIndex: 2 β
β } β
β ], β
β shots: [] β Empty, shots not generated yet β
β } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Code References¶
- LLM Prompt:
src/promptTemplates/workbenchV2/generateScenesV2.template.ts - Conversion Logic:
src/workbench/workbench.service.ts(convertLLMDialogueToSceneDialogue) - Timing Calculation:
src/shared/dialogue.utils.ts(recalculateSceneDialogueTimingsContiguous) - UUID Generation:
src/shared/dialogue.utils.ts(generateSegmentId)
2. Distribution Flow (Shot Generation)¶
Trigger¶
User generates shots (KEYS_ONLY mode) β generateShots()
Step-by-Step Process¶
User Action: "Generate Key Shots"
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. LLM Call (generateShotsV2.template.ts) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: β
β β’ sceneData (with dialogues: SceneDialogue[]) β
β β’ storyData β
β β’ mode: "KEYS_ONLY" β
β β
β Prompt Includes Scene Dialogue: β
β "SCENE DIALOGUE (reference by index): β
β [ β
β { β
β segmentIndex: 0, β LLM uses this β
β dialogueIndex: 1, β
β speaker: {...}, β
β text: "I was on my way...", β
β timing: {startTime: 0, endTime: 3.99} β
β }, β
β { β
β segmentIndex: 1, β
β dialogueIndex: 2, β
β speaker: {...}, β
β text: "You're a bird...", β
β timing: {startTime: 3.99, endTime: 7.11} β
β } β
β ] β
β β
β NOTE: Distribute dialogue across shots. β
β Use segmentIndex + portionStart/End. β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. LLM Output (Raw JSON) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β { β
β key_shots: [ β
β { β
β shotId: "K1", β
β cameraAngle: "MEDIUM_TWO_SHOT", β
β duration: 4.5, β
β dialogueDistribution: { β Raw LLM output β
β segments: [ β
β { β
β segmentIndex: 0, β References scene β
β portionStart: 0, β Char index β
β portionEnd: 66 β Full segment β
β } β
β ] β
β } β
β }, β
β { β
β shotId: "K2", β
β cameraAngle: "CLOSE_UP", β
β duration: 1.2, β
β dialogueDistribution: { segments: [] } β No dialogueβ
β }, β
β { β
β shotId: "K3", β
β cameraAngle: "MEDIUM_SINGLE", β
β duration: 3.5, β
β dialogueDistribution: { β
β segments: [ β
β { β
β segmentIndex: 1, β
β portionStart: 0, β
β portionEnd: 38 β Partial segment β
β } β
β ] β
β } β
β } β
β ] β
β } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Backend Processing (testDebug.service.ts / β
β workbenchSceneGuru.convertDialogueDistribution) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β For each shot: β
β For each segment in dialogueDistribution: β
β 1. Get scene dialogue by segmentIndex β
β 2. Create ShotDialogue entry via β
β createShotDialogue(sceneDialogue, β
β portionStart, portionEnd, startOffset) β
β β
β Remove raw dialogueDistribution from shot β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. Final Shot Data (Saved to scene.shots[]) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β shots: [ β
β { β
β shotId: "K1", β
β cameraAngle: "MEDIUM_TWO_SHOT", β
β duration: 4.5, β
β dialogues: [ β SHOT DIALOGUE (ShotDialogue[]) β
β { β
β dialogueId: "dlg-abc-123", β Links to scene β
β speaker: {...}, β Copied from scene β
β portion: { start: 0, end: 66, text: "I was on my way to save you..." },β
β timing: { β
β startOffset: 0, β
β duration: 3.99 β
β }, β
β voiceDirection: "apologetic, tired" β
β } β
β ] β
β }, β
β { β
β shotId: "K2", β
β duration: 1.2, β
β dialogues: [] β Visual only β
β }, β
β { β
β shotId: "K3", β
β duration: 3.5, β
β dialogues: [ β
β { β
β dialogueId: "dlg-def-456", β
β speaker: {...}, β
β portion: { start: 0, end: 38, text: "You're a bird. You fly. What?" },β
β timing: { β
β startOffset: 0, β
β duration: 3.12 β
β }, β
β voiceDirection: "sarcastic, annoyed" β
β } β
β ] β
β } β
β ] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Point¶
Shot dialogue is DERIVED, not independent:
- Shots reference scene dialogue via dialogueId + portion.start/end
- Shots only cache the relevant substring and timing
- Full dialogue text and canonical timing live on the scene
Code References¶
- LLM Prompt:
src/promptTemplates/workbenchV2/generateShotsV2.template.ts - Shot Generation:
src/agenticGuru/workbenchSceneGuru.ts(generateShots) - Post-processing:
src/testDebug/testDebug.service.ts(postProcessGenerateShots)
3. Redistribution Flow (Shot Edits)¶
Trigger¶
User adds filler shot β addFillerShot() β Dialogue must redistribute
Step-by-Step Process¶
User Action: "Add filler shot between K2 and K3"
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. Current State (Before Filler) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Scene Dialogue (UNCHANGED - source of truth): β
β dialogues: [ β
β {dialogueId: "dlg-abc-123", text: "I was...", ...}β
β {dialogueId: "dlg-def-456", text: "You're...", ...}β
β ] β
β β
β Shots: β
β K1 (4.5s): [dlg-abc-123: 0-66 chars] β
β K2 (1.2s): [] β Visual only β
β K3 (3.5s): [dlg-def-456: 0-38 chars] β
β β
β Problem: Total 9.2s, but dialogue is 7.11s β
β Seg-def-456 only uses 38/38 chars β
β Where does rest of dialogue go? β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. Insert Filler Shot (workbenchSceneGuru.ts:1069) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β New Shot Array: β
β K1 (4.5s) β Existing β
β K2 (1.2s) β Existing β
β F1 (1.0s) β NEW FILLER β
β K3 (3.5s) β Existing β
β β
β Total duration: 10.2s β
β Dialogue duration: 7.11s β
β β
β Trigger Redistribution: β
β redistributeDialogueAcrossShots( β
β sceneDialogues: scene.dialogues, β
β shotData: [ β
β {shotId: "K1", timing: 4.5}, β
β {shotId: "K2", timing: 1.2}, β
β {shotId: "F1", timing: 1.0}, β NEW β
β {shotId: "K3", timing: 3.5} β
β ] β
β ) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Redistribution Algorithm (dialogue.utils.ts) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Input: β
β β’ Scene dialogues (canonical dialogue) β
β β’ Shot timings (available time) β
β β
β Algorithm: β
β 1. Sort dialogues by dialogueIndex β
β 2. Initialize: currentShotIndex = 0, β
β currentShotTimeUsed = 0 β
β 3. For each dialogue: β
β a. Walk through its text using character indices β
β b. For each shot while text remains: β
β - Calculate how much text fits in remaining β
β shot time (charsPerSecond Γ time) β
β - Find a word boundary for the break point β
β - Create ShotDialogue for that portion β
β - Update currentShotTimeUsed and index β
β β
β Key Logic: β
β while (dialogueCharIndex < dialogueText.length β
β && currentShotIndex < shots.length) { β
β shotTimeRemaining = shot.timing - usedTime β
β β
β if (shotTimeRemaining fits remaining text) { β
β // Dialogue fits entirely in this shot β
β portionEnd = dialogueText.length β
β } else { β
β // Partial segment, break at word boundary β
β charsPerSec = CHARS_PER_SECOND[speed] β
β maxChars = shotTimeRemaining * charsPerSec β
β portionEnd = findWordBoundary(text, maxChars) β
β } β
β β
β createShotDialogue(...) β
β currentShotTimeUsed += portionDuration β
β } β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. New Shot Dialogue (After Redistribution) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β K1 (4.5s): [seg-abc-123: 0-66] β Full segment β
β "I was on my way to save you, butβ¦ traffic..." β
β Duration: 3.99s, fits in 4.5s β β
β β
β K2 (1.2s): [] β Visual only, no dialogue β
β Filler for reaction/establishing β
β β
β F1 (1.0s): [dlg-def-456: 0-15] β NEW PORTION β
β "You're a bird." β
β Duration: ~0.8s, fits in 1.0s β β
β β
β K3 (3.5s): [dlg-def-456: 15-38] β ADJUSTED β
β "You fly. What traffic??" β
β Duration: ~2.3s, fits in 3.5s β β
β β
β Result: All dialogue redistributed naturally β
β Breaks at word boundaries β
β No gaps or overlaps β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Points¶
- Scene dialogue NEVER changes - it's the source of truth
- Shot dialogue is recalculated from scene segments + shot timings
- Algorithm ensures:
- Linear flow (no dialogue skipped)
- Word boundary breaks (natural pacing)
- No overlaps or gaps
- Respects shot timing constraints
Code References¶
- Redistribution Call:
src/agenticGuru/workbenchSceneGuru.ts:1069-1095 - Algorithm:
src/shared/dialogue.utils.ts:144-215(redistributeDialogueAcrossShots) - Portion Creation:
src/shared/dialogue.utils.ts:98-139(createShotDialoguePortion)
4. Data Structures¶
Scene Dialogue (Source of Truth)¶
// Scene-level dialogue segment (source of truth)
export interface SceneDialogue {
dialogueId: string;
speaker: {
title: string; // Character display name
type: 'CHARACTER' | 'NARRATOR' | 'VOICEOVER' | 'OFF_SCREEN';
assetId?: string; // Mapped from scene.characters
};
text: string; // FULL turn text
origin: 'SCRIPT' | 'GENERATED';
timing: {
startTime: number; // Seconds from scene start
endTime: number; // Calculated from text + speed
deliverySpeed: 'slow' | 'normal' | 'fast';
};
voiceDirection: string; // TTS guidance
dialogueIndex: number; // Sequence (1, 2, 3...)
audios?: DialogueAudioEntry[]; // Generated TTS audio entries (history + active selection)
isDeleted?: boolean; // Soft-delete flag β true means excluded from UI and timing recalcs
}
// Audio generation entry for a dialogue segment
export interface DialogueAudioEntry {
assetGenJobId: string; // Reference to AssetGenJob β resolves audioUrl, userId, createdAt
pitch: 'high' | 'low' | 'none'; // Voice variant used for this generation
isSelected: boolean; // Only one entry should be true at a time per dialogue
text: string; // Snapshot of text at generation time (staleness detection)
}
Shot Dialogue (Derived)¶
// Shot-level dialogue (references SceneDialogue via dialogueId)
export interface ShotDialogue {
dialogueId: string; // Reference to SceneDialogue.dialogueId
speaker: SceneDialogue['speaker']; // Copied from scene for convenience
portion: {
start: number; // Character index in SceneDialogue.text
end: number; // Character index in SceneDialogue.text
text: string; // Cached substring for display
};
timing: {
startOffset: number; // Seconds into the shot when this portion starts
sceneStartOffset?: number; // Absolute scene seconds (shotStart + startOffset); same clock as SceneDialogue.timing
duration: number; // How long this portion takes in seconds
};
voiceDirection: string; // Copied from scene dialogue
}
Scene vs shot timing: Both layers use the same seconds-from-scene-start clock after shot breakdown. SceneDialogue.timing.startTime/endTime is the canonical range for the full line; each ShotDialogue portion stores startOffset (within its shot) and sceneStartOffset (absolute scene time). Reconciliation code derives scene time from shotStart + startOffset; sceneStartOffset is persisted for eval, export, and other readers that should not re-walk shots.
Timing Constants¶
// Characters per second for different delivery speeds
CHARS_PER_SECOND = {
slow: 12, // Moody, dramatic, deliberate
normal: 18, // Standard conversation
fast: 26 // Excited, urgent, quick banter
}
// Pause durations for punctuation (milliseconds)
PAUSE_DURATIONS_MS = {
',': { min: 80, max: 120 }, // Brief pause
'.': { min: 180, max: 260 }, // Sentence end
'...': { min: 250, max: 450 }, // Ellipsis, suspense
'!': { min: 200, max: 320 }, // Exclamation
'?': { min: 200, max: 320 }, // Question
'β': { min: 180, max: 300 }, // Em dash
'\n': { min: 250, max: 500 } // Line break
}
5. Key Design Principles¶
β Single Source of Truth¶
- Scene dialogue is canonical
- Shot dialogue is derived and recalculable
- Edits to scene dialogue automatically invalidate shot dialogue
β Referential Integrity¶
- Shots reference segments by
sceneSegmentId(UUID) - Never duplicate dialogue text
- Always derive from scene
β Linear Flow¶
- Dialogue flows sequentially across shots
- No overlaps, no gaps, no skipped segments
orderfield ensures sequence
β Timing Independence¶
- Scene timing: Based on text + delivery speed
- Shot timing: Based on shot duration
- Can recalculate without affecting source
β Frontend vs Backend Responsibilities¶
| Operation | Where | Why |
|---|---|---|
| Dialogue Extraction | Backend (LLM) | Requires AI understanding |
| UUID Generation | Backend | Security + consistency |
| Timing Calculation | Backend | Deterministic, complex |
| Initial Distribution | Backend (LLM) | Cinematic judgment needed |
| Redistribution | Backend | Affects multiple shots |
| Manual Edits | Frontend | User interactivity |
| Validation | Both | Frontend: UX, Backend: Data integrity |
Complete Example¶
Let's trace a full user journey:
1. User writes script¶
KOYAL: I was on my way to save you, but⦠traffic in Ghatkopar was insane.
CUDDLES: You're a bird. You fly. What traffic??
2. Generates scenes¶
Backend creates:
- seg-abc-123: "I was on my way..." (0-3.99s, normal delivery)
- seg-def-456: "You're a bird..." (3.99-7.11s, fast delivery)
3. Generates shots¶
LLM distributes: - K1 (4.5s): Full segment 1 - "I was on my way to save you..." - K2 (1.2s): Visual only - Reaction shot - K3 (3.5s): Full segment 2 - "You're a bird. You fly. What traffic??"
4. Adds filler F1¶
Backend redistributes: - K1 (4.5s): seg-abc-123 (full) - "I was on my way..." - K2 (1.2s): (visual) - Reaction shot - F1 (1.0s): seg-def-456 (0-15) - "You're a bird." - K3 (3.5s): seg-def-456 (15-38) - "You fly. What traffic??"
Result: Natural pacing, no dialogue lost, cinematically sound! π¬
Testing¶
The dialogue system is tested with 36 test cases across 3 fixtures (sdf3, sentinels1, voc5).
See python_eval/tests/test_definitions/dialogue_operations.py for:
- 7 deterministic tests for dialogue extraction
- 2 deterministic tests for dialogue distribution
- 3 G-Eval tests for semantic quality
Current Status: β 100% passing (36/36)
6. Dialogue Audio V2 β TTS Generation & Management¶
Overview¶
Generates ElevenLabs TTS audio for scene-level dialogue entries. Audio lives at the SceneDialogue level as audios: DialogueAudioEntry[]. Each entry is a lean record β display data (audioUrl, userId, createdAt) is resolved at runtime from the AssetGenJob collection.
API Endpoints¶
| Method | Endpoint | Description |
|---|---|---|
POST |
/workbench/generateDialogueAudio |
Create TTS job + push audio entry onto dialogue; reconciles dialogue timing with real WAV duration in the same DB write |
PATCH |
/workbench/selectDialogueAudio |
Select audio entry + sync dialogue text from snapshot when it differs (atomic) |
PATCH |
/workbench/updateDialogueText |
Update text + recalculate timings + redistribute to shots; clears isSelected on all audio entries |
PATCH |
/workbench/deleteDialogue |
Soft-delete dialogue + recalculate timings + redistribute to shots |
Soft Delete¶
Dialogues use the isDeleted: boolean pattern (matching shots/characters). On deletion:
1. dialogue.isDeleted = true is set in the DB
2. Remaining (non-deleted) dialogues get their timings recalculated
3. Dialogue portions are redistributed across shots
4. Deleted dialogues remain in the array for data preservation but are filtered out in queries and UI
Integration Tests¶
Located at tests/workbench/dialogueAudio.test.ts β 19 supertest integration tests covering select, update text, and delete endpoints.
Related Files¶
Core Implementation¶
src/workbench/workbench.service.ts- Scene generation with dialogue + dialogue audio V2 endpointssrc/workbench/workbench.controller.ts- Route handlers for dialogue audio V2 (generate, select, update text, delete)src/workbench/workbench.validator.ts- Zod validators for dialogue audio V2 requestssrc/workbench/workbench.router.ts- Route definitions for dialogue audio V2src/agenticGuru/workbenchSceneGuru.ts- Shot generation,convertDialogueDistribution,retimeAndRedistributeDialoguessrc/shared/dialogue.utils.ts- Timing and redistribution algorithmssrc/shared/dialogue.types.ts- TypeScript type definitions (SceneDialogue, ShotDialogue, DialogueAudioEntry)src/shared/timelineTime.ts- Shared scene-clock helpers (parseShotTimingSeconds, rounding)src/shared/dialogue.constants.ts- Timing constantssrc/models/storyVideos.model.ts- MongoDB model methods (pushDialogueAudio, softDeleteDialogue)
Prompt Templates¶
src/promptTemplates/workbenchV2/generateScenesV2.template.ts- Scene dialogue extractionsrc/promptTemplates/workbenchV2/generateShotsV2.template.ts- Shot dialogue distribution
Testing¶
python_eval/tests/test_definitions/dialogue_operations.py- Eval test definitionspython_eval/tests/test_definitions/utils/dialogue_validation.py- Deterministic eval helperspython_eval/tests/test_definitions/utils/g_eval_metrics.py- G-Eval metricstests/shared/dialogue.utils.test.ts- Redistribution unit teststests/shared/timelineTime.test.ts- Shot timing parse unit teststests/agenticGuru/workbenchSceneGuru.test.ts- Guru reconciliation teststests/workbench/dialogueAudio.test.ts- Supertest integration tests for dialogue audio API
Migration¶
For existing stories with old dialogue format, see src/shared/dialogue.migration.ts for on-demand migration utilities.