Skip to content

🎬 Dialogue System Flow - Complete Architecture

πŸ“‹ Table of Contents

  1. Overview
  2. Creation Flow (Scene Generation)
  3. Distribution Flow (Shot Generation)
  4. Redistribution Flow (Shot Edits)
  5. Data Structures
  6. Key Design Principles
  7. Dialogue Audio V2
  8. Complete Example

Overview

The dialogue system operates on a source of truth principle where scene-level dialogue (SceneDialogue[]) is canonical and shot-level dialogue (ShotDialogue[]) is derived. This document explains the complete flow from dialogue creation through distribution and redistribution.

High-Level Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   SCRIPT    β”‚ (Text with character dialogue)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       ↓ [1. EXTRACTION]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   SCENE DIALOGUE (SceneDialogue[]) β”‚
β”‚   β€’ dialogueId, speaker, text      β”‚
β”‚   β€’ origin, timing, voiceDirection β”‚
β”‚   β€’ Independent of shots           β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       ↓ [2. DISTRIBUTION]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   SHOT DIALOGUE (ShotDialogue[])   β”‚
β”‚   β€’ references SceneDialogue via   β”‚
β”‚     dialogueId + portion indices   β”‚
β”‚   β€’ Derived from scene dialogue    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       ↓ [3. REDISTRIBUTION]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   UPDATED SHOT DIALOGUE             β”‚
β”‚   β€’ Recalculated when shots change  β”‚
β”‚   β€’ Fills gaps, maintains flow      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Creation Flow (Scene Generation)

Trigger

User clicks "Generate Scenes" β†’ generateScenesV2()

Step-by-Step Process

User Action: "Generate Scenes"
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. LLM Call (generateScenesV2.template.ts)           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Input:                                                β”‚
β”‚   β€’ scriptContent (raw text)                          β”‚
β”‚   β€’ characters[] (from story, with assetIds)          β”‚
β”‚   β€’ environmentSettings[] (from story, with assetIds) β”‚
β”‚                                                       β”‚
β”‚ Prompt Instructions:                                  β”‚
β”‚   "Extract dialogue from script. Return:              β”‚
β”‚    - dialogue: Array of segments                      β”‚
β”‚    - Each segment: speaker, text, deliverySpeed,      β”‚
β”‚      voiceDirection, order"                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. LLM Output (Raw JSON)                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ {                                                     β”‚
β”‚   scenes: [                                           β”‚
β”‚     {                                                 β”‚
β”‚       sceneIndex: 1,                                  β”‚
β”‚       title: "Koyal and Cuddles Reunion",            β”‚
β”‚       characters: [                                   β”‚
β”‚         {title: "Koyal", assetId: "char-uuid-1"},    β”‚
β”‚         {title: "Cuddles", assetId: "char-uuid-2"}   β”‚
β”‚       ],                                              β”‚
β”‚       dialogue: [  ← Raw LLM output                   β”‚
β”‚         {                                             β”‚
β”‚           speaker: {                                  β”‚
β”‚             type: "CHARACTER",                        β”‚
β”‚             characterName: "Koyal"                    β”‚
β”‚           },                                          β”‚
β”‚           text: "I was on my way to save you...",    β”‚
β”‚           deliverySpeed: "normal",                    β”‚
β”‚           voiceDirection: "apologetic, tired",        β”‚
β”‚           order: 1                                    β”‚
β”‚         },                                            β”‚
β”‚         {                                             β”‚
β”‚           speaker: {                                  β”‚
β”‚             type: "CHARACTER",                        β”‚
β”‚             characterName: "Cuddles"                  β”‚
β”‚           },                                          β”‚
β”‚           text: "You're a bird. You fly. What?",     β”‚
β”‚           deliverySpeed: "fast",                      β”‚
β”‚           voiceDirection: "sarcastic, annoyed",       β”‚
β”‚           order: 2                                    β”‚
β”‚         }                                             β”‚
β”‚       ]                                               β”‚
β”‚     }                                                 β”‚
β”‚   ]                                                   β”‚
β”‚ }                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Backend Processing (workbench.service.ts)         β”‚
β”‚    convertLLMDialogueToSceneDialogue()               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Step 3a: Map character names β†’ assetIds              β”‚
β”‚   characterMap = {                                   β”‚
β”‚     "koyal": "char-uuid-1",                         β”‚
β”‚     "cuddles": "char-uuid-2"                        β”‚
β”‚   }                                                 β”‚
β”‚                                                      β”‚
β”‚ Step 3b: Generate dialogue UUIDs                    β”‚
β”‚   dialogueId = generateSegmentId()  // Backend UUID β”‚
β”‚                                                      β”‚
β”‚ Step 3c: Add assetId to speakers                    β”‚
β”‚   if (speaker.type === 'CHARACTER') {               β”‚
β”‚     speaker.assetId = characterMap.get(...)         β”‚
β”‚   }                                                 β”‚
β”‚                                                      β”‚
β”‚ Step 3d: Calculate timing (dialogue.utils.ts)       β”‚
β”‚   recalculateSceneDialogueTimings(dialogues)        β”‚
β”‚   β€’ Loops through dialogues in order                β”‚
β”‚   β€’ For each dialogue:                              β”‚
β”‚     - Base duration = text.length / CHARS_PER_SECONDβ”‚
β”‚     - Add pauses for punctuation (., ?, !, etc.)    β”‚
β”‚     - startTime = previous.endTime                  β”‚
β”‚     - endTime = startTime + duration                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. Final Scene Data (Saved to MongoDB)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ scene: {                                              β”‚
β”‚   sceneId: "scene-uuid",                             β”‚
β”‚   sceneIndex: 1,                                      β”‚
β”‚   title: "Koyal and Cuddles Reunion",                β”‚
β”‚   characters: [...],  // AssetRef[]                  β”‚
β”‚   dialogues: [  ← CANONICAL DIALOGUE (SceneDialogue[])β”‚
β”‚     {                                                β”‚
β”‚       dialogueId: "dlg-abc-123",  ← Backend UUID    β”‚
β”‚       speaker: {                                     β”‚
β”‚         title: "Koyal",                              β”‚
β”‚         type: "CHARACTER",                           β”‚
β”‚         assetId: "char-uuid-1"  ← Mapped             β”‚
β”‚       },                                             β”‚
β”‚       text: "I was on my way to save you...",        β”‚
β”‚       origin: "SCRIPT",                              β”‚
β”‚       timing: {                                      β”‚
β”‚         startTime: 0,        ← Calculated            β”‚
β”‚         endTime: 3.99,       ← Calculated            β”‚
β”‚         deliverySpeed: "normal"                      β”‚
β”‚       },                                             β”‚
β”‚       voiceDirection: "apologetic, tired",           β”‚
β”‚       dialogueIndex: 1                               β”‚
β”‚     },                                               β”‚
β”‚     {                                                β”‚
β”‚       dialogueId: "dlg-def-456",                    β”‚
β”‚       speaker: {                                     β”‚
β”‚         title: "Cuddles",                            β”‚
β”‚         type: "CHARACTER",                           β”‚
β”‚         assetId: "char-uuid-2"                       β”‚
β”‚       },                                             β”‚
β”‚       text: "You're a bird. You fly. What?",         β”‚
β”‚       origin: "SCRIPT",                              β”‚
β”‚       timing: {                                      β”‚
β”‚         startTime: 3.99,     ← Sequential            β”‚
β”‚         endTime: 7.11,                               β”‚
β”‚         deliverySpeed: "fast"                        β”‚
β”‚       },                                             β”‚
β”‚       voiceDirection: "sarcastic, annoyed",          β”‚
β”‚       dialogueIndex: 2                               β”‚
β”‚     }                                                β”‚
β”‚   ],                                                 β”‚
β”‚   shots: []  ← Empty, shots not generated yet        β”‚
β”‚ }                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Code References

  • LLM Prompt: src/promptTemplates/workbenchV2/generateScenesV2.template.ts
  • Conversion Logic: src/workbench/workbench.service.ts (convertLLMDialogueToSceneDialogue)
  • Timing Calculation: src/shared/dialogue.utils.ts (recalculateSceneDialogueTimings)
  • UUID Generation: src/shared/dialogue.utils.ts (generateSegmentId)

2. Distribution Flow (Shot Generation)

Trigger

User generates shots (KEYS_ONLY mode) β†’ generateShots()

Step-by-Step Process

User Action: "Generate Key Shots"
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. LLM Call (generateShotsV2.template.ts)            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Input:                                                β”‚
β”‚   β€’ sceneData (with dialogues: SceneDialogue[])       β”‚
β”‚   β€’ storyData                                         β”‚
β”‚   β€’ mode: "KEYS_ONLY"                                 β”‚
β”‚                                                       β”‚
β”‚ Prompt Includes Scene Dialogue:                      β”‚
β”‚   "SCENE DIALOGUE (reference by index):               β”‚
β”‚    [                                                  β”‚
β”‚      {                                                β”‚
β”‚        segmentIndex: 0,  ← LLM uses this             β”‚
β”‚        dialogueIndex: 1,                             β”‚
β”‚        speaker: {...},                                β”‚
β”‚        text: "I was on my way...",                    β”‚
β”‚        timing: {startTime: 0, endTime: 3.99}          β”‚
β”‚      },                                               β”‚
β”‚      {                                                β”‚
β”‚        segmentIndex: 1,                               β”‚
β”‚        dialogueIndex: 2,                             β”‚
β”‚        speaker: {...},                                β”‚
β”‚        text: "You're a bird...",                      β”‚
β”‚        timing: {startTime: 3.99, endTime: 7.11}       β”‚
β”‚      }                                                β”‚
β”‚    ]                                                  β”‚
β”‚                                                       β”‚
β”‚    NOTE: Distribute dialogue across shots.            β”‚
β”‚    Use segmentIndex + portionStart/End.              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. LLM Output (Raw JSON)                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ {                                                     β”‚
β”‚   key_shots: [                                        β”‚
β”‚     {                                                 β”‚
β”‚       shotId: "K1",                                   β”‚
β”‚       cameraAngle: "MEDIUM_TWO_SHOT",                 β”‚
β”‚       duration: 4.5,                                  β”‚
β”‚       dialogueDistribution: {  ← Raw LLM output      β”‚
β”‚         segments: [                                   β”‚
β”‚           {                                           β”‚
β”‚             segmentIndex: 0,  ← References scene     β”‚
β”‚             portionStart: 0,   ← Char index          β”‚
β”‚             portionEnd: 66     ← Full segment        β”‚
β”‚           }                                           β”‚
β”‚         ]                                             β”‚
β”‚       }                                               β”‚
β”‚     },                                                β”‚
β”‚     {                                                 β”‚
β”‚       shotId: "K2",                                   β”‚
β”‚       cameraAngle: "CLOSE_UP",                        β”‚
β”‚       duration: 1.2,                                  β”‚
β”‚       dialogueDistribution: { segments: [] }  ← No dialogueβ”‚
β”‚     },                                                β”‚
β”‚     {                                                 β”‚
β”‚       shotId: "K3",                                   β”‚
β”‚       cameraAngle: "MEDIUM_SINGLE",                   β”‚
β”‚       duration: 3.5,                                  β”‚
β”‚       dialogueDistribution: {                         β”‚
β”‚         segments: [                                   β”‚
β”‚           {                                           β”‚
β”‚             segmentIndex: 1,                          β”‚
β”‚             portionStart: 0,                          β”‚
β”‚             portionEnd: 38  ← Partial segment        β”‚
β”‚           }                                           β”‚
β”‚         ]                                             β”‚
β”‚       }                                               β”‚
β”‚     }                                                 β”‚
β”‚   ]                                                   β”‚
β”‚ }                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Backend Processing (testDebug.service.ts /        β”‚
β”‚    workbenchSceneGuru.convertDialogueDistribution)   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ For each shot:                                        β”‚
β”‚   For each segment in dialogueDistribution:           β”‚
β”‚     1. Get scene dialogue by segmentIndex             β”‚
β”‚     2. Create ShotDialogue entry via                  β”‚
β”‚        createShotDialogue(sceneDialogue,              β”‚
β”‚          portionStart, portionEnd, startOffset)       β”‚
β”‚                                                       β”‚
β”‚ Remove raw dialogueDistribution from shot             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. Final Shot Data (Saved to scene.shots[])          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ shots: [                                              β”‚
β”‚   {                                                   β”‚
β”‚     shotId: "K1",                                     β”‚
β”‚     cameraAngle: "MEDIUM_TWO_SHOT",                   β”‚
β”‚     duration: 4.5,                                    β”‚
β”‚     dialogues: [  ← SHOT DIALOGUE (ShotDialogue[])    β”‚
β”‚       {                                               β”‚
β”‚         dialogueId: "dlg-abc-123", ← Links to scene  β”‚
β”‚         speaker: {...},  ← Copied from scene         β”‚
β”‚         portion: { start: 0, end: 66, text: "I was on my way to save you..." },β”‚
β”‚         timing: {                                     β”‚
β”‚           startOffset: 0,                             β”‚
β”‚           duration: 3.99                              β”‚
β”‚         },                                            β”‚
β”‚         voiceDirection: "apologetic, tired"           β”‚
β”‚       }                                               β”‚
β”‚     ]                                                 β”‚
β”‚   },                                                  β”‚
β”‚   {                                                   β”‚
β”‚     shotId: "K2",                                     β”‚
β”‚     duration: 1.2,                                    β”‚
β”‚     dialogues: []           ← Visual only             β”‚
β”‚   },                                                  β”‚
β”‚   {                                                   β”‚
β”‚     shotId: "K3",                                     β”‚
β”‚     duration: 3.5,                                    β”‚
β”‚     dialogues: [                                      β”‚
β”‚       {                                               β”‚
β”‚         dialogueId: "dlg-def-456",                   β”‚
β”‚         speaker: {...},                               β”‚
β”‚         portion: { start: 0, end: 38, text: "You're a bird. You fly. What?" },β”‚
β”‚         timing: {                                     β”‚
β”‚           startOffset: 0,                             β”‚
β”‚           duration: 3.12                              β”‚
β”‚         },                                            β”‚
β”‚         voiceDirection: "sarcastic, annoyed"          β”‚
β”‚       }                                               β”‚
β”‚     ]                                                 β”‚
β”‚   }                                                   β”‚
β”‚ ]                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Point

Shot dialogue is DERIVED, not independent: - Shots reference scene dialogue via dialogueId + portion.start/end - Shots only cache the relevant substring and timing - Full dialogue text and canonical timing live on the scene

Code References

  • LLM Prompt: src/promptTemplates/workbenchV2/generateShotsV2.template.ts
  • Shot Generation: src/agenticGuru/workbenchSceneGuru.ts (generateShots)
  • Post-processing: src/testDebug/testDebug.service.ts (postProcessGenerateShots)

3. Redistribution Flow (Shot Edits)

Trigger

User adds filler shot β†’ addFillerShot() β†’ Dialogue must redistribute

Step-by-Step Process

User Action: "Add filler shot between K2 and K3"
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Current State (Before Filler)                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Scene Dialogue (UNCHANGED - source of truth):         β”‚
β”‚   dialogues: [                                        β”‚
β”‚     {dialogueId: "dlg-abc-123", text: "I was...", ...}β”‚
β”‚     {dialogueId: "dlg-def-456", text: "You're...", ...}β”‚
β”‚   ]                                                   β”‚
β”‚                                                       β”‚
β”‚ Shots:                                                β”‚
β”‚   K1 (4.5s): [dlg-abc-123: 0-66 chars]               β”‚
β”‚   K2 (1.2s): [] ← Visual only                        β”‚
β”‚   K3 (3.5s): [dlg-def-456: 0-38 chars]               β”‚
β”‚                                                       β”‚
β”‚ Problem: Total 9.2s, but dialogue is 7.11s           β”‚
β”‚          Seg-def-456 only uses 38/38 chars           β”‚
β”‚          Where does rest of dialogue go?             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. Insert Filler Shot (workbenchSceneGuru.ts:1069)   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ New Shot Array:                                       β”‚
β”‚   K1 (4.5s)  ← Existing                              β”‚
β”‚   K2 (1.2s)  ← Existing                              β”‚
β”‚   F1 (1.0s)  ← NEW FILLER                            β”‚
β”‚   K3 (3.5s)  ← Existing                              β”‚
β”‚                                                       β”‚
β”‚ Total duration: 10.2s                                 β”‚
β”‚ Dialogue duration: 7.11s                              β”‚
β”‚                                                       β”‚
β”‚ Trigger Redistribution:                               β”‚
β”‚   redistributeDialogueAcrossShots(                    β”‚
β”‚     sceneDialogues: scene.dialogues,                  β”‚
β”‚     shotData: [                                       β”‚
β”‚       {shotId: "K1", timing: 4.5},                   β”‚
β”‚       {shotId: "K2", timing: 1.2},                   β”‚
β”‚       {shotId: "F1", timing: 1.0},  ← NEW            β”‚
β”‚       {shotId: "K3", timing: 3.5}                    β”‚
β”‚     ]                                                 β”‚
β”‚   )                                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. Redistribution Algorithm (dialogue.utils.ts)      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Input:                                                β”‚
β”‚   β€’ Scene dialogues (canonical dialogue)              β”‚
β”‚   β€’ Shot timings (available time)                     β”‚
β”‚                                                       β”‚
β”‚ Algorithm:                                            β”‚
β”‚   1. Sort dialogues by dialogueIndex                  β”‚
β”‚   2. Initialize: currentShotIndex = 0,                β”‚
β”‚                  currentShotTimeUsed = 0              β”‚
β”‚   3. For each dialogue:                               β”‚
β”‚      a. Walk through its text using character indices β”‚
β”‚      b. For each shot while text remains:             β”‚
β”‚         - Calculate how much text fits in remaining   β”‚
β”‚           shot time (charsPerSecond Γ— time)           β”‚
β”‚         - Find a word boundary for the break point    β”‚
β”‚         - Create ShotDialogue for that portion        β”‚
β”‚         - Update currentShotTimeUsed and index        β”‚
β”‚                                                       β”‚
β”‚ Key Logic:                                            β”‚
β”‚   while (dialogueCharIndex < dialogueText.length      β”‚
β”‚          && currentShotIndex < shots.length) {        β”‚
β”‚     shotTimeRemaining = shot.timing - usedTime        β”‚
β”‚                                                       β”‚
β”‚     if (shotTimeRemaining fits remaining text) {      β”‚
β”‚       // Dialogue fits entirely in this shot          β”‚
β”‚       portionEnd = dialogueText.length                β”‚
β”‚     } else {                                          β”‚
β”‚       // Partial segment, break at word boundary      β”‚
β”‚       charsPerSec = CHARS_PER_SECOND[speed]           β”‚
β”‚       maxChars = shotTimeRemaining * charsPerSec      β”‚
β”‚       portionEnd = findWordBoundary(text, maxChars)   β”‚
β”‚     }                                                 β”‚
β”‚                                                       β”‚
β”‚     createShotDialogue(...)                           β”‚
β”‚     currentShotTimeUsed += portionDuration            β”‚
β”‚   }                                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. New Shot Dialogue (After Redistribution)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ K1 (4.5s): [seg-abc-123: 0-66]  ← Full segment       β”‚
β”‚   "I was on my way to save you, but… traffic..."     β”‚
β”‚   Duration: 3.99s, fits in 4.5s βœ“                    β”‚
β”‚                                                       β”‚
β”‚ K2 (1.2s): []  ← Visual only, no dialogue            β”‚
β”‚   Filler for reaction/establishing                    β”‚
β”‚                                                       β”‚
β”‚ F1 (1.0s): [dlg-def-456: 0-15]  ← NEW PORTION        β”‚
β”‚   "You're a bird."                                    β”‚
β”‚   Duration: ~0.8s, fits in 1.0s βœ“                    β”‚
β”‚                                                       β”‚
β”‚ K3 (3.5s): [dlg-def-456: 15-38]  ← ADJUSTED          β”‚
β”‚   "You fly. What traffic??"                           β”‚
β”‚   Duration: ~2.3s, fits in 3.5s βœ“                    β”‚
β”‚                                                       β”‚
β”‚ Result: All dialogue redistributed naturally          β”‚
β”‚         Breaks at word boundaries                     β”‚
β”‚         No gaps or overlaps                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Points

  • Scene dialogue NEVER changes - it's the source of truth
  • Shot dialogue is recalculated from scene segments + shot timings
  • Algorithm ensures:
  • Linear flow (no dialogue skipped)
  • Word boundary breaks (natural pacing)
  • No overlaps or gaps
  • Respects shot timing constraints

Code References

  • Redistribution Call: src/agenticGuru/workbenchSceneGuru.ts:1069-1095
  • Algorithm: src/shared/dialogue.utils.ts:144-215 (redistributeDialogueAcrossShots)
  • Portion Creation: src/shared/dialogue.utils.ts:98-139 (createShotDialoguePortion)

4. Data Structures

Scene Dialogue (Source of Truth)

// Scene-level dialogue segment (source of truth)
export interface SceneDialogue {
  dialogueId: string;
  speaker: {
    title: string;                  // Character display name
    type: 'CHARACTER' | 'NARRATOR' | 'VOICEOVER' | 'OFF_SCREEN';
    assetId?: string;               // Mapped from scene.characters
  };
  text: string;                     // FULL turn text
  origin: 'SCRIPT' | 'GENERATED';
  timing: {
    startTime: number;             // Seconds from scene start
    endTime: number;               // Calculated from text + speed
    deliverySpeed: 'slow' | 'normal' | 'fast';
  };
  voiceDirection: string;          // TTS guidance
  dialogueIndex: number;           // Sequence (1, 2, 3...)
  audios?: DialogueAudioEntry[];   // Generated TTS audio entries (history + active selection)
  isDeleted?: boolean;             // Soft-delete flag β€” true means excluded from UI and timing recalcs
}

// Audio generation entry for a dialogue segment
export interface DialogueAudioEntry {
  assetGenJobId: string;           // Reference to AssetGenJob β€” resolves audioUrl, userId, createdAt
  pitch: 'high' | 'low' | 'none'; // Voice variant used for this generation
  isSelected: boolean;             // Only one entry should be true at a time per dialogue
  text: string;                    // Snapshot of text at generation time (staleness detection)
}

Shot Dialogue (Derived)

// Shot-level dialogue (references SceneDialogue via dialogueId)
export interface ShotDialogue {
  dialogueId: string; // Reference to SceneDialogue.dialogueId
  speaker: SceneDialogue['speaker']; // Copied from scene for convenience
  portion: {
    start: number;    // Character index in SceneDialogue.text
    end: number;      // Character index in SceneDialogue.text
    text: string;     // Cached substring for display
  };
  timing: {
    startOffset: number; // Seconds into the shot when this dialogue starts
    duration: number;    // How long this dialogue takes in seconds
  };
  voiceDirection: string; // Copied from scene dialogue
}

Timing Constants

// Characters per second for different delivery speeds
CHARS_PER_SECOND = {
  slow: 12,    // Moody, dramatic, deliberate
  normal: 18,  // Standard conversation
  fast: 26     // Excited, urgent, quick banter
}

// Pause durations for punctuation (milliseconds)
PAUSE_DURATIONS_MS = {
  ',': { min: 80, max: 120 },     // Brief pause
  '.': { min: 180, max: 260 },    // Sentence end
  '...': { min: 250, max: 450 },  // Ellipsis, suspense
  '!': { min: 200, max: 320 },    // Exclamation
  '?': { min: 200, max: 320 },    // Question
  'β€”': { min: 180, max: 300 },    // Em dash
  '\n': { min: 250, max: 500 }    // Line break
}

5. Key Design Principles

βœ… Single Source of Truth

  • Scene dialogue is canonical
  • Shot dialogue is derived and recalculable
  • Edits to scene dialogue automatically invalidate shot dialogue

βœ… Referential Integrity

  • Shots reference segments by sceneSegmentId (UUID)
  • Never duplicate dialogue text
  • Always derive from scene

βœ… Linear Flow

  • Dialogue flows sequentially across shots
  • No overlaps, no gaps, no skipped segments
  • order field ensures sequence

βœ… Timing Independence

  • Scene timing: Based on text + delivery speed
  • Shot timing: Based on shot duration
  • Can recalculate without affecting source

βœ… Frontend vs Backend Responsibilities

Operation Where Why
Dialogue Extraction Backend (LLM) Requires AI understanding
UUID Generation Backend Security + consistency
Timing Calculation Backend Deterministic, complex
Initial Distribution Backend (LLM) Cinematic judgment needed
Redistribution Backend Affects multiple shots
Manual Edits Frontend User interactivity
Validation Both Frontend: UX, Backend: Data integrity

Complete Example

Let's trace a full user journey:

1. User writes script

KOYAL: I was on my way to save you, but… traffic in Ghatkopar was insane.
CUDDLES: You're a bird. You fly. What traffic??

2. Generates scenes

Backend creates: - seg-abc-123: "I was on my way..." (0-3.99s, normal delivery) - seg-def-456: "You're a bird..." (3.99-7.11s, fast delivery)

3. Generates shots

LLM distributes: - K1 (4.5s): Full segment 1 - "I was on my way to save you..." - K2 (1.2s): Visual only - Reaction shot - K3 (3.5s): Full segment 2 - "You're a bird. You fly. What traffic??"

4. Adds filler F1

Backend redistributes: - K1 (4.5s): seg-abc-123 (full) - "I was on my way..." - K2 (1.2s): (visual) - Reaction shot - F1 (1.0s): seg-def-456 (0-15) - "You're a bird." - K3 (3.5s): seg-def-456 (15-38) - "You fly. What traffic??"

Result: Natural pacing, no dialogue lost, cinematically sound! 🎬


Testing

The dialogue system is tested with 36 test cases across 3 fixtures (sdf3, sentinels1, voc5).

See python_eval/tests/test_definitions/dialogue_operations.py for: - 7 deterministic tests for dialogue extraction - 2 deterministic tests for dialogue distribution - 3 G-Eval tests for semantic quality

Current Status: βœ… 100% passing (36/36)


6. Dialogue Audio V2 β€” TTS Generation & Management

Overview

Generates ElevenLabs TTS audio for scene-level dialogue entries. Audio lives at the SceneDialogue level as audios: DialogueAudioEntry[]. Each entry is a lean record β€” display data (audioUrl, userId, createdAt) is resolved at runtime from the AssetGenJob collection.

API Endpoints

Method Endpoint Description
POST /workbench/generateDialogueAudio Create TTS job + push audio entry onto dialogue
PATCH /workbench/selectDialogueAudio Toggle isSelected on a specific audio entry
PATCH /workbench/updateDialogueText Update text + recalculate timings + redistribute to shots
PATCH /workbench/deleteDialogue Soft-delete dialogue + recalculate timings + redistribute to shots

Soft Delete

Dialogues use the isDeleted: boolean pattern (matching shots/characters). On deletion: 1. dialogue.isDeleted = true is set in the DB 2. Remaining (non-deleted) dialogues get their timings recalculated 3. Dialogue portions are redistributed across shots 4. Deleted dialogues remain in the array for data preservation but are filtered out in queries and UI

Integration Tests

Located at tests/workbench/dialogueAudio.test.ts β€” 19 supertest integration tests covering select, update text, and delete endpoints.


Core Implementation

  • src/workbench/workbench.service.ts - Scene generation with dialogue + dialogue audio V2 endpoints
  • src/workbench/workbench.controller.ts - Route handlers for dialogue audio V2 (generate, select, update text, delete)
  • src/workbench/workbench.validator.ts - Zod validators for dialogue audio V2 requests
  • src/workbench/workbench.router.ts - Route definitions for dialogue audio V2
  • src/agenticGuru/workbenchSceneGuru.ts - Shot operations with redistribution
  • src/shared/dialogue.utils.ts - Timing and redistribution algorithms
  • src/shared/dialogue.types.ts - TypeScript type definitions (SceneDialogue, DialogueAudioEntry)
  • src/shared/dialogue.constants.ts - Timing constants
  • src/models/storyVideos.model.ts - MongoDB model methods (pushDialogueAudio, selectDialogueAudio, softDeleteDialogue)

Prompt Templates

  • src/promptTemplates/workbenchV2/generateScenesV2.template.ts - Scene dialogue extraction
  • src/promptTemplates/workbenchV2/generateShotsV2.template.ts - Shot dialogue distribution

Testing

  • python_eval/tests/test_definitions/dialogue_operations.py - Test definitions
  • python_eval/tests/test_definitions/utils/dialogue_validation.py - Test functions
  • python_eval/tests/test_definitions/utils/g_eval_metrics.py - G-Eval metrics
  • tests/workbench/dialogueAudio.test.ts - Supertest integration tests for dialogue audio API

Migration

For existing stories with old dialogue format, see src/shared/dialogue.migration.ts for on-demand migration utilities.