mantus.ai

AI CONFIDENCE, SERVED FRESH DAILY

Add advanced features when the foundation is solid

Building branching workflows, live audio narration, and Miro integration on proven architecture.

Once the core capture-and-document loop worked reliably, the real features could start. This is the phase where product intuition matters most—knowing which capabilities will multiply the tool's value versus which ones are just nice-to-have.

The foundation from Step 4 gave us three things that made advanced features possible: a clean data model (WorkflowStep[]), reliable browser automation, and generated outputs that could be enhanced without breaking. Each new capability built on what already worked.

Branching workflows: when linear isn't enough

The first advanced feature request came from real usage: "Sometimes there might be two clickable options I want to get into the same workflow. Is this possible?"

The existing linear model couldn't represent this. Each step had exactly one predecessor and one successor. A Y-fork where users could take path A or path B required a graph structure, not a simple array.

The solution was elegant: refactor the data model to support graphs, but keep linear workflows as the default case. Most captures would still be straight-line flows. The branching capability would be there when needed.

Three new concepts handled this:

Graph representation. Convert each captured flow into nodes and edges. A linear workflow becomes a degenerate graph—each step connects to the next one. Branching workflows share a common prefix, then diverge.

Automatic merging. Instead of forcing users to manually author graph files, detect shared prefixes automatically. When you capture two flows that start the same way (same URL, same clicks), merge them at the divergence point. The first flow becomes the main path. The second becomes a branch.

Spatial layout. Main flow stays at y=0. Branches get their own lanes at y=-260, y=+260, y=-520, etc. The Miro export positions shapes accordingly, creating a readable flowchart.

Testing this with a real branching scenario—two different paths through the same login workflow—worked on the first try. The shared login steps appeared once. The divergence showed as a Y-fork with proper lane spacing.

Live audio narration: documentation while you demonstrate

The next capability jumped straight to voice. Instead of writing documentation after capturing a workflow, what if you could narrate while you demonstrate?

The first design was wrong: a separate slide deck interface where you'd view screenshots and record audio after the fact. This assumed that "driving the UI and explaining simultaneously is hard."

That assumption was backwards. PMs and designers do this constantly—Loom walkthroughs, demo recordings, user interviews. The cognitive load of explaining while clicking is natural, not burdensome.

The correct design: start audio recording when you press Enter to begin capturing. Each click becomes a timestamp. When you finish with Ctrl+C, slice the master audio file into per-step segments.

Technical choice: Node + ffmpeg, not browser based recording. Browser MediaRecorder dies on page navigation. The captured site might reload several times during a workflow. An ffmpeg subprocess survives all of that and produces clean WebM files that work everywhere.

The implementation was straightforward: spawn ffmpeg -f avfoundation -i ":0" when recording starts, send timestamps to an array, then slice the master file with ffmpeg -ss/-to for each step. Audio files get embedded in both the Markdown README and HTML site as native <audio controls> elements.

One gotcha: macOS mic device detection. Using :0 (device index 0) often picked up AirPods or iPhone mics instead of the actual default input. The fix queries system_profiler SPAudioDataType to find the system default, then matches it against avfoundation's device list.

Swedish whisper transcription: from voice to text

Audio narration was useful, but text remained the primary documentation format. The next step added automatic transcription using a Swedish tuned Whisper model.

Architecture: long lived Python subprocess communicating over JSON lines. Load the model once (KBLab/kb-whisper-large), then process audio files one by one. This saved 10-15 seconds of model loading per file—the difference between 20 seconds total and 2 minutes for a multi-step workflow.

Idempotency through fingerprinting. Each transcription stamps the audio file's modification time and size. Re-running flowdoc transcribe skips unchanged files. Re-record one step in a fresh capture session, and only that step gets re-transcribed.

The model quality was impressive. Technical UI terms, Swedish student names, conversational phrases—all transcribed accurately. The output appeared in both README.md (as blockquotes under each step) and the HTML site (inline with the audio players).

Miro integration: from documentation to collaboration

The final advanced feature connected to external tools. Miro boards are where product teams actually collaborate. Instead of keeping workflow documentation in isolated files, what if captured flows could become native Miro shapes?

The flowdoc miro command reads a captured workflow and creates rounded rectangles on a Miro board—one shape per step, connected with labeled arrows. No screenshots, no two way sync, just the structural flow as editable shapes.

Brand styling made it real. Generic blue rectangles looked like a developer tool. Mapping to proper UX flowchart symbols—yellow start circles, blue action rectangles, light blue pages, green decision diamonds—made it something a product team would actually use.

The implementation was pure REST API calls using Node's built-in fetch. Create shapes at calculated positions (x = step * 450), collect the returned shape IDs, then create connectors between adjacent pairs. Rate limiting, error handling, and shape styling were the main complexity.

One detail that mattered: branch workflows automatically used the right symbols. Fork points (nodes with multiple outgoing edges) became green diamonds. The visual language matched what product teams already knew.

Why this order worked

Each advanced feature built on the previous foundation:

Branching required reliable capture. The graph model only worked because the linear workflow was already solid. Shaky core mechanics would have made branching impossible to debug.

Audio needed stable shutdown. Live recording during capture meant the subprocess cleanup had to be bulletproof. Earlier sessions had already solved the browser close and ffmpeg termination edge cases.

Transcription leveraged audio architecture. The "load model once, stream requests" pattern reused the subprocess communication approach from audio recording.

Miro export used the graph model. Converting workflows to positioned shapes was trivial once the data was already structured as nodes and edges.

The capabilities accumulated naturally. By the end, a single flowdoc capture session produced six different outputs: markdown README, HTML site, Mermaid flowchart, audio files, transcriptions, and Miro ready data. Each output served a different use case, but they all came from the same source capture.

The lesson: build advanced features on proven architecture. Don't add complexity until the foundation can support it cleanly.