When n8n stops being the right tool
A post-mortem on rewriting an inbound-email automation pipeline that had grown past the point where a visual workflow was helping.
TL;DR
I had an n8n workflow that ingested messy, human-written emails — multiple attachment types, multiple intents, lots of edge cases — classified them, and wrote to a few downstream systems. It worked, until it didn't. The failures weren't bugs I could spot-fix; they were structural consequences of building a judgment-heavy pipeline inside a visual workflow tool. I rewrote it as a small Python boundary plus a Claude Code orchestrator. Same inputs, same outputs, a tenth of the moving parts, and every failure mode is now mechanically prevented instead of "we'll watch for it." This is what I learned.
The setup
Most of the interesting work I do at the moment looks like this: an inbox somewhere receives emails from humans. The emails are inconsistent — different formats, different intents, sometimes attachments, sometimes not, sometimes language A, sometimes language B. The job of the pipeline is to understand each email and route it correctly: extract structured data, file documents in the right place, update records in the right system, surface ambiguity to a human when it can't decide.
The original implementation was a 53-node n8n workflow. IMAP trigger → branch on attachment type → AI extraction node → enrichment via an HTTP API → classifier → API writes → notification. Three side-flows on a VPS handled subroutines that didn't fit cleanly in n8n. It ran on cron, looked impressive on the canvas, and shipped a working v1 in about two weeks.
For about a month, it was great.
Where it stopped being great
Six recurring failure modes, in roughly the order they showed up:
1. Duplicates on re-mention
A record already in the destination system would get a second copy on the next mention of the same entity. The workflow had a "look up existing record" step, but it was a fuzzy name match — fine when the canonical identifier (a URL, a domain, an email) was extractable, brittle when it wasn't. The fallback path collapsed to "create new," and the duplicate slipped through. You only noticed when a human opened the destination system and saw two of the same thing side by side.
2. Silent drops in the LLM extraction step
When the input was long, the LLM extraction node would return truncated JSON. The downstream parser tolerated the truncation gracefully — it parsed what was parseable and discarded the rest. From the user's point of view, items they could see in the input simply did not appear in the output. There was no log of what was dropped or why.
3. Overconfident defaults on uncertain classifications
The classifier had to pick one of several routes. When it was uncertain, it defaulted to the most common one. That default was the wrong direction: the most common route was also the most expensive one to get wrong (it wrote to a production record store). The correct default for an uncertain classifier is "ask a human," not "guess and write."
4. Silent auth failures on downstream calls
Some of the downstream actions hit third-party endpoints that occasionally required interactive authentication (a one-time password, a token refresh). When those gates were hit, the action failed and the workflow moved on without surfacing a clear signal. The next time anyone noticed was when they asked for the artifact that should have been produced.
5. No way for a human to disambiguate
The single biggest source of user friction wasn't a bug, it was an absence. When the agent genuinely couldn't decide — multiple plausible entities in one message, missing context, low confidence — there was no channel for the user to answer back. The agent had no inbound surface for disambiguation. Ambiguous emails landed in a "Failed" folder and stayed there until someone reviewed manually.
6. No replay, no fixtures
Every fix had to be tested by sending a real email through the live pipeline and watching what happened. There was no way to say "here is the exact message that broke; rerun it locally against my fix and tell me if it does the right thing now." Engineering velocity on a pipeline you can't replay is approximately zero.
This isn't an n8n problem
Pause here, because the obvious read of this list is "n8n is bad." It isn't. n8n is genuinely the right tool for a large class of automation problems. Three structural mismatches made it the wrong tool for this one:
The graph stopped being a useful representation. Most of the interesting logic was inside Function nodes containing JavaScript. The 53-node canvas was a façade over ~600 lines of imperative code spread across two dozen invisible-on-first-glance chunks. Once you're maintaining hundreds of lines of code split across hidden boxes, the visual workflow has become a liability rather than an asset. You're paying the cost of a visual tool without getting the benefit.
State lived in the wrong place. The workflow trusted IMAP protocol flags to decide what was new. Any human reading the inbox could flip those flags and silently change the workflow's behaviour. There was no persistent, workflow-owned idempotency layer. Each external write was independently idempotent (or not), depending on which integration node you used.
There was no orchestrator. "Orchestrator" in the agent sense — something that holds the lifecycle of a single unit of work, decides what to do next based on outcomes, and surfaces structured status. n8n nodes pass data downstream; they don't reason about whether the dispatched action succeeded. When something failed mid-graph, the workflow shrugged and moved on.
The diagnosis: this workload needed (a) a real orchestrator with persistent state and replay, (b) honest uncertainty handling with a "I don't know" output, and (c) the ability to ask a human for help inline. None of those are n8n strengths. All of them are basic features of a code-first agent harness.
What I replaced it with
Two processes, one decision point:
[ inbound poller (Python) ]
│
▼
materialize raw + parsed + signals to disk
│
▼
[ claude -p "/process-event <basePath>" ]
│
▼
classifier skill → { route, confidence, reason, signals_used }
│
▼
hard safety rails (Python, post-LLM)
│
▼
┌─────────────┬───────────────┬──────────────────┐
│ │ │ │
route A route B route C NeedsAttention
│ │ │ │
▼ ▼ ▼ ▼
… … … ask the human
Concretely:
- A deterministic Python boundary runs on cron. No LLM, no judgment. It fetches new messages, writes each one to
inbox/<id>/as raw artifact + parsed JSON + anatts/directory, computes a signals dict (allow-list match, attachment types, registry lookups, keyword hits, structural counts), and records the message ID in SQLite for idempotency. - For each new message it spawns a single
claude -p "/process-event <basePath>"invocation. That's the orchestrator. Claude reads the materialized files and runs a classifier skill, which outputs{ route, confidence, reason, signals_used }. - Hard safety rails run after the LLM, in Python. They check confidence threshold, sender allow-list, and sanity between claimed route and the artifacts on disk. Anything that fails the rails is forced to
unknownand surfaced for human attention. - For each accepted route, a route-handler skill wraps the actual work. Skills print a final JSON status line; the orchestrator parses it and updates external state (SQLite + queue label).
Total: ~3000 lines of Python, five skills, one slash command, one cron line.
How each failure mode is mechanically prevented
This is the part that matters. A rewrite that swaps tools but reproduces the same failures is theatre. Each issue above maps to a specific architectural decision:
Duplicates: deterministic external_id
Every write derives a stable external_id from canonical fields with an ordered fallback chain — primary identifier (URL or domain) → secondary identifier → slugified name. The write call uses an upsert keyed on that ID, so re-running the same input — by accident, replay, or because a flag flipped — updates the existing record instead of creating a duplicate. The "missing primary identifier" edge case still degrades gracefully because the fallback chain produces a deterministic ID for the same input either way.
Silent drops: enumerate every line, log every skip
The extraction step writes one line to a extract.jsonl audit log per unit of input it sees, parsed or dropped. Dropped lines carry a skip_reason. Ten items in, ten items in the log — even if the downstream classifier chooses to skip some, the skip reason is auditable. A daily digest surfaces unusual drop ratios.
Overconfident defaults: unknown is a first-class output
The classifier rubric defines confidence ≥ 0.95 for "multiple strong signals agree," 0.85–0.94 for "one strong signal, no contradictions," and anything below 0.85 → unknown regardless of best guess. The post-LLM rails enforce the threshold even if the LLM tries to ignore it. Default behaviour under uncertainty is "ping the human," not "guess and write to production."
Silent auth failures: needs-attention is structured state
When a downstream call hits an interactive gate it cannot resolve, the route returns needs_attention with a structured reason field. The message is moved to a dedicated queue, the daily digest flags it explicitly, and the orchestrator records the gate in SQLite so the next replay knows what to ask for.
No human-in-the-loop: reply-to-act
For ambiguous messages, the agent replies to the original sender with a clear ask: "Reply with <option A>, <option B>, <option C> or <param>: <value> to continue." The sender replies inline. The inbound poller picks up the reply, matches it via In-Reply-To to the original message's base path, parses the command, and feeds it back into the orchestrator as a forced route or a stored parameter. Subject lines and email threading do the matching — no separate portal, no link tracking. This was the single change that flipped the user experience from "the bot is unreliable" to "the bot is helpful and occasionally asks for confirmation."
No replay: every run is replayable from disk
Because the deterministic boundary materializes every message to inbox/<id>/ before anything else happens, replay is trivial: python scripts/replay.py --eml fixtures/<case>.eml reproduces any incident byte-for-byte against your local code. Fix the bug, rerun the fixture, verify the new output, ship.
What I'd repeat on the next project
A few patterns turned out to be load-bearing well beyond this one project:
Materialize-then-dispatch. Do the deterministic boundary work — protocol I/O, parsing, signals computation — in plain code before the LLM sees anything. The LLM reads from a known directory layout. This decouples "where messages come from" from "how they get processed," and it makes replay free.
Signals + LLM, never one or the other. Pure rules miss novel inputs. Pure LLM hallucinates on ambiguous inputs. Compute a deterministic signals dict in code and pass it to the LLM as ground truth; the LLM weighs and reasons over them. Always both.
Rails after the LLM, not before. Putting rules before the LLM as a gate reinvents the rules-only classifier with extra steps. Putting them after lets you constrain output without constraining input — the LLM can still reason over novel cases, but its conclusions are checked.
claude -p exit codes lie. They reflect Claude's exit, not the dispatched skill's outcome. Every skill in the harness must print a final JSON status line that the orchestrator parses. Otherwise cron thinks everything is fine while every skill silently fails.
Names, not IDs. Hardcoded resource IDs from a vendor's UI rot the moment a resource is recreated. Look up by name via the API, cache the mapping locally, refresh on a schedule. One extra request, zero rot.
Track state externally, not in protocol flags. Email \Seen flags, queue ack states, file-watcher "seen" markers — all of them can be flipped by humans, other clients, or restarts. Maintain idempotency in your own store keyed on a stable identifier.
Build the human-in-the-loop early. A pipeline that can ask for help is fundamentally different from one that can only succeed or fail. The reply-to-act loop was the highest-leverage feature in this rewrite by a wide margin, and adding it later would have cost more than building it in from the start.
When to choose visual workflows vs code-first agent harnesses
A visual workflow tool is the right choice when:
- The flow is mostly "API → API → API," with little judgment in the middle.
- The data shape is stable and the inputs are clean.
- Non-engineers need to inspect or edit the flow.
- The cost to replace is dominated by API integration glue.
A code-first agent harness is the right choice when:
- The workload requires judgment over messy human input.
- Idempotency, replay, and audit logging matter.
- Failures need to surface to a human in a structured, actionable way.
- The "logic in Function nodes" footprint is starting to exceed the "logic in graph topology" footprint.
I crossed that threshold somewhere around the 30-node mark on this pipeline. By the time the workflow had reached 53 nodes, the visual representation was actively misleading — most of the interesting behaviour was hidden inside three or four Function nodes — and the rebuild had become inevitable.
What the migration cost
For honesty: about five hours of focused work, end to end.
That number is misleading without context. The reason it wasn't three weeks is that the n8n workflow file was already the design document. The exported JSON describes every node, every connection, every Function node's source code, and every credential reference. Translating "IMAP trigger → branch on attachment type → Function node containing this JavaScript → HTTP write" into "Python poller writes to disk → classifier skill → route handler skill" is a mechanical port, not an architectural exercise. Most of the thinking had already happened — it just lived inside the canvas instead of in a markdown file.
The reply-to-act loop took the largest single bite of those five hours, because the part the n8n JSON couldn't help with — making a bot mailbox that can both send and receive reliably across providers, and matching replies back to their original thread — was genuinely new code with no prior analogue in the workflow.
The reward side: the daily digest now shows zero "Failed (silent)" rows. Everything is either Processed, NeedsAttention (with a structured reason and a /replay command attached), or — very occasionally — Failed with an explicit error message. Trust in the pipeline went from "I check it daily because I don't believe it" to "I check the digest once a week to see what came through."
That's the part I'd write the post for. Visual workflow tools are great until the moment your reliability bar shifts from "works most of the time" to "I have to be able to explain every failure." When that happens, the cheapest thing you can do is rewrite the workflow as code with an LLM doing the judgment in the middle — and if you already have a working n8n version, you're starting from a much better place than a blank file.
If you're considering a similar migration and want a sounding board, I'm at matthias@lifeisapitch.io.