never waste a token¶
Ch01.699 never waste a token¶
📊 Level ⭐⭐ | 4.2KB |
entities/sunilpai.md
never waste a token¶
背景:从 newsletter candidates 提取,2026-06-18 v×c=56 stars=4 通过评分门槛。 URL: https://sunilpai.dev/posts/never-waste-a-token/
核心要点¶
Published Time: 2026-06-15T00:00:00.000Z
Markdown Content: (this post itself is LLM slop, but it tastes alright)
tl;dr - put a durable buffer between your agent and the LLM provider. the provider connection now outlives your process, so a deploy in the middle of a stream doesn’t cost you the tokens you already paid for. and the same buffer that lets a disconnected browser catch back up is the thing that recovers a crashed turn. one log, two readers.
I’ve spent the last few weeks stuck on one question: what happens to an agent when the process running it dies in the middle of a turn?
it goes deep fast. tool calls that may or may not have fired. sub-agents. half-written streams waiting on a human. I’m writing all of that up separately (durable agent loops, coming soon). but one piece of it is small and self-contained enough to pull out on its own:
when your process dies mid-inference, you don’t just lose your place. you lose money.
the problem that’s easy to miss¶
your agent opens a streaming request to a model, and the model starts generating. you’re billed for those output tokens the moment they’re generated. then your process gets replaced. maybe a deploy, maybe an eviction, maybe an OOM.
the usual reassurance is “don’t worry, the state is durable.” and sure, your conversation history survived. but the in-flight HTTP request to the provider did not. it lived in the memory of the process that just died. so when you recover, your only option is to make the call again. you pay for those output tokens a second time.
now make it an agent. a real one does multiple tool calls in a single turn:
user message
→ stream some text
→ tool call → tool result
→ stream more text
→ tool call → tool result
→ stream the answer
every interruption throws away all the output tokens generated so far in that turn. and it scales with the model you actually want to use: output runs $30 per million tokens on gpt-5.5 versus $2 on gpt-5.5-mini, so a flagship retry burns ~15x what a mini one does. the better the model, the more it hurts. deploys happen constantly, evictions happen constantly, and each one that lands on a live stream is money straight out the window.
the happy path hides it. you only see it when you start counting tokens after an incident and the numbers don’t add up.
the move: stop tying the request to the process¶
the reason a crash wastes tokens is that the provider connection lives inside the thing that crashed. so move it out.
put a buffer between the agent and the provider, and make it a separate deployment: its own Worker, its own Durable Object.
when a request comes in, the buffer does three things in order. it resets its state for a fresh stream. it kicks off a background task that drains the provider connection into SQLite. and it immed
评估理由¶
- value=8: Excellent practical engineering insight on agent architecture: token cost loss when LLM provider connection dies mid-stream, durable buffer pattern, and token economics across model tiers (gpt-5.5 vs
- confidence=7: 详细程度与来源可信度
- stars=4: 独特技术洞察评分