← Articles

AI · What's new · 9 min read · May 4, 2026

The agent stack reset: what actually matters in late 2026

Three years into the agent gold rush, half the tools we praised in 2024 are gone or irrelevant. Here's the stack I run for clients now, and the four shifts that forced the rewrite.

The agent stack reset: what actually matters in late 2026 — cover image

If you set up an AI workflow before mid-2025, throw it out. I mean it kindly. The platform under our feet has shifted three times in two years, and most of the architectures that looked clever twelve months ago are now slow, expensive, or solving problems that no longer exist.

I rebuilt my client stack twice this year. Below is what survived, what got cut, and the four shifts that forced the rewrite.

Shift 1 — Tool calls became the API, not the chat

Two years ago we wrote prompts that begged a model to use a tool. Now models reach for tools the way humans reach for a calculator: without thinking, in parallel, sometimes a dozen at a time. That changed what a workflow even looks like. The unit of work is no longer a prompt. It's a session — a model with access to your filesystem, your database, your design tool, and your inbox, running in a loop until the goal is met.

Practical consequence: I stopped writing chains in n8n and Make for anything that needs reasoning. I write capabilities. The model picks the order. n8n still glues triggers together, but the brain is the model, not the canvas.

Shift 2 — Context windows ate retrieval

RAG was a beautiful idea. It is also, in 2026, mostly unnecessary for the size of work most studios do. With one-million-token windows on the frontier models, I can paste an entire codebase, an entire client knowledge base, every email thread for a project — and the model holds it. Embedding pipelines, vector databases, hybrid search, reranking — all the infrastructure we built to fake long context — collapse to one line: read the file, hand it to the model.

There's still a place for retrieval at enterprise scale. Below that, it's a moat that's mostly dried up. If your stack still routes everything through a vector DB for a 50-page handbook, you are paying a tax for a problem that has been solved by Moore's law for tokens.

Shift 3 — Frameworks lost to OS-level integration

Late 2024 was the era of the framework: LangChain, LlamaIndex, AutoGen. Late 2026 is the era of the protocol. MCP servers, computer-use APIs, browser-control agents. The model talks to your tools through a thin, language-agnostic protocol — not through a Python class hierarchy that breaks every time a vendor ships an update.

What that means for builders: the moat moves from "who has the slickest abstraction" to "who shipped the cleanest MCP server for their tool first." Figma, Linear, Notion, Stripe, Postgres — they all have first-party servers now. The work isn't writing glue. It's deciding which glue to trust and which to host yourself. We help clients pick — see integrations and AI workflows.

Shift 4 — Cost stopped being the constraint

In 2024 we counted tokens. In 2026 the smaller frontier models are essentially free for any non-streaming, non-realtime use. Cost has been replaced by latency and concurrency. A workflow that costs five cents but makes a client wait sixteen seconds is now worse than one that costs forty cents and returns in three.

I rate every client integration on the same scale I rate UI animation: under 200 ms it feels instant, under 1 second it feels live, over 4 seconds it feels broken. Most of last year's automations are sitting at twelve to forty seconds and people have learned to tolerate them. They shouldn't. The latency budget is the new pricing tier.

What survived in my stack

  • Claude Sonnet 4.6 for almost everything reasoning-heavy. Cheap, fast, doesn't hallucinate file paths. Claude 4.7 Opus when the work is worth waiting for.
  • OpenAI Codex for one-shot coding tasks I can specify clearly — pull request size and below.
  • Gemini 3 Pro when the input is video, audio, or genuinely massive (1M+ tokens of mixed media).
  • n8n self-hosted for the boring choreography: webhooks, retries, queues. The model is the worker. n8n is the supervisor.
  • Cursor and Claude Code for actual building, with MCP servers connecting them to the design tool, the database, and the deploy pipeline.
  • Postgres + pgvector for the rare retrieval case. No more dedicated vector DBs in any new build.

What got cut

  • LangChain pipelines older than six months. Rewritten as plain TypeScript with the SDK directly.
  • Pinecone, Weaviate, Qdrant — replaced by pgvector or by long context, depending on the case.
  • Anything that asked GPT-4 to "think step by step" in a manually-written chain. Models do that natively now.
  • Five different image generation APIs, consolidated to two. The lift from juggling more vendors stopped paying.
  • Most prompt-engineering libraries. Prompts shrank as models got better at intent.

The honest version

If you are running a real business and your AI stack hasn't been touched since the start of 2025, the most useful thing you can do this quarter is throw away half of it. Not all — half. Keep what's load-bearing. Cut what's there because someone said you needed it eighteen months ago. The technology has moved past that conversation. The tools you keep will get more powerful by the month. The ones you are still maintaining out of habit are a tax on your throughput. If you want a second pair of eyes on yours, the free 30-minute audit is the right shape of conversation.

Sources & further reading

#ai-agents-2026#mcp-servers#claude-4.7#gpt-5#n8n#agent-stack#ai-workflows
← All articlesWork with me ↗