- Lamatic Labs
- Posts
- You're Building Two Products (And Only Shipping One)
You're Building Two Products (And Only Shipping One)
When glue code becomes infrastructure: choosing between owning your middleware or migrating to managed platforms
TL;DR: Custom middleware grows from glue code into a second product. You have two choices: own it (investing in its ongoing development and maintenance) or outsource it safely. Both work. Ignoring it doesn't.

There's a moment that catches most teams off-guard. Three months into building a sales copilot…
The demo works beautifully - it pulls context from your knowledge base, caches responses, respects permissions.
The first few customers love it. Then suddenly you're trying to figure out why the Notion connector stopped syncing at 2 AM.
Or why permission checks are taking 800ms per query.
Or why schema changes from Slack's API silently broke your retrieval last week.
You get the picture.
And none of this was in your original system requirements. Rather, this glue code grew naturally from necessity. At each step, your decision-making made sense. The Notion connector needed retries. Permission propagation had to be bulletproof. Schema versioning became non-negotiable when Slack changed their API format without warning. Each addition solved a real production problem.
The tension surfaces when you're trying to ship product features but your engineers are debugging connector failures or rewriting permission logic for the third time. Middleware that started as weekend work grows - becoming increasingly fragile and creating emergency engineering work that sabotages your product roadmap.
What This Framework Covers
This piece walks through three things: recognizing when glue code becomes a system that needs ownership, calculating what that actually costs your team, and mapping out your options without pausing your roadmap.
How Does Simple Glue Code Become a Second Product?

Month one looks clean:
# Month 1: a simple connector
docs = notion_client.get_pages(workspace_id)Month three, you're handling edge cases:
# Month 3: retries + backoff
docs = fetch_with_retry(
notion_client.get_pages,
workspace_id,
max_retries=3
)Month six, and you’ve got a platform roadmap and ongoing maintenance:
# Month 6: it's now a second product
from custom_middleware import (
MultiSourceConnector,
PermissionPropagator,
SchemaVersionManager,
FreshnessMonitor,
TenantIsolator,
BackfillOrchestrator
)
# ...4,200 lines later
Each addition makes sense. Retries prevent failures. Permission propagation stops leaks. Schema versioning handles upstream changes. But they accumulate into something bigger: a second product on which the product you sell depends.
The question isn't "how did we let this happen?" It's "now that we’re here, what do we do?"
Recognizing the Inflection Point
The clearest signal: you're spending engineering time you didn't budget for. Infrastructure is competing with features in sprint planning. You’re realizing that your product is fragile and you can’t iterate it at the speed you need.
Before diving into technical specifics, use this framework to measure the problem and see where you are on this curve:
The context matters: One engineer on middleware at a 3-person startup is 33% of engineering. At a 30-person company, it's 3%. Same activity, completely different implications.
Reframing Sunk Cost
Those months taught you exactly what to demand and test. That knowledge compounds whether you formalize custom or evaluate managed alternatives.
The question isn't "do we throw this away?" It's "what compounds best over the next twelve months given our stage and constraints?"
When Does Custom Infrastructure Make Sense?
Custom infrastructure makes sense in specific situations. Teams building single-tenant systems with 2-3 stable data sources and strong compliance requirements often choose to build their stack. When infrastructure is your competitive advantage and you allocate the bandwidth to maintain it properly, building custom is often the right path.
The patterns that work:
Single-tenant deployments with stable requirements
2-3 data sources that rarely change
Upstreams with strong backwards compatibility
Low compliance burden (no SOC2/HIPAA walls)
Minimal collaboration needs with technical users only
Infrastructure is a core differentiator
If this describes your situation, embrace it. Treat middleware as the product it is by resourcing it, managing it proactively and not allowing it to distract your team from developing the differentiating features that will help you win in the market.
Why Custom Middleware Grows More Complex?
The patterns above describe ideal conditions for custom infrastructure. But even teams that check every box encounter unexpected complexity. Infrastructure that starts clean evolves through production realities: business requirements change, new tech is integrated, edge cases multiply, upstreams change without warning, scale reveals problems that were invisible or manageable early-on.
This is the nature of AI-powered applications - the Software Development Life Cycle (SDLC) for non-deterministic production systems is different.
Why Do Multi-Tenant Permission Layers Break at Scale?
Multi-tenant Retrieval-Augmented Generation reconciles three layers speaking different languages:
Source APIs (OAuth scopes, per-doc permissions)
↓
Vector DB (metadata filters, flattened access)
↓
Application (user context, policy decisions)
What breaks in production:
# Works in dev, hurts in prod
def search_docs(user_id, query):
results = vector_db.search(query, top_k=10)
return [r for r in results if user_id in r.metadata['owners']]With 50 workspaces × 10k documents:
500k Access Control List (ACL) checks per query scanning metadata after retrieval
Stale permissions when someone loses access but vectors remain
Cross-tenant cache leaks when admin scopes at ingest serve user scopes at query
No audit trail for "why could this user see this?"
What good looks like: policy-as-code service (centralized decisions + logs), per-tenant cache keys, ingest-time Access Control List attribution, pre-filter at search time.
How Upstream API Changes Wreak Havoc (without breaking your health checks)?
Upstreams evolve without failing your health checks. Clean 200 OK, completely different data:
// Before
{"headline": "Breaking News", "body": "Full text"}
// After (still 200 OK)
{
"headline": "Breaking News",
"body": {
"text": "Full text",
"html": "<p>Full text</p>"
}
}Why this breaks your product: Your AI suddenly returns empty results for 40% of queries. Customer support is flooded with issues they can’t triage. Your dashboards show everything's healthy because the API technically returned 200 OK - but the data structure silently changed.
The fix production teams use: Versioned parsers (handle both old and new formats), contract tests (catch changes before production), and drift alerts (notify when formats don't match expectations). This prevents the 'silent failure' scenario where everything looks fine but nothing works.
Why Does Vector Search Performance Degrade Over Time?
Incremental updates and metadata filters interact poorly at scale:
Hierarchical Navigable Small World (a vector search algorithm) insertion degradation: Graph quality deteriorates without maintenance
Restrictive filters fragment graphs: Heavy filtering disconnects segments
Multi-tenant isolation: Per-tenant indexes (expensive, isolated) vs shared index with strict filters (cheaper, complex)
Production teams track index health metrics (fragmentation percentage, recall benchmarks), schedule maintenance windows, make clear architectural choices, and monitor quality decay before users notice.
Driver | Dev Symptom | Hidden Signal | Minimum Guardrail |
Permission drift | "Why can't I see this?" | Stale Access Control List metadata | Ingest-time attribution + audit logs |
Schema change | Empty results | Type mismatch, silent fails | Contract tests, versioned parsers |
Index fragmentation | Degraded recall | Hierarchical Navigable Small World (a vector search algorithm) disconnected segments | Health metrics + maintenance |
Cross-tenant leaks | "I saw someone else's data" | Shared cache, wrong filters | Per-tenant cache keys, pre-filtering |
Why Standard Monitoring Misses These Problems?
Most monitoring tracks whether your system is running - uptime, error rates, latency. That measures the control plane. But users experience a different layer of failures: stale data that's six hours old, permission leaks that show the wrong documents, schema changes that break parsing silently. Your dashboards report green while customers hit real problems.
The data plane is where failures appear in agentic systems.
What Data-Plane Metrics Should You Monitor?
Your monitoring tracks the control plane: requests, errors, latency. That measures if middleware is running. Not if it's working correctly.
The data plane is where failures happen.
What Data-Plane Metrics Should You Monitor That Dashboards Don't Track?
Control-Plane Metric | What Actually Breaks | Why It's Invisible | Data-Plane Signal to Add |
API uptime (99.9%) | Stale data (6 hours old) | No freshness Service Level Objectives (SLO) per connector | Hours since last successful sync |
Request errors (0.1%) | Permission leaks | Access Control List (ACL) evaluation after retrieval | Permission outcomes: allowed/denied/leaks |
P95 latency (200ms) | Wrong schema version | Drift doesn't throw errors | Schema version distribution |
Vector search uptime | Degraded recall | No index health metrics | Fragmentation %, recall benchmarks |
This framework works with any infrastructure. With custom AI middleware, your team builds and maintains it. Managed platforms (like Lamatic.ai) make monitoring the data plane modular and manageable without engineering resources.
"Your dashboards measure control-plane health. Your users experience data-plane failures."
Traq.ai hit this iterating prompt in production. Control-plane metrics: healthy. Data-plane reality: prompt changes needed full pipeline redeployments, blocking releases. Decoupling prompt config from embedding jobs: 5× faster iteration². But only after measuring the right things.
What Some Teams Learned When Middleware Became a Bottleneck
How Did Reveal Achieve 10x Cost Reduction Through Rapid Iteration?
Reveal, an AI-powered enterprise enablement platform, faced a bottleneck: every prompt optimization required full pipeline redeployments, slowing their ability to find optimal configurations.
After migrating to Lamatic, prompt updates became instant. "If we want to tweak a prompt, we go into Lamatic, update it, deploy, and those changes are live," says founder Johan Hoernell.
This iteration speed unlocked rapid testing of optimization permutations. The result: 10x reduction in token usage through optimized prompts and intelligent image cropping, drastically improved output quality, and development velocity.
How Did Traq.ai Unlock Feature Velocity While Reducing Engineering Overhead?
Traq.ai, an AI-powered conversation intelligence platform for sales teams, needed to rapidly ship new AI features but their small engineering team was stretched thin maintaining custom middleware.
Their challenge: Every new AI capability required building and maintaining additional infrastructure - connectors, embeddings, retrieval logic. This meant engineers spent time on undifferentiated middleware instead of customer-facing features.
By switching to Lamatic's managed platform, Traq.ai eliminated this bottleneck - freeing TRAQ to focus its precious engineering resources on releasing the product features that differentiate TRAQ from slower-moving competitors in its space like Gong, Chorus, Fireflies and Fathom.
Why Did Beehive Climate Need Extraction Contracts to Close Enterprise Deals?
Beehive Climate helps high-profile enterprises transform governmental climate compliance reporting from an expense line into profit-producing climate action. Their challenge: extracting insights from thousands of pages of mixed-format climate reports.
Mixed-format parsing wasn't enough - legal and investor reporting demanded lineage proof. Without extraction contracts and audit logs, they couldn't prove why a claim appeared in a disclosure. Schema contracts, decision logs and SOC2 compliance certification were key unlocks for Beehive’s enterprise deals.
What Does It Take to Properly Own Custom Middleware?
Ownership means:
Accountability & Expertise (not "whoever has time")
Roadmap (planned improvements, not just firefighting)
Service Level Objectives (freshness per connector, permission accuracy, schema compatibility)
Policy-as-code (centralized decisions, permissions, audit logs)
Contract tests (validate at build time)
How Do You Migrate from Custom to Managed Without Downtime?
Migration is only practical if your architecture supports it. The most common mistake is mixing your AI logic directly into your application code. When your RAG pipeline, prompt orchestration, and vector search sit in the same files as your UI components and business logic, every connector change requires retesting your entire application.
Keep two separate layers: Your frontend and business logic in one layer. Your agentic infrastructure (connectors, retrieval, orchestration) in another. This modularity is similar to the way you separate your database layer from your application - you wouldn't write SQL queries directly in your React components.
Zero-downtime pattern: instrument → shim → dual-write → replay → cutover → rollback switches.
class MiddlewareAdapter:
def init(self):
self.custom = CustomMiddleware()
self.managed = ManagedPlatform() # e.g., Lamatic
self.dual = bool(os.getenv('DUAL_WRITE', False))
async def search(self, query, user_ctx):
# Primary path (current system)
primary = await self.custom.search(query, user_ctx)
# Shadow path (future system)
if self.dual:
shadow = await self.managed.search(query, user_ctx)
await self.compare(primary, shadow) # Log parity
return primaryKey moves:
Test with real queries first
Replay actual production queries before switching traffic. This catches edge cases your synthetic tests miss - like the user with 10,000 documents who triggers timeouts.
Why it matters: Finding problems in testing costs an afternoon. Finding them after customers are affected erodes trust and is far more disruptive operationally.
Move one API call at a time
A “big bang” conversion is much slower and riskier.
Why it matters: This agile approach to migrating allows you to iterate, learn, and ultimately migrate without the delay and all the contingency planning a full changeover would require.
Build instant rollback switches
Feature flags let you revert to the old system with one click - no code deployment needed.
Why it matters: When problems hit at 2 AM, you revert in 2 minutes instead of spending 2 hours redeploying code.
Monitor what users experience
Track data freshness, permission accuracy, and schema compatibility - not just uptime.
Why it matters: Traditional monitoring says "system is up." These metrics tell you if it's actually working correctly.
From the Reveal case study: Johan Hoernell migrated from Langchain to Lamatic in what he describes as "an afternoon project." The key was decoupling: "We desegregated the core underlying software infrastructure from our ML, which has been really helpful."
This architectural separation meant swapping API calls without touching core product logic. Agent updates that previously took weeks now happen instantly with 1-click config-driven deployments.
What About Vendor Lock-In with Managed Platforms?
You're thinking: "What if pricing changes, my platform partner is acquired, or they don’t keep up and I need to migrate?"
Legitimate concerns. Here's how to evaluate dependency:
How Do You Ensure Data Portability with Managed Middleware?
Verify:
Full export via documented API (embeddings, metadata, configs)
Standard formats (JSON, Parquet) not proprietary schemas
No termination fees or data hostage scenarios
For enterprise: Data Processing Agreement should include data return provisions and export assistance.
How Do You Avoid Getting Locked Into a Vendor's API?
Preserve portability with a thin abstraction layer - 50 lines that make switching providers a config change rather than a rewrite. The same adapter pattern used for dual-write migrations works for vendor portability.
What Should You Ask Vendors About Migration Support?
Ask any managed platform vendor:
Migration assistance in both directions? (in and out)
Typical timeline for migrating?
Customers who've successfully migrated away?
Pro tip: Vendors confident in their value document exit paths clearly. Defensiveness about lock-in is a red flag.
How Do You Protect Against Price Increases in Managed Platforms?
For enterprise contracts:
Price protection (max X% annual increase)
Change-of-control provisions (renegotiation if acquired)
Volume commitment transparency
Reality check: Custom middleware creates its own lock-in via opportunity cost and institutional knowledge. Every architecture involves trade-offs. Choose the constraint aligning with your priorities: internal technical debt vs vendor relationship management.
Which Path Should You Choose: Custom or Managed Middleware?
Historically, most teams have started custom - mostly because quality managed middleware didn’t exist. The inflection point hits when this undifferentiating code - which provides connectivity (to models, data, tools, infra and apps), orchestration, workflow, and run-time execution - becomes a bottleneck that slows your ability to deliver differentiated features as fast as competitors.
Two viable paths:
Own it: Dedicated team, roadmap, Service Level Objectives, data-plane monitoring, contracts preventing drift.
Migrate: Instrument, shim, dual-write, replay, cutover - without pausing features.
Both work. Treating critical infrastructure as "glue code" that maintains itself doesn't.
Consistent shippers choose their path intentionally and commit.
FAQ
Q: How do I know I'm at the inflection point?
Run the self-assessment. Score 4+? You're there. Symptoms: 2+ engineers on middleware, increasing feature lead times, no data-plane observability.
Q: What's the difference between control-plane and data-plane monitoring?
Control-plane: is your system running (uptime, errors, latency). Data-plane: is it working correctly (permission accuracy, data freshness, schema compatibility). Most teams only monitor control-plane. Dashboards stay green while users experience failures.
Q: How long does migration typically take?
With dual-write, 4-8 weeks migrating connector-by-connector with zero downtime. Key: rollback switches at connector level.
Q: What if I build custom and need to migrate later?
Use a modular, microservice architecture (SOA) to make both ongoing maintenance and future migration fast and easy. Knowledge from building custom helps you evaluate managed solutions better. Not wasted - teaches you what to demand. Dual-write adapter makes later migration safer.
Q: How do I calculate the real cost of custom middleware?
Engineers × % time on infra × fully-loaded cost + direct infra costs. Compare against managed platform subscription + migration investment. Factor opportunity cost: what could your engineers accomplish if they were freed from building and maintaining custom middleware?
Next Steps
Read the Three Paths framework: our battle-tested framework that dives deeper into Direct APIs, Custom and managed platforms.
Schedule an agentic middleware strategy session and demo with Lamatic: https://lamatic.ai/request-demo
Part 3: migration patterns in depth - dual-writing, replay strategies, safe cutover sequences.
Written by Chuck Whiteman (CEO, Lamatic) with technical contributions from Aman Sharma (CTO). Based on direct experience with dozens of experienced AI builders at organizations that include Traq, Reveal, Navigamo, Beehive, The AI Collective and many more.
Lamatic provides managed AI middleware for SaaS product teams- handling connectors, permissions, orchestration, schema evolution. Focus on shipping features. Learn more at Lamatic.ai.
Reply