Platform Engineering

Juli 26, 2026

Documentation Is Agent Infrastructure: Governing the Knowledge Layer Behind Enterprise Automation

Most enterprise AI agent programs focus first on models, tools, and orchestration. Documentation is often treated as supporting material that can be connected later.

That order is backwards.

When an agent is expected to explain a product, interpret live system state, or choose the right action, the organization’s knowledge layer becomes part of the runtime. Documentation, examples, support guidance, and API references are no longer just content for people. They are operational dependencies that influence how the agent plans and acts.

This changes the governance question. The challenge is not simply whether an agent can search documentation. The challenge is whether the knowledge it retrieves is current, authorized, observable, and reliable enough to support enterprise automation.

A signal from 1,192 agent conversations

A recent CNCF member post from kapa.ai analyzed 1,192 conversations with an agent embedded in its own product. The agent had around 30 native tools for querying the platform and one tool for searching documentation, code examples, support FAQs, and API references.

The documentation tool became the most frequently used tool, almost matching all native tools combined. The analysis identified three roles for documentation:

Fallback: 32.1% of conversations relied on documentation when no native tool could directly answer the question.
Context: around 7% combined live product data with documentation to explain what that data meant.
Planning: the agent consulted documentation to understand product capabilities and select the appropriate native tool.

This is evidence from one product, not a universal benchmark. The exact percentages should not be generalized across every enterprise. The pattern, however, is strategically important: purpose-built tools tell an agent what the system can expose, while documentation often tells it how the system works and what an observation means.

The knowledge layer is part of the execution path

Consider an operations agent investigating a failed deployment. Native tools might return rollout status, recent events, policy results, and telemetry. Those facts are necessary but incomplete. To choose a useful next step, the agent may also need the organization’s deployment conventions, ownership model, approved rollback procedure, exception policy, and known platform limitations.

If that information is missing, the agent has three poor options: refuse, guess, or call the wrong tool. If the information is stale or over-broadly accessible, the outcome can be worse. The agent may produce a confident but obsolete recommendation, expose restricted operational details, or act on guidance that no longer matches the platform.

That makes documentation quality a reliability and control concern. A broken API can cause an automation failure. So can an obsolete runbook retrieved at the wrong moment.

Five controls for an enterprise agent knowledge layer

1. Assign ownership and freshness objectives

Every knowledge source available to an agent should have an accountable owner. Ownership needs to cover more than writing. It should include review frequency, retirement, escalation, and alignment with the systems the content describes.

Different content requires different freshness objectives. An architectural principle may remain valid for years. An incident runbook, API example, or security exception may become dangerous within weeks. Treating every document alike creates either excessive review effort or unacceptable drift.

A practical inventory should record the owner, classification, last review date, next review deadline, source system, and systems or tools affected by the content.

2. Preserve version and environment context

Retrieval should not flatten every document into one timeless knowledge pool. Guidance for an older Kubernetes version, a retired API, or a development environment can be accurate in isolation and still be wrong for the current task.

Attach metadata such as product version, environment, region, effective date, lifecycle state, and superseding document. Retrieval policies can then prefer knowledge that matches the live context and exclude archived or incompatible guidance.

For high-impact actions, the agent should be able to cite the exact source and version that informed its decision. This gives reviewers evidence instead of a vague statement that “the documentation says so.”

3. Enforce permissions at retrieval time

An agent must not gain broader knowledge access simply because it provides a convenient conversational interface. Retrieval should preserve the caller’s identity, tenant, role, and data-classification constraints.

This control applies to both source selection and returned content. Filtering after retrieval is too late if restricted text has already entered the model context. The authorization decision belongs at the retrieval boundary.

OWASP’s guidance on vector and embedding weaknesses also highlights risks such as unauthorized access, data leakage, and poisoning in retrieval-augmented systems. For enterprises, the knowledge index therefore needs the same security attention as other production data stores: provenance, access control, integrity checks, and monitored ingestion.

4. Evaluate answers and tool choices separately

Traditional documentation analytics measure page views and search terms. Agent-enabled knowledge requires additional evaluation.

Teams should test at least four outcomes:

Did retrieval return the correct and current source?
Did the answer remain grounded in that source?
Did the agent choose the correct native tool and parameters?
Did it refuse or escalate when knowledge was missing, ambiguous, or unauthorized?

These are different failure modes. A fluent answer can still be based on the wrong version. Correct retrieval can still lead to an unsafe tool call. A strong evaluation set should therefore include normal questions, conflicting documents, expired guidance, permission boundaries, and cases where no valid answer exists.

5. Make retrieval observable

Organizations need telemetry for the knowledge path, not only for model latency and tool execution. Useful signals include source selection, document version, retrieval score, authorization outcome, citation use, fallback frequency, tool choice, user correction, and escalation.

This telemetry answers operational questions that static content reviews cannot:

Which unanswered questions repeatedly force the agent to guess or refuse?
Which documents influence the most actions?
Where do retrieved instructions conflict?
Which content is frequently retrieved but rarely leads to a successful outcome?
Are users reaching knowledge they could not access through the original source?

Retrieval telemetry should respect privacy and retention policies. Its purpose is to improve reliability and governance, not to create an uncontrolled archive of user prompts and internal content.

A practical operating model

The knowledge layer sits across several existing responsibilities. Treating it as “the documentation team’s problem” leaves critical gaps.

Product and service owners own factual correctness and lifecycle decisions.
Platform teams provide ingestion, indexing, retrieval, observability, and safe rollout capabilities.
Security and data governance define classification, access, retention, and integrity controls.
Agent teams define retrieval policies, tool boundaries, evaluations, and failure behavior.
Operations teams validate runbooks and feed production corrections back into the source.

The platform should make the safe path easy: versioned ingestion, metadata validation, permission-aware retrieval, citation support, evaluation gates, and progressive rollout. Content owners should not need to become retrieval engineers, but they must remain accountable for what their knowledge enables.

Start with the decisions the agent must support

A useful implementation sequence begins with a bounded workflow rather than a company-wide content crawl.

Select one agent use case with measurable business value and clear risk boundaries.
List the decisions and tool calls the agent must make.
Identify the minimum authoritative sources required for those decisions.
Add ownership, classification, version, and lifecycle metadata before indexing.
Build evaluation cases for correct answers, correct tool selection, refusal, and authorization.
Instrument retrieval and review failure patterns before expanding scope.

This approach creates evidence about where documentation improves outcomes and where purpose-built tools or structured APIs are still required. It also avoids turning a large, inconsistent content estate into an ungoverned agent dependency overnight.

Documentation is not a substitute for tools

The lesson is not that documentation should replace typed APIs or deterministic controls. Native tools remain essential for current state, validated mutations, and enforceable authorization. Documentation contributes semantics, operating context, and capability guidance.

Reliable enterprise agents need both. Tools provide constrained interaction with the live system. Governed knowledge helps the agent interpret that system and select the right interaction. The boundary between them should be explicit: documentation may inform a decision, but consequential actions still need server-side validation, least-privilege authorization, and auditable execution.

The strategic implication

Organizations that deploy agents without governing their knowledge layer are building automation on an unmanaged dependency. The model may be capable and the tools may be secure, yet the agent can still fail because its understanding of the product, policy, or operating environment is incomplete or stale.

Documentation becomes agent infrastructure when it influences planning and action. Infrastructure requires ownership, lifecycle management, access control, testing, telemetry, and rollback. Applying those disciplines to enterprise knowledge is what turns documentation search from a convenient feature into a trustworthy platform capability.

Sources

Juli 19, 2026

AI Code Review Needs a Platform Contract: Governing Instructions, Network Access, and Runners

AI-assisted code review is moving beyond a simple product toggle. Once a reviewer can read repository instructions, execute setup steps, access networks, and run on different compute environments, it becomes part of the software delivery platform. That changes the governance question from “Should we enable AI review?” to “Under which versioned and auditable contract may it operate?”

GitHub’s latest Copilot code review improvements make that shift visible. Copilot code review can now read instructions from the pull request’s head branch, consume additional instruction formats such as REVIEW.md, GEMINI.md, and CLAUDE.md, execute custom setup steps, run behind a configurable firewall, and use runner settings that are independent from the Copilot cloud agent.

These capabilities can improve relevance and reduce false positives. They also expand the execution surface around every review. Enterprises should therefore treat the configuration as platform policy rather than scattered repository preferences.

The reviewer now has an execution environment

Traditional static analysis usually operates under a relatively fixed contract: a known rule set evaluates a defined artifact and emits deterministic results. An AI reviewer behaves differently. Its output depends on context, instructions, available tools, network access, and the environment in which it runs.

AI review can work in an enterprise, but model selection is only one part of the operating model.

Four control planes now shape the result:

Instruction policy: which repository files influence the review and who may change them.
Setup policy: which commands and dependencies prepare the review environment.
Network policy: which external destinations the reviewer may reach.
Runner policy: where the review executes and which identities, secrets, and internal resources are available there.

If these controls evolve independently, two repositories using the same AI reviewer can have very different security and quality properties. A platform contract makes those differences explicit.

Instructions are executable review policy

Reading custom instructions from the head branch is useful because teams can test review guidance before merging it. It also creates a trust-boundary question: a pull request can propose code changes and changes to the instructions used to review that code.

This is not inherently unsafe. Pull requests already modify workflows, build scripts, and policy files. The important point is that instruction files deserve comparable governance.

A practical enterprise baseline should:

define which instruction formats are supported and where they may live;
assign CODEOWNERS to organization-critical instruction and agent-skill paths;
require human review when a pull request changes both application code and reviewer policy;
separate mandatory organization guidance from repository-specific context;
test instruction changes against representative pull requests before broad rollout;
retain the exact instruction revision with review evidence where audit requirements justify it.

Product teams still need room for domain-specific guidance. The platform team should own the contract, validation, and protected baseline, while repository owners maintain local context within those boundaries.

Setup steps belong to the software supply chain

Custom setup steps can make reviews substantially better. A reviewer may need generated code, dependency metadata, project-specific linters, or a compiled schema before it can understand a change.

However, setup steps are code execution. They can download dependencies, run scripts from the branch, and interact with credentials exposed to the runner. Treating them as harmless customization would recreate risks that enterprises already learned to manage in CI/CD.

The platform contract should require:

minimal and explicit setup steps;
pinned versions or immutable references for downloaded tools and reusable actions;
dependency caching that does not weaken provenance checks;
timeouts and resource limits;
no production credentials in the review environment;
clear separation between untrusted pull-request code and trusted setup logic;
logs that show what ran without exposing secrets.

Where setup behavior is common across many repositories, publish a maintained golden path instead of copying shell fragments. Teams can then inherit a reviewed baseline and add only the repository-specific preparation they need.

Network access needs an exception model

GitHub states that Copilot code review now runs behind a firewall by default and that its network rules can be configured independently from the Copilot cloud agent. This is an important default because review tasks often need less network access than general-purpose coding agents.

The right enterprise pattern is deny by default with narrow, observable exceptions. Typical allowed destinations may include an internal package repository, a schema registry, or an approved documentation endpoint. General internet access should require a documented reason.

Every exception should answer four questions:

Which destination is required?
Which review capability depends on it?
What data may leave the runner?
Who owns and periodically revalidates the exception?

The current GitHub notice also matters operationally: self-hosted runners do not support the code-review firewall. Organizations using self-hosted runners therefore need compensating controls at the network, workload, or runner-pool level. “Self-hosted” should not be interpreted as “automatically safer.”

Runner selection is a data-boundary decision

Separating runner configuration for Copilot code review from the Copilot cloud agent allows enterprises to match infrastructure to the task. It also prevents an overly broad agent configuration from becoming the default for review workloads.

Runner policy should be based on data classification and required connectivity:

use isolated hosted runners when repositories need no private network access;
use hardened self-hosted pools only when internal dependencies make them necessary;
keep runner images minimal and reproducible;
use short-lived credentials with repository-scoped permissions;
prevent persistence between review jobs;
separate runner pools for different trust zones.

Runner selection affects far more than performance. It determines what source code, internal services, credentials, and telemetry the review process can reach.

A platform contract for AI code review

A useful contract should be small enough for teams to understand and strict enough for governance teams to trust. It can be implemented as policy-as-code, reusable repository configuration, protected settings, and automated conformance checks.

The contract should define at least:

Purpose: AI review supplements human review and deterministic controls; it does not become an unmeasured release gate.
Instruction ownership: approved formats, protected paths, and required reviewers.
Execution policy: allowed setup operations, dependency provenance, resource limits, and secret handling.
Egress policy: default network posture, approved destinations, and exception ownership.
Runner policy: permitted runner classes by repository classification and connectivity need.
Evidence: the configuration version, runner class, policy result, and operational outcome retained for each review where needed.
Escalation: how developers report incorrect findings, missing context, or policy constraints that reduce review quality.

This contract should be versioned. A change to review instructions, setup behavior, firewall rules, or runner selection can alter both quality and risk. It deserves the same change-management discipline as a CI template or deployment policy.

Measure the operating model, not just adoption

Counting how many repositories enabled AI review says little about whether it works safely. Platform and engineering leaders need evidence across quality, reliability, and governance.

Useful measures include:

acceptance and dismissal rates for review findings;
time from finding to resolution;
setup-step failure and timeout rates;
repositories compliant with the approved runner and firewall baseline;
number and age of network exceptions;
frequency of instruction changes and their effect on finding quality;
cases where deterministic checks or human reviewers contradicted the AI reviewer.

These measures create a feedback loop. Teams can improve instructions and setup context while the platform team identifies systemic policy or reliability problems.

A pragmatic rollout sequence

Enterprises do not need to design the final governance model before gaining value. A controlled rollout can proceed in four steps:

Inventory: identify existing instruction files, setup workflows, runner types, and network dependencies.
Baseline: define one default contract with protected instructions, minimal setup, restricted egress, and an approved runner class.
Pilot: onboard a representative set of repositories and measure finding quality, setup reliability, and exceptions.
Scale: publish golden-path configuration, automate conformance checks, and review exceptions periodically.

This approach lets the organization learn without turning every repository into a separate governance experiment.

From feature configuration to delivery policy

GitHub’s new controls are useful because they expose the real shape of AI-assisted review: it is a configurable workload operating inside the delivery system. Instructions influence judgment. Setup steps execute code. Network rules define egress. Runners define the data boundary.

Enterprises now have to make this customization reviewable, reusable, and measurable.

Platform teams should provide the paved road: a versioned contract, secure defaults, approved extension points, and evidence that the reviewer operated within policy. Product teams should own the domain context that makes reviews useful. Security and engineering leadership should define the exceptions and success measures.

That division of responsibility allows AI code review to improve delivery quality without creating an invisible parallel control plane.

GitHub’s implementation details are described in the Copilot code review customization announcement, the code review environment documentation, the repository custom instructions guidance, and the firewall configuration guidance.

Juli 15, 2026

Pods Are Workers, Not Agents: Designing the Runtime Boundary for Enterprise Agent Platforms

Kubernetes Pods are excellent execution units. They provide scheduling, resource controls, networking, workload identity integration, and a natural boundary for security and observability.

That does not automatically make a Pod the right representation of an AI agent.

Enterprise agent platforms need to distinguish two concepts that are easy to collapse during early implementations: the logical agent and the runtime worker executing its current task. Treating them as the same object can work for prototypes and continuously running agents. At scale, it creates idle infrastructure, slow burst handling, fragmented identity, and weak lifecycle semantics.

The durable pattern is to let Kubernetes manage execution workers while an agent control plane manages agent identity, state, policy, placement, and lifecycle. Pods remain essential. They become workers rather than the agent itself.

Why one Pod per agent is an attractive first design

The one-agent-per-Pod model solves several real problems quickly.

A Pod provides a process and container isolation boundary.
A ServiceAccount gives the workload a Kubernetes identity.
NetworkPolicy and admission policy can constrain its environment.
CPU and memory requests make resource consumption schedulable.
Logs, metrics, and traces can be attributed to a workload instance.
Existing GitOps, deployment, and incident-response practices remain usable.

For a small number of high-value agents, those benefits may outweigh the overhead. The model is understandable and conservative. It uses boundaries that platform and security teams already know how to operate.

The problem appears when the organization assumes that the execution container is also the durable identity and lifecycle of the agent.

Agents do not behave like ordinary services

A typical service is expected to remain available and handle a continuing stream of requests. An agent may wake up for a task, run for seconds or minutes, wait for a human decision, delegate work to subagents, and then remain idle for hours.

These characteristics create a different workload shape:

Bursty demand: a single business event can fan out into many parallel agent tasks.
Long idle periods: logical agents may exist without needing compute.
External waiting: execution may pause for approval, data, or another system.
Variable duration: tasks range from short tool calls to extended research or coding sessions.
Delegated authority: an agent often acts on behalf of a user or workflow rather than only as itself.
Stateful continuation: a later execution may need to resume the same logical conversation or plan on a different worker.

Keeping one Pod alive for every logical agent reserves capacity for identities that are not doing work. Creating a fresh Pod for every short task can introduce startup latency and control-plane churn. Encoding state inside the Pod makes rescheduling and recovery harder.

The architectural question is therefore not whether Kubernetes should run agents. It is which responsibilities belong to Kubernetes and which belong to an agent-specific control plane.

The runtime boundary: agents, actors, and workers

A recent CNCF article describing kagent’s agent-substrate architecture illustrates this separation. Kubernetes continues to manage Pods, networking, storage, and compute. A higher-level control plane manages logical actors and places them onto a pool of execution workers.

In that model:

The logical agent has durable identity, ownership, policy, configuration, and state.
An agent task or actor instance represents a unit of active execution.
A worker is a sandboxed runtime capable of executing one or more assigned actors.
A worker pool defines capacity, runtime profile, isolation class, and placement characteristics.

Agent-substrate is one implementation, not a universal enterprise standard. Its value for platform design is the principle it demonstrates: logical lifecycle can be decoupled from Pod lifecycle without removing Kubernetes from the architecture.

Six contracts the control plane must preserve

Decoupling an agent from a Pod improves efficiency only if the platform preserves the controls that dedicated Pods made easy.

1. Durable agent identity

An agent needs an identity that survives worker replacement. That identity should identify the agent definition, tenant, owner, environment, risk tier, and approved capabilities.

The worker also needs its own workload identity. The two must not be confused. A worker identity proves which runtime is communicating with the platform. The agent identity determines which business permissions and policies apply to the assigned execution.

When an agent acts for a person, the authorization decision should include delegated user context with explicit scope and expiry. Copying a user’s full credentials into a worker is not delegation.

2. Execution leases

Placement should create a time-bound execution lease binding an agent task to a specific worker. The lease should include the agent identity, policy revision, tool permissions, state reference, deadline, and expected resource profile.

Leases make reassignment and failure handling explicit. If a worker disappears, the control plane can determine whether the task is safe to retry, must resume from a checkpoint, or requires human review.

3. Isolation classes

Sharing workers does not mean sharing trust. The platform needs multiple runtime profiles based on risk.

Low-risk, read-only tasks may use a warm multi-tenant worker pool.
Tasks handling confidential data may require stronger sandboxing and tenant-dedicated workers.
Agents with write access to production systems may require a dedicated Pod or ephemeral sandbox per execution.
Untrusted code execution may require gVisor, microVMs, or another hardened isolation boundary.

The scheduling decision should derive from policy. Developers should request a workload class rather than select a weaker runtime to reduce latency.

4. Policy attribution

Kubernetes policy usually sees the Pod, namespace, and ServiceAccount. A shared worker introduces another logical principal inside that boundary. The platform must propagate agent, tenant, task, and delegated-user context to every policy enforcement point.

Tool gateways, model gateways, data APIs, and egress proxies should authorize the logical execution, not merely trust the worker’s network location. Audit events should record both worker identity and agent identity so investigators can reconstruct who did what and where it ran.

5. Externalized state and checkpoints

Agent state should not depend on the continued existence of a worker Pod. Conversation state, plans, artifacts, approval state, and checkpoints need durable storage with tenant-aware encryption and retention controls.

Externalizing state allows the platform to release compute while an agent is idle and rehydrate it when work resumes. It also creates a controlled recovery point instead of treating the worker filesystem as an accidental system of record.

6. End-to-end observability

Pod-level telemetry remains necessary but is no longer sufficient. Operators need to follow a logical agent across workers and over time.

Every execution should carry stable correlation fields such as:

agent, tenant, task, session, and parent-task identifiers;
worker and worker-pool identity;
policy, prompt, model, and tool versions;
delegated user and approval references where permitted;
token, latency, tool-call, cost, and outcome signals;
checkpoint, retry, reassignment, and termination reasons.

This creates observability for the business execution rather than only for the container currently hosting it.

A reference enterprise architecture

A practical runtime separates responsibilities across four layers.

Agent control plane

The control plane stores agent definitions, ownership, policy, lifecycle, state references, and desired runtime class. It accepts tasks, decides placement, issues leases, tracks execution, and coordinates retries or resumptions.

Worker pools

Kubernetes Deployments or other controllers maintain warm capacity for defined execution profiles. Pools may differ by tenant, geography, accelerator, sandbox technology, network access, or data classification.

Shared platform gateways

Model, tool, MCP, data, and egress gateways enforce logical identity and policy. They keep privileged credentials out of agent code and provide consistent rate limits, approval checks, observability, and revocation.

Durable state and evidence

State services store checkpoints and artifacts. An evidence plane records immutable links between the agent definition, execution lease, policy decision, worker, model interaction, tool call, and outcome.

Kubernetes remains the infrastructure substrate. The agent control plane provides semantics Kubernetes was not designed to infer.

Multi-tenancy must shape worker placement

Worker utilization can improve dramatically when idle logical agents do not retain Pods. That benefit should not override tenant boundaries.

Platform teams should define placement rules covering:

whether tenants may share a worker process, Pod, node, or cluster;
which data classifications require dedicated runtime capacity;
how memory, filesystems, caches, and credentials are cleared between assignments;
whether agent-generated code can execute and under which sandbox;
which tools and destinations each pool can reach;
how noisy-neighbor behavior is detected and constrained;
where state and inference traffic may be processed geographically.

There is no single correct sharing boundary. The platform should offer a small set of reviewed isolation classes and make the selected class visible in cost, latency, and risk reporting.

When one Pod per agent is still the right answer

Decoupling should not become an objective by itself. A dedicated Pod remains a strong choice when:

the agent is continuously active or exposes a stable service endpoint;
startup latency is acceptable and the fleet is small;
the workload needs strong tenant or process isolation;
it runs untrusted code or privileged tools;
its memory and resource profile do not fit a shared pool;
existing Kubernetes controls provide sufficient lifecycle semantics;
the added agent scheduler would cost more to operate than it saves.

The mature platform supports more than one runtime pattern. It chooses the boundary based on workload behavior and risk rather than forcing every agent into the same optimization.

Measure the runtime as a platform product

Worker density is useful, but cost efficiency alone is an incomplete success measure. Track flow, reliability, isolation, and control together.

Task queue time and time to first execution
Warm-start and cold-start latency
Active versus idle worker utilization
Logical agents per worker and per isolation class
Checkpoint, resume, retry, and reassignment success rates
Policy denials and unauthorized cross-tenant attempts
State cleanup and credential revocation failures
Cost per successful agent task
Trace and audit coverage from task request to external side effect

A cheaper runtime that cannot explain an agent’s actions is not an enterprise improvement.

A staged adoption path

1. Separate identifiers before changing runtime

Introduce stable agent, task, tenant, and worker identifiers in the current platform. Propagate them through logs, traces, policy decisions, and tool calls. This exposes hidden coupling before a scheduler is introduced.

2. Externalize state

Move durable state and artifacts out of the Pod. Define checkpoint, retry, expiry, encryption, and deletion semantics. Test recovery from worker termination.

3. Add one low-risk worker pool

Select bursty, read-only tasks with clear resource limits. Compare queue time, utilization, cost, and operational effort with the dedicated-Pod baseline.

4. Add policy-aware placement

Introduce reviewed isolation classes and execution leases. Integrate logical identity with tool, model, data, and egress gateways. Exercise tenant separation and credential revocation.

5. Expand only with evidence

Move higher-risk agents after proving state hygiene, observability, rollback, and incident response. Keep dedicated Pods as an explicit option rather than treating them as a failed legacy design.

Pods should host work, not define the agent

The Pod remains one of the strongest execution boundaries available to cloud-native platforms. The mistake is asking it to carry semantics it does not own: durable agent identity, delegated authority, conversation lifecycle, human approval, and cross-execution state.

Enterprise agent platforms should model those concerns explicitly. Kubernetes can then do what it does best — schedule and isolate execution — while the agent control plane decides which logical work runs where, under whose authority, with which policy, and with what evidence.

That separation improves utilization, but its greater value is governance. It allows the platform to scale agents without losing the identity and accountability that production systems require.

Sources

Juli 13, 2026

The Agent Egress Boundary: Making Every AI Tool Call Enforceable and Observable

AI agents do not create risk only when they generate the wrong answer. They create operational risk when they turn that answer into an outbound action: calling an API, querying a search service, downloading content, opening a ticket, sending a message, or changing a production system.

Most enterprise controls still focus on the agent’s intent. Prompts, guardrails, and model policies describe what the agent should do. They do not guarantee which destinations the workload can reach, which request was sent, or whether an unapproved path was used.

That gap calls for an agent egress boundary: a platform-enforced control through which every external tool call must pass, combined with traceable evidence that links the call to the originating agent interaction.

Guardrails are necessary, but they are not enforcement

Prompt-level guardrails are useful for shaping behavior. They can tell an agent not to disclose sensitive information, not to call unknown services, or to request human approval before a consequential action. But those controls operate inside the reasoning path they are intended to constrain.

Production systems need an independent layer. If an agent is compromised through prompt injection, a poisoned tool response, a vulnerable dependency, or a simple implementation mistake, the network should still prevent access to destinations outside the approved contract.

The distinction is familiar from other areas of security:

application authorization expresses intended access;
network enforcement limits reachable destinations;
observability records what actually happened;
human approval controls high-impact exceptions.

No single layer is sufficient. Together, they create defense in depth.

The platform contract

An agent egress boundary should answer four questions for every outbound request:

Who initiated it? Identify the workload, agent, tenant, and user or workflow context.
Where is it going? Resolve the approved destination, protocol, port, and application-level route.
Was it allowed? Evaluate the call against a versioned policy rather than an application convention.
What evidence remains? Record a traceable decision without leaking secrets or sensitive payloads.

This turns outbound connectivity into a platform contract. An agent receives only the network access required by its tools, while the platform provides a consistent control and evidence plane.

A practical cloud-native pattern

A recent CNCF implementation demonstrates the core idea using NGINX, Kubernetes, and OpenTelemetry. NGINX acts as both the inbound reverse proxy and the outbound forward proxy for an agent workload. Network rules drop direct egress so the proxy becomes the only approved path. The NGINX OpenTelemetry module emits a span for each request, and an OpenTelemetry Collector forwards the evidence to observability or security systems.

The important principle is architectural: the boundary is not a library the agent may choose to call. It is the only network path available.

A production-oriented request flow can look like this:

A user or system invokes the agent through an authenticated gateway.
The gateway propagates a trace context and workload identity.
The agent selects a tool and issues an outbound request.
Kubernetes egress controls permit traffic only to the designated proxy.
The proxy evaluates destination, protocol, identity, and policy.
Allowed traffic is forwarded; denied traffic returns a controlled error.
OpenTelemetry records the decision and correlates it with the originating interaction.

The result is a chain of evidence from user request to external side effect.

Why Kubernetes NetworkPolicy alone is not enough

Kubernetes NetworkPolicy is a strong foundation. It can isolate workloads and restrict egress by IP block, port, and selected peers, provided the cluster’s network plugin enforces the policy. A default-deny egress policy should be the starting point for sensitive agent workloads.

However, many agent tools call dynamic external services over HTTPS. IP addresses change, destinations share infrastructure, and business rules are usually expressed in terms of domains, API routes, methods, or tool identities rather than static addresses.

That is why a layered design is useful:

NetworkPolicy or equivalent CNI controls ensure the workload can only reach the approved proxy and essential platform services.
The egress proxy enforces destination and application-aware rules.
Workload identity distinguishes agents and tenants without relying only on source IP.
OpenTelemetry provides correlated evidence for operations, security, and audit.

The network layer prevents bypass. The proxy layer understands enough context to make a useful decision.

Policy should follow the tool contract

Allowing an agent to reach an entire domain is often broader than the tool definition requires. A better policy starts with the declared tool contract.

For example, an incident-analysis agent may need to:

read selected observability APIs;
create, but not delete, incident tickets;
query a controlled knowledge source;
send notifications only to an approved channel;
never call arbitrary internet destinations.

The platform can translate that contract into an egress policy covering destination, method, route, identity, rate, and approval requirements. High-risk actions can be routed through a separate approval service rather than granted as normal network access.

This also creates a cleaner ownership model. Domain teams define which tools are necessary. Security teams define control requirements. Platform teams provide the reusable enforcement mechanism.

Observability must produce evidence, not surveillance

OpenTelemetry is well suited to correlating inbound interactions with outbound HTTP client activity. Standard HTTP span conventions provide consistent attributes for requests and responses, while trace context links multiple services into one transaction.

But recording everything is not automatically safe. Agent traffic can include credentials, personal data, customer information, prompts, and tool payloads. The audit plane therefore needs its own policy.

Useful evidence

trace and request identifiers;
agent, workload, tenant, and tool identity;
policy version and allow or deny decision;
destination service and approved route classification;
HTTP method and status class;
latency, retries, and byte counts;
model or agent configuration version;
human approval reference where required.

Data to avoid by default

authorization headers and API keys;
full request or response bodies;
raw prompts containing confidential data;
URL query parameters unless explicitly sanitized;
unbounded high-cardinality attributes.

The purpose is to prove and investigate behavior, not to create a second uncontrolled copy of sensitive data.

Controls that make the boundary credible

A proxy is only a boundary when bypass is demonstrably difficult. Platform teams should validate at least the following controls:

Default-deny egress: direct external connectivity fails.
DNS control: workloads cannot switch to an unmonitored resolver or exploit unexpected resolution paths.
IPv4 and IPv6 parity: policy applies consistently to both address families.
Protocol coverage: non-HTTP tools, WebSockets, streaming APIs, and message protocols have explicit handling.
TLS design: the organization decides where TLS terminates and what metadata can be inspected without undermining privacy.
Identity: decisions rely on authenticated workload identity, not only mutable labels or network location.
Fail-closed behavior: proxy, collector, or policy failures do not silently open direct access.
High availability: the control plane does not become an avoidable single point of failure.

These details determine whether the pattern is an architectural control or merely a useful demonstration.

Operational signals for platform teams

Once all tool traffic crosses the boundary, the same telemetry can improve reliability and cost control.

Useful service-level indicators include:

allowed and denied tool calls by agent and policy version;
unexpected destinations or repeated policy violations;
external dependency latency and error rates;
retry storms and rate-limit responses;
egress volume and estimated third-party API cost;
calls that required human approval;
trace gaps where an outbound action lacks an originating interaction.

This gives security and operations teams a shared view. The same denied request may indicate an attack, an outdated policy, or a legitimate new tool requirement.

A phased adoption plan

Inventory agent egress. Identify destinations, protocols, credentials, and business owners for each production tool.
Introduce observation first. Capture sanitized outbound traces to understand real behavior before enforcing a narrow policy.
Define tool-level contracts. Document approved destinations and actions rather than granting general internet access.
Apply default deny. Force a low-risk agent through the proxy and prove that direct egress fails.
Add policy-as-code. Version destination rules, ownership, exceptions, and approval conditions in Git.
Connect the audit plane. Send sanitized OpenTelemetry data to the organization’s observability and SIEM platforms.
Test failure modes. Validate DNS bypass, IPv6, proxy outage, collector outage, policy rollback, and certificate rotation.
Scale by platform product. Offer the boundary as a reusable golden-path capability rather than a custom design for every agent.

Conclusion

Enterprises should not have to trust that an AI agent will respect its network boundaries. Those boundaries should be enforced by the platform and evidenced through telemetry.

NGINX, Kubernetes, and OpenTelemetry show that the core pattern can be built from mature cloud-native components: default-deny connectivity, an application-aware egress proxy, and correlated traces. The exact implementation will vary, but the platform contract should remain consistent.

Every agent tool call should be attributable, policy-checked, observable, and reversible where the downstream system allows it. That is the difference between experimenting with autonomous software and operating it responsibly.

Sources and further reading

Juli 13, 2026

The AI-Native Platform Contract: Expanding Golden Paths Beyond Application Delivery

Platform engineering earned its place by turning application delivery into a repeatable product. Golden paths combined infrastructure, security, deployment, and operational standards into a paved route that developers could use without learning every platform detail.

AI-native workloads do not invalidate that model. They expose where it stops too early.

A conventional golden path typically starts with source code and ends with a running service. An AI-native product depends on a wider chain: governed data, accelerator capacity, models and prompts, evaluation evidence, inference controls, agent identities, external tools, and continuous cost and risk feedback. If each of those capabilities arrives through a separate specialist portal, the organization has not created an AI platform. It has created another integration problem.

The next platform contract should therefore extend the golden path rather than build a parallel AI silo. The goal is not to hide every AI decision behind automation. It is to make safe defaults easy, exceptions explicit, and every promoted artifact traceable.

The application delivery contract is no longer enough

Platform Engineering 1.0 concentrated on a familiar delivery unit: an application packaged as a container, deployed through a pipeline, and operated with standard observability and security controls. That remains valuable, but AI changes both the workload and its consumers.

ML engineers need experiment tracking, model registries, feature and data access, and specialized compute. Application teams need stable inference endpoints and predictable latency. Security teams need controls for model provenance, prompt injection, data leakage, and non-human identities. FinOps teams need to attribute expensive training and inference usage. AI agents themselves become platform consumers that request tools, credentials, and runtime actions.

The CNCF discussion of evolving platform engineering for AI-native workloads captures this expansion through capabilities such as GPU and TPU allocation, model serving, MCP gateways, agentic guardrails, embedded FinOps, and policy-driven governance. The important organizational point is that these should not become an isolated platform owned by a small AI team. They should become extensions of the same product model, interfaces, and control philosophy used by the enterprise platform.

Define a platform contract, not a catalog of tools

A platform contract describes what a product team can request, what evidence it must provide, what the platform guarantees, and which controls are automatically applied. It is stronger than a service catalog entry and more flexible than a single mandatory implementation.

For an AI-native workload, that contract should cover at least six dimensions.

1. Governed data access

The path should make data classification, residency, retention, and permitted use visible before a workload reaches production. A request for a dataset should resolve to an approved identity, purpose, environment, and audit trail. The platform can automate access, but the product team remains accountable for whether the data is appropriate for the use case.

2. Compute and accelerator intent

Teams should request capabilities rather than hard-code a particular GPU model into every manifest. The contract can express workload class, memory, performance objective, duration, geographic constraints, and cost ceiling. Kubernetes mechanisms such as Dynamic Resource Allocation can support more structured resource claims, but the platform still needs policy for quotas, scarcity, preemption, and approved hardware profiles.

3. Model, prompt, and artifact provenance

Container images are not the only production artifacts. The platform must track model version, source, license, evaluation result, prompt bundle, retrieval configuration, tool definitions, and deployment policy. Promotion should be based on an immutable set of linked artifacts, not a model name copied into an environment variable.

4. Evaluation as a release gate

AI quality is probabilistic and context-dependent. A successful build does not prove production fitness. Golden paths should provide standard evaluation suites for task quality, safety, latency, robustness, and cost. Teams can add domain-specific tests, while the platform supplies the execution environment, evidence format, thresholds, and promotion workflow.

5. Runtime identity and guardrails

An inference service or autonomous agent needs a workload identity, scoped data access, approved tools, network boundaries, and observable policy decisions. The contract should distinguish a human user’s authority from an agent’s delegated authority. It should also define what happens when a model, tool, or policy is unavailable rather than allowing silent fallback to an uncontrolled path.

6. Cost and operational accountability

AI infrastructure introduces different cost behavior from ordinary stateless services. Training jobs can consume scarce capacity in bursts. Inference cost depends on model choice, token volume, batching, cache efficiency, and service-level objectives. Cost attribution and budgets should therefore be part of provisioning and release decisions, not a dashboard reviewed after the invoice arrives.

What an AI-native golden path looks like

A useful golden path follows the product lifecycle rather than exposing a collection of disconnected infrastructure forms.

Declare the workload. The team selects an archetype such as batch training, online inference, retrieval-augmented generation, or tool-using agent. It declares data class, expected scale, latency objective, risk tier, and ownership.
Provision an isolated workspace. The platform creates namespaces, identities, network boundaries, secrets references, storage, accelerator claims, quotas, and standard telemetry.
Develop with approved building blocks. Teams consume versioned model endpoints, registries, feature services, MCP or tool gateways, and evaluation templates through stable APIs.
Produce evidence. CI records model and data lineage, software dependencies, evaluation results, policy decisions, security findings, and predicted operating cost.
Promote as a release set. GitOps promotes the linked application, model, prompt, policy, and tool configuration together. A rollback restores the complete known-good set.
Operate with continuous feedback. Runtime telemetry covers service health, model quality indicators, policy denials, data drift, tool calls, accelerator utilization, and unit economics.

This lifecycle gives specialists room to innovate without forcing every product team to assemble the control plane themselves.

Avoid the separate AI platform trap

A dedicated AI enablement team may be necessary, but a separate delivery system should not be the default. Parallel identity models, pipelines, policy engines, and observability stacks increase cost and weaken governance. They also create a handoff between application engineers and AI specialists exactly where the product needs shared accountability.

A better operating model separates platform ownership by capability while preserving one product contract:

The core platform team owns common interfaces, workload identity, delivery workflows, policy integration, and the developer experience.
The AI platform capability team owns model-serving patterns, evaluation services, accelerator profiles, registries, and AI-specific runtime controls.
Data teams own governed data products and access semantics.
Security and risk teams define control objectives and approval boundaries as policy and evidence requirements.
Product teams own business fitness, domain evaluations, production outcomes, and accepted residual risk.

The teams collaborate through APIs, schemas, policy bundles, and service-level objectives rather than tickets and undocumented exceptions.

Measure whether the contract creates value

An AI-native platform should not be measured by the number of services in its catalog. Measure whether teams can deliver trustworthy outcomes faster.

Time from approved use case to first governed experiment
Time from candidate model to production release
Percentage of releases with complete model, data, prompt, and policy provenance
Evaluation failure escape rate
Percentage of agent tool calls using approved identities and gateways
Accelerator utilization and queue time by workload class
Inference cost per business transaction
Rollback time for a complete AI release set
Adoption and exception rates for each golden path

These metrics reveal whether the platform improves flow and control together. High adoption with slow delivery signals an overloaded path. Fast delivery with weak evidence signals unmanaged risk.

A practical 90-day starting point

Do not begin by designing a universal AI platform. Choose one real workload and use it to define the minimum viable contract.

Days 1–30: map the lifecycle

Select one representative AI product with a committed owner.
Map every artifact, identity, environment, approval, and operational dependency.
Classify which existing platform capabilities can be reused and where AI-specific gaps exist.
Define the workload’s risk tier, evaluation evidence, and cost objectives.

Days 31–60: build one vertical path

Create one workload template and governed workspace.
Connect model and prompt provenance to the existing GitOps release flow.
Add standard telemetry, policy checks, evaluation execution, and cost labels.
Document escape hatches with owners, expiry dates, and review requirements.

Days 61–90: prove and productize

Run a production-like release and rollback.
Measure lead time, evidence completeness, operational quality, and unit cost.
Interview the platform consumers and remove unnecessary steps.
Publish the contract as versioned schemas, APIs, examples, and service-level expectations.

The platform becomes the organizational control surface

AI-native platform engineering is not a race to add GPUs and model registries to an internal portal. It is the work of extending a proven product contract across a more complex value stream.

The strongest platforms will preserve what already works: product thinking, self-service, golden paths, policy automation, and composable cloud-native interfaces. They will add the missing contracts for data, models, evaluations, agents, specialized compute, and cost. That approach avoids a new silo while giving teams a credible path from experimentation to governed production.

Sources

Juli 13, 2026

OpenTelemetry Fleet Management: Why OpAMP Belongs in the Enterprise Observability Control Plane

OpenTelemetry can standardize how an enterprise collects and exports telemetry, but standardization alone does not make the collection layer operable.

At small scale, teams can manage Collector configuration through deployment manifests, virtual machine tooling, or a handful of automation scripts. At enterprise scale, the fleet becomes heterogeneous: Kubernetes DaemonSets, centralized gateways, virtual machines, laptops, point-of-sale devices, edge systems, and embedded environments. Different teams deploy the agents, while a central observability group remains accountable for data quality and service reliability.

That creates a control-plane problem. The organization needs to know which agents exist, what they are running, whether their configuration is current, whether a rollout succeeded, and how to recover without losing telemetry. The Open Agent Management Protocol, or OpAMP, provides a vendor-neutral protocol for that management relationship.

The strategic point is bigger than remote configuration. OpAMP belongs in the enterprise observability control plane because telemetry collection is production infrastructure. It needs identity, desired state, health feedback, controlled rollout, auditability, and rollback just like any other critical fleet.

Telemetry standardization exposes the management gap

OpenTelemetry adoption often begins with a sensible objective: remove proprietary instrumentation and normalize traces, metrics, and logs around open standards. The Collector becomes a flexible processing and export layer between workloads and one or more observability backends.

Success creates a new operating challenge. Collector configurations diverge by environment and team. Components run different versions. Credentials rotate at different times. Pipelines fail silently or begin dropping data. A change that looks safe in a development cluster can overload a regional gateway or remove a critical security log source.

GitOps helps with Kubernetes-managed Collectors, but it does not automatically cover agents on virtual machines, workstations, edge locations, or devices. It also tells the platform what was declared, not necessarily what every agent loaded or whether the resulting pipeline is healthy.

An enterprise control plane must connect declared intent with runtime evidence across the entire fleet.

What OpAMP actually provides

The OpenTelemetry specification describes OpAMP as a network protocol for remotely managing large fleets of data collection agents. It is vendor-agnostic and supports communication between an OpAMP server and clients associated with managed agents.

Core capabilities include:

Reporting agent identity, description, version, capabilities, and health
Receiving and acknowledging remote configuration
Reporting effective configuration and configuration status
Reporting package or component inventory
Receiving package update offers where the implementation supports them
Establishing bidirectional management communication over WebSocket or HTTP

OpAMP is a protocol, not a complete fleet-management product. It does not decide who may approve a production configuration, how rollout rings are selected, what policy is acceptable, or how a failed change should be escalated. Those are control-plane responsibilities that an enterprise platform must implement around the protocol.

The specification is currently marked beta. Newer Collector management work, including an alpha OpAMP Gateway Extension discussed by the CNCF, is promising but should be treated according to its maturity. Protocol adoption and production rollout should be deliberately separated from assumptions about experimental components.

The observability control plane needs a clear contract

A useful control plane maintains two views of every managed agent.

Desired state describes what the organization intends: approved Collector version, component set, configuration bundle, certificates, export destinations, and rollout assignment.

Observed state describes what the agent reports: identity, capabilities, effective configuration, health, errors, version, and last successful communication.

The difference between these views is configuration drift. Drift is not automatically a failure. An agent may be offline, a rollout may be paused, or a local emergency override may be permitted. The control plane should classify the difference, assign an owner, and decide whether to reconcile, roll back, or escalate.

This is why OpAMP complements rather than replaces GitOps. Git remains the reviewable source of approved configuration. OpAMP provides a standardized delivery and feedback channel for agents that cannot all be managed through the same deployment mechanism.

A reference enterprise architecture

A practical architecture separates policy, rollout orchestration, and protocol transport.

Configuration repository. Versioned Collector templates, component allow lists, routing policy, environment overlays, and rollout metadata are reviewed through pull requests.
Build and validation service. Every bundle is parsed, semantically validated, policy-checked, and tested against representative telemetry before promotion.
Fleet inventory. The platform records agent identity, owner, environment, workload class, capabilities, current version, desired version, and health.
Rollout controller. A change is assigned to cohorts, advanced through rings, paused on thresholds, and linked to an immutable configuration revision.
OpAMP server. The server communicates desired state to clients and receives acknowledgements and status. It should not become the only system of record for policy decisions.
Managed agents. Collectors or supervisors authenticate to the control plane, apply supported changes, and report effective state and health.
Control-plane observability. The management system emits its own metrics, logs, and traces to an independent path so a fleet failure remains visible.

This architecture keeps configuration governance in familiar enterprise workflows while using OpAMP for standardized fleet interaction.

Identity is the first security boundary

A remote management channel can change what telemetry is collected, where it is sent, and which components execute. It is therefore a high-value security boundary.

Each client needs a stable identity tied to an owner and expected environment. Transport encryption is necessary but not sufficient. The server must authorize what that identity may receive, which cohort it belongs to, and whether it can accept sensitive configuration.

Recommended controls include:

Mutual authentication with short-lived, automatically rotated credentials
Per-agent or narrowly scoped workload identities rather than shared fleet secrets
Authorization by tenant, environment, geography, and workload class
Signed or integrity-protected configuration artifacts
Strict separation between configuration authors, approvers, and rollout operators
Audit records linking every server instruction to a reviewed revision and actor
Egress restrictions so agents communicate only with approved management and telemetry endpoints
Safe local behavior when the management server is unavailable

The server also needs protection from compromised agents. Rate limits, message-size limits, tenant isolation, replay resistance, input validation, and anomaly detection should be part of the threat model.

Configuration rollout should look like progressive delivery

Collector configuration can affect the visibility of an entire production estate. Treating a fleet-wide change as a simple push is an operational risk.

A safer workflow uses progressive rollout rings:

Validation. Parse the configuration, resolve components, verify endpoints, run policy checks, and exercise representative telemetry.
Development cohort. Apply the revision to disposable or low-risk agents and verify configuration acknowledgement.
Canary cohort. Select a small production group that represents important environments and traffic patterns.
Regional or workload rings. Expand only while health, drop rate, queue pressure, and backend load remain within thresholds.
Fleet completion. Record coverage and identify offline or incompatible agents as explicit exceptions.
Rollback. Restore the last known-good revision automatically when defined safety conditions fail.

A rollout is not successful because the server sent a configuration. It is successful when the intended agents report the expected effective state and the telemetry pipeline remains healthy.

Observe the observability fleet

The Collector layer is part of the monitoring system, so its management telemetry must not disappear into the same failure domain.

Track at least:

Active, offline, unknown, and quarantined agents
Desired-versus-effective configuration drift
Configuration acknowledgement and failure rates
Rollout duration and rollback frequency
Agent version and component-version distribution
Telemetry receive, drop, retry, queue, and export-failure rates
Credential age and failed authentication attempts
Management-channel latency and reconnect rate
Coverage by business service, environment, and data type

Business-level objectives matter as well. How quickly can the organization deploy a new security log source? How long does it take to revoke a compromised exporter credential? What percentage of critical services has a healthy, policy-compliant collection path?

An operating model for shared ownership

Fleet management spans organizational boundaries. Clear ownership prevents the control plane from becoming either an unresponsive central bottleneck or an uncontrolled self-service system.

The observability platform team owns the protocol service, supported agent profiles, configuration schemas, rollout automation, and service-level objectives.
Security owns control objectives, management-plane threat modeling, credential requirements, and sensitive destination policy.
Service and infrastructure teams own agent coverage, local dependencies, and declared business criticality.
Backend owners publish capacity constraints and compatibility requirements.
A change advisory model is encoded through risk tiers, automated evidence, and approval rules rather than a universal manual meeting.

Teams should be able to request supported pipelines and processors through a controlled interface. They should not need permission for every low-risk change, but they also should not be able to redirect enterprise telemetry to an arbitrary endpoint.

A staged adoption plan

Phase 1: establish inventory and evidence

Enumerate Collector deployments and other managed agents.
Assign ownership, environment, and criticality.
Define a small set of supported configurations and component versions.
Measure current drift, rollout time, and blind spots before adding remote control.

Phase 2: introduce OpAMP in read-oriented mode

Connect a non-critical cohort.
Collect agent descriptions, versions, effective configuration, and health.
Validate identity and tenant boundaries.
Compare observed state with the Git-approved desired state.

Phase 3: controlled configuration delivery

Enable remote configuration for one standardized agent profile.
Use signed revisions, canary rings, automated thresholds, and rollback.
Exercise server outage, invalid configuration, expired credentials, and incompatible-agent scenarios.

Phase 4: expand deliberately

Add heterogeneous environments and additional agent capabilities.
Integrate package updates only after configuration delivery is reliable.
Publish service objectives and an exception process.
Keep experimental extensions behind explicit maturity and risk gates.

Standard telemetry needs standard operations

OpenTelemetry solves an important portability problem, but enterprises also need a portable way to operate the collection fleet. OpAMP creates the protocol foundation for that control plane.

The durable design is not a central server that can push arbitrary files. It is a governed system that connects reviewed intent to agent identity, progressive rollout, effective state, health evidence, and safe recovery. Organizations that build those capabilities can scale OpenTelemetry without replacing proprietary telemetry agents with a new collection layer that is open but operationally opaque.

Sources

Juli 10, 2026

GitOps for AI Agents: Why Prompts, Tools, and Policies Belong in Your Platform Repository

AI agents are increasingly moving from experiments into production workflows. They can inspect systems, call tools, change infrastructure, open pull requests, and trigger operational actions. Yet many teams still manage the most important parts of an agent—its system prompt, tool permissions, output contract, and safety rules—as scattered text in notebooks, environment variables, or application code.

That is not just inconvenient. It is a governance problem.

If agent configuration influences production behavior, it should be managed like any other form of production configuration: declarative, versioned, reviewed, testable, and reversible. This is where GitOps becomes relevant—not as another fashionable label, but as a practical operating model for agentic systems.

Agent configuration is production behavior

For a conventional service, teams already treat deployment manifests, network policies, resource limits, and feature flags as controlled artifacts. An AI agent adds another behavioral layer:

the system prompt defines role, boundaries, and decision priorities;
the tool list determines which actions the agent can perform;
the output schema defines what downstream systems may trust;
policy bundles decide which actions are allowed, denied, or escalated;
model and routing settings affect cost, latency, and risk;
confidence and blast-radius thresholds determine when a human must intervene.

A change to any of these elements can alter production outcomes without changing a single line of traditional application code. Treating them as informal configuration creates an audit gap: teams may know which container image ran, but not which instructions or tool permissions shaped the agent’s decision.

What GitOps adds

The OpenGitOps principles describe desired state as declarative, versioned and immutable, automatically pulled, and continuously reconciled. Applied to agents, these principles create a clear chain from intent to runtime behavior.

A practical model looks like this:

Agent configuration is stored in Git as structured data.
A pull request shows the exact behavioral change.
Automated checks validate schemas, policies, permissions, and evaluation results.
Reviewers approve the change based on ownership and risk.
A GitOps controller reconciles the approved state into the runtime platform.
Telemetry confirms which version is active and how it behaves.
A rollback restores the last known-good configuration when required.

This is already being applied in real cloud-native agent platforms. In a CNCF case study from Orange Innovation, each agent’s system prompt, tool list, and output schema is represented as a Kubernetes Custom Resource and reconciled from Git through Argo CD. Safety policies live in the same repository, making promotion code-reviewed, auditable, and reversible.

What should live in Git?

The goal is not to put every piece of runtime context into a repository. Git should contain the stable desired state that governs the agent.

Good candidates

system prompts and instruction templates;
allowed and denied tool definitions;
input and output schemas;
policy-as-code bundles;
model selection and fallback rules;
human-approval thresholds;
resource limits and deployment settings;
evaluation datasets and acceptance thresholds;
ownership metadata and escalation routes.

What should not live in Git?

API keys, tokens, and credentials;
personal or customer-sensitive conversation data;
short-lived runtime context;
unfiltered model traces containing confidential data;
mutable operational state that belongs in a database or event stream.

Secrets should be referenced through a secret-management system. Dynamic context should be retrieved through controlled tools with explicit identity, authorization, and audit trails.

An illustrative Kubernetes resource

Kubernetes Custom Resources provide one possible way to model agent desired state. The following example is illustrative rather than a proposed standard:

apiVersion: agents.platform.it-stud.io/v1alpha1
kind: AgentConfiguration
metadata:
  name: incident-reviewer
spec:
  promptRef: prompts/incident-reviewer-v12
  modelPolicy:
    primary: approved-enterprise-model
    fallback: approved-low-latency-model
  tools:
    allow:
      - read-observability-data
      - create-incident-ticket
    deny:
      - execute-production-change
  outputSchemaRef: schemas/incident-review-v3.json
  policyBundleRef: policies/soc-reviewer-v8
  humanApproval:
    requiredFor:
      - customer-facing-assets
      - identity-systems
      - actions-above-blast-radius-threshold

The value is not the YAML itself. The value is that the desired behavior becomes visible, reviewable, and reconcilable. A platform controller can translate this resource into runtime configuration while policy engines validate what teams are allowed to change.

The pull request becomes a governance control

A prompt review should not be treated like a copy-editing exercise. It is closer to reviewing infrastructure or authorization policy.

Different changes need different reviewers:

domain owners review whether instructions reflect the intended business process;
platform teams review runtime, deployment, and operational impact;
security teams review tool permissions, policy rules, identity, and blast radius;
AI engineers review model behavior, schemas, and evaluation results.

Branch protection and CODEOWNERS can turn this responsibility model into an enforceable workflow. A tool-permission change may require security approval, while a wording clarification within an existing boundary may only require the domain owner.

CI must test behavior, not just syntax

Schema validation is necessary but insufficient. An agent configuration can be valid YAML and still create unsafe or ineffective behavior.

A useful CI pipeline should combine:

schema and policy validation;
checks for forbidden tools or excessive permissions;
prompt-injection and adversarial test cases;
regression evaluations against representative scenarios;
cost and latency budgets;
output-schema conformance;
evidence that required human escalation still occurs.

The result should be an evaluation report attached to the pull request. Reviewers then see not only what changed, but how the agent’s measured behavior changed.

Deployment needs progressive delivery

GitOps makes rollback possible, but production agent changes should still be introduced gradually. A prompt or policy update can pass offline evaluations and fail under real operational conditions.

Platform teams can apply familiar delivery patterns:

shadow mode, where the new version makes decisions without executing them;
canary rollout to a limited workload or user group;
automatic rollback on quality, safety, latency, or cost regression;
version labels in traces so behavior can be tied to the exact Git revision;
human approval for changes that expand tool access or blast radius.

This is where agent operations begin to look less like prompt experimentation and more like mature platform engineering.

A practical operating model

Teams do not need a new organizational silo for every agent. They need clear contracts between existing responsibilities.

Domain teams own desired outcomes and business constraints.
AI engineering owns agent contracts, evaluations, and model behavior.
Platform engineering owns the runtime, GitOps reconciliation, observability, and deployment controls.
Security and risk own policy requirements, privileged actions, and evidence.

Machine-readable contracts—schemas, policies, Custom Resources, and evaluation thresholds—reduce coordination overhead. Teams can evolve their area without relying on undocumented meetings or hidden configuration.

A 30-day starting plan

Inventory: identify production agents and locate their prompts, tools, policies, and schemas.
Structure: move stable behavioral configuration into a versioned repository without migrating secrets or sensitive runtime data.
Protect: add CODEOWNERS, branch protection, and approval requirements for high-risk fields.
Validate: introduce schema checks, policy tests, and a small regression evaluation suite.
Reconcile: automate deployment through an existing GitOps controller or equivalent reconciliation process.
Observe: attach configuration version, model version, tool calls, cost, latency, and escalation outcomes to telemetry.
Roll back: test restoration of the last known-good configuration before the first production incident.

Conclusion

AI agents should not be governed through scattered prompts and tribal knowledge. The configuration that shapes their behavior belongs in the same disciplined operating model used for other production systems.

GitOps provides a practical foundation: declared intent, version history, peer review, automated validation, continuous reconciliation, and fast rollback. Combined with policy-as-code, behavioral evaluations, progressive delivery, and human approval boundaries, it gives platform teams a credible way to scale agentic systems without losing control.

The core principle is simple: if a configuration change can alter what an agent is allowed to decide or do, it deserves the same engineering rigor as a production code change.

Sources and further reading

Mai 12, 2026Mai 12, 2026

The .de DNSSEC Meltdown: What Platform Teams Can Learn from Germany’s TLD Outage

TL;DR — On May 5 2026, DENIC pushed broken DNSSEC signatures into the .de zone. Because DNSSEC validation is a strict chain-of-trust model, every validating resolver on the planet began returning SERVFAIL for all .de domains. Millions of websites, APIs, and mail servers went dark. Resolvers that had deployed Serve-Stale (RFC 8767) and Negative Trust Anchors (RFC 7646) recovered within minutes; everyone else waited hours. This article breaks down the incident, the mitigation patterns, and the concrete steps platform teams should take so a single TLD mistake doesn’t take down their stack.

What Happened on May 5, 2026

At approximately 10:42 UTC on Monday, May 5, monitoring dashboards across Europe lit up. DNS resolution for .de domains — one of the world’s largest country-code TLDs, consistently ranking in the Top 5 at Cloudflare Radar — started failing en masse. The root cause: DENIC, the registry operator for .de, had published DNSSEC signatures that did not match the zone’s active Zone Signing Key (ZSK).

The timing was no coincidence. The faulty signatures surfaced during a scheduled ZSK rotation — one of the most operationally sensitive windows in DNSSEC key management. A misconfiguration in the signing pipeline meant that the new signatures were generated with a key that validating resolvers could not verify against the published DS records in the root zone. The result was catastrophic: the entire .de chain of trust was broken.

Within minutes, every DNSSEC-validating resolver worldwide — including Cloudflare’s 1.1.1.1, Google’s 8.8.8.8, and Quad9’s 9.9.9.9 — began returning SERVFAIL for queries to .de domains. Non-validating resolvers continued to work, which created a confusing split-brain situation where some users could reach German websites and others couldn’t, depending on their configured resolver.

The DNSSEC Chain of Trust: One Link Breaks, Everything Falls

To understand why a single registry mistake can have such a massive blast radius, you need to understand how DNSSEC validation works.

DNSSEC adds cryptographic signatures to DNS records. Resolvers verify these signatures by walking a chain of trust from the root zone (.) down through the TLD (.de) to the individual domain (example.de). Each level delegates trust to the next via DS (Delegation Signer) records. If any link in this chain produces an invalid signature, a validating resolver must return SERVFAIL. That’s not a bug — it’s the design. DNSSEC was built to prevent cache poisoning, and treating unverifiable answers as failures is the entire point.

The double-edged nature of this design becomes painfully clear during operator errors at the TLD level. When DENIC’s signatures broke, it wasn’t just one domain that failed — it was every single .de domain, regardless of whether the individual domain owner had done everything right. The TLD is a single point of cryptographic failure for all domains beneath it.

ZSK/KSK Rotation: The Critical Window

DNSSEC uses two types of keys: the Key Signing Key (KSK), which signs the DNSKEY RRset, and the Zone Signing Key (ZSK), which signs the actual zone data. ZSK rotations happen more frequently and involve a carefully choreographed dance: pre-publish the new key, wait for caches to expire, sign with the new key, remove the old one. Get any step wrong — wrong timing, wrong key reference, stale DS record — and you shatter the chain of trust. This is exactly what happened with .de.

How Major Resolvers Responded

The incident provided a real-world stress test for two mitigation techniques that the DNS community has been advocating for years: Serve-Stale and Negative Trust Anchors.

Serve-Stale (RFC 8767)

Serve-Stale allows a resolver to return expired (stale) cached records instead of failing with SERVFAIL when it cannot fetch a fresh, valid answer from upstream. Cloudflare’s 1.1.1.1 had Serve-Stale enabled, and their detailed incident report showed that users hitting warm caches continued to get working answers for .de domains — stale data, but functional. For most use cases (websites, APIs, mail routing), a stale A or AAAA record from five minutes ago is infinitely better than SERVFAIL.

The limitation: Serve-Stale only works if the record was previously cached. Cold caches — new queries for domains the resolver hadn’t seen recently — still failed. And once stale TTLs expired (typically capped at 1–3 days depending on implementation), even warm caches would stop serving.

Negative Trust Anchors (RFC 7646)

Negative Trust Anchors (NTAs) are the emergency brake for DNSSEC. An NTA tells a resolver: „Stop validating DNSSEC for this specific domain or zone.“ When applied to .de, it effectively disables signature verification for the entire TLD, allowing queries to resolve normally — at the cost of losing DNSSEC protection.

Cloudflare, Google, and Quad9 all deployed NTAs for .de within the first hour of the incident. This was the fastest path to restoring service for end users. The NTAs were removed once DENIC republished correct signatures later that day.

The Third Option: Disabling DNSSEC Validation Entirely

Some smaller operators chose the nuclear option: disabling DNSSEC validation on their resolvers entirely. This restored service for all domains immediately but removed cryptographic protection for every zone, not just the broken one. This is the equivalent of disabling your firewall because one rule is misconfigured — it works, but the security implications are severe. NTAs are strictly preferable because they scope the trust bypass to the affected zone.

The Amplification Problem

DNS outages create a vicious feedback loop. When resolvers return SERVFAIL, clients retry — aggressively. Applications retry. Browsers retry. Stub resolvers retry. Monitoring systems fire off their own queries. Cloudflare reported a 10x spike in query volume for .de during the incident, as retry storms amplified the load on authoritative servers and resolvers alike.

This client-retry amplification is a well-known pattern in distributed systems, but it’s especially brutal in DNS because retries happen at multiple layers simultaneously. It delays recovery because even after the root cause is fixed, the query flood continues until retry backoffs settle.

Parallels to Prior TLD Outages

The .de incident wasn’t the first time a TLD’s DNSSEC misconfiguration caused widespread outages. In 2024, New Zealand’s .nz experienced a similar DNSSEC signing failure that took down domains across the country. Sweden’s .se has had its own DNSSEC-related incidents. Each time, the pattern is the same: a key management error at the TLD level cascades into a nationwide or zone-wide outage, and the community rediscovers that DNSSEC’s strict validation model trades availability for integrity.

The lesson keeps repeating because the operational complexity of DNSSEC key management is genuinely hard, and the failure mode is binary: it either validates or it doesn’t. There’s no graceful degradation built into the protocol itself.

Platform Engineering Lessons

If you’re running a platform team — especially one operating in the EU — the .de incident should be a wake-up call. DNS is deeply embedded in every layer of a modern cloud-native stack: ExternalDNS syncs records, cert-manager validates domain ownership via DNS-01 challenges, Ingress controllers rely on DNS routing, service meshes resolve endpoints. A DNS outage isn’t just „websites are down“ — it can break certificate issuance, deployment pipelines, service discovery, and monitoring.

1. Monitor DNSSEC Validation, Not Just Resolution

Most teams monitor whether DNS resolution works. Few monitor whether DNSSEC validation is healthy. Set up checks that specifically test DNSSEC signature validity for your critical domains and their parent zones. Tools like DNSViz, Zonemaster, and RIPE Atlas probes can automate this. Alert on validation failures before your users notice.

2. Implement a Multi-Resolver Strategy

Don’t depend on a single upstream resolver. Configure failover across multiple providers: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9). Each operator has different NTA deployment speeds and Serve-Stale configurations. During the .de incident, the window between „Cloudflare deployed NTA“ and „smaller ISP resolvers deployed NTA“ was measured in hours. A multi-resolver setup lets you ride the fastest responder.

3. Deploy Serve-Stale in Your Own Resolvers

If you run local resolvers (CoreDNS, Unbound, BIND), enable Serve-Stale. In CoreDNS, this means configuring the cache plugin with serve_stale. In Unbound, set serve-expired: yes with appropriate serve-expired-ttl and serve-expired-client-timeout values. This single configuration change is your best passive defense against upstream DNSSEC failures.

# Unbound example
server:
    serve-expired: yes
    serve-expired-ttl: 86400
    serve-expired-client-timeout: 1800

# CoreDNS example
.:53 {
    forward . 1.1.1.1 8.8.8.8 9.9.9.9
    cache 3600 {
        serve_stale 86400
    }
}

4. Treat DNS as a Critical Dependency in Your Architecture

Map out every component in your stack that depends on DNS resolution. ExternalDNS, cert-manager (DNS-01 challenges), Ingress controllers, external API calls, webhook endpoints, OAuth/OIDC provider discovery — all of these break when DNS breaks. Document these dependencies and include DNS failure scenarios in your chaos engineering practice.

5. Build a DNS Incident Response Playbook

Your runbook should5 include:

Detection: Automated alerts for DNSSEC validation failures and elevated SERVFAIL rates
Triage: Is the issue local, resolver-level, or TLD-level? Use dig +dnssec and delv to isolate
Mitigation: Pre-approved steps to deploy NTAs on local resolvers, switch upstream resolvers, or enable Serve-Stale
Communication: Templates for status page updates that explain DNS issues to non-technical stakeholders
Recovery: Validation that DNSSEC signatures are correct before removing NTAs

6. NIS2 and DORA: DNS Resilience Is Now a Compliance Issue

For organizations operating in the EU, the NIS2 Directive and the Digital Operational Resilience Act (DORA) explicitly require resilience measures for critical infrastructure, including ICT supply chain risks. DNS is a foundational ICT service. A TLD-level outage that takes down your platform because you had no failover, no Serve-Stale, and no incident playbook is now a compliance gap, not just an operational one. Document your DNS resilience measures as part of your NIS2/DORA risk assessments.

The Bigger Picture

The .de DNSSEC meltdown highlights a fundamental tension in internet infrastructure: the systems designed to protect us (DNSSEC, certificate validation, strict security policies) can also become single points of failure when they break. The answer isn’t to disable security — it’s to build resilience layers that absorb the impact of failures without sacrificing protection during normal operations.

Serve-Stale and Negative Trust Anchors are exactly this kind of resilience layer. They don’t weaken DNSSEC; they give operators a controlled way to maintain availability while the underlying issue is fixed. Every platform team should have both in their toolkit.

Conclusion: Your DNS Is Only as Strong as Your Weakest Trust Anchor

The .de outage wasn’t caused by a sophisticated attack. It was a configuration error during routine key rotation — the kind of mistake that can happen to any registry, any operator, at any time. What separated the teams that weathered it from those that scrambled was preparation: multi-resolver setups, Serve-Stale configurations, DNSSEC monitoring, and tested incident playbooks.

Your action items for this week:

Check if your resolvers have Serve-Stale enabled. If not, enable it today.
Set up DNSSEC validation monitoring for your critical domains and their parent TLDs.
Document your DNS dependencies and add DNS failure to your incident response playbook.
Test a multi-resolver failover — don’t wait for the next TLD outage to find out if it works.

The next DNSSEC meltdown isn’t a matter of if — it’s a matter of which TLD and when. Be ready.

April 30, 2026Mai 3, 2026

Small Language Models for Platform Engineering: Why 8B Parameters Beat API Dependencies

The economics of AI in platform engineering are shifting — fast. For the past two years, the default answer to „how do we add AI to our internal platform?“ has been „call an API.“ But with inference costs rising, data governance getting stricter, and a new generation of compact models matching much larger counterparts on critical benchmarks, that default is worth questioning. Small Language Models (SLMs) — particularly in the 7B–9B parameter range — have reached a threshold where they can handle the majority of platform engineering workloads without ever leaving your network.

The Benchmark Reality Check: 8B Is Not a Compromise

IBM’s Granite 4.1 8B, released in April 2026 under Apache 2.0, is a useful anchor for this conversation. On enterprise coding benchmarks, the 8B model matches IBM’s own 32B Mixture-of-Experts (MoE) variant. On HumanEval pass@1, the 8B scores 87.2% compared to 89.6% for the 30B model — a gap of less than 3 percentage points that is largely irrelevant for the deterministic, constrained tasks that platform teams actually run.

This pattern holds across the SLM landscape:

Phi-4 (14B) — Microsoft’s model excels at reasoning-heavy tasks, punching well above its weight on MATH and GPQA
Qwen-3 (8B) — Strong multilingual coding support, excellent for polyglot infrastructure codebases
Llama-3.3 (8B) — Meta’s workhorse, widely supported across inference frameworks
Mistral-Small (22B) — A good middle ground when you need more capacity without the frontier price tag

The takeaway: if you are still reaching for GPT-4 or Claude Sonnet to answer „why is this Helm chart failing?“ you are likely overspending.

Dense Non-Thinking Architecture: Why It Matters for Operations

Granite 4.1 uses what IBM calls a Dense Non-Thinking Architecture. In practice, this means the model does not execute an internal chain-of-thought (CoT) reasoning step before responding. For frontier models solving novel math problems, CoT is valuable. For a platform engineer asking „summarize this PagerDuty alert and suggest the top three actions,“ CoT overhead is pure latency and token cost with zero benefit.

Platform tasks are largely pattern-matching with context, not novel reasoning. Alert triage, PR description generation, runbook execution, code review comments — these are well-defined, repetitive, structured tasks where a fast, confident response beats a slow, deeply deliberative one. Dense models optimized for inference speed are a natural fit.

The FinOps Case: What Self-Hosting an 8B Model Actually Costs

Let’s put numbers on this. A mid-tier platform team might generate 50,000 LLM calls per month for internal tooling: PR review summaries, alert enrichment, documentation queries, CI/CD pipeline diagnostics.

At $0.002 per 1K tokens (input + output average), 50,000 calls at ~500 tokens each = $50/month in API costs. Manageable — until agents arrive.

Agentic workflows are not single API calls. A single „investigate this alert“ agent might issue 15–25 tool calls, each with full context. That same 50,000-event scenario becomes 750,000–1,250,000 LLM calls. At $0.002/1K tokens, that is now $1,500–$2,500/month — and growing linearly with adoption.

Self-hosting an 8B model on a single RTX 4090 (~$1,800 hardware) or a Mac Studio M4 Max (~$2,000) delivers:

~30–50 tokens/second throughput (sufficient for internal tooling)
Zero marginal cost per call after hardware amortization
Full data residency — no tokens leave your network
Instant availability without rate limits or provider outages

At an agentic scale, the hardware pays for itself within 1–2 months. Beyond that, it is pure savings.

Platform Engineering Use Cases Where SLMs Shine

1. Alert Triage and Runbook Execution

The HolmesGPT pattern (CNCF Sandbox) demonstrates the right approach: give an SLM access to kubectl, PromQL, and Loki, and a structured Markdown runbook. With a well-crafted runbook, tool calls per investigation drop from 16+ to 2–4. An 8B model running locally handles this at millisecond latency with no data leaving the cluster.

2. CI/CD Pipeline Assistance

PR description generation, test coverage summaries, changelog drafting — these are low-complexity, high-volume tasks. An SLM integrated directly into your CI/CD pipeline (via Ollama’s REST API or a vLLM endpoint) can run as a pipeline step without any external dependency. No API key rotation. No rate limiting during a big release crunch.

3. Code Review Comments

Automated first-pass code review — style enforcement, security pattern flagging, documentation gaps — is exactly the kind of task where an 8B model is sufficient. The model does not need to understand your entire business domain; it needs to apply consistent rules to code diffs. Fine-tuning on your internal codebase further improves relevance.

4. Documentation and Runbook Generation

Keeping runbooks current is a perennial platform team pain point. An SLM that can read infrastructure-as-code, observe recent incident patterns, and generate or update Markdown documentation solves a real operational problem — without requiring a cloud API call for every update.

Enterprise Trust: Granite’s Compliance Credentials

IBM Granite 4.1 ships with two features that matter disproportionately in regulated industries: Guardian Models and cryptographic signing.

Guardian Models are companion classifiers that can check model inputs and outputs for compliance — harmful content, PII exposure, prompt injection attempts. This is built into the model ecosystem, not bolted on afterward. For financial services or healthcare platform teams, this is a significant differentiator versus a generic open-source model.

The cryptographic signing (with ISO certification) means you can verify model provenance. In an era where supply chain security is central to platform governance (see SLSA, Sigstore, in-toto), being able to verify that the model running in your cluster is exactly the model IBM published is not a minor detail.

The Multi-Model Strategy: SLM + Cloud for 80/20 Coverage

The most practical approach is not „replace all cloud APIs with SLMs“ — it is to route intelligently:

~80% of tasks → Local SLM: Alert triage, CI/CD assistance, doc generation, code review, runbook execution, structured queries against internal data
~20% of tasks → Cloud frontier model: Novel architecture decisions, complex multi-step reasoning, tasks requiring broad world knowledge not captured in your fine-tuned model

This mirrors how mature platform teams already think about compute: use the right tool at the right cost tier. An internal platform that routes requests based on complexity signals (task type, token budget, confidence threshold) gives you both cost efficiency and capability headroom.

Getting Started: Self-Hosting in the Platform Engineering Stack

The barrier to running an 8B model is lower than most teams expect:

Ollama — Single-command model serving, REST API, model library with one-line pulls (ollama pull granite3.3:8b)
LM Studio — Desktop GUI for evaluation, good for initial benchmarking before committing to infrastructure
vLLM — Production-grade serving with OpenAI-compatible API, batching, and quantization support; the right choice for Kubernetes-native deployments

For Kubernetes, vLLM running as a Deployment with a GPU node selector and an HPA on request queue depth is a reasonable production starting point. Pair it with an OpenAI-compatible API shim and your existing LLM-integrated tooling requires zero code changes to switch endpoints.

The Connection to Agentic Infrastructure

The Agentic Compute Cliff is real: GitHub Copilot paused new signups in April 2026 due to capacity constraints, and multiple cloud providers are experiencing GPU shortages. As agentic workloads scale — where a single developer workflow might trigger hundreds of LLM calls per hour — dependency on cloud inference is a reliability and cost risk.

SLMs running on internal infrastructure are not just a cost play. They are a resilience play. Your internal platform keeps working when the cloud provider has an outage. Your agents are not rate-limited during a major incident response. Your data never transits a network boundary you do not control.

When 8B Is Not Enough

Intellectual honesty matters here. SLMs are not the answer for everything:

Novel architecture decisions requiring broad reasoning across domains
Complex multi-step debugging across large, unfamiliar codebases
Tasks requiring deep world knowledge beyond your training/fine-tuning window
High-stakes customer-facing generation where quality variance is unacceptable

The skill is in classification — building a platform that knows when to route locally and when to escalate to a frontier model. That routing logic, often just a simple task classifier, is itself a good candidate to run on a local SLM.

Conclusion: Make the Economics Argument

The conversation about SLMs in platform engineering is no longer theoretical. The benchmarks have arrived. The tooling (Ollama, vLLM, LM Studio) is mature. The hardware cost is justified within months at agentic scale. And the privacy and compliance benefits — data residency, Guardian Models, cryptographic provenance — increasingly matter as organizations bring AI deeper into their software delivery lifecycle.

The 8B parameter class is not a compromise. It is a deliberate choice that aligns cost, performance, privacy, and operational simplicity for the tasks that platform teams actually run. Start with one use case — alert triage is a natural first target — measure the results, and expand from there. The API dependency you are paying for today may be entirely optional.

April 19, 2026April 19, 2026

The Vercel Breach Playbook: What Platform Teams Must Do When Their PaaS Provider Gets Compromised

Today — April 19, 2026 — Vercel disclosed a security incident involving unauthorized access to its internal systems. The breach has been linked to the ShinyHunters group, a threat actor known for targeting SaaS platforms via social engineering and vulnerability exploitation. Vercel says a „limited subset of customers“ was impacted and recommends reviewing environment variables — particularly urging use of their Sensitive Environment Variable feature.

If you’re a platform engineer running production workloads on Vercel, this is your signal to act. Not tomorrow. Now.

But this post isn’t just about Vercel. It’s about what every platform team should do when the infrastructure they trust gets compromised — because this has happened before, and it will happen again.

We’ve Been Here Before

The Vercel breach follows a pattern that platform teams should recognize by now:

CircleCI (January 2023) — An engineer’s laptop was compromised, giving attackers access to customer environment variables, tokens, and keys. CircleCI’s guidance was unambiguous: rotate every secret, immediately. Teams that delayed paid the price.
Codecov (April 2021) — Attackers modified Codecov’s Bash Uploader script, exfiltrating environment variables from CI pipelines for two months before detection. Thousands of repositories had their credentials silently harvested.
Travis CI (September 2021) — A vulnerability exposed secrets from public repositories, including signing keys and access tokens. The scope was enormous because the trust boundary had been quietly violated for years.

The common thread: environment variables are the crown jewels, and PaaS providers are the vault. When the vault gets cracked, every secret inside is potentially compromised.

The Shared Responsibility Blind Spot

Most teams understand the shared responsibility model for IaaS — you secure your workloads, AWS secures the hypervisor. But with PaaS providers like Vercel, Netlify, or Railway, the trust boundary is far murkier.

Consider what Vercel has access to in a typical deployment:

Your source code (pulled from Git during builds)
Every environment variable you’ve configured — database URLs, API keys, signing secrets
Build-time and runtime secrets
Deployment metadata and audit logs
DNS configuration and SSL certificates

When Vercel’s internal systems are breached, all of these become part of the blast radius. You didn’t misconfigure anything. You didn’t leak a credential. Your provider’s security posture became your security posture.

This is the platform trust boundary problem: the more convenience your PaaS offers, the more implicit trust you’ve delegated.

Immediate Response: The First 24 Hours

If you’re running on Vercel right now, here’s the checklist. Don’t wait for their investigation to conclude — assume the worst and work backward.

1. Audit Your Environment Variables

Vercel’s own advisory specifically calls out environment variables. Start here:

# List all Vercel projects and their env vars
vercel env ls --environment production
vercel env ls --environment preview
vercel env ls --environment development

Or use the consolidated environment variables page Vercel provides. Document every secret. You need to know what’s potentially exposed before you can rotate.

2. Rotate Every Secret — No Exceptions

This is the lesson from CircleCI: partial rotation is no rotation. If a secret was accessible to your PaaS provider, treat it as compromised.

Database credentials (connection strings, passwords)
API keys (Stripe, Twilio, SendGrid, any third-party service)
OAuth client secrets
JWT signing keys
Webhook secrets
Encryption keys

Prioritize by blast radius: payment processing keys and database credentials first, monitoring API keys last.

3. Review Deployment History

Check for unauthorized deployments or unexpected build activity:

# Review recent deployments via Vercel CLI
vercel ls --limit 50

# Check for deployments from unexpected branches or commits
vercel inspect <deployment-url>

Look for deployments that don’t correlate with your Git history. An attacker with access to Vercel’s internals could potentially trigger builds with modified environment variables or injected build steps.

4. Revoke and Regenerate Tokens

Beyond environment variables, rotate all integration tokens:

Vercel API tokens (personal and team)
Git integration tokens (GitHub/GitLab app installations)
Any webhook endpoints that use shared secrets for verification
CI/CD integration tokens that connect to Vercel

5. Check Downstream Systems

If your database credentials were in Vercel env vars, check your database audit logs for unusual access patterns. If your AWS keys were stored there, review CloudTrail. Every secret that was in Vercel is a thread to pull.

Stop Storing Secrets in Environment Variables

The deeper lesson here is architectural. Environment variables are the de facto standard for passing configuration to applications — but they were never designed as a secrets management system. They’re plaintext, they get logged, they get copied into build caches, and they’re only as secure as the system storing them.

External Secrets Operator

If you’re running Kubernetes workloads (even alongside a PaaS), the External Secrets Operator lets you reference secrets from external stores without ever putting them in your deployment platform:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: db-creds
  data:
    - secretKey: password
      remoteRef:
        key: secret/data/production/database
        property: password

The secret lives in Vault or AWS Secrets Manager. Your PaaS never sees it. If the PaaS is breached, the secret isn’t in the blast radius.

HashiCorp Vault with Dynamic Secrets

Even better: don’t store long-lived credentials at all. Vault’s dynamic secrets generate short-lived database credentials on demand:

# Application requests temporary database credentials at startup
vault read database/creds/my-role
# Returns credentials valid for 1 hour
# Automatically revoked after TTL expires

When your PaaS is breached, there’s nothing useful to steal — the credentials expired hours ago.

CI/CD Credential Hygiene: Kill the Static Tokens

Static API keys and long-lived tokens are the gift that keeps giving — to attackers. Every major PaaS breach has involved harvesting static credentials. The fix is structural.

OIDC Federation: Identity Without Secrets

Instead of storing cloud provider credentials in your CI/CD platform, use OIDC federation. Your pipeline proves its identity to the cloud provider directly, receiving short-lived tokens that can’t be stolen from the PaaS:

# GitHub Actions example — no AWS keys stored anywhere
- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/deploy-role
    aws-region: eu-central-1
    # No access-key-id or secret-access-key needed
    # GitHub's OIDC token proves the workflow's identity

All major cloud providers support OIDC federation from GitHub Actions, GitLab CI, and most CI/CD platforms. There is no good reason to store static cloud credentials in your PaaS in 2026.

Workload Identity and SPIFFE7.

For more complex deployments, SPIFFE (Secure Production Identity Framework for Everyone) and its reference implementation SPIRE provide cryptographic identity attestation for workloads. Every workload gets a verifiable identity (SVID) without static credentials, and identity is attested based on the workload’s environment — not a secret that can be exfiltrated.

This is zero-trust for deployment pipelines: trust is established through verifiable identity, not shared secrets.

SBOM and Provenance: Know What You Shipped

When your build platform is compromised, one critical question emerges: can you prove that what’s running in production is what you intended to ship?

Build provenance — cryptographic attestations that link a deployed artifact to its source code, build parameters, and builder identity — becomes essential during incident response:

# Verify build provenance with cosign
cosign verify-attestation \
  --type slsaprovenance \
  --certificate-identity builder@your-org.iam.gserviceaccount.com \
  --certificate-oidc-issuer https://accounts.google.com \
  ghcr.io/your-org/your-app:latest

If you maintain SBOMs (Software Bills of Materials) and SLSA provenance attestations, you can forensically verify whether a compromised build platform injected anything into your artifacts. Without them, you’re flying blind.

Long-Term: Multi-Provider Resilience

The uncomfortable truth is that every PaaS provider will eventually have a security incident. The question isn’t if — it’s whether your architecture limits the blast radius when it happens.

Reduce Single Points of Trust

Secrets in an external vault, not in the PaaS — Vault, AWS Secrets Manager, Azure Key Vault
Build artifacts signed independently — don’t rely on the build platform’s integrity alone
DNS and TLS managed separately — if your PaaS controls your DNS, a breach can redirect traffic
Audit logs forwarded in real-time — ship PaaS audit logs to your own SIEM before the provider can tamper with them

Portable Deployments

If your deployment is tightly coupled to a single PaaS, you can’t move quickly during an incident. Containerized workloads with Infrastructure-as-Code configuration give you the option to shift to another platform within hours, not weeks. You don’t need to be multi-cloud on day one — but you need the capability to move when the trust relationship breaks.

The Incident Response Checklist

Pin this somewhere visible. When your next PaaS breach notification lands in your inbox:

Timeframe	Action
0-1 hours	Inventory all secrets stored in the provider. Begin rotating critical credentials (database, payment, auth).
1-4 hours	Revoke all API tokens and integration credentials. Review deployment history for anomalies.
4-12 hours	Complete rotation of all remaining secrets. Check downstream system audit logs. Verify build artifact integrity.
12-24 hours	Confirm no unauthorized deployments occurred. Brief stakeholders. Document timeline.
1-7 days	Conduct full post-incident review. Implement architectural improvements (external secrets, OIDC federation). Update runbooks.

Trust, but Architect for Betrayal

The Vercel breach is a reminder that platform trust is borrowed, not owned. Every convenience a PaaS provides — environment variable storage, built-in secrets, managed DNS — is a trust delegation that becomes a liability during a breach.

The platforms you depend on will get compromised. The question is whether you’ve architected your systems so that a provider breach is a inconvenience you handle in hours — or a catastrophe that takes weeks to untangle.

Start rotating your secrets now. Then start building the architecture that means you won’t have to do it so urgently next time.