Skip to main content
Upgrading Cloud Native infrastructure is complex, error-prone, and time-consuming. The process is highly manual, with teams sifting through scattered release notes and documentation for every component. Small changes in one layer can introduce hidden incompatibilities in another, risking outages and forcing expert-level troubleshooting at every step. Because it’s so complex, it can’t be easily delegated or automated, creating bottlenecks, long delays, and added costs—including extended support fees and mounting technical debt. Chkk is an Agentic Operational Safety Platform designed to solve this problem for the Cloud Native ecosystem. It leverages Knowledge Graphs, Risk Signature Databases, and Agentic AI systems to proactively identify hidden dependencies, unknown incompatibilities, and potential risks before they cause failures. By generating pre-verified, agentic upgrade workflows on demand, Chkk Upgrade Copilot provides teams with structured, safe, and efficient upgrade plans—speeding up the upgrades by 3x to 5x while eliminating last-minute surprises and disruptions. Two complementary paths: Use a Rapid Upgrade Assessment for a fast, zero‑install, planning‑grade read on scope/targets/risks; when you’re close to execution, switch to Upgrade Templates and cluster‑specific Upgrade Plans for preverified, impact‑free workflows.

Building Blocks

Upgrade Assessments

What it is

A fast, zero‑install, CLI‑first assessment that summarizes scope, recommended target versions, and early risks across clusters, nodes, and key Cloud Native Projects in minutes. Output is Markdown today; JSON schema + flags are coming to wire into CI/dashboards.

When to use

  • Months ahead to T‑shirt‑size effort, choose “next versions,” and generate app‑team work (API deprecations/removals, misconfigured PDBs).
  • Weekly in CI to track readiness and make visibility/escalation machine‑driven (publish to dashboards/wikis).

How it works

Multiple agents classify environment artifacts (e.g. EKS clusters, Istio, cert‑manager versions), query Chkk’s Knowledge Graph, and compute recommended targets and notable risks without a preverification pass—that’s why it’s fast and planning‑grade. Use Templates/Plans when you’re close to execution.

Accuracy & Expectations

Skipping preverification means rare recommendations can be incomplete if upstream context is missing; coverage improves continuously. For impact‑free execution, generate a Template and Plan.

Upgrade Templates

When you’re near execution: For early planning and weekly readiness, start with an Upgrade Template. An Upgrade Template is an agentic workflow containing a tested and structured sequence of steps and stages to safely upgrade your clusters. An Upgrade Template is generated on-demand and is scoped to an Environment (e.g. dev, staging or prod). Upgrade Templates support three commonly-used upgrade patterns: In-Place, Blue-Green, and Rolling/Surge. Upgrade Templates answer all the questions that Platform Teams have to address in an upgrade:
  1. It starts with Deep Analysis & Research:
  • What versions and configuration of control planes, nodes, Cloud Native Projects, and applications are running?
    • Have any of these versions reached EOL?
    • Are any of these versions incompatible with each other?
  • What Operational Risks should I be aware of before upgrading?
    • Are there open defects in specific versions that have caused breakages or disruptions?
    • Are there any misconfigurations that I should fix prior to upgrading?
  • Which version of control planes and Projects should we upgrade to?
    • What’s the version matrix where all Projects are compatible with the next cluster version?
    • What breaking changes (CRDs, configuration, features, behavioral changes,…) will we encounter when executing the upgrade?
    • Are there any hidden dependencies that can break Projects or applications?
    • Can I upgrade Projects to the desired version directly or do we need to do it as a sequence of upgrade hops to avoid schema breakages and/or incompatibilities?
  1. Then comes Preparation:
  • What Helm chart and CRD changes must be catered for before executing upgrades?
  • What preflight checks should be run for control plane and Projects to ensure it’s safe to execute upgrades?
  • What code diffs should be applied to upgrade Projects, control plane, and nodes, and in which order?
  • What postflight checks should be run to ensure everything is healthy after the upgrade is complete?
An Upgrade Template answers all the above questions and its entire workflow of steps and stages is pre-verified to work without failures on a Digital Twin of your environment. While simple Projects (e.g. VPC CNI, cert-manager, External Secrets Operator, etc.) can be upgraded with the cluster, complex Projects (e.g. Istio, Contour, Consul, etc.) generally require a dedicated Project Upgrade Template with steps and stages specific to that Project. Project (or Application Service) Upgrade Templates ensure upgrade safety by enabling you to manage complex Projects’ upgrades independently of your cluster upgrade lifecycle. Your team reviews and customizes an Upgrade Template by collaborating through comments inside the Template, adding your own custom steps, and finally approving the Upgrade Template for execution.
  1. And now you are ready for Upgrade Execution:
This is where you instantiate Upgrade Plans for each cluster in the Environment. The instantiated Upgrade Plans inherit all the information present in Upgrade Templates + additional cluster-specific information like:
  • Which applications are using deprecated/removed APIs?
  • Are there any application client changes in Cloud Native Projects or Application Services?
  • Are there any application misconfigurations-like incorrect Pod Disruption Budgets (PDBs)-that can cause the upgrade to fail?
All activities performed on the Upgrade Templates and Upgrade Plans are stored in long-running, durable workflows, ensuring safe, structured, and repeatable upgrades. Upgrade Templates get all the information they need from two foundational technology components: Risk Signature Database (RSig DB) and Knowledge Graph.

Operational Risks

Operational Risk refers to any known or potential defect, misconfiguration, or incompatibility in Cloud Native infrastructure that can cause incidents, disruptions, or breakages. These risks, which may include known defects or issues stemming from unsupported versions, deprecated APIs, and software nearing end-of-life, are categorized by severity—Critical, High, Medium, or Low. An Operational Risk is detected by scanning for at-risk components, identifying trigger conditions, and assessing availability impact, root cause, remediation steps, and possible mitigations. In Chkk, these risks are codified as Risk Signatures (RSigs) that continuously scan customer environments to proactively uncover and address Operational Risks before they cause breakages or outages.

Chkk Service

Chkk is a secure, scalable multi-regional solution offered in the US and EU. It is built using a cell-based architecture and supports multi-tenant and instances. It comprises multiple services that run APIs and microservices to serve product modules and maintain inventory records. At the heart of the Chkk Service are Classifiers and Engines, working in tandem to provide real-time insights and operational resilience for Cloud Native environments. Classifiers leverage references from the Knowledge Graph and RSigDB, either extracting image digests (hash-based) or applying RuleSets (rule-based) to identify resources and their relationships. Engines then use these classification results to detect latent Operational Risks, ensure guardrail conformance, and generate Upgrade Templates. They also create digital twins to preverify proposed Upgrade Plans, helping ensure that all steps can be executed without failures, and enabling smoother, more reliable operations. Rapid path: on demand, agents classify your environment (cluster / Cloud Native Project versions), query the Knowledge Graph, and compute recommended targets + notable risks—without running the deep preverification loop—so results land in minutes. Use the preverified path (Templates/Plans) for go‑time.

Risk Signature Database (RSig DB)

RSig DB takes inspiration from cybersecurity, where security vulnerabilities are reported publicly in the CVE Database. We extended this idea to operational safety: If there’s an Operational Risk (e.g. an error, failure, or disruption) that has happened anywhere in the world, Chkk AI aggregators and data connectors learn about it, convert it into a Risk Signature-similar to a virus signature-and store it in the RSig DB. Any new Risk Signature is streamed to all our customers, where it is scanned in their environments. That way, our customer can proactively detect, identify, and remediate Operational Risks before they cause breakages and disruptions, much like antivirus software detects and removes viruses before they start causing harm. RSigDB’s AI aggregators and data connectors scour Operational Risks from the following sources of information:
  • Release schedules
  • Upstream Tickets / Issues
  • Release notes / Changelogs
  • Pull Requests
  • Cloud provider knowledge basis and issue trackers
Customers can also voluntarily opt-in to share Operational Risks with Chkk-we don’t learn anything from the customer directly. RSig DB also codifies Guardrails which represent operational best practices from across the Cloud Native ecosystem (upstream communities, cloud providers, and vendors). Platform Teams use these Guardrails to ensure Application developers are conforming to their Platform’s operational excellence standards.

Knowledge Graph

Knowledge Graph stores agentic data and relationships across hundreds of Cloud Native Projects in the ecosystem, modeling their impact and identifying the safest upgrade paths. Oversight is provided for agentic data and relationships by the Chkk Research Team. Knowledge Graph covers releases of all major clouds and distributions: EKS, GKE, AKS, VMware Tanzu, OpenShift, Rancher RKE1/RKE2, Nutanix. We also support DIY and self-hosted clusters. Chkk also covers 300+ Projects, and coverage for a new Project or application service can be extended within 48hrs. Coverage Extension is done on a continual basis by an agentic AI architecture, with multiple task-specific AI agents identifying and curating information from the following sources:
  • Release Notes are curated with emphasis on breaking changes and upgrade considerations
  • Image hashes from upstream registries
  • Components are modeled, where a Project or application service may package other Projects (e.g. Redis-OSS is packaged inside ArgoCD)
  • Package Systems (Helm, Kustomize, etc.)
  • Package Sources (HelmCharts, KustomizeSources, KubeSources, etc.)
  • Deployment Modes (e.g. Istio has two Deployment Modes: Sidecar, Ambient)
  • EOL policies
  • Version Compatibility:
    • With upstream cluster versions
    • With the cloud substrate (e.g. Amazon EKS versions and Amazon AMI versions), and
    • With other Projects (e.g. a Contour version being compatible with certain Envoy versions)
  • Safety, Health and Readiness Checks, which include per-version preflight, inflight and postflight checks packaged in single-click, ready-to-run containers
Using the Knowledge Graph, Chkk identifies the safest Upgrade Paths for clusters and Cloud Native Projects. This path discovery, at times, contains multiple upgrade hops.

Chkk Dashboard

Chkk Dashboard is a UI for you to interact with Chkk-this interaction includes, but isn’t limited to, the following actions:
  • Onboarding clusters, managing access tokens
  • Inviting team members
  • Operational Risks: latent in clusters, reading Knowledge Base articles about these Risks, and performing actions (like Ignoring a risk or leaving comments for team members)
  • Guardrails: that are not followed, reading Knowledge Base articles about these Guardrails, and exposing these Guardrails to Application Teams through an API integration
  • Upgrade Templates: Requesting, customizing, reviewing and approving Upgrade Templates.
  • Upgrade Plans: instantiating and executing Upgrade Plans.
  • Artifacts: inventory extracted from running components, container images, repositories, and tools across multiple clusters, clouds and layers of infrastructure.
  • Integrations: with internal tools like GitHub, Slack, etc. Also includes SSO integration.

A Simple Operational Recipe

  • Month −3 to −2: Run an Upgrade Assessment to scope effort (Platform T‑shirt size + App work).
  • Month −2 to 0 (weekly): Automate a weekly Assessment → publish to dashboards/wikis → app teams burndown risks; leadership reviews trends.
  • D‑30: Generate an Upgrade Template for target versions; validate in sandbox.
  • D‑7: Create a cluster‑specific Upgrade Plan; promote dev → staging → prod.
  • Go‑time: Execute with high confidence.