Building Blocks
Upgrade Assessments
What it is
A fast, zero‑install, CLI‑first assessment that summarizes scope, recommended target versions, and early risks across clusters, nodes, and key Cloud Native Projects in minutes. Output is Markdown today; JSON schema + flags are coming to wire into CI/dashboards.When to use
- Months ahead to T‑shirt‑size effort, choose “next versions,” and generate app‑team work (API deprecations/removals, misconfigured PDBs).
- Weekly in CI to track readiness and make visibility/escalation machine‑driven (publish to dashboards/wikis).
How it works
Multiple agents classify environment artifacts (e.g. EKS clusters, Istio, cert‑manager versions), query Chkk’s Knowledge Graph, and compute recommended targets and notable risks without a preverification pass—that’s why it’s fast and planning‑grade. Use Templates/Plans when you’re close to execution.Accuracy & Expectations
Skipping preverification means rare recommendations can be incomplete if upstream context is missing; coverage improves continuously. For impact‑free execution, generate a Template and Plan.Upgrade Templates
When you’re near execution: For early planning and weekly readiness, start with an Upgrade Template. An Upgrade Template is an agentic workflow containing a tested and structured sequence of steps and stages to safely upgrade your clusters. An Upgrade Template is generated on-demand and is scoped to an Environment (e.g. dev, staging or prod). Upgrade Templates support three commonly-used upgrade patterns: In-Place, Blue-Green, and Rolling/Surge. Upgrade Templates answer all the questions that Platform Teams have to address in an upgrade:- It starts with Deep Analysis & Research:
- What versions and configuration of control planes, nodes, Cloud Native Projects, and applications are running?
- Have any of these versions reached EOL?
- Are any of these versions incompatible with each other?
- What Operational Risks should I be aware of before upgrading?
- Are there open defects in specific versions that have caused breakages or disruptions?
- Are there any misconfigurations that I should fix prior to upgrading?
- Which version of control planes and Projects should we upgrade to?
- What’s the version matrix where all Projects are compatible with the next cluster version?
- What breaking changes (CRDs, configuration, features, behavioral changes,…) will we encounter when executing the upgrade?
- Are there any hidden dependencies that can break Projects or applications?
- Can I upgrade Projects to the desired version directly or do we need to do it as a sequence of upgrade hops to avoid schema breakages and/or incompatibilities?
- Then comes Preparation:
- What Helm chart and CRD changes must be catered for before executing upgrades?
- What preflight checks should be run for control plane and Projects to ensure it’s safe to execute upgrades?
- What code diffs should be applied to upgrade Projects, control plane, and nodes, and in which order?
- What postflight checks should be run to ensure everything is healthy after the upgrade is complete?
- And now you are ready for Upgrade Execution:
- Which applications are using deprecated/removed APIs?
- Are there any application client changes in Cloud Native Projects or Application Services?
- Are there any application misconfigurations-like incorrect Pod Disruption Budgets (PDBs)-that can cause the upgrade to fail?
Operational Risks
Operational Risk refers to any known or potential defect, misconfiguration, or incompatibility in Cloud Native infrastructure that can cause incidents, disruptions, or breakages. These risks, which may include known defects or issues stemming from unsupported versions, deprecated APIs, and software nearing end-of-life, are categorized by severity—Critical, High, Medium, or Low. An Operational Risk is detected by scanning for at-risk components, identifying trigger conditions, and assessing availability impact, root cause, remediation steps, and possible mitigations. In Chkk, these risks are codified as Risk Signatures (RSigs) that continuously scan customer environments to proactively uncover and address Operational Risks before they cause breakages or outages.Chkk Service
Chkk is a secure, scalable multi-regional solution offered in the US and EU. It is built using a cell-based architecture and supports multi-tenant and instances. It comprises multiple services that run APIs and microservices to serve product modules and maintain inventory records. At the heart of the Chkk Service are Classifiers and Engines, working in tandem to provide real-time insights and operational resilience for Cloud Native environments. Classifiers leverage references from the Knowledge Graph and RSigDB, either extracting image digests (hash-based) or applying RuleSets (rule-based) to identify resources and their relationships. Engines then use these classification results to detect latent Operational Risks, ensure guardrail conformance, and generate Upgrade Templates. They also create digital twins to preverify proposed Upgrade Plans, helping ensure that all steps can be executed without failures, and enabling smoother, more reliable operations. Rapid path: on demand, agents classify your environment (cluster / Cloud Native Project versions), query the Knowledge Graph, and compute recommended targets + notable risks—without running the deep preverification loop—so results land in minutes. Use the preverified path (Templates/Plans) for go‑time.Risk Signature Database (RSig DB)
RSig DB takes inspiration from cybersecurity, where security vulnerabilities are reported publicly in the CVE Database. We extended this idea to operational safety: If there’s an Operational Risk (e.g. an error, failure, or disruption) that has happened anywhere in the world, Chkk AI aggregators and data connectors learn about it, convert it into a Risk Signature-similar to a virus signature-and store it in the RSig DB. Any new Risk Signature is streamed to all our customers, where it is scanned in their environments. That way, our customer can proactively detect, identify, and remediate Operational Risks before they cause breakages and disruptions, much like antivirus software detects and removes viruses before they start causing harm. RSigDB’s AI aggregators and data connectors scour Operational Risks from the following sources of information:- Release schedules
- Upstream Tickets / Issues
- Release notes / Changelogs
- Pull Requests
- Cloud provider knowledge basis and issue trackers
Knowledge Graph
Knowledge Graph stores agentic data and relationships across hundreds of Cloud Native Projects in the ecosystem, modeling their impact and identifying the safest upgrade paths. Oversight is provided for agentic data and relationships by the Chkk Research Team. Knowledge Graph covers releases of all major clouds and distributions: EKS, GKE, AKS, VMware Tanzu, OpenShift, Rancher RKE1/RKE2, Nutanix. We also support DIY and self-hosted clusters. Chkk also covers 300+ Projects, and coverage for a new Project or application service can be extended within 48hrs. Coverage Extension is done on a continual basis by an agentic AI architecture, with multiple task-specific AI agents identifying and curating information from the following sources:- Release Notes are curated with emphasis on breaking changes and upgrade considerations
- Image hashes from upstream registries
- Components are modeled, where a Project or application service may package other Projects (e.g. Redis-OSS is packaged inside ArgoCD)
- Package Systems (Helm, Kustomize, etc.)
- Package Sources (HelmCharts, KustomizeSources, KubeSources, etc.)
- Deployment Modes (e.g. Istio has two Deployment Modes: Sidecar, Ambient)
- EOL policies
- Version Compatibility:
- With upstream cluster versions
- With the cloud substrate (e.g. Amazon EKS versions and Amazon AMI versions), and
- With other Projects (e.g. a Contour version being compatible with certain Envoy versions)
- Safety, Health and Readiness Checks, which include per-version preflight, inflight and postflight checks packaged in single-click, ready-to-run containers
Chkk Dashboard
Chkk Dashboard is a UI for you to interact with Chkk-this interaction includes, but isn’t limited to, the following actions:- Onboarding clusters, managing access tokens
- Inviting team members
- Operational Risks: latent in clusters, reading Knowledge Base articles about these Risks, and performing actions (like Ignoring a risk or leaving comments for team members)
- Guardrails: that are not followed, reading Knowledge Base articles about these Guardrails, and exposing these Guardrails to Application Teams through an API integration
- Upgrade Templates: Requesting, customizing, reviewing and approving Upgrade Templates.
- Upgrade Plans: instantiating and executing Upgrade Plans.
- Artifacts: inventory extracted from running components, container images, repositories, and tools across multiple clusters, clouds and layers of infrastructure.
- Integrations: with internal tools like GitHub, Slack, etc. Also includes SSO integration.
A Simple Operational Recipe
- Month −3 to −2: Run an Upgrade Assessment to scope effort (Platform T‑shirt size + App work).
- Month −2 to 0 (weekly): Automate a weekly Assessment → publish to dashboards/wikis → app teams burndown risks; leadership reviews trends.
- D‑30: Generate an Upgrade Template for target versions; validate in sandbox.
- D‑7: Create a cluster‑specific Upgrade Plan; promote dev → staging → prod.
- Go‑time: Execute with high confidence.