This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Kubernetes Blog

Kubernetes v1.36: Deprecation and removal of Service ExternalIPs

The .spec.externalIPs field for Service was an early attempt to provide cloud-load-balancer-like functionality for non-cloud clusters. Unfortunately, the API assumes that every user in the cluster is fully trusted, and in any situation where that is not the case, it enables various security exploits, as described in CVE-2020-8554.

Since Kubernetes 1.21, the Kubernetes project has recommended that all users disable .spec.externalIPs. To make that easier, Kubernetes also added an admission controller (DenyServiceExternalIPs) that can be enabled to do this. At the time, SIG Network felt that blocking the functionality by default was too large a breaking change to consider.

However, the security problems are still there, and as a project we're increasingly unhappy with the "insecure by default" state of the feature. Additionally, there are now several better alternatives for non-cloud clusters wanting load-balancer-like functionality.

As a result, the .spec.externalIPs field for Service is now formally deprecated in Kubernetes 1.36. We expect that a future minor release of Kubernetes will drop implementation of the behavior from kube-proxy, and will update the Kubernetes conformance criteria to require that conforming implementations do not provide support.

A note on terminology, and what hasn't been deprecated

The phrase external IP is somewhat overloaded in Kubernetes:

  • The Service API has a field .spec.externalIPs that can be used to add additional IP addresses that a Service will respond on.

  • The Node API's .status.addresses field can list addresses of several different types, one of which is called ExternalIP.

  • The kubectl tool, when displaying information about a Service of type LoadBalancer in the default output format, will show the load balancer IP address under the column heading EXTERNAL-IP.

This deprecation is about the first of those. If you are not setting the field externalIPs in any of your Services, then it does not apply to you.

That said, as a precaution, you may still want to enable the DenyServiceExternalIPs admission controller to block any future use of the externalIPs field.

Alternatives to externalIPs

If you are using .spec.externalIPs, then there are several alternatives.

Consider a Service like the following:

apiVersion: v1
kind: Service
metadata:
  name: my-example-service
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: my-example-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  externalIPs:
    - "192.0.2.4"

Using manually-managed LoadBalancer Services instead of externalIPs

The easiest (but also worst) option is to just switch from using externalIPs to using a type: LoadBalancer service, and assigning a load balancer IP by hand. This is, essentially, exactly the same as externalIPs, with one important difference: the load balancer IP is part of the Service's .status, not its .spec, and in a cluster with RBAC enabled, it can't be edited by ordinary users by default. Thus, this replacement for externalIPs would only be available to users who were given permission by the admins (although those users would then be fully empowered to replicate CVE-2020-8554; there would still not be any further checks to ensure that one user wasn't stealing another user's IPs, etc.)

Because of the way that .status works in Kubernetes, you must create the Service without a load balancer IP, and then add the IP as a second step:

$ cat loadbalancer-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: my-example-service
spec:
  # prevent any real load balancer controllers from managing this service
  # by using a non-existent loadBalancerClass
  loadBalancerClass: non-existent-class
  type: LoadBalancer
  selector:
    app.kubernetes.io/name: my-example-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
$ kubectl apply -f loadbalancer-service.yaml
service/my-example-service created
$ kubectl patch service my-example-service --subresource=status --type=merge -p '{"status":{"loadBalancer":{"ingress":[{"ip":"192.0.2.4"}]}}}'

Using a non-cloud based load balancer controller

Although LoadBalancer services were originally designed to be backed by cloud load balancers, Kubernetes can also support them on non-cloud platforms by using a third-party load balancer controller such as MetalLB. This solves the security problems associated with externalIPs because the administrator can configure what ranges of IP addresses the controller will assign to services, and the controller will ensure that two services can't both use the same IP.

So, for example, after installing and configuring MetalLB, a cluster administrator could configure a pool of IP addresses for use in the cluster:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: production
  namespace: metallb-system
spec:
  addresses:
  - 192.0.2.0/24
  autoAssign: true
  avoidBuggyIPs: false

After which a user can create a type: LoadBalancer Service and MetalLB will handle the assignment of the IP address. MetalLB even supports the deprecated loadBalancerIP field in Service, so the end user can request a specific IP (assuming it is available) for backward-compatibility with the externalIPs approach, rather than being assigned one at random:

apiVersion: v1
kind: Service
metadata:
  name: my-example-service
spec:
  type: LoadBalancer
  selector:
    app.kubernetes.io/name: my-example-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  loadBalancerIP: "192.0.2.4"

Similar approaches would work with other load balancer controllers. This approach can allow cluster administrators to have control over which IP addresses are assigned, rather than users.

Using Gateway API

Another potential solution is to use an implementation of the Gateway API.

Gateway API allows cluster administrators to define a Gateway resource, which can have an IP address attached to it via the .spec.addresses field. Since Gateway resources are designed to be managed by cluster administrators, RBAC rules can be put in place to only allow privileged users to manage them.

An example of how this could look is:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: example-gateway
spec:
  gatewayClassName: example-gateway-class
  addresses:
  - type: IPAddress
    value: "192.0.2.4"
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: example-route
spec:
  parentRefs:
  - name: example-gateway
  rules:
  - backendRefs:
    - name: example-svc
      port: 80
---
apiVersion: v1
kind: Service
metadata:
  name: example-svc
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: example-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

The Gateway API project is the next generation of Kubernetes Ingress, Load Balancing, and Service Mesh APIs within Kubernetes. Gateway API was designed to fix the shortcomings of the Service and Ingress resource, making it a very reliable robust solution that is under active development.

Timeline for externalIPs deprecation

The rough timeline for this deprecation is as follows:

  1. With the release of Kubernetes 1.36, the field was deprecated; Kubernetes now emits warnings when a user uses this field
  2. About a year later (v1.40 at the earliest) support for .spec.externalIPs will be disabled in kube-proxy, but users will have a way to opt back in should they require more time to migrate away
  3. About another year later - (v1.43 at the earliest) support will be disabled completely; users won't have a way to opt back in

Kubernetes v1.36: Advancing Workload-Aware Scheduling

AI/ML and batch workloads introduce unique scheduling challenges that go beyond simple Pod-by-Pod scheduling. In Kubernetes v1.35, we introduced the first tranche of workload-aware scheduling improvements, featuring the foundational Workload API alongside basic gang scheduling support built on a Pod-based framework, and an opportunistic batching feature to efficiently process identical Pods.

Kubernetes v1.36 introduces a significant architectural evolution by cleanly separating API concerns: the Workload API acts as a static template, while the new PodGroup API handles the runtime state. To support this, the kube-scheduler features a new PodGroup scheduling cycle that enables atomic workload processing and paves the way for future enhancements. This release also debuts the first iterations of topology-aware scheduling and workload-aware preemption to advance scheduling capabilities. Additionally, ResourceClaim support for workloads unlocks Dynamic Resource Allocation (DRA) for PodGroups. Finally, to demonstrate real-world readiness, v1.36 delivers the first phase of integration between the Job controller and the new API.

Workload and PodGroup API updates

The Workload API now serves as a static template, while the new PodGroup API describes the runtime object. Kubernetes v1.36 introduces the Workload and PodGroup APIs as part of the scheduling.k8s.io/v1alpha2 API group, completely replacing the previous v1alpha1 API version.

In v1.35, Pod groups and their runtime states were embedded within the Workload resource. The new model decouples these concepts: the Workload now serves as a static template object, while the PodGroup manages the runtime state. This separation also improves performance and scalability as the PodGroup API allows per-replica sharding of status updates.

Because the Workload API acts merely as a template, the kube-scheduler's logic is streamlined. The scheduler can directly read the PodGroup, which contains all the information required by the scheduler, without needing to watch or parse the Workload object itself.

Here is what the updated configuration looks like. Workload controllers (such as the Job controller) define the Workload object, which now acts as a static template for your Pod groups:

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: training-job-workload
  namespace: some-ns
spec:
  # Pod groups are now defined as templates,
  # which contains the PodGroup objects' spec fields.
  podGroupTemplates:
  - name: workers
    schedulingPolicy:
      gang:
        # The gang is schedulable only if 4 pods can run at once
        minCount: 4

Controllers then stamp out runtime PodGroup instances based on those templates. The PodGroup runtime object holds the actual scheduling policy and references the template from which it was created. It also has a status containing conditions that mirror the states of individual Pods, reflecting the overall scheduling state of the group:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: training-job-workers-pg
  namespace: some-ns
spec:
  # The PodGroup references the Workload template it originated from.
  # In comparison, .metadata.ownerReferences points to the "true" workload object,
  # e.g., a Job.
  podGroupTemplateRef:
    workload:
      workloadName: training-job-workload
      podGroupTemplateName: workers
  # The actual scheduling policy is placed inside the runtime PodGroup
  schedulingPolicy:
    gang:
      minCount: 4
status:
  # The status contains conditions mirroring individual Pod conditions.
  conditions:
  - type: PodGroupScheduled
    status: "True"
    lastTransitionTime: 2026-04-03T00:00:00Z

Finally, to bridge this new architecture with individual Pods, the workloadRef field in the Pod API has been replaced with the schedulingGroup field. When creating Pods, you link them directly to the runtime PodGroup:

apiVersion: v1
kind: Pod
metadata:
  name: worker-0
  namespace: some-ns
spec:
  # The workloadRef field has been replaced by schedulingGroup
  schedulingGroup:
    podGroupName: training-job-workers-pg
  ...

By keeping the Workload as a static template and elevating the PodGroup to a first-class, standalone API, we establish a robust foundation for building advanced workload scheduling capabilities in future Kubernetes releases.

PodGroup scheduling cycle and gang scheduling

To efficiently manage these workloads, the kube-scheduler now features a dedicated PodGroup scheduling cycle. Instead of evaluating and reserving resources sequentially Pod-by-Pod, which risks scheduling deadlocks, the scheduler evaluates the group as a unified operation.

When the scheduler pops a PodGroup member from the scheduling queue, regardless of the group's specific policy, it fetches the rest of the queued Pods for that group, sorts them deterministically, and executes an atomic scheduling cycle as follows:

  1. The scheduler takes a single snapshot of the cluster state to prevent race conditions and ensure consistency while evaluating the entire group.

  2. It then attempts to find valid Node placements for all Pods in the group using a PodGroup scheduling algorithm, which leverages the standard Pod-based filtering and scoring phases.

  3. Based on the algorithm's outcome, the scheduling decision is applied atomically for the entire PodGroup.

    • Success: If the placement is found and group constraints are met, the schedulable member Pods are moved directly to the binding phase together. Any remaining unschedulable Pods are returned to the scheduling queue to wait for available resources so they can join the already scheduled Pods.

      (Note: If new Pods are added to a PodGroup after others are already scheduled, the cycle evaluates the new Pods while accounting for the existing ones. Crucially, Pods already assigned to Nodes remain running. The scheduler will not unassign or evict them, even if the group fails to meet its requirements in subsequent cycles.)

    • Failure: If the group fails to meet its requirements, the entire group is considered unschedulable. None of the Pods are bound, and they are returned to the scheduling queue to retry later after a backoff period.

This cycle acts as the foundation for gang scheduling. When your workload requires strict all-or-nothing placement, the gang policy leverages this cycle to prevent partial deployments that lead to resource wastage and potential deadlocks.

While the scheduler still holds the Pods in the PreEnqueue until the minCount requirement is met, the actual scheduling phase now relies entirely on the new PodGroup cycle. Specifically, during the algorithm's execution, the scheduler verifies that the number of schedulable Pods satisfies the minCount. If the cluster cannot accommodate the required minimum, none of the pods are bound. The group fails and waits for sufficient resources to free up.

Limitations

The first version of the PodGroup scheduling cycle comes with certain limitations:

  • For basic homogeneous Pod groups (i.e., those where all Pods have identical scheduling requirements and lack inter-Pod dependencies like affinity, anti-affinity, or topology spread constraints), the algorithm is expected to find a placement if one exists.

  • For heterogeneous Pod groups, finding a valid placement if one exists is not guaranteed, even when the solution might seem trivial.

  • For Pod groups with inter-Pod dependencies, finding a valid placement if one exists is not guaranteed.

In addition to the above, for cases involving intra-group dependencies (e.g., when the schedulability of one Pod depends on another group member via inter-Pod affinity), this algorithm may fail to find a placement regardless of cluster state due to its deterministic processing order.

Topology-aware scheduling

For complex distributed workloads like AI/ML training or batch processing, placing Pods randomly across a cluster can introduce significant network latency and bottleneck overall performance.

Topology-aware scheduling addresses this problem by allowing you to define topology constraints directly on a PodGroup, ensuring its Pods are co-located within specific physical or logical domains:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: topology-aware-workers-pg
spec:
  schedulingPolicy:
    gang:
      minCount: 4
  # Enforce that the pods are co-located based on the rack topology
  schedulingConstraints:
    topology:
      - key: topology.kubernetes.io/rack

In this example, the kube-scheduler attempts to schedule the Pods across various combinations of Nodes that match the rack topology constraint. It then selects the optimal placement based on how efficiently the PodGroup utilizes resources and how many Pods can successfully be scheduled within that domain.

To achieve this, the scheduler extends the PodGroup scheduling cycle with a dedicated placement-based algorithm consisting of three phases:

  1. Generate candidate placements (subsets of Nodes that are theoretically feasible for the PodGroup's assignment) based on the group's scheduling constraints. The topology-aware scheduling plugin uses the new PlacementGenerate extension point to create these placements.

  2. Evaluate each proposed placement to confirm whether the entire PodGroup can actually fit there.

  3. Score all feasible placements to select the best fit for the PodGroup. The topology-aware scheduling plugins use the new PlacementScore extension point to score these placements.

Currently, topology-aware scheduling does not trigger Pod preemption to satisfy constraints. However, we plan to integrate workload-aware preemption with topology constraints in the upcoming release.

While Kubernetes v1.36 delivers this foundational topology-aware scheduling, the Kubernetes project is planning expand its capabilities soon. Future updates will introduce support for multiple topology levels, soft constraints (preferences), deeper integration with Dynamic Resource Allocation (DRA), and more robust behavior when paired with the basic scheduling policy.

Workload-aware preemption

To support the new PodGroup scheduling cycle, Kubernetes v1.36 introduces a new type of preemption mechanism called workload-aware preemption. When a PodGroup cannot be scheduled, the scheduler utilizes this mechanism to try making a scheduling of this PodGroup possible.

Compared to the default preemption used in the standard Pod-by-Pod scheduling cycle, this new mechanism treats the entire PodGroup as a single preemptor unit. Instead of evaluating preemption victims on each Node separately, it searches across the entire cluster. This allows the scheduler to preempt Pods from multiple Nodes simultaneously, making enough space to schedule the whole PodGroup afterwards.

Workload-aware preemption also introduces two additional concepts directly to the PodGroup API:

  • PodGroup priority that overrides the priority of the individual Pods forming the PodGroup.

  • PodGroup disruptionMode that dictates whether the Pods within a PodGroup can be preempted independently, or if they have to be preempted together in an all-or-nothing fashion.

In Kubernetes v1.36, these fields are only respected by the workload-aware preemption mechanism. The people working on this set of features are hoping to extend support for these fields to other disruption sources, including default preemption used in the Pod-by-Pod scheduling cycle, in future releases.

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: victim-pg
spec:
  priorityClassName: high-priority
  priority: 1000
  disruptionMode: PodGroup

In this example, when the scheduler evaluates victim-pg as a potential preemption victim during a workload-aware preemption cycle, it will use 1000 as its priority and preempt the PodGroup in a strictly all-or-nothing fashion.

DRA ResourceClaim support for workloads

Since its general availability in Kubernetes v1.34, DRA has enabled Pods to make detailed requests for devices like GPUs, TPUs, and NICs. Requested devices can be shared by multiple Pods requesting the same ResourceClaim by name. Other requests can be replicated through a ResourceClaimTemplate, in which Kubernetes generates one ResourceClaim with a non-deterministic name for each Pod referencing the template. However, large-scale workloads that require certain Pods to share certain devices are currently left to manage creating individual ResourceClaims themselves.

Now, in addition to Pods, PodGroups can represent the replicable unit for a ResourceClaimTemplate. For ResourceClaimTemplates referenced by one of a PodGroup's spec.resourceClaims, Kubernetes generates one ResourceClaim for the entire PodGroup, no matter how many Pods are in the group. When one of a Pod's spec.resourceClaims for a ResourceClaimTemplate matches one of its PodGroup's spec.resourceClaims, the Pod's claim resolves to the ResourceClaim generated for the PodGroup and a ResourceClaim will not be generated for that individual Pod. A single PodGroupTemplate in a Workload object can express resource requests which are both copied for each distinct PodGroup and shareable by the Pods within each group.

The following example shows two Pods requesting the same ResourceClaim generated from a ResourceClaimTemplate for their PodGroup:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: training-job-workers-pg
spec:
  ...
  resourceClaims:
    - name: pg-claim
      resourceClaimTemplateName: my-claim-template
---
apiVersion: v1
kind: Pod
metadata:
  name: topology-aware-workers-pg-pod-1
spec:
  ...
  schedulingGroup:
    podGroupName: training-job-workers-pg
  resourceClaims:
    - name: pg-claim
      resourceClaimTemplateName: my-claim-template
---
apiVersion: v1
kind: Pod
metadata:
  name: topology-aware-workers-pg-pod-2
spec:
  ...
  schedulingGroup:
    podGroupName: training-job-workers-pg
  resourceClaims:
    - name: pg-claim
      resourceClaimTemplateName: my-claim-template

In addition, ResourceClaims referenced by PodGroups, either through resourceClaimName or the claim generated from resourceClaimTemplateName, become reserved for the entire PodGroup. Previously, kube-scheduler could only list individual Pods in a ResourceClaim's status.reservedFor field which is limited to 256 items. Now, a single PodGroup reference in status.reservedFor can represent many more than 256 Pods, allowing high-cardinality sharing of devices.

Together, these changes enable massive workloads with complex topologies to utilize DRA for scalable device management.

Integration with the Job controller

In Kubernetes v1.36, the Job controller can create and manage Workload and PodGroup objects on your behalf, so that Jobs representing a tightly coupled parallel application, such as distributed AI training, are gang-scheduled without any additional tooling. Without this integration, you would have to create the Workload and PodGroup yourself and wire their references into the Pod template. Now, the Job controller automates this process natively.

When the WorkloadWithJob feature gate is enabled, the Job controller automatically:

  • creates a Workload and a corresponding runtime PodGroup for each qualifying Job,

  • sets .spec.schedulingGroup onto every Pod the Job creates so the scheduler treats them as a single gang, and

  • sets the Job as the owner of the generated objects, so they are garbage-collected when the Job is deleted.

When does the integration kick in?

To keep the first feature iteration predictable, the Job controller only creates a Workload and PodGroup when the Job has a well-defined, fixed shape:

  • .spec.parallelism is greater than 1

  • .spec.completionMode is set to Indexed

  • .spec.completions is equal to .spec.parallelism

  • The schedulingGroup is not already set on the Pod template.

These conditions describe the class of Jobs that gang scheduling can reason about: each Pod has a stable identity (Indexed), the gang size is known and fixed at admission time (parallelism == completions), and no other controller has already claimed scheduling responsibility (schedulingGroup field is unset). Jobs that do not meet these conditions are scheduled Pod-by-Pod, exactly as before.

If you set schedulingGroup on the Pod template yourself (for example, because a higher-level controller is managing the workload), the Job controller leaves the Pod template alone and does not create its own Workload or PodGroup. This makes the feature safe to enable in clusters that already use an external batch system.

Here is an example of a Job that qualifies for gang scheduling:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
  namespace: job-ns
spec:
  completionMode: Indexed
  parallelism: 4
  completions: 4
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: worker
        image: registry.example/trainer:latest

The Job controller creates a Workload and a PodGroup owned by this Job, and every Pod it creates carries a .spec.schedulingGroup that points at the generated PodGroup. The Pods are then scheduled together once all four can be placed at the same time using the PodGroup scheduling cycle described earlier in this post.

What's not covered yet

The current constraints limit this integration to static, indexed, fully-parallel Jobs. Support for additional workload shapes, including elastic Jobs and other built-in controllers, is tracked in KEP-5547.

In future Kubernetes releases, this integration will expand to support additional workload controllers, and the current constraints for Jobs may be relaxed.

What's next?

The journey for workload-aware scheduling doesn't stop here. For v1.37, the community is actively working on:

  • Graduating Workload and PodGroup APIs to Beta: Our primary goal is to mature the Workload and PodGroup APIs to the Beta stage, solidifying their foundational role in the Kubernetes ecosystem. As part of this graduation process, we also plan to introduce minCount mutability to unlock elastic jobs and allow dynamic workloads to scale efficiently.

  • Multi-level Workload hierarchies: To support complex modern AI workloads like JobSet or Disaggregated Inference via LeaderWorkerSet (LWS), we are working on expanding the architecture to support multi-level hierarchies. We aim to introduce a new API that allows grouping multiple PodGroups into hierarchical structures, directly reflecting the organization of real-world workload controllers.

  • Graduating advanced scheduling features: We are focused on driving the maturity of the broader workload-aware scheduling ecosystem. This includes bringing existing features, such as topology-aware scheduling and workload-aware preemption, to the Beta stage.

  • Unified controller integration API: To streamline adoption, we’re working on a controller integration API. This will provide real-world workload controllers with a unified, standardized method for consuming workload-aware scheduling capabilities.

The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.

Getting started

All below workload-aware scheduling improvements are available as Alpha features in v1.36. To try them out, you must configure the following:

  • Prerequisite: Workload and PodGroup API support: Enable the GenericWorkload feature gate on both the kube-apiserver and kube-scheduler, and ensure the scheduling.k8s.io/v1alpha2 API group is enabled.

Once the prerequisite is met, you can enable specific features:

  • Gang scheduling: Enable the GangScheduling feature gate on the kube-scheduler.
  • Topology-aware scheduling: Enable the TopologyAwareWorkloadScheduling feature gate on the kube-scheduler.
  • Workload-aware preemption: Enable the WorkloadAwarePreemption feature gate on the kube-scheduler (requires GangScheduling to also be enabled).
  • DRA ResourceClaim support for workloads: Enable the DRAWorkloadResourceClaims feature gate on the kube-apiserver, kube-controller-manager, kube-scheduler and kubelet.
  • Workload API integration with the Job controller: Enable the WorkloadWithJob feature gate on the kube-apiserver and kube-controller-manager.

We encourage you to try out workload-aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:

Learn more

To dive deeper into the architecture and design of these features, read the KEPs:

Kubernetes v1.36: PSI Metrics for Kubernetes Graduates to GA

Since its original implementation in the Linux kernel in 2018, Pressure Stall Information (PSI) has provided users with the high-fidelity signals needed to identify resource saturation before it becomes an outage. Unlike traditional utilization metrics, PSI tells the story of tasks stalled and time lost, all in nicely-packaged percentages of time across the CPU, memory, and I/O.

With the recent release of Kubernetes v1.36, users across the ecosystem have a stable, reliable interface to observe resource contention at the node, pod, and container levels. In this post, we will dive into the improvements and performance testing that proved its readiness for production.

Beyond utilization: why PSI?

Monitoring CPU or memory usage alone can be misleading. A node may report XX% (below 100%) CPU utilization while certain tasks are experiencing severe latency due to scheduling delays. PSI fills this gap by providing:

  • Cumulative Totals: Absolute time spent in a stalled state.
  • Moving Averages: 10s, 60s, and 300s windows that allow operators to distinguish between transient spikes and sustained resource tension.

Proving stability: performance testing at scale

A common concern when graduating telemetry features is the resource overhead required to collect and serve the metrics. To address this, SIG Node conducted extensive performance validation on high-density workloads (80+ pods) across various machine types.

Our testing focused on two primary scenarios to isolate the impact of the Kubelet and kernel-level collection respectively:

  1. Kernel PSI ON / Kubelet Feature OFF vs Kernel PSI ON / Kubelet Feature ON (Kubelet overhead)
  2. Kernel PSI OFF / Kubelet Feature ON vs Kernel PSI ON / Kubelet Feature ON (Kernel overhead)

Scenario 1: The Kubelet Overhead

First, we looked at the kubelet usage on 4 core machines (Case 1). For these, the Linux kernel was already tracking pressure on both clusters by default(psi=1), but we toggled the KubeletPSI feature gate to see if the Kubelet actively querying and exposing these metrics impacted the resource usage. The synchronized bursts seen in the graph are practically identical in both magnitude and frequency, confirming that the Kubelet's collection logic is highly lightweight and blends seamlessly into standard housekeeping cycles. There is no issue about the feature affecting the pre-existing resource use, staying within the normal 0.1 cores or 2.5% of the total node capacity, and is therefore safe for production-scale deployments.

A line graph comparing the kubelet CPU usage rate over elapsed time with the Kubelet PSI feature turned off versus on and kernel PSI always on.

(Case 1) Kubelet CPU Usage Rate Comparison

Figure 2: Kubelet CPU Usage Rate Comparison.

Next, we evaluated the system overhead in the same run. As seen in the following graph, the System CPU usage lines for the Kubelet PSI-enabled (red) follows the same pattern as the Kubelet PSI-disabled (blue) clusters, with a slight expected increase from the baseline. This visualizes that once the OS is tracking PSI, at around 2.5 cores, the act of Kubernetes reading those cgroup metrics is negligible to performance.

A line graph comparing the system CPU usage rate over elapsed time with the PSI feature turned off versus on and kernel PSI default ON.

(Case 1) System CPU Usage Rate Comparison

Figure 1: Node System CPU Usage Rate Comparison.

Scenario 2: The Kernel Overhead

Shifting gears, we evaluated the underlying overhead of enabling PSI on the Linux kernel also on a 4 core machine. By comparing a cluster booted with psi=1 (COS default) against a cluster with psi=0, we isolated the exact cost of the OS-level bookkeeping. Even under heavy I/O and CPU load at an 80-pod density, the System CPU delta between the kernel-enabled and kernel-disabled clusters remained consistently between 0.037 cores and 0.125 cores or 0.925% - 3.125% of the total node capacity. There was a single spike to 0.225 cores, or 5.6%, but was controlled back down within a few seconds. This confirms that the internal kernel tracking is highly efficient under load.

A line graph comparing the Node System (Kernel) CPU usage rate with Kernel PSI ON and OFF over elapsed time.

(Case 2) Node System CPU Usage Rate Comparison

Figure 3: Node System CPU Usage Rate Comparison.

Figure 4 zooms in on the kubelet process itself, which serves as the primary collector for these metrics. . The results show that even while the kubelet performs periodic sweeps to aggregate data from the cgroup hierarchy, its CPU usage remains remarkably low with interchangeable spikes and nothing exceeding 0.25 cores or 6.25% of total capacity for longer than a second.

A line graph comparing the kubelet CPU usage rate over elapsed time with the Kernel PSI feature turned off versus on.

(Case 2) Kubelet CPU Usage Rate Comparison

Figure 4: Kubelet CPU Usage Rate Comparison.

Improvements between beta (1.34) and stable (1.36)

  • Smarter Metric Emission for GA: We improved how the Kubelet handles underlying OS support for PSI. Previously, if the feature was enabled in Kubernetes but the underlying Linux kernel didn't support PSI (psi=0), the Kubelet would emit misleading zero-valued metrics. These could trigger false alarms when read as real metrics instead of missing values. In v1.36, the Kubelet now detects OS-level PSI support via cgroup configurations before reporting. This ensures that pressure metrics are only collected and emitted when they are actually supported by the node, providing cleaner data for monitoring and alerting systems.

Getting started

To use PSI metrics in your Kubernetes cluster, your nodes must meet the following requirements:

  1. Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.
  2. Ensure PSI is enabled at the OS level (your kernel must be compiled with CONFIG_PSI=y and must not be booted with the psi=0 parameter).

As of v1.36, Kubelet PSI metrics are generally available and you do not need to opt in to any feature gate.

Once the OS prerequisites are met, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes, the kubelet will simply omit the PSI metrics.

If your cluster is running a recent enough version of Kubernetes and you are a privileged node administrator, you can also proxy to the kubelet's HTTP API via the control plane's API server to see real-time pressure data from the Summary API.

Caution: Proxying to the kubelet is a privileged operation. Granting access to it is a security risk, so ensure you have the appropriate administrative permissions before executing these commands.

CONTAINER_NAME="example-container"
kubectl get --raw "/api/v1/nodes/$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')/proxy/stats/summary" | jq '.pods[].containers[] | select(.name=="'"$CONTAINER_NAME"'") | {name, cpu: .cpu.psi, memory: .memory.psi, io: .io.psi}'

Further reading

If you want to dive deeper into how these metrics are calculated and exposed, check out these resources:

  1. The official Kernel documentation
  2. Understanding PSI in the Kubernetes documentation
  3. cAdvisor Metrics Implementation

Acknowledgements

Support for PSI metrics was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, review, and document this feature across its journey from alpha in v1.33, through beta in v1.34, to GA in v1.36.

To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub.

Feedback

If you have feedback and want to share your experience using this feature, join the discussion:

SIG Node would love to hear about your experiences using this feature in production!

Kubernetes v1.36: Moving Volume Group Snapshots to GA

Volume group snapshots were introduced as an Alpha feature with the Kubernetes v1.27 release, moved to Beta in v1.32, and to a second Beta in v1.34. We are excited to announce that in the Kubernetes v1.36 release, support for volume group snapshots has reached General Availability (GA).

The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash-consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaim objects for snapshotting. A key aim is to allow you to restore that set of snapshots to new volumes and recover your workload based on a crash-consistent recovery point.

This feature is only supported for CSI volume drivers.

An overview of volume group snapshots

Some storage systems provide the ability to create a crash-consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).

Why add volume group snapshots to Kubernetes?

The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage. Underpinning all these features is the Kubernetes goal of workload portability.

There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This is extremely useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another. If snapshots for these volumes are taken at different times, the application will not be consistent and will not function properly if restored from those snapshots.

While you can quiesce the application first and take individual snapshots sequentially, this process can be time-consuming or sometimes impossible. Consistent group support provides crash consistency across all volumes in the group without the need for application quiescence.

Kubernetes APIs for volume group snapshots

Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:

VolumeGroupSnapshot
Created by a Kubernetes user (or automation) to request creation of a volume group snapshot for multiple persistent volume claims.
VolumeGroupSnapshotContent
Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the provisioned cluster resource (a group snapshot). The object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
VolumeGroupSnapshotClass
Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.

These three API kinds are defined as CustomResourceDefinitions (CRDs). For the GA release, the API version has been promoted to v1.

What's new in GA?

  • The API version for VolumeGroupSnapshot, VolumeGroupSnapshotContent, and VolumeGroupSnapshotClass is promoted to groupsnapshot.storage.k8s.io/v1.
  • Enhanced stability and bug fixes based on feedback from the beta releases, including the improvements introduced in v1beta2 for accurate restoreSize reporting.

How do I use Kubernetes volume group snapshots

Creating a new group snapshot with Kubernetes

Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.

Label the PVCs you wish to group:

% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled

% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled

For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.

apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshot
metadata:
  name: snapshot-daily-20260422
  namespace: demo-namespace
spec:
  volumeGroupSnapshotClassName: csi-groupSnapclass
  source:
    selector:
      matchLabels:
        group: myGroup

The VolumeGroupSnapshotClass is required for dynamic provisioning:

apiVersion: groupsnapshot.storage.k8s.io/v1
kind: VolumeGroupSnapshotClass
metadata:
  name: csi-groupSnapclass
driver: example.csi.k8s.io
deletionPolicy: Delete

How to use group snapshot for restore

At restore time, request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. Repeat this for all volumes that are part of the group snapshot.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: examplepvc-restored-2026-04-22
  namespace: demo-namespace
spec:
  storageClassName: example-sc
  dataSource:
    name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOncePod
  resources:
    requests:
      storage: 100Mi

As a storage vendor, how do I add support for group snapshots?

To implement the volume group snapshot feature, a CSI driver must:

  • Implement a new group controller service.
  • Implement group controller RPCs: CreateVolumeGroupSnapshot, DeleteVolumeGroupSnapshot, and GetVolumeGroupSnapshot.
  • Add group controller capability CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT.

See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.

How can I learn more?

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to all the contributors who stepped up over the years to help the project reach GA:

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

Kubernetes v1.36: More Drivers, New Features, and the Next Era of DRA

Dynamic Resource Allocation (DRA) has fundamentally changed how platform administrators handle hardware accelerators and specialized resources in Kubernetes. In the v1.36 release, DRA continues to mature, bringing a wave of feature graduations, critical usability improvements, and new capabilities that extend the flexibility of DRA to native resources like memory and CPU, and support for ResourceClaims in PodGroups.

Driver availability continues to expand. Beyond specialized compute accelerators, the ecosystem includes support for networking and other hardware types, reflecting a move toward a more robust, hardware-agnostic infrastructure.

Whether you are managing massive fleets of GPUs, need better handling of failures, or simply looking for better ways to define resource fallback options, the upgrades to DRA in 1.36 have something for you. Let's dive into the new features and graduations!

Feature graduations

The community has been hard at work stabilizing core DRA concepts. In Kubernetes 1.36, several highly anticipated features have graduated to Beta and Stable.

Prioritized list (stable)

Hardware heterogeneity is a reality in most clusters. With the Prioritized list feature, you can confidently define fallback preferences when requesting devices. Instead of hardcoding a request for a specific device model, you can specify an ordered list of preferences (e.g., "Give me an H100, but if none are available, fall back to an A100"). The scheduler will evaluate these requests in order, drastically improving scheduling flexibility and cluster utilization.

Extended resource support (beta)

As DRA becomes the standard for resource allocation, bridging the gap with legacy systems is crucial. The DRA Extended resource feature allows users to request resources via traditional extended resources on a Pod. This allows for a gradual transition to DRA, meaning cluster operators can migrate clusters to DRA but let application developers adopt the ResourceClaim API on their own schedule.

Partitionable devices (beta)

Hardware accelerators are powerful, and sometimes a single workload doesn't need an entire device. The Partitionable devices feature, provides native DRA support for dynamically carving physical hardware into smaller, logical instances (such as Multi-Instance GPUs) based on workload demands. This allows administrators to safely and efficiently share expensive accelerators across multiple Pods.

Device taints (beta)

Just as you can taint a Kubernetes Node, you can apply taints directly to specific DRA devices. Device taints and tolerations empower cluster administrators to manage hardware more effectively. You can taint faulty devices to prevent them from being allocated to standard claims, or reserve specific hardware for dedicated teams, specialized workloads, and experiments. Ultimately, only Pods with matching tolerations are permitted to claim these tainted devices.

Device binding conditions (beta)

To improve scheduling reliability, the Kubernetes scheduler can use the Binding conditions feature to delay committing a Pod to a Node until its required external resources—such as attachable devices or FPGAs—are fully prepared. By explicitly modeling resource readiness, this prevents premature assignments that can lead to Pod failures, ensuring a much more robust and predictable deployment process.

Resource health status (beta)

Knowing when a device has failed or become unhealthy is critical for workloads running on specialized hardware. With Resource health status, Kubernetes expose device health information directly in the Pod status, giving users and controllers crucial visibility to quickly identify and react to hardware failures. The feature includes support for human-readable health status messages, making it significantly easier to diagnose issues without the need to dive into complex driver logs.

New Features

Beyond stabilizing existing capabilities, v1.36 introduces foundational new features that expand what DRA can do. These are alpha features, so they are behind feature gates that are disabled by default.

ResourceClaim support for workloads

To optimize large-scale AI/ML workloads that rely on strict topological scheduling, the ResourceClaim support for workloads feature enables Kubernetes to seamlessly manage shared resources across massive sets of Pods. By associating ResourceClaims or ResourceClaimTemplates with PodGroups, this feature eliminates previous scaling bottlenecks, such as the limit on the number of pods that can share a claim, and removes the burden of manual claim management from specialized orchestrators.

Node allocatable resources

Why should DRA only be for external accelerators? In v1.36, we are introducing the first iteration of using the DRA APIs to manage node allocatable infrastructure resources (like CPU and memory). By bringing CPU and memory allocation under the DRA umbrella with the DRA Node allocatable resources feature, users can leverage DRA's advanced placement, NUMA-awareness, and prioritization semantics for standard compute resources, paving the way for incredibly fine-grained performance tuning.

DRA resource availability visibility

One of the most requested features from cluster administrators has been better visibility into hardware capacity. The new Resource pool status feature allows you to query the availability of devices in DRA resource pools. By creating a ResourcePoolStatusRequest object, you get a point-in-time snapshot of device counts — total, allocated, available, and unavailable — for each pool managed by a given driver. This enables better integration with dashboards and capacity planning tools.

List types for attributes

ResourceClaim constraint evaluation has changed to work better with scalar and list values: matchAttribute now checks for a non-empty intersection, and distinctAttribute checks for pairwise disjoint values.

An includes() function in CEL has also been introduced, that lets device selectors keep working more easily when an attribute changes between scalar and list representations. (The includes() function is only available in DRA contexts for expression evaluation).

Deterministic device selection

The Kubernetes scheduler has been updated to evaluate devices using lexicographical ordering based on resource pool and ResourceSlice names. This change empowers drivers to proactively influence the scheduling process, leading to improved throughput and more optimal scheduling decisions. The ResourceSlice controller toolkit automatically generates names that reflect the exact device ordering specified by the driver author.

Discoverable device metadata in containers

Workloads running on nodes with DRA devices often need to discover details about their allocated devices, such as PCI bus addresses or network interface configuration, without querying the Kubernetes API. With Device metadata, Kubernetes defines a standard protocol for how DRA drivers expose device attributes to containers as versioned JSON files at well-known paths. Drivers built with the DRA kubelet plugin library get this behavior transparently; they just provide the metadata and the library handles file layout, CDI bind-mounts, versioning, and lifecycle. This gives applications a consistent, driver-independent way to discover and consume device metadata, eliminating the need for custom controllers or looking up ResourceSlice objects to get metadata via attributes.

What’s next?

This release introduced a wealth of new Dynamic Resource Allocation (DRA) features, and the momentum is only building. As we look ahead, our roadmap focuses on maturing existing features toward beta and stable releases while hardening DRA’s performance, scalability, and reliability. A key priority over the coming cycles will be deep integration with workload aware and topology aware scheduling.

A big goal for us is to migrate users from Device Plugin to DRA, and we want you involved. Whether you are currently maintaining a driver or are just beginning to explore the possibilities, your input is vital. Partner with us to shape the next generation of resource management. Reach out today to collaborate on development, share feedback, or start building your first DRA driver.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at Americas/EMEA and EMEA/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

Kubernetes v1.36: Server-Side Sharded List and Watch

As Kubernetes clusters grow to tens of thousands of nodes, controllers that watch high-cardinality resources like Pods face a scaling wall. Every replica of a horizontally scaled controller receives the full stream of events from the API server, paying the CPU, memory, and network cost to deserialize everything, only to discard the objects it is not responsible for. Scaling out the controller does not reduce per-replica cost; it multiplies it.

Kubernetes v1.36 introduces server-side sharded list and watch as an alpha feature (KEP-5866). With this feature enabled, the API server filters events at the source so that each controller replica receives only the slice of the resource collection it owns.

The problem with client-side sharding

Some controllers, such as kube-state-metrics, already support horizontal sharding. Each replica is assigned a portion of the keyspace and discards objects that do not belong to it. While this works functionally, it does not reduce the volume of data flowing from the API server:

  • N replicas x full event stream: every replica deserializes and processes every event, then throws away what it does not need.
  • Network bandwidth scales with replicas, not with shard size.
  • CPU spent on deserialization is wasted for the discarded fraction.

Server-side sharded list and watch solves this by moving the filtering upstream into the API server. Each replica tells the API server which hash range it owns, and the API server only sends matching events.

How it works

The feature adds a shardSelector field to ListOptions. Clients specify a hash range using the shardRange() function:

shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')

The API server computes a deterministic 64-bit FNV-1a hash of the specified field and returns only objects whose hash falls within the range [start, end). This applies to both list responses and watch event streams. The hash function produces the same result across all API server instances, so the feature is safe to use with multiple API server replicas.

Currently supported field paths are object.metadata.uid and object.metadata.namespace.

Using sharded watches in controllers

Controllers typically use informers to list and watch resources. To shard the workload, each replica injects the shardSelector into the ListOptions used by its informers via WithTweakListOptions:

import (
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/informers"
)

shardSelector := "shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"

factory := informers.NewSharedInformerFactoryWithOptions(client, resyncPeriod,
    informers.WithTweakListOptions(func(opts *metav1.ListOptions) {
        opts.ShardSelector = shardSelector
    }),
)

For a 2-replica deployment, the selectors split the hash space in half:

// Replica 0: lower half of the hash space
"shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"

// Replica 1: upper half of the hash space
"shardRange(object.metadata.uid, '0x8000000000000000', '0x10000000000000000')"

A single replica can also cover non-contiguous ranges using ||:

"shardRange(object.metadata.uid, '0x0000000000000000', '0x4000000000000000') || " +
    "shardRange(object.metadata.uid, '0x8000000000000000', '0xc000000000000000')"

Verifying server support

When the API server honors a shard selector, the list response includes a shardInfo field in the response metadata that echoes back the applied selector:

{
  "kind": "PodList",
  "apiVersion": "v1",
  "metadata": {
    "resourceVersion": "10245",
    "shardInfo": {
      "selector": "shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')"
    }
  },
  "items": [...]
}

If shardInfo is absent, the server did not honor the shard selector and the client received the complete, unfiltered collection. In this case, the client should be prepared to handle the full result set, for example by applying client-side filtering to discard objects outside its assigned shard range.

Getting involved

This feature is in alpha and requires enabling the ShardedListAndWatch feature gate on the API server. We are looking for feedback from controller authors and operators running large clusters.

If you have questions or feedback, join the #sig-api-machinery channel on Kubernetes Slack.

Kubernetes v1.36: Declarative Validation Graduates to GA

In Kubernetes v1.36, Declarative Validation for Kubernetes native types has reached General Availability (GA).

For users, this means more reliable, predictable, and better-documented APIs. By moving to a declarative model, the project also unlocks the future ability to publish validation rules via OpenAPI and integrate with ecosystem tools like Kubebuilder. For contributors and ecosystem developers, this replaces thousands of lines of handwritten validation code with a unified, maintainable framework.

This post covers why this migration was necessary, how the declarative validation framework works, and what new capabilities come with this GA release.

The Motivation: Escaping the "Handwritten" Technical Debt

For years, the validation of Kubernetes native APIs relied almost entirely on handwritten Go code. If a field needed to be bounded by a minimum value, or if two fields needed to be mutually exclusive, developers had to write explicit Go functions to enforce those constraints.

As the Kubernetes API surface expanded, this approach led to several systemic issues:

  1. Technical Debt: The project accumulated roughly 18,000 lines of boilerplate validation code. This code was difficult to maintain, error-prone, and required intense scrutiny during code reviews.
  2. Inconsistency: Without a centralized framework, validation rules were sometimes applied inconsistently across different resources.
  3. Opaque APIs: Handwritten validation logic was difficult to discover or analyze programmatically. This meant clients and tooling couldn't predictably know validation rules without consulting the source code or encountering errors at runtime.

The solution proposed by SIG API Machinery was Declarative Validation: using Interface Definition Language (IDL) tags (specifically +k8s: marker tags) directly within types.go files to define validation rules.

Enter validation-gen

At the core of the declarative validation feature is a new code generator called validation-gen. Just as Kubernetes uses generators for deep copies, conversions, and defaulting, validation-gen parses +k8s: tags and automatically generates the corresponding Go validation functions.

These generated functions are then registered seamlessly with the API scheme. The generator is designed as an extensible framework, allowing developers to plug in new "Validators" by describing the tags they parse and the Go logic they should produce.

A Comprehensive Suite of +k8s: Tags

The declarative validation framework introduces a comprehensive suite of marker tags that provide rich validation capabilities highly optimized for Go types. For a full list of supported tags, check out the official documentation. Here is a catalog of some of the most common tags you will now see in the Kubernetes codebase:

  • Presence: +k8s:optional, +k8s:required
  • Basic Constraints: +k8s:minimum=0, +k8s:maximum=100, +k8s:maxLength=16, +k8s:format=k8s-short-name
  • Collections: +k8s:listType=map, +k8s:listMapKey=type
  • Unions: +k8s:unionMember, +k8s:unionDiscriminator
  • Immutability: +k8s:immutable, +k8s:update=[NoSet, NoModify, NoClear]

Example Usage:

type ReplicationControllerSpec struct {
    // +k8s:optional
    // +k8s:minimum=0
    Replicas *int32 `json:"replicas,omitempty"`
}

By placing these tags directly above the field definitions, the constraints are self-documenting and immediately visible to anyone reading the type definitions.

Advanced Capabilities: "Ambient Ratcheting"

One of the most substantial outcomes of this work is that validation ratcheting is now a standard, ambient part of the API. In the past, if we needed to tighten validation, we had to first add handwritten ratcheting code, wait a release, and then tighten the validation to avoid breaking existing objects.

With declarative validation, this safety mechanism is built-in. If a user updates an existing object, the validation framework compares the incoming object with the oldObject. If a specific field's value is semantically equivalent to its prior state (i.e., the user didn't change it), the new validation rule is bypassed. This "ambient ratcheting" means we can loosen or tighten validation immediately and in the least disruptive way possible.

Scaling API Reviews with kube-api-linter

Reaching GA required absolute confidence in the generated code, but our vision extends beyond just validation. Declarative validation is a key part of a comprehensive approach to making API review easier, more consistent, and highly scalable.

By moving validation rules out of opaque Go functions and into structured markers, we are empowering tools like kube-api-linter. This linter can now statically analyze API types and enforce API conventions automatically, significantly reducing the manual burden on SIG API Machinery reviewers and providing immediate feedback to contributors.

What's next?

With the release of Kubernetes v1.36, Declarative Validation graduates to General Availability (GA). As a stable feature, the associated DeclarativeValidation feature gate is now enabled by default. It has become the primary mechanism for adding new validation rules to Kubernetes native types.

Looking forward, the project is committed to adopting declarative validation even more extensively. This includes migrating the remaining legacy handwritten validation code for established APIs and requiring its use for all new APIs and new fields. This ongoing transition will continue to shrink the codebase's complexity while enhancing the consistency and reliability of the entire Kubernetes API surface.

Beyond the core migration, declarative validation also unlocks an exciting future for the broader ecosystem. Because validation rules are now defined as structured markers rather than opaque Go code, they can be parsed and reflected in the OpenAPI schemas published by the Kubernetes API server. This paves the way for tools like kubectl, client libraries, and IDEs to perform rich client-side validation before a request is ever sent to the cluster. The same declarative framework can also be consumed by ecosystem tools like Kubebuilder, enabling a more consistent developer experience for authors of Custom Resource Definitions (CRDs).

Getting involved

The migration to declarative validation is an ongoing effort. While the framework itself is GA, there is still work to be done migrating older APIs to the new declarative format.

If you are interested in contributing to the core of Kubernetes API Machinery, this is a fantastic place to start. Check out the validation-gen documentation, look for issues tagged with sig/api-machinery, and join the conversation in the #sig-api-machinery and #sig-api-machinery-dev-tools channels on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You can also attend the SIG API Machinery meetings to get involved directly.

Acknowledgments

A huge thank you to everyone who helped bring this feature to GA:

And the many others across the Kubernetes community who contributed along the way.

Welcome to the declarative future of Kubernetes validation!

Kubernetes v1.36: Admission Policies That Can't Be Deleted

If you've ever tried to enforce a security policy across a fleet of Kubernetes clusters, you've probably run into a frustrating chicken-and-egg problem. Your admission policies are API objects, which means they don't exist until someone creates them, and they can be deleted by anyone with the right permissions. There's always a window during cluster bootstrap where your policies aren't active yet, and there's no way to prevent a privileged user from removing them.

Kubernetes v1.36 introduces an alpha feature that addresses this: manifest-based admission control. It lets you define admission webhooks and CEL-based policies as files on disk, loaded by the API server at startup, before it serves any requests.

The gap we're closing

Most Kubernetes policy enforcement today works through the API. You create a ValidatingAdmissionPolicy or a webhook configuration as an API object, and the admission controller picks it up. This works well in steady state, but it has some fundamental limitations.

During cluster bootstrap, there's a gap between when the API server starts serving requests and when your policies are created and active. If you're restoring from a backup or recovering from an etcd failure, that gap can be significant.

There's also a self-protection problem. Admission webhooks and policies can't intercept operations on their own configuration resources. Kubernetes skips invoking webhooks on types like ValidatingWebhookConfiguration to avoid circular dependencies. That means a sufficiently privileged user can delete your critical admission policies, and there's nothing in the admission chain to stop them.

We - Kubernetes SIG API Machinery - wanted a way to say "these policies are always on, full stop."

How it works

You add a staticManifestsDir field to the AdmissionConfiguration file that you already pass to the API server via --admission-control-config-file. Point it at a directory, drop your policy YAML files in there, and the API server loads them before it starts serving.

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: ValidatingAdmissionPolicy
  configuration:
    apiVersion: apiserver.config.k8s.io/v1
    kind: ValidatingAdmissionPolicyConfiguration
    staticManifestsDir: "/etc/kubernetes/admission/validating-policies/"

The manifest files are standard Kubernetes resource definitions. The only requirement is that all the objects that these manifests define must have names ending in .static.k8s.io. This reserved suffix prevents collisions with API-based configurations and makes it easy to tell where an admission decision came from when you're looking at metrics or audit logs.

Here's a complete example that denies privileged containers outside kube-system:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "deny-privileged.static.k8s.io"
  annotations:
    kubernetes.io/description: "Deny launching privileged pods, anywhere this policy is applied"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups: [""]
      apiVersions: ["v1"]
      operations: ["CREATE", "UPDATE"]
      resources: ["pods"]
  variables:
  - name: allContainers
    expression: >-
      object.spec.containers +
      (has(object.spec.initContainers) ? object.spec.initContainers : []) +
      (has(object.spec.ephemeralContainers) ? object.spec.ephemeralContainers : [])
  validations:
  - expression: >-
      !variables.allContainers.exists(c,
      has(c.securityContext) && has(c.securityContext.privileged) &&
      c.securityContext.privileged == true)
    message: "Privileged containers are not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "deny-privileged-binding.static.k8s.io"
  annotations:
    kubernetes.io/description: "Bind deny-privileged policy to all namespaces except kube-system"
spec:
  policyName: "deny-privileged.static.k8s.io"
  validationActions:
  - Deny
  matchResources:
    namespaceSelector:
      matchExpressions:
      - key: "kubernetes.io/metadata.name"
        operator: NotIn
        values: ["kube-system"]

Protecting what couldn't be protected before

The part we're most excited about is the ability to intercept operations on admission configuration resources themselves.

With API-based admission, webhooks and policies are never invoked on types like ValidatingAdmissionPolicy or ValidatingWebhookConfiguration. That restriction exists for good reason: if a webhook could reject changes to its own configuration, you could end up locked out with no way to fix it through the API.

Manifest-based policies don't have that problem. If a bad policy is blocking something it shouldn't, you fix the file on disk and the API server picks up the change. There's no circular dependency because the recovery path doesn't go through the API.

This means you can write a manifest-based policy that prevents deletion of your critical API-based admission policies. For platform teams managing shared clusters, this is a significant improvement. You can now guarantee that your baseline security policies can't be removed by a cluster admin, accidentally or otherwise.

Here's what that looks like in practice. This policy prevents any modification or deletion of admission resources that carry the platform.example.com/protected: "true" label:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "protect-policies.static.k8s.io"
  annotations:
    kubernetes.io/description: "Prevent modification or deletion of protected admission resources"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups: ["admissionregistration.k8s.io"]
      apiVersions: ["*"]
      operations: ["DELETE", "UPDATE"]
      resources:
      - "validatingadmissionpolicies"
      - "validatingadmissionpolicybindings"
      - "validatingwebhookconfigurations"
      - "mutatingwebhookconfigurations"
  validations:
  - expression: >-
      !has(oldObject.metadata.labels) ||
      !('platform.example.com/protected' in oldObject.metadata.labels) ||
      oldObject.metadata.labels['platform.example.com/protected'] != 'true'
    message: "Protected admission resources cannot be modified or deleted"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "protect-policies-binding.static.k8s.io"
  annotations:
    kubernetes.io/description: "Bind protect-policies policy to all admission resources"
spec:
  policyName: "protect-policies.static.k8s.io"
  validationActions:
  - Deny

With this in place, any API-based admission policy or webhook configuration labeled platform.example.com/protected: "true" is shielded from tampering. The protection itself lives on disk and can't be removed through the API.

A few things to know

Manifest-based configurations are intentionally self-contained. They can't reference API resources, which means no paramKind for policies, no Service references for admission webhooks (instead they are URL-only), and bindings may only reference policies in the same manifest set. These restrictions exist because the configurations need to work without any cluster state, including at startup before etcd is available.

If you run multiple API server instances, each one loads its own manifest files independently. There's no cross-server synchronization built in. This is the same model as other file-based API server configurations like encryption at rest. When this feature is enabled, Kubernetes exposes a configuration hash as a label on relevant metrics, so you can detect drift.

Files are watched for changes at runtime, so you don't need to restart the API server to update policies. If you update a manifest file, the API server validates the new configuration and swaps it in atomically. If validation fails, it keeps the previous good configuration and logs the error. This means you can roll out policy changes across your fleet using standard configuration management tools (Ansible, Puppet, or even mounted ConfigMaps) without any API server downtime.

The initial load at startup is stricter: if any manifest is invalid, the API server won't start. This is intentional. At startup, failing fast is safer than running without your expected policies.

Try it out

To try this in Kubernetes v1.36:

  1. Enable the ManifestBasedAdmissionControlConfig feature gate for each kube-apiserver.
  2. Create a directory with your static manifest files. If you need to mount that in to the Pod where the API server runs, do that too. Read-only is fine.
  3. Configure staticManifestsDir in your AdmissionConfiguration with the directory path.
  4. Start the API server with --admission-control-config-file pointing to your AdmissionConfiguration file.

The full documentation is at Manifest-Based Admission Control, and you can follow KEP-5793 for ongoing progress.

We'd love to hear your feedback. Reach out on the #sig-api-machinery channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you're interested in contributing to this feature or other SIG API Machinery projects, join us on #sig-api-machinery on Kubernetes Slack. You're also welcome to attend the SIG API Machinery meetings, held every other Wednesday.

Kubernetes v1.36: Pod-Level Resource Managers (Alpha)

Kubernetes v1.36 introduces Pod-Level Resource Managers as an alpha feature, bringing a more flexible and powerful resource management model to performance-sensitive workloads. This enhancement extends the kubelet's Topology, CPU, and Memory Managers to support pod-level resource specifications (.spec.resources), evolving them from a strictly per-container allocation model to a pod-centric one.

Why do we need pod-level resource managers?

When running performance-critical workloads such as machine learning (ML) training, high-frequency trading applications, or low-latency databases, you often need exclusive, NUMA-aligned resources for your primary application containers to ensure predictable performance.

However, modern Kubernetes pods rarely consist of just one container. They frequently include sidecar containers for logging, monitoring, service meshes, or data ingestion.

Before this feature, this created a trade-off, to get NUMA-aligned, exclusive resources for your main application, you had to allocate exclusive, integer-based CPU resources to every container in the pod. This might be wasteful for lightweight sidecars. If you didn't do this, you forfeited the pod's Guaranteed Quality of Service (QoS) class entirely, losing the performance benefits.

Introducing pod-level resource managers

Enabling pod-level resources support for the resource managers (via the PodLevelResourceManagers and PodLevelResources feature gates) allows the kubelet to create hybrid resource allocation models. This brings flexibility and efficiency to high-performance workloads without sacrificing NUMA alignment.

Real-world use cases

Here are a few practical scenarios demonstrating how this feature can be applied, depending on the configured Topology Manager scope:

1. Tightly-coupled database (Topology manager's pod scope)

Consider a latency-sensitive database pod that includes a main database container, a local metrics exporter, and a backup agent sidecar.

When configured with the pod Topology Manager scope, the kubelet performs a single NUMA alignment based on the entire pod's budget. The database container gets its exclusive CPU and memory slices from that NUMA node. The remaining resources from the pod's budget form a new pod shared pool. The metrics exporter and backup agent run in this pod shared pool. They share resources with each other, but they are strictly isolated from the database's exclusive slices and the rest of the node.

This allows you to safely co-locate auxiliary containers on the same NUMA node as your primary workload without wasting dedicated cores on them.

apiVersion: v1
kind: Pod
metadata:
  name: tightly-coupled-database
spec:
  # Pod-level resources establish the overall budget and NUMA alignment size.
  resources:
    requests:
      cpu: "8"
      memory: "16Gi"
    limits:
      cpu: "8"
      memory: "16Gi"
  initContainers:
  - name: metrics-exporter
    image: metrics-exporter:v1
    restartPolicy: Always
  - name: backup-agent
    image: backup-agent:v1
    restartPolicy: Always
  containers:
  - name: database
    image: database:v1
    # This Guaranteed container gets an exclusive 6 CPU slice from the pod's budget.
    # The remaining 2 CPUs and 4Gi memory form the pod shared pool for the sidecars.
    resources:
      requests:
        cpu: "6"
        memory: "12Gi"
      limits:
        cpu: "6"
        memory: "12Gi"

2. ML workload with infrastructure sidecars (Topology manager's container scope)

Imagine a pod running a GPU-accelerated ML training workload alongside a generic service mesh sidecar.

Under the container Topology Manager scope, the kubelet evaluates each container individually. You can grant the ML container exclusive, NUMA-aligned CPUs and Memory for maximum performance. Meanwhile, the service mesh sidecar doesn't need to be NUMA-aligned; it can run in the general node-wide shared pool. The collective resource consumption is still safely bounded by the overall pod limits, but you only allocate NUMA-aligned, exclusive resources to the specific containers that actually require them.

apiVersion: v1
kind: Pod
metadata:
  name: ml-workload
spec:
  # Pod-level resources establish the overall budget constraint.
  resources:
    requests:
      cpu: "4"
      memory: "8Gi"
    limits:
      cpu: "4"
      memory: "8Gi"
  initContainers:
  - name: service-mesh-sidecar
    image: service-mesh:v1
    restartPolicy: Always
  containers:
  - name: ml-training
    image: ml-training:v1
    # Under the 'container' scope, this Guaranteed container receives exclusive,
    # NUMA-aligned resources, while the sidecar runs in the node's shared pool.
    resources:
      requests:
        cpu: "3"
        memory: "6Gi"
      limits:
        cpu: "3"
        memory: "6Gi"

CPU quotas (CFS) and isolation

When running these mixed workloads within a pod, isolation is enforced differently depending on the allocation:

  • Exclusive containers: Containers granted exclusive CPU slices have their CPU CFS quota enforcement disabled at the container level, allowing them to run without being throttled by the Linux scheduler.
  • Pod shared pool containers: Containers falling into the pod shared pool have CPU CFS quotas enforced at the pod level, ensuring they do not consume more than the leftover pod budget.

How to enable Pod-Level Resource Managers

Using this feature requires Kubernetes v1.36 or newer. To enable it, you must configure the kubelet with the appropriate feature gates and policies:

  1. Enable the PodLevelResources and PodLevelResourceManagers feature gates.
  2. Configure the Topology Manager with a policy other than none (i.e., best-effort, restricted, or single-numa-node).
  3. Set the Topology Manager scope to either pod or container using the topologyManagerScope field in the KubeletConfiguration.
  4. Configure the CPU Manager with the static policy.
  5. Configure the Memory Manager with the Static policy.

Observability

To help cluster administrators monitor and debug these new allocation models, we have introduced several new kubelet metrics when the feature gate is enabled:

  • resource_manager_allocations_total: Counts the total number of exclusive resource allocations performed by a manager. The source label ("pod" or "node") distinguishes between allocations drawn from the node-level pool versus a pre-allocated pod-level pool.
  • resource_manager_allocation_errors_total: Counts errors encountered during exclusive resource allocation, distinguished by the intended allocation source ("pod" or "node").
  • resource_manager_container_assignments: Tracks the cumulative number of containers running with specific assignment types. The assignment_type label ("node_exclusive", "pod_exclusive", "pod_shared") provides visibility into how workloads are distributed.

Current limitations and caveats

While this feature opens up new possibilities, there are a few things to keep in mind during its alpha phase. Be sure to review the Limitations and caveats in the official documentation for full details on compatibility, requirements, and downgrade instructions.

Getting started and providing feedback

For a deep dive into the technical details and configuration of this feature, check out the official concept documentation:

To learn more about the overall pod-level resources feature and how to assign resources to pods, see:

As this feature moves through Alpha, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:

Kubernetes v1.36: In-Place Vertical Scaling for Pod-Level Resources Graduates to Beta

Following the graduation of Pod-Level Resources to Beta in v1.34 and the General Availability (GA) of In-Place Pod Vertical Scaling in v1.35, the Kubernetes community is thrilled to announce that In-Place Pod-Level Resources Vertical Scaling has graduated to Beta in v1.36!

This feature is now enabled by default via the InPlacePodLevelResourcesVerticalScaling feature gate. It allows users to update the aggregate Pod resource budget (.spec.resources) for a running Pod, often without requiring a container restart.

Why Pod-level in-place resize?

The Pod-level resource model simplified management for complex Pods (such as those with sidecars) by allowing containers to share a collective pool of resources. In v1.36, you can now adjust this aggregate boundary on-the-fly.

This is particularly useful for Pods where containers do not have individual limits defined. These containers automatically scale their effective boundaries to fit the newly resized Pod-level dimensions, allowing you to expand the shared pool during peak demand without manual per-container recalculations.

Resource inheritance and the resizePolicy

When a Pod-level resize is initiated, the Kubelet treats the change as a resize event for every container that inherits its limits from the Pod-level budget. To determine whether a restart is required, the Kubelet consults the resizePolicy defined within individual containers:

  • Non-disruptive Updates: If a container's restartPolicy is set to NotRequired, the Kubelet attempts to update the cgroup limits dynamically via the Container Runtime Interface (CRI).
  • Disruptive Updates: If set to RestartContainer, the container will be restarted to apply the new aggregate boundary safely.

Note: Currently, resizePolicy is not supported at the Pod level. The Kubelet always defers to individual container settings to decide if an update can be applied in-place or requires a restart.

Example: Scaling a shared resource pool

In this scenario, a Pod is defined with a 2 CPU pod-level limit. Because the individual containers do not have their own limits defined, they share this total pool.

1. Initial Pod specification

apiVersion: v1
kind: Pod
metadata:
  name: shared-pool-app
spec:
  resources: # Pod-level limits
    limits:
      cpu: "2"
      memory: "4Gi"
  containers:
  - name: main-app
    image: my-app:v1
    resizePolicy: [{resourceName: "cpu", restartPolicy: "NotRequired"}]
  - name: sidecar
    image: logger:v1
    resizePolicy: [{resourceName: "cpu", restartPolicy: "NotRequired"}]

2. The resize operation

To double the CPU capacity to 4 CPUs, apply a patch using the resize subresource:

kubectl patch pod shared-pool-app --subresource resize --patch \
  '{"spec":{"resources":{"limits":{"cpu":"4"}}}}'

Node-Level reality: feasibility and safety

Applying a resize patch is only the first step. The Kubelet performs several checks and follows a specific sequence to ensure node stability:

1. The feasibility check

Before admitting a resize, the Kubelet verifies if the new aggregate request fits within the Node's allocatable capacity. If the Node is overcommitted, the resize is not ignored; instead, the PodResizePending condition will reflect a Deferred or Infeasible status, providing immediate feedback on why the "envelope" hasn't grown.

2. Update sequencing

To prevent resource "overshoot", the Kubelet coordinates the cgroup updates in a specific order:

  • When Increasing: The Pod-level cgroup is expanded first, creating the "room" before the individual container cgroups are enlarged.
  • When Decreasing: The container cgroups are throttled first, and only then is the aggregate Pod-level cgroup shrunken.

Observability: tracking resize status

With the move to Beta, Kubernetes uses Pod Conditions to track the lifecycle of a resize:

  • PodResizePending: The spec is updated, but the Node hasn't admitted the change (e.g., due to capacity).
  • PodResizeInProgress: The Node has admitted the resize (status.allocatedResources) but the changes aren't yet fully applied to the cgroups (status.resources).
status:
  allocatedResources:
    cpu: "4"
  resources:
    limits:
      cpu: "4"
  conditions:
  - type: PodResizeInProgress
    status: "True"

Constraints and requirements

  • cgroup v2 Only: Required for accurate aggregate enforcement.
  • CRI Support: Requires a container runtime that supports the UpdateContainerResources CRI call (e.g., containerd v2.0+ or CRI-O).
  • Feature Gates: Requires PodLevelResources, InPlacePodVerticalScaling, InPlacePodLevelResourcesVerticalScaling, and NodeDeclaredFeatures.
  • Linux Only: Currently exclusive to Linux-based nodes.

What's next?

As we move toward General Availability (GA), the community is focusing on Vertical Pod Autoscaler (VPA) Integration, enabling VPA to issue Pod-level resource recommendations and trigger in-place actuation automatically.

Getting started and providing feedback

We encourage you to test this feature and provide feedback via the standard Kubernetes communication channels:

Kubernetes v1.36: Tiered Memory Protection with Memory QoS

On behalf of SIG Node, we are pleased to announce updates to the Memory QoS feature (alpha) in Kubernetes v1.36. Memory QoS uses the cgroup v2 memory controller to give the kernel better guidance on how to treat container memory. It was first introduced in v1.22 and updated in v1.27. In Kubernetes v1.36, we're introducing: opt-in memory reservation, tiered protection by QoS class, observability metrics, and kernel-version warning for memory.high.

What's new in v1.36

Opt-in memory reservation with memoryReservationPolicy

v1.36 separates throttling from reservation. Enabling the feature gate turns on memory.high throttling (the kubelet sets memory.high based on memoryThrottlingFactor, default 0.9), but memory reservation is now controlled by a separate kubelet configuration field:

  • None (default): no memory.min or memory.low is written. Throttling via memory.high still works.
  • TieredReservation: the kubelet writes tiered memory protection based on the Pod's QoS class:

Guaranteed Pods get hard protection via memory.min. For example, a Guaranteed Pod requesting 512 MiB of memory results in:

$ cat /sys/fs/cgroup/kubepods.slice/kubepods-pod6a4f2e3b_1c9d_4a5e_8f7b_2d3e4f5a6b7c.slice/memory.min
536870912

The kernel will not reclaim this memory under any circumstances. If it cannot honor the guarantee, it invokes the OOM killer on other processes to free pages.

Burstable Pods get soft protection via memory.low. For the same 512 MiB request on a Burstable Pod:

$ cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8b3c7d2e_4f5a_6b7c_9d1e_3f4a5b6c7d8e.slice/memory.low
536870912

The kernel avoids reclaiming this memory under normal pressure, but may reclaim it if the alternative is a system-wide OOM.

BestEffort Pods get neither memory.min nor memory.low. Their memory remains fully reclaimable.

Comparison with v1.27 behavior

In earlier versions, enabling the MemoryQoS feature gate immediately set memory.min for every container with a memory request. memory.min is a hard reservation that the kernel will not reclaim, regardless of memory pressure.

Consider a node with 8 GiB of RAM where Burstable Pod requests total 7 GiB. In earlier versions, that 7 GiB would be locked as memory.min, leaving little headroom for the kernel, system daemons, or BestEffort workloads and increasing the risk of OOM kills.

With v1.36 tiered reservation, those Burstable requests map to memory.low instead of memory.min. Under normal pressure, the kernel still protects that memory, but under extreme pressure it can reclaim part of it to avoid system-wide OOM. Only Guaranteed Pods use memory.min, which keeps hard reservation lower.

With memoryReservationPolicy in v1.36, you can enable throttling first, observe workload behavior, and opt into reservation when your node has enough headroom.

Observability metrics

Two alpha-stability metrics are exposed on the kubelet /metrics endpoint:

MetricDescription
kubelet_memory_qos_node_memory_min_bytesTotal memory.min across Guaranteed Pods
kubelet_memory_qos_node_memory_low_bytesTotal memory.low across Burstable Pods

These are useful for capacity planning. If kubelet_memory_qos_node_memory_min_bytes is creeping toward your node's physical memory, you know hard reservation is getting tight.

$ curl -sk https://localhost:10250/metrics | grep memory_qos
# HELP kubelet_memory_qos_node_memory_min_bytes [ALPHA] Total memory.min in bytes for Guaranteed pods
kubelet_memory_qos_node_memory_min_bytes 5.36870912e+08
# HELP kubelet_memory_qos_node_memory_low_bytes [ALPHA] Total memory.low in bytes for Burstable pods
kubelet_memory_qos_node_memory_low_bytes 2.147483648e+09

Kernel version check

On kernels older than 5.9, memory.high throttling can trigger the kernel livelock issue. The bug was fixed in kernel 5.9. In v1.36, when the feature gate is enabled, the kubelet checks the kernel version at startup and logs a warning if it is below 5.9. The feature continues to work — this is informational, not a hard block.

How Kubernetes maps Memory QoS to cgroup v2

Memory QoS uses four cgroup v2 memory controller interfaces:

  • memory.max: hard memory limit — unchanged from previous versions
  • memory.min: hard memory protection — with TieredReservation, set only for Guaranteed Pods
  • memory.low: soft memory protection — set for Burstable Pods with TieredReservation
  • memory.high: memory throttling threshold — unchanged from previous versions

The following table shows how Kubernetes container resources map to cgroup v2 interfaces when memoryReservationPolicy: TieredReservation is configured. With the default memoryReservationPolicy: None, no memory.min or memory.low values are set.

QoS Classmemory.minmemory.lowmemory.highmemory.max
GuaranteedSet to requests.memory
(hard protection)
Not setNot set
(requests == limits, so throttling is not useful)
Set to limits.memory
BurstableNot setSet to requests.memory
(soft protection)
Calculated based on
formula with throttling factor
Set to limits.memory
(if specified)
BestEffortNot setNot setCalculated based on
node allocatable memory
Not set

Cgroup hierarchy

cgroup v2 requires that a parent cgroup's memory protection is at least as large as the sum of its children's. The kubelet maintains this by setting memory.min on the kubepods root cgroup to the sum of all Guaranteed and Burstable Pod memory requests, and memory.low on the Burstable QoS cgroup to the sum of all Burstable Pod memory requests. This way the kernel can enforce the per-container and per-pod protection values correctly.

The kubelet manages pod-level and QoS-class cgroups directly using the runc libcontainer library, while container-level cgroups are managed by the container runtime (containerd or CRI-O).

How do I use it?

Prerequisites

  1. Kubernetes v1.36 or later
  2. Linux with cgroup v2. Kernel 5.9 or higher is recommended — earlier kernels work but may experience the livelock issue. You can verify cgroup v2 is active by running mount | grep cgroup2.
  3. A container runtime that supports cgroup v2 (containerd 1.6+, CRI-O 1.22+)

Configuration

To enable Memory QoS with tiered protection:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  MemoryQoS: true
memoryReservationPolicy: TieredReservation  # Options: None (default), TieredReservation
memoryThrottlingFactor: 0.9  # Optional: default is 0.9

If you want memory.high throttling without memory protection, omit memoryReservationPolicy or set it to None:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  MemoryQoS: true
memoryReservationPolicy: None  # This is the default

How can I learn more?

Getting involved

This feature is driven by SIG Node. If you are interested in contributing or have feedback, you can find us on Slack (#sig-node), the mailing list, or at the regular SIG Node meetings. Please file bugs at kubernetes/kubernetes and enhancement proposals at kubernetes/enhancements.

Kubernetes v1.36: Staleness Mitigation and Observability for Controllers

Staleness in Kubernetes controllers is a problem that affects many controllers, and is something may affect controller behavior in subtle ways. It is usually not until it is too late, when a controller in production has already taken incorrect action, that staleness is found to be an issue due to some underlying assumption made by the controller author. Some issues caused by staleness include controllers taking incorrect actions, controllers not taking action when they should, and controllers taking too long to take action. I am excited to announce that Kubernetes v1.36 includes new features that help mitigate staleness in controllers and provide better observability into controller behavior.

What is staleness?

Staleness in controllers comes from an outdated view of the world inside of the controller cache. In order to provide a fast user experience, controllers typically maintain a local cache of the state of the cluster. This cache is populated by watching the Kubernetes API server for changes to objects that the controller cares about. When the controller needs to take action, it will first check its cache to see if it has the latest information. If it does not, it will then update its cache by watching the API server for changes to objects that the controller cares about. This process is known as reconciliation.

However, there are some cases where the controller's cache may be outdated. For example, if the controller is restarted, it will need to rebuild its cache by watching the API server for changes to objects that the controller cares about. During this time, the controller's cache will be outdated, and it will not be able to take action. Additionally, if the API server is down, the controller's cache will not be updated, and it will not be able to take action. These are just a few examples of cases where the controller's cache may be outdated.

Improvements in 1.36

Kubernetes v1.36 includes improvements in both client-go as well as implementations of highly contended controllers in kube-controller-manager, using those client-go improvements.

client-go improvements

In client-go, the project added atomic FIFO processing (feature gate name AtomicFIFO), which is on top of the existing FIFO queue implementation. The new approach allows for the queue to atomically handle operations that are recieved in batches, such as the initial set of objects from a list operation that an informer uses to populate its cache. This ensures that the queue is always in a consistent state, even when events come out of order. Prior to this, events were added to the queue in the order that they were received, which could lead to an inconsistent state in the cache that does not accurately reflect the state of the cluster.

With this change, you can now ensure that the queue is always in a consistent state, even when events come out of order. To take advantage of this, clients using client-go can now introspect into the cache to determine the latest resource version that the controller cache has seen. This is done with the newly added function LastStoreSyncResourceVersion() implemented on the Store interface here. This function is the basis for the staleness mitigation features in kube-controller-manager.

kube-controller-manager improvements

In kube-controller-manager, the v1.36 release has added the ability for 4 different controllers to use this new capability. The controllers are:

  1. DaemonSet controller
  2. StatefulSet controller
  3. ReplicaSet controller
  4. Job controller

These controllers all act on pods, which in most cases are under the highest amount of contention in a cluster. The changes are on by default for these controllers, and can be disabled by setting the feature gates StaleControllerConsistency<API type> to false for the specific controller you wish to disable it for. For example, to disable the feature for the DaemonSet controller, you would set the feature gate StaleControllerConsistencyDaemonSet to false.

When the relevant feature gate is enabled, the controller will first check the latest resource version of the cache before taking action. If the latest resource version of the cache is lower than what the controller has written to the API server for the object it is trying to reconcile, the controller will not take action. This is because the controller's cache is outdated, and it does not have the latest information about the state of the cluster.

Use for informer authors

Informer authors using client-go can also immediately take advantage of these improvements. See an example of how to use this feature in the ReplicaSet informer. This PR shows how to use the new feature to check if the informer's cache is stale before taking action. The client-go library provides a ConsistencyStore data structure that queries the store and compares the latest resource version of the cache with the written resource version of the object.

The ReplicaSet controller tracks both the ReplicaSet's resource version and the resource version of the pods that the ReplicaSet manages. For a specific ReplicaSet, it tracks the latest written resource version of the pods that the ReplicaSet owns as well as any writes to the ReplicaSet itself. If the latest resource version of the cache is lower than what the controller has written to the API server for the object it is trying to reconcile, the controller will not take action. This is because the controller's cache is outdated, and it does not have the latest information about the state of the cluster.

An informer author can use the ConsistencyStore to track the latest resource version of the objects that the informer cares about. It provides 3 main functions:

type ConsistencyStore interface {
	// WroteAt records that the given object was written at the given resource version.
	WroteAt(owningObj runtime.Object, uid types.UID, groupResource schema.GroupResource, resourceVersion string)

	// EnsureReady returns true if the cache is up to date for the given object.
	// It is used prior to reconciliation to decide whether to reconcile or not.
	EnsureReady(namespacedName types.NamespacedName) bool

	// Clear removes the given object from the consistency store.
	// It is used when an object is deleted.
	Clear(namespacedName types.NamespacedName, uid types.UID)
}
  1. WroteAt: This function is called by the controller when it writes to the API server for an object. It is used to record the latest resource version of the object that the controller has written to the API server. The owningObj is the object that the controller is reconciling, and the uid is the UID of that object. The resource version and GroupResource are the resource version and GroupResource of the object that the controller has written to the API server. The object is not explicitly tracked, since the controller only cares about waiting to catch up to the latest resource version of the written object.
  2. EnsureReady: This function is called by the controller to ensure that the cache is up to date for the object. It is used prior to reconciliation to decide whether to reconcile or not. It returns true if the cache is up to date for the object, and false otherwise. It will use the information provided by WroteAt to determine if the cache is up to date.
  3. Clear: This function is called by the controller when an object is deleted. It is used to remove the object from the consistency store. This is mostly used for cleanup when an object is deleted to prevent the consistency store from growing indefinitely.

The UID is used to distinguish between different objects that have the same name, such as when an object is deleted and then recreated. It is not needed for EnsureReady because the consistency store is only concerned with catching up to the latest resource version of the object, not the specific object. It is primarily used to ensure that the controller doesn't delete the entry for an object when it is recreated with a new UID.

With these 3 functions, an informer author can implement staleness mitigation in their controller.

Observability

In addition to the staleness mitigation features, the Kubernetes project has also added related instrumentation to kube-controller-manager in 1.36. These metrics are also enabled by default, and are controlled using the same set of feature gates.

Metrics

The following alpha metrics have been added to kube-controller-manager in 1.36:

stale_sync_skips_total: The number of times the controller has skipped a sync due to stale cache. This metric is exposed for each controller that uses the staleness mitigation feature with the subsystem of the controller.

This metric is exposed by the kube-controller-manager metrics endpoint, and can be used to monitor the health of the controller.

Along with this metric, client-go also emits metrics that expose the latest resource version of every shared informer with the subsystem of the informer. This allows you to see the latest resource version of each informer, and use that to determine if the controller's cache is stale, especially great for comparing against the resource version of the API server.

This metric is named store_resource_version and has the Group, Version, and Resource as labels.

What's next?

Kubernetes SIG API Machinery is excited to continue working on this feature and hope to bring it to more controllers in the future. We are also interested in hearing your feedback on this feature. Please let us know what you think in the comments below or by opening an issue on the Kubernetes GitHub repository.

We are also working with controller-runtime to enable this set of semantics for all controllers built with controller-runtime. This will allow any controller built with controller-runtime to gain the benefits of read your own writes, without having to implement the logic themselves.

Kubernetes v1.36: Mutable Pod Resources for Suspended Jobs (beta)

Kubernetes v1.36 promotes the ability to modify container resource requests and limits in the pod template of a suspended Job to beta. First introduced as alpha in v1.35, this feature allows queue controllers and cluster administrators to adjust CPU, memory, GPU, and extended resource specifications on a Job while it is suspended, before it starts or resumes running.

Why mutable pod resources for suspended Jobs?

Batch and machine learning workloads often have resource requirements that are not precisely known at Job creation time. The optimal resource allocation depends on current cluster capacity, queue priorities, and the availability of specialized hardware like GPUs.

Before this feature, resource requirements in a Job's pod template were immutable once set. If a queue controller like Kueue determined that a suspended Job should run with different resources, the only option was to delete and recreate the Job, losing any associated metadata, status, or history. This feature also provides a way to let a specific Job instance for a CronJob progress slowly with reduced resources, rather than outright failing to run if the cluster is heavily loaded.

Consider a machine learning training Job initially requesting 4 GPUs:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
          limits:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
      restartPolicy: Never

A queue controller managing cluster resources might determine that only 2 GPUs are available. With this feature, the controller can update the Job's resource requests before resuming it:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
            example-hardware-vendor.com/gpu: "2"
          limits:
            cpu: "4"
            memory: "16Gi"
            example-hardware-vendor.com/gpu: "2"
      restartPolicy: Never

Once the resources are updated, the controller resumes the Job by setting spec.suspend to false, and the new Pods are created with the adjusted resource specifications.

How it works

The Kubernetes API server relaxes the immutability constraint on pod template resource fields specifically for suspended Jobs. No new API types have been introduced; the existing Job and pod template structures accommodate the change through relaxed validation.

The mutable fields are:

  • spec.template.spec.containers[*].resources.requests
  • spec.template.spec.containers[*].resources.limits
  • spec.template.spec.initContainers[*].resources.requests
  • spec.template.spec.initContainers[*].resources.limits

Resource updates are permitted when the following conditions are met:

  1. The Job has spec.suspend set to true.
  2. For a Job that was previously running and then suspended, all active Pods must have terminated (status.active equals 0) before resource mutations are accepted.

Standard resource validation still applies. For example, resource limits must be greater than or equal to requests, and extended resources must be specified as whole numbers where required.

What's new in beta

With the promotion to beta in Kubernetes v1.36, the MutablePodResourcesForSuspendedJobs feature gate is enabled by default. This means clusters running v1.36 can use this feature without any additional configuration on the API server.

Try it out

If your cluster is running Kubernetes v1.36 or later, this feature is available by default. For v1.35 clusters, enable the MutablePodResourcesForSuspendedJobs feature gate on the kube-apiserver.

You can test it by creating a suspended Job, updating its container resources using kubectl edit or a controller, and then resuming the Job:

# Create a suspended Job
kubectl apply -f my-job.yaml --server-side

# Edit the resource requests
kubectl edit job training-job-example-abcd123

# Resume the Job
kubectl patch job training-job-example-abcd123 -p '{"spec":{"suspend":false}}'

Considerations

Running Jobs that are suspended

If you suspend a Job that was already running, you must wait for all of that Job's active Pods to terminate before modifying resources. The API server rejects resource mutations while status.active is greater than zero. This prevents inconsistency between running Pods and the updated pod template.

Pod replacement policy

When using this feature with Jobs that may have failed Pods, consider setting podReplacementPolicy: Failed. This ensures that replacement Pods are only created after the previous Pods have fully terminated, preventing resource contention from overlapping Pods.

ResourceClaims

Dynamic Resource Allocation (DRA) resourceClaimTemplates remain immutable. If your workload uses DRA, you must recreate the claim templates separately to match any resource changes.

Getting involved

This feature was developed by SIG Apps This feature was developed by SIG Apps with input from WG Batch. Both groups welcome feedback as the feature progresses toward stable.

You can reach out through:

Kubernetes v1.36: Fine-Grained Kubelet API Authorization Graduates to GA

On behalf of Kubernetes SIG Auth and SIG Node, we are pleased to announce the graduation of fine-grained kubelet API authorization to General Availability (GA) in Kubernetes v1.36!

The KubeletFineGrainedAuthz feature gate was introduced as an opt-in alpha feature in Kubernetes v1.32, then graduated to beta (enabled by default) in v1.33. Now, the feature is generally available and the feature gate is locked to enabled. This feature enables more precise, least-privilege access control over the kubelet's HTTPS API, replacing the need to grant the overly broad nodes/proxy permission for common monitoring and observability use cases.

Motivation: the nodes/proxy problem

The kubelet exposes an HTTPS endpoint with several APIs that give access to data of varying sensitivity, including pod listings, node metrics, container logs, and, critically, the ability to execute commands inside running containers.

Prior to this feature, kubelet authorization used a coarse-grained model. When webhook authorization was enabled, almost all kubelet API paths were mapped to a single nodes/proxy subresource. This meant that any workload needing to read metrics or health status from the kubelet required nodes/proxy permission, the same permission that also grants the ability to execute arbitrary commands in any container running on the node.

What's wrong with that?

Granting nodes/proxy to monitoring agents, log collectors, or health-checking tools violates the principle of least privilege. If any of those workloads were compromised, an attacker would gain the ability to run commands in every container on the node. The nodes/proxy permission is effectively a node-level superuser capability, and granting it broadly dramatically increases the blast radius of a security incident.

This problem has been well understood in the community for years (see kubernetes/kubernetes#83465), and was the driving motivation behind this enhancement KEP-2862.

The nodes/proxy GET WebSocket RCE risk

The situation is more severe than it might appear at first glance. Security researchers demonstrated in early 2026 that nodes/proxy GET alone, which is the minimal read-only permission routinely granted to monitoring tools, can be abused to execute commands in any pod on reachable nodes.

The root cause is a mismatch between how WebSocket connections work and how the kubelet maps HTTP methods to RBAC verbs. The WebSocket protocol (RFC 6455) requires an HTTP GET request for the initial connection handshake. The kubelet maps this GET to the RBAC get verb and authorizes the request without performing a secondary check to confirm that CREATE permission is also present for the write operation that follows. Using a WebSocket client like websocat, an attacker can reach the kubelet's /exec endpoint directly on port 10250 and execute arbitrary commands:

websocat --insecure \
  --header "Authorization: Bearer $TOKEN" \
  --protocol v4.channel.k8s.io \
  "wss://$NODE_IP:10250/exec/default/nginx/nginx?output=1&error=1&command=id"

uid=0(root) gid=0(root) groups=0(root)

Fine-grained kubelet authorization: how it works

With KubeletFineGrainedAuthz, the kubelet now performs an additional, more specific authorization check before falling back to the nodes/proxy subresource. Several commonly used kubelet API paths are mapped to their own dedicated subresources:

kubelet APIResourceSubresource
/stats/*nodesstats
/metrics/*nodesmetrics
/logs/*nodeslog
/podsnodespods, proxy
/runningPods/nodespods, proxy
/healthznodeshealthz, proxy
/configznodesconfigz, proxy
/spec/*nodesspec
/checkpoint/*nodescheckpoint
all othersnodesproxy

For the endpoints that now have fine-grained subresources (/pods, /runningPods/, /healthz, /configz), the kubelet first sends a SubjectAccessReview for the specific subresource. If that check succeeds, the request is authorized. If it fails, the kubelet retries with the coarse-grained nodes/proxy subresource for backward compatibility.

This dual-check approach ensures a smooth migration path. Existing workloads with nodes/proxy permissions continue to work, while new deployments can adopt least-privilege access from day one.

What this means in practice

Consider a Prometheus node exporter or a monitoring DaemonSet that needs to scrape /metrics from the kubelet. Previously, you would need an RBAC ClusterRole like this:

# Old approach: overly broad
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-agent
rules:
- apiGroups: [""]
  resources: ["nodes/proxy"]
  verbs: ["get"]

This grants the monitoring agent far more access than it needs. With fine-grained authorization, you can now scope the permissions precisely:

# New approach: least privilege
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-agent
rules:
- apiGroups: [""]
  resources: ["nodes/metrics", "nodes/stats"]
  verbs: ["get"]

The monitoring agent can now read metrics and stats from the kubelet without ever being able to execute commands in containers.

Updated system:kubelet-api-admin ClusterRole

When RBAC authorization is enabled, the built-in system:kubelet-api-admin ClusterRole is automatically updated to include permissions for all the new fine-grained subresources. This ensures that cluster administrators who already use this role, including the API server's kubelet client, continue to have full access without any manual configuration changes.

The role now includes permissions for:

  • nodes/proxy
  • nodes/stats
  • nodes/metrics
  • nodes/log
  • nodes/spec
  • nodes/checkpoint
  • nodes/configz
  • nodes/healthz
  • nodes/pods

Upgrade considerations

Because the kubelet performs a dual authorization check (fine-grained first, then falling back to nodes/proxy), upgrading to v1.36 should be seamless for most clusters:

  • Existing workloads with nodes/proxy permissions continue to work without changes. The fallback to nodes/proxy ensures backward compatibility.
  • The API server always has nodes/proxy permissions via system:kubelet-api-admin, so kube-apiserver-to-kubelet communication is unaffected regardless of feature gate state.
  • Mixed-version clusters are handled gracefully. If a kubelet supports fine-grained authorization but the API server does not (or vice versa), nodes/proxy permissions serve as the fallback.

Verifying the feature is enabled

You can confirm that the feature is active on a given node by checking the kubelet metrics endpoint. Since the metrics endpoint on port 10250 requires authorization, you'll first need to create appropriate RBAC bindings for the pod or ServiceAccount making the request.

Step 1: Create a ServiceAccount and ClusterRole

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kubelet-metrics-checker
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubelet-metrics-reader
rules:
- apiGroups: [""]
  resources: ["nodes/metrics"]
  verbs: ["get"]

Step 2: Bind the ClusterRole to the ServiceAccount

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubelet-metrics-checker
subjects:
- kind: ServiceAccount
  name: kubelet-metrics-checker
  namespace: default
roleRef:
  kind: ClusterRole
  name: kubelet-metrics-reader
  apiGroup: rbac.authorization.k8s.io

Apply both manifests:

kubectl apply -f serviceaccount.yaml
kubectl apply -f clusterrole.yaml
kubectl apply -f clusterrolebinding.yaml

Step 3: Run a pod with the ServiceAccount and check the feature flag

kubectl run kubelet-check \
  --image=curlimages/curl \
  --serviceaccount=kubelet-metrics-checker \
  --restart=Never \
  --rm -it \
  -- sh

Then from within the pod, retrieve the node IP and query the metrics endpoint:

# Get the token
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)

# Query the kubelet metrics and filter for the feature gate
curl -sk \
  --header "Authorization: Bearer $TOKEN" \
  https://$NODE_IP:10250/metrics \
  | grep kubernetes_feature_enabled \
  | grep KubeletFineGrainedAuthz

If the feature is enabled, you should see output like:

kubernetes_feature_enabled{name="KubeletFineGrainedAuthz",stage="GA"} 1

Note: Replace $NODE_IP with the IP address of the node you want to check. You can retrieve node IPs with kubectl get nodes -o wide.

The journey from alpha to GA

ReleaseStageDetails
v1.32AlphaFeature gate KubeletFineGrainedAuthz introduced, disabled by default
v1.33BetaEnabled by default; fine-grained checks for /pods, /runningPods/, /healthz, /configz
v1.36GAFeature gate locked to enabled; fine-grained kubelet authorization is always active

What's next?

With fine-grained kubelet authorization now GA, the Kubernetes community can begin recommending and eventually enforcing the use of specific subresources instead of nodes/proxy for monitoring and observability workloads. The urgency of this migration is underscored by research showing that nodes/proxy GET can be abused for unlogged remote code execution via the WebSocket protocol. This risk is present in the default RBAC configurations of dozens of widely deployed Helm charts. Over time, we expect:

  • Ecosystem adoption: Monitoring tools like Prometheus, Datadog agents, and other DaemonSets can update their default RBAC configurations to use nodes/metrics, nodes/stats, and nodes/pods instead of nodes/proxy. This directly eliminates the WebSocket RCE attack surface for those workloads.
  • Policy enforcement: Admission controllers and policy engines can flag or reject RBAC bindings that grant nodes/proxy when fine-grained alternatives exist, helping organizations adopt least-privilege access at scale.
  • Deprecation path: As adoption grows, nodes/proxy may eventually be deprecated for monitoring use cases, further reducing the attack surface of Kubernetes clusters.

Getting involved

This enhancement was driven by SIG Auth and SIG Node. If you are interested in contributing to the security and authorization features of Kubernetes, please join us:

We look forward to hearing your feedback and experiences with this feature!

Kubernetes v1.36: User Namespaces in Kubernetes are finally GA

After several years of development, User Namespaces support in Kubernetes reached General Availability (GA) with the v1.36 release. This is a Linux-only feature.

For those of us working on low level container runtimes and rootless technologies, this has been a long awaited milestone. We finally reached the point where "rootless" security isolation can be used for Kubernetes workloads.

This feature also enables a critical pattern: running workloads with privileges and still being confined in the user namespace. When hostUsers: false is set, capabilities like CAP_NET_ADMIN become namespaced, meaning they grant administrative power over container local resources without affecting the host. This effectively enables new use cases that were not possible before without running a fully privileged container.

The Problem with UID 0

A process running as root inside a container is also seen from the kernel as root on the host. If an attacker manages to break out of the container, whether through a kernel vulnerability or a misconfigured mount, they are root on the host.

While there are many security measures in place for running containers, these measures don't change the underlying identity of the process, it still has some "parts" of root.

The engine: ID-mapped mounts

The road to GA wasn't just about the Kubernetes API; it was about making the kernel work for us. In the early stages, one of the biggest blockers was volume ownership. If you mapped a container to a high UID range, the Kubelet had to recursively chown every file in the attached volume so the container could read/write them. For large volumes, this was such an expensive operation that destroyed startup performance.

The key enabler was ID-mapped mounts (introduced in Linux 5.12 and refined in later versions). Instead of rewriting file ownership on disk, the kernel remaps it at mount time.

When a volume is mounted into a Pod with User Namespaces enabled, the kernel performs a transparent translation of the UIDs (user ids) and GIDs (group ids). To the container, the files appear owned by UID 0. On disk, file ownership is unchanged — no chown is needed. This is an O(1) operation, instant and efficient.

Using it in Kubernetes v1.36

Using user namespaces is straightforward: all you need to do is set hostUsers: false in your Pod spec. No changes to your container images, no complex configuration. The interface remains the same one introduced during the Alpha phase. In the spec for a Pod (or PodTemplate), you explicitly opt-out of the host user namespace:

apiVersion: v1
kind: Pod
metadata:
  name: isolated-workload
spec:
  hostUsers: false
  containers:
  - name: app
    image: fedora:42
    securityContext:
      runAsUser: 0

For more details on how user namespaces work in practice and demos of CVEs rated HIGH mitigated, see the previous blog posts: User Namespaces alpha, User Namespaces stateful pods in alpha, User Namespaces beta, and User Namespaces enabled by default.

Getting involved

If you're interested in user namespaces or want to contribute, here are some useful links:

Acknowledgments

This feature has been years in the making: the first KEP was opened 10 years ago by other contributors, and we have been actively working on it for the last 6 years. We'd like to thank everyone who contributed across SIG Node, the container runtimes, and the Linux kernel. Special thanks to the reviewers and early adopters who helped shape the design through multiple alpha and beta cycles.

SELinux Volume Label Changes goes GA (and likely implications in v1.37)

If you run Kubernetes on Linux with SELinux in enforcing mode, plan ahead: a future release (anticipated to be v1.37) is expected to turn the SELinuxMount feature gate on by default. This makes volume setup faster for most workloads, but it can break applications that still depend on the older recursive relabeling model in subtle ways (for example, sharing one volume between privileged and unprivileged Pods on the same node). Kubernetes v1.36 is the right release to audit your cluster and fix or opt out of this change.

If your nodes do not use SELinux, nothing changes for you: the kubelet skips the whole SELinux logic when SELinux is unavailable or disabled in the Linux kernel. You can skip this article completely.

This blog builds on the earlier work described in the Kubernetes 1.27: Efficient SELinux Relabeling (Beta) post, where the SELinuxMountReadWriteOncePod feature gate was described. The problem to be addressed remains the same, however, this blog extends that same approach to all volumes.

The problem

Linux systems with Security Enhanced Linux (SELinux) enabled use labels attached to objects (for example, files and network sockets) to make access control decisions. Historically, the container runtime applies SELinux labels to a Pod and all its volumes. Kubernetes only passes the SELinux label from a Pod's securityContext fields to the container runtime.

The container runtime then recursively changes the SELinux label on all files that are visible to the Pod's containers. This can be time-consuming if there are many files on the volume, especially when the volume is on a remote filesystem.

Caution:

If a container uses subPath of a volume, only that subPath of the whole volume is relabeled. This allows two Pods that have two different SELinux labels to use the same volume, as long as they use different subpaths of it.

If a Pod does not have any SELinux label assigned in the Kubernetes API, the container runtime assigns a unique random label, so a process that potentially escapes the container boundary cannot access data of any other container on the host. The container runtime still recursively relabels all Pod volumes with this random SELinux label.

What Kubernetes is improving

Where the stack supports it, the kubelet can mount the volume with -o context=<label> so the kernel applies the correct label for all inodes on that mount without a recursive inode traversal. That path is gated by feature flags and requires, among other things, that the Pod expose enough of an SELinux label (for example spec.securityContext.seLinuxOptions.level) and that the volume driver opts in (for CSI, CSIDriver field spec.seLinuxMount: true).

The project rolled this out in phases:

  • ReadWriteOncePod volumes were handled under the SELinuxMountReadWriteOncePod feature gate, on by default since v1.28 and GA in v1.36.
  • Broader coverage was handled under the SELinuxMount flag, paired with the spec.securityContext.seLinuxChangePolicy field on Pods.

If a Pod and its volume meet all of the following conditions, Kubernetes will mount the volume directly with the right SELinux label. Such a mount will happen in a constant time and the container runtime will not need to recursively relabel any files on it. For such a mount to happen:

  1. The operating system must support SELinux. Without SELinux support detected, the kubelet and the container runtime do not do anything with regard to SELinux.

  2. The feature gate SELinuxMountReadWriteOncePod must be enabled. If you're running Kubernetes v1.36, the feature is enabled unconditionally.

  3. The Pod must use a PersistentVolumeClaim with applicable accessModes:

    • Either the volume has accessModes: ["ReadWriteOncePod"]
    • or the volume can use any other access mode(s), provided that the feature gates SELinuxChangePolicy and SELinuxMount are both enabled and the Pod has spec.securityContext.seLinuxChangePolicy set to nil (default) or as MountOption.

    The feature gate SELinuxMount is Beta and disabled by default in Kubernetes 1.36. All other SELinux-related feature gates are now General Availability (GA).

    With any of these feature gates disabled, SELinux labels will always be applied by the container runtime via recursively traversing through the volume (or its subPaths).

  4. The Pod must have at least seLinuxOptions.level assigned in its security context or all containers in that Pod must have it set in their container-level security contexts. Kubernetes will read the default user, role and type from the operating system defaults (typically system_u, system_r and container_t).

    Without Kubernetes knowing at least the SELinux level, the container runtime will assign a random level after the volumes are mounted. The container runtime will still relabel the volumes recursively in that case.

  5. The volume plugin or the CSI driver responsible for the volume supports mounting with SELinux mount options.

    These in-tree volume plugins support mounting with SELinux mount options: fc and iscsi.

    CSI drivers that support mounting with SELinux mount options must declare this capability in their CSIDriver instance by setting the seLinuxMount field.

    Volumes managed by other volume plugins or CSI drivers that do not set seLinuxMount: true will be recursively relabeled by the container runtime.

The breaking change

The SELinuxMount feature gate changes what volumes can be shared among multiple Pods in a subtle way.

Both of these cases work with recursive relabeling:

  1. Two Pods with different SELinux labels share the same volume, but each of them uses a different subPath to the volume.
  2. A privileged Pod and an unprivileged Pod share the same volume.

The above scenarios will not work with modern, target behavior for Kubernetes mounting when SELinux is active. Instead, one of these Pods will be stuck in ContainerCreating until the other Pod is terminated.

The first case is very niche and hasn't been seen in practice. Although the second case is still quite rare, this setup has been observed in applications. Kubernetes v1.36 offers metrics and events to identify these Pods and allows cluster administrators to opt out of the mount option through the Pod field spec.securityContext.seLinuxChangePolicy.

seLinuxChangePolicy

The new Pod field spec.securityContext.seLinuxChangePolicy specifies how the SELinux label is applied to all Pod volumes. In Kubernetes v1.36, this field is part of the stable Pod API.

There are three choices available:

field not set (default)
In Kubernetes v1.36, the behavior depends on whether the SELinuxMount feature gate is enabled. By default that feature gate is not enabled, and the SELinux label is applied recursively. If you enable that feature gate in your cluster, and all other conditions are met, labelling will be applied using the mount option.
Recursive
the SELinux label is applied recursively. This opts out from using the mount option.
MountOption
the SELinux label is applied using the mount option, if all other conditions are met. This choice is available only when the SELinuxMount feature gate is enabled.

SELinux warning controller (optional)

Kubernetes v1.36 provides a new controller within the control plane, selinux-warning-controller. This controller runs within the kube-controller-manager controller. To use it, you pass --controllers=*,selinux-warning-controller on the kube-controller-manager command line; you also must not have explicitly overridden the SELinuxChangePolicy feature gate to be disabled.

The controller watches all Pods in the cluster and emits an Event when it finds two Pods that share the same volume in a way that is not compatible with the SELinuxMount feature gate. All such conflicting Pods will receive an event, such as:

SELinuxLabel "system_u:system_r:container_t:s0:c98,c99" conflicts with pod my-other-pod that uses the same volume as this pod with SELinuxLabel "system_u:system_r:container_t:s0:c0,c1". If both pods land on the same node, only one of them may access the volume.

The actual Pod name may be censored when the conflicting Pods run in different namespaces to prevent leaking information across namespace boundaries.

The controller reports such an event even when these Pods don't run on the same node, to make sure all Pods work regardless of the Kubernetes scheduler decision. They could run on the same node next time.

In addition, the controller emits the metric selinux_warning_controller_selinux_volume_conflict that lists all current conflicts among Pods. The metric has labels that identify the conflicting Pods and their SELinux labels, such as:

selinux_warning_controller_selinux_volume_conflict{pod1_name="my-other-pod",pod1_namespace="default",pod1_value="system_u:object_r:container_file_t:s0:c0,c1",pod2_name="my-pod",pod2_namespace="default",pod2_value="system_u:object_r:container_file_t:s0:c0,c2",property="SELinuxLabel"} 1

There is a security consequence from enabling this opt-in controller: it may reveal namespace names, which are always present in the metric. The Kubernetes project assumes only cluster administrators can access kube-controller-manager metrics.

Suggested upgrade path

To ensure a smooth upgrade path from v1.36 to a release with SELinuxMount enabled (anticipated to be v1.37), we suggest you follow these steps:

  1. Enable selinux-warning-controller in the kube-controller-manager.
  2. Check the selinux_warning_controller_selinux_volume_conflict metric. It shows all potential conflicts between Pods. For each conflicting Pod (Deployment, StatefulSet, etc.), either apply the opt-out (set Pod's spec.securityContext.seLinuxChangePolicy: Recursive) or re-architect the application to remove such a conflict. For example, do your Pods really need to run as privileged?
  3. Check the volume_manager_selinux_volume_context_mismatch_warnings_total metric. This metric is emitted by the kubelet when it actually starts a Pod that runs when SELinuxMount is disabled, but such a Pod won't start when SELinuxMount is enabled. This metric lists the number of Pods that will experience a true conflict. Unfortunately, this metric does not expose the exact Pod name as a label. The full Pod name is available only in the selinux_warning_controller_selinux_volume_conflict metric.
  4. Once both metrics have been accounted for, upgrade to a Kubernetes version that has SELinuxMount enabled.

Consider using a MutatingAdmissionPolicy, a mutating webhook, or a policy engine like Kyverno or Gatekeeper to apply the opt-out to all Pods in a namespace or across the entire cluster.

When SELinuxMount is enabled, the kubelet will emit the metric volume_manager_selinux_volume_context_mismatch_errors_total with the number of Pods that could not start because their SELinux label conflicts with an existing Pod that uses the same volume. The exact Pod names should still be available in the selinux_warning_controller_selinux_volume_conflict metric, if the selinux-warning-controller is enabled.

Further reading

Acknowledgements

If you run into issues, have feedback, or want to contribute, find us on the Kubernetes Slack in #sig-node and #sig-storage or join a SIG Node or SIG Storage meetings.

Kubernetes v1.36: ハル (Haru)

Editors: Chad M. Crowell, Kirti Goyal, Sophia Ugochukwu, Swathi Rao, Utkarsh Umre

Similar to previous releases, the release of Kubernetes v1.36 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 70 enhancements. Of those enhancements, 18 have graduated to Stable, 25 are entering Beta, and 25 have graduated to Alpha.

There are also some deprecations and removals in this release; make sure to read about those.

We open 2026 with Kubernetes v1.36, a release that arrives as the season turns and the light shifts on the mountain. ハル (Haru) is a sound in Japanese that carries many meanings; among those we hold closest are 春 (spring), 晴れ (hare, clear skies), and 遥か (haruka, far-off, distant). A season, a sky, and a horizon. You will find all three in what follows.

The logo, created by avocadoneko / Natsuho Ide, draws inspiration from Katsushika Hokusai's Thirty-six Views of Mount Fuji (富嶽三十六景, Fugaku Sanjūrokkei), the same series that gave the world The Great Wave off Kanagawa. Our v1.36 logo reimagines one of the series' most celebrated prints, Fine Wind, Clear Morning (凱風快晴, Gaifū Kaisei), also known as Red Fuji (赤富士, Aka Fuji): the mountain lit red in a summer dawn, bare of snow after the long thaw. Thirty-six views felt like a fitting number to sit with at v1.36, and a reminder that even Hokusai didn't stop there.1 Keeping watch over the scene is the Kubernetes helm, set into the sky alongside the mountain.

At the foot of Fuji sit Stella (left) and Nacho (right), two cats with the Kubernetes helm on their collars, standing in for the role of komainu, the paired lion-dog guardians that watch over Japanese shrines. Paired, because nothing is guarded alone. Stella and Nacho stand in for a very much larger set of paws: the SIGs and working groups, the maintainers and reviewers, the people behind docs, blogs, and translations, the release team, first-time contributors taking their first steps, and lifelong contributors returning season after season. Kubernetes v1.36 is, as always, held up by many hands.

Brushed across Red Fuji in the logo is the calligraphy 晴れに翔け (hare ni kake), "soar into clear skies". It is the first half of a couplet that was too long to fit on the mountain:

晴れに翔け、未来よ明け
hare ni kake, asu yo ake
"Soar into clear skies; toward tomorrow's sunrise."2

That is the wish we carry for this release: to soar into clear skies, for the release itself, for the project, and for everyone who ships it together. The dawn breaking over Red Fuji is not an ending but a passage: this release carries us to the next, and that one to the one after, on toward horizons far beyond what any single view can hold.

1. The series was so popular that Hokusai added ten more prints, bringing the total to forty-six.
2. 未来 means "the future" in its widest sense, not just tomorrow but everything still to come. It is usually read mirai; here it takes the informal reading asu.

Spotlight on key updates

Kubernetes v1.36 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: Fine-grained API authorization

On behalf of Kubernetes SIG Auth and SIG Node, we are pleased to announce the graduation of fine-grained kubelet API authorization to General Availability (GA) in Kubernetes v1.36!

The KubeletFineGrainedAuthz feature gate was introduced as an opt-in alpha feature in Kubernetes v1.32, then graduated to beta (enabled by default) in v1.33. Now, the feature is generally available. This feature enables more precise, least-privilege access control over the kubelet's HTTPS API replacing the need to grant the overly broad nodes/proxy permission for common monitoring and observability use cases.

​​This work was done as a part of KEP #2862 led by SIG Auth and SIG Node.

Beta: Resource health status

Before the v1.34 release, Kubernetes lacked a native way to report the health of allocated devices, making it difficult to diagnose Pod crashes caused by hardware failures. Building on the initial alpha release in v1.31 which focused on Device Plugins, Kubernetes v1.36 expands this feature by promoting the allocatedResourcesStatus field within the .status for each Pod (to beta). This field provides a unified health reporting mechanism for all specialized hardware.

Users can now run kubectl describe pod to determine if a container's crash loop is due to an Unhealthy or Unknown device status, regardless of whether the hardware was provisioned via traditional plugins or the newer DRA framework. This enhanced visibility allows administrators and automated controllers to quickly identify faulty hardware and streamline the recovery of high-performance workloads.

This work was done as part of KEP #4680 led by SIG Node.

Alpha: Workload Aware Scheduling (WAS) features

Previously, the Kubernetes scheduler and job controllers managed pods as independent units, often leading to fragmented scheduling or resource waste for complex, distributed workloads. Kubernetes v1.36 introduces a comprehensive suite of Workload Aware Scheduling (WAS) features in Alpha, natively integrating the Job controller with a revised Workload API and a new decoupled PodGroup API, to treat related pods as a single logical entity.

Kubernetes v1.35 already supported gang scheduling by requiring a minimum number of pods to be schedulable before any were bound to nodes. v1.36 goes further with a new PodGroup scheduling cycle that evaluates the entire group atomically, either all pods in the group are bound together, or none are.

This work was done across several KEPs (including #4671, #5547, #5832, #5732, and #5710) led by SIG Scheduling and SIG Apps.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.36 release.

Volume group snapshots

After several cycles in beta, VolumeGroupSnapshot support reaches General Availability (GA) in Kubernetes v1.36. This feature allows you to take crash-consistent snapshots across multiple PersistentVolumeClaims simultaneously. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. A key aim is to allow you to restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

This work was done as part of KEP #3476 led by SIG Storage.

Mutable volume attach limits

In Kubernetes v1.36, the mutable CSINode allocatable feature graduates to stable. This enhancement allows Container Storage Interface (CSI) drivers to dynamically update the reported maximum number of volumes that a node can handle.

With this update, the kubelet can dynamically update a node's volume limits and capacity information. The kubelet adjusts these limits based on periodic checks or in response to resource exhaustion errors from the CSI driver, without requiring a component restart. This ensures the Kubernetes scheduler maintains an accurate view of storage availability, preventing pod scheduling failures caused by outdated volume limits.

This work was done as part of KEP #4876 led by SIG Storage.

API for external signing of ServiceAccount tokens

In Kubernetes v1.36, the external ServiceAccount token signer feature for service accounts graduates to stable, making it possible to offload token signing to an external system while still integrating cleanly with the Kubernetes API. Clusters can now rely on an external JWT signer for issuing projected service account tokens that follow the standard service account token format, including support for extended expiration when needed. This is especially useful for clusters that already rely on external identity or key management systems, allowing Kubernetes to integrate without duplicating key management inside the control plane.

The kube-apiserver is wired to discover public keys from the external signer, cache them, and validate tokens it did not sign itself, so existing authentication and authorization flows continue to work as expected. Over the alpha and beta phases, the API and configuration for the external signer plugin, path validation, and OIDC discovery were hardened to handle real-world deployments and rotation patterns safely.

With GA in v1.36, external ServiceAccount token signing is now a fully supported option for platforms that centralize identity and signing, simplifying integration with external IAM systems and reducing the need to manage signing keys directly inside the control plane.

This work was done as part of KEP #740 led by SIG Auth.

DRA features graduating to Stable

Part of the Dynamic Resource Allocation (DRA) ecosystem reaches full production maturity in Kubernetes v1.36 as key governance and selection features graduate to Stable. The transition of DRA admin access to GA provides a permanent, secure framework for cluster administrators to access and manage hardware resources globally, while the stabilization of prioritized lists ensures that resource selection logic remains consistent and predictable across all cluster environments.

Now, organizations can confidently deploy mission-critical hardware automation with the guarantee of long-term API stability and backward compatibility. These features empower users to implement sophisticated resource-sharing policies and administrative overrides that are essential for large-scale GPU clusters and multi-tenant AI platforms, marking the completion of the core architectural foundation for next-generation resource management.

This work was done as part of KEPs #5018 and #4816 led by SIG Auth and SIG Scheduling.

Mutating admission policies

Declarative cluster management reaches a new level of sophistication in Kubernetes v1.36 with the graduation of MutatingAdmissionPolicies to Stable. This milestone provides a native, high-performance alternative to traditional webhooks by allowing administrators to define resource mutations directly in the API server using the Common Expression Language (CEL), fully replacing the need for external infrastructure for many common use cases.

Now, cluster operators can modify incoming requests without the latency and operational complexity associated with managing custom admission webhooks. By moving mutation logic into a declarative, versioned policy, organizations can achieve more predictable cluster behavior, reduced network overhead, and a hardened security model with the full guarantee of long-term API stability.

This work was done as part of KEP #3962 led by SIG API Machinery.

Declarative validation of Kubernetes native types with validation-gen

The development of custom resources reaches a new level of efficiency in Kubernetes v1.36 as declarative validation (with validation-gen) graduates to Stable. This milestone replaces the manual and often error-prone task of writing complex OpenAPI schemas by allowing developers to define sophisticated validation logic directly within Go struct tags using the Common Expression Language (CEL).

Instead of writing custom validation functions, Kubernetes contributors can now define validation rules using IDL marker comments (such as +k8s:minimum or +k8s:enum) directly within the API type definitions (types.go). The validation-gen tool parses these comments to automatically generate robust API validation code at compile-time. This reduces maintenance overhead and ensures that API validation remains consistent and synchronized with the source code.

This work was done as part of KEP #5073 led by SIG API Machinery.

Removal of gogo protobuf dependency for Kubernetes API types

Security and long-term maintainability for the Kubernetes codebase take a major step forward in Kubernetes v1.36 with the completion of the gogoprotobuf removal. This initiative has eliminated a significant dependency on the unmaintained gogoprotobuf library, which had become a source of potential security vulnerabilities and a blocker for adopting modern Go language features.

Instead of migrating to standard Protobuf generation, which presented compatibility risks for Kubernetes API types, the project opted to fork and internalize the required generation logic within k8s.io/code-generator. This approach successfully eliminates the unmaintained runtime dependencies from the Kubernetes dependency graph while preserving existing API behavior and serialization compatibility. For consumers of Kubernetes API Go types, this change reduces technical debt and prevents accidental misuse with standard protobuf libraries.

This work was done as part of KEP #5589 led by SIG API Machinery.

Node log query

Previously, Kubernetes required cluster administrators to log into nodes via SSH or implement a client-side reader for debugging issues pertaining to control-plane or worker nodes. While certain issues still require direct node access, issues with the kube-proxy or kubelet can be diagnosed by inspecting their logs. Node logs offer cluster administrators a method to view these logs using the kubelet API and kubectl plugin to simplify troubleshooting without logging into nodes, similar to debugging issues related to a pod or container. This method is operating system agnostic and requires the services or nodes to log to /var/log.

As this feature reaches GA in Kubernetes 1.36 after thorough performance validation on production workloads, it is enabled by default on the kubelet through the NodeLogQuery feature gate. In addition, the enableSystemLogQuery kubelet configuration option must also be enabled.

This work was done as a part of KEP #2258 led by SIG Windows.

Support User Namespaces in pods

Container isolation and node security reach a major maturity milestone in Kubernetes v1.36 as support for User Namespaces graduates to Stable. This long-awaited feature provides a critical layer of defense-in-depth by allowing the mapping of a container's root user to a non-privileged user on the host, ensuring that even if a process escapes the container, it possesses no administrative power over the underlying node.

Now, cluster operators can confidently enable this hardened isolation for production workloads to mitigate the impact of container breakout vulnerabilities. By decoupling the container's internal identity from the host's identity, Kubernetes provides a robust, standardized mechanism to protect multi-tenant environments and sensitive infrastructure from unauthorized access, all with the full guarantee of long-term API stability.

This work was done as part of KEP #127 led by SIG Node.

Support PSI based on cgroupv2

Node resource management and observability become more precise in Kubernetes v1.36 as the export of Pressure Stall Information (PSI) metrics graduates to Stable. This feature provides the kubelet with the ability to report "pressure" metrics for CPU, memory, and I/O, offering a more granular view of resource contention than traditional utilization metrics.

Cluster operators and autoscalers can use these metrics to distinguish between a system that is simply busy and one that is actively stalling due to resource exhaustion. By leveraging these signals, users can more accurately tune pod resource requests, improve the reliability of vertical autoscaling, and detect noisy neighbor effects before they lead to application performance degradation or node instability.

This work was done as part of KEP #4205 led by SIG Node.

Volume source: OCI artifact and/or image

The distribution of container data becomes more flexible in Kubernetes v1.36 as OCI volume source support graduates to Stable. This feature moves beyond the traditional requirement of mounting volumes from external storage providers or config maps by allowing the kubelet to pull and mount content directly from any OCI-compliant registry, such as a container image or an artifact repository.

Now, developers and platform engineers can package application data, models, or static assets as OCI artifacts and deliver them to pods using the same registries and versioning workflows they already use for container images. This convergence of image and volume management simplifies CI/CD pipelines, reduces dependency on specialized storage backends for read-only content, and ensures that data remains portable and securely accessible across any environment.

This work was done as part of KEP #4639 led by SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.36 release.

Staleness mitigation for controllers

Staleness in Kubernetes controllers is a problem that affects many controllers and can subtly affect controller behavior. It is usually not until it is too late, when a controller in production has already taken incorrect action, that staleness is found to be an issue due to some underlying assumption made by the controller author. This could lead to conflicting updates or data corruption upon controller reconciliation during times of cache staleness.

We are excited to announce that Kubernetes v1.36 includes new features that help mitigate controller staleness and provide better observability of controller behavior. This prevents reconciliation based on an outdated view of cluster state that can often lead to harmful behavior.

This work was done as part of KEP #5647 led by SIG API Machinery.

IP/CIDR validation improvements

In Kubernetes v1.36, the StrictIPCIDRValidation feature for API IP and CIDR fields graduates to beta, tightening validation to catch malformed addresses and prefixes that previously slipped through. This helps prevent subtle configuration bugs where Services, Pods, NetworkPolicies, or other resources reference invalid IPs, which could otherwise lead to confusing runtime behavior or security surprises.

Controllers are updated to canonicalize IPs they write back into objects and to warn when they encounter bad values that were already stored, so clusters can gradually converge on clean, consistent data. With beta, StrictIPCIDRValidation is ready for wider use, giving operators more reliable guardrails around IP-related configuration as they evolve networks and policies over time.

This work was done as a part of KEP #4858 led by SIG Network.

Separate kubectl user preferences from cluster configs

The .kuberc feature for customizing kubectl user preferences continues to be beta and is enabled by default. The ~/.kube/kuberc file allows users to store aliases, default flags, and other personal settings separately from kubeconfig files, which hold cluster endpoints and credentials. This separation prevents personal preferences from interfering with CI pipelines or shared kubeconfig files, while maintaining a consistent kubectl experience across different clusters and contexts.

In Kubernetes v1.36, .kuberc was expanded with the ability to define policies for credential plugins (allowlists or denylists) to enforce safer authentication practicies. Users can disable this functionality if needed by setting the KUBECTL_KUBERC=false or KUBERC=off environment variables.

This work was done as a part of KEP #3104 led by SIG CLI, with the help from SIG Auth.

Mutable container resources when Job is suspended

In Kubernetes v1.36, the MutablePodResourcesForSuspendedJobs feature graduates to beta and is enabled by default. This update relaxes Job validation to allow updates to container CPU, memory, GPU, and extended resource requests and limits while a Job is suspended.

This capability allows queue controllers and operators to adjust batch workload requirements based on real‑time cluster conditions. For example, a queueing system can suspend incoming Jobs, adjust their resource requirements to match available capacity or quota, and then unsuspend them. The feature strictly limits mutability to suspended Jobs (or Jobs whose pods have been terminated upon suspension) to prevent disruptive changes to actively running pods.

This work was done as a part of KEP #5440 led by SIG Apps.

Constrained impersonation

In Kubernetes v1.36, the ConstrainedImpersonation feature for user impersonation graduates to beta, tightening a historically all‑or‑nothing mechanism into something that can actually follow least‑privilege principles. When this feature is enabled, an impersonator must have two distinct sets of permissions: one to impersonate a given identity, and another to perform specific actions on that identity’s behalf. This prevents support tools, controllers, or node agents from using impersonation to gain broader access than they themselves are allowed, even if their impersonation RBAC is misconfigured. Existing impersonate rules keep working, but the API server prefers the new constrained checks first, making the transition incremental instead of a flag day. With beta in v1.36, ConstrainedImpersonation is tested, documented, and ready for wider adoption by platform teams that rely on impersonation for debugging, proxying, or node‑level controllers.

This work was done as a part of KEP #5284 led by SIG Auth.

DRA features in beta

The Dynamic Resource Allocation (DRA) framework reaches another maturity milestone in Kubernetes v1.36 as several core features graduate to beta and are enabled by default. This transition moves DRA beyond basic allocation by graduating partitionable devices and consumable capacity, allowing for more granular sharing of hardware like GPUs, while device taints and tolerations ensure that specialized resources are only utilized by the appropriate workloads.

Now, users benefit from a much more reliable and observable resource lifecycle through ResourceClaim device status and the ability to ensure device attachment before Pod scheduling. By integrating these features with extended resource support, Kubernetes provides a robust production-ready alternative to the legacy device plugin system, enabling complex AI and HPC workloads to manage hardware with unprecedented precision and operational safety.

This work was done across several KEPs (including #5004, #4817, #5055, #5075, #4815, and #5007) led by SIG Scheduling and SIG Node.

Statusz for Kubernetes components

In Kubernetes v1.36, the ComponentStatusz feature gate for core Kubernetes components graduates to beta, providing a /statusz endpoint (enabled by default) that surfaces real‑time build and version details for each component. This low‑overhead z-page exposes information like start time, uptime, Go version, binary version, emulation version, and minimum compatibility version, so operators and developers can quickly see exactly what is running without digging through logs or configs.

The endpoint offers a human‑readable text view by default, plus a versioned structured API (config.k8s.io/v1beta1) for programmatic access in JSON, YAML, or CBOR via explicit content negotiation. Access is granted to the system:monitoring group, keeping it aligned with existing protections on health and metrics endpoints and avoiding exposure of sensitive data.

With beta, ComponentStatusz is enabled by default across all core control‑plane components and node agents, backed by unit, integration, and end‑to‑end tests so it can be safely used in production for observability and debugging workflows.

This work was done as a part of KEP #4827 led by SIG Instrumentation.

Flagz for Kubernetes components

In Kubernetes v1.36, the ComponentFlagz feature gate for core Kubernetes components graduates to beta, standardizing a /flagz endpoint that exposes the effective command‑line flags each component was started with. This gives cluster operators and developers real‑time, in‑cluster visibility into component configuration, making it much easier to debug unexpected behavior or verify that a flag rollout actually took effect after a restart.

The endpoint supports both a human‑readable text view and a versioned structured API (initially config.k8s.io/v1beta1), so you can either curl it during an incident or wire it into automated tooling once you are ready. Access is granted to the system:monitoring group and sensitive values can be redacted, keeping configuration insight aligned with existing security practices around health and status endpoints.

With beta, ComponentFlagz is now enabled by default and implemented across all core control‑plane components and node agents, backed by unit, integration, and end‑to‑end tests to ensure the endpoint is reliable in production clusters.

This work was done as a part of KEP #4828 led by SIG Instrumentation.

Mixed version proxy (aka unknown version interoperability proxy)

In Kubernetes v1.36, the mixed version proxy feature graduates to beta, building on its alpha introduction in v1.28 to provide safer control-plane upgrades for mixed-version clusters. Each API request can now be routed to the apiserver instance that serves the requested group, version, and resource, reducing 404s and failures due to version skew.

The feature relies on peer-aggregated discovery, so apiservers share information about which resources and versions they expose, then use that data to transparently reroute requests when needed. New metrics on rerouted traffic and proxy behavior help operators understand how often requests are forwarded and to which peers. Together, these changes make it easier to run highly available, mixed-version API control planes in production while performing multi-step or partial control-plane upgrades.

This work was done as a part of KEP #4020 led by SIG API Machinery

Memory QoS with cgroups v2

Kubernetes now enhances memory QoS on Linux cgroup v2 nodes with smarter, tiered memory protection that better aligns kernel controls with pod requests and limits, reducing interference and thrashing for workloads sharing the same node. This iteration also refines how kubelet programs memory.high and memory.min, adds metrics and safeguards to avoid livelocks, and introduces configuration options so cluster operators can tune memory protection behavior for their environments.

This work was done as part of KEP #2570 led by SIG Node.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.36 release.

HPA scale to zero for custom metrics

Until now, the HorizontalPodAutoscaler (HPA) required a minimum of at least one replica to remain active, as it could only calculate scaling needs based on metrics (like CPU or Memory) from running pods. Kubernetes v1.36 continues the development of the HPA scale to zero feature (disabled by default) in Alpha, allowing workloads to scale down to zero replicas specifically when using Object or External metrics.

Now, users can experiment with significantly reducing infrastructure costs by completely idling heavy workloads when no work is pending. While the feature remains behind the HPAScaleToZero feature gate, it enables the HPA to stay active even with zero running pods, automatically scaling the deployment back up as soon as the external metric (e.g., queue length) indicates that new tasks have arrived.

This work was done as part of KEP #2021 led by SIG Autoscaling.

DRA features in Alpha

Historically, the Dynamic Resource Allocation (DRA) framework lacked seamless integration with high-level controllers and provided limited visibility into device-specific metadata or availability. Kubernetes v1.36 introduces a wave of DRA enhancements in Alpha, including native ResourceClaim support for workloads, and DRA native resources to provide the flexibility of DRA to cpu management.

Now, users can leverage the downward API to expose complex resource attributes directly to containers and benefit from improved resource availability visibility for more predictable scheduling. these updates, combined with support for list types in device attributes, transform DRA from a low-level primitive into a robust system capable of handling the sophisticated networking and compute requirements of modern AI and high-performance computing (HPC) stacks.

This work was done across several KEPs (including #5729, #5304, #5517, #5677, and #5491) led by SIG Scheduling and SIG Node.

Native histogram support for Kubernetes metrics

High-resolution monitoring reaches a new milestone in Kubernetes v1.36 with the introduction of native histogram support in Alpha. While classical Prometheus histograms relied on static, pre-defined buckets that often forced a compromise between data accuracy and memory usage, this update allows the control plane to export sparse histograms that dynamically adjust their resolution based on real-time data.

Now, cluster operators can capture precise latency distributions for the kube-apiserver and other core components without the overhead of manual bucket management. This architectural shift ensures more reliable SLIs and SLOs, providing high-fidelity heatmaps that remain accurate even during the most unpredictable workload spikes.

This work was done as part of KEP #5808 led by SIG Instrumentation.

Manifest based admission control config

Managing admission controllers moves toward a more declarative and consistent model in Kubernetes v1.36 with the introduction of manifest-based admission control configuration in Alpha. This change addresses the long-standing challenge of configuring admission plugins through disparate command-line flags or separate, complex config files by allowing administrators to define the desired state of admission control directly through a structured manifest.

Now, cluster operators can manage admission plugin settings with the same versioned, declarative workflows used for other Kubernetes objects, significantly reducing the risk of configuration drift and manual errors during cluster upgrades. By centralizing these configurations into a unified manifest, the kube-apiserver becomes easier to audit and automate, paving the way for more secure and reproducible cluster deployments.

This work was done as part of KEP #5793 led by SIG API Machinery.

CRI list streaming

With the introduction of CRI list streaming in Alpha, Kubernetes v1.36 uses new internal streaming operations. This enhancement addresses the memory pressure and latency spikes often seen on large-scale nodes by replacing traditional, monolithic List requests between the kubelet and the container runtime with a more efficient server-side streaming RPC.

Now, instead of waiting for a single, massive response containing all container or image data, the kubelet can process results incrementally as they are streamed. This shift significantly reduces the peak memory footprint of the kubelet and improves responsiveness on high-density nodes, ensuring that cluster management remains fluid even as the number of containers per node continues to grow.

This work was done as part of KEP #5825 led by SIG Node.

Other notable changes

Ingress NGINX retirement

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee have retired Ingress NGINX on March 24, 2026. Since that date, there have been no further releases, no bugfixes, and no updates to resolve any security vulnerabilities discovered. Existing deployments of Ingress NGINX will continue to function, and installation artifacts like Helm charts and container images will remain available.

For full details, see the official retirement announcement.

Faster SELinux labelling for volumes (GA)

Kubernetes v1.36 makes the SELinux volume mounting improvement generally available. This change replaced recursive file relabeling with mount -o context=XYZ option, applying the correct SELinux label to the entire volume at mount time. It brings more consistent performance and reduces Pod startup delays on SELinux-enforcing systems.

This feature was introduced as beta in v1.28 for ReadWriteOncePod volumes. In v1.32, it gained metrics and an opt-out option (securityContext.seLinuxChangePolicy: Recursive) to help catch conflicts. Now in v1.36, it reaches Stable and defaults to all volumes, with Pods or CSIDrivers opting in via spec.seLinuxMount.

However, we expect this feature to create the risk of breaking changes in the future Kubernetes releases, potentially due to sharing one volume between privileged and unprivileged Pods on the same node.

Developers have the responsibility of setting the seLinuxChangePolicy field and SELinux volume labels on Pods. Regardless of whether they are writing a Deployment, StatefulSet, DaemonSet or even a custom resource that includes a Pod template, being careless with these settings can lead to a range of problems such as Pods not starting up correctly when Pods share a volume.

Kubernetes v1.36 is the ideal release to audit your clusters. To learn more, check out SELinux Volume Label Changes goes GA (and likely implications in v1.37) blog.

For more details on this enhancement, refer to KEP-1710: Speed up recursive SELinux label change.

Graduations, deprecations, and removals in v1.36

Graduations to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 18 enhancements promoted to stable:

Deprecations removals, and community updates

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.36 includes a couple of deprecations.

Deprecation of Service .spec.externalIPs

With this release, the externalIPs field in Service spec is deprecated. This means the functionality exists, but will no longer function in a future version of Kubernetes. You should plan to migrate if you currently rely on that field. This field has been a known security headache for years, enabling man-in-the-middle attacks on your cluster traffic, as documented in CVE-2020-8554. From Kubernetes v1.36 and onwards, you will see deprecation warnings when using it, with full removal planned for v1.43.

If your Services still lean on externalIPs, consider using LoadBalancer services for cloud-managed ingress, NodePort for simple port exposure, or Gateway API for a more flexible and secure way to handle external traffic.

For more details on this field and its deprecation, refer to External IPs or read KEP-5707: Deprecate service.spec.externalIPs.

Removal of the gitRepo volume driver

The gitRepo volume type has been deprecated since v1.11. For Kubernetes v1.36, the gitRepo volume plugin is permanently disabled and cannot be turned back on. This change protects clusters from a critical security issue where using gitRepo could let an attacker run code as root on the node.

Although gitRepo has been deprecated for years and better alternatives have been recommended, it was still technically possible to use it in previous releases. From v1.36 onward, that path is closed for good, so any existing workloads depending on gitRepo will need to migrate to supported approaches such as init containers or external git-sync style tools.

For more details on this removal, refer to KEP-5040: Remove gitRepo volume driver

Release notes

Check out the full details of the Kubernetes v1.36 release in our release notes.

Availability

Kubernetes v1.36 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.36 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.36 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out to our release lead, Ryota Sawada, for guiding us through a successful release cycle, for his hands-on approach to solving challenges, and for bringing the energy and care that drives our community forward.

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.36 release cycle, which spanned 15 weeks from 12th January 2026 to 22nd April 2026, Kubernetes received contributions from as many as 106 different companies and 491 individuals. In the wider cloud native ecosystem, the figure goes up to 370 companies, counting 2235 total contributors.

Note that “contribution” counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs. If you are interested in contributing, visit Getting Started on our contributor website.

Source for this data:

Events Update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

April 2026

May 2026

June 2026

July 2026

September 2026

October 2026

November 2026

You can find the latest event details here.

Upcoming Release Webinar

Join members of the Kubernetes v1.36 Release Team on Wednesday, May 20th 2026 at 4:00 PM (UTC) to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Gateway API v1.5: Moving features to Stable

Gateway API logo

The Kubernetes SIG Network community presents the release of Gateway API (v1.5)! Released on February 27, 2026, version 1.5 is our biggest release yet, and concentrates on moving existing Experimental features to Standard (Stable).

The Gateway API v1.5.1 patch release is already available.

The Gateway API v1.5 brings six widely-requested feature promotions to the Standard channel (Gateway API's GA release channel):

  • ListenerSet
  • TLSRoute
  • HTTPRoute CORS Filter
  • Client Certificate Validation
  • Certificate Selection for Gateway TLS Origination
  • ReferenceGrant

Special thanks for Gateway API Contributors for their efforts on this release.

New release process

As of Gateway API v1.5, the project has moved to a release train model, where on a feature freeze date, any features that are ready are shipped in the release.

This applies to both Experimental and Standard, and also applies to documentation -- if the documentation isn't ready to ship, the feature isn't ready to ship.

We are aiming for this to produce a more reliable release cadence (since we are basing our work off the excellent work done by SIG Release on Kubernetes itself). As part of this change, we've also introduced Release Manager and Release Shadow roles to our release team. Many thanks to Flynn (Buoyant) and Beka Modebadze (Google) for all the great work coordinating and filing the rough edges of our release process. They are both going to continue in this role for the next release as well.

New standard features

ListenerSet

Leads: Dave Protasowski, David Jumani

GEP-1713

Why ListenerSet?

Prior to ListenerSet, all listeners had to be specified directly on the Gateway object. While this worked well for simple use cases, it created challenges for more complex or multi-tenant environments:

  • Platform teams and application teams often needed to coordinate changes to the same Gateway
  • Safely delegating ownership of individual listeners was difficult
  • Extending existing Gateways required direct modification of the original resource

ListenerSet addresses these limitations by allowing listeners to be defined independently and then merged onto a target Gateway.

ListenerSets also enable attaching more than 64 listeners to a single, shared Gateway. This is critical for large scale deployments and scenarios with multiple hostnames per listener.

Even though the ListenerSet feature significantly enhances scalability, the listener field in Gateway remains a mandatory requirement and the Gateway must have at least one valid listener.

How it works

A ListenerSet attaches to a Gateway and contributes one or more listeners. The Gateway controller is responsible for merging listeners from the Gateway resource itself and any attached ListenerSet resources.

In this example, a central infrastructure team defines a Gateway with a default HTTP listener, while two different application teams define their own ListenerSet resources in separate namespaces. Both ListenerSets attach to the same Gateway and contribute additional HTTPS listeners.

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: example-gateway
  namespace: infra
spec:
  gatewayClassName: example-gateway-class
  allowedListeners:
    namespaces:
      from: All # A selector lets you fine tune this
  listeners:
  - name: http
    protocol: HTTP
    port: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: ListenerSet
metadata:
  name: team-a-listeners
  namespace: team-a
spec:
  parentRef:
    name: example-gateway
    namespace: infra
  listeners:
  - name: https-a
    protocol: HTTPS
    port: 443
    hostname: a.example.com
    tls:
      certificateRefs:
      - name: a-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: ListenerSet
metadata:
  name: team-b-listeners
  namespace: team-b
spec:
  parentRef:
    name: example-gateway
    namespace: infra
  listeners:
  - name: https-b
    protocol: HTTPS
    port: 443
    hostname: b.example.com
    tls:
      certificateRefs:
      - name: b-cert

TLSRoute

Leads: Rostislav Bobrovsky, Ricardo Pchevuzinske Katz

GEP-2643

The TLSRoute resource allows you to route requests by matching the Server Name Indication (SNI) presented by the client during the TLS handshake and directing the stream to the appropriate Kubernetes backends.

When working with TLSRoute, a Gateway's TLS listener can be configured in one of two modes: Passthrough or Terminate.

If you install Gateway API v1.5 Standard over v1.4 or earlier Experimental, your existing Experimental TLSRoutes will not be usable. This is because they will be stored in the v1alpha2 or v1alpha3 version, which is not included in the v1.5 Standard YAMLs. If this applies to you, either continue using Experimental for v1.5.1 and onward, or you'll need to download and migrate your TLSRoutes to v1, which is present in the Standard YAMLs.

Passthrough mode

The Passthrough mode is designed for strict security requirements. It is ideal for scenarios where traffic must remain encrypted end-to-end until it reaches the destination backend, when the external client and backend need to authenticate directly with each other, or when you can’t store certificates on the Gateway. This configuration is also applicable when an encrypted TCP stream is required instead of standard HTTP traffic.

In this mode, the encrypted byte stream is proxied directly to the destination backend. The Gateway has zero access to private keys or unencrypted data.

The following TLSRoute is attached to a listener that is configured in Passthrough mode. It will match only TLS handshakes with the foo.example.com SNI hostname and apply its routing rules to pass the encrypted TCP stream to the configured backend:

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: example-gateway
spec:
  gatewayClassName: example-gateway-class
  listeners:
  - name: tls-passthrough
    protocol: TLS
    port: 8443
    tls:
      mode: Passthrough
---
apiVersion: gateway.networking.k8s.io/v1
kind: TLSRoute
metadata:
  name: foo-route
spec:
  parentRefs:
  - name: example-gateway
    sectionName: tls-passthrough
  hostnames:
  - "foo.example.com"
  rules:
  - backendRefs:
    - name: foo-svc
      port: 8443

Terminate mode

The Terminate mode provides the convenience of centralized TLS certificate management directly at the Gateway.

In this mode, the TLS session is fully terminated at the Gateway, which then routes the decrypted payload to the destination backend as a plain text TCP stream.

The following TLSRoute is attached to a listener that is configured in Terminate mode. It will match only TLS handshakes with the bar.example.com SNI hostname and apply its routing rules to pass the decrypted TCP stream to the configured backend:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: example-gateway
spec:
  gatewayClassName: example-gateway-class
  listeners:
  - name: tls-terminate
    protocol: TLS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - name: tls-terminate-certificate
---
apiVersion: gateway.networking.k8s.io/v1
kind: TLSRoute
metadata:
  name: bar-route
spec:
  parentRefs:
  - name: example-gateway
    sectionName: tls-terminate
  hostnames:
  - "bar.example.com"
  rules:
  - backendRefs:
    - name: bar-svc
      port: 8080

HTTPRoute CORS filter

Leads: Damian Sawicki, Ricardo Pchevuzinske Katz, Norwin Schnyder, Huabing (Robin) Zhao, LiangLliu,

GEP-1767

Cross-origin resource sharing (CORS) is an HTTP-header based security mechanism that allows (or denies) a web page to access resources from a server on an origin different from the domain that served the web page. See our documentation page for more information. The HTTPRoute resource can be used to configure Cross-Origin Resource Sharing (CORS). The following HTTPRoute allows requests from https://app.example:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: cors
spec:
  parentRefs:
  - name: same-namespace
  rules:
  - matches:
    - path:
       type: PathPrefix
       value: /cors-behavior-creds-false
    backendRefs:
    - name: infra-backend-v1
       port: 8080
    filters:
    - cors:
      allowOrigins:
        - https://app.example
      type: CORS

Instead of specifying a list of specific origins, you can also specify a single wildcard ("*"), which will allow any origin. It is also allowed to use semi-specified origins in the list, where the wildcard appears after the scheme and at the beginning of the hostname, e.g. https://*.bar.com:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: cors
spec:
  parentRefs:
  - name: same-namespace
  rules:
  - matches:
    - path:
      type: PathPrefix
      value: /cors-behavior-creds-false
    backendRefs:
    - name: infra-backend-v1
      port: 8080
    filters:
    - cors:
      allowOrigins:
        - https://www.baz.com
        - https://*.bar.com
        - https://*.foo.com
      type: CORS

HTTPRoute filters allow for the configuration of CORS settings. See a list of supported options below:

allowCredentials
Specifies whether the browser is allowed to include credentials (such as cookies and HTTP authentication) in the CORS request.
allowMethods
The HTTP methods that are allowed for CORS requests.
allowHeaders
The HTTP headers that are allowed for CORS requests.
exposeHeaders
The HTTP headers that are exposed to the client.
maxAge
The maximum time in seconds that the browser should cache the preflight response.

A comprehensive example:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: cors-allow-credentials
spec:
  parentRefs:
  - name: same-namespace
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /cors-behavior-creds-true
    backendRefs:
    - name: infra-backend-v1
      port: 8080
    filters:
    - cors:
        allowOrigins:
        - "https://www.foo.example.com"
        - "https://*.bar.example.com"
        allowMethods:
        - GET
        - OPTIONS
        allowHeaders:
        - "*"
        exposeHeaders:
        - "x-header-3"
        - "x-header-4"
        allowCredentials: true
        maxAge: 3600
      type: CORS

Gateway client certificate validation

Leads: Arko Dasgupta, Katarzyna Łach, Norwin Schnyder

GEP-91

Client certificate validation, also known as mutual TLS (mTLS), is a security mechanism where the client provides a certificate to the server to prove its identity. This is in contrast to standard TLS, where only the server presents a certificate to the client. In the context of the Gateway API, frontend mTLS means that the Gateway validates the client's certificate before allowing the connection to proceed to a backend service. This validation is done by checking the client certificate against a set of trusted Certificate Authorities (CAs) configured on the Gateway. The API was shaped this way to address a critical security vulnerability related to connection reuse and still provide some level of flexibility.

Configuration overview

Client validation is defined using the frontendValidation struct, which specifies how the Gateway should verify the client's identity.

  • caCertificateRefs: A list of references to Kubernetes objects (typically ConfigMap's) containing PEM-encoded CA certificate bundles used as trust anchors to validate the client's certificate.
  • mode: Defines the validation behavior.
    • AllowValidOnly (Default): The Gateway accepts connections only if the client presents a valid certificate that passes validation against the specified CA bundle.
    • AllowInsecureFallback: The Gateway accepts connections even if the client certificate is missing or fails verification. This mode typically delegates authorization to the backend and should be used with caution.

Validation can be applied globally to the Gateway or overridden for specific ports:

  1. Default Configuration: This configuration applies to all HTTPS listeners on the Gateway, unless a per-port override is defined.
  2. Per-Port Configuration: This allows for fine-grained control, overriding the default configuration for all listeners handling traffic on a specific port.

Example:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: client-validation-basic
spec:
  gatewayClassName: acme-lb
  tls:
    frontend:
      default:
        validation:
          caCertificateRefs:
          - kind: ConfigMap
            group: ""
            name: foo-example-com-ca-cert
      perPort:
      - port: 8443
        tls:
          validation:
            caCertificateRefs:
            - kind: ConfigMap
              group: ""
              name: foo-example-com-ca-cert
            mode: "AllowInsecureFallback"
  listeners:
  - name: foo-https
    protocol: HTTPS
    port: 443
    hostname: foo.example.com
    tls:
      certificateRefs:
      - kind: Secret
        group: ""
        name: foo-example-com-cert
  - name: bar-https
    protocol: HTTPS
    port: 8443
    hostname: bar.example.com
    tls:
      certificateRefs:
      - kind: Secret
        group: ""
        name: bar-example-com-cert

Certificate selection for Gateway TLS origination

Leads: Marcin Kosieradzki, Rob Scott, Norwin Schnyder, Lior Lieberman, Katarzyna Lach

GEP-3155

Mutual TLS (mTLS) for upstream connections requires the Gateway to present a client certificate to the backend, in addition to verifying the backend's certificate. This ensures that the backend only accepts connections from authorized Gateways.

Gateway’s client certificate configuration

To configure the client certificate that the Gateway uses when connecting to backends, use the tls.backend.clientCertificateRef field in the Gateway resource. This configuration applies to the Gateway as a client for all upstream connections managed by that Gateway.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: backend-tls
spec:
  gatewayClassName: acme-lb
  tls:
    backend:
      clientCertificateRef:
        kind: Secret
        group: "" # empty string means core API group
        name: foo-example-cert
  listeners:
  - name: foo-http
    protocol: HTTP
    port: 80
    hostname: foo.example.com

ReferenceGrant promoted to v1

The ReferenceGrant resource has not changed in more than a year, and we do not expect it to change further, so its version has been bumped to v1, and it is now officially in the Standard channel, and abides by the GA API contract (that is, no breaking changes).

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.30 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide.

As of this writing, seven implementations are already fully conformant with Gateway API v1.5. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

This article was edited in April 2026 to correct the release date for Gateway API 1.5.0.

Kubernetes v1.36 Sneak Peek

Kubernetes v1.36 is coming at the end of April 2026. This release will include removals and deprecations, and it is packed with an impressive number of enhancements. Here are some of the features we are most excited about in this cycle!

Please note that this information reflects the current state of v1.36 development and may change before release.

The Kubernetes API removal and deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release. It will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
  • Beta or pre-release API versions must be supported for 3 releases after the deprecation.
  • Alpha or experimental API versions may be removed in any release without prior deprecation notice; this process can become a withdrawal in cases where a different implementation for the same feature is already in place.

Whether an API is removed as a result of a feature graduating from beta to stable, or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the deprecation guide.

A recent example of this principle in action is the retirement of the ingress-nginx project, announced by SIG-Security on March 24, 2026. As stewardship shifts away from the project, the community has been encouraged to evaluate alternative ingress controllers that align with current security and maintenance best practices. This transition reflects the same lifecycle discipline that underpins Kubernetes itself, ensuring continued evolution without abrupt disruption.

Ingress NGINX retirement

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee have retired Ingress NGINX on March 24, 2026. Since that date, there have been no further releases, no bugfixes, and no updates to resolve any security vulnerabilities discovered. Existing deployments of Ingress NGINX will continue to function, and installation artifacts like Helm charts and container images will remain available.

For full details, see the official retirement announcement.

Deprecations and removals for Kubernetes v1.36

Deprecation of .spec.externalIPs in Service

The externalIPs field in Service spec is being deprecated, which means you’ll soon lose a quick way to route arbitrary externalIPs to your Services. This field has been a known security headache for years, enabling man-in-the-middle attacks on your cluster traffic, as documented in CVE-2020-8554. From Kubernetes v1.36 and onwards, you will see deprecation warnings when using it, with full removal planned for v1.43.

If your Services still lean on externalIPs, consider using LoadBalancer services for cloud-managed ingress, NodePort for simple port exposure, or Gateway API for a more flexible and secure way to handle external traffic.

For more details on this enhancement, refer to KEP-5707: Deprecate service.spec.externalIPs

Removal of gitRepo volume driver

The gitRepo volume type has been deprecated since v1.11. Starting Kubernetes v1.36, the gitRepo volume plugin is permanently disabled and cannot be turned back on. This change protects clusters from a critical security issue where using gitRepo could let an attacker run code as root on the node.

Although gitRepo has been deprecated for years and better alternatives have been recommended, it was still technically possible to use it in previous releases. From v1.36 onward, that path is closed for good, so any existing workloads depending on gitRepo will need to migrate to supported approaches such as init containers or external git-sync style tools.

For more details on this enhancement, refer to KEP-5040: Remove gitRepo volume driver

The following list of enhancements is likely to be included in the upcoming v1.36 release. This is not a commitment and the release content is subject to change.

Faster SELinux labelling for volumes (GA)

Kubernetes v1.36 makes the SELinux volume mounting improvement generally available. This change replaced recursive file relabeling with mount -o context=XYZ option, applying the correct SELinux label to the entire volume at mount time. It brings more consistent performance and reduces Pod startup delays on SELinux-enforcing systems.

This feature was introduced as beta in v1.28 for ReadWriteOncePod volumes. In v1.32, it gained metrics and an opt-out option (securityContext.seLinuxChangePolicy: Recursive) to help catch conflicts. Now in v1.36, it reaches stable and defaults to all volumes, with Pods or CSIDrivers opting in via spec.SELinuxMount.

However, we expect this feature to create the risk of breaking changes in the future Kubernetes releases, due to the potential for mixing of privileged and unprivileged pods. Setting the seLinuxChangePolicy field and SELinux volume labels on Pods, correctly, is the responsibility of the Pod author Developers have that responsibility whether they are writing a Deployment, StatefulSet, DaemonSet or even a custom resource that includes a Pod template. Being careless with these settings can lead to a range of problems when Pods share volumes.

For more details on this enhancement, refer to KEP-1710: Speed up recursive SELinux label change

External signing of ServiceAccount tokens

As a beta feature, Kubernetes already supports external signing of ServiceAccount tokens. This allows clusters to integrate with external key management systems or signing services instead of relying only on internally managed keys.

With this enhancement, the kube-apiserver can delegate token signing to external systems such as cloud key management services or hardware security modules. This improves security and simplifies key management services for clusters that rely on centralized signing infrastructure. We expect that this will graduate to stable (GA) in Kubernetes v1.36.

For more details on this enhancement, refer to KEP-740: Support external signing of service account tokens

DRA Driver support for Device taints and tolerations

Kubernetes v1.33 introduced support for taints and tolerations for physical devices managed through Dynamic Resource Allocation (DRA). Normally, any device can be used for scheduling. However, this enhancement allows DRA drivers to mark devices as tainted, which ensures that they will not be used for scheduling purposes. Alternatively, cluster administrators can create a DeviceTaintRule to mark devices that match a certain selection criteria(such as all devices of a certain driver) as tainted. This improves scheduling control and helps ensure that specialized hardware resources are only used by workloads that explicitly request them.

In Kubernetes v1.36, this feature graduates to beta with more comprehensive testing complete, making it accessible by default without the need for a feature flag and open to user feedback.

To learn about taints and tolerations, see taints and tolerations.
For more details on this enhancement, refer to KEP-5055: DRA: device taints and tolerations.

DRA support for partitionable devices

Kubernetes v1.36 expands Dynamic Resource Allocation (DRA) by introducing support for partitionable devices, allowing a single hardware accelerator to be split into multiple logical units that can be shared across workloads. This is especially useful for high-cost resources like GPUs, where dedicating an entire device to a single workload can lead to underutilization.

With this enhancement, platform teams can improve overall cluster efficiency by allocating only the required portion of a device to each workload, rather than reserving it entirely. This makes it easier to run multiple workloads on the same hardware while maintaining isolation and control, helping organizations get more value out of their infrastructure.

To learn more about this enhancement, refer to KEP-4815: DRA Partitionable Devices

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.36 as part of the CHANGELOG for that release.

Kubernetes v1.36 release is planned for Wednesday, April 22, 2026. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Announcing Ingress2Gateway 1.0: Your Path to Gateway API

With the Ingress-NGINX retirement scheduled for March 2026, the Kubernetes networking landscape is at a turning point. For most organizations, the question isn't whether to migrate to Gateway API, but how to do so safely.

Migrating from Ingress to Gateway API is a fundamental shift in API design. Gateway API provides a modular, extensible API with strong support for Kubernetes-native RBAC. Conversely, the Ingress API is simple, and implementations such as Ingress-NGINX extend the API through esoteric annotations, ConfigMaps, and CRDs. Migrating away from Ingress controllers such as Ingress-NGINX presents the daunting task of capturing all the nuances of the Ingress controller, and mapping that behavior to Gateway API.

Ingress2Gateway is an assistant that helps teams confidently move from Ingress to Gateway API. It translates Ingress resources/manifests along with implementation-specific annotations to Gateway API while warning you about untranslatable configuration and offering suggestions.

Today, SIG Network is proud to announce the 1.0 release of Ingress2Gateway. This milestone represents a stable, tested migration assistant for teams ready to modernize their networking stack.

Ingress2Gateway 1.0

Ingress-NGINX annotation support

The main improvement for the 1.0 release is more comprehensive Ingress-NGINX support. Before the 1.0 release, Ingress2Gateway only supported three Ingress-NGINX annotations. For the 1.0 release, Ingress2Gateway supports over 30 common annotations (CORS, backend TLS, regex matching, path rewrite, etc.).

Comprehensive integration testing

Each supported Ingress-NGINX annotation, and representative combinations of common annotations, is backed by controller-level integration tests that verify the behavioral equivalence of the Ingress-NGINX configuration and the generated Gateway API. These tests exercise real controllers in live clusters and compare runtime behavior (routing, redirects, rewrites, etc.), not just YAML structure.

The tests:

  • spin up an Ingress-NGINX controller
  • spin up multiple Gateway API controllers
  • apply Ingress resources that have implementation-specific configuration
  • translate Ingress resources to Gateway API with ingress2gateway and apply generated manifests
  • verify that the Gateway API controllers and the Ingress controller exhibit equivalent behavior.

A comprehensive test suite not only catches bugs in development, but also ensures the correctness of the translation, especially given surprising edge cases and unexpected defaults, so that you don't find out about them in production.

Notification & error handling

Migration is not a "one-click" affair. Surfacing subtleties and untranslatable behavior is as important as translating supported configuration. The 1.0 release cleans up the formatting and content of notifications, so it is clear what is missing and how you can fix it.

Using Ingress2Gateway

Ingress2Gateway is a migration assistant, not a one-shot replacement. Its goal is to

  • migrate supported Ingress configuration and behavior
  • identify unsupported configuration and suggest alternatives
  • reevaluate and potentially discard undesirable configuration

The rest of the section shows you how to safely migrate the following Ingress-NGINX configuration

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "1G"
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "1"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "1"
    nginx.ingress.kubernetes.io/enable-cors: "true"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "Request-Id: $req_id";
  name: my-ingress
  namespace: my-ns
spec:
  ingressClassName: nginx
  rules:
    - host: my-host.example.com
      http:
        paths:
          - backend:
              service:
                name: website-service
                port:
                  number: 80
            path: /users/(\d+)
            pathType: ImplementationSpecific
  tls:
    - hosts:
        - my-host.example.com
      secretName: my-secret

1. Install Ingress2Gateway

If you have a Go environment set up, you can install Ingress2Gateway with

go install github.com/kubernetes-sigs/ingress2gateway@v1.0.0

Otherwise,

brew install ingress2gateway

You can also download the binary from GitHub or build from source.

2. Run Ingress2Gateway

You can pass Ingress2Gateway Ingress manifests, or have the tool read directly from your cluster.

# Pass it files
ingress2gateway print --input-file my-manifest.yaml,my-other-manifest.yaml --providers=ingress-nginx > gwapi.yaml
# Use a namespace in your cluster
ingress2gateway print --namespace my-api --providers=ingress-nginx > gwapi.yaml
# Or your whole cluster
ingress2gateway print --providers=ingress-nginx --all-namespaces > gwapi.yaml

Note:

You can also pass --emitter <agentgateway|envoy-gateway|kgateway> to output implementation-specific extensions.

3. Review the output

This is the most critical step. The commands from the previous section output a Gateway API manifest to gwapi.yaml, and they also emit warnings that explain what did not translate exactly and what to review manually.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  annotations:
    gateway.networking.k8s.io/generator: ingress2gateway-dev
  name: nginx
  namespace: my-ns
spec:
  gatewayClassName: nginx
  listeners:
  - hostname: my-host.example.com
    name: my-host-example-com-http
    port: 80
    protocol: HTTP
  - hostname: my-host.example.com
    name: my-host-example-com-https
    port: 443
    protocol: HTTPS
    tls:
      certificateRefs:
      - group: ""
        kind: Secret
        name: my-secret
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  annotations:
    gateway.networking.k8s.io/generator: ingress2gateway-dev
  name: my-ingress-my-host-example-com
  namespace: my-ns
spec:
  hostnames:
  - my-host.example.com
  parentRefs:
  - name: nginx
    port: 443
  rules:
  - backendRefs:
    - name: website-service
      port: 80
    filters:
    - cors:
        allowCredentials: true
        allowHeaders:
        - DNT
        - Keep-Alive
        - User-Agent
        - X-Requested-With
        - If-Modified-Since
        - Cache-Control
        - Content-Type
        - Range
        - Authorization
        allowMethods:
        - GET
        - PUT
        - POST
        - DELETE
        - PATCH
        - OPTIONS
        allowOrigins:
        - '*'
        maxAge: 1728000
      type: CORS
    matches:
    - path:
        type: RegularExpression
        value: (?i)/users/(\d+).*
    name: rule-0
    timeouts:
      request: 10s
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  annotations:
    gateway.networking.k8s.io/generator: ingress2gateway-dev
  name: my-ingress-my-host-example-com-ssl-redirect
  namespace: my-ns
spec:
  hostnames:
  - my-host.example.com
  parentRefs:
  - name: nginx
    port: 80
  rules:
  - filters:
    - requestRedirect:
        scheme: https
        statusCode: 308
      type: RequestRedirect

Ingress2Gateway successfully translated some annotations into their Gateway API equivalents. For example, the nginx.ingress.kubernetes.io/enable-cors annotation was translated into a CORS filter. But upon closer inspection, the nginx.ingress.kubernetes.io/proxy-{read,send}-timeout and nginx.ingress.kubernetes.io/proxy-body-size annotations do not map perfectly. The logs show the reason for these omissions as well as reasoning behind the translation.

┌─ WARN  ────────────────────────────────────────
│  Unsupported annotation nginx.ingress.kubernetes.io/configuration-snippet
│  source: INGRESS-NGINX
│  object: Ingress: my-ns/my-ingress
└─
┌─ INFO  ────────────────────────────────────────
│  Using case-insensitive regex path matches. You may want to change this.
│  source: INGRESS-NGINX
│  object: HTTPRoute: my-ns/my-ingress-my-host-example-com
└─
┌─ WARN  ────────────────────────────────────────
│  ingress-nginx only supports TCP-level timeouts; i2gw has made a best-effort translation to Gateway API timeouts.request. Please verify that this meets your needs. See documentation: https://gateway-api.sigs.k8s.io/guides/http-timeouts/
│  source: INGRESS-NGINX
│  object: HTTPRoute: my-ns/my-ingress-my-host-example-com
└─
┌─ WARN  ────────────────────────────────────────
│  Failed to apply my-ns.my-ingress.metadata.annotations."nginx.ingress.kubernetes.io/proxy-body-size" from my-ns/my-ingress: Most Gateway API implementations have reasonable body size and buffering defaults
│  source: STANDARD_EMITTER
│  object: HTTPRoute: my-ns/my-ingress-my-host-example-com
└─
┌─ WARN  ────────────────────────────────────────
│  Gateway API does not support configuring URL normalization (RFC 3986, Section 6). Please check if this matters for your use case and consult implementation-specific details.
│  source: STANDARD_EMITTER
└─

There is a warning that Ingress2Gateway does not support the nginx.ingress.kubernetes.io/configuration-snippet annotation. You will have to check your Gateway API implementation documentation to see if there is a way to achieve equivalent behavior.

The tool also notified us that Ingress-NGINX regex matches are case-insensitive prefix matches, which is why there is a match pattern of (?i)/users/(\d+).*. Most organizations will want to change this behavior to be an exact case-sensitive match by removing the leading (?i) and the trailing .* from the path pattern.

Ingress2Gateway made a best-effort translation from the nginx.ingress.kubernetes.io/proxy-{send,read}-timeout annotations to a 10 second request timeout in our HTTP route. If requests for this service should be much shorter, say 3 seconds, you can make the corresponding changes to your Gateway API manifests.

Also, nginx.ingress.kubernetes.io/proxy-body-size does not have a Gateway API equivalent, and was thus not translated. However, most Gateway API implementations have reasonable defaults for maximum body size and buffering, so this might not be a problem in practice. Further, some emitters might offer support for this annotation through implementation-specific extensions. For example, adding the --emitter agentgateway, --emitter envoy-gateway, or --emitter kgateway flag to the previous ingress2gateway print command would have resulted in additional implementation-specific configuration in the generated Gateway API manifests that attempted to capture the body size configuration.

We also see a warning about URL normalization. Gateway API implementations such as Agentgateway, Envoy Gateway, Kgateway, and Istio have some level of URL normalization, but the behavior varies across implementations and is not configurable through standard Gateway API. You should check and test the URL normalization behavior of your Gateway API implementation to ensure it is compatible with your use case.

To match Ingress-NGINX default behavior, Ingress2Gateway also added a listener on port 80 and a HTTP Request redirect filter to redirect HTTP traffic to HTTPS. You may not want to serve HTTP traffic at all and remove the listener on port 80 and the corresponding HTTPRoute.

Caution:

Always thoroughly review the generated output and logs.

After manually applying these changes, the Gateway API manifests might look as follows.

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  annotations:
    gateway.networking.k8s.io/generator: ingress2gateway-dev
  name: nginx
  namespace: my-ns
spec:
  gatewayClassName: nginx
  listeners:
  - hostname: my-host.example.com
    name: my-host-example-com-https
    port: 443
    protocol: HTTPS
    tls:
      certificateRefs:
      - group: ""
        kind: Secret
        name: my-secret
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  annotations:
    gateway.networking.k8s.io/generator: ingress2gateway-dev
  name: my-ingress-my-host-example-com
  namespace: my-ns
spec:
  hostnames:
  - my-host.example.com
  parentRefs:
  - name: nginx
    port: 443
  rules:
  - backendRefs:
    - name: website-service
      port: 80
    filters:
    - cors:
        allowCredentials: true
        allowHeaders:
        - DNT
        ...
        allowMethods:
        - GET
        ...
        allowOrigins:
        - '*'
        maxAge: 1728000
      type: CORS
    matches:
    - path:
        type: RegularExpression
        value: /users/(\d+)
    name: rule-0
    timeouts:
      request: 3s

4. Verify

Now that you have Gateway API manifests, you should thoroughly test them in a development cluster. In this case, you should at least double-check that your Gateway API implementation's maximum body size defaults are appropriate for you and verify that a three-second timeout is enough.

After validating behavior in a development cluster, deploy your Gateway API configuration alongside your existing Ingress. We strongly suggest that you then gradually shift traffic using weighted DNS, your cloud load balancer, or traffic-splitting features of your platform. This way, you can quickly recover from any misconfiguration that made it through your tests.

Finally, when you have shifted all your traffic to your Gateway API controller, delete your Ingress resources and uninstall your Ingress controller.

Conclusion

The Ingress2Gateway 1.0 release is just the beginning, and we hope that you use Ingress2Gateway to safely migrate to Gateway API. As we approach the March 2026 Ingress-NGINX retirement, we invite the community to help us increase our configuration coverage, expand testing, and improve UX.

Resources about Gateway API

The scope of Gateway API can be daunting. Here are some resources to help you work with Gateway API:

Running Agents on Kubernetes with Agent Sandbox

The landscape of artificial intelligence is undergoing a massive architectural shift. In the early days of generative AI, interacting with a model was often treated as a transient, stateless function call: a request that spun up, executed for perhaps 50 milliseconds, and terminated.

Today, the world is witnessing AI v2 eating AI v1. The ecosystem is moving from short-lived, isolated tasks to deploying multiple, coordinated AI agents that run constantly. These autonomous agents need to maintain context, use external tools, write and execute code, and communicate with one another over extended periods.

As platform engineering teams look for the right infrastructure to host these new AI workloads, one platform stands out as the natural choice: Kubernetes. However, mapping these unique agentic workloads to traditional Kubernetes primitives requires a new abstraction.

This is where the new Agent Sandbox project (currently in development under SIG Apps) comes into play.

The Kubernetes advantage (and the abstraction gap)

Kubernetes is the de facto standard for orchestrating cloud-native applications precisely because it solves the challenges of extensibility, robust networking, and ecosystem maturity. However, as AI evolves from short-lived inference requests to long-running, autonomous agents, we are seeing the emergence of a new operational pattern.

AI agents, by contrast, are typically isolated, stateful, singleton workloads. They act as a digital workspace or execution environment for an LLM. An agent needs a persistent identity and a secure scratchpad for writing and executing (often untrusted) code. Crucially, because these long-lived agents are expected to be mostly idle except for brief bursts of activity, they require a lifecycle that supports mechanisms like suspension and rapid resumption.

While you could theoretically approximate this by stringing together a StatefulSet of size 1, a headless Service, and a PersistentVolumeClaim for every single agent, managing this at scale becomes an operational nightmare.

Because of these unique properties, traditional Kubernetes primitives don't perfectly align.

Introducing Kubernetes Agent Sandbox

To bridge this gap, SIG Apps is developing agent-sandbox. The project introduces a declarative, standardized API specifically tailored for singleton, stateful workloads like AI agent runtimes.

At its core, the project introduces the Sandbox CRD. It acts as a lightweight, single-container environment built entirely on Kubernetes primitives, offering:

  • Strong isolation for untrusted code: When an AI agent generates and executes code autonomously, security is paramount. The Sandbox custom resource natively supports different runtimes, like gVisor or Kata Containers. This provides the necessary kernel and network isolation required for multi-tenant, untrusted execution.
  • Lifecycle management: Unlike traditional web servers optimized for steady, stateless traffic, an AI agent operates as a stateful workspace that may be idle for hours between tasks. Agent Sandbox supports scaling these idle environments to zero to save resources, while ensuring they can resume exactly where they left off.
  • Stable identity: Coordinated multi-agent systems require stable networking. Every Sandbox is given a stable hostname and network identity, allowing distinct agents to discover and communicate with each other seamlessly.

Scaling agents with extensions

Because the AI space is moving incredibly quickly, we built an Extensions API layer that enables even faster iteration and development.

Starting a new pod adds about a second of overhead. That's perfectly fine when deploying a new version of a microservice, but when an agent is invoked after being idle, a one-second cold start breaks the continuity of the interaction. It forces the user or the orchestrating service to wait for the environment to provision before the model can even begin to think or act. SandboxWarmPool solves this by maintaining a pool of pre-provisioned Sandbox pods, effectively eliminating cold starts. Users or orchestration services can simply issue a SandboxClaim against a SandboxTemplate, and the controller immediately hands over a pre-warmed, fully isolated environment to the agent.

Quick start

Ready to try it yourself? You can install the Agent Sandbox core components and extensions directly into your learning or sandbox cluster, using your chosen release.

We recommend you use the latest release as the project is moving fast.

# Replace "vX.Y.Z" with a specific version tag (e.g., "v0.1.0") from
# https://github.com/kubernetes-sigs/agent-sandbox/releases
export VERSION="vX.Y.Z"

# Install the core components:
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/manifest.yaml

# Install the extensions components (optional):
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/extensions.yaml

# Install the Python SDK (optional):
# Create a virtual Python environment
python3 -m venv .venv
source .venv/bin/activate
# Install from PyPI
pip install k8s-agent-sandbox

Once installed, you can try out the Python SDK for AI agents or deploy one of the ready-to-use examples to see how easy it is to spin up an isolated agent environment.

The future of agents is cloud native

Whether it’s a 50-millisecond stateless task, or a multi-week, mostly-idle collaborative process, extending Kubernetes with primitives designed specifically for isolated stateful singletons allows us to leverage all the robust benefits of the cloud-native ecosystem.

The Agent Sandbox project is open source and community-driven. If you are building AI platforms, developing agentic frameworks, or are interested in Kubernetes extensibility, we invite you to get involved:

Securing Production Debugging in Kubernetes

During production debugging, the fastest route is often broad access such as cluster-admin (a ClusterRole that grants administrator-level access), shared bastions/jump boxes, or long-lived SSH keys. It works in the moment, but it comes with two common problems: auditing becomes difficult, and temporary exceptions have a way of becoming routine.

This post offers my recommendations for good practices applicable to existing Kubernetes environments with minimal tooling changes:

  • Least privilege with RBAC
  • Short-lived, identity-bound credentials
  • An SSH-style handshake model for cloud native debugging

A good architecture for securing production debugging workflows is to use a just-in-time secure shell gateway (often deployed as an on demand pod in the cluster). It acts as an SSH-style “front door” that makes temporary access actually temporary. You can authenticate with short-lived, identity-bound credentials, establish a session to the gateway, and the gateway uses the Kubernetes API and RBAC to control what they can do, such as pods/log, pods/exec, and pods/portforward. Sessions expire automatically, and both the gateway logs and Kubernetes audit logs capture who accessed what and when without shared bastion accounts or long-lived keys.

1) Using an access broker on top of Kubernetes RBAC

RBAC controls who can do what in Kubernetes. Many Kubernetes environments rely primarily on RBAC for authorization, although Kubernetes also supports other authorization modes such as Webhook authorization. You can enforce access directly with Kubernetes RBAC, or put an access broker in front of the cluster that still relies on Kubernetes permissions under the hood. In either model, Kubernetes RBAC remains the source of truth for what the Kubernetes API allows and at what scope.

An access broker adds controls that RBAC does not cover well. For example, it can decide whether a request is auto-approved or requires manual approval, whether a user can run a command, and which commands are allowed in a session. It can also manage group membership so that you grant permissions to groups instead of individual users. Kubernetes RBAC can allow actions such as pods/exec, but it cannot restrict which commands run inside an exec session.

With that model, Kubernetes RBAC defines the allowed actions for a user or group (for example, an on-call team in a single namespace). I recommend you only define access rules that grant rights to groups or to ServiceAccounts - never to individual users. The broker or identity provider then adds or removes users from that group as needed.

The broker can also enforce extra policy on top, like which commands are permitted in an interactive session and which requests can be auto-approved versus require manual approval. That policy can live in a JSON or XML file and be maintained through code review, so updates go through a formal pull request and are reviewed like any other production change.

Example: a namespaced on-call debug Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: oncall-debug
  namespace: <namespace>
rules:
  # Discover what’s running
  - apiGroups: [""]
    resources: ["pods", "events"]
    verbs: ["get", "list", "watch"]

  # Read logs
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]

  # Interactive debugging actions
  - apiGroups: [""]
    resources: ["pods/exec", "pods/portforward"]
    verbs: ["create"]

  # Understand rollout/controller state
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch"]

  # Optional: allow kubectl debug ephemeral containers
  - apiGroups: [""]
    resources: ["pods/ephemeralcontainers"]
    verbs: ["update"]

Bind the Role to a group (rather than individual users) so membership can be managed through your identity provider:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: oncall-debug
  namespace: <namespace>
subjects:
  - kind: Group
    name: oncall-<team-name>
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: oncall-debug
  apiGroup: rbac.authorization.k8s.io

2) Short-lived, identity-bound credentials

The goal is to use short-lived, identity-bound credentials that clearly tie a session to a real person and expire quickly. These credentials can include the user’s identity and the scope of what they’re allowed to do. They’re typically signed using a private key that stays with the engineer, such as a hardware-backed key (for example, a YubiKey), so they can not be forged without access to that key.

You can implement this with Kubernetes-native authentication (for example, client certificates or an OIDC-based flow), or have the access broker from the previous section issue short-lived credentials on the user’s behalf. In many setups, Kubernetes still uses RBAC to enforce permissions based on the authenticated identity and groups/claims. If you use an access broker, it can also encode additional scope constraints in the credential and enforce them during the session, such as which cluster or namespace the session applies to and which actions (or approved commands) are allowed against pods or nodes. In either case, the credentials should be signed by a certificate authority (CA), and that CA should be rotated on a regular schedule (for example, quarterly) to limit long-term risk.

Option A: short-lived OIDC tokens

A lot of managed Kubernetes clusters already give you short-lived tokens. The main thing is to make sure your kubeconfig refreshes them automatically instead of copying a long-lived token into the file.

For example:

users:
- name: oncall
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1
      command: cred-helper
      args: ["--cluster=prod", "--ttl=30m"]

Option B: Short-lived client certificates (X.509)

If your API server (or your access broker from the previous section) is set up to trust a client CA, you can use short-lived client certificates for debugging access. The idea is:

  • The private key is created and kept under the engineer’s machine (ideally hardware-backed, like a non-exportable key in a YubiKey/PIV token)
  • A short-lived certificate is issued (often via the CertificateSigningRequest API, or your access broker from the previous section, with a TTL).
  • RBAC maps the authenticated identity to a minimal Role

This is straightforward to operationalize with the Kubernetes CertificateSigningRequest API.

Generate a key and CSR locally:

# Generate a private key.
# This could instead be generated within a hardware token;
# OpenSSL and several similar tools include support for that.
openssl genpkey -algorithm Ed25519 -out oncall.key

openssl req -new -key oncall.key -out oncall.csr \
  -subj "/CN=user/O=oncall-payments"

Create a CertificateSigningRequest with a short expiration:

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: oncall-<user>-20260218
spec:
  request: <base64-encoded oncall.csr>
  signerName: kubernetes.io/kube-apiserver-client
  expirationSeconds: 1800  # 30 minutes
  usages:
    - client auth

After the CSR is approved and signed, you extract the issued certificate and use it together with the private key to authenticate, for example via kubectl.

3) Use a just-in-time access gateway to run debugging commands

Once you have short-lived credentials, you can use them to open a secure shell session to a just-in-time access gateway, often exposed over SSH and created on demand. If the gateway is exposed over SSH, a common pattern is to issue the engineer a short-lived OpenSSH user certificate for the session. The gateway trusts your SSH user CA, authenticates the engineer at connection time, and then applies the approved session policy before making Kubernetes API calls on the user’s behalf. OpenSSH certificates are separate from Kubernetes X.509 client certificates, so these are usually treated as distinct layers.

The resulting session should also be scoped so it cannot be reused outside of what was approved. For example, the gateway or broker can limit it to a specific cluster and namespace, and optionally to a narrower target such as a pod or node. That way, even if someone tries to reuse the access, it will not work outside the intended scope. After the session is established, the gateway executes only the allowed actions and records what happened for auditing.

Example: Namespace-scoped role bindings

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: jit-debug
  namespace: <namespace>
  annotations:
    kubernetes.io/description: >
      Colleagues performing semi-privileged debugging, with access provided
      just in time and on demand.
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: jit-debug
  namespace: <namespace>
subjects:
  - kind: Group
    name: jit:oncall:<namespace>   # mapped from the short-lived credential (cert/OIDC)
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: jit-debug
  apiGroup: rbac.authorization.k8s.io

These RBAC objects, and the rules they define, allow debugging only within the specified namespace; attempts to access other namespaces are not allowed.

Example: Cluster-scoped role binding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: jit-cluster-read
rules:
  - apiGroups: [""]
    resources: ["nodes", "namespaces"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: jit-cluster-read
subjects:
  - kind: Group
    name: jit:oncall:cluster
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: jit-cluster-read
  apiGroup: rbac.authorization.k8s.io

These RBAC rules grant cluster-wide read access (for example, to nodes and namespaces) and should be used only for workflows that truly require cluster-scoped resources.

Finer-grained restrictions like “only this pod/node” or “only these commands” are typically enforced by the access gateway/broker during the session, but Kubernetes also offers other options, such as ValidatingAdmissionPolicy for restricting writes and webhook authorization for custom authorization across verbs.

In environments with stricter access controls, you can add an extra, short-lived session mediation layer to separate session establishment from privileged actions. Both layers are ephemeral, use identity-bound expiring credentials, and produce independent audit trails. The mediation layer handles session setup/forwarding, while the execution layer performs only RBAC-authorized Kubernetes actions. This separation can reduce exposure by narrowing responsibilities, scoping credentials per step, and enforcing end-to-end session expiry.

References

Disclaimer: The views expressed in this post are solely those of the author and do not reflect the views of the author’s employer or any other organization.

The Invisible Rewrite: Modernizing the Kubernetes Image Promoter

Every container image you pull from registry.k8s.io got there through kpromo, the Kubernetes image promoter. It copies images from staging registries to production, signs them with cosign, replicates signatures across more than 20 regional mirrors, and generates SLSA provenance attestations. If this tool breaks, no Kubernetes release ships. Over the past few weeks, we rewrote its core from scratch, deleted 20% of the codebase, made it dramatically faster, and nobody noticed. That was the whole point.

A bit of history

The image promoter started in late 2018 as an internal Google project by Linus Arver. The goal was simple: replace the manual, Googler-gated process of copying container images into k8s.gcr.io with a community-owned, GitOps-based workflow. Push to a staging registry, open a PR with a YAML manifest, get it reviewed and merged, and automation handles the rest. KEP-1734 formalized this proposal.

In early 2019, the code moved to kubernetes-sigs/k8s-container-image-promoter and grew quickly. Over the next few years, Stephen Augustus consolidated multiple tools (cip, gh2gcs, krel promote-images, promobot-files) into a single CLI called kpromo. The repository was renamed to promo-tools. Adolfo Garcia Veytia (Puerco) added cosign signing and SBOM support. Tyler Ferrara built vulnerability scanning. Carlos Panato kept the project in a healthy and releasable state. 42 contributors made about 3,500 commits across more than 60 releases.

It worked. But by 2025 the codebase carried the weight of seven years of incremental additions from multiple SIGs and subprojects. The README said it plainly: you will see duplicated code, multiple techniques for accomplishing the same thing, and several TODOs.

The problems we needed to solve

Production promotion jobs for Kubernetes core images regularly took over 30 minutes and frequently failed with rate limit errors. The core promotion logic had grown into a monolith that was hard to extend and difficult to test, making new features like provenance or vulnerability scanning painful to add.

On the SIG Release roadmap, two work items had been sitting for a while: "Rewrite artifact promoter" and "Make artifact validation more robust". We had discussed these at SIG Release meetings and KubeCons, and the open research spikes on project board #171 captured eight questions that needed answers before we could move forward.

One issue to answer them all

In February 2026, we opened issue #1701 ("Rewrite artifact promoter pipeline") and answered all eight spikes in a single tracking issue. The rewrite was deliberately phased so that each step could be reviewed, merged, and validated independently. Here is what we did:

Phase 1: Rate Limiting (#1702). Rewrote rate limiting to properly throttle all registry operations with adaptive backoff.

Phase 2: Interfaces (#1704). Put registry and auth operations behind clean interfaces so they can be swapped out and tested independently.

Phase 3: Pipeline Engine (#1705). Built a pipeline engine that runs promotion as a sequence of distinct phases instead of one large function.

Phase 4: Provenance (#1706). Added SLSA provenance verification for staging images.

Phase 5: Scanner and SBOMs (#1709). Added vulnerability scanning and SBOM support. Flipped the default to the new pipeline engine. At this point we cut v4.2.0 and let it soak in production before continuing.

Phase 6: Split Signing from Replication (#1713). Separated image signing from signature replication into their own pipeline phases, eliminating the rate limit contention that caused most production failures.

Phase 7: Remove Legacy Pipeline (#1712). Deleted the old code path entirely.

Phase 8: Remove Legacy Dependencies (#1716). Deleted the audit subsystem, deprecated tools, and e2e test infrastructure.

Phase 9: Delete the Monolith (#1718). Removed the old monolithic core and its supporting packages. Thousands of lines deleted across phases 7 through 9.

Each phase shipped independently. v4.3.0 followed the next day with the legacy code fully removed.

With the new architecture in place, a series of follow-up improvements landed: parallelized registry reads (#1736), retry logic for all network operations (#1742), per-request timeouts to prevent pipeline hangs (#1763), HTTP connection reuse (#1759), local registry integration tests (#1746), the removal of deprecated credential file support (#1758), a rework of attestation handling to use cosign's OCI APIs and the removal of deprecated SBOM support (#1764), and a dedicated promotion record predicate type registered with the in-toto attestation framework (#1767). These would have been much harder to land without the clean separation the rewrite provided. v4.4.0 shipped all of these improvements and enabled provenance generation and verification by default.

The new pipeline

The promotion pipeline now has seven clearly separated phases:

graph LR
    Setup --> Plan --> Provenance --> Validate --> Promote --> Sign --> Attest
PhaseWhat it does
SetupValidate options, prewarm TUF cache.
PlanParse manifests, read registries, compute which images need promotion.
ProvenanceVerify SLSA attestations on staging images.
ValidateCheck cosign signatures, exit here for dry runs.
PromoteCopy images server-side, preserving digests.
SignSign promoted images with keyless cosign.
AttestGenerate promotion provenance attestations using a dedicated in-toto predicate type.

Phases run sequentially, so each one gets exclusive access to the full rate limit budget. No more contention. Signature replication to mirror registries is no longer part of this pipeline and runs as a dedicated periodic Prow job instead.

Making it fast

With the architecture in place, we turned to performance.

Parallel registry reads (#1736): The plan phase reads 1,350 registries. We parallelized this and the plan phase dropped from about 20 minutes to about 2 minutes.

Two-phase tag listing (#1761): Instead of checking all 46,000 image groups across more than 20 mirrors, we first check only the source repositories. About 57% of images have no signatures at all because they were promoted before signing was enabled. We skip those entirely, cutting API calls roughly in half.

Source check before replication (#1727): Before iterating all mirrors for a given image, we check if the signature exists on the primary registry first. In steady state where most signatures are already replicated, this reduced the work from about 17 hours to about 15 minutes.

Per-request timeouts (#1763): We observed intermittent hangs where a stalled connection blocked the pipeline for over 9 hours. Every network operation now has its own timeout and transient failures are retried automatically.

Connection reuse (#1759): We started reusing HTTP connections and auth state across operations, eliminating redundant token negotiations. This closed a long-standing request from 2023.

By the numbers

Here is what the rewrite looks like in aggregate.

  • Over 40 PRs merged, 3 releases shipped (v4.2.0, v4.3.0, v4.4.0)
  • Over 10,000 lines added and over 16,000 lines deleted, a net reduction of about 5,000 lines (20% smaller codebase)
  • Performance drastically improved across the board
  • Robustness improved with retry logic, per-request timeouts, and adaptive rate limiting
  • 19 long-standing issues closed

The codebase shrank by a fifth while gaining provenance attestations, a pipeline engine, vulnerability scanning integration, parallelized operations, retry logic, integration tests against local registries, and a standalone signature replication mode.

No user-facing changes

This was a hard requirement. The kpromo cip command accepts the same flags and reads the same YAML manifests. The post-k8sio-image-promo Prow job continued working throughout. The promotion manifests in kubernetes/k8s.io did not change. Nobody had to update their workflows or configuration.

We caught two regressions early in production. One (#1731) caused a registry key mismatch that made every image appear as "lost" so that nothing was promoted. Another (#1733) set the default thread count to zero, blocking all goroutines. Both were fixed within hours. The phased release strategy (v4.2.0 with the new engine, v4.3.0 with legacy code removed) gave us a clear rollback path that we fortunately never needed.

What comes next

Signature replication across all mirror registries remains the most expensive part of the promotion cycle. Issue #1762 proposes eliminating it entirely by having archeio (the registry.k8s.io redirect service) route signature tag requests to a single canonical upstream instead of per-region backends. Another option would be to move signing closer to the registry infrastructure itself. Both approaches need further discussion with the SIG Release and infrastructure teams, but either one would remove thousands of API calls per promotion cycle and simplify the codebase even further.

Thank you

This project has been a community effort spanning seven years. Thank you to Linus, Stephen, Adolfo, Carlos, Ben, Marko, Lauri, Tyler, Arnaud, and many others who contributed code, reviews, and planning over the years. The SIG Release and Release Engineering communities provided the context, the discussions, and the patience for a rewrite of infrastructure that every Kubernetes release depends on.

If you want to get involved, join us in #release-management on the Kubernetes Slack or check out the repository.

Announcing the AI Gateway Working Group

The community around Kubernetes includes a number of Special Interest Groups (SIGs) and Working Groups (WGs) facilitating discussions on important topics between interested contributors. Today, we're excited to announce the formation of the AI Gateway Working Group, a new initiative focused on developing standards and best practices for networking infrastructure that supports AI workloads in Kubernetes environments.

What is an AI Gateway?

In a Kubernetes context, an AI Gateway refers to network gateway infrastructure (including proxy servers, load-balancers, etc.) that generally implements the Gateway API specification with enhanced capabilities for AI workloads. Rather than defining a distinct product category, AI Gateways describe infrastructure designed to enforce policy on AI traffic, including:

  • Token-based rate limiting for AI APIs.
  • Fine-grained access controls for inference APIs.
  • Payload inspection enabling intelligent routing, caching, and guardrails.
  • Support for AI-specific protocols and routing patterns.

Working group charter and mission

The AI Gateway Working Group operates under a clear charter with the mission to develop proposals for Kubernetes Special Interest Groups (SIGs) and their sub-projects. Its primary goals include:

  • Standards Development: Create declarative APIs, standards, and guidance for AI workload networking in Kubernetes.
  • Community Collaboration: Foster discussions and build consensus around best practices for AI infrastructure.
  • Extensible Architecture: Ensure composability, pluggability, and ordered processing for AI-specific gateway extensions.
  • Standards-Based Approach: Build on established networking foundations, layering AI-specific capabilities on top of proven standards.

Active proposals

WG AI Gateway currently has several active proposals that address key challenges in AI workload networking:

Payload Processing

The payload processing proposal addresses the critical need for AI workloads to inspect and transform full HTTP request and response payloads. This enables:

AI Inference Security

  • Guard against malicious prompts and prompt injection attacks.
  • Content filtering for AI responses.
  • Signature-based detection and anomaly detection for AI traffic.

AI Inference Optimization

  • Semantic routing based on request content.
  • Intelligent caching to reduce inference costs and improve response times.
  • RAG (Retrieval-Augmented Generation) system integration for context enhancement.

The proposal defines standards for declarative payload processor configuration, ordered processing pipelines, and configurable failure modes - all essential for production AI workload deployments.

Egress gateways

Modern AI applications increasingly depend on external inference services, whether for specialized models, failover scenarios, or cost optimization. The egress gateways proposal aims to define standards for securely routing traffic outside the cluster. Key features include:

External AI Service Integration

  • Secure access to cloud-based AI services (OpenAI, Vertex AI, Bedrock, etc.).
  • Managed authentication and token injection for third-party AI APIs.
  • Regional compliance and failover capabilities.

Advanced Traffic Management

  • Backend resource definitions for external FQDNs and services.
  • TLS policy management and certificate authority control.
  • Cross-cluster routing for centralized AI infrastructure.

User Stories We're Addressing

  • Platform operators providing managed access to external AI services.
  • Developers requiring inference failover across multiple cloud providers.
  • Compliance engineers enforcing regional restrictions on AI traffic.
  • Organizations centralizing AI workloads on dedicated clusters.

Upcoming events

KubeCon + CloudNativeCon Europe 2026, Amsterdam

AI Gateway working group members will be presenting at KubeCon + CloudNativeCon Europe in Amsterdam, discussing the problems at the intersection of AI and networking, including the working group's active proposals, as well as the intersection of AI gateways with Model Context Protocol (MCP) and agent networking patterns.
This session will showcase how AI Gateway working group proposals enable the infrastructure needed for next-generation AI deployments and communication patterns.
The session will also include the initial designs, early prototypes, and emerging directions shaping the WG’s roadmap.
For more details see our session here:

Get involved

The AI Gateway Working Group represents the Kubernetes community's commitment to standardizing AI workload networking. As AI becomes increasingly integral to modern applications, we need robust, standardized infrastructure that can support the unique requirements of inference workloads while maintaining the security, observability, and reliability standards that Kubernetes users expect.
Our proposals are currently in active development, with implementations beginning across various gateway projects. We're working closely with SIG Network on Gateway API enhancements and collaborating with the broader cloud-native community to ensure our standards meet real-world production needs.

Whether you're a gateway implementer, platform operator, AI application developer, or simply interested in the intersection of Kubernetes and AI, we'd love your input. The working group follows an open contribution model - you can review our proposals, join our weekly meetings, or start discussions on our GitHub repository. To learn more:

The future of AI infrastructure in Kubernetes is being built today, join up and learn how you can contribute and help shape the future of AI-aware gateway capabilities in Kubernetes.

Before You Migrate: Five Surprising Ingress-NGINX Behaviors You Need to Know

As announced November 2025, Kubernetes will retire Ingress-NGINX in March 2026. Despite its widespread usage, Ingress-NGINX is full of surprising defaults and side effects that are probably present in your cluster today. This blog highlights these behaviors so that you can migrate away safely and make a conscious decision about which behaviors to keep. This post also compares Ingress-NGINX with Gateway API and shows you how to preserve Ingress-NGINX behavior in Gateway API. The recurring risk pattern in every section is the same: a seemingly correct translation can still cause outages if it does not consider Ingress-NGINX's quirks.

I'm going to assume that you, the reader, have some familiarity with Ingress-NGINX and the Ingress API. Most examples use httpbin as the backend.

Also, note that Ingress-NGINX and NGINX Ingress are two separate Ingress controllers. Ingress-NGINX is an Ingress controller maintained and governed by the Kubernetes community that is retiring March 2026. NGINX Ingress is an Ingress controller by F5. Both use NGINX as the dataplane, but are otherwise unrelated. From now on, this blog post only discusses Ingress-NGINX.

1. Regex matches are prefix-based and case insensitive

Suppose that you wanted to route all requests with a path consisting of only three uppercase letters to the httpbin service. You might create the following Ingress with the nginx.ingress.kubernetes.io/use-regex: "true" annotation and the regex pattern of /[A-Z]{3}.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: regex-match-ingress
  annotations:
    nginx.ingress.kubernetes.io/use-regex: "true"
spec:
  ingressClassName: nginx
  rules:
  - host: regex-match.example.com
    http:
      paths:
      - path: "/[A-Z]{3}"
        pathType: ImplementationSpecific
        backend:
          service:
            name: httpbin
            port:
              number: 8000

However, because regex matches are prefix and case insensitive, Ingress-NGINX routes any request with a path that starts with any three letters to httpbin:

curl -sS -H "Host: regex-match.example.com" http://<your-ingress-ip>/uuid

The output is similar to:

{
  "uuid": "e55ef929-25a0-49e9-9175-1b6e87f40af7"
}

Note: The /uuid endpoint of httpbin returns a random UUID. A UUID in the response body means that the request was successfully routed to httpbin.

With Gateway API, you can use an HTTP path match with a type of RegularExpression for regular expression path matching. RegularExpression matches are implementation specific, so check with your Gateway API implementation to verify the semantics of RegularExpression matching. Popular Envoy-based Gateway API implementations such as Istio1, Envoy Gateway, and Kgateway do a full case-sensitive match.

Thus, if you are unaware that Ingress-NGINX patterns are prefix and case-insensitive, and, unbeknownst to you, clients or applications send traffic to /uuid (or /uuid/some/other/path), you might create the following HTTP route.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: regex-match-route
spec:
  hostnames:
  - regex-match.example.com
  parentRefs:
  - name: <your gateway>  # Change this depending on your use case
  rules:
  - matches:
    - path:
        type: RegularExpression
        value: "/[A-Z]{3}"
    backendRefs:
    - name: httpbin
      port: 8000

However, if your Gateway API implementation does full case-sensitive matches, the above HTTP route would not match a request with a path of /uuid. The above HTTP route would thus cause an outage because requests that Ingress-NGINX routed to httpbin would fail with a 404 Not Found at the gateway.

To preserve the case-insensitive regex matching, you can use the following HTTP route.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: regex-match-route
spec:
  hostnames:
  - regex-match.example.com
  parentRefs:
  - name: <your gateway>  # Change this depending on your use case
  rules:
  - matches:
    - path:
        type: RegularExpression
        value: "/[a-zA-Z]{3}.*"
    backendRefs:
    - name: httpbin
      port: 8000

Alternatively, the aforementioned proxies support the (?i) flag to indicate case insensitive matches. Using the flag, the pattern could be (?i)/[a-z]{3}.*.

2. The nginx.ingress.kubernetes.io/use-regex applies to all paths of a host across all (Ingress-NGINX) Ingresses

Now, suppose that you have an Ingress with the nginx.ingress.kubernetes.io/use-regex: "true" annotation, but you want to route requests with a path of exactly /headers to httpbin. Unfortunately, you made a typo and set the path to /Header instead of /headers.

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: regex-match-ingress
  annotations:
    nginx.ingress.kubernetes.io/use-regex: "true"
spec:
  ingressClassName: nginx
  rules:
  - host: regex-match.example.com
    http:
      paths:
      - path: "<some regex pattern>"
        pathType: ImplementationSpecific
        backend:
          service:
            name: <your backend>
            port:
              number: 8000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: regex-match-ingress-other
spec:
  ingressClassName: nginx
  rules:
  - host: regex-match.example.com
    http:
      paths:
      - path: "/Header" # typo here, should be /headers
        pathType: Exact
        backend:
          service:
            name: httpbin
            port:
              number: 8000

Most would expect a request to /headers to respond with a 404 Not Found, since /headers does not match the Exact path of /Header. However, because the regex-match-ingress Ingress has the nginx.ingress.kubernetes.io/use-regex: "true" annotation and the regex-match.example.com host, all paths with the regex-match.example.com host are treated as regular expressions across all (Ingress-NGINX) Ingresses. Since regex patterns are case-insensitive prefix matches, /headers matches the /Header pattern and Ingress-NGINX routes such requests to httpbin. Running the command

curl -sS -H "Host: regex-match.example.com" http://<your-ingress-ip>/headers

the output looks like:

{
  "headers": {
    ...
  }
}

Note: The /headers endpoint of httpbin returns the request headers. The fact that the response contains the request headers in the body means that the request was successfully routed to httpbin.

Gateway API does not silently convert or interpret Exact and Prefix matches as regex patterns. So if you converted the above Ingresses into the following HTTP route and preserved the typo and match types, requests to /headers will respond with a 404 Not Found instead of a 200 OK.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: regex-match-route
spec:
  hostnames:
  - regex-match.example.com
  rules:
  ...
  - matches:
    - path:
        type: Exact
        value: "/Header"
    backendRefs:
    - name: httpbin
      port: 8000

To keep the case-insensitive prefix matching, you can change

  - matches:
    - path:
        type: Exact
        value: "/Header"

to

  - matches:
    - path:
        type: RegularExpression
        value: "(?i)/Header"

Or even better, you could fix the typo and change the match to

  - matches:
    - path:
        type: Exact
        value: "/headers"

3. Rewrite target implies regex

In this case, suppose you want to rewrite the path of requests with a path of /ip to /uuid before routing them to httpbin, and as in Section 2, you want to route requests with the path of exactly /headers to httpbin. However, you accidentally make a typo and set the path to /IP instead of /ip and /Header instead of /headers.

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rewrite-target-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: "/uuid"
spec:
  ingressClassName: nginx
  rules:
  - host: rewrite-target.example.com
    http:
      paths:
      - path: "/IP"
        pathType: Exact
        backend:
          service:
            name: httpbin
            port:
              number: 8000
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rewrite-target-ingress-other
spec:
  ingressClassName: nginx
  rules:
  - host: rewrite-target.example.com
    http:
      paths:
      - path: "/Header"
        pathType: Exact
        backend:
          service:
            name: httpbin
            port:
              number: 8000

The nginx.ingress.kubernetes.io/rewrite-target: "/uuid" annotation causes requests that match paths in the rewrite-target-ingress Ingress to have their paths rewritten to /uuid before being routed to the backend.

Even though no Ingress has the nginx.ingress.kubernetes.io/use-regex: "true" annotation, the presence of the nginx.ingress.kubernetes.io/rewrite-target annotation in the rewrite-target-ingress Ingress causes all paths with the rewrite-target.example.com host to be treated as regex patterns. In other words, the nginx.ingress.kubernetes.io/rewrite-target silently adds the nginx.ingress.kubernetes.io/use-regex: "true" annotation, along with all the side effects discussed above.

For example, a request to /ip has its path rewritten to /uuid because /ip matches the case-insensitive prefix pattern of /IP in the rewrite-target-ingress Ingress. After running the command

curl -sS -H "Host: rewrite-target.example.com" http://<your-ingress-ip>/ip

the output is similar to:

{
  "uuid": "12a0def9-1adg-2943-adcd-1234aadfgc67"
}

Like in the nginx.ingress.kubernetes.io/use-regex example, Ingress-NGINX treats paths of other ingresses with the rewrite-target.example.com host as case-insensitive prefix patterns. Running the command

curl -sS -H "Host: rewrite-target.example.com" http://<your-ingress-ip>/headers

gives an output that looks like

{
  "headers": {
    ...
  }
}

You can configure path rewrites in Gateway API with the HTTP URL rewrite filter which does not silently convert your Exact and Prefix matches into regex patterns. However, if you are unaware of the side effects of the nginx.ingress.kubernetes.io/rewrite-target annotation and do not realize that /Header and /IP are both typos, you might create the following HTTP route.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: rewrite-target-route
spec:
  hostnames:
  - rewrite-target.example.com
  parentRefs:
  - name: <your-gateway>
  rules:
  - matches:
    - path:
        type: Exact
        value: "/IP"
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          type: ReplaceFullPath
          replaceFullPath: /uuid
    backendRefs:
    - name: httpbin
      port: 8000
  - matches:
    - path:
        # This is an exact match, irrespective of other rules
        type: Exact
        value: "/Header"
    backendRefs:
    - name: httpbin
      port: 8000

As with Section 2, because /IP is now an Exact match type in your HTTP route, requests to /ip will respond with a 404 Not Found instead of a 200 OK. Similarly, requests to /headers will also respond with a 404 Not Found instead of a 200 OK. Thus, this HTTP route will break applications and clients that rely on the /ip and /headers routes.

To fix this, you can change the matches in the HTTP route to be regex matches, and change the path patterns to be case-insensitive prefix matches, as follows.

  - matches:
    - path:
        type: RegularExpression
        value: "(?i)/IP.*"
...
  - matches:
    - path:
        type: RegularExpression
        value: "(?i)/Header.*"

Or, you can keep the Exact match type and fix the typos.

4. Requests missing a trailing slash are redirected to the same path with a trailing slash

Consider the following Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: trailing-slash-ingress
spec:
  ingressClassName: nginx
  rules:
  - host: trailing-slash.example.com
    http:
      paths:
      - path: "/my-path/"
        pathType: Exact
        backend:
          service:
            name: <your-backend>
            port:
              number: 8000

You might expect Ingress-NGINX to respond to /my-path with a 404 Not Found since the /my-path does not exactly match the Exact path of /my-path/. However, Ingress-NGINX redirects the request to /my-path/ with a 301 Moved Permanently because the only difference between /my-path and /my-path/ is a trailing slash.

curl -isS -H "Host: trailing-slash.example.com" http://<your-ingress-ip>/my-path

The output looks like:

HTTP/1.1 301 Moved Permanently
...
Location: http://trailing-slash.example.com/my-path/
...

The same applies if you change the pathType to Prefix. However, the redirect does not happen if the path is a regex pattern.

Conformant Gateway API implementations do not silently configure any kind of redirects. If clients or downstream services depend on this redirect, a migration to Gateway API that does not explicitly configure request redirects will cause an outage because requests to /my-path will now respond with a 404 Not Found instead of a 301 Moved Permanently. You can explicitly configure redirects using the HTTP request redirect filter as follows:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: trailing-slash-route
spec:
  hostnames:
  - trailing-slash.example.com
  parentRefs:
  - name: <your-gateway>
  rules:
  - matches:
    - path:
        type: Exact
        value: "/my-path"
    filters:
      requestRedirect:
        statusCode: 301
        path:
          type: ReplaceFullPath
          replaceFullPath: /my-path/
  - matches:
    - path:
        type: Exact # or Prefix
        value: "/my-path/"
    backendRefs:
    - name: <your-backend>
      port: 8000

5. Ingress-NGINX normalizes URLs

URL normalization is the process of converting a URL into a canonical form before matching it against Ingress rules and routing it. The specifics of URL normalization are defined in RFC 3986 Section 6.2, but some examples are

  • removing path segments that are just a .: my/./path -> my/path
  • having a .. path segment remove the previous segment: my/../path -> /path
  • deduplicating consecutive slashes in a path: my//path -> my/path

Ingress-NGINX normalizes URLs before matching them against Ingress rules. For example, consider the following Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: path-normalization-ingress
spec:
  ingressClassName: nginx
  rules:
  - host: path-normalization.example.com
    http:
      paths:
      - path: "/uuid"
        pathType: Exact
        backend:
          service:
            name: httpbin
            port:
              number: 8000

Ingress-NGINX normalizes the path of the following requests to /uuid. Now that the request matches the Exact path of /uuid, Ingress-NGINX responds with either a 200 OK response or a 301 Moved Permanently to /uuid.

For the following commands

curl -sS -H "Host: path-normalization.example.com" http://<your-ingress-ip>/uuid
curl -sS -H "Host: path-normalization.example.com" http://<your-ingress-ip>/ip/abc/../../uuid
curl -sSi -H "Host: path-normalization.example.com" http://<your-ingress-ip>////uuid

the outputs are similar to

{
  "uuid": "29c77dfe-73ec-4449-b70a-ef328ea9dbce"
}
{
  "uuid": "d20d92e8-af57-4014-80ba-cf21c0c4ffae"
}
HTTP/1.1 301 Moved Permanently
...
Location: /uuid
...

Your backends might rely on the Ingress/Gateway API implementation to normalize URLs. That said, most Gateway API implementations will have some path normalization enabled by default. For example, Istio, Envoy Gateway, and Kgateway all normalize . and .. segments out of the box. For more details, check the documentation for each Gateway API implementation that you use.

Conclusion

As we all race to respond to the Ingress-NGINX retirement, I hope this blog post instills some confidence that you can migrate safely and effectively despite all the intricacies of Ingress-NGINX.

SIG Network has also been working on supporting the most common Ingress-NGINX annotations (and some of these unexpected behaviors) in Ingress2Gateway to help you translate Ingress-NGINX configuration into Gateway API, and offer alternatives to unsupported behavior.

SIG Network released Gateway API 1.5 earlier today (27th February 2026), which graduates features such as ListenerSet (that allow app developers to better manage TLS certificates), and the HTTPRoute CORS filter that allows CORS configuration.


  1. You can use Istio purely as Gateway API controller with no other service mesh features. ↩︎

Kubernetes v1.36: New Metric for Route Sync in the Cloud Controller Manager

Kubernetes v1.36 introduces a new alpha counter metric route_controller_route_sync_total to the Cloud Controller Manager (CCM) route controller implementation at k8s.io/cloud-provider. This metric increments each time routes are synced with the cloud provider.

A/B testing watch-based route reconciliation

This metric was added to help operators validate the CloudControllerManagerWatchBasedRoutesReconciliation feature gate introduced in Kubernetes v1.35. That feature gate switches the route controller from a fixed-interval loop to a watch-based approach that only reconciles when nodes actually change. This reduces unnecessary API calls to the infrastructure provider, lowering pressure on rate-limited APIs and allowing operators to make more efficient use of their available quota.

To A/B test this, compare route_controller_route_sync_total with the feature gate disabled (default) versus enabled. In clusters where node changes are infrequent, you should see a significant drop in the sync rate with the feature gate turned on.

Example: expected behavior

With the feature gate disabled (the default fixed-interval loop), the counter increments steadily regardless of whether any node changes occurred:

# After 10 minutes with no node changes
route_controller_route_sync_total 60
# After 20 minutes, still no node changes
route_controller_route_sync_total 120

With the feature gate enabled (watch-based reconciliation), the counter only increments when nodes are actually added, removed, or updated:

# After 10 minutes with no node changes
route_controller_route_sync_total 1
# After 20 minutes, still no node changes — counter unchanged
route_controller_route_sync_total 1
# A new node joins the cluster — counter increments
route_controller_route_sync_total 2

The difference is especially visible in stable clusters where nodes rarely change.

Where can I give feedback?

If you have feedback, feel free to reach out through any of the following channels:

How can I learn more?

For more details, refer to KEP-5237.

Spotlight on SIG Architecture: API Governance

This is the fifth interview of a SIG Architecture Spotlight series that covers the different subprojects, and we will be covering SIG Architecture: API Governance.

In this SIG Architecture spotlight we talked with Jordan Liggitt, lead of the API Governance sub-project.

Introduction

FM: Hello Jordan, thank you for your availability. Tell us a bit about yourself, your role and how you got involved in Kubernetes.

JL: My name is Jordan Liggitt. I'm a Christian, husband, father of four, software engineer at Google by day, and amateur musician by stealth. I was born in Texas (and still like to claim it as my point of origin), but I've lived in North Carolina for most of my life.

I've been working on Kubernetes since 2014. At that time, I was working on authentication and authorization at Red Hat, and my very first pull request to Kubernetes attempted to add an OAuth server to the Kubernetes API server. It never exited work-in-progress status. I ended up going with a different approach that layered on top of the core Kubernetes API server in a different project (spoiler alert: this is foreshadowing), and I closed it without merging six months later.

Undeterred by that start, I stayed involved, helped build Kubernetes authentication and authorization capabilities, and got involved in the definition and evolution of the core Kubernetes APIs from early beta APIs, like v1beta3 to v1. I got tagged as an API reviewer in 2016 based on those contributions, and was added as an API approver in 2017.

Today, I help lead the API Governance and code organization subprojects for SIG Architecture, and I am a tech lead for SIG Auth.

FM: And when did you get specifically involved in the API Governance project?

JL: Around 2019.

Goals and scope of API Governance

FM: How would you describe the main goals and areas of intervention of the subproject?

The surface area includes all the various APIs Kubernetes has, and there are APIs that people do not always realize are APIs: command-line flags, configuration files, how binaries are run, how they talk to back-end components like the container runtime, and how they persist data. People often think of "the API" as only the REST API... that is the biggest and most obvious one, and the one with the largest audience, but all of these other surfaces are also APIs. Their audiences are narrower, so there is more flexibility there, but they still require consideration.

The goals are to be stable while still enabling innovation. Stability is easy if you never change anything, but that contradicts the goal of evolution and growth. So we balance "be stable" with "allow change".

FM: Speaking of changes, in terms of ensuring consistency and quality (which is clearly one of the reasons this project exists), what are the specific quality gates in the lifecycle of a Kubernetes change? Does API Governance get involved during the release cycle, prior to it through guidelines, or somewhere in between? At what points do you ensure the intended role is fulfilled?

JL: We have guidelines and conventions, both for APIs in general and for how to change an API. These are living documents that we update as we encounter new scenarios. They are long and dense, so we also support them with involvement at either the design stage or the implementation stage.

Sometimes, due to bandwidth constraints, teams move ahead with design work without feedback from API Review. That’s fine, but it means that when implementation begins, the API review will happen then, and there may be substantial feedback. So we get involved when a new API is created or an existing API is changed, either at design or implementation.

FM: Is this during the Kubernetes Enhancement Proposal (KEP) process? Since KEPs are mandatory for enhancements, I assume part of the work intersects with API Governance?

JL: It can. KEPs vary in how detailed they are. Some include literal API definitions. When they do, we can perform an API review at the design stage. Then implementation becomes a matter of checking fidelity to the design.

Getting involved early is ideal. But some KEPs are conceptual and leave details to the implementation. That’s not wrong; it just means the implementation will be more exploratory. Then API Review gets involved later, possibly recommending structural changes.

There’s a trade-off regardless: detailed design upfront versus iterative discovery during implementation. People and teams work differently, and we’re flexible and happy to consult early or at implementation time.

FM: This reminds me of what Fred Brooks wrote in "The Mythical Man-Month" about conceptual integrity being central to product quality... No matter how you structure the process, there must be a point where someone looks at what is coming and ensures conceptual integrity. Kubernetes uses APIs everywhere -- externally and internally -- so API Governance is critical to maintaining that integrity. How is this captured?

JL: Yes, the conventions document captures patterns we’ve learned over time: what to do in various situations. We also have automated linters and checks to ensure correctness around patterns like spec/status semantics. These automated tools help catch issues even when humans miss them.

As new scenarios arise -- and they do constantly -- we think through how to approach them and fold the results back into our documentation and tools. Sometimes it takes a few attempts before we settle on an approach that works well.

FM: Exactly. Each new interaction improves the guidelines.

JL: Right. And sometimes the first approach turns out to be wrong. It may take two or three iterations before we land on something robust.

The impact of Custom Resource Definitions

FM: Is there any particular change, episode, or domain that stands out as especially noteworthy, complex, or interesting in your experience?

JL: The watershed moment was Custom Resources. Prior to that, every API was handcrafted by us and fully reviewed. There were inconsistencies, but we understood and controlled every type and field.

When Custom Resources arrived, anyone could define anything. The first version did not even require a schema. That made it extremely powerful -- it enabled change immediately -- but it left us playing catch-up on stability and consistency.

When Custom Resources graduated to General Availability (GA), schemas became required, but escape hatches still existed for backward compatibility. Since then, we’ve been working on giving CRD authors validation capabilities comparable to built-ins. Built-in validation rules for CRDs have only just reached GA in the last few releases.

So CRDs opened the "anything is possible" era. Built-in validation rules are the second major milestone: bringing consistency back.

The three major themes have been defining schemas, validating data, and handling pre-existing invalid data. With ratcheting validation (allowing data to improve without breaking existing objects), we can now guide CRD authors toward conventions without breaking the world.

API Governance in context

FM: How does API Governance relate to SIG Architecture and API Machinery?

JL: API Machinery provides the actual code and tools that people build APIs on. They don’t review APIs for storage, networking, scheduling, etc.

SIG Architecture sets the overall system direction and works with API Machinery to ensure the system supports that direction. API Governance works with other SIGs building on that foundation to define conventions and patterns, ensuring consistent use of what API Machinery provides.

FM: Thank you. That clarifies the flow. Going back to release cycles: do release phases -- enhancements freeze, code freeze -- change your workload? Or is API Governance mostly continuous?

JL: We get involved in two places: design and implementation. Design involvement increases before enhancements freeze; implementation involvement increases before code freeze. However, many efforts span multiple releases, so there is always some design and implementation happening, even for work targeting future releases. Between those intense periods, we often have time to work on long-term design work.

An anti-pattern we see is teams thinking about a large feature for months and then presenting it three weeks before enhancements freeze, saying, "Here is the design, please review." For big changes with API impact, it’s much better to involve API Governance early.

And there are good times in the cycle for this -- between freezes -- when people have bandwidth. That’s when long-term review work fits best.

Getting involved

FM: Clearly. Now, regarding team dynamics and new contributors: how can someone get involved in API Governance? What should they focus on?

JL: It’s usually best to follow a specific change rather than trying to learn everything at once. Pick a small API change, perhaps one someone else is making or one you want to make, and observe the full process: design, implementation, review.

High-bandwidth review -- live discussion over video -- is often very effective. If you’re making or following a change, ask whether there’s a time to go over the design or PR together. Observing those discussions is extremely instructive.

Start with a small change. Then move to a bigger one. Then maybe a new API. That builds understanding of conventions as they are applied in practice.

FM: Excellent. Any final comments, or anything we missed?

JL: Yes... the reason we care so much about compatibility and stability is for our users. It’s easy for contributors to see those requirements as painful obstacles preventing cleanup or requiring tedious work... but users integrated with our system, and we made a promise to them: we want them to trust that we won’t break that contract. So even when it requires more work, moves slower, or involves duplication, we choose stability.

We are not trying to be obstructive; we are trying to make life good for users.

A lot of our questions focus on the future: you want to do something now... how will you evolve it later without breaking it? We assume we will know more in the future, and we want the design to leave room for that.

We also assume we will make mistakes. The question then is: how do we leave ourselves avenues to improve while keeping compatibility promises?

FM: Exactly. Jordan, thank you, I think we’ve covered everything. This has been an insightful view into the API Governance project and its role in the wider Kubernetes project.

JL: Thank you.

Introducing Node Readiness Controller

Logo for node readiness controller

In the standard Kubernetes model, a node’s suitability for workloads hinges on a single binary "Ready" condition. However, in modern Kubernetes environments, nodes require complex infrastructure dependencies—such as network agents, storage drivers, GPU firmware, or custom health checks—to be fully operational before they can reliably host pods.

Today, on behalf of the Kubernetes project, I am announcing the Node Readiness Controller. This project introduces a declarative system for managing node taints, extending the readiness guardrails during node bootstrapping beyond standard conditions. By dynamically managing taints based on custom health signals, the controller ensures that workloads are only placed on nodes that met all infrastructure-specific requirements.

Why the Node Readiness Controller?

Core Kubernetes Node "Ready" status is often insufficient for clusters with sophisticated bootstrapping requirements. Operators frequently struggle to ensure that specific DaemonSets or local services are healthy before a node enters the scheduling pool.

The Node Readiness Controller fills this gap by allowing operators to define custom scheduling gates tailored to specific node groups. This enables you to enforce distinct readiness requirements across heterogeneous clusters, ensuring for example, that GPU equipped nodes only accept pods once specialized drivers are verified, while general purpose nodes follow a standard path.

It provides three primary advantages:

  • Custom Readiness Definitions: Define what ready means for your specific platform.
  • Automated Taint Management: The controller automatically applies or removes node taints based on condition status, preventing pods from landing on unready infrastructure.
  • Declarative Node Bootstrapping: Manage multi-step node initialization reliably, with a clear observability into the bootstrapping process.

Core concepts and features

The controller centers around the NodeReadinessRule (NRR) API, which allows you to define declarative gates for your nodes.

Flexible enforcement modes

The controller supports two distinct operational modes:

Continuous enforcement
Actively maintains the readiness guarantee throughout the node’s entire lifecycle. If a critical dependency (like a device driver) fails later, the node is immediately tainted to prevent new scheduling.
Bootstrap-only enforcement
Specifically for one-time initialization steps, such as pre-pulling heavy images or hardware provisioning. Once conditions are met, the controller marks the bootstrap as complete and stops monitoring that specific rule for the node.

Condition reporting

The controller reacts to Node Conditions rather than performing health checks itself. This decoupled design allows it to integrate seamlessly with other tools existing in the ecosystem as well as custom solutions:

  • Node Problem Detector (NPD): Use existing NPD setups and custom scripts to report node health.
  • Readiness Condition Reporter: A lightweight agent provided by the project that can be deployed to periodically check local HTTP endpoints and patch node conditions accordingly.

Operational safety with dry run

Deploying new readiness rules across a fleet carries inherent risk. To mitigate this, dry run mode allows operators to first simulate impact on the cluster. In this mode, the controller logs intended actions and updates the rule's status to show affected nodes without applying actual taints, enabling safe validation before enforcement.

Example: CNI bootstrapping

The following NodeReadinessRule ensures a node remains unschedulable until its CNI agent is functional. The controller monitors a custom cniplugin.example.net/NetworkReady condition and only removes the readiness.k8s.io/acme.com/network-unavailable taint once the status is True.

apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
  name: network-readiness-rule
spec:
  conditions:
    - type: "cniplugin.example.net/NetworkReady"
      requiredStatus: "True"
  taint:
    key: "readiness.k8s.io/acme.com/network-unavailable"
    effect: "NoSchedule"
    value: "pending"
  enforcementMode: "bootstrap-only"
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""

Demo:

Getting involved

The Node Readiness Controller is just getting started, with our initial releases out, and we are seeking community feedback to refine the roadmap. Following our productive Unconference discussions at KubeCon NA 2025, we are excited to continue the conversation in person.

Join us at KubeCon + CloudNativeCon Europe 2026 for our maintainer track session: Addressing Non-Deterministic Scheduling: Introducing the Node Readiness Controller.

In the meantime, you can contribute or track our progress here:

New Conversion from cgroup v1 CPU Shares to v2 CPU Weight

I'm excited to announce the implementation of an improved conversion formula from cgroup v1 CPU shares to cgroup v2 CPU weight. This enhancement addresses critical issues with CPU priority allocation for Kubernetes workloads when running on systems with cgroup v2.

Background

Kubernetes was originally designed with cgroup v1 in mind, where CPU shares were derived from a container's CPU requests using the following formula:

cpu.shares=milliCPU×10241000cpu.shares = milliCPU \times \frac{1024}{1000}

Note that the value 1024 in this formula is the default cpu.shares value in cgroup v1, and is unrelated to millicores. For example, a container requesting 1 CPU (1000m) would get (cpu.shares = 1000 \times 1024 / 1000 = 1024), and a container requesting 100m would get (cpu.shares = 100 \times 1024 / 1000 = 102).

After a while, cgroup v1 started being replaced by its successor, cgroup v2. In cgroup v2, the concept of CPU shares (which ranges from 2 to 262144, or from 2¹ to 2¹⁸) was replaced with CPU weight (which ranges from [1, 10000], or 10⁰ to 10⁴).

With the transition to cgroup v2, KEP-2254 introduced a conversion formula to map cgroup v1 CPU shares to cgroup v2 CPU weight. The conversion formula was defined as: cpu.weight = (1 + ((cpu.shares - 2) * 9999) / 262142)

This formula linearly maps values from [2¹, 2¹⁸] to [10⁰, 10⁴].

Linear conversion formula

While this approach is simple, the linear mapping imposes a few significant problems and impacts both performance and configuration granularity.

Problems with previous conversion formula

The current conversion formula creates two major issues:

1. Reduced priority against non-Kubernetes workloads

In cgroup v1, the default value for CPU shares is 1024, meaning a container requesting 1 CPU has equal priority with system processes that live outside of Kubernetes' scope. However, in cgroup v2, the default CPU weight is 100, but the current formula converts 1 CPU (1000m) to only ≈39 weight - less than 40% of the default.

Example:

  • Container requesting 1 CPU (1000m)
  • cgroup v1: cpu.shares = 1024 (equal to default)
  • cgroup v2 (current): cpu.weight = 39 (much lower than default 100)

This means that after moving to cgroup v2, Kubernetes (or OCI) workloads would de-facto reduce their CPU priority against non-Kubernetes processes. The problem can be severe for setups with many system daemons that run outside of Kubernetes' scope and expect Kubernetes workloads to have priority, especially in situations of resource starvation.

2. Unmanageable granularity

The current formula produces very low values for small CPU requests, limiting the ability to create sub-cgroups within containers for fine-grained resource distribution (which will possibly be much easier moving forward, see KEP #5474 for more info).

Example:

  • Container requesting 100m CPU
  • cgroup v1: cpu.shares = 102
  • cgroup v2 (current): cpu.weight = 4 (too low for sub-cgroup configuration)

With cgroup v1, requesting 100m CPU which led to 102 CPU shares was manageable in the sense that sub-cgroups could have been created inside the main container, assigning fine-grained CPU priorities for different groups of processes. With cgroup v2 however, having 4 shares is very hard to distribute between sub-cgroups since it's not granular enough.

With plans to allow writable cgroups for unprivileged containers, this becomes even more relevant.

New conversion formula

Description

The new formula is more complicated, but does a much better job mapping between cgroup v1 CPU shares and cgroup v2 CPU weight:

cpu.weight=10(L2/612+125L/6127/34), where: L=log2(cpu.shares)cpu.weight = \lceil 10^{(L^{2}/612 + 125L/612 - 7/34)} \rceil, \text{ where: } L = \log_2(cpu.shares)

The idea is that this is a quadratic function to cross the following values:

  • (2, 1): The minimum values for both ranges.
  • (1024, 100): The default values for both ranges.
  • (262144, 10000): The maximum values for both ranges.

Visually, the new function looks as follows:

2025-10-25-new-cgroup-v1-to-v2-conversion-formula-new-conversion.png

And if you zoom in to the important part:

2025-10-25-new-cgroup-v1-to-v2-conversion-formula-new-conversion-zoom.png

The new formula is "close to linear", yet it is carefully designed to map the ranges in a clever way so the three important points above would cross.

How it solves the problems

  1. Better priority alignment:

    • A container requesting 1 CPU (1000m) will now get a cpu.weight = 102. This value is close to cgroup v2's default 100. This restores the intended priority relationship between Kubernetes workloads and system processes.
  2. Improved granularity:

    • A container requesting 100m CPU will get cpu.weight = 17, (see here). Enables better fine-grained resource distribution within containers.

Adoption and integration

This change was implemented at the OCI layer. In other words, this is not implemented in Kubernetes itself; therefore the adoption of the new conversion formula depends solely on the OCI runtime adoption.

For example:

  • runc: The new formula is enabled from version 1.3.2.
  • crun: The new formula is enabled from version 1.23.

Impact on existing deployments

Important: Some consumers may be affected if they assume the older linear conversion formula. Applications or monitoring tools that directly calculate expected CPU weight values based on the previous formula may need updates to account for the new quadratic conversion. This is particularly relevant for:

  • Custom resource management tools that predict CPU weight values.
  • Monitoring systems that validate or expect specific weight values.
  • Applications that programmatically set or verify CPU weight values.

Also note that reversing the conversion from cpu.weight back to milliCPU will not always yield the exact original value. There are two sources of information loss: the milliCPU to cpu.shares conversion involves integer truncation (e.g. 100m becomes 102 shares, not 102.4), and more significantly, the shares-to-weight mapping is many-to-one (e.g. milliCPU values 90 through 109 all map to cpu.weight = 17). Tools that need precise CPU request values should read them directly from the pod spec rather than deriving them from cgroup parameters.

The Kubernetes project recommends testing the new conversion formula in non-production environments before upgrading OCI runtimes to ensure compatibility with existing tooling.

Where can I learn more?

For those interested in this enhancement:

How do I get involved?

For those interested in getting involved with Kubernetes node-level features, join the Kubernetes Node Special Interest Group. We always welcome new contributors and diverse perspectives on resource management challenges.

Ingress NGINX: Statement from the Kubernetes Steering and Security Response Committees

In March 2026, Kubernetes will retire Ingress NGINX, a piece of critical infrastructure for about half of cloud native environments. The retirement of Ingress NGINX was announced for March 2026, after years of public warnings that the project was in dire need of contributors and maintainers. There will be no more releases for bug fixes, security patches, or any updates of any kind after the project is retired. This cannot be ignored, brushed off, or left until the last minute to address. We cannot overstate the severity of this situation or the importance of beginning migration to alternatives like Gateway API or one of the many third-party Ingress controllers immediately.

To be abundantly clear: choosing to remain with Ingress NGINX after its retirement leaves you and your users vulnerable to attack. None of the available alternatives are direct drop-in replacements. This will require planning and engineering time. Half of you will be affected. You have two months left to prepare.

Existing deployments will continue to work, so unless you proactively check, you may not know you are affected until you are compromised. In most cases, you can check to find out whether or not you rely on Ingress NGINX by running kubectl get pods --all-namespaces --selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

Despite its broad appeal and widespread use by companies of all sizes, and repeated calls for help from the maintainers, the Ingress NGINX project never received the contributors it so desperately needed. According to internal Datadog research, about 50% of cloud native environments currently rely on this tool, and yet for the last several years, it has been maintained solely by one or two people working in their free time. Without sufficient staffing to maintain the tool to a standard both ourselves and our users would consider secure, the responsible choice is to wind it down and refocus efforts on modern alternatives like Gateway API.

We did not make this decision lightly; as inconvenient as it is now, doing so is necessary for the safety of all users and the ecosystem as a whole. Unfortunately, the flexibility Ingress NGINX was designed with, that was once a boon, has become a burden that cannot be resolved. With the technical debt that has piled up, and fundamental design decisions that exacerbate security flaws, it is no longer reasonable or even possible to continue maintaining the tool even if resources did materialize.

We issue this statement together to reinforce the scale of this change and the potential for serious risk to a significant percentage of Kubernetes users if this issue is ignored. It is imperative that you check your clusters now. If you are reliant on Ingress NGINX, you must begin planning for migration.

Thank you,

Kubernetes Steering Committee

Kubernetes Security Response Committee

Experimenting with Gateway API using kind

This document will guide you through setting up a local experimental environment with Gateway API on kind. This setup is designed for learning and testing. It helps you understand Gateway API concepts without production complexity.

Caution:

This is an experimentation learning setup, and should not be used for production. The components used on this document are not suited for production usage. Once you're ready to deploy Gateway API in a production environment, select an implementation that suits your needs.

Overview

In this guide, you will:

  • Set up a local Kubernetes cluster using kind (Kubernetes in Docker)
  • Deploy cloud-provider-kind, which provides both LoadBalancer Services and a Gateway API controller
  • Create a Gateway and HTTPRoute to route traffic to a demo application
  • Test your Gateway API configuration locally

This setup is ideal for learning, development, and experimentation with Gateway API concepts.

Prerequisites

Before you begin, ensure you have the following installed on your local machine:

  • Docker - Required to run kind and cloud-provider-kind
  • kubectl - The Kubernetes command-line tool
  • kind - Kubernetes in Docker
  • curl - Required to test the routes

Create a kind cluster

Create a new kind cluster by running:

kind create cluster

This will create a single-node Kubernetes cluster running in a Docker container.

Install cloud-provider-kind

Next, you need cloud-provider-kind, which provides two key components for this setup:

  • A LoadBalancer controller that assigns addresses to LoadBalancer-type Services
  • A Gateway API controller that implements the Gateway API specification

It also automatically installs the Gateway API Custom Resource Definitions (CRDs) in your cluster.

Run cloud-provider-kind as a Docker container on the same host where you created the kind cluster:

VERSION="$(basename $(curl -s -L -o /dev/null -w '%{url_effective}' https://github.com/kubernetes-sigs/cloud-provider-kind/releases/latest))"
docker run -d --name cloud-provider-kind --rm --network host -v /var/run/docker.sock:/var/run/docker.sock registry.k8s.io/cloud-provider-kind/cloud-controller-manager:${VERSION}

Note: On some systems, you may need elevated privileges to access the Docker socket.

Verify that cloud-provider-kind is running:

docker ps --filter name=cloud-provider-kind

You should see the container listed and in a running state. You can also check the logs:

docker logs cloud-provider-kind

Experimenting with Gateway API

Now that your cluster is set up, you can start experimenting with Gateway API resources.

cloud-provider-kind automatically provisions a GatewayClass called cloud-provider-kind. You'll use this class to create your Gateway.

It is worth noticing that while kind is not a cloud provider, the project is named as cloud-provider-kind as it provides features that simulate a cloud-enabled environment.

Deploy a Gateway

The following manifest will:

  • Create a new namespace called gateway-infra
  • Deploy a Gateway that listens on port 80
  • Accept HTTPRoutes with hostnames matching the *.exampledomain.example pattern
  • Allow routes from any namespace to attach to the Gateway. Note: In real clusters, prefer Same or Selector values on the allowedRoutes namespace selector field to limit attachments.

Apply the following manifest:

---
apiVersion: v1
kind: Namespace
metadata:
  name: gateway-infra
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gateway
  namespace: gateway-infra
spec:
  gatewayClassName: cloud-provider-kind
  listeners:
  - name: default
    hostname: "*.exampledomain.example"
    port: 80
    protocol: HTTP
    allowedRoutes:
      namespaces:
        from: All

Then verify that your Gateway is properly programmed and has an address assigned:

kubectl get gateway -n gateway-infra gateway

Expected output:

NAME      CLASS                 ADDRESS      PROGRAMMED   AGE
gateway   cloud-provider-kind   172.18.0.3   True         5m6s

The PROGRAMMED column should show True, and the ADDRESS field should contain an IP address.

Deploy a demo application

Next, deploy a simple echo application that will help you test your Gateway configuration. This application:

  • Listens on port 3000
  • Echoes back request details including path, headers, and environment variables
  • Runs in a namespace called demo

Apply the following manifest:

apiVersion: v1
kind: Namespace
metadata:
  name: demo
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: echo
  name: echo
  namespace: demo
spec:
  ports:
  - name: http
    port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app.kubernetes.io/name: echo
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: echo
  name: echo
  namespace: demo
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: echo
  template:
    metadata:
      labels:
        app.kubernetes.io/name: echo
    spec:
      containers:
      - env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: registry.k8s.io/gateway-api/echo-basic:v20251204-v1.4.1
        name: echo-basic

Create an HTTPRoute

Now create an HTTPRoute to route traffic from your Gateway to the echo application. This HTTPRoute will:

  • Respond to requests for the hostname some.exampledomain.example
  • Route traffic to the echo application
  • Attach to the Gateway in the gateway-infra namespace

Apply the following manifest:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: echo
  namespace: demo
spec:
  parentRefs:
  - name: gateway
    namespace: gateway-infra
  hostnames: ["some.exampledomain.example"]
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: echo
      port: 3000

Test your route

The final step is to test your route using curl. You'll make a request to the Gateway's IP address with the hostname some.exampledomain.example. The command below is for POSIX shell only, and may need to be adjusted for your environment:

GW_ADDR=$(kubectl get gateway -n gateway-infra gateway -o jsonpath='{.status.addresses[0].value}')
curl --resolve some.exampledomain.example:80:${GW_ADDR} http://some.exampledomain.example

You should receive a JSON response similar to this:

{
 "path": "/",
 "host": "some.exampledomain.example",
 "method": "GET",
 "proto": "HTTP/1.1",
 "headers": {
  "Accept": [
   "*/*"
  ],
  "User-Agent": [
   "curl/8.15.0"
  ]
 },
 "namespace": "demo",
 "ingress": "",
 "service": "",
 "pod": "echo-dc48d7cf8-vs2df"
}

If you see this response, congratulations! Your Gateway API setup is working correctly.

Troubleshooting

If something isn't working as expected, you can troubleshoot by checking the status of your resources.

Check the Gateway status

First, inspect your Gateway resource:

kubectl get gateway -n gateway-infra gateway -o yaml

Look at the status section for conditions. Your Gateway should have:

  • Accepted: True - The Gateway was accepted by the controller
  • Programmed: True - The Gateway was successfully configured
  • .status.addresses populated with an IP address

Check the HTTPRoute status

Next, inspect your HTTPRoute:

kubectl get httproute -n demo echo -o yaml

Check the status.parents section for conditions. Common issues include:

  • ResolvedRefs set to False with reason BackendNotFound; this means that the backend Service doesn't exist or has the wrong name
  • Accepted set to False; this means that the route couldn't attach to the Gateway (check namespace permissions or hostname matching)

Example error when a backend is not found:

status:
  parents:
  - conditions:
    - lastTransitionTime: "2026-01-19T17:13:35Z"
      message: backend not found
      observedGeneration: 2
      reason: BackendNotFound
      status: "False"
      type: ResolvedRefs
    controllerName: kind.sigs.k8s.io/gateway-controller

Check controller logs

If the resource statuses don't reveal the issue, check the cloud-provider-kind logs:

docker logs -f cloud-provider-kind

This will show detailed logs from both the LoadBalancer and Gateway API controllers.

Cleanup

When you're finished with your experiments, you can clean up the resources:

Remove Kubernetes resources

Delete the namespaces (this will remove all resources within them):

kubectl delete namespace gateway-infra
kubectl delete namespace demo

Stop cloud-provider-kind

Stop and remove the cloud-provider-kind container:

docker stop cloud-provider-kind

Because the container was started with the --rm flag, it will be automatically removed when stopped.

Delete the kind cluster

Finally, delete the kind cluster:

kind delete cluster

Next steps

Now that you've experimented with Gateway API locally, you're ready to explore production-ready implementations:

  • Production Deployments: Review the Gateway API implementations to find a controller that matches your production requirements
  • Learn More: Explore the Gateway API documentation to learn about advanced features like TLS, traffic splitting, and header manipulation
  • Advanced Routing: Experiment with path-based routing, header matching, request mirroring and other features following Gateway API user guides

A final word of caution

This kind setup is for development and learning only. Always use a production-grade Gateway API implementation for real workloads.

Cluster API v1.12: Introducing In-place Updates and Chained Upgrades

Cluster API brings declarative management to Kubernetes cluster lifecycle, allowing users and platform teams to define the desired state of clusters and rely on controllers to continuously reconcile toward it.

Similar to how you can use StatefulSets or Deployments in Kubernetes to manage a group of Pods, in Cluster API you can use KubeadmControlPlane to manage a set of control plane Machines, or you can use MachineDeployments to manage a group of worker Nodes.

The Cluster API v1.12.0 release expands what is possible in Cluster API, reducing friction in common lifecycle operations by introducing in-place updates and chained upgrades.

Emphasis on simplicity and usability

With v1.12.0, the Cluster API project demonstrates once again that this community is capable of delivering a great amount of innovation, while at the same time minimizing impact for Cluster API users.

What does this mean in practice?

Users simply have to change the Cluster or the Machine spec (just as with previous Cluster API releases), and Cluster API will automatically trigger in-place updates or chained upgrades when possible and advisable.

In-place Updates

Like Kubernetes does for Pods in Deployments, when the Machine spec changes also Cluster API performs rollouts by creating a new Machine and deleting the old one.

This approach, inspired by the principle of immutable infrastructure, has a set of considerable advantages:

  • It is simple to explain, predictable, consistent and easy to reason about with users and engineers.
  • It is simple to implement, because it relies only on two core primitives, create and delete.
  • Implementation does not depend on Machine-specific choices, like OS, bootstrap mechanism etc.

As a result, Machine rollouts drastically reduce the number of variables to be considered when managing the lifecycle of a host server that is hosting Nodes.

However, while advantages of immutability are not under discussion, both Kubernetes and Cluster API are undergoing a similar journey, introducing changes that allow users to minimize workload disruption whenever possible.

Over time, also Cluster API has introduced several improvements to immutable rollouts, including:

The new in-place update feature in Cluster API is the next step in this journey.

With the v1.12.0 release, Cluster API introduces support for update extensions allowing users to make changes on existing machines in-place, without deleting and re-creating the Machines.

Both KubeadmControlPlane and MachineDeployments support in-place updates based on the new update extension, and this means that the boundary of what is possible in Cluster API is now changed in a significant way.

How do in-place updates work?

The simplest way to explain it is that once the user triggers an update by changing the desired state of Machines, then Cluster API chooses the best tool to achieve the desired state.

The news is that now Cluster API can choose between immutable rollouts and in-place update extensions to perform required changes.

In-place updates in Cluster API

Importantly, this is not immutable rollouts vs in-place updates; Cluster API considers both valid options and selects the most appropriate mechanism for a given change.

From the perspective of the Cluster API maintainers, in-place updates are most useful for making changes that don't otherwise require a node drain or pod restart; for example: changing user credentials for the Machine. On the other hand, when the workload will be disrupted anyway, just do a rollout.

Nevertheless, Cluster API remains true to its extensible nature, and everyone can create their own update extension and decide when and how to use in-place updates by trading in some of the benefits of immutable rollouts.

For a deep dive into this feature, make sure to attend the session In-place Updates with Cluster API: The Sweet Spot Between Immutable and Mutable Infrastructure at KubeCon EU in Amsterdam!

Chained Upgrades

ClusterClass and managed topologies in Cluster API jointly provided a powerful and effective framework that acts as a building block for many platforms offering Kubernetes-as-a-Service.

Now with v1.12.0 this feature is making another important step forward, by allowing users to upgrade by more than one Kubernetes minor version in a single operation, commonly referred to as a chained upgrade.

This allows users to declare a target Kubernetes version and let Cluster API safely orchestrate the required intermediate steps, rather than manually managing each minor upgrade.

The simplest way to explain how chained upgrades work, is that once the user triggers an update by changing the desired version for a Cluster, Cluster API computes an upgrade plan, and then starts executing it. Rather than (for example) update the Cluster to v1.33.0 and then v1.34.0 and then v1.35.0, checking on progress at each step, a chained upgrade lets you go directly to v1.35.0.

Executing an upgrade plan means upgrading control plane and worker machines in a strictly controlled order, repeating this process as many times as needed to reach the desired state. The Cluster API is now capable of managing this for you.

Cluster API takes care of optimizing and minimizing the upgrade steps for worker machines, and in fact worker machines will skip upgrades to intermediate Kubernetes minor releases whenever allowed by the Kubernetes version skew policies.

Chained upgrades in Cluster API

Also in this case extensibility is at the core of this feature, and upgrade plan runtime extensions can be used to influence how the upgrade plan is computed; similarly, lifecycle hooks can be used to automate other tasks that must be performed during an upgrade, e.g. upgrading an addon after the control plane update completed.

From our perspective, chained upgrades are most useful for users that struggle to keep up with Kubernetes minor releases, and e.g. they want to upgrade only once per year and then upgrade by three versions (n-3 → n). But be warned: the fact that you can now easily upgrade by more than one minor version is not an excuse to not patch your cluster frequently!

Release team

I would like to thank all the contributors, the maintainers, and all the engineers that volunteered for the release team.

The reliability and predictability of Cluster API releases, which is one of the most appreciated features from our users, is only possible with the support, commitment, and hard work of its community.

Kudos to the entire Cluster API community for the v1.12.0 release and all the great releases delivered in 2025! ​​ If you are interested in getting involved, learn about Cluster API contributing guidelines.

What’s next?

If you read the Cluster API manifesto, you can see how the Cluster API subproject claims the right to remain unfinished, recognizing the need to continuously evolve, improve, and adapt to the changing needs of Cluster API’s users and the broader Cloud Native ecosystem.

As Kubernetes itself continues to evolve, the Cluster API subproject will keep advancing alongside it, focusing on safer upgrades, reduced disruption, and stronger building blocks for platforms managing Kubernetes at scale.

Innovation remains at the heart of Cluster API, stay tuned for an exciting 2026!


Useful links:

Headlamp in 2025: Project Highlights

This announcement is a recap from a post originally published on the Headlamp blog.

Headlamp has come a long way in 2025. The project has continued to grow – reaching more teams across platforms, powering new workflows and integrations through plugins, and seeing increased collaboration from the broader community.

We wanted to take a moment to share a few updates and highlight how Headlamp has evolved over the past year.

Updates

Joining Kubernetes SIG UI

This year marked a big milestone for the project: Headlamp is now officially part of Kubernetes SIG UI. This move brings roadmap and design discussions even closer to the core Kubernetes community and reinforces Headlamp’s role as a modern, extensible UI for the project.

As part of that, we’ve also been sharing more about making Kubernetes approachable for a wider audience, including an appearance on Enlightening with Whitney Lee and a talk at KCD New York 2025.

Linux Foundation mentorship

This year, we were excited to work with several students through the Linux Foundation’s Mentorship program, and our mentees have already left a visible mark on Headlamp:

  • Adwait Godbole built the KEDA plugin, adding a UI in Headlamp to view and manage KEDA resources like ScaledObjects and ScaledJobs.
  • Dhairya Majmudar set up an OpenTelemetry-based observability stack for Headlamp, wiring up metrics, logs, and traces so the project is easier to monitor and debug.
  • Aishwarya Ghatole led a UX audit of Headlamp plugins, identifying usability issues and proposing design improvements and personas for plugin users.
  • Anirban Singha developed the Karpenter plugin, giving Headlamp a focused view into Karpenter autoscaling resources and decisions.
  • Aditya Chaudhary improved Gateway API support, so you can see networking relationships on the resource map, as well as improved support for many of the new Gateway API resources.
  • Faakhir Zahid completed a way to easily manage plugin installation with Headlamp deployed in clusters.
  • Saurav Upadhyay worked on backend caching for Kubernetes API calls, reducing load on the API server and improving performance in Headlamp.

New changes

Multi-cluster view

Managing multiple clusters is challenging: teams often switch between tools and lose context when trying to see what runs where. Headlamp solves this by giving you a single view to compare clusters side-by-side. This makes it easier to understand workloads across environments and reduces the time spent hunting for resources.

Multi-cluster view

View of multi-cluster workloads

Projects

Kubernetes apps often span multiple namespaces and resource types, which makes troubleshooting feel like piecing together a puzzle. We’ve added Projects to give you an application-centric view that groups related resources across multiple namespaces – and even clusters. This allows you to reduce sprawl, troubleshoot faster, and collaborate without digging through YAML or cluster-wide lists.

Projects feature

View of the new Projects feature

Changes:

  • New “Projects” feature for grouping namespaces into app- or team-centric projects
  • Extensible Projects details view that plugins can customize with their own tabs and actions

Day-to-day ops in Kubernetes often means juggling logs, terminals, YAML, and dashboards across clusters. We redesigned Headlamp’s navigation to treat these as first-class “activities” you can keep open and come back to, instead of one-off views you lose as soon as you click away.

New task bar

View of the new task bar

Changes:

  • A new task bar/activities model lets you pin logs, exec sessions, and details as ongoing activities
  • An activity overview with a “Close all” action and cluster information
  • Multi-select and global filters in tables

Thanks to Jan Jansen and Aditya Chaudhary.

Search and map

When something breaks in production, the first two questions are usually “where is it?” and “what is it connected to?” We’ve upgraded both search and the map view so you can get from a high-level symptom to the right set of objects much faster.

Advanced search

View of the new Advanced Search feature

Changes:

  • An Advanced search view that supports rich, expression-based queries over Kubernetes objects
  • Improved global search that understands labels and multiple search items, and can even update your current namespace based on what you find
  • EndpointSlice support in the Network section
  • A richer map view that now includes Custom Resources and Gateway API objects

Thanks to Fabian, Alexander North, and Victor Marcolino from Swisscom, and also to Aditya Chaudhary.

OIDC and authentication

We’ve put real work into making OIDC setup clearer and more resilient, especially for in-cluster deployments.

User info

View of user information for OIDC clusters

Changes:

  • User information displayed in the top bar for OIDC-authenticated users
  • PKCE support for more secure authentication flows, as well as hardened token refresh handling
  • Documentation for using the access token using -oidc-use-access-token=true
  • Improved support for public OIDC clients like AKS and EKS
  • New guide for setting up Headlamp on AKS with Azure Entra-ID using OAuth2Proxy

Thanks to David Dobmeier and Harsh Srivastava.

App Catalog and Helm

We’ve broadened how you deploy and source apps via Headlamp, specifically supporting vanilla Helm repos.

Changes:

  • A more capable Helm chart with optional backend TLS termination, PodDisruptionBudgets, custom pod labels, and more
  • Improved formatting and added missing access token arg in the Helm chart
  • New in-cluster Helm support with an --enable-helm flag and a service proxy

Thanks to Vrushali Shah and Murali Annamneni from Oracle, and also to Pat Riehecky, Joshua Akers, Rostislav Stříbrný, Rick L, and Victor.

Performance, accessibility, and UX

Finally, we’ve spent a lot of time on the things you notice every day but don’t always make headlines: startup time, list views, log viewers, accessibility, and small network UX details. A continuous accessibility self-audit has also helped us identify key issues and make Headlamp easier for everyone to use.

Learn section

View of the Learn section in docs

Changes:

  • Significant desktop improvements, with up to 60% faster app loads and much quicker dev-mode reloads for contributors
  • Numerous table and log viewer refinements: persistent sort order, consistent row actions, copy-name buttons, better tooltips, and more forgiving log inputs
  • Accessibility and localization improvements, including fixes for zoom-related layout issues, better color contrast, improved screen reader support, and expanded language coverage
  • More control over resources, with live pod CPU/memory metrics, richer pod details, and inline editing for secrets and CRD fields
  • A refreshed documentation and plugin onboarding experience, including a “Learn” section and plugin showcase
  • A more complete NetworkPolicy UI and network-related polish
  • Nightly builds available for early testing

Thanks to Jaehan Byun and Jan Jansen.

Plugins and extensibility

Discovering plugins is simpler now – no more hopping between Artifact Hub and assorted GitHub repos. Browse our dedicated Plugins page for a curated catalog of Headlamp-endorsed plugins, along with a showcase of featured plugins.

Plugins page

View of the Plugins showcase

Headlamp AI Assistant

Managing Kubernetes often means memorizing commands and juggling tools. Headlamp’s new AI Assistant changes this by adding a natural-language interface built into the UI. Now, instead of typing kubectl or digging through YAML you can ask, “Is my app healthy?” or “Show logs for this deployment,” and get answers in context, speeding up troubleshooting and smoothing onboarding for new users. Learn more about it here.

New plugins additions

Alongside the new AI Assistant, we’ve been growing Headlamp’s plugin ecosystem so you can bring more of your workflows into a single UI, with integrations like Minikube, Karpenter, and more.

Highlights from the latest plugin releases:

  • Minikube plugin, providing a locally stored single node Minikube cluster
  • Karpenter plugin, with support for Azure Node Auto-Provisioning (NAP)
  • KEDA plugin, which you can learn more about here
  • Community-maintained plugins for Gatekeeper and KAITO

Thanks to Vrushali Shah and Murali Annamneni from Oracle, and also to Anirban Singha, Adwait Godbole, Sertaç Özercan, Ernest Wong, and Chloe Lim.

Other plugins updates

Alongside new additions, we’ve also spent time refining plugins that many of you already use, focusing on smoother workflows and better integration with the core UI.

Backstage plugin

View of the Backstage plugin

Changes:

  • Flux plugin: Updated for Flux v2.7, with support for newer CRDs, navigation fixes so it works smoothly on recent clusters
  • App Catalog: Now supports Helm repos in addition to Artifact Hub, can run in-cluster via /serviceproxy, and shows both current and latest app versions
  • Plugin Catalog: Improved card layout and accessibility, plus dependency and Storybook test updates
  • Backstage plugin: Dependency and build updates, more info here

Plugin development

We’ve focused on making it faster and clearer to build, test, and ship Headlamp plugins, backed by improved documentation and lighter tooling.

Plugin development

View of the Plugin Development guide

Changes:

  • New and expanded guides for plugin architecture and development, including how to publish and ship plugins
  • Added i18n support documentation so plugins can be translated and localized
  • Added example plugins: ui-panels, resource-charts, custom-theme, and projects
  • Improved type checking for Headlamp APIs, restored Storybook support for component testing, and reduced dependencies for faster installs and fewer updates
  • Documented plugin install locations, UI signifiers in Plugin Settings, and labels that differentiated shipped, UI-installed, and dev-mode plugins

Security upgrades

We've also been investing in keeping Headlamp secure – both by tightening how authentication works and by staying on top of upstream vulnerabilities and tooling.

Updates:

  • We've been keeping up with security updates, regularly updating dependencies and addressing upstream security issues.
  • We tightened the Helm chart's default security context and fixed a regression that broke the plugin manager.
  • We've improved OIDC security with PKCE support, helping unblock more secure and standards-compliant OIDC setups when deploying Headlamp in-cluster.

Conclusion

Thank you to everyone who has contributed to Headlamp this year – whether through pull requests, plugins, or simply sharing how you're using the project. Seeing the different ways teams are adopting and extending the project is a big part of what keeps us moving forward. If your organization uses Headlamp, consider adding it to our adopters list.

If you haven't tried Headlamp recently, all these updates are available today. Check out the latest Headlamp release, explore the new views, plugins, and docs, and share your feedback with us on Slack or GitHub – your feedback helps shape where Headlamp goes next.

Announcing the Checkpoint/Restore Working Group

The community around Kubernetes includes a number of Special Interest Groups (SIGs) and Working Groups (WGs) facilitating discussions on important topics between interested contributors. Today we would like to announce the new Kubernetes Checkpoint Restore WG focusing on the integration of Checkpoint/Restore functionality into Kubernetes.

Motivation and use cases

There are several high-level scenarios discussed in the working group:

  • Optimizing resource utilization for interactive workloads, such as Jupyter notebooks and AI chatbots
  • Accelerating startup of applications with long initialization times, including Java applications and LLM inference services
  • Using periodic checkpointing to enable fault-tolerance for long-running workloads, such as distributed model training
  • Providing interruption-aware scheduling with transparent checkpoint/restore, allowing lower-priority Pods to be preempted while preserving the runtime state of applications
  • Facilitating Pod migration across nodes for load balancing and maintenance, without disrupting workloads.
  • Enabling forensic checkpointing to investigate and analyze security incidents such as cyberattacks, data breaches, and unauthorized access.

Across these scenarios, the goal is to help facilitate discussions of ideas between the Kubernetes community and the growing Checkpoint/Restore in Userspace (CRIU) ecosystem. The CRIU community includes several projects that support these use cases, including:

  • CRIU - A tool for checkpointing and restoring running applications and containers
  • checkpointctl - A tool for in-depth analysis of container checkpoints
  • criu-coordinator - A tool for coordinated checkpoint/restore of distributed applications with CRIU
  • checkpoint-restore-operator - A Kubernetes operator for managing checkpoints

More information about the checkpoint/restore integration with Kubernetes is also available here.

Following our presentation about transparent checkpointing at KubeCon EU 2025, we are excited to welcome you to our panel discussion and AI + ML session at KubeCon + CloudNativeCon Europe 2026.

Connect with us

If you are interested in contributing to Kubernetes or CRIU, there are several ways to participate:

Uniform API server access using clientcmd

If you've ever wanted to develop a command line client for a Kubernetes API, especially if you've considered making your client usable as a kubectl plugin, you might have wondered how to make your client feel familiar to users of kubectl. A quick glance at the output of kubectl options might put a damper on that: "Am I really supposed to implement all those options?"

Fear not, others have done a lot of the work involved for you. In fact, the Kubernetes project provides two libraries to help you handle kubectl-style command line arguments in Go programs: clientcmd and cli-runtime (which uses clientcmd). This article will show how to use the former.

General philosophy

As might be expected since it's part of client-go, clientcmd's ultimate purpose is to provide an instance of restclient.Config that can issue requests to an API server.

It follows kubectl semantics:

  • defaults are taken from ~/.kube or equivalent;
  • files can be specified using the KUBECONFIG environment variable;
  • all of the above settings can be further overridden using command line arguments.

It doesn't set up a --kubeconfig command line argument, which you might want to do to align with kubectl; you'll see how to do this in the "Bind the flags" section.

Available features

clientcmd allows programs to handle

  • kubeconfig selection (using KUBECONFIG);
  • context selection;
  • namespace selection;
  • client certificates and private keys;
  • user impersonation;
  • HTTP Basic authentication support (username/password).

Configuration merging

In various scenarios, clientcmd supports merging configuration settings: KUBECONFIG can specify multiple files whose contents are combined. This can be confusing, because settings are merged in different directions depending on how they are implemented. If a setting is defined in a map, the first definition wins, subsequent definitions are ignored. If a setting is not defined in a map, the last definition wins.

When settings are retrieved using KUBECONFIG, missing files result in warnings only. If the user explicitly specifies a path (in --kubeconfig style), there must be a corresponding file.

If KUBECONFIG isn't defined, the default configuration file, ~/.kube/config, is used instead, if present.

Overall process

The general usage pattern is succinctly expressed in the clientcmd package documentation:

loadingRules := clientcmd.NewDefaultClientConfigLoadingRules()
// if you want to change the loading rules (which files in which order), you can do so here

configOverrides := &clientcmd.ConfigOverrides{}
// if you want to change override values or bind them to flags, there are methods to help you

kubeConfig := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(loadingRules, configOverrides)
config, err := kubeConfig.ClientConfig()
if err != nil {
	// Do something
}
client, err := metav1.New(config)
// ...

In the context of this article, there are six steps:

  1. Configure the loading rules.
  2. Configure the overrides.
  3. Build a set of flags.
  4. Bind the flags.
  5. Build the merged configuration.
  6. Obtain an API client.

Configure the loading rules

clientcmd.NewDefaultClientConfigLoadingRules() builds loading rules which will use either the contents of the KUBECONFIG environment variable, or the default configuration file name (~/.kube/config). In addition, if the default configuration file is used, it is able to migrate settings from the (very) old default configuration file (~/.kube/.kubeconfig).

You can build your own ClientConfigLoadingRules, but in most cases the defaults are fine.

Configure the overrides

clientcmd.ConfigOverrides is a struct storing overrides which will be applied over the settings loaded from the configuration derived using the loading rules. In the context of this article, its primary purpose is to store values obtained from command line arguments. These are handled using the pflag library, which is a drop-in replacement for Go's flag package, adding support for double-hyphen arguments with long names.

In most cases there's nothing to set in the overrides; I will only bind them to flags.

Build a set of flags

In this context, a flag is a representation of a command line argument, specifying its long name (such as --namespace), its short name if any (such as -n), its default value, and a description shown in the usage information. Flags are stored in instances of the FlagInfo struct.

Three sets of flags are available, representing the following command line arguments:

  • authentication arguments (certificates, tokens, impersonations, username/password);
  • cluster arguments (API server, certificate authority, TLS configuration, proxy, compression)
  • context arguments (cluster name, kubeconfig user name, namespace)

The recommended selection includes all three with a named context selection argument and a timeout argument.

These are all available using the Recommended…Flags functions. The functions take a prefix, which is prepended to all the argument long names.

So calling clientcmd.RecommendedConfigOverrideFlags("") results in command line arguments such as --context, --namespace, and so on. The --timeout argument is given a default value of 0, and the --namespace argument has a corresponding short variant, -n. Adding a prefix, such as "from-", results in command line arguments such as --from-context, --from-namespace, etc. This might not seem particularly useful on commands involving a single API server, but they come in handy when multiple API servers are involved, such as in multi-cluster scenarios.

There's a potential gotcha here: prefixes don't modify the short name, so --namespace needs some care if multiple prefixes are used: only one of the prefixes can be associated with the -n short name. You'll have to clear the short names associated with the other prefixes' --namespace , or perhaps all prefixes if there's no sensible -n association. Short names can be cleared as follows:

kflags := clientcmd.RecommendedConfigOverrideFlags(prefix)
kflags.ContextOverrideFlags.Namespace.ShortName = ""

In a similar fashion, flags can be disabled entirely by clearing their long name:

kflags.ContextOverrideFlags.Namespace.LongName = ""

Bind the flags

Once a set of flags has been defined, it can be used to bind command line arguments to overrides using clientcmd.BindOverrideFlags. This requires a pflag FlagSet rather than one from Go's flag package.

If you also want to bind --kubeconfig, you should do so now, by binding ExplicitPath in the loading rules:

flags.StringVarP(&loadingRules.ExplicitPath, "kubeconfig", "", "", "absolute path(s) to the kubeconfig file(s)")

Build the merged configuration

Two functions are available to build a merged configuration:

As the names suggest, the difference between the two is that the first can ask for authentication information interactively, using a provided reader, whereas the second only operates on the information given to it by the caller.

The "deferred" mention in these function names refers to the fact that the final configuration will be determined as late as possible. This means that these functions can be called before the command line arguments are parsed, and the resulting configuration will use whatever values have been parsed by the time it's actually constructed.

Obtain an API client

The merged configuration is returned as a ClientConfig instance. An API client can be obtained from that by calling the ClientConfig() method.

If no configuration is given (KUBECONFIG is empty or points to non-existent files, ~/.kube/config doesn't exist, and no configuration is given using command line arguments), the default setup will return an obscure error referring to KUBERNETES_MASTER. This is legacy behaviour; several attempts have been made to get rid of it, but it is preserved for the --local and --dry-run command line arguments in --kubectl. You should check for "empty configuration" errors by calling clientcmd.IsEmptyConfig() and provide a more explicit error message.

The Namespace() method is also useful: it returns the namespace that should be used. It also indicates whether the namespace was overridden by the user (using --namespace).

Full example

Here's a complete example.

package main

import (
	"context"
	"fmt"
	"os"

	"github.com/spf13/pflag"
	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/tools/clientcmd"
)

func main() {
	// Loading rules, no configuration
	loadingRules := clientcmd.NewDefaultClientConfigLoadingRules()

	// Overrides and flag (command line argument) setup
	configOverrides := &clientcmd.ConfigOverrides{}
	flags := pflag.NewFlagSet("clientcmddemo", pflag.ExitOnError)
	clientcmd.BindOverrideFlags(configOverrides, flags,
		clientcmd.RecommendedConfigOverrideFlags(""))
	flags.StringVarP(&loadingRules.ExplicitPath, "kubeconfig", "", "", "absolute path(s) to the kubeconfig file(s)")
	flags.Parse(os.Args)

	// Client construction
	kubeConfig := clientcmd.NewNonInteractiveDeferredLoadingClientConfig(loadingRules, configOverrides)
	config, err := kubeConfig.ClientConfig()
	if err != nil {
		if clientcmd.IsEmptyConfig(err) {
			panic("Please provide a configuration pointing to the Kubernetes API server")
		}
		panic(err)
	}
	client, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err)
	}

	// How to find out what namespace to use
	namespace, overridden, err := kubeConfig.Namespace()
	if err != nil {
		panic(err)
	}
	fmt.Printf("Chosen namespace: %s; overridden: %t\n", namespace, overridden)

	// Let's use the client
	nodeList, err := client.CoreV1().Nodes().List(context.TODO(), v1.ListOptions{})
	if err != nil {
		panic(err)
	}
	for _, node := range nodeList.Items {
		fmt.Println(node.Name)
	}
}

Happy coding, and thank you for your interest in implementing tools with familiar usage patterns!

Kubernetes v1.35: Restricting executables invoked by kubeconfigs via exec plugin allowList added to kuberc

Did you know that kubectl can run arbitrary executables, including shell scripts, with the full privileges of the invoking user, and without your knowledge? Whenever you download or auto-generate a kubeconfig, the users[n].exec.command field can specify an executable to fetch credentials on your behalf. Don't get me wrong, this is an incredible feature that allows you to authenticate to the cluster with external identity providers. Nevertheless, you probably see the problem: Do you know exactly what executables your kubeconfig is running on your system? Do you trust the pipeline that generated your kubeconfig? If there has been a supply-chain attack on the code that generates the kubeconfig, or if the generating pipeline has been compromised, an attacker might well be doing unsavory things to your machine by tricking your kubeconfig into running arbitrary code.

To give the user more control over what gets run on their system, SIG-Auth and SIG-CLI added the credential plugin policy and allowlist as a beta feature to Kubernetes 1.35. This is available to all clients using the client-go library, by filling out the ExecProvider.PluginPolicy struct on a REST config. To broaden the impact of this change, Kubernetes v1.35 also lets you manage this without writing a line of application code. You can configure kubectl to enforce the policy and allowlist by adding two fields to the kuberc configuration file: credentialPluginPolicy and credentialPluginAllowlist. Adding one or both of these fields restricts which credential plugins kubectl is allowed to execute.

How it works

A full description of this functionality is available in our official documentation for kuberc, but this blog post will give a brief overview of the new security knobs. The new features are in beta and available without using any feature gates.

The following example is the simplest one: simply don't specify the new fields.

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference

This will keep kubectl acting as it always has, and all plugins will be allowed.

The next example is functionally identical, but it is more explicit and therefore preferred if it's actually what you want:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: AllowAll

If you don't know whether or not you're using exec credential plugins, try setting your policy to DenyAll:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: DenyAll

If you are using credential plugins, you'll quickly find out what kubectl is trying to execute. You'll get an error like the following.

Unable to connect to the server: getting credentials: plugin "cloudco-login" not allowed: policy set to "DenyAll"

If there is insufficient information for you to debug the issue, increase the logging verbosity when you run your next command. For example:

# increase or decrease verbosity if the issue is still unclear
kubectl get pods --verbosity 5

Selectively allowing plugins

What if you need the cloudco-login plugin to do your daily work? That is why there's a third option for your policy, Allowlist. To allow a specific plugin, set the policy and add the credentialPluginAllowlist:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
credentialPluginPolicy: Allowlist
credentialPluginAllowlist:
  - name: /usr/local/bin/cloudco-login
  - name: get-identity

You'll notice that there are two entries in the allowlist. One of them is specified by full path, and the other, get-identity is just a basename. When you specify just the basename, the full path will be looked up using exec.LookPath, which does not expand globbing or handle wildcards. Globbing is not supported at this time. Both forms (basename and full path) are acceptable, but the full path is preferable because it narrows the scope of allowed binaries even further.

Future enhancements

Currently, an allowlist entry has only one field, name. In the future, we (Kubernetes SIG CLI) want to see other requirements added. One idea that seems useful is checksum verification whereby, for example, a binary would only be allowed to run if it has the sha256 sum b9a3fad00d848ff31960c44ebb5f8b92032dc085020f857c98e32a5d5900ff9c and exists at the path /usr/bin/cloudco-login.

Another possibility is only allowing binaries that have been signed by one of a set of a trusted signing keys.

Get involved

The credential plugin policy is still under development and we are very interested in your feedback. We'd love to hear what you like about it and what problems you'd like to see it solve. Or, if you have the cycles to contribute one of the above enhancements, they'd be a great way to get started contributing to Kubernetes. Feel free to join in the discussion on slack:

Kubernetes v1.35: Mutable PersistentVolume Node Affinity (alpha)

The PersistentVolume node affinity API dates back to Kubernetes v1.10. It is widely used to express that volumes may not be equally accessible by all nodes in the cluster. This field was previously immutable, and it is now mutable in Kubernetes v1.35 (alpha). This change opens a door to more flexible online volume management.

Why make node affinity mutable?

This raises an obvious question: why make node affinity mutable now? While stateless workloads like Deployments can be changed freely and the changes will be rolled out automatically by re-creating every Pod, PersistentVolumes (PVs) are stateful and cannot be re-created easily without losing data.

However, Storage providers evolve and storage requirements change. Most notably, multiple providers are offering regional disks now. Some of them even support live migration from zonal to regional disks, without disrupting the workloads. This change can be expressed through the VolumeAttributesClass API, which recently graduated to GA in 1.34. However, even if the volume is migrated to regional storage, Kubernetes still prevents scheduling Pods to other zones because of the node affinity recorded in the PV object. In this case, you may want to change the PV node affinity from:

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - us-east1-b

to:

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.kubernetes.io/region
          operator: In
          values:
          - us-east1

As another example, providers sometimes offer new generations of disks. New disks cannot always be attached to older nodes in the cluster. This accessibility can also be expressed through PV node affinity and ensures the Pods can be scheduled to the right nodes. But when the disk is upgraded, new Pods using this disk can still be scheduled to older nodes. To prevent this, you may want to change the PV node affinity from:

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: provider.com/disktype.gen1
          operator: In
          values:
          - available

to:

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: provider.com/disktype.gen2
          operator: In
          values:
          - available

So, it is mutable now, a first step towards a more flexible online volume management. While it is a simple change that removes one validation from the API server, we still have a long way to go to integrate well with the Kubernetes ecosystem.

Try it out

This feature is for you if you are a Kubernetes cluster administrator, and your storage provider allows online update that you want to utilize, but those updates can affect the accessibility of the volume.

Note that changing PV node affinity alone will not actually change the accessibility of the underlying volume. Before using this feature, you must first update the underlying volume in the storage provider, and understand which nodes can access the volume after the update. You can then enable this feature and keep the PV node affinity in sync.

Currently, this feature is in alpha state. It is disabled by default, and may subject to change. To try it out, enable the MutablePVNodeAffinity feature gate on APIServer, then you can edit the PV spec.nodeAffinity field. Typically only administrators can edit PVs, please make sure you have the right RBAC permissions.

Race condition between updating and scheduling

There are only a few factors outside of a Pod that can affect the scheduling decision, and PV node affinity is one of them. It is fine to allow more nodes to access the volume by relaxing node affinity, but there is a race condition when you try to tighten node affinity: it is unclear how the Scheduler will see the modified PV in its cache, so there is a small window where the scheduler may place a Pod on an old node that can no longer access the volume. In this case, the Pod will stuck at ContainerCreating state.

One mitigation currently under discussion is for the kubelet to fail Pod startup if the PersistentVolume’s node affinity is violated. This has not landed yet. So if you are trying this out now, please watch subsequent Pods that use the updated PV, and make sure they are scheduled onto nodes that can access the volume. If you update PV and immediately start new Pods in a script, it may not work as intended.

Future integration with CSI (Container Storage Interface)

Currently, it is up to the cluster administrator to modify both PV's node affinity and the underlying volume in the storage provider. But manual operations are error-prone and time-consuming. It is preferred to eventually integrate this with VolumeAttributesClass, so that an unprivileged user can modify their PersistentVolumeClaim (PVC) to trigger storage-side updates, and PV node affinity is updated automatically when appropriate, without the need for cluster admin's intervention.

We welcome your feedback from users and storage driver developers

As noted earlier, this is only a first step.

If you are a Kubernetes user, we would like to learn how you use (or will use) PV node affinity. Is it beneficial to update it online in your case?

If you are a CSI driver developer, would you be willing to implement this feature? How would you like the API to look?

Please provide your feedback via:

For any inquiries or specific questions related to this feature, please reach out to the SIG Storage community.

Kubernetes v1.35: A Better Way to Pass Service Account Tokens to CSI Drivers

If you maintain a CSI driver that uses service account tokens, Kubernetes v1.35 brings a refinement you'll want to know about. Since the introduction of the TokenRequests feature, service account tokens requested by CSI drivers have been passed to them through the volume_context field. While this has worked, it's not the ideal place for sensitive information, and we've seen instances where tokens were accidentally logged in CSI drivers.

Kubernetes v1.35 introduces a beta solution to address this: CSI Driver Opt-in for Service Account Tokens via Secrets Field. This allows CSI drivers to receive service account tokens through the secrets field in NodePublishVolumeRequest, which is the appropriate place for sensitive data in the CSI specification.

Understanding the existing approach

When CSI drivers use the TokenRequests feature, they can request service account tokens for workload identity by configuring the TokenRequests field in the CSIDriver spec. These tokens are passed to drivers as part of the volume attributes map, using the key csi.storage.k8s.io/serviceAccount.tokens.

The volume_context field works, but it's not designed for sensitive data. Because of this, there are a few challenges:

First, the protosanitizer tool that CSI drivers use doesn't treat volume context as sensitive, so service account tokens can end up in logs when gRPC requests are logged. This happened with CVE-2023-2878 in the Secrets Store CSI Driver and CVE-2024-3744 in the Azure File CSI Driver.

Second, each CSI driver that wants to avoid this issue needs to implement its own sanitization logic, which leads to inconsistency across drivers.

The CSI specification already has a secrets field in NodePublishVolumeRequest that's designed exactly for this kind of sensitive information. The challenge is that we can't just change where we put the tokens without breaking existing CSI drivers that expect them in volume context.

How the opt-in mechanism works

Kubernetes v1.35 introduces an opt-in mechanism that lets CSI drivers choose how they receive service account tokens. This way, existing drivers continue working as they do today, and drivers can move to the more appropriate secrets field when they're ready.

CSI drivers can set a new field in their CSIDriver spec:

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#
apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: example-csi-driver
spec:
  # ... existing fields ...
  tokenRequests:
  - audience: "example.com"
    expirationSeconds: 3600
  # New field for opting into secrets delivery
  serviceAccountTokenInSecrets: true  # defaults to false

The behavior depends on the serviceAccountTokenInSecrets field:

When set to false (the default), tokens are placed in VolumeContext with the key csi.storage.k8s.io/serviceAccount.tokens, just like today. When set to true, tokens are placed only in the Secrets field with the same key.

About the beta release

The CSIServiceAccountTokenSecrets feature gate is enabled by default on both kubelet and kube-apiserver. Since the serviceAccountTokenInSecrets field defaults to false, enabling the feature gate doesn't change any existing behavior. All drivers continue receiving tokens via volume context unless they explicitly opt in. This is why we felt comfortable starting at beta rather than alpha.

Guide for CSI driver authors

If you maintain a CSI driver that uses service account tokens, here's how to adopt this feature.

Adding fallback logic

First, update your driver code to check both locations for tokens. This makes your driver compatible with both the old and new approaches:

const serviceAccountTokenKey = "csi.storage.k8s.io/serviceAccount.tokens"

func getServiceAccountTokens(req *csi.NodePublishVolumeRequest) (string, error) {
    // Check secrets field first (new behavior when driver opts in)
    if tokens, ok := req.Secrets[serviceAccountTokenKey]; ok {
        return tokens, nil
    }
    
    // Fall back to volume context (existing behavior)
    if tokens, ok := req.VolumeContext[serviceAccountTokenKey]; ok {
        return tokens, nil
    }
    
    return "", fmt.Errorf("service account tokens not found")
}

This fallback logic is backward compatible and safe to ship in any driver version, even before clusters upgrade to v1.35.

Rollout sequence

CSI driver authors need to follow a specific sequence when adopting this feature to avoid breaking existing volumes.

Driver preparation (can happen anytime)

You can start preparing your driver right away by adding fallback logic that checks both the secrets field and volume context for tokens. This code change is backward compatible and safe to ship in any driver version, even before clusters upgrade to v1.35. We encourage you to add this fallback logic early, cut releases, and even backport to maintenance branches where feasible.

Cluster upgrade and feature enablement

Once your driver has the fallback logic deployed, here's the safe rollout order for enabling the feature in a cluster:

  1. Complete the kube-apiserver upgrade to 1.35 or later
  2. Complete kubelet upgrade to 1.35 or later on all nodes
  3. Ensure CSI driver version with fallback logic is deployed (if not already done in preparation phase)
  4. Fully complete CSI driver DaemonSet rollout across all nodes
  5. Update your CSIDriver manifest to set serviceAccountTokenInSecrets: true

Important constraints

The most important thing to remember is timing. If your CSI driver DaemonSet and CSIDriver object are in the same manifest or Helm chart, you need two separate updates. Deploy the new driver version with fallback logic first, wait for the DaemonSet rollout to complete, then update the CSIDriver spec to set serviceAccountTokenInSecrets: true.

Also, don't update the CSIDriver before all driver pods have rolled out. If you do, volume mounts will fail on nodes still running the old driver version, since those pods only check volume context.

Why this matters

Adopting this feature helps in a few ways:

  • It eliminates the risk of accidentally logging service account tokens as part of volume context in gRPC requests
  • It uses the CSI specification's designated field for sensitive data, which feels right
  • The protosanitizer tool automatically handles the secrets field correctly, so you don't need driver-specific workarounds
  • It's opt-in, so you can migrate at your own pace without breaking existing deployments

Call to action

We (Kubernetes SIG Storage) encourage CSI driver authors to adopt this feature and provide feedback on the migration experience. If you have thoughts on the API design or run into any issues during adoption, please reach out to us on the #csi channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

You can follow along on KEP-5538 to track progress across the coming Kubernetes releases.

Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)

Many production Kubernetes clusters blend on-demand (higher-SLA) and spot/preemptible (lower-SLA) nodes to optimize costs while maintaining reliability for critical workloads. Platform teams need a safe default that keeps most workloads away from risky capacity, while allowing specific workloads to opt-in with explicit thresholds like "I can tolerate nodes with failure probability up to 5%".

Today, Kubernetes taints and tolerations can match exact values or check for existence, but they can't compare numeric thresholds. You'd need to create discrete taint categories, use external admission controllers, or accept less-than-optimal placement decisions.

In Kubernetes v1.35, we're introducing Extended Toleration Operators as an alpha feature. This enhancement adds Gt (Greater Than) and Lt (Less Than) operators to spec.tolerations, enabling threshold-based scheduling decisions that unlock new possibilities for SLA-based placement, cost optimization, and performance-aware workload distribution.

The evolution of tolerations

Historically, Kubernetes supported two primary toleration operators:

  • Equal: The toleration matches a taint if the key and value are exactly equal
  • Exists: The toleration matches a taint if the key exists, regardless of value

While these worked well for categorical scenarios, they fell short for numeric comparisons. Starting with v1.35, we are closing this gap.

Consider these real-world scenarios:

  • SLA requirements: Schedule high-availability workloads only on nodes with failure probability below a certain threshold
  • Cost optimization: Allow cost-sensitive batch jobs to run on cheaper nodes that exceed a specific cost-per-hour value
  • Performance guarantees: Ensure latency-sensitive applications run only on nodes with disk IOPS or network bandwidth above minimum thresholds

Without numeric comparison operators, cluster operators have had to resort to workarounds like creating multiple discrete taint values or using external admission controllers, neither of which scale well or provide the flexibility needed for dynamic threshold-based scheduling.

Why extend tolerations instead of using NodeAffinity?

You might wonder: NodeAffinity already supports numeric comparison operators, so why extend tolerations? While NodeAffinity is powerful for expressing pod preferences, taints and tolerations provide critical operational benefits:

  • Policy orientation: NodeAffinity is per-pod, requiring every workload to explicitly opt-out of risky nodes. Taints invert control—nodes declare their risk level, and only pods with matching tolerations may land there. This provides a safer default; most pods stay away from spot/preemptible nodes unless they explicitly opt-in.
  • Eviction semantics: NodeAffinity has no eviction capability. Taints support the NoExecute effect with tolerationSeconds, enabling operators to drain and evict pods when a node's SLA degrades or spot instances receive termination notices.
  • Operational ergonomics: Centralized, node-side policy is consistent with other safety taints like disk-pressure and memory-pressure, making cluster management more intuitive.

This enhancement preserves the well-understood safety model of taints and tolerations while enabling threshold-based placement for SLA-aware scheduling.

Introducing Gt and Lt operators

Kubernetes v1.35 introduces two new operators for tolerations:

  • Gt (Greater Than): The toleration matches if the taint's numeric value is greater than the toleration's value
  • Lt (Less Than): The toleration matches if the taint's numeric value is less than the toleration's value

When a pod tolerates a taint with Lt, it's saying "I can tolerate nodes where this metric is less than my threshold". Since tolerations allow scheduling, the pod can run on nodes where the taint value is greater than the toleration value. Think of it as: "I tolerate nodes that are above my minimum requirements".

These operators work with numeric taint values and enable the scheduler to make sophisticated placement decisions based on continuous metrics rather than discrete categories.

Note:

Numeric values for Gt and Lt operators must be positive 64-bit integers without leading zeros. For example, "100" is valid, but "0100" (with leading zero) and "0" (zero value) are not permitted.

The Gt and Lt operators work with all taint effects: NoSchedule, NoExecute, and PreferNoSchedule.

Use cases and examples

Let's explore how Extended Toleration Operators solve real-world scheduling challenges.

Example 1: Spot instance protection with SLA thresholds

Many clusters mix on-demand and spot/preemptible nodes to optimize costs. Spot nodes offer significant savings but have higher failure rates. You want most workloads to avoid spot nodes by default, while allowing specific workloads to opt-in with clear SLA boundaries.

First, taint spot nodes with their failure probability (for example, 15% annual failure rate):

apiVersion: v1
kind: Node
metadata:
  name: spot-node-1
spec:
  taints:
  - key: "failure-probability"
    value: "15"
    effect: "NoExecute"

On-demand nodes have much lower failure rates:

apiVersion: v1
kind: Node
metadata:
  name: ondemand-node-1
spec:
  taints:
  - key: "failure-probability"
    value: "2"
    effect: "NoExecute"

Critical workloads can specify strict SLA requirements:

apiVersion: v1
kind: Pod
metadata:
  name: payment-processor
spec:
  tolerations:
  - key: "failure-probability"
    operator: "Lt"
    value: "5"
    effect: "NoExecute"
    tolerationSeconds: 30
  containers:
  - name: app
    image: payment-app:v1

This pod will only schedule on nodes with failure-probability less than 5 (meaning ondemand-node-1 with 2% but not spot-node-1 with 15%). The NoExecute effect with tolerationSeconds: 30 means if a node's SLA degrades (for example, cloud provider changes the taint value), the pod gets 30 seconds to gracefully terminate before forced eviction.

Meanwhile, a fault-tolerant batch job can explicitly opt-in to spot instances:

apiVersion: v1
kind: Pod
metadata:
  name: batch-job
spec:
  tolerations:
  - key: "failure-probability"
    operator: "Lt"
    value: "20"
    effect: "NoExecute"
  containers:
  - name: worker
    image: batch-worker:v1

This batch job tolerates nodes with failure probability up to 20%, so it can run on both on-demand and spot nodes, maximizing cost savings while accepting higher risk.

Example 2: AI workload placement with GPU tiers

AI and machine learning workloads often have specific hardware requirements. With Extended Toleration Operators, you can create GPU node tiers and ensure workloads land on appropriately powered hardware.

Taint GPU nodes with their compute capability score:

apiVersion: v1
kind: Node
metadata:
  name: gpu-node-a100
spec:
  taints:
  - key: "gpu-compute-score"
    value: "1000"
    effect: "NoSchedule"
---
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-t4
spec:
  taints:
  - key: "gpu-compute-score"
    value: "500"
    effect: "NoSchedule"

A heavy training workload can require high-performance GPUs:

apiVersion: v1
kind: Pod
metadata:
  name: model-training
spec:
  tolerations:
  - key: "gpu-compute-score"
    operator: "Gt"
    value: "800"
    effect: "NoSchedule"
  containers:
  - name: trainer
    image: ml-trainer:v1
    resources:
      limits:
        nvidia.com/gpu: 1

This ensures the training pod only schedules on nodes with compute scores greater than 800 (like the A100 node), preventing placement on lower-tier GPUs that would slow down training.

Meanwhile, inference workloads with less demanding requirements can use any available GPU:

apiVersion: v1
kind: Pod
metadata:
  name: model-inference
spec:
  tolerations:
  - key: "gpu-compute-score"
    operator: "Gt"
    value: "400"
    effect: "NoSchedule"
  containers:
  - name: inference
    image: ml-inference:v1
    resources:
      limits:
        nvidia.com/gpu: 1

Example 3: Cost-optimized workload placement

For batch processing or non-critical workloads, you might want to minimize costs by running on cheaper nodes, even if they have lower performance characteristics.

Nodes can be tainted with their cost rating:

spec:
  taints:
  - key: "cost-per-hour"
    value: "50"
    effect: "NoSchedule"

A cost-sensitive batch job can express its tolerance for expensive nodes:

tolerations:
- key: "cost-per-hour"
  operator: "Lt"
  value: "100"
  effect: "NoSchedule"

This batch job will schedule on nodes costing less than $100/hour but avoid more expensive nodes. Combined with Kubernetes scheduling priorities, this enables sophisticated cost-tiering strategies where critical workloads get premium nodes while batch workloads efficiently use budget-friendly resources.

Example 4: Performance-based placement

Storage-intensive applications often require minimum disk performance guarantees. With Extended Toleration Operators, you can enforce these requirements at the scheduling level.

tolerations:
- key: "disk-iops"
  operator: "Gt"
  value: "3000"
  effect: "NoSchedule"

This toleration ensures the pod only schedules on nodes where disk-iops exceeds 3000. The Gt operator means "I need nodes that are greater than this minimum".

How to use this feature

Extended Toleration Operators is an alpha feature in Kubernetes v1.35. To try it out:

  1. Enable the feature gate on both your API server and scheduler:

    --feature-gates=TaintTolerationComparisonOperators=true
    
  2. Taint your nodes with numeric values representing the metrics relevant to your scheduling needs:

      kubectl taint nodes node-1 failure-probability=5:NoSchedule
      kubectl taint nodes node-2 disk-iops=5000:NoSchedule
    
  3. Use the new operators in your pod specifications:

      spec:
        tolerations:
        - key: "failure-probability"
          operator: "Lt"
          value: "1"
          effect: "NoSchedule"
    

Note:

As an alpha feature, Extended Toleration Operators may change in future releases and should be used with caution in production environments. Always test thoroughly in non-production clusters first.

What's next?

This alpha release is just the beginning. As we gather feedback from the community, we plan to:

  • Add support for CEL (Common Expression Language) expressions in tolerations and node affinity for even more flexible scheduling logic, including semantic versioning comparisons
  • Improve integration with cluster autoscaling for threshold-aware capacity planning
  • Graduate the feature to beta and eventually GA with production-ready stability

We're particularly interested in hearing about your use cases! Do you have scenarios where threshold-based scheduling would solve problems? Are there additional operators or capabilities you'd like to see?

Getting involved

This feature is driven by the SIG Scheduling community. Please join us to connect with the community and share your ideas and feedback around this feature and beyond.

You can reach the maintainers of this feature at:

For questions or specific inquiries related to Extended Toleration Operators, please reach out to the SIG Scheduling community. We look forward to hearing from you!

How can I learn more?

Kubernetes v1.35: New level of efficiency with in-place Pod restart

The release of Kubernetes 1.35 introduces a powerful new feature that provides a much-requested capability: the ability to trigger a full, in-place restart of the Pod. This feature, Restart All Containers (alpha in 1.35), allows for an efficient way to reset a Pod's state compared to resource-intensive approach of deleting and recreating the entire Pod. This feature is especially useful for AI/ML workloads allowing application developers to concentrate on their core training logic while offloading complex failure-handling and recovery mechanisms to sidecars and declarative Kubernetes configuration. With RestartAllContainers and other planned enhancements, Kubernetes continues to add building blocks for creating the most flexible, robust, and efficient platforms for AI/ML workloads.

This new functionality is available by enabling the RestartAllContainersOnContainerExits feature gate. This alpha feature extends the Container Restart Rules feature, which graduated to beta in Kubernetes 1.35.

The problem: when a single container restart isn't enough and recreating pods is too costly

Kubernetes has long supported restart policies at the Pod level (restartPolicy) and, more recently, at the individual container level. These policies are great for handling crashes in a single, isolated process. However, many modern applications have more complex inter-container dependencies. For instance:

  • An init container prepares the environment by mounting a volume or generating a configuration file. If the main application container corrupts this environment, simply restarting that one container is not enough. The entire initialization process needs to run again.
  • A watcher sidecar monitors system health. If it detects an unrecoverable but retriable error state, it must trigger a restart of the main application container from a clean slate.
  • A sidecar that manages a remote resource fails. Even if the sidecar restarts on its own, the main container may be stuck trying to access an outdated or broken connection.

In all these cases, the desired action is not to restart a single container, but all of them. Previously, the only way to achieve this was to delete the Pod and have a controller (like a Job or ReplicaSet) create a new one. This process is slow and expensive, involving the scheduler, node resource allocation and re-initialization of networking and storage.

This inefficiency becomes even worse when handling large-scale AI/ML workloads (>= 1,000 Nodes with one Pod per Node). A common requirement for these synchronous workloads is that when a failure occurs (such as a Node crash), all Pods in the fleet must be recreated to reset the state before training can resume, even if all the other Pods were not directly affected by the failure. Deleting, creating and scheduling thousands of Pods simultaneously creates a massive bottleneck. The estimated overhead of this failure could cost $100,000 per month in wasted resources.

Handling these failures for AI/ML training jobs requires a complex integration touching both the training framework and Kubernetes, which are often fragile and toilsome. This feature introduces a Kubernetes-native solution, improving system robustness and allowing application developers to concentrate on their core training logic.

Another major benefit of restarting Pods in place is that keeping Pods on their assigned Nodes allows for further optimizations. For example, one can implement node-level caching tied to a specific Pod identity, something that is impossible when Pods are unnecessarily being recreated on different Nodes.

Introducing the RestartAllContainers action

To address this, Kubernetes v1.35 adds a new action to the container restart rules: RestartAllContainers. When a container exits in a way that matches a rule with this action, the kubelet initiates a fast, in-place restart of the Pod.

This in-place restart is highly efficient because it preserves the Pod's most important resources:

  • The Pod's UID, IP address and network namespace.
  • The Pod's sandbox and any attached devices.
  • All volumes, including emptyDir and mounted volumes from PVCs.

After terminating all running containers, the Pod's startup sequence is re-executed from the very beginning. This means all init containers are run again in order, followed by the sidecar and regular containers, ensuring a completely fresh start in a known-good environment. With the exception of ephemeral containers (which are terminated), all other containers—including those that previously succeeded or failed—will be restarted, regardless of their individual restart policies.

Use cases

1. Efficient restarts for ML/Batch jobs

For ML training jobs, rescheduling a worker Pod on failure is a costly operation that wastes valuable compute resources. On a 1,000-node training cluster, rescheduling overhead can waste over $100,000 in compute resources monthly.

With RestartAllContainers actions you can address this by enabling a much faster, hybrid recovery strategy: recreate only the "bad" Pods (e.g., those on unhealthy Nodes) while triggering RestartAllContainers for the remaining healthy Pods. Benchmarks show this reduces the recovery overhead from minutes to a few seconds.

With in-place restarts, a watcher sidecar can monitor the main training process. If it encounters a specific, retriable error, the watcher can exit with a designated code to trigger a fast reset of the worker Pod, allowing it to restart from the last checkpoint without involving the Job controller. This capability is now natively supported by Kubernetes.

Read more details about future development and JobSet features at KEP-467 JobSet in-place restart.

apiVersion: v1
kind: Pod
metadata:
  name: ml-worker-pod
spec:
  restartPolicy: Never
  initContainers:
  # This init container will re-run on every in-place restart
  - name: setup-environment
    image: my-repo/setup-worker:1.0
  - name: watcher-sidecar
    image: my-repo/watcher:1.0
    restartPolicy: Always
    restartPolicyRules:
    - action: RestartAllContainers
      exitCodes:
        operator: In
        # A specific exit code from the watcher triggers a full pod restart
        values: [88]
  containers:
  - name: main-application
    image: my-repo/training-app:1.0

2. Re-running init containers for a clean state

Imagine a scenario where an init container is responsible for fetching credentials or setting up a shared volume. If the main application fails in a way that corrupts this shared state, you need the init container to rerun.

By configuring the main application to exit with a specific code upon detecting such a corruption, you can trigger the RestartAllContainers action, guaranteeing that the init container provides a clean setup before the application restarts.

3. Handling high rate of similar tasks execution

There are cases when tasks are best represented as a Pod execution. And each task requires a clean execution. The task may be a game session backend or some queue item processing. If the rate of tasks is high, running the whole cycle of Pod creation, scheduling and initialization is simply too expensive, especially when tasks can be short. The ability to restart all containers from scratch enables a Kubernetes-native way to handle this scenario without custom solutions or frameworks.

How to use it

To try this feature, you must enable the RestartAllContainersOnContainerExits feature gate on your Kubernetes cluster components (API server and kubelet) running Kubernetes v1.35+. This alpha feature extends the ContainerRestartRules feature, which graduated to beta in v1.35 and is enabled by default.

Once enabled, you can add restartPolicyRules to any container (init, sidecar, or regular) and use the RestartAllContainers action.

The feature is designed to be easily usable on existing apps. However, if an application does not follow some best practices, it may cause issues for the application or for observability tooling. When enabling the feature, make sure that all containers are reentrant and that external tooling is prepared for init containers to re-run. Also, when restarting all containers, the kubelet does not run preStop hooks. This means containers must be designed to handle abrupt termination without relying on preStop hooks for graceful shutdown.

Observing the restart

To make this process observable, a new Pod condition, AllContainersRestarting, is added to the Pod's status. When a restart is triggered, this condition becomes True and it reverts to False once all containers have terminated and the Pod is ready to start its lifecycle anew. This provides a clear signal to users and other cluster components about the Pod's state.

All containers restarted by this action will have their restart count incremented in the container status.

Learn more

We want your feedback!

As an alpha feature, RestartAllContainers is ready for you to experiment with and any use cases and feedback are welcome. This feature is driven by the SIG Node community. If you are interested in getting involved, sharing your thoughts, or contributing, please join us!

You can reach SIG Node through:

Kubernetes 1.35: Enhanced Debugging with Versioned z-pages APIs

Debugging Kubernetes control plane components can be challenging, especially when you need to quickly understand the runtime state of a component or verify its configuration. With Kubernetes 1.35, we're enhancing the z-pages debugging endpoints with structured, machine-parseable responses that make it easier to build tooling and automate troubleshooting workflows.

What are z-pages?

z-pages are special debugging endpoints exposed by Kubernetes control plane components. Introduced as an alpha feature in Kubernetes 1.32, these endpoints provide runtime diagnostics for components like kube-apiserver, kube-controller-manager, kube-scheduler, kubelet and kube-proxy. The name "z-pages" comes from the convention of using /*z paths for debugging endpoints.

Currently, Kubernetes supports two primary z-page endpoints:

/statusz
Displays high-level component information including version information, start time, uptime, and available debug paths
/flagz
Shows all command-line arguments and their values used to start the component (with confidential values redacted for security)

These endpoints are valuable for human operators who need to quickly inspect component state, but until now, they only returned plain text output that was difficult to parse programmatically.

What's new in Kubernetes 1.35?

Kubernetes 1.35 introduces structured, versioned responses for both /statusz and /flagz endpoints. This enhancement maintains backward compatibility with the existing plain text format while adding support for machine-readable JSON responses.

Backward compatible design

The new structured responses are opt-in. Without specifying an Accept header, the endpoints continue to return the familiar plain text format:

$ curl --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
  --key /etc/kubernetes/pki/apiserver-kubelet-client.key \
  --cacert /etc/kubernetes/pki/ca.crt \
  https://localhost:6443/statusz

kube-apiserver statusz
Warning: This endpoint is not meant to be machine parseable, has no formatting compatibility guarantees and is for debugging purposes only.

Started: Wed Oct 16 21:03:43 UTC 2024
Up: 0 hr 00 min 16 sec
Go version: go1.23.2
Binary version: 1.35.0-alpha.0.1595
Emulation version: 1.35
Paths: /healthz /livez /metrics /readyz /statusz /version

Structured JSON responses

To receive a structured response, include the appropriate Accept header:

Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Statusz

This returns a versioned JSON response:

{
  "kind": "Statusz",
  "apiVersion": "config.k8s.io/v1alpha1",
  "metadata": {
    "name": "kube-apiserver"
  },
  "startTime": "2025-10-29T00:30:01Z",
  "uptimeSeconds": 856,
  "goVersion": "go1.23.2",
  "binaryVersion": "1.35.0",
  "emulationVersion": "1.35",
  "paths": [
    "/healthz",
    "/livez",
    "/metrics",
    "/readyz",
    "/statusz",
    "/version"
  ]
}

Similarly, /flagz supports structured responses with the header:

Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Flagz

Example response:

{
  "kind": "Flagz",
  "apiVersion": "config.k8s.io/v1alpha1",
  "metadata": {
    "name": "kube-apiserver"
  },
  "flags": {
    "advertise-address": "192.168.8.4",
    "allow-privileged": "true",
    "authorization-mode": "[Node,RBAC]",
    "enable-priority-and-fairness": "true",
    "profiling": "true"
  }
}

Why structured responses matter

The addition of structured responses opens up several new possibilities:

1. Automated health checks and monitoring

Instead of parsing plain text, monitoring tools can now easily extract specific fields. For example, you can programmatically check if a component has been running with an unexpected emulated version or verify that critical flags are set correctly.

2. Better debugging tools

Developers can build sophisticated debugging tools that compare configurations across multiple components or track configuration drift over time. The structured format makes it trivial to diff configurations or validate that components are running with expected settings.

3. API versioning and stability

By introducing versioned APIs (starting with v1alpha1), we provide a clear path to stability. As the feature matures, we'll introduce v1beta1 and eventually v1, giving you confidence that your tooling won't break with future Kubernetes releases.

How to use structured z-pages

Prerequisites

Both endpoints require feature gates to be enabled:

  • /statusz: Enable the ComponentStatusz feature gate
  • /flagz: Enable the ComponentFlagz feature gate

Example: Getting structured responses

Here's an example using curl to retrieve structured JSON responses from the kube-apiserver:

# Get structured statusz response
curl \
  --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
  --key /etc/kubernetes/pki/apiserver-kubelet-client.key \
  --cacert /etc/kubernetes/pki/ca.crt \
  -H "Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Statusz" \
  https://localhost:6443/statusz | jq .

# Get structured flagz response
curl \
  --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt \
  --key /etc/kubernetes/pki/apiserver-kubelet-client.key \
  --cacert /etc/kubernetes/pki/ca.crt \
  -H "Accept: application/json;v=v1alpha1;g=config.k8s.io;as=Flagz" \
  https://localhost:6443/flagz | jq .

Note:

The examples above use client certificate authentication and verify the server's certificate using --cacert. If you need to bypass certificate verification in a test environment, you can use --insecure (or -k), but this should never be done in production as it makes you vulnerable to man-in-the-middle attacks.

Important considerations

Alpha feature status

The structured z-page responses are an alpha feature in Kubernetes 1.35. This means:

  • The API format may change in future releases
  • These endpoints are intended for debugging, not production automation
  • You should avoid relying on them for critical monitoring workflows until they reach beta or stable status

Security and access control

z-pages expose internal component information and require proper access controls. Here are the key security considerations:

Authorization: Access to z-page endpoints is restricted to members of the system:monitoring group, which follows the same authorization model as other debugging endpoints like /healthz, /livez, and /readyz. This ensures that only authorized users and service accounts can access debugging information. If your cluster uses RBAC, you can manage access by granting appropriate permissions to this group.

Authentication: The authentication requirements for these endpoints depend on your cluster's configuration. Unless anonymous authentication is enabled for your cluster, you typically need to use authentication mechanisms (such as client certificates) to access these endpoints.

Information disclosure: These endpoints reveal configuration details about your cluster components, including:

  • Component versions and build information
  • All command-line arguments and their values (with confidential values redacted)
  • Available debug endpoints

Only grant access to trusted operators and debugging tools. Avoid exposing these endpoints to unauthorized users or automated systems that don't require this level of access.

Future evolution

As the feature matures, we (Kubernetes SIG Instrumentation) expect to:

  • Introduce v1beta1 and eventually v1 versions of the API
  • Gather community feedback on the response schema
  • Potentially add additional z-page endpoints based on user needs

Try it out

We encourage you to experiment with structured z-pages in a test environment:

  1. Enable the ComponentStatusz and ComponentFlagz feature gates on your control plane components
  2. Try querying the endpoints with both plain text and structured formats
  3. Build a simple tool or script that uses the structured data
  4. Share your feedback with the community

Learn more

Get involved

We'd love to hear your feedback! The structured z-pages feature is designed to make Kubernetes easier to debug and monitor. Whether you're building internal tooling, contributing to open source projects, or just exploring the feature, your input helps shape the future of Kubernetes observability.

If you have questions, suggestions, or run into issues, please reach out to SIG Instrumentation. You can find us on Slack or at our regular community meetings.

Happy debugging!

Kubernetes v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager

Up to and including Kubernetes v1.34, the route controller in Cloud Controller Manager (CCM) implementations built using the k8s.io/cloud-provider library reconciles routes at a fixed interval. This causes unnecessary API requests to the cloud provider when there are no changes to routes. Other controllers implemented through the same library already use watch-based mechanisms, leveraging informers to avoid unnecessary API calls. A new feature gate is being introduced in v1.35 to allow changing the behavior of the route controller to use watch-based informers.

What's new?

The feature gate CloudControllerManagerWatchBasedRoutesReconciliation has been introduced to k8s.io/cloud-provider in alpha stage by SIG Cloud Provider. To enable this feature you can use --feature-gate=CloudControllerManagerWatchBasedRoutesReconciliation=true in the CCM implementation you are using.

About the feature gate

This feature gate will trigger the route reconciliation loop whenever a node is added, deleted, or the fields .spec.podCIDRs or .status.addresses are updated.

An additional reconcile is performed in a random interval between 12h and 24h, which is chosen at the controller's start time.

This feature gate does not modify the logic within the reconciliation loop. Therefore, users of a CCM implementation should not experience significant changes to their existing route configurations.

How can I learn more?

For more details, refer to the KEP-5237.

Kubernetes v1.35: Introducing Workload Aware Scheduling

Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod, as it often requires considering all Pods together instead of scheduling each one independently. For example, when scheduling a machine learning batch job, you often need to place each worker strategically, such as on the same rack, to make the entire process as efficient as possible. At the same time, the Pods that are part of such a workload are very often identical from the scheduling perspective, which fundamentally changes how this process should look.

There are many custom schedulers adapted to perform workload scheduling efficiently, but considering how common and important workload scheduling is to Kubernetes users, especially in the AI era with the growing number of use cases, it is high time to make workloads a first-class citizen for kube-scheduler and support them natively.

Workload aware scheduling

The recent 1.35 release of Kubernetes delivered the first tranche of workload aware scheduling improvements. These are part of a wider effort that is aiming to improve scheduling and management of workloads. The effort will span over many SIGs and releases, and is supposed to gradually expand capabilities of the system toward reaching the north star goal, which is seamless workload scheduling and management in Kubernetes including, but not limited to, preemption and autoscaling.

Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape as well as scheduling-oriented requirements of the workload. It comes with an initial implementation of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion. Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process thanks to the opportunistic batching feature.

Workload API

The new Workload API resource is part of the scheduling.k8s.io/v1alpha1 API group. This resource acts as a structured, machine-readable definition of the scheduling requirements of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource determines how a group of Pods should be scheduled and how its placement should be managed throughout its lifecycle.

A Workload allows you to define a group of Pods and apply a scheduling policy to them. Here is what a gang scheduling configuration looks like. You can define a podGroup named workers and apply the gang policy with a minCount of 4.

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: training-job-workload
  namespace: some-ns
spec:
  podGroups:
  - name: workers
    policy:
      gang:
        # The gang is schedulable only if 4 pods can run at once
        minCount: 4

When you create your Pods, you link them to this Workload using the new workloadRef field:

apiVersion: v1
kind: Pod
metadata:
  name: worker-0
  namespace: some-ns
spec:
  workloadRef:
    name: training-job-workload
    podGroup: workers
  ...

How gang scheduling works

The gang policy enforces all-or-nothing placement. Without gang scheduling, a Job might be partially scheduled, consuming resources without being able to run, leading to resource wastage and potential deadlocks.

When you create Pods that are part of a gang-scheduled pod group, the scheduler's GangScheduling plugin manages the lifecycle independently for each pod group (or replica key):

  1. When you create your Pods (or a controller makes them for you), the scheduler blocks them from scheduling, until:

    • The referenced Workload object is created.
    • The referenced pod group exists in a Workload.
    • The number of pending Pods in that group meets your minCount.
  2. Once enough Pods arrive, the scheduler tries to place them. However, instead of binding them to nodes immediately, the Pods wait at a Permit gate.

  3. The scheduler checks if it has found valid assignments for the entire group (at least the minCount).

    • If there is room for the group, the gate opens, and all Pods are bound to nodes.
    • If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes), the scheduler rejects all of the Pods in the group. They go back to the queue, freeing up the reserved resources for other workloads.

We'd like to point out that that while this is a first implementation, the Kubernetes project firmly intends to improve and expand the gang scheduling algorithm in future releases. Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang, workload-level preemption, and more, moving towards the north star goal.

Opportunistic batching

In addition to explicit gang scheduling, v1.35 introduces opportunistic batching. This is a Beta feature that improves scheduling latency for identical Pods.

Unlike gang scheduling, this feature does not require the Workload API or any explicit opt-in on the user's part. It works opportunistically within the scheduler by identifying Pods that have identical scheduling requirements (container images, resource requests, affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations for subsequent identical Pods in the queue, significantly speeding up the process.

Most users will benefit from this optimization automatically, without taking any special steps, provided their Pods meet the following criteria.

Restrictions

Opportunistic batching works under specific conditions. All fields used by the kube-scheduler to find a placement must be identical between Pods. Additionally, using some features disables the batching mechanism for those Pods to ensure correctness.

Note that you may need to review your kube-scheduler configuration to ensure it is not implicitly disabling batching for your workloads.

See the docs for more details about restrictions.

The north star vision

The project has a broad ambition to deliver workload aware scheduling. These new APIs and scheduling enhancements are just the first steps. In the near future, the effort aims to tackle:

  • Introducing a workload scheduling phase
  • Improved support for multi-node DRA and topology aware scheduling
  • Workload-level preemption
  • Improved integration between scheduling and autoscaling
  • Improved interaction with external workload schedulers
  • Managing placement of workloads throughout their entire lifecycle
  • Multi-workload scheduling simulations

And more. The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.

Getting started

To try the workload aware scheduling improvements:

  • Workload API: Enable the GenericWorkload feature gate on both kube-apiserver and kube-scheduler, and ensure the scheduling.k8s.io/v1alpha1 API group is enabled.
  • Gang scheduling: Enable the GangScheduling feature gate on kube-scheduler (requires the Workload API to be enabled).
  • Opportunistic batching: As a Beta feature, it is enabled by default in v1.35. You can disable it using the OpportunisticBatching feature gate on kube-scheduler if needed.

We encourage you to try out workload aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:

Learn more

Kubernetes v1.35: Fine-grained Supplemental Groups Control Graduates to GA

On behalf of Kubernetes SIG Node, we are pleased to announce the graduation of fine-grained supplemental groups control to General Availability (GA) in Kubernetes v1.35!

The new Pod field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31, and then had graduated to beta in v1.33. Now, the feature is generally available. This feature allows you to implement more precise control over supplemental groups in Linux containers that can strengthen the security posture particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.

If you are planning to upgrade your cluster from v1.32 or an earlier version, please be aware that some behavioral breaking change introduced since beta (v1.33). For more details, see the behavioral changes introduced in beta and the upgrade considerations sections of the previous blog for graduation to beta.

Motivation: Implicit group memberships defined in /etc/group in the container image

Even though the majority of Kubernetes cluster admins/users may not be aware of this, by default Kubernetes merges group information from the Pod with information defined in /etc/group in the container image.

Here's an example; a Pod manifest that specifies spec.securityContext.runAsUser: 1000, spec.securityContext.runAsGroup: 3000 and spec.securityContext.supplementalGroups: 4000 as part of the Pod's security context.

apiVersion: v1
kind: Pod
metadata:
  name: implicit-groups-example
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
  containers:
  - name: example-container
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

What is the result of id command in the example-container container? The output should be similar to this:

uid=1000 gid=3000 groups=3000,4000,50000

Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.

Checking the contents of /etc/group in the container image contains something like the following:

user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image

This shows that the container's primary user 1000 belongs to the group 50000 in the last entry.

Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.

What's wrong with it?

The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.

Fine-grained supplemental groups control in a Pod: supplementaryGroupsPolicy

To tackle this problem, a Pod's .spec.securityContext now includes supplementalGroupsPolicy field.

This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:

  • Merge: The group membership defined in /etc/group for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility).

  • Strict: Only the group IDs specified in fsGroup, supplementalGroups, or runAsGroup are attached as supplementary groups to the container processes. Group memberships defined in /etc/group for the container's primary user are ignored.

I'll explain how the Strict policy works. The following Pod manifest specifies supplementalGroupsPolicy: Strict:

apiVersion: v1
kind: Pod
metadata:
  name: strict-supplementalgroups-policy-example
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
    supplementalGroupsPolicy: Strict
  containers:
  - name: example-container
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

The result of id command in the example-container container should be similar to this:

uid=1000 gid=3000 groups=3000,4000

You can see Strict policy can exclude group 50000 from groups!

Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.

Note:

A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity.

Read on for more details.

Attached process identity in Pod status

This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.

...
status:
  containerStatuses:
  - name: ctr
    user:
      linux:
        gid: 3000
        supplementalGroups:
        - 3000
        - 4000
        uid: 1000
...

Note:

Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.

There are several ways to restrict these permissions in containers. We suggest the belows as simple solutions:

Also, kubelet has no visibility into NRI plugins or container runtime internal workings. Cluster Administrator configuring nodes or highly privilege workloads with the permission of a local administrator may change supplemental groups for any pod. However this is outside of a scope of Kubernetes control and should not be a concern for security-hardened nodes.

Strict policy requires up-to-date container runtimes

The high level container runtime (e.g. containerd, CRI-O) plays a key role for calculating supplementary group ids that will be attached to the containers. Thus, supplementalGroupsPolicy: Strict requires a CRI runtime that support this feature. The old behavior (supplementalGroupsPolicy: Merge) can work with a CRI runtime that does not support this feature, because this policy is fully backward compatible.

Here are some CRI runtimes that support this feature, and the versions you need to be running:

  • containerd: v2.0 or later
  • CRI-O: v1.31 or later

And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field. Please note that this field is different from status.declaredFeatures introduced in KEP-5328: Node Declared Features(formerly Node Capabilities).

apiVersion: v1
kind: Node
...
status:
  features:
    supplementalGroupsPolicy: true

As container runtimes support this feature universally, various security policies may start enforcing the Strict behavior as more secure. It is the best practice to ensure that your Pods are ready for this enforcement and all supplemental groups are transparently declared in Pod spec, rather than in images.

Getting involved

This enhancement was driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

How can I learn more?

Kubernetes v1.35: Kubelet Configuration Drop-in Directory Graduates to GA

With the recent v1.35 release of Kubernetes, support for a kubelet configuration drop-in directory is generally available. The newly stable feature simplifies the management of kubelet configuration across large, heterogeneous clusters.

With v1.35, the kubelet command line argument --config-dir is production-ready and fully supported, allowing you to specify a directory containing kubelet configuration drop-in files. All files in that directory will be automatically merged with your main kubelet configuration. This allows cluster administrators to maintain a cohesive base configuration for kubelets while enabling targeted customizations for different node groups or use cases, and without complex tooling or manual configuration management.

The problem: managing kubelet configuration at scale

As Kubernetes clusters grow larger and more complex, they often include heterogeneous node pools with different hardware capabilities, workload requirements, and operational constraints. This diversity necessitates different kubelet configurations across node groups—yet managing these varied configurations at scale becomes increasingly challenging. Several pain points emerge:

  • Configuration drift: Different nodes may have slightly different configurations, leading to inconsistent behavior
  • Node group customization: GPU nodes, edge nodes, and standard compute nodes often require different kubelet settings
  • Operational overhead: Maintaining separate, complete configuration files for each node type is error-prone and difficult to audit
  • Change management: Rolling out configuration changes across heterogeneous node pools requires careful coordination

Before this support was added to Kubernetes, cluster administrators had to choose between using a single monolithic configuration file for all nodes, manually maintaining multiple complete configuration files, or relying on separate tooling. Each approach had its own drawbacks. This graduation to stable gives cluster administrators a fully supported fourth way to solve that challenge.

Example use cases

Managing heterogeneous node pools

Consider a cluster with multiple node types: standard compute nodes, high-capacity nodes (such as those with GPUs or large amounts of memory), and edge nodes with specialized requirements.

Base configuration

File: 00-base.conf

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
clusterDNS:
  - "10.96.0.10"
clusterDomain: cluster.local

High-capacity node override

File: 50-high-capacity-nodes.conf

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: 50
systemReserved:
  memory: "4Gi"
  cpu: "1000m"

Edge node override

File: 50-edge-nodes.conf (edge compute typically has lower capacity)

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "5%"

With this structure, high-capacity nodes apply both the base configuration and the capacity-specific overrides, while edge nodes apply the base configuration with edge-specific settings.

Gradual configuration rollouts

When rolling out configuration changes, you can:

  1. Add a new drop-in file with a high numeric prefix (e.g., 99-new-feature.conf)
  2. Test the changes on a subset of nodes
  3. Gradually roll out to more nodes
  4. Once stable, merge changes into the base configuration

Viewing the merged configuration

Since configuration is now spread across multiple files, you can inspect the final merged configuration using the kubelet's /configz endpoint:

# Start kubectl proxy
kubectl proxy

# In another terminal, fetch the merged configuration
# Change the '<node-name>' placeholder before running the curl command
curl -X GET http://127.0.0.1:8001/api/v1/nodes/<node-name>/proxy/configz | jq .

This shows the actual configuration the kubelet is using after all merging has been applied. The merged configuration also includes any configuration settings that were specified via kubelet command-line arguments.

For detailed setup instructions, configuration examples, and merging behavior, see the official documentation:

Good practices

When using the kubelet configuration drop-in directory:

  1. Test configurations incrementally: Always test new drop-in configurations on a subset of nodes before rolling out cluster-wide to minimize risk

  2. Version control your drop-ins: Store your drop-in configuration files in version control (or the configuration source from which these are generated) alongside your infrastructure as code to track changes and enable easy rollbacks

  3. Use numeric prefixes for predictable ordering: Name files with numeric prefixes (e.g., 00-, 50-, 90-) to explicitly control merge order and make the configuration layering obvious to other administrators

  4. Be mindful of temporary files: Some text editors automatically create backup files (such as .bak, .swp, or files with ~ suffix) in the same directory when editing. Ensure these temporary or backup files are not left in the configuration directory, as they may be processed by the kubelet

Acknowledgments

This feature was developed through the collaborative efforts of SIG Node. Special thanks to all contributors who helped design, implement, test, and document this feature across its journey from alpha in v1.28, through beta in v1.30, to GA in v1.35.

To provide feedback on this feature, join the Kubernetes Node Special Interest Group, participate in discussions on the public Slack channel (#sig-node), or file an issue on GitHub.

Get involved

If you have feedback or questions about kubelet configuration management, or want to share your experience using this feature, join the discussion:

SIG Node would love to hear about your experiences using this feature in production!

Avoiding Zombie Cluster Members When Upgrading to etcd v3.6

This article is a mirror of an original that was recently published to the official etcd blog. The key takeaway? Always upgrade to etcd v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired, and avoids zombie members.

Issue summary

Recently, the etcd community addressed an issue that may appear when users upgrade from v3.5 to v3.6. This bug can cause the cluster to report "zombie members", which are etcd nodes that were removed from the database cluster some time ago, and are re-appearing and joining database consensus. The etcd cluster is then inoperable until these zombie members are removed.

In etcd v3.5 and earlier, the v2store was the source of truth for membership data, even though the v3store was also present. As a part of our v2store deprecation plan, in v3.6 the v3store is the source of truth for cluster membership. Through a bug report we found out that, in some older clusters, v2store and v3store could become inconsistent. This inconsistency manifests after upgrading as seeing old, removed "zombie" cluster members re-appearing in the cluster.

The fix and upgrade path

We’ve added a mechanism in etcd v3.5.26 to automatically sync v3store from v2store, ensuring that affected clusters are repaired before upgrading to 3.6.x.

To support the many users currently upgrading to 3.6, we have provided the following safe upgrade path:

  1. Upgrade your cluster to v3.5.26 or later.
  2. Wait and confirm that all members are healthy post-update.
  3. Upgrade to v3.6.

We are unable to provide a safe workaround path for users who have some obstacle preventing updating to v3.5.26. As such, if v3.5.26 is not available from your packaging source or vendor, you should delay upgrading to v3.6 until it is.

Additional technical detail

Information below is offered for reference only. Users can follow the safe upgrade path without knowledge of the following details.

This issue is encountered with clusters that have been running in production on etcd v3.5.25 or earlier. It is a side effect of adding and removing members from the cluster, or recovering the cluster from failure. This means that the issue is more likely the older the etcd cluster is, but it cannot be ruled out for any user regardless of the age of the cluster.

etcd maintainers, working with issue reporters, have found three possible triggers for the issue based on symptoms and an analysis of etcd code and logs:

  1. Bug in etcdctl snapshot restore (v3.4 and old versions): When restoring a snapshot using etcdctl snapshot restore, etcdctl was supposed to remove existing members before adding the new ones. In v3.4, due to a bug, old members were not removed, resulting in zombie members. Refer to the comment on etcdctl.
  2. --force-new-cluster in v3.5 and earlier versions: In rare cases, forcibly creating a new single-member cluster did not fully remove old members, leaving zombies. The issue was resolved in v3.5.22. Please refer to this PR in the Raft project for detailed technical information.
  3. --unsafe-no-sync enabled: If --unsafe-no-sync is enabled, in rare cases etcd might persist a membership change to v3store but crash before writing it to the WAL, causing inconsistency between v2store and v3store. This is a problem for single-member clusters. For multi-member clusters, forcibly creating a new single-member cluster from the crashed node’s data may lead to zombie members.

Importantly, there may be other triggers for v2store and v3store membership data becoming inconsistent that we have not yet found. This means that you cannot assume that you are safe just because you have not performed any of the three actions above. Once users are upgraded to etcd v3.6, v3store becomes the source of membership data, and further inconsistency is not possible.

Advanced users who want to verify the consistency between v2store and v3store can follow the steps described in this comment. This check is not required to fix the issue, nor does SIG etcd recommend bypassing the v3.5.26 update regardless of the results of the check.

Key takeaway

Always upgrade to v3.5.26 or later before moving to v3.6. This ensures your cluster is automatically repaired and avoids zombie members.

Acknowledgements

We would like to thank Christian Baumann for reporting this long-standing upgrade issue. His report and follow-up work helped bring the issue to our attention so that we could investigate and resolve it upstream.

Kubernetes 1.35: In-Place Pod Resize Graduates to Stable

This release marks a major step: more than 6 years after its initial conception, the In-Place Pod Resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, and graduated to beta in Kubernetes v1.33, is now stable (GA) in Kubernetes 1.35!

This graduation is a major milestone for improving resource efficiency and flexibility for workloads running on Kubernetes.

What is in-place Pod Resize?

In the past, the CPU and memory resources allocated to a container in a Pod were immutable. This meant changing them required deleting and recreating the entire Pod. For stateful services, batch jobs, or latency-sensitive workloads, this was an incredibly disruptive operation.

In-Place Pod Resize makes CPU and memory requests and limits mutable, allowing you to adjust these resources within a running Pod, often without requiring a container restart.

Key Concept:

  • Desired Resources: A container's spec.containers[*].resources field now represents the desired resources. For CPU and memory, these fields are now mutable.
  • Actual Resources: The status.containerStatuses[*].resources field reflects the resources currently configured for a running container.
  • Triggering a Resize: You can request a resize by updating the desired requests and limits in the Pod's specification by utilizing the new resize subresource.

How can I start using in-place Pod Resize?

Detailed usage instructions and examples are provided in the official documentation: Resize CPU and Memory Resources assigned to Containers.

How does this help me?

In-place Pod Resize is a foundational building block that unlocks seamless, vertical autoscaling and improvements to workload efficiency.

  • Resources adjusted without disruption Workloads sensitive to latency or restarts can have their resources modified in-place without downtime or loss of state.
  • More powerful autoscaling Autoscalers are now empowered to adjust resources and with less impact. For example, Vertical Pod Autoscaler (VPA)'s InPlaceOrRecreate update mode, which leverages this feature, has graduated to beta. This allows resources to be adjusted automatically and seamlessly based on usage with minimal disruption.
  • Address transient resource needs Workloads that temporarily need more resources can be adjusted quickly. This enables features like the CPU Startup Boost (AEP-7862) where applications can request more CPU during startup and then automatically scale back down.

Here are a few examples of some use cases:

  • A game server that needs to adjust its size with shifting player count.
  • A pre-warmed worker that can be shrunk while unused but inflated with the first request.
  • Dynamically scale with load for efficient bin-packing.
  • Increased resources for JIT compilation on startup.

Changes between beta (1.33) and stable (1.35)

Since the initial beta in v1.33, development effort has primarily been around stabilizing the feature and improving its usability based on community feedback. Here are the primary changes for the stable release:

  • Memory limit decrease Decreasing memory limits was previously prohibited. This restriction has been lifted, and memory limit decreases are now permitted. The Kubelet attempts to prevent OOM-kills by allowing the resize only if the current memory usage is below the new desired limit. However, this check is best-effort and not guaranteed.
  • Prioritized resizes If a node doesn't have enough room to accept all resize requests, Deferred resizes are reattempted based on the following priority:
    • PriorityClass
    • QoS class
    • Duration Deferred, with older requests prioritized first.
  • Pod Level Resources (Alpha) Support for in-place Pod Resize with Pod Level Resources has been introduced behind its own feature gate, which is alpha in v1.35.
  • Increased observability: There are now new Kubelet metrics and Pod events specifically associated with In-Place Pod Resize to help users track and debug resource changes.

What's next?

The graduation of In-Place Pod Resize to stable opens the door for powerful integrations across the Kubernetes ecosystem. There are several areas for futher improvement that are currently planned.

Integration with autoscalers and other projects

There are planned integrations with several autoscalers and other projects to improve workload efficiency at a larger scale. Some projects under discussion:

  • VPA CPU startup boost (AEP-7862): Allows applications to request more CPU at startup and scale back down after a specific period of time.
  • VPA Support for in-place updates (AEP-4016): VPA support for InPlaceOrRecreate has recently graduated to beta, with the eventual goal being to graduate the feature to stable. Support for InPlace mode is still being worked on; see this pull request.
  • Ray autoscaler: Plans to leverage In-Place Pod Resize to improve workload efficiency. See this Google Cloud blog post for more details.
  • Agent-sandbox "Soft-Pause": Investigating leveraging in-place Pod Resize for better improved latency. See the Github issue for more details.
  • Runtime support: Java and Python runtimes do not support resizing memory without restart. There is an open conversation with the Java developers, see the bug.

If you have a project that could benefit from integration with in-place pod resize, please reach out using the channels listed in the feedback section!

Feature expansion

Today, In-Place Pod Resize is prohibited when used in combination with: swap, the static CPU Manager, and the static Memory Manager. Additionally, resources other than CPU and memory are still immutable. Expanding the set of supported features and resources is under consideration as more feedback about community needs comes in.

There are also plans to support workload preemption; if there is not enough room on the node for the resize of a high priority pod, the goal is to enable policies to automatically evict a lower-priority pod or upsize the node.

Improved stability

  • Resolve kubelet-scheduler race conditions There are known race conditions between the kubelet and scheduler with regards to in-place pod resize. Work is underway to resolve these issues over the next few releases. See the issue for more details.

  • Safer memory limit decrease The Kubelet's best-effort check for OOM-kill prevention can be made even safer by moving the memory usage check into the container runtime itself. See the issue for more details.

Providing feedback

Looking to further build on this foundational feature, please share your feedback on how to improve and extend this feature. You can share your feedback through GitHub issues, mailing lists, or Slack channels related to the Kubernetes #sig-node and #sig-autoscaling communities.

Thank you to everyone who contributed to making this long-awaited feature a reality!

Kubernetes v1.35: Job Managed By Goes GA

In Kubernetes v1.35, the ability to specify an external Job controller (through .spec.managedBy) graduates to General Availability.

This feature allows external controllers to take full responsibility for Job reconciliation, unlocking powerful scheduling patterns like multi-cluster dispatching with MultiKueue.

Why delegate Job reconciliation?

The primary motivation for this feature is to support multi-cluster batch scheduling architectures, such as MultiKueue.

The MultiKueue architecture distinguishes between a Management Cluster and a pool of Worker Clusters:

  • The Management Cluster is responsible for dispatching Jobs but not executing them. It needs to accept Job objects to track status, but it skips the creation and execution of Pods.
  • The Worker Clusters receive the dispatched Jobs and execute the actual Pods.
  • Users usually interact with the Management Cluster. Because the status is automatically propagated back, they can observe the Job's progress "live" without accessing the Worker Clusters.
  • In the Worker Clusters, the dispatched Jobs run as regular Jobs managed by the built-in Job controller, with no .spec.managedBy set.

By using .spec.managedBy, the MultiKueue controller on the Management Cluster can take over the reconciliation of a Job. It copies the status from the "mirror" Job running on the Worker Cluster back to the Management Cluster.

Why not just disable the Job controller? While one could theoretically achieve this by disabling the built-in Job controller entirely, this is often impossible or impractical for two reasons:

  1. Managed Control Planes: In many cloud environments, the Kubernetes control plane is locked, and users cannot modify controller manager flags.
  2. Hybrid Cluster Role: Users often need a "hybrid" mode where the Management Cluster dispatches some heavy workloads to remote clusters but still executes smaller or control-plane-related Jobs in the Management Cluster. .spec.managedBy allows this granularity on a per-Job basis.

How .spec.managedBy works

The .spec.managedBy field indicates which controller is responsible for the Job, specifically there are two modes of operation:

  • Standard: if unset or set to the reserved value kubernetes.io/job-controller, the built-in Job controller reconciles the Job as usual (standard behavior).
  • Delegation: If set to any other value, the built-in Job controller skips reconciliation entirely for that Job.

To prevent orphaned Pods or resource leaks, this field is immutable. You cannot transfer a running Job from one controller to another.

If you are looking into implementing an external controller, be aware that your controller needs to be conformant with the definitions for the Job API. In order to enforce the conformance, a significant part of the effort was to introduce the extensive Job status validation rules. Navigate to the How can you learn more? section for more details.

Ecosystem Adoption

The .spec.managedBy field is rapidly becoming the standard interface for delegating control in the Kubernetes batch ecosystem.

Various custom workload controllers are adding this field (or an equivalent) to allow MultiKueue to take over their reconciliation and orchestrate them across clusters:

While it is possible to use .spec.managedBy to implement a custom Job controller from scratch, we haven't observed that yet. The feature is specifically designed to support delegation patterns, like MultiKueue, without reinventing the wheel.

How can you learn more?

If you want to dig deeper:

Read the user-facing documentation for:

Deep dive into the design history:

Explore how MultiKueue uses .spec.managedBy in practice in the task guide for running Jobs across clusters.

Acknowledgments

As with any Kubernetes feature, a lot of people helped shape this one through design discussions, reviews, test runs, and bug reports.

We would like to thank, in particular:

Get involved

This work was sponsored by the Kubernetes Batch Working Group in close collaboration with the SIG Apps, and with strong input from the SIG Scheduling community.

If you are interested in batch scheduling, multi-cluster solutions, or further improving the Job API:

Kubernetes v1.35: Timbernetes (The World Tree Release)

Editors: Aakanksha Bhende, Arujjwal Negi, Chad M. Crowell, Graziano Casto, Swathi Rao

Similar to previous releases, the release of Kubernetes v1.35 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 60 enhancements, including 17 stable, 19 beta, and 22 alpha features.

There are also some deprecations and removals in this release; make sure to read about those.

2025 began in the shimmer of Octarine: The Color of Magic (v1.33) and rode the gusts Of Wind & Will (v1.34). We close the year with our hands on the World Tree, inspired by Yggdrasil, the tree of life that binds many realms. Like any great tree, Kubernetes grows ring by ring and release by release, shaped by the care of a global community.

At its center sits the Kubernetes wheel wrapped around the Earth, grounded by the resilient maintainers, contributors and users who keep showing up. Between day jobs, life changes, and steady open-source stewardship, they prune old APIs, graft new features and keep one of the world’s largest open source projects healthy.

Three squirrels guard the tree: a wizard holding the LGTM scroll for reviewers, a warrior with an axe and Kubernetes shield for the release crews who cut new branches, and a rogue with a lantern for the triagers who bring light to dark issue queues.

Together, they stand in for a much larger adventuring party. Kubernetes v1.35 adds another growth ring to the World Tree, a fresh cut shaped by many hands, many paths and a community whose branches reach higher as its roots grow deeper.

Spotlight on key updates

Kubernetes v1.35 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: In-place update of Pod resources

Kubernetes has graduated in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust CPU and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Earlier Kubernetes releases allowed you to change only infrastructure resource settings (requests and limits) for existing Pods. The new in-place functionality allows for smoother, nondisruptive vertical scaling, improves efficiency, and can also simplify development.

This work was done as part of KEP #1287 led by SIG Node.

Beta: Pod certificates for workload identity and security

Previously, delivering certificates to pods required external controllers (cert-manager, SPIFFE/SPIRE), CRD orchestration, and Secret management, with rotation handled by sidecars or init containers. Kubernetes v1.35 enables native workload identity with automated certificate rotation, drastically simplifying service mesh and zero-trust architectures.

Now, the kubelet generates keys, requests certificates via PodCertificateRequest, and writes credential bundles directly to the Pod's filesystem. The kube-apiserver enforces node restriction at admission time, eliminating the most common pitfall for third-party signers: accidentally violating node isolation boundaries. This enables pure mTLS flows with no bearer tokens in the issuance path.

This work was done as part of KEP #4317 led by SIG Auth.

Alpha: Node declared features before scheduling

When control planes enable new features but nodes lag behind (permitted by Kubernetes skew policy), the scheduler can place pods requiring those features onto incompatible older nodes. The node-declaration features framework allows nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it supports, publishing this information to the control plane via a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers, and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints to ensure that Pods run only on compatible nodes.

This work was done as part of KEP #5328 led by SIG Node.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.35 release.

PreferSameNode traffic distribution

The trafficDistribution field for Services has been updated to provide more explicit control over traffic routing. A new option, PreferSameNode, has been introduced to let services strictly prioritize endpoints on the local node if available, falling back to remote endpoints otherwise.

Simultaneously, the existing PreferClose option has been renamed to PreferSameZone. This change makes the API self-explanatory by explicitly indicating that traffic is preferred within the current availability zone. While PreferClose is preserved for backward compatibility, PreferSameZone is now the standard for zonal routing, ensuring that both node-level and zone-level preferences are clearly distinguished.

This work was done as part of KEP #3015 led by SIG Network.

Job API managed-by mechanism

The Job API now includes a managedBy field that allows an external controller to handle Job status synchronization. This feature, which graduates to stable in Kubernetes v1.35, is primarily driven by MultiKueue, a multi-cluster dispatching system where a Job created in a management cluster is mirrored and executed in a worker cluster, with status updates propagated back. To enable this workflow, the built-in Job controller must not act on a particular Job resource so that the Kueue controller can manage status updates instead.

The goal is to allow clean delegation of Job synchronization to another controller. It does not aim to pass custom parameters to that controller or modify CronJob concurrency policies.

This work was done as part of KEP #4368 led by SIG Apps.

Reliable Pod update tracking with .metadata.generation

Historically, the Pod API lacked the metadata.generation field found in other Kubernetes objects such as Deployments. Because of this omission, controllers and users had no reliable way to verify whether the kubelet had actually processed the latest changes to a Pod's specification. This ambiguity was particularly problematic for features like In-Place Pod Vertical Scaling, where it was difficult to know exactly when a resource resize request had been enacted.

Kubernetes v1.33 added .metadata.generation fields for Pods, as an alpha feature. That field is now stable in the v1.35 Pod API, which means that every time a Pod's spec is updated, the .metadata.generation value is incremented. As part of this improvement, the Pod API also gained a .status.observedGeneration field, which reports the generation that the kubelet has successfully seen and processed. Pod conditions also each contain their own individual observedGeneration field that clients can report and / or observe.

Because this feature has graduated to stable in v1.35, it is available for all workloads.

This work was done as part of KEP #5067 led by SIG Node.

Configurable NUMA node limit for topology manager

The topology manager historically used a hard-coded limit of 8 for the maximum number of NUMA nodes it can support, preventing state explosion during affinity calculation. (There's an important detail here; a NUMA node is not the same as a Node in the Kubernetes API.) This limit on the number of NUMA nodes prevented Kubernetes from fully utilizing modern high-end servers, which increasingly feature CPU architectures with more than 8 NUMA nodes.

Kubernetes v1.31 introduced a new, beta max-allowable-numa-nodes option to the topology manager policy configuration. In Kubernetes v1.35, that option is stable. Cluster administrators who enable it can use servers with more than 8 NUMA nodes.

Although the configuration option is stable, the Kubernetes community is aware of the poor performance for large NUMA hosts, and there is a proposed enhancement (KEP-5726) that aims to improve on it. You can learn more about this by reading Control Topology Management Policies on a node.

This work was done as part of KEP #4622 led by SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.35 release.

Expose node topology labels via Downward API

Accessing node topology information, such as region and zone, from within a Pod has typically required querying the Kubernetes API server. While functional, this approach creates complexity and security risks by necessitating broad RBAC permissions or sidecar containers just to retrieve infrastructure metadata. Kubernetes v1.35 promotes the capability to expose node topology labels directly via the Downward API to beta.

The kubelet can now inject standard topology labels, such as topology.kubernetes.io/zone and topology.kubernetes.io/region, into Pods as environment variables or projected volume files. The primary benefit is a safer and more efficient way for workloads to be topology-aware. This allows applications to natively adapt to their availability zone or region without dependencies on the API server, strengthening security by upholding the principle of least privilege and simplifying cluster configuration.

Note: Kubernetes now injects available topology labels to every Pod so that they can be used as inputs to the downward API. With the v1.35 upgrade, most cluster administrators will see several new labels added to each Pod; this is expected as part of the design.

This work was done as part of KEP #4742 led by SIG Node.

Native support for storage version migration

In Kubernetes v1.35, the native support for storage version migration graduates to beta and is enabled by default. This move integrates the migration logic directly into the core Kubernetes control plane ("in-tree"), eliminating the dependency on external tools.

Historically, administrators relied on manual "read/write loops"—often piping kubectl get into kubectl replace—to update schemas or re-encrypt data at rest. This method was inefficient and prone to conflicts, especially for large resources like Secrets. With this release, the built-in controller automatically handles update conflicts and consistency tokens, providing a safe, streamlined, and reliable way to ensure stored data remains current with minimal operational overhead.

This work was done as part of KEP #4192 led by SIG API Machinery.

Mutable Volume attach limits

A CSI (Container Storage Interface) driver is a Kubernetes plugin that provides a consistent way for storage systems to be exposed to containerized workloads. The CSINode object records details about all CSI drivers installed on a node. However, a mismatch can arise between the reported and actual attachment capacity on nodes. When volume slots are consumed after a CSI driver starts up, the kube-scheduler may assign stateful pods to nodes without sufficient capacity, ultimately getting stuck in a ContainerCreating state.

Kubernetes v1.35 makes CSINode.spec.drivers[*].allocatable.count mutable so that a node’s available volume attachment capacity can be updated dynamically. It also allows CSI drivers to control how frequently the allocatable.count value is updated on all nodes by introducing a configurable refresh interval, defined through the CSIDriver object. Additionally, it automatically updates CSINode.spec.drivers[*].allocatable.count on detecting a failure in volume attachment due to insufficient capacity. Although this feature graduated to beta in v1.34 with the feature flag MutableCSINodeAllocatableCount disabled by default, it remains in beta for v1.35 to allow time for feedback, but the feature flag is enabled by default.

This work was done as part of KEP #4876 led by SIG Storage.

Opportunistic batching

Historically, the Kubernetes scheduler processes pods sequentially with time complexity of O(num pods × num nodes), which can result in redundant computation for compatible pods. This KEP introduces an opportunistic batching mechanism that aims to improve performance by identifying such compatible Pods via Pod scheduling signature and batching them together, allowing shared filtering and scoring results across them.

The pod scheduling signature ensures that two pods with the same signature are “the same” from a scheduling perspective. It takes into account not only the pod and node attributes, but also the other pods in the system and global data about the pod placement. This means that any pod with the given signature will get the same scores/feasibility results from any arbitrary set of nodes.

The batching mechanism consists of two operations that can be invoked whenever needed - create and nominate. Create leads to the creation of a new set of batch information from the scheduling results of Pods that have a valid signature. Nominate uses the batching information from create to set the nominated node name from a new Pod whose signature matches the canonical Pod’s signature.

This work was done as part of KEP #5598 led by SIG Scheduling.

maxUnavailable for StatefulSets

A StatefulSet runs a group of Pods and maintains a sticky identity for each of those Pods. This is critical for stateful workloads requiring stable network identifiers or persistent storage. When a StatefulSet's .spec.updateStrategy.<type> is set to RollingUpdate, the StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time.

Kubernetes v1.24 added a new alpha field to a StatefulSet's rollingUpdate configuration settings, called maxUnavailable. That field wasn't part of the Kubernetes API unless your cluster administrator explicitly opted in. In Kubernetes v1.35 that field is beta and is available by default. You can use it to define the maximum number of pods that can be unavailable during an update. This setting is most effective in combination with .spec.podManagementPolicy set to Parallel. You can set maxUnavailable as either a positive number (example: 2) or a percentage of the desired number of Pods (example: 10%). If this field is not specified, it will default to 1, to maintain the previous behavior of only updating one Pod at a time. This improvement allows stateful applications (that can tolerate more than one Pod being down) to finish updating faster.

This work was done as part of KEP #961 led by SIG Apps.

Configurable credential plugin policy in kuberc

The optional kuberc file is a way to separate server configurations and cluster credentials from user preferences without disrupting already running CI pipelines with unexpected outputs.

As part of the v1.35 release, kuberc gains additional functionality which allows users to configure credential plugin policy. This change introduces two fields credentialPluginPolicy, which allows or denies all plugins, and allows specifying a list of allowed plugins using credentialPluginAllowlist.

This work was done as part of KEP #3104 as a cooperation between SIG Auth and SIG CLI.

KYAML

YAML is a human-readable format of data serialization. In Kubernetes, YAML files are used to define and configure resources, such as Pods, Services, and Deployments. However, complex YAML is difficult to read. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (see: The Norway Bug). While JSON is an alternative, it lacks support for comments and has strict requirements for trailing commas and quoted keys.

KYAML is a safer and less ambiguous subset of YAML designed specifically for Kubernetes. Introduced as an opt-in alpha feature in v1.34, this feature graduated to beta in Kubernetes v1.35 and has been enabled by default. It can be disabled by setting the environment variable KUBECTL_KYAML=false.

KYAML addresses challenges pertaining to both YAML and JSON. All KYAML files are also valid YAML files. This means you can write KYAML and pass it as an input to any version of kubectl. This also means that you don’t need to write in strict KYAML for the input to be parsed.

This work was done as part of KEP #5295 led by SIG CLI.

Configurable tolerance for HorizontalPodAutoscalers

The Horizontal Pod Autoscaler (HPA) has historically relied on a fixed, global 10% tolerance for scaling actions. A drawback of this hardcoded value was that workloads requiring high sensitivity, such as those needing to scale on a 5% load increase, were often blocked from scaling, while others might oscillate unnecessarily.

With Kubernetes v1.35, the configurable tolerance feature graduates to beta and is enabled by default. This enhancement allows users to define a custom tolerance window on a per-resource basis within the HPA behavior field. By setting a specific tolerance (e.g., lowering it to 0.05 for 5%), operators gain precise control over autoscaling sensitivity, ensuring that critical workloads react quickly to small metric changes, without requiring cluster-wide configuration adjustments.

This work was done as part of KEP #4951 led by SIG Autoscaling.

Support for user namespaces in Pods

Kubernetes is adding support for user namespaces, allowing pods to run with isolated user and group ID mappings instead of sharing host IDs. This means containers can operate as root internally while actually being mapped to an unprivileged user on the host, reducing the risk of privilege escalation in the event of a compromise. The feature improves pod-level security and makes it safer to run workloads that need root inside the container. Over time, support has expanded to both stateless and stateful Pods through id-mapped mounts.

This work was done as part of KEP #127 led by SIG Node.

VolumeSource: OCI artifact and/or image

When creating a Pod, you often need to provide data, binaries, or configuration files for your containers. This meant including the content into the main container image or using a custom init container to download and unpack files into an emptyDir. Both these approaches are still valid. Kubernetes v1.31 added support for the image volume type allowing Pods to declaratively pull and unpack OCI container image artifacts into a volume. This lets you package and deliver data-only artifacts such as configs, binaries, or machine learning models using standard OCI registry tools.

With this feature, you can fully separate your data from your container image and remove the need for extra init containers or startup scripts. The image volume type has been in beta since v1.33 and is enabled by default in v1.35. Please note that using this feature requires a compatible container runtime, such as containerd v2.1 or later.

This work was done as part of KEP #4639 led by SIG Node.

Enforced kubelet credential verification for cached images

The imagePullPolicy: IfNotPresent setting currently allows a Pod to use a container image that is already cached on a node, even if the Pod itself does not possess the credentials to pull that image. A drawback of this behavior is that it creates a security vulnerability in multi-tenant clusters: if a Pod with valid credentials pulls a sensitive private image to a node, a subsequent unauthorized Pod on the same node can access that image simply by relying on the local cache.

This KEP introduces a mechanism where the kubelet enforces credential verification for cached images. Before allowing a Pod to use a locally cached image, the kubelet checks if the Pod has the valid credentials to pull it. This ensures that only authorized workloads can use private images, regardless of whether they are already present on the node, significantly hardening the security posture for shared clusters.

In Kubernetes v1.35, this feature has graduated to beta and is enabled by default. Users can still disable it by setting the KubeletEnsureSecretPulledImages feature gate to false. Additionally, the imagePullCredentialsVerificationPolicy flag allows operators to configure the desired security level, ranging from a mode that prioritizes backward compatibility to a strict enforcement mode that offers maximum security.

This work was done as part of KEP #2535 led by SIG Node.

Fine-grained Container restart rules

Historically, the restartPolicy field was defined strictly at the Pod level, forcing the same behavior on all containers within a Pod. A drawback of this global setting was the lack of granularity for complex workloads, such as AI/ML training jobs. These often required restartPolicy: Never for the Pod to manage job completion, yet individual containers would benefit from in-place restarts for specific, retriable errors (like network glitches or GPU init failures).

Kubernetes v1.35 addresses this by enabling restartPolicy and restartPolicyRules within the container API itself. This allows users to define restart strategies for individual regular and init containers that operate independently of the Pod's overall policy. For example, a container can now be configured to restart automatically only if it exits with a specific error code, avoiding the expensive overhead of rescheduling the entire Pod for a transient failure.

In this release, the feature has graduated to beta and is enabled by default. Users can immediately leverage restartPolicyRules in their container specifications to optimize recovery times and resource utilization for long-running workloads, without altering the broader lifecycle logic of their Pods.

This work was done as part of KEP #5307 led by SIG Node.

CSI driver opt-in for service account tokens via secrets field

Providing ServiceAccount tokens to Container Storage Interface (CSI) drivers has traditionally relied on injecting them into the volume_context field. This approach presents a significant security risk because volume_context is intended for non-sensitive configuration data and is frequently logged in plain text by drivers and debugging tools, potentially leaking credentials.

Kubernetes v1.35 introduces an opt-in mechanism for CSI drivers to receive ServiceAccount tokens via the dedicated secrets field in the NodePublishVolume request. Drivers can now enable this behavior by setting the serviceAccountTokenInSecrets field to true in their CSIDriver object, instructing the kubelet to populate the token securely.

The primary benefit is the prevention of accidental credential exposure in logs and error messages. This change ensures that sensitive workload identities are handled via the appropriate secure channels, aligning with best practices for secret management while maintaining backward compatibility for existing drivers.

This work was done as part of KEP #5538 led by SIG Auth in cooperation with SIG Storage.

Deployment status: count of terminating replicas

Historically, the Deployment status provided details on available and updated replicas but lacked explicit visibility into Pods that were in the process of shutting down. A drawback of this omission was that users and controllers could not easily distinguish between a stable Deployment and one that still had Pods executing cleanup tasks or adhering to long grace periods.

Kubernetes v1.35 promotes the terminatingReplicas field within the Deployment status to beta. This field provides a count of Pods that have a deletion timestamp set but have not yet been removed from the system. This feature is a foundational step in a larger initiative to improve how Deployments handle Pod replacement, laying the groundwork for future policies regarding when to create new Pods during a rollout.

The primary benefit is improved observability for lifecycle management tools and operators. By exposing the number of terminating Pods, external systems can now make more informed decisions such as waiting for a complete shutdown before proceeding with subsequent tasks without needing to manually query and filter individual Pod lists.

This work was done as part of KEP #3973 led by SIG Apps.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.35 release.

Gang scheduling support in Kubernetes

Scheduling interdependent workloads, such as AI/ML training jobs or HPC simulations, has traditionally been challenging because the default Kubernetes scheduler places Pods individually. This often leads to partial scheduling where some Pods start while others wait indefinitely for resources, resulting in deadlocks and wasted cluster capacity.

Kubernetes v1.35 introduces native support for so-called "gang scheduling" via the new Workload API and PodGroup concept. This feature implements an "all-or-nothing" scheduling strategy, ensuring that a defined group of Pods is scheduled only if the cluster has sufficient resources to accommodate the entire group simultaneously.

The primary benefit is improved reliability and efficiency for batch and parallel workloads. By preventing partial deployments, it eliminates resource deadlocks and ensures that expensive cluster capacity is utilized only when a complete job can run, significantly optimizing the orchestration of large-scale data processing tasks.

This work was done as part of KEP #4671 led by SIG Scheduling.

Constrained impersonation

Historically, the impersonate verb in Kubernetes RBAC functioned on an all-or-nothing basis: once a user was authorized to impersonate a target identity, they gained all associated permissions. A drawback of this broad authorization was that it violated the principle of least privilege, preventing administrators from restricting impersonators to specific actions or resources.

Kubernetes v1.35 introduces a new alpha feature, constrained impersonation, which adds a secondary authorization check to the impersonation flow. When enabled via the ConstrainedImpersonation feature gate, the API server verifies not only the basic impersonate permission but also checks if the impersonator is authorized for the specific action using new verb prefixes (e.g., impersonate-on:<mode>:<verb>). This allows administrators to define fine-grained policies—such as permitting a support engineer to impersonate a cluster admin solely to view logs, without granting full administrative access.

This work was done as part of KEP #5284 led by SIG Auth.

Flagz for Kubernetes components

Verifying the runtime configuration of Kubernetes components, such as the API server or kubelet, has traditionally required privileged access to the host node or process arguments. To address this, the /flagz endpoint was introduced to expose command-line options via HTTP. However, its output was initially limited to plain text, making it difficult for automated tools to parse and validate configurations reliably.

In Kubernetes v1.35, the /flagz endpoint has been enhanced to support structured, machine-readable JSON output. Authorized users can now request a versioned JSON response using standard HTTP content negotiation, while the original plain text format remains available for human inspection. This update significantly improves observability and compliance workflows, allowing external systems to programmatically audit component configurations without fragile text parsing or direct infrastructure access.

This work was done as part of KEP #4828 led by SIG Instrumentation.

Statusz for Kubernetes components

Troubleshooting Kubernetes components like the kube-apiserver or kubelet has traditionally involved parsing unstructured logs or text output, which is brittle and difficult to automate. While a basic /statusz endpoint existed previously, it lacked a standardized, machine-readable format, limiting its utility for external monitoring systems.

In Kubernetes v1.35, the /statusz endpoint has been enhanced to support structured, machine-readable JSON output. Authorized users can now request this format using standard HTTP content negotiation to retrieve precise status data—such as version information and health indicators—without relying on fragile text parsing. This improvement provides a reliable, consistent interface for automated debugging and observability tools across all core components.

This work was done as part of KEP #4827 led by SIG Instrumentation.

CCM: watch-based route controller reconciliation using informers

Managing network routes within cloud environments has traditionally relied on the Cloud Controller Manager (CCM) periodically polling the cloud provider's API to verify and update route tables. This fixed-interval reconciliation approach can be inefficient, often generating a high volume of unnecessary API calls and introducing latency between a node state change and the corresponding route update.

For the Kubernetes v1.35 release, the cloud-controller-manager library introduces a watch-based reconciliation strategy for the route controller. Instead of relying on a timer, the controller now utilizes informers to watch for specific Node events, such as additions, deletions, or relevant field updates and triggers route synchronization only when a change actually occurs.

The primary benefit is a significant reduction in cloud provider API usage, which lowers the risk of hitting rate limits and reduces operational overhead. Additionally, this event-driven model improves the responsiveness of the cluster's networking layer by ensuring that route tables are updated immediately following changes in cluster topology.

This work was done as part of KEP #5237 led by SIG Cloud Provider.

Extended toleration operators for threshold-based placement

Kubernetes v1.35 introduces SLA-aware scheduling by enabling workloads to express reliability requirements. The feature adds numeric comparison operators to tolerations, allowing pods to match or avoid nodes based on SLA-oriented taints such as service guarantees or fault-domain quality.

The primary benefit is enhancing the scheduler with more precise placement. Critical workloads can demand higher-SLA nodes, while lower priority workloads can opt into lower SLA ones. This improves utilization and reduces cost without compromising reliability.

This work was done as part of KEP #5471 led by SIG Scheduling.

Mutable container resources when Job is suspended

Running batch workloads often involves trial and error with resource limits. Currently, the Job specification is immutable, meaning that if a Job fails due to an Out of Memory (OOM) error or insufficient CPU, the user cannot simply adjust the resources; they must delete the Job and create a new one, losing the execution history and status.

Kubernetes v1.35 introduces the capability to update resource requests and limits for Jobs that are in a suspended state. Enabled via the MutablePodResourcesForSuspendedJobs feature gate, this enhancement allows users to pause a failing Job, modify its Pod template with appropriate resource values, and then resume execution with the corrected configuration.

The primary benefit is a smoother recovery workflow for misconfigured jobs. By allowing in-place corrections during suspension, users can resolve resource bottlenecks without disrupting the Job's lifecycle identity or losing track of its completion status, significantly improving the developer experience for batch processing.

This work was done as part of KEP #5440 led by SIG Apps.

Other notable changes

Continued innovation in Dynamic Resource Allocation (DRA)

The core functionality was graduated to stable in v1.34, with the ability to turn it off. In v1.35 it is always enabled. Several alpha features have also been significantly improved and are ready for testing. We encourage users to provide feedback on these capabilities to help clear the path for their target promotion to beta in upcoming releases.

Extended Resource Requests via DRA

Several functional gaps compared to Extended Resource requests via Device Plugins were addressed, for example scoring and reuse of devices in init containers.

Device Taints and Tolerations

The new "None" effect can be used to report a problem without immediately affecting scheduling or running pod. DeviceTaintRule now provides status information about an ongoing eviction. The "None" effect can be used for a "dry run" before actually evicting pods:

  • Create DeviceTaintRule with "effect: None".
  • Check the status to see how many pods would be evicted.
  • Replace "effect: None" with "effect: NoExecute".

Partitionable Devices

Devices belonging to the same partitionable devices may now be defined in different ResourceSlices. You can read more in the official documentation.

Consumable Capacity, Device Binding Conditions

Several bugs were fixed and/or more tests added. You can learn more about Consumable Capacity and Binding Conditions in the official documentation.

Comparable resource version semantics

Kubernetes v1.35 changes the way that clients are allowed to interpret resource versions.

Before v1.35, the only supported comparison that clients could make was to check for string equality: if two resource versions were equal, they were the same. Clients could also provide a resource version to the API server and ask the control plane to do internal comparisons, such as streaming all events since a particular resource version.

In v1.35, all in-tree resource versions meet a new stricter definition: the values are a special form of decimal number. And, because they can be compared, clients can do their own operations to compare two different resource versions. For example, this means that a client reconnecting after a crash can detect when it has lost updates, as distinct from the case where there has been an update but no lost changes in the meantime.

This change in semantics enables other important use cases such as storage version migration, performance improvements to informers (a client helper concept), and controller reliability. All of those cases require knowing whether one resource version is newer than another.

This work was done as part of KEP #5504 led by SIG API Machinery.

Graduations, deprecations, and removals in v1.35

Graduations to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 15 enhancements promoted to stable:

Deprecations, removals and community updates

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.35 includes a couple of deprecations.

Ingress NGINX retirement

For years, the Ingress NGINX controller has been a popular choice for routing traffic into Kubernetes clusters. It was flexible, widely adopted, and served as the standard entry point for countless applications.

However, maintaining the project has become unsustainable. With a severe shortage of maintainers and mounting technical debt, the community recently made the difficult decision to retire it. This isn't strictly part of the v1.35 release, but it's such an important change that we wanted to highlight it here.

Consequently, the Kubernetes project announced that Ingress NGINX will receive only best-effort maintenance until March 2026. After this date, it will be archived with no further updates. The recommended path forward is to migrate to the Gateway API, which offers a more modern, secure, and extensible standard for traffic management.

You can find more in the official blog post.

Removal of cgroup v1 support

When it comes to managing resources on Linux nodes, Kubernetes has historically relied on cgroups (control groups). While the original cgroup v1 was functional, it was often inconsistent and limited. That is why Kubernetes introduced support for cgroup v2 back in v1.25, offering a much cleaner, unified hierarchy and better resource isolation.

Because cgroup v2 is now the modern standard, Kubernetes is ready to retire the legacy cgroup v1 support in v1.35. This is an important notice for cluster administrators: if you are still running nodes on older Linux distributions that don't support cgroup v2, your kubelet will fail to start. To avoid downtime, you will need to migrate those nodes to systems where cgroup v2 is enabled.

To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.

Deprecation of ipvs mode in kube-proxy

Years ago, Kubernetes adopted the ipvs mode in kube-proxy to provide faster load balancing than the standard iptables. While it offered a performance boost, keeping it in sync with evolving networking requirements created too much technical debt and complexity.

Because of this maintenance burden, Kubernetes v1.35 deprecates ipvs mode. Although the mode remains available in this release, kube-proxy will now emit a warning on startup when configured to use it. The goal is to streamline the codebase and focus on modern standards. For Linux nodes, you should begin transitioning to nftables, which is now the recommended replacement.

You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy.

Final call for containerd v1.X

While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases, this is the final version with such support. The SIG Node community has designated v1.35 as the last release to support the containerd v1.X series.

This serves as an important reminder: before upgrading to the next Kubernetes version, you must switch to containerd 2.0 or later. To help identify which nodes need attention, you can monitor the kubelet_cri_losing_support metric within your cluster.

You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI.

Improved Pod stability during kubelet restarts

Previously, restarting the kubelet service often caused a temporary disruption in Pod status. During a restart, the kubelet would reset container states, causing healthy Pods to be marked as NotReady and removed from load balancers, even if the application itself was still running correctly.

To address this reliability issue, this behavior has been corrected to ensure seamless node maintenance. The kubelet now properly restores the state of existing containers from the runtime upon startup. This ensures that your workloads remain Ready and traffic continues to flow uninterrupted during kubelet restarts or upgrades.

You can find more in KEP-4781: Fix inconsistent container ready state after kubelet restart.

Release notes

Check out the full details of the Kubernetes v1.35 release in our release notes.

Availability

Kubernetes v1.35 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.35 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We honor the memory of Han Kang, a long-time contributor and respected engineer whose technical excellence and infectious enthusiasm left a lasting impact on the Kubernetes community. Han was a significant force within SIG Instrumentation and SIG API Machinery, earning a 2021 Kubernetes Contributor Award for his critical work and sustained commitment to the project's core stability. Beyond his technical contributions, Han was deeply admired for his generosity as a mentor and his passion for building connections among people. He was known for "opening doors" for others, whether guiding new contributors through their first pull requests or supporting colleagues with patience and kindness. Han’s legacy lives on through the engineers he inspired, the robust systems he helped build, and the warm, collaborative spirit he fostered within the cloud native ecosystem.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.35 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. We are incredibly grateful to our Release Lead, Drew Hagen, whose hands-on guidance and vibrant energy not only navigated us through complex challenges but also fueled the community spirit behind this successful release.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.35 release cycle, which spanned 14 weeks from 15th September 2025 to 17th December 2025, Kubernetes received contributions from as many as 85 different companies and 419 individuals. In the wider cloud native ecosystem, the figure goes up to 281 companies, counting 1769 total contributors.

Note that "contribution" counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.
If you are interested in contributing, visit Getting Started on our contributor website.

Sources for this data:

Events update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

February 2026

March 2026

May 2026

June 2026

July 2026

You can find the latest event details here. ​

Upcoming release webinar

Join members of the Kubernetes v1.35 Release Team on Wednesday, January 14, 2026, at 5:00 PM (UTC) to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Kubernetes v1.35 Sneak Peek

As the release of Kubernetes v1.35 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the project's overall health. This blog post outlines planned changes for the v1.35 release that the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes cluster(s), and to keep you up to date with the latest developments. The information below is based on the current status of the v1.35 release and is subject to change before the final release date.

Deprecations and removals for Kubernetes v1.35

cgroup v1 support

On Linux nodes, container runtimes typically rely on cgroups (short for "control groups"). Support for using cgroup v2 has been stable in Kubernetes since v1.25, providing an alternative to the original v1 cgroup support. While cgroup v1 provided the initial resource control mechanism, it suffered from well-known inconsistencies and limitations. Adding support for cgroup v2 allowed use of a unified control group hierarchy, improved resource isolation, and served as the foundation for modern features, making legacy cgroup v1 support ready for removal. The removal of cgroup v1 support will only impact cluster administrators running nodes on older Linux distributions that do not support cgroup v2; on those nodes, the kubelet will fail to start. Administrators must migrate their nodes to systems with cgroup v2 enabled. More details on compatibility requirements will be available in a blog post soon after the v1.35 release.

To learn more, read about cgroup v2;
you can also track the switchover work via KEP-5573: Remove cgroup v1 support.

Deprecation of ipvs mode in kube-proxy

Many releases ago, the Kubernetes project implemented an ipvs mode in kube-proxy. It was adopted as a way to provide high-performance service load balancing, with better performance than the existing iptables mode. However, maintaining feature parity between ipvs and other kube-proxy modes became difficult, due to technical complexity and diverging requirements. This created significant technical debt and made the ipvs backend impractical to support alongside newer networking capabilities.

The Kubernetes project intends to deprecate kube-proxy ipvs mode in the v1.35 release, to streamline the kube-proxy codebase. For Linux nodes, the recommended kube-proxy mode is already nftables.

You can find more in KEP-5495: Deprecate ipvs mode in kube-proxy

Kubernetes is deprecating containerd v1.y support

While Kubernetes v1.35 still supports containerd 1.7 and other LTS releases of containerd, as a consequence of automated cgroup driver detection, the Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.X. Kubernetes v1.35 is the last release to offer this support (aligned with containerd 1.7 EOL).

This is a final warning that if you are using containerd 1.X, you must switch to 2.0 or later before upgrading Kubernetes to the next version. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be unsupported.

You can find more in the official blog post or in KEP-4033: Discover cgroup driver from CRI

The following enhancements are some of those likely to be included in the v1.35 release. This is not a commitment, and the release content is subject to change.

Node declared features

When scheduling Pods, Kubernetes uses node labels, taints, and tolerations to match workload requirements with node capabilities. However, managing feature compatibility becomes challenging during cluster upgrades due to version skew between the control plane and nodes. This can lead to Pods being scheduled on nodes that lack required features, resulting in runtime failures.

The node declared features framework will introduce a standard mechanism for nodes to declare their supported Kubernetes features. With the new alpha feature enabled, a Node reports the features it can support, publishing this information to the control plane through a new .status.declaredFeatures field. Then, the kube-scheduler, admission controllers and third-party components can use these declarations. For example, you can enforce scheduling and API validation constraints, ensuring that Pods run only on compatible nodes.

This approach reduces manual node labeling, improves scheduling accuracy, and prevents incompatible pod placements proactively. It also integrates with the Cluster Autoscaler for informed scale-up decisions. Feature declarations are temporary and tied to Kubernetes feature gates, enabling safe rollout and cleanup.

Targeting alpha in v1.35, node declared features aims to solve version skew scheduling issues by making node capabilities explicit, enhancing reliability and cluster stability in heterogeneous version environments.

To learn more about this before the official documentation is published, you can read KEP-5328.

In-place update of Pod resources

Kubernetes is graduating in-place updates for Pod resources to General Availability (GA). This feature allows users to adjust cpu and memory resources without restarting Pods or Containers. Previously, such modifications required recreating Pods, which could disrupt workloads, particularly for stateful or batch applications. Previous Kubernetes releases already allowed you to change infrastructure resources settings (requests and limits) for existing Pods. This allows for smoother vertical scaling, improves efficiency, and can also simplify solution development.

The Container Runtime Interface (CRI) has also been improved, extending the UpdateContainerResources API for Windows and future runtimes while allowing ContainerStatus to report real-time resource configurations. Together, these changes make scaling in Kubernetes faster, more flexible, and disruption-free. The feature was introduced as alpha in v1.27, graduated to beta in v1.33, and is targeting graduation to stable in v1.35.

You can find more in KEP-1287: In-place Update of Pod Resources

Pod certificates

When running microservices, Pods often require a strong cryptographic identity to authenticate with each other using mutual TLS (mTLS). While Kubernetes provides Service Account tokens, these are designed for authenticating to the API server, not for general-purpose workload identity.

Before this enhancement, operators had to rely on complex, external projects like SPIFFE/SPIRE or cert-manager to provision and rotate certificates for their workloads. But what if you could issue a unique, short-lived certificate to your Pods natively and automatically? KEP-4317 is designed to enable such native workload identity. It opens up various possibilities for securing pod-to-pod communication by allowing the kubelet to request and mount certificates for a Pod via a projected volume.

This provides a built-in mechanism for workload identity, complete with automated certificate rotation, significantly simplifying the setup of service meshes and other zero-trust network policies. This feature was introduced as alpha in v1.34 and is targeting beta in v1.35.

You can find more in KEP-4317: Pod Certificates

Numeric values for taints

Kubernetes is enhancing taints and tolerations by adding numeric comparison operators, such as Gt (Greater Than) and Lt (Less Than).

Previously, tolerations supported only exact (Equal) or existence (Exists) matches, which were not suitable for numeric properties such as reliability SLAs.

With this change, a Pod can use a toleration to "opt-in" to nodes that meet a specific numeric threshold. For example, a Pod can require a Node with an SLA taint value greater than 950 (operator: Gt, value: "950").

This approach is more powerful than Node Affinity because it supports the NoExecute effect, allowing Pods to be automatically evicted if a node's numeric value drops below the tolerated threshold.

You can find more in KEP-5471: Enable SLA-based Scheduling

User namespaces

When running Pods, you can use securityContext to drop privileges, but containers inside the pod often still run as root (UID 0). This simplicity poses a significant challenge, as that container UID 0 maps directly to the host's root user.

Before this enhancement, a container breakout vulnerability could grant an attacker full root access to the node. But what if you could dynamically remap the container's root user to a safe, unprivileged user on the host? KEP-127 specifically allows such native support for Linux User Namespaces. It opens up various possibilities for pod security by isolating container and host user/group IDs. This allows a process to have root privileges (UID 0) within its namespace, while running as a non-privileged, high-numbered UID on the host.

Released as alpha in v1.25 and beta in v1.30, this feature continues to progress through beta maturity, paving the way for truly "rootless" containers that drastically reduce the attack surface for a whole class of security vulnerabilities.

You can find more in KEP-127: User Namespaces

Support for mounting OCI images as volumes

When provisioning a Pod, you often need to bundle data, binaries, or configuration files for your containers. Before this enhancement, people often included that kind of data directly into the main container image, or required a custom init container to download and unpack files into an emptyDir. You can still take either of those approaches, of course.

But what if you could populate a volume directly from a data-only artifact in an OCI registry, just like pulling a container image? Kubernetes v1.31 added support for the image volume type, allowing Pods to pull and unpack OCI container image artifacts into a volume declaratively.

This allows for seamless distribution of data, binaries, or ML models using standard registry tooling, completely decoupling data from the container image and eliminating the need for complex init containers or startup scripts. This volume type has been in beta since v1.33 and will likely be enabled by default in v1.35.

You can try out the beta version of image volumes, or you can learn more about the plans from KEP-4639: OCI Volume Source.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.35 as part of the CHANGELOG for that release.

The Kubernetes v1.35 release is planned for December 17, 2025. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Kubernetes Configuration Good Practices

Configuration is one of those things in Kubernetes that seems small until it's not. Configuration is at the heart of every Kubernetes workload. A missing quote, a wrong API version or a misplaced YAML indent can ruin your entire deploy.

This blog brings together tried-and-tested configuration best practices. The small habits that make your Kubernetes setup clean, consistent and easier to manage. Whether you are just starting out or already deploying apps daily, these are the little things that keep your cluster stable and your future self sane.

This blog is inspired by the original Configuration Best Practices page, which has evolved through contributions from many members of the Kubernetes community.

General configuration practices

Use the latest stable API version

Kubernetes evolves fast. Older APIs eventually get deprecated and stop working. So, whenever you are defining resources, make sure you are using the latest stable API version. You can always check with

kubectl api-resources

This simple step saves you from future compatibility issues.

Store configuration in version control

Never apply manifest files directly from your desktop. Always keep them in a version control system like Git, it's your safety net. If something breaks, you can instantly roll back to a previous commit, compare changes or recreate your cluster setup without panic.

Write configs in YAML not JSON

Write your configuration files using YAML rather than JSON. Both work technically, but YAML is just easier for humans. It's cleaner to read and less noisy and widely used in the community.

YAML has some sneaky gotchas with boolean values: Use only true or false. Don't write yes, no, on or off. They might work in one version of YAML but break in another. To be safe, quote anything that looks like a Boolean (for example "yes").

Keep configuration simple and minimal

Avoid setting default values that are already handled by Kubernetes. Minimal manifests are easier to debug, cleaner to review and less likely to break things later.

If your Deployment, Service and ConfigMap all belong to one app, put them in a single manifest file.
It's easier to track changes and apply them as a unit. See the Guestbook all-in-one.yaml file for an example of this syntax.

You can even apply entire directories with:

kubectl apply -f configs/

One command and boom everything in that folder gets deployed.

Add helpful annotations

Manifest files are not just for machines, they are for humans too. Use annotations to describe why something exists or what it does. A quick one-liner can save hours when debugging later and also allows better collaboration.

The most helpful annotation to set is kubernetes.io/description. It's like using comment, except that it gets copied into the API so that everyone else can see it even after you deploy.

Managing Workloads: Pods, Deployments, and Jobs

A common early mistake in Kubernetes is creating Pods directly. Pods work, but they don't reschedule themselves if something goes wrong.

Naked Pods (Pods not managed by a controller, such as Deployment or a StatefulSet) are fine for testing, but in real setups, they are risky.

Why? Because if the node hosting that Pod dies, the Pod dies with it and Kubernetes won't bring it back automatically.

Use Deployments for apps that should always be running

A Deployment, which both creates a ReplicaSet to ensure that the desired number of Pods is always available, and specifies a strategy to replace Pods (such as RollingUpdate), is almost always preferable to creating Pods directly. You can roll out a new version, and if something breaks, roll back instantly.

Use Jobs for tasks that should finish

A Job is perfect when you need something to run once and then stop like database migration or batch processing task. It will retry if the pods fails and report success when it's done.

Service Configuration and Networking

Services are how your workloads talk to each other inside (and sometimes outside) your cluster. Without them, your pods exist but can't reach anyone. Let's make sure that doesn't happen.

Create Services before workloads that use them

When Kubernetes starts a Pod, it automatically injects environment variables for existing Services. So, if a Pod depends on a Service, create a Service before its corresponding backend workloads (Deployments or StatefulSets), and before any workloads that need to access it.

For example, if a Service named foo exists, all containers will get the following variables in their initial environment:

FOO_SERVICE_HOST=<the host the Service runs on>
FOO_SERVICE_PORT=<the port the Service runs on>

DNS based discovery doesn't have this problem, but it's a good habit to follow anyway.

Use DNS for Service discovery

If your cluster has the DNS add-on (most do), every Service automatically gets a DNS entry. That means you can access it by name instead of IP:

curl http://my-service.default.svc.cluster.local

It's one of those features that makes Kubernetes networking feel magical.

Avoid hostPort and hostNetwork unless absolutely necessary

You'll sometimes see these options in manifests:

hostPort: 8080
hostNetwork: true

But here's the thing: They tie your Pods to specific nodes, making them harder to schedule and scale. Because each <hostIP, hostPort, protocol> combination must be unique. If you don't specify the hostIP and protocol explicitly, Kubernetes will use 0.0.0.0 as the default hostIP and TCP as the default protocol. Unless you're debugging or building something like a network plugin, avoid them.

If you just need local access for testing, try kubectl port-forward:

kubectl port-forward deployment/web 8080:80

See Use Port Forwarding to access applications in a cluster to learn more. Or if you really need external access, use a type: NodePort Service. That's the safer, Kubernetes-native way.

Use headless Services for internal discovery

Sometimes, you don't want Kubernetes to load balance traffic. You want to talk directly to each Pod. That's where headless Services come in.

You create one by setting clusterIP: None. Instead of a single IP, DNS gives you a list of all Pods IPs, perfect for apps that manage connections themselves.

Working with labels effectively

Labels are key/value pairs that are attached to objects such as Pods. Labels help you organize, query and group your resources. They don't do anything by themselves, but they make everything else from Services to Deployments work together smoothly.

Use semantics labels

Good labels help you understand what's what, even after months later. Define and use labels that identify semantic attributes of your application or Deployment. For example;

labels:
  app.kubernetes.io/name: myapp
  app.kubernetes.io/component: web
  tier: frontend
  phase: test
  • app.kubernetes.io/name : what the app is
  • tier : which layer it belongs to (frontend/backend)
  • phase : which stage it's in (test/prod)

You can then use these labels to make powerful selectors. For example:

kubectl get pods -l tier=frontend

This will list all frontend Pods across your cluster, no matter which Deployment they came from. Basically you are not manually listing Pod names; you are just describing what you want. See the guestbook app for examples of this approach.

Use common Kubernetes labels

Kubernetes actually recommends a set of common labels. It's a standardized way to name things across your different workloads or projects. Following this convention makes your manifests cleaner, and it means that tools such as Headlamp, dashboard, or third-party monitoring systems can all automatically understand what's running.

Manipulate labels for debugging

Since controllers (like ReplicaSets or Deployments) use labels to manage Pods, you can remove a label to “detach” a Pod temporarily.

Example:

kubectl label pod mypod app-

The app- part removes the label key app. Once that happens, the controller won’t manage that Pod anymore. It’s like isolating it for inspection, a “quarantine mode” for debugging. To interactively remove or add labels, use kubectl label.

You can then check logs, exec into it and once done, delete it manually. That’s a super underrated trick every Kubernetes engineer should know.

Handy kubectl tips

These small tips make life much easier when you are working with multiple manifest files or clusters.

Apply entire directories

Instead of applying one file at a time, apply the whole folder:

# Using server-side apply is also a good practice
kubectl apply -f configs/ --server-side

This command looks for .yaml, .yml and .json files in that folder and applies them all together. It's faster, cleaner and helps keep things grouped by app.

Use label selectors to get or delete resources

You don't always need to type out resource names one by one. Instead, use selectors to act on entire groups at once:

kubectl get pods -l app=myapp
kubectl delete pod -l phase=test

It's especially useful in CI/CD pipelines, where you want to clean up test resources dynamically.

Quickly create Deployments and Services

For quick experiments, you don't always need to write a manifest. You can spin up a Deployment right from the CLI:

kubectl create deployment webapp --image=nginx

Then expose it as a Service:

kubectl expose deployment webapp --port=80

This is great when you just want to test something before writing full manifests. Also, see Use a Service to Access an Application in a cluster for an example.

Conclusion

Cleaner configuration leads to calmer cluster administrators. If you stick to a few simple habits: keep configuration simple and minimal, version-control everything, use consistent labels, and avoid relying on naked Pods, you'll save yourself hours of debugging down the road.

The best part? Clean configurations stay readable. Even after months, you or anyone on your team can glance at them and know exactly what’s happening.

Ingress NGINX Retirement: What You Need to Know

To prioritize the safety and security of the ecosystem, Kubernetes SIG Network and the Security Response Committee are announcing the upcoming retirement of Ingress NGINX. Best-effort maintenance will continue until March 2026. Afterward, there will be no further releases, no bugfixes, and no updates to resolve any security vulnerabilities that may be discovered. Existing deployments of Ingress NGINX will continue to function and installation artifacts will remain available.

We recommend migrating to one of the many alternatives. Consider migrating to Gateway API, the modern replacement for Ingress. If you must continue using Ingress, many alternative Ingress controllers are listed in the Kubernetes documentation. Continue reading for further information about the history and current state of Ingress NGINX, as well as next steps.

About Ingress NGINX

Ingress is the original user-friendly way to direct network traffic to workloads running on Kubernetes. (Gateway API is a newer way to achieve many of the same goals.) In order for an Ingress to work in your cluster, there must be an Ingress controller running. There are many Ingress controller choices available, which serve the needs of different users and use cases. Some are cloud-provider specific, while others have more general applicability.

Ingress NGINX was an Ingress controller, developed early in the history of the Kubernetes project as an example implementation of the API. It became very popular due to its tremendous flexibility, breadth of features, and independence from any particular cloud or infrastructure provider. Since those days, many other Ingress controllers have been created within the Kubernetes project by community groups, and by cloud native vendors. Ingress NGINX has continued to be one of the most popular, deployed as part of many hosted Kubernetes platforms and within innumerable independent users’ clusters.

History and Challenges

The breadth and flexibility of Ingress NGINX has caused maintenance challenges. Changing expectations about cloud native software have also added complications. What were once considered helpful options have sometimes come to be considered serious security flaws, such as the ability to add arbitrary NGINX configuration directives via the "snippets" annotations. Yesterday’s flexibility has become today’s insurmountable technical debt.

Despite the project’s popularity among users, Ingress NGINX has always struggled with insufficient or barely-sufficient maintainership. For years, the project has had only one or two people doing development work, on their own time, after work hours and on weekends. Last year, the Ingress NGINX maintainers announced their plans to wind down Ingress NGINX and develop a replacement controller together with the Gateway API community. Unfortunately, even that announcement failed to generate additional interest in helping maintain Ingress NGINX or develop InGate to replace it. (InGate development never progressed far enough to create a mature replacement; it will also be retired.)

Current State and Next Steps

Currently, Ingress NGINX is receiving best-effort maintenance. SIG Network and the Security Response Committee have exhausted our efforts to find additional support to make Ingress NGINX sustainable. To prioritize user safety, we must retire the project.

In March 2026, Ingress NGINX maintenance will be halted, and the project will be retired. After that time, there will be no further releases, no bugfixes, and no updates to resolve any security vulnerabilities that may be discovered. The GitHub repositories will be made read-only and left available for reference.

Existing deployments of Ingress NGINX will not be broken. Existing project artifacts such as Helm charts and container images will remain available.

In most cases, you can check whether you use Ingress NGINX by running kubectl get pods \--all-namespaces \--selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

We would like to thank the Ingress NGINX maintainers for their work in creating and maintaining this project–their dedication remains impressive. This Ingress controller has powered billions of requests in datacenters and homelabs all around the world. In a lot of ways, Kubernetes wouldn’t be where it is without Ingress NGINX, and we are grateful for so many years of incredible effort.

SIG Network and the Security Response Committee recommend that all Ingress NGINX users begin migration to Gateway API or another Ingress controller immediately. Many options are listed in the Kubernetes documentation: Gateway API, Ingress. Additional options may be available from vendors you work with.

Announcing the 2025 Steering Committee Election Results

The 2025 Steering Committee Election is now complete. The Kubernetes Steering Committee consists of 7 seats, 4 of which were up for election in 2025. Incoming committee members serve a term of 2 years, and all members are elected by the Kubernetes Community.

The Steering Committee oversees the governance of the entire Kubernetes project. With that great power comes great responsibility. You can learn more about the steering committee’s role in their charter.

Thank you to everyone who voted in the election; your participation helps support the community’s continued health and success.

Results

Congratulations to the elected committee members whose two year terms begin immediately (listed in alphabetical order by GitHub handle):

They join continuing members:

Maciej Szulik and Paco Xu are returning Steering Committee Members.

Big thanks!

Thank you and congratulations on a successful election to this round’s election officers:

Thanks to the Emeritus Steering Committee Members. Your service is appreciated by the community:

And thank you to all the candidates who came forward to run for election.

Get involved with the Steering Committee

This governing body, like all of Kubernetes, is open to all. You can follow along with Steering Committee meeting notes and weigh in by filing an issue or creating a PR against their repo. They have an open meeting on the first Wednesday at 8am PT of every month. They can also be contacted at their public mailing list steering@kubernetes.io.

You can see what the Steering Committee meetings are all about by watching past meetings on the YouTube Playlist.


This post was adapted from one written by the Contributor Comms Subproject. If you want to write stories about the Kubernetes community, learn more about us.

This article was revised in November 2025 to update the information about when the steering committee meets.

Gateway API 1.4: New Features

Gateway API logo

Ready to rock your Kubernetes networking? The Kubernetes SIG Network community presented the General Availability (GA) release of Gateway API (v1.4.0)! Released on October 6, 2025, version 1.4.0 reinforces the path for modern, expressive, and extensible service networking in Kubernetes.

Gateway API v1.4.0 brings three new features to the Standard channel (Gateway API's GA release channel):

  • BackendTLSPolicy for TLS between gateways and backends
  • supportedFeatures in GatewayClass status
  • Named rules for Routes

and introduces three new experimental features:

  • Mesh resource for service mesh configuration
  • Default gateways to ease configuration burden**
  • externalAuth filter for HTTPRoute

Graduations to Standard Channel

Backend TLS policy

Leads: Candace Holman, Norwin Schnyder, Katarzyna Łach

GEP-1897: BackendTLSPolicy

BackendTLSPolicy is a new Gateway API type for specifying the TLS configuration of the connection from the Gateway to backend pod(s). . Prior to the introduction of BackendTLSPolicy, there was no API specification that allowed encrypted traffic on the hop from Gateway to backend.

The BackendTLSPolicy validation configuration requires a hostname. This hostname serves two purposes. It is used as the SNI header when connecting to the backend and for authentication, the certificate presented by the backend must match this hostname, unless subjectAltNames is explicitly specified.

If subjectAltNames (SANs) are specified, the hostname is only used for SNI, and authentication is performed against the SANs instead. If you still need to authenticate against the hostname value in this case, you MUST add it to the subjectAltNames list.

BackendTLSPolicy validation configuration also requires either caCertificateRefs or wellKnownCACertificates. caCertificateRefs refer to one or more (up to 8) PEM-encoded TLS certificate bundles. If there are no specific certificates to use, then depending on your implementation, you may use wellKnownCACertificates, set to "System" to tell the Gateway to use an implementation-specific set of trusted CA Certificates.

In this example, the BackendTLSPolicy is configured to use certificates defined in the auth-cert ConfigMap to connect with a TLS-encrypted upstream connection where pods backing the auth service are expected to serve a valid certificate for auth.example.com. It uses subjectAltNames with a Hostname type, but you may also use a URI type.

apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
  name: tls-upstream-auth
spec:
  targetRefs:
  - kind: Service
    name: auth
    group: ""
    sectionName: "https"
  validation:
    caCertificateRefs:
    - group: "" # core API group
      kind: ConfigMap
      name: auth-cert
    subjectAltNames: 
    - type: "Hostname"
      hostname: "auth.example.com"

In this example, the BackendTLSPolicy is configured to use system certificates to connect with a TLS-encrypted backend connection where Pods backing the dev Service are expected to serve a valid certificate for dev.example.com.

apiVersion: gateway.networking.k8s.io/v1
kind: BackendTLSPolicy
metadata:
  name: tls-upstream-dev
spec:
  targetRefs:
  - kind: Service
    name: dev
    group: ""
    sectionName: "btls"
  validation:
    wellKnownCACertificates: "System"
    hostname: dev.example.com

More information on the configuration of TLS in Gateway API can be found in Gateway API - TLS Configuration.

Status information about the features that an implementation supports

Leads: Lior Lieberman, Beka Modebadze

GEP-2162: Supported features in GatewayClass Status

GatewayClass status has a new field, supportedFeatures. This addition allows implementations to declare the set of features they support. This provides a clear way for users and tools to understand the capabilities of a given GatewayClass.

This feature's name for conformance tests (and GatewayClass status reporting) is SupportedFeatures. Implementations must populate the supportedFeatures field in the .status of the GatewayClass before the GatewayClass is accepted, or in the same operation.

Here’s an example of a supportedFeatures published under GatewayClass' .status:

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
...
status:
  conditions:
  - lastTransitionTime: "2022-11-16T10:33:06Z"
    message: Handled by Foo controller
    observedGeneration: 1
    reason: Accepted
    status: "True"
    type: Accepted
  supportedFeatures:
    - HTTPRoute
    - HTTPRouteHostRewrite
    - HTTPRoutePortRedirect
    - HTTPRouteQueryParamMatching

Graduation of SupportedFeatures to Standard, helped improve the conformance testing process for Gateway API. The conformance test suite will now automatically run tests based on the features populated in the GatewayClass' status. This creates a strong, verifiable link between an implementation's declared capabilities and the test results, making it easier for implementers to run the correct conformance tests and for users to trust the conformance reports.

This means when the SupportedFeatures field is populated in the GatewayClass status there will be no need for additional conformance tests flags like –suported-features, or –exempt or –all-features. It's important to note that Mesh features are an exception to this and can be tested for conformance by using Conformance Profiles, or by manually providing any combination of features related flags until the dedicated resource graduates from the experimental channel.

Named rules for Routes

GEP-995: Adding a new name field to all xRouteRule types (HTTPRouteRule, GRPCRouteRule, etc.)

Leads: Guilherme Cassolato

This enhancement enables route rules to be explicitly identified and referenced across the Gateway API ecosystem. Some of the key use cases include:

  • Status: Allowing status conditions to reference specific rules directly by name.
  • Observability: Making it easier to identify individual rules in logs, traces, and metrics.
  • Policies: Enabling policies (GEP-713) to target specific route rules via the sectionName field in their targetRef[s].
  • Tooling: Simplifying filtering and referencing of route rules in tools such as gwctl, kubectl, and general-purpose utilities like jq and yq.
  • Internal configuration mapping: Facilitating the generation of internal configurations that reference route rules by name within gateway and mesh implementations.

This follows the same well-established pattern already adopted for Gateway listeners, Service ports, Pods (and containers), and many other Kubernetes resources.

While the new name field is optional (so existing resources remain valid), its use is strongly encouraged. Implementations are not expected to assign a default value, but they may enforce constraints such as immutability.

Finally, keep in mind that the name format is validated, and other fields (such as sectionName) may impose additional, indirect constraints.

Experimental channel changes

Enabling external Auth for HTTPRoute

Giving Gateway API the ability to enforce authentication and maybe authorization as well at the Gateway or HTTPRoute level has been a highly requested feature for a long time. (See the GEP-1494 issue for some background.)

This Gateway API release adds an Experimental filter in HTTPRoute that tells the Gateway API implementation to call out to an external service to authenticate (and, optionally, authorize) requests.

This filter is based on the Envoy ext_authz API, and allows talking to an Auth service that uses either gRPC or HTTP for its protocol.

Both methods allow the configuration of what headers to forward to the Auth service, with the HTTP protocol allowing some extra information like a prefix path.

A HTTP example might look like this (noting that this example requires the Experimental channel to be installed and an implementation that supports External Auth to actually understand the config):

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: require-auth
  namespace: default
spec:
  parentRefs:
    - name: your-gateway-here
  rules:
    - matches:
      - path:
          type: Prefix
          value: /admin
      filters:
        - type: ExternalAuth
          externalAuth:
            protocol: HTTP
            backendRef:
              name: auth-service
            http:
              # These headers are always sent for the HTTP protocol,
              # but are included here for illustrative purposes
              allowedHeaders:
                - Host
                - Method
                - Path
                - Content-Length
                - Authorization
      backendRefs:
        - name: admin-backend
          port: 8080

This allows the backend Auth service to use the supplied headers to make a determination about the authentication for the request.

When a request is allowed, the external Auth service will respond with a 200 HTTP response code, and optionally extra headers to be included in the request that is forwarded to the backend. When the request is denied, the Auth service will respond with a 403 HTTP response.

Since the Authorization header is used in many authentication methods, this method can be used to do Basic, Oauth, JWT, and other common authentication and authorization methods.

Mesh resource

Lead(s): Flynn

GEP-3949: Mesh-wide configuration and supported features

Gateway API v1.4.0 introduces a new experimental Mesh resource, which provides a way to configure mesh-wide settings and discover the features supported by a given mesh implementation. This resource is analogous to the Gateway resource and will initially be mainly used for conformance testing, with plans to extend its use to off-cluster Gateways in the future.

The Mesh resource is cluster-scoped and, as an experimental feature, is named XMesh and resides in the gateway.networking.x-k8s.io API group. A key field is controllerName, which specifies the mesh implementation responsible for the resource. The resource's status stanza indicates whether the mesh implementation has accepted it and lists the features the mesh supports.

One of the goals of this GEP is to avoid making it more difficult for users to adopt a mesh. To simplify adoption, mesh implementations are expected to create a default Mesh resource upon startup if one with a matching controllerName doesn't already exist. This avoids the need for manual creation of the resource to begin using a mesh.

The new XMesh API kind, within the gateway.networking.x-k8s.io/v1alpha1 API group, provides a central point for mesh configuration and feature discovery (source).

A minimal XMesh object specifies the controllerName:

apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XMesh
metadata:
  name: one-mesh-to-mesh-them-all
spec:
  controllerName: one-mesh.example.com/one-mesh

The mesh implementation populates the status field to confirm it has accepted the resource and to list its supported features ( source):

status:
  conditions:
    - type: Accepted
      status: "True"
      reason: Accepted
  supportedFeatures:
    - name: MeshHTTPRoute
    - name: OffClusterGateway

Introducing default Gateways

Lead(s): Flynn

GEP-3793: Allowing Gateways to program some routes by default.

For application developers, one common piece of feedback has been the need to explicitly name a parent Gateway for every single north-south Route. While this explicitness prevents ambiguity, it adds friction, especially for developers who just want to expose their application to the outside world without worrying about the underlying infrastructure's naming scheme. To address this, we have introduce the concept of Default Gateways.

For application developers: Just "use the default"

As an application developer, you often don't care about the specific Gateway your traffic flows through, you just want it to work. With this enhancement, you can now create a Route and simply ask it to use a default Gateway.

This is done by setting the new useDefaultGateways field in your Route's spec.

Here’s a simple HTTPRoute that uses a default Gateway:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-route
spec:
  useDefaultGateways: All
  rules:
  - backendRefs:
    - name: my-service
      port: 80

That's it! No more need to hunt down the correct Gateway name for your environment. Your Route is now a "defaulted Route."

For cluster operators: You're still in control

This feature doesn't take control away from cluster operators ("Chihiro"). In fact, they have explicit control over which Gateways can act as a default. A Gateway will only accept these defaulted Routes if it is configured to do so.

You can also use a ValidatingAdmissionPolicy to either require or even forbid for Routes to rely on a default Gateway.

As a cluster operator, you can designate a Gateway as a default by setting the (new) .spec.defaultScope field:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: my-default-gateway
  namespace: default
spec:
  defaultScope: All
  # ... other gateway configuration

Operators can choose to have no default Gateways, or even multiple.

How it works and key details

  • To maintain a clean, GitOps-friendly workflow, a default Gateway does not modify the spec.parentRefs of your Route. Instead, the binding is reflected in the Route's status field. You can always inspect the status.parents stanza of your Route to see exactly which Gateway or Gateways have accepted it. This preserves your original intent and avoids conflicts with CD tools.

  • The design explicitly supports having multiple Gateways designated as defaults within a cluster. When this happens, a defaulted Route will bind to all of them. This enables cluster operators to perform zero-downtime migrations and testing of new default Gateways.

  • You can create a single Route that handles both north-south traffic (traffic entering or leaving the cluster, via a default Gateway) and east-west/mesh traffic (traffic between services within the cluster), by explicitly referencing a Service in parentRefs.

Default Gateways represent a significant step forward in making the Gateway API simpler and more intuitive for everyday use cases, bridging the gap between the flexibility needed by operators and the simplicity desired by developers.

Configuring client certificate validation

Lead(s): Arko Dasgupta, Katarzyna Łach

GEP-91: Address connection coalescing security issue

This release brings updates for configuring client certificate validation, addressing a critical security vulnerability related to connection reuse. HTTP connection coalescing is a web performance optimization that allows a client to reuse an existing TLS connection for requests to different domains. While this reduces the overhead of establishing new connections, it introduces a security risk in the context of API gateways. The ability to reuse a single TLS connection across multiple Listeners brings the need to introduce shared client certificate configuration in order to avoid unauthorized access.

Why SNI-based mTLS is not the answer

One might think that using Server Name Indication (SNI) to differentiate between Listeners would solve this problem. However, TLS SNI is not a reliable mechanism for enforcing security policies in a connection coalescing scenario. A client could use a single TLS connection for multiple peer connections, as long as they are all covered by the same certificate. This means that a client could establish a connection by indicating one peer identity (using SNI), and then reuse that connection to access a different virtual host that is listening on the same IP address and port. That reuse, which is controlled by client side heuristics, could bypass mutual TLS policies that were specific to the second listener configuration.

Here's an example to help explain it:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: wildcard-tls-gateway
spec:
  gatewayClassName: example
  listeners:
  - name: foo-https
    protocol: HTTPS
    port: 443
    hostname: foo.example.com
    tls:
      certificateRefs:
      - group: "" # core API group
        kind: Secret
        name: foo-example-com-cert  # SAN: foo.example.com
  - name: wildcard-https
    protocol: HTTPS
    port: 443
    hostname: "*.example.com"
    tls:
      certificateRefs:
      - group: "" # core API group
        kind: Secret
        name: wildcard-example-com-cert # SAN: *.example.com

I have configured a Gateway with two listeners, both having overlapping hostnames. My intention is for the foo-http listener to be accessible only by clients presenting the foo-example-com-cert certificate. In contrast, the wildcard-https listener should allow access to a broader audience using any certificate valid for the *.example.com domain.

Consider a scenario where a client initially connects to foo.example.com. The server requests and successfully validates the foo-example-com-cert certificate, establishing the connection. Subsequently, the same client wishes to access other sites within this domain, such as bar.example.com, which is handled by the wildcard-https listener. Due to connection reuse, clients can access wildcard-https backends without an additional TLS handshake on the existing connection. This process functions as expected.

However, a critical security vulnerability arises when the order of access is reversed. If a client first connects to bar.example.com and presents a valid bar.example.com certificate, the connection is successfully established. If this client then attempts to access foo.example.com, the existing connection's client certificate will not be re-validated. This allows the client to bypass the specific certificate requirement for the foo backend, leading to a serious security breach.

The solution: per-port TLS configuration

The updated Gateway API gains a tls field in the .spec of a Gateway, that allows you to define a default client certificate validation configuration for all Listeners, and then if needed override it on a per-port basis. This provides a flexible and powerful way to manage your TLS policies.

Here’s a look at the updated API definitions (shown as Go source code):

// GatewaySpec defines the desired state of Gateway.
type GatewaySpec struct {
    ...
    // GatewayTLSConfig specifies frontend tls configuration for gateway.
    TLS *GatewayTLSConfig `json:"tls,omitempty"`
}

// GatewayTLSConfig specifies frontend tls configuration for gateway.
type GatewayTLSConfig struct {
    // Default specifies the default client certificate validation configuration
    Default TLSConfig `json:"default"`

    // PerPort specifies tls configuration assigned per port.
    PerPort []TLSPortConfig `json:"perPort,omitempty"`
}

// TLSPortConfig describes a TLS configuration for a specific port.
type TLSPortConfig struct {
    // The Port indicates the Port Number to which the TLS configuration will be applied.
    Port PortNumber `json:"port"`

    // TLS store the configuration that will be applied to all Listeners handling
    // HTTPS traffic and matching given port.
    TLS TLSConfig `json:"tls"`
}

Breaking changes

Standard GRPCRoute - .spec field required (technicality)

The promotion of GRPCRoute to Standard introduces a minor but technically breaking change regarding the presence of the top-level .spec field. As part of achieving Standard status, the Gateway API has tightened the OpenAPI schema validation within the GRPCRoute CustomResourceDefinition (CRD) to explicitly ensure the spec field is required for all GRPCRoute resources. This change enforces stricter conformance to Kubernetes object standards and enhances the resource's stability and predictability. While it is highly unlikely that users were attempting to define a GRPCRoute without any specification, any existing automation or manifests that might have relied on a relaxed interpretation allowing a completely absent spec field will now fail validation and must be updated to include the .spec field, even if empty.

Experimental CORS support in HTTPRoute - breaking change for allowCredentials field

The Gateway API subproject has introduced a breaking change to the Experimental CORS support in HTTPRoute, concerning the allowCredentials field within the CORS policy. This field's definition has been strictly aligned with the upstream CORS specification, which dictates that the corresponding Access-Control-Allow-Credentials header must represent a Boolean value. Previously, the implementation might have been overly permissive, potentially accepting non-standard or string representations such as true due to relaxed schema validation. Users who were configuring CORS rules must now review their manifests and ensure the value for allowCredentials strictly conforms to the new, more restrictive schema. Any existing HTTPRoute definitions that do not adhere to this stricter validation will now be rejected by the API server, requiring a configuration update to maintain functionality.

Improving the development and usage experience

As part of this release, we have improved some of the developer experience workflow:

  • Added Kube API Linter to the CI/CD pipelines, reducing the burden of API reviewers and also reducing the amount of common mistakes.
  • Improving the execution time of CRD tests with the usage of envtest.

Additionally, as part of the effort to improve Gateway API usage experience, some efforts were made to remove some ambiguities and some old tech-debts from our documentation website:

  • The API reference is now explicit when a field is experimental.
  • The GEP (GatewayAPI Enhancement Proposal) navigation bar is automatically generated, reflecting the real status of the enhancements.

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide.

As of this writing, seven implementations are already conformant with Gateway API v1.4.0. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

7 Common Kubernetes Pitfalls (and How I Learned to Avoid Them)

It’s no secret that Kubernetes can be both powerful and frustrating at times. When I first started dabbling with container orchestration, I made more than my fair share of mistakes enough to compile a whole list of pitfalls. In this post, I want to walk through seven big gotchas I’ve encountered (or seen others run into) and share some tips on how to avoid them. Whether you’re just kicking the tires on Kubernetes or already managing production clusters, I hope these insights help you steer clear of a little extra stress.

1. Skipping resource requests and limits

The pitfall: Not specifying CPU and memory requirements in Pod specifications. This typically happens because Kubernetes does not require these fields, and workloads can often start and run without them—making the omission easy to overlook in early configurations or during rapid deployment cycles.

Context: In Kubernetes, resource requests and limits are critical for efficient cluster management. Resource requests ensure that the scheduler reserves the appropriate amount of CPU and memory for each pod, guaranteeing that it has the necessary resources to operate. Resource limits cap the amount of CPU and memory a pod can use, preventing any single pod from consuming excessive resources and potentially starving other pods. When resource requests and limits are not set:

  1. Resource Starvation: Pods may get insufficient resources, leading to degraded performance or failures. This is because Kubernetes schedules pods based on these requests. Without them, the scheduler might place too many pods on a single node, leading to resource contention and performance bottlenecks.
  2. Resource Hoarding: Conversely, without limits, a pod might consume more than its fair share of resources, impacting the performance and stability of other pods on the same node. This can lead to issues such as other pods getting evicted or killed by the Out-Of-Memory (OOM) killer due to lack of available memory.

How to avoid it:

  • Start with modest requests (for example 100m CPU, 128Mi memory) and see how your app behaves.
  • Monitor real-world usage and refine your values; the HorizontalPodAutoscaler can help automate scaling based on metrics.
  • Keep an eye on kubectl top pods or your logging/monitoring tool to confirm you’re not over- or under-provisioning.

My reality check: Early on, I never thought about memory limits. Things seemed fine on my local cluster. Then, on a larger environment, Pods got OOMKilled left and right. Lesson learned. For detailed instructions on configuring resource requests and limits for your containers, please refer to Assign Memory Resources to Containers and Pods (part of the official Kubernetes documentation).

2. Underestimating liveness and readiness probes

The pitfall: Deploying containers without explicitly defining how Kubernetes should check their health or readiness. This tends to happen because Kubernetes will consider a container “running” as long as the process inside hasn’t exited. Without additional signals, Kubernetes assumes the workload is functioning—even if the application inside is unresponsive, initializing, or stuck.

Context:
Liveness, readiness, and startup probes are mechanisms Kubernetes uses to monitor container health and availability.

  • Liveness probes determine if the application is still alive. If a liveness check fails, the container is restarted.
  • Readiness probes control whether a container is ready to serve traffic. Until the readiness probe passes, the container is removed from Service endpoints.
  • Startup probes help distinguish between long startup times and actual failures.

How to avoid it:

  • Add a simple HTTP livenessProbe to check a health endpoint (for example /healthz) so Kubernetes can restart a hung container.
  • Use a readinessProbe to ensure traffic doesn’t reach your app until it’s warmed up.
  • Keep probes simple. Overly complex checks can create false alarms and unnecessary restarts.

My reality check: I once forgot a readiness probe for a web service that took a while to load. Users hit it prematurely, got weird timeouts, and I spent hours scratching my head. A 3-line readiness probe would have saved the day.

For comprehensive instructions on configuring liveness, readiness, and startup probes for containers, please refer to Configure Liveness, Readiness and Startup Probes in the official Kubernetes documentation.

3. “We’ll just look at container logs” (famous last words)

The pitfall: Relying solely on container logs retrieved via kubectl logs. This often happens because the command is quick and convenient, and in many setups, logs appear accessible during development or early troubleshooting. However, kubectl logs only retrieves logs from currently running or recently terminated containers, and those logs are stored on the node’s local disk. As soon as the container is deleted, evicted, or the node is restarted, the log files may be rotated out or permanently lost.

How to avoid it:

  • Centralize logs using CNCF tools like Fluentd or Fluent Bit to aggregate output from all Pods.
  • Adopt OpenTelemetry for a unified view of logs, metrics, and (if needed) traces. This lets you spot correlations between infrastructure events and app-level behavior.
  • Pair logs with Prometheus metrics to track cluster-level data alongside application logs. If you need distributed tracing, consider CNCF projects like Jaeger.

My reality check: The first time I lost Pod logs to a quick restart, I realized how flimsy “kubectl logs” can be on its own. Since then, I’ve set up a proper pipeline for every cluster to avoid missing vital clues.

4. Treating dev and prod exactly the same

The pitfall: Deploying the same Kubernetes manifests with identical settings across development, staging, and production environments. This often occurs when teams aim for consistency and reuse, but overlook that environment-specific factors—such as traffic patterns, resource availability, scaling needs, or access control—can differ significantly. Without customization, configurations optimized for one environment may cause instability, poor performance, or security gaps in another.

How to avoid it:

  • Use environment overlays or kustomize to maintain a shared base while customizing resource requests, replicas, or config for each environment.
  • Extract environment-specific configuration into ConfigMaps and / or Secrets. You can use a specialized tool such as Sealed Secrets to manage confidential data.
  • Plan for scale in production. Your dev cluster can probably get away with minimal CPU/memory, but prod might need significantly more.

My reality check: One time, I scaled up replicaCount from 2 to 10 in a tiny dev environment just to “test.” I promptly ran out of resources and spent half a day cleaning up the aftermath. Oops.

5. Leaving old stuff floating around

The pitfall: Leaving unused or outdated resources—such as Deployments, Services, ConfigMaps, or PersistentVolumeClaims—running in the cluster. This often happens because Kubernetes does not automatically remove resources unless explicitly instructed, and there is no built-in mechanism to track ownership or expiration. Over time, these forgotten objects can accumulate, consuming cluster resources, increasing cloud costs, and creating operational confusion, especially when stale Services or LoadBalancers continue to route traffic.

How to avoid it:

  • Label everything with a purpose or owner label. That way, you can easily query resources you no longer need.
  • Regularly audit your cluster: run kubectl get all -n <namespace> to see what’s actually running, and confirm it’s all legit.
  • Adopt Kubernetes’ Garbage Collection: K8s docs show how to remove dependent objects automatically.
  • Leverage policy automation: Tools like Kyverno can automatically delete or block stale resources after a certain period, or enforce lifecycle policies so you don’t have to remember every single cleanup step.

My reality check: After a hackathon, I forgot to tear down a “test-svc” pinned to an external load balancer. Three weeks later, I realized I’d been paying for that load balancer the entire time. Facepalm.

6. Diving too deep into networking too soon

The pitfall: Introducing advanced networking solutions—such as service meshes, custom CNI plugins, or multi-cluster communication—before fully understanding Kubernetes' native networking primitives. This commonly occurs when teams implement features like traffic routing, observability, or mTLS using external tools without first mastering how core Kubernetes networking works: including Pod-to-Pod communication, ClusterIP Services, DNS resolution, and basic ingress traffic handling. As a result, network-related issues become harder to troubleshoot, especially when overlays introduce additional abstractions and failure points.

How to avoid it:

  • Start small: a Deployment, a Service, and a basic ingress controller such as one based on NGINX (e.g., Ingress-NGINX).
  • Make sure you understand how traffic flows within the cluster, how service discovery works, and how DNS is configured.
  • Only move to a full-blown mesh or advanced CNI features when you actually need them, complex networking adds overhead.

My reality check: I tried Istio on a small internal app once, then spent more time debugging Istio itself than the actual app. Eventually, I stepped back, removed Istio, and everything worked fine.

7. Going too light on security and RBAC

The pitfall: Deploying workloads with insecure configurations, such as running containers as the root user, using the latest image tag, disabling security contexts, or assigning overly broad RBAC roles like cluster-admin. These practices persist because Kubernetes does not enforce strict security defaults out of the box, and the platform is designed to be flexible rather than opinionated. Without explicit security policies in place, clusters can remain exposed to risks like container escape, unauthorized privilege escalation, or accidental production changes due to unpinned images.

How to avoid it:

  • Use RBAC to define roles and permissions within Kubernetes. While RBAC is the default and most widely supported authorization mechanism, Kubernetes also allows the use of alternative authorizers. For more advanced or external policy needs, consider solutions like OPA Gatekeeper (based on Rego), Kyverno, or custom webhooks using policy languages such as CEL or Cedar.
  • Pin images to specific versions (no more :latest!). This helps you know what’s actually deployed.
  • Look into Pod Security Admission (or other solutions like Kyverno) to enforce non-root containers, read-only filesystems, etc.

My reality check: I never had a huge security breach, but I’ve heard plenty of cautionary tales. If you don’t tighten things up, it’s only a matter of time before something goes wrong.

Final thoughts

Kubernetes is amazing, but it’s not psychic, it won’t magically do the right thing if you don’t tell it what you need. By keeping these pitfalls in mind, you’ll avoid a lot of headaches and wasted time. Mistakes happen (trust me, I’ve made my share), but each one is a chance to learn more about how Kubernetes truly works under the hood. If you’re curious to dive deeper, the official docs and the community Slack are excellent next steps. And of course, feel free to share your own horror stories or success tips, because at the end of the day, we’re all in this cloud native adventure together.

Happy Shipping!

Spotlight on Policy Working Group

(Note: The Policy Working Group has completed its mission and is no longer active. This article reflects its work, accomplishments, and insights into how a working group operates.)

In the complex world of Kubernetes, policies play a crucial role in managing and securing clusters. But have you ever wondered how these policies are developed, implemented, and standardized across the Kubernetes ecosystem? To answer that, let's take a look back at the work of the Policy Working Group.

The Policy Working Group was dedicated to a critical mission: providing an overall architecture that encompasses both current policy-related implementations and future policy proposals in Kubernetes. Their goal was both ambitious and essential: to develop a universal policy architecture that benefits developers and end-users alike.

Through collaborative methods, this working group strove to bring clarity and consistency to the often complex world of Kubernetes policies. By focusing on both existing implementations and future proposals, they ensured that the policy landscape in Kubernetes remains coherent and accessible as the technology evolves.

This blog post dives deeper into the work of the Policy Working Group, guided by insights from its former co-chairs:

Interviewed by Arujjwal Negi.

These co-chairs explained what the Policy Working Group was all about.

Introduction

Hello, thank you for the time! Let’s start with some introductions, could you tell us a bit about yourself, your role, and how you got involved in Kubernetes?

Jim Bugwadia: My name is Jim Bugwadia, and I am a co-founder and the CEO at Nirmata which provides solutions that automate security and compliance for cloud-native workloads. At Nirmata, we have been working with Kubernetes since it started in 2014. We initially built a Kubernetes policy engine in our commercial platform and later donated it to CNCF as the Kyverno project. I joined the CNCF Kubernetes Policy Working Group to help build and standardize various aspects of policy management for Kubernetes and later became a co-chair.

Andy Suderman: My name is Andy Suderman and I am the CTO of Fairwinds, a managed Kubernetes-as-a-Service provider. I began working with Kubernetes in 2016 building a web conferencing platform. I am an author and/or maintainer of several Kubernetes-related open-source projects such as Goldilocks, Pluto, and Polaris. Polaris is a JSON-schema-based policy engine, which started Fairwinds' journey into the policy space and my involvement in the Policy Working Group.

Poonam Lamba: My name is Poonam Lamba, and I currently work as a Product Manager for Google Kubernetes Engine (GKE) at Google. My journey with Kubernetes began back in 2017 when I was building an SRE platform for a large enterprise, using a private cloud built on Kubernetes. Intrigued by its potential to revolutionize the way we deployed and managed applications at the time, I dove headfirst into learning everything I could about it. Since then, I've had the opportunity to build the policy and compliance products for GKE. I lead and contribute to GKE CIS benchmarks. I am involved with the Gatekeeper project as well as I have contributed to Policy-WG for over 2 years and served as a co-chair for the group.

Responses to the following questions represent an amalgamation of insights from the former co-chairs.

About Working Groups

One thing even I am not aware of is the difference between a working group and a SIG. Can you help us understand what a working group is and how it is different from a SIG?

Unlike SIGs, working groups are temporary and focused on tackling specific, cross-cutting issues or projects that may involve multiple SIGs. Their lifespan is defined, and they disband once they've achieved their objective. Generally, working groups don't own code or have long-term responsibility for managing a particular area of the Kubernetes project.

(To know more about SIGs, visit the list of Special Interest Groups)

You mentioned that Working Groups involve multiple SIGS. What SIGS was the Policy WG closely involved with, and how did you coordinate with them?

The group collaborated closely with Kubernetes SIG Auth throughout our existence, and more recently, the group also worked with SIG Security since its formation. Our collaboration occurred in a few ways. We provided periodic updates during the SIG meetings to keep them informed of our progress and activities. Additionally, we utilize other community forums to maintain open lines of communication and ensured our work aligned with the broader Kubernetes ecosystem. This collaborative approach helped the group stay coordinated with related efforts across the Kubernetes community.

Policy WG

Why was the Policy Working Group created?

To enable a broad set of use cases, we recognize that Kubernetes is powered by a highly declarative, fine-grained, and extensible configuration management system. We've observed that a Kubernetes configuration manifest may have different portions that are important to various stakeholders. For example, some parts may be crucial for developers, while others might be of particular interest to security teams or address operational concerns. Given this complexity, we believe that policies governing the usage of these intricate configurations are essential for success with Kubernetes.

Our Policy Working Group was created specifically to research the standardization of policy definitions and related artifacts. We saw a need to bring consistency and clarity to how policies are defined and implemented across the Kubernetes ecosystem, given the diverse requirements and stakeholders involved in Kubernetes deployments.

Can you give me an idea of the work you did in the group?

We worked on several Kubernetes policy-related projects. Our initiatives included:

  • We worked on a Kubernetes Enhancement Proposal (KEP) for the Kubernetes Policy Reports API. This aims to standardize how policy reports are generated and consumed within the Kubernetes ecosystem.
  • We conducted a CNCF survey to better understand policy usage in the Kubernetes space. This helped gauge the practices and needs across the community at the time.
  • We wrote a paper that will guide users in achieving PCI-DSS compliance for containers. This is intended to help organizations meet important security standards in their Kubernetes environments.
  • We also worked on a paper highlighting how shifting security down can benefit organizations. This focuses on the advantages of implementing security measures earlier in the development and deployment process.

Can you tell us what were the main objectives of the Policy Working Group and some of your key accomplishments?

The charter of the Policy WG was to help standardize policy management for Kubernetes and educate the community on best practices.

To accomplish this we updated the Kubernetes documentation (Policies | Kubernetes), produced several whitepapers (Kubernetes Policy Management, Kubernetes GRC), and created the Policy Reports API (API reference) which standardizes reporting across various tools. Several popular tools such as Falco, Trivy, Kyverno, kube-bench, and others support the Policy Report API. A major milestone for the Policy WG was promoting the Policy Reports API to a SIG-level API or finding it a stable home.

Beyond that, as ValidatingAdmissionPolicy and MutatingAdmissionPolicy approached GA in Kubernetes, a key goal of the WG was to guide and educate the community on the tradeoffs and appropriate usage patterns for these built-in API objects and other CNCF policy management solutions like OPA/Gatekeeper and Kyverno.

Challenges

What were some of the major challenges that the Policy Working Group worked on?

During our work in the Policy Working Group, we encountered several challenges:

  • One of the main issues we faced was finding time to consistently contribute. Given that many of us have other professional commitments, it can be difficult to dedicate regular time to the working group's initiatives.

  • Another challenge we experienced was related to our consensus-driven model. While this approach ensures that all voices are heard, it can sometimes lead to slower decision-making processes. We valued thorough discussion and agreement, but this can occasionally delay progress on our projects.

  • We've also encountered occasional differences of opinion among group members. These situations require careful navigation to ensure that we maintain a collaborative and productive environment while addressing diverse viewpoints.

  • Lastly, we've noticed that newcomers to the group may find it difficult to contribute effectively without consistent attendance at our meetings. The complex nature of our work often requires ongoing context, which can be challenging for those who aren't able to participate regularly.

Can you tell me more about those challenges? How did you discover each one? What has the impact been? What were some strategies you used to address them?

There are no easy answers, but having more contributors and maintainers greatly helps! Overall the CNCF community is great to work with and is very welcoming to beginners. So, if folks out there are hesitating to get involved, I highly encourage them to attend a WG or SIG meeting and just listen in.

It often takes a few meetings to fully understand the discussions, so don't feel discouraged if you don't grasp everything right away. We made a point to emphasize this and encouraged new members to review documentation as a starting point for getting involved.

Additionally, differences of opinion were valued and encouraged within the Policy-WG. We adhered to the CNCF core values and resolve disagreements by maintaining respect for one another. We also strove to timebox our decisions and assign clear responsibilities to keep things moving forward.


This is where our discussion about the Policy Working Group ends. The working group, and especially the people who took part in this article, hope this gave you some insights into the group's aims and workings. You can get more info about Working Groups here.

Introducing Headlamp Plugin for Karpenter - Scaling and Visibility

Headlamp is an open‑source, extensible Kubernetes SIG UI project designed to let you explore, manage, and debug cluster resources.

Karpenter is a Kubernetes Autoscaling SIG node provisioning project that helps clusters scale quickly and efficiently. It launches new nodes in seconds, selects appropriate instance types for workloads, and manages the full node lifecycle, including scale-down.

The new Headlamp Karpenter Plugin adds real-time visibility into Karpenter’s activity directly from the Headlamp UI. It shows how Karpenter resources relate to Kubernetes objects, displays live metrics, and surfaces scaling events as they happen. You can inspect pending pods during provisioning, review scaling decisions, and edit Karpenter-managed resources with built-in validation. The Karpenter plugin was made as part of a LFX mentor project.

The Karpenter plugin for Headlamp aims to make it easier for Kubernetes users and operators to understand, debug, and fine-tune autoscaling behavior in their clusters. Now we will give a brief tour of the Headlamp plugin.

Map view of Karpenter Resources and how they relate to Kubernetes resources

Easily see how Karpenter Resources like NodeClasses, NodePool and NodeClaims connect with core Kubernetes resources like Pods, Nodes etc.

Map view showing relationships between resources

Visualization of Karpenter Metrics

Get instant insights of Resource Usage v/s Limits, Allowed disruptions, Pending Pods, Provisioning Latency and many more .

NodePool default metrics shown with controls to see different frequencies

NodeClaim default metrics shown with controls to see different frequencies

Scaling decisions

Shows which instances are being provisioned for your workloads and understand the reason behind why Karpenter made those choices. Helpful while debugging.

Pod Placement Decisions data including reason, from, pod, message, and age

Node decision data including Type, Reason, Node, From, Message

Config editor with validation support

Make live edits to Karpenter configurations. The editor includes diff previews and resource validation for safer adjustments.
Config editor with validation support

Real time view of Karpenter resources

View and track Karpenter specific resources in real time such as “NodeClaims” as your cluster scales up and down.

Node claims data including Name, Status, Instance Type, CPU, Zone, Age, and Actions

Node Pools data including Name, NodeClass, CPU, Memory, Nodes, Status, Age, Actions

EC2 Node Classes data including Name, Cluster, Instance Profile, Status, IAM Role, Age, and Actions

Dashboard for Pending Pods

View all pending pods with unmet scheduling requirements/Failed Scheduling highlighting why they couldn't be scheduled.

Pending Pods data including Name, Namespace, Type, Reason, From, and Message

Karpenter Providers

This plugin should work with most Karpenter providers, but has only so far been tested on the ones listed in the table. Additionally, each provider gives some extra information, and the ones in the table below are displayed by the plugin.

Provider NameTestedExtra provider specific info supported
AWS
Azure
AlibabaCloud
Bizfly Cloud
Cluster API
GCP
Proxmox
Oracle Cloud Infrastructure (OCI)

Please submit an issue if you test one of the untested providers or if you want support for this provider (PRs also gladly accepted).

How to use

Please see the plugins/karpenter/README.md for instructions on how to use.

Feedback and Questions

Please submit an issue if you use Karpenter and have any other ideas or feedback. Or come to the Kubernetes slack headlamp channel for a chat.

Announcing Changed Block Tracking API support (alpha)

We're excited to announce the alpha support for a changed block tracking mechanism. This enhances the Kubernetes storage ecosystem by providing an efficient way for CSI storage drivers to identify changed blocks in PersistentVolume snapshots. With a driver that can use the feature, you could benefit from faster and more resource-efficient backup operations.

If you're eager to try this feature, you can skip to the Getting Started section.

What is changed block tracking?

Changed block tracking enables storage systems to identify and track modifications at the block level between snapshots, eliminating the need to scan entire volumes during backup operations. The improvement is a change to the Container Storage Interface (CSI), and also to the storage support in Kubernetes itself. With the alpha feature enabled, your cluster can:

  • Identify allocated blocks within a CSI volume snapshot
  • Determine changed blocks between two snapshots of the same volume
  • Streamline backup operations by focusing only on changed data blocks

For Kubernetes users managing large datasets, this API enables significantly more efficient backup processes. Backup applications can now focus only on the blocks that have changed, rather than processing entire volumes.

Note:

As of now, the Changed Block Tracking API is supported only for block volumes and not for file volumes. CSI drivers that manage file-based storage systems will not be able to implement this capability.

Benefits of changed block tracking support in Kubernetes

As Kubernetes adoption grows for stateful workloads managing critical data, the need for efficient backup solutions becomes increasingly important. Traditional full backup approaches face challenges with:

  • Long backup windows: Full volume backups can take hours for large datasets, making it difficult to complete within maintenance windows.
  • High resource utilization: Backup operations consume substantial network bandwidth and I/O resources, especially for large data volumes and data-intensive applications.
  • Increased storage costs: Repetitive full backups store redundant data, causing storage requirements to grow linearly even when only a small percentage of data actually changes between backups.

The Changed Block Tracking API addresses these challenges by providing native Kubernetes support for incremental backup capabilities through the CSI interface.

Key components

The implementation consists of three primary components:

  1. CSI SnapshotMetadata Service API: An API, offered by gRPC, that provides volume snapshot and changed block data.
  2. SnapshotMetadataService API: A Kubernetes CustomResourceDefinition (CRD) that advertises CSI driver metadata service availability and connection details to cluster clients.
  3. External Snapshot Metadata Sidecar: An intermediary component that connects CSI drivers to backup applications via a standardized gRPC interface.

Implementation requirements

Storage provider responsibilities

If you're an author of a storage integration with Kubernetes and want to support the changed block tracking feature, you must implement specific requirements:

  1. Implement CSI RPCs: Storage providers need to implement the SnapshotMetadata service as defined in the CSI specifications protobuf. This service requires server-side streaming implementations for the following RPCs:

    • GetMetadataAllocated: For identifying allocated blocks in a snapshot
    • GetMetadataDelta: For determining changed blocks between two snapshots
  2. Storage backend capabilities: Ensure the storage backend has the capability to track and report block-level changes.

  3. Deploy external components: Integrate with the external-snapshot-metadata sidecar to expose the snapshot metadata service.

  4. Register custom resource: Register the SnapshotMetadataService resource using a CustomResourceDefinition and create a SnapshotMetadataService custom resource that advertises the availability of the metadata service and provides connection details.

  5. Support error handling: Implement proper error handling for these RPCs according to the CSI specification requirements.

Backup solution responsibilities

A backup solution looking to leverage this feature must:

  1. Set up authentication: The backup application must provide a Kubernetes ServiceAccount token when using the Kubernetes SnapshotMetadataService API. Appropriate access grants, such as RBAC RoleBindings, must be established to authorize the backup application ServiceAccount to obtain such tokens.

  2. Implement streaming client-side code: Develop clients that implement the streaming gRPC APIs defined in the schema.proto file. Specifically:

    • Implement streaming client code for GetMetadataAllocated and GetMetadataDelta methods
    • Handle server-side streaming responses efficiently as the metadata comes in chunks
    • Process the SnapshotMetadataResponse message format with proper error handling

    The external-snapshot-metadata GitHub repository provides a convenient iterator support package to simplify client implementation.

  3. Handle large dataset streaming: Design clients to efficiently handle large streams of block metadata that could be returned for volumes with significant changes.

  4. Optimize backup processes: Modify backup workflows to use the changed block metadata to identify and only transfer changed blocks to make backups more efficient, reducing both backup duration and resource consumption.

Getting started

To use changed block tracking in your cluster:

  1. Ensure your CSI driver supports volume snapshots and implements the snapshot metadata capabilities with the required external-snapshot-metadata sidecar
  2. Make sure the SnapshotMetadataService custom resource is registered using CRD
  3. Verify the presence of a SnapshotMetadataService custom resource for your CSI driver
  4. Create clients that can access the API using appropriate authentication (via Kubernetes ServiceAccount tokens)

The API provides two main functions:

  • GetMetadataAllocated: Lists blocks allocated in a single snapshot
  • GetMetadataDelta: Lists blocks changed between two snapshots

What’s next?

Depending on feedback and adoption, the Kubernetes developers hope to push the CSI Snapshot Metadata implementation to Beta in the future releases.

Where can I learn more?

For those interested in trying out this new feature:

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who helped review the design and implementation of the project, including but not limited to the following:

Thank also to everyone who has contributed to the project, including others who helped review the KEP and the CSI spec PR

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

The SIG also holds regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

Kubernetes v1.34: Pod Level Resources Graduated to Beta

On behalf of the Kubernetes community, I am thrilled to announce that the Pod Level Resources feature has graduated to Beta in the Kubernetes v1.34 release and is enabled by default! This significant milestone introduces a new layer of flexibility for defining and managing resource allocation for your Pods. This flexibility stems from the ability to specify CPU and memory resources for the Pod as a whole. Pod level resources can be combined with the container-level specifications to express the exact resource requirements and limits your application needs.

Pod-level specification for resources

Until recently, resource specifications that applied to Pods were primarily defined at the individual container level. While effective, this approach sometimes required duplicating or meticulously calculating resource needs across multiple containers within a single Pod. As a beta feature, Kubernetes allows you to specify the CPU, memory and hugepages resources at the Pod-level. This means you can now define resource requests and limits for an entire Pod, enabling easier resource sharing without requiring granular, per-container management of these resources where it's not needed.

Why does Pod-level specification matter?

This feature enhances resource management in Kubernetes by offering flexible resource management at both the Pod and container levels.

  • It provides a consolidated approach to resource declaration, reducing the need for meticulous, per-container management, especially for Pods with multiple containers.

  • Pod-level resources enable containers within a pod to share unused resoures amongst themselves, promoting efficient utilization within the pod. For example, it prevents sidecar containers from becoming performance bottlenecks. Previously, a sidecar (e.g., a logging agent or service mesh proxy) hitting its individual CPU limit could be throttled and slow down the entire Pod, even if the main application container had plenty of spare CPU. With pod-level resources, the sidecar and the main container can share Pod's resource budget, ensuring smooth operation during traffic spikes - either the whole Pod is throttled or all containers work.

  • When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This gives you – and cluster administrators - a powerful way to enforce overall resource boundaries for your Pods.

    For scheduling, if a pod-level request is explicitly defined, the scheduler uses that specific value to find a suitable node, insteaf of the aggregated requests of the individual containers. At runtime, the pod-level limit acts as a hard ceiling for the combined resource usage of all containers. Crucially, this pod-level limit is the absolute enforcer; even if the sum of the individual container limits is higher, the total resource consumption can never exceed the pod-level limit.

  • Pod-level resources are prioritized in influencing the Quality of Service (QoS) class of the Pod.

  • For Pods running on Linux nodes, the Out-Of-Memory (OOM) score adjustment calculation considers both pod-level and container-level resources requests.

  • Pod-level resources are designed to be compatible with existing Kubernetes functionalities, ensuring a smooth integration into your workflows.

How to specify resources for an entire Pod

Using PodLevelResources feature gate requires Kubernetes v1.34 or newer for all cluster components, including the control plane and every node. This feature gate is in beta and enabled by default in v1.34.

Example manifest

You can specify CPU, memory and hugepages resources directly in the Pod spec manifest at the resources field for the entire Pod.

Here’s an example demonstrating a Pod with both CPU and memory requests and limits defined at the Pod level:

apiVersion: v1
kind: Pod
metadata:
  name: pod-resources-demo
  namespace: pod-resources-example
spec:
  # The 'resources' field at the Pod specification level defines the overall
  # resource budget for all containers within this Pod combined.
  resources: # Pod-level resources
    # 'limits' specifies the maximum amount of resources the Pod is allowed to use.
    # The sum of the limits of all containers in the Pod cannot exceed these values.
    limits:
      cpu: "1" # The entire Pod cannot use more than 1 CPU core.
      memory: "200Mi" # The entire Pod cannot use more than 200 MiB of memory.
    # 'requests' specifies the minimum amount of resources guaranteed to the Pod.
    # This value is used by the Kubernetes scheduler to find a node with enough capacity.
    requests:
      cpu: "1" # The Pod is guaranteed 1 CPU core when scheduled.
      memory: "100Mi" # The Pod is guaranteed 100 MiB of memory when scheduled.
  containers:
  - name: main-app-container
    image: nginx
    ...
    # This container has no resource requests or limits specified.
  - name: auxiliary-container
    image: fedora
    command: ["sleep", "inf"]
    ...
    # This container has no resource requests or limits specified.

In this example, the pod-resources-demo Pod as a whole requests 1 CPU and 100 MiB of memory, and is limited to 1 CPU and 200 MiB of memory. The containers within will operate under these overall Pod-level constraints, as explained in the next section.

Interaction with container-level resource requests or limits

When both pod-level and container-level resources are specified, pod-level requests and limits take precedence. This means the node allocates resources based on the pod-level specifications.

Consider a Pod with two containers where pod-level CPU and memory requests and limits are defined, and only one container has its own explicit resource definitions:

apiVersion: v1
kind: Pod
metadata:
  name: pod-resources-demo
  namespace: pod-resources-example
spec:
  resources:
    limits:
      cpu: "1"
      memory: "200Mi"
    requests:
      cpu: "1"
      memory: "100Mi"
  containers:
  - name: main-app-container
    image: nginx
    resources:
      requests:
        cpu: "0.5"
        memory: "50Mi"
  - name: auxiliary-container
    image: fedora
    command: [ "sleep", "inf"]
    # This container has no resource requests or limits specified.
  • Pod-Level Limits: The pod-level limits (cpu: "1", memory: "200Mi") establish an absolute boundary for the entire Pod. The sum of resources consumed by all its containers is enforced at this ceiling and cannot be surpassed.

  • Resource Sharing and Bursting: Containers can dynamically borrow any unused capacity, allowing them to burst as needed, so long as the Pod's aggregate usage stays within the overall limit.

  • Pod-Level Requests: The pod-level requests (cpu: "1", memory: "100Mi") serve as the foundational resource guarantee for the entire Pod. This value informs the scheduler's placement decision and represents the minimum resources the Pod can rely on during node-level contention.

  • Container-Level Requests: Container-level requests create a priority system within the Pod's guaranteed budget. Because main-app-container has an explicit request (cpu: "0.5", memory: "50Mi"), it is given precedence for its share of resources under resource pressure over the auxiliary-container, which has no such explicit claim.

Limitations

  • First of all, in-place resize of pod-level resources is not supported for Kubernetes v1.34 (or earlier). Attempting to modify the pod-level resource limits or requests on a running Pod results in an error: the resize is rejected. The v1.34 implementation of Pod level resources focuses on allowing initial declaration of an overall resource envelope, that applies to the entire Pod. That is distinct from in-place pod resize, which (despite what the name might suggest) allows you to make dynamic adjustments to container resource requests and limits, within a running Pod, and potentially without a container restart. In-place resizing is also not yet a stable feature; it graduated to Beta in the v1.33 release.

  • Only CPU, memory, and hugepages resources can be specified at pod-level.

  • Pod-level resources are not supported for Windows pods. If the Pod specification explicitly targets Windows (e.g., by setting spec.os.name: "windows"), the API server will reject the Pod during the validation step. If the Pod is not explicitly marked for Windows but is scheduled to a Windows node (e.g., via a nodeSelector), the Kubelet on that Windows node will reject the Pod during its admission process.

  • The Topology Manager, Memory Manager and CPU Manager do not align pods and containers based on pod-level resources as these resource managers don't currently support pod-level resources.

Getting started and providing feedback

Ready to explore Pod Level Resources feature? You'll need a Kubernetes cluster running version 1.34 or later. Remember to enable the PodLevelResources feature gate across your control plane and all nodes.

As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels:

Kubernetes v1.34: Recovery From Volume Expansion Failure (GA)

Have you ever made a typo when expanding your persistent volumes in Kubernetes? Meant to specify 2TB but specified 20TiB? This seemingly innocuous problem was kinda hard to fix - and took the project almost 5 years to fix. Automated recovery from storage expansion has been around for a while in beta; however, with the v1.34 release, we have graduated this to general availability.

While it was always possible to recover from failing volume expansions manually, it usually required cluster-admin access and was tedious to do (See aformentioned link for more information).

What if you make a mistake and then realize immediately? With Kubernetes v1.34, you should be able to reduce the requested size of the PersistentVolumeClaim (PVC) and, as long as the expansion to previously requested size hadn't finished, you can amend the size requested. Kubernetes will automatically work to correct it. Any quota consumed by failed expansion will be returned to the user and the associated PersistentVolume should be resized to the latest size you specified.

I'll walk through an example of how all of this works.

Reducing PVC size to recover from failed expansion

Imagine that you are running out of disk space for one of your database servers, and you want to expand the PVC from previously specified 10TB to 100TB - but you make a typo and specify 1000TB.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: myclaim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1000TB # newly specified size - but incorrect!

Now, you may be out of disk space on your disk array or simply ran out of allocated quota on your cloud-provider. But, assume that expansion to 1000TB is never going to succeed.

In Kubernetes v1.34, you can simply correct your mistake and request a new PVC size, that is smaller than the mistake, provided it is still larger than the original size of the actual PersistentVolume.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: myclaim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100TB # Corrected size; has to be greater than 10TB.
                     # You cannot shrink the volume below its actual size.

This requires no admin intervention. Even better, any surplus Kubernetes quota that you temporarily consumed will be automatically returned.

This fault recovery mechanism does have a caveat: whatever new size you specify for the PVC, it must be still higher than the original size in .status.capacity. Since Kubernetes doesn't support shrinking your PV objects, you can never go below the size that was originally allocated for your PVC request.

Improved error handling and observability of volume expansion

Implementing what might look like a relatively minor change also required us to almost fully redo how volume expansion works under the hood in Kubernetes. There are new API fields available in PVC objects which you can monitor to observe progress of volume expansion.

Improved observability of in-progress expansion

You can query .status.allocatedResourceStatus['storage'] of a PVC to monitor progress of a volume expansion operation. For a typical block volume, this should transition between ControllerResizeInProgress, NodeResizePending and NodeResizeInProgress and become nil/empty when volume expansion has finished.

If for some reason, volume expansion to requested size is not feasible it should accordingly be in states like - ControllerResizeInfeasible or NodeResizeInfeasible.

You can also observe size towards which Kubernetes is working by watching pvc.status.allocatedResources.

Improved error handling and reporting

Kubernetes should now retry your failed volume expansions at slower rate, it should make fewer requests to both storage system and Kubernetes apiserver.

Errors observerd during volume expansion are now reported as condition on PVC objects and should persist unlike events. Kubernetes will now populate pvc.status.conditions with error keys ControllerResizeError or NodeResizeError when volume expansion fails.

Fixes long standing bugs in resizing workflows

This feature also has allowed us to fix long standing bugs in resizing workflow such as Kubernetes issue #115294. If you observe anything broken, please report your bugs to https://github.com/kubernetes/kubernetes/issues, along with details about how to reproduce the problem.

Working on this feature through its lifecycle was challenging and it wouldn't have been possible to reach GA without feedback from @msau42, @jsafrane and @xing-yang.

All of the contributors who worked on this also appreciate the input provided by @thockin and @liggitt at various Kubernetes contributor summits.

Kubernetes v1.34: DRA Consumable Capacity

Dynamic Resource Allocation (DRA) is a Kubernetes API for managing scarce resources across Pods and containers. It enables flexible resource requests, going beyond simply allocating N number of devices to support more granular usage scenarios. With DRA, users can request specific types of devices based on their attributes, define custom configurations tailored to their workloads, and even share the same resource among multiple containers or Pods.

In this blog, we focus on the device sharing feature and dive into a new capability introduced in Kubernetes 1.34: DRA consumable capacity, which extends DRA to support finer-grained device sharing.

Background: device sharing via ResourceClaims

From the beginning, DRA introduced the ability for multiple Pods to share a device by referencing the same ResourceClaim. This design decouples resource allocation from specific hardware, allowing for more dynamic and reusable provisioning of devices.

In Kubernetes 1.33, the new support for partitionable devices allowed resource drivers to advertise slices of a device that are available, rather than exposing the entire device as an all-or-nothing resource. This enabled Kubernetes to model shareable hardware more accurately.

But there was still a missing piece: it didn't yet support scenarios where the device driver manages fine-grained, dynamic portions of a device resource — like network bandwidth — based on user demand, or to share those resources independently of ResourceClaims, which are restricted by their spec and namespace.

That’s where consumable capacity for DRA comes in.

Benefits of DRA consumable capacity support

Here's a taste of what you get in a cluster with the DRAConsumableCapacity feature gate enabled.

Device sharing across multiple ResourceClaims or DeviceRequests

Resource drivers can now support sharing the same device — or even a slice of a device — across multiple ResourceClaims or across multiple DeviceRequests.

This means that Pods from different namespaces can simultaneously share the same device, if permitted and supported by the specific DRA driver.

Device resource allocation

Kubernetes extends the allocation algorithm in the scheduler to support allocating a portion of a device's resources, as defined in the capacity field. The scheduler ensures that the total allocated capacity across all consumers never exceeds the device’s total capacity, even when shared across multiple ResourceClaims or DeviceRequests. This is very similar to the way the scheduler allows Pods and containers to share allocatable resources on Nodes; in this case, it allows them to share allocatable (consumable) resources on Devices.

This feature expands support for scenarios where the device driver is able to manage resources within a device and on a per-process basis — for example, allocating a specific amount of memory (e.g., 8 GiB) from a virtual GPU, or setting bandwidth limits on virtual network interfaces allocated to specific Pods. This aims to provide safe and efficient resource sharing.

DistinctAttribute constraint

This feature also introduces a new constraint: DistinctAttribute, which is the complement of the existing MatchAttribute constraint.

The primary goal of DistinctAttribute is to prevent the same underlying device from being allocated multiple times within a single ResourceClaim, which could happen since we are allocating shares (or subsets) of devices. This constraint ensures that each allocation refers to a distinct resource, even if they belong to the same device class.

It is useful for use cases such as allocating network devices connecting to different subnets to expand coverage or provide redundancy across failure domains.

How to use consumable capacity?

DRAConsumableCapacity is introduced as an alpha feature in Kubernetes 1.34. The feature gate DRAConsumableCapacity must be enabled in kubelet, kube-apiserver, kube-scheduler and kube-controller-manager.

--feature-gates=...,DRAConsumableCapacity=true

As a DRA driver developer

As a DRA driver developer writing in Golang, you can make a device within a ResourceSlice allocatable to multiple ResourceClaims (or devices.requests) by setting AllowMultipleAllocations to true.

Device {
  ...
  AllowMultipleAllocations: ptr.To(true),
  ...
}

Additionally, you can define a policy to restrict how each device's Capacity should be consumed by each DeviceRequest by defining RequestPolicy field in the DeviceCapacity. The example below shows how to define a policy that requires a GPU with 40 GiB of memory to allocate at least 5 GiB per request, with each allocation in multiples of 5 GiB.

DeviceCapacity{
  Value: resource.MustParse("40Gi"),
  RequestPolicy: &CapacityRequestPolicy{
    Default: ptr.To(resource.MustParse("5Gi")),
    ValidRange: &CapacityRequestPolicyRange {
      Min: ptr.To(resource.MustParse("5Gi")),
      Step: ptr.To(resource.MustParse("5Gi")),
    }
  }
}

This will be published to the ResourceSlice, as partially shown below:

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
...
spec:
  devices:
  - name: gpu0
    allowMultipleAllocations: true
    capacity:
      memory:
        value: 40Gi
        requestPolicy:
          default: 5Gi
          validRange:
            min: 5Gi
            step: 5Gi

An allocated device with a specified portion of consumed capacity will have a ShareID field set in the allocation status.

claim.Status.Allocation.Devices.Results[i].ShareID

This ShareID allows the driver to distinguish between different allocations that refer to the same device or same statically-partitioned slice but come from different ResourceClaim requests.
It acts as a unique identifier for each shared slice, enabling the driver to manage and enforce resource limits independently across multiple consumers.

As a consumer

As a consumer (or user), the device resource can be requested with a ResourceClaim like this:

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
...
spec:
  devices:
    requests: # for devices
    - name: req0
      exactly:
        deviceClassName: resource.example.com
        capacity:
          requests: # for resources which must be provided by those devices
            memory: 10Gi

This configuration ensures that the requested device can provide at least 10GiB of memory.

Notably that any resource.example.com device that has at least 10GiB of memory can be allocated. If a device that does not support multiple allocations is chosen, the allocation would consume the entire device. To filter only devices that support multiple allocations, you can define a selector like this:

selectors:
  - cel:
      expression: |-
        device.allowMultipleAllocations == true

Integration with DRA device status

In device sharing, general device information is provided through the resource slice. However, some details are set dynamically after allocation. These can be conveyed using the .status.devices field of a ResourceClaim. That field is only published in clusters where the DRAResourceClaimDeviceStatus feature gate is enabled.

If you do have device status support available, a driver can expose additional device-specific information beyond the ShareID. One particularly useful use case is for virtual networks, where a driver can include the assigned IP address(es) in the status. This is valuable for both network service operations and troubleshooting.

You can find more information by watching our recording at: KubeCon Japan 2025 - Reimagining Cloud Native Networks: The Critical Role of DRA.

What can you do next?

  • Check out the CNI DRA Driver project for an example of DRA integration in Kubernetes networking. Try integrating with network resources like macvlan, ipvlan, or smart NICs.

  • Start enabling the DRAConsumableCapacity feature gate and experimenting with virtualized or partitionable devices. Specify your workloads with consumable capacity (for example: fractional bandwidth or memory).

  • Let us know your feedback:

    • ✅ What worked well?
    • ⚠️ What didn’t?

    If you encountered issues to fix or opportunities to enhance, please file a new issue and reference KEP-5075 there, or reach out via Slack (#wg-device-management).

Conclusion

Consumable capacity support enhances the device sharing capability of DRA by allowing effective device sharing across namespaces, across claims, and tailored to each Pod’s actual needs. It also empowers drivers to enforce capacity limits, improves scheduling accuracy, and unlocks new use cases like bandwidth-aware networking and multi-tenant device sharing.

Try it out, experiment with consumable resources, and help shape the future of dynamic resource allocation in Kubernetes!

Further Reading

Kubernetes v1.34: Pods Report DRA Resource Health

The rise of AI/ML and other high-performance workloads has made specialized hardware like GPUs, TPUs, and FPGAs a critical component of many Kubernetes clusters. However, as discussed in a previous blog post about navigating failures in Pods with devices, when this hardware fails, it can be difficult to diagnose, leading to significant downtime. With the release of Kubernetes v1.34, we are excited to announce a new alpha feature that brings much-needed visibility into the health of these devices.

This work extends the functionality of KEP-4680, which first introduced a mechanism for reporting the health of devices managed by Device Plugins. Now, this capability is being extended to Dynamic Resource Allocation (DRA). Controlled by the ResourceHealthStatus feature gate, this enhancement allows DRA drivers to report device health directly into a Pod's .status field, providing crucial insights for operators and developers.

Why expose device health in Pod status?

For stateful applications or long-running jobs, a device failure can be disruptive and costly. By exposing device health in the .status field for a Pod, Kubernetes provides a standardized way for users and automation tools to quickly diagnose issues. If a Pod is failing, you can now check its status to see if an unhealthy device is the root cause, saving valuable time that might otherwise be spent debugging application code.

How it works

This feature introduces a new, optional communication channel between the Kubelet and DRA drivers, built on three core components.

A new gRPC health service

A new gRPC service, DRAResourceHealth, is defined in the dra-health/v1alpha1 API group. DRA drivers can implement this service to stream device health updates to the Kubelet. The service includes a NodeWatchResources server-streaming RPC that sends the health status (Healthy, Unhealthy, or Unknown) for the devices it manages.

Kubelet integration

The Kubelet’s DRAPluginManager discovers which drivers implement the health service. For each compatible driver, it starts a long-lived NodeWatchResources stream to receive health updates. The DRA Manager then consumes these updates and stores them in a persistent healthInfoCache that can survive Kubelet restarts.

Populating the Pod status

When a device's health changes, the DRA manager identifies all Pods affected by the change and triggers a Pod status update. A new field, allocatedResourcesStatus, is now part of the v1.ContainerStatus API object. The Kubelet populates this field with the current health of each device allocated to the container.

A practical example

If a Pod is in a CrashLoopBackOff state, you can use kubectl describe pod <pod-name> to inspect its status. If an allocated device has failed, the output will now include the allocatedResourcesStatus field, clearly indicating the problem:

status:
  containerStatuses:
  - name: my-gpu-intensive-container
    # ... other container statuses
    allocatedResourcesStatus:
    - name: "claim:my-gpu-claim"
      resources:
      - resourceID: "example.com/gpu-a1b2-c3d4"
        health: "Unhealthy"

This explicit status makes it clear that the issue is with the underlying hardware, not the application.

Now you can improve the failure detection logic to react on the unhealthy devices associated with the Pod by de-scheduling a Pod.

How to use this feature

As this is an alpha feature in Kubernetes v1.34, you must take the following steps to use it:

  1. Enable the ResourceHealthStatus feature gate on your kube-apiserver and kubelets.
  2. Ensure you are using a DRA driver that implements the v1alpha1 DRAResourceHealth gRPC service.

DRA drivers

If you are developing a DRA driver, make sure to think about device failure detection strategy and ensure that your driver is integrated with this feature. This way, your driver will improve the user experience and simplify debuggability of hardware issues.

What's next?

This is the first step in a broader effort to improve how Kubernetes handles device failures. As we gather feedback on this alpha feature, the community is planning several key enhancements before graduating to Beta:

  • Detailed health messages: To improve the troubleshooting experience, we plan to add a human-readable message field to the gRPC API. This will allow DRA drivers to provide specific context for a health status, such as "GPU temperature exceeds threshold" or "NVLink connection lost".
  • Configurable health timeouts: The timeout for marking a device's health as "Unknown" is currently hardcoded. We plan to make this configurable, likely on a per-driver basis, to better accommodate the different health-reporting characteristics of various hardware.
  • Improved post-mortem troubleshooting: We will address a known limitation where health updates may not be applied to pods that have already terminated. This fix will ensure that the health status of a device at the time of failure is preserved, which is crucial for troubleshooting batch jobs and other "run-to-completion" workloads.

This feature was developed as part of KEP-4680, and community feedback is crucial as we work toward graduating it to Beta. We have more improvements of device failure handling in k8s and encourage you to try it out and share your experiences with the SIG Node community!

Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2

Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release and moved to Beta in the Kubernetes 1.32 release. The recent release of Kubernetes v1.34 moved that support to a second beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

This new feature is only supported for CSI volume drivers.

What's new in Beta 2?

While testing the beta version, we encountered an issue where the restoreSize field is not set for individual VolumeSnapshotContents and VolumeSnapshots if CSI driver does not implement the ListSnapshots RPC call. We evaluated various options here and decided to make this change releasing a new beta for the API.

Specifically, a VolumeSnapshotInfo struct is added in v1beta2, it contains information for an individual volume snapshot that is a member of a volume group snapshot. VolumeSnapshotInfoList, a list of VolumeSnapshotInfo, is added to VolumeGroupSnapshotContentStatus, replacing VolumeSnapshotHandlePairList. VolumeSnapshotInfoList is a list of snapshot information returned by the CSI driver to identify snapshots on the storage system. VolumeSnapshotInfoList is populated by the csi-snapshotter sidecar based on the CSI CreateVolumeGroupSnapshotResponse returned by the CSI driver's CreateVolumeGroupSnapshot call.

The existing v1beta1 API objects will be converted to the new v1beta2 API objects by a conversion webhook.

What’s next?

Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.

How can I learn more?

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

Kubernetes v1.34: Decoupled Taint Manager Is Now Stable

This enhancement separates the responsibility of managing node lifecycle and pod eviction into two distinct components. Previously, the node lifecycle controller handled both marking nodes as unhealthy with NoExecute taints and evicting pods from them. Now, a dedicated taint eviction controller manages the eviction process, while the node lifecycle controller focuses solely on applying taints. This separation not only improves code organization but also makes it easier to improve taint eviction controller or build custom implementations of the taint based eviction.

What's new?

The feature gate SeparateTaintEvictionController has been promoted to GA in this release. Users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.

How can I learn more?

For more details, refer to the KEP and to the beta announcement article: Kubernetes 1.29: Decoupling taint manager from node lifecycle controller.

How to get involved?

We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from beta to stable:

  • Ed Bartosh (@bart0sh)
  • Yuan Chen (@yuanchen8911)
  • Aldo Culquicondor (@alculquicondor)
  • Baofa Fan (@carlory)
  • Sergey Kanzhelev (@SergeyKanzhelev)
  • Tim Bannister (@lmktfy)
  • Maciej Skoczeń (@macsko)
  • Maciej Szulik (@soltysh)
  • Wojciech Tyczynski (@wojtek-t)

Kubernetes v1.34: Autoconfiguration for Node Cgroup Driver Goes GA

Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would misbehave without any explicit error message. This was a source of headaches for many cluster admins. Now, we've (almost) arrived at the end of that headache.

Automated cgroup driver detection

In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. You can read more here. After many releases of waiting for each CRI implementation to have major versions released and packaged in major operating systems, this feature has gone GA as of Kubernetes 1.34.0.

In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:

  • containerd: Support was added in v2.0.0
  • CRI-O: Support was added in v1.28.0

Announcement: Kubernetes is deprecating containerd v1.y support

While CRI-O releases versions that match Kubernetes versions, and thus CRI-O versions without this behavior are no longer supported, containerd maintains its own release cycle. containerd support for this feature is only in v2.0 and later, but Kubernetes 1.34 still supports containerd 1.7 and other LTS releases of containerd.

The Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.y. The last Kubernetes release to offer this support will be the last released version of v1.35, and support will be dropped in v1.36.0. To assist administrators in managing this future transition, a new detection mechanism is available. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated. The presence of this metric with a version label of 1.36.0 will indicate that the node's containerd runtime is not new enough for the upcoming requirements. Consequently, an administrator will need to upgrade containerd to v2.0 or a later version before, or at the same time as, upgrading the kubelet to v1.36.0.

Kubernetes v1.34: Mutable CSI Node Allocatable Graduates to Beta

The functionality for CSI drivers to update information about attachable volume count on the nodes, first introduced as Alpha in Kubernetes v1.33, has graduated to Beta in the Kubernetes v1.34 release! This marks a significant milestone in enhancing the accuracy of stateful pod scheduling by reducing failures due to outdated attachable volume capacity information.

Background

Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:

  • Manual or external operations attaching/detaching volumes outside of Kubernetes control.
  • Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
  • Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another.

Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.

Dynamically adapting CSI volume limits

With this new feature, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler, as well as other components relying on this information, have the most accurate, up-to-date view of node capacity.

How it works

Kubernetes supports two mechanisms for updating the reported node volume limits:

  • Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
  • Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error).

Enabling the feature

To use this beta feature, the MutableCSINodeAllocatableCount feature gate must be enabled in these components:

  • kube-apiserver
  • kubelet

Example CSI driver configuration

Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: example.csi.k8s.io
spec:
  nodeAllocatableUpdatePeriodSeconds: 60

This configuration directs kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.

Immediate updates on attachment failures

When a volume attachment operation fails due to a ResourceExhausted error (gRPC code 8), Kubernetes immediately updates the allocatable count instead of waiting for the next periodic update. The Kubelet then marks the affected pods as Failed, enabling their controllers to recreate them. This prevents pods from getting permanently stuck in the ContainerCreating state.

Getting started

To enable this feature in your Kubernetes v1.34 cluster:

  1. Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components.
  2. Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds.
  3. Monitor and observe improvements in scheduling accuracy and pod placement reliability.

Next steps

This feature is currently in beta and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution to GA stability.

Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.

Kubernetes v1.34: Use An Init Container To Define App Environment Variables

Kubernetes typically uses ConfigMaps and Secrets to set environment variables, which introduces additional API calls and complexity, For example, you need to separately manage the Pods of your workloads and their configurations, while ensuring orderly updates for both the configurations and the workload Pods.

Alternatively, you might be using a vendor-supplied container that requires environment variables (such as a license key or a one-time token), but you don’t want to hard-code them or mount volumes just to get the job done.

If that's the situation you are in, you now have a new (alpha) way to achieve that. Provided you have the EnvFiles feature gate enabled across your cluster, you can tell the kubelet to load a container's environment variables from a volume (the volume must be part of the Pod that the container belongs to). this feature gate allows you to load environment variables directly from a file in an emptyDir volume without actually mounting that file into the container. It’s a simple yet elegant solution to some surprisingly common problems.

What’s this all about?

At its core, this feature allows you to point your container to a file, one generated by an initContainer, and have Kubernetes parse that file to set your environment variables. The file lives in an emptyDir volume (a temporary storage space that lasts as long as the pod does), Your main container doesn’t need to mount the volume. The kubelet will read the file and inject these variables when the container starts.

How It Works

Here's a simple example:

apiVersion: v1
kind: Pod
spec:
  initContainers:
  - name: generate-config
    image: busybox
    command: ['sh', '-c', 'echo "CONFIG_VAR=HELLO" > /config/config.env']
    volumeMounts:
    - name: config-volume
      mountPath: /config
  containers:
  - name: app-container
    image: gcr.io/distroless/static
    env:
    - name: CONFIG_VAR
      valueFrom:
        fileKeyRef:
          path: config.env
          volumeName: config-volume
          key: CONFIG_VAR
  volumes:
  - name: config-volume
    emptyDir: {}

Using this approach is a breeze. You define your environment variables in the pod spec using the fileKeyRef field, which tells Kubernetes where to find the file and which key to pull. The file itself resembles the standard for .env syntax (think KEY=VALUE), and (for this alpha stage at least) you must ensure that it is written into an emptyDir volume. Other volume types aren't supported for this feature. At least one init container must mount that emptyDir volume (to write the file), but the main container doesn’t need to—it just gets the variables handed to it at startup.

A word on security

While this feature supports handling sensitive data such as keys or tokens, note that its implementation relies on emptyDir volumes mounted into pod. Operators with node filesystem access could therefore easily retrieve this sensitive data through pod directory paths.

If storing sensitive data like keys or tokens using this feature, ensure your cluster security policies effectively protect nodes against unauthorized access to prevent exposure of confidential information.

Summary

This feature will eliminate a number of complex workarounds used today, simplifying apps authoring, and opening doors for more use cases. Kubernetes stays flexible and open for feedback. Tell us how you use this feature or what is missing.

Kubernetes v1.34: Snapshottable API server cache

For years, the Kubernetes community has been on a mission to improve the stability and performance predictability of the API server. A major focus of this effort has been taming list requests, which have historically been a primary source of high memory usage and heavy load on the etcd datastore. With each release, we've chipped away at the problem, and today, we're thrilled to announce the final major piece of this puzzle.

The snapshottable API server cache feature has graduated to Beta in Kubernetes v1.34, culminating a multi-release effort to allow virtually all read requests to be served directly from the API server's cache.

Evolving the cache for performance and stability

The path to the current state involved several key enhancements over recent releases that paved the way for today's announcement.

Consistent reads from cache (Beta in v1.31)

While the API server has long used a cache for performance, a key milestone was guaranteeing consistent reads of the latest data from it. This v1.31 enhancement allowed the watch cache to be used for strongly-consistent read requests for the first time, a huge win as it enabled filtered collections (e.g. "a list of pods bound to this node") to be safely served from the cache instead of etcd, dramatically reducing its load for common workloads.

Taming large responses with streaming (Beta in v1.33)

Another key improvement was tackling the problem of memory spikes when transmitting large responses. The streaming encoder, introduced in v1.33, allowed the API server to send list items one by one, rather than buffering the entire multi-gigabyte response in memory. This made the memory cost of sending a response predictable and minimal, regardless of its size.

The missing piece

Despite these huge improvements, a critical gap remained. Any request for a historical LIST—most commonly used for paginating through large result sets—still had to bypass the cache and query etcd directly. This meant that the cost of retrieving the data was still unpredictable and could put significant memory pressure on the API server.

Kubernetes 1.34: snapshots complete the picture

The snapshottable API server cache solves this final piece of the puzzle. This feature enhances the watch cache, enabling it to generate efficient, point-in-time snapshots of its state.

Here’s how it works: for each update, the cache creates a lightweight snapshot. These snapshots are "lazy copies," meaning they don't duplicate objects but simply store pointers, making them incredibly memory-efficient.

When a list request for a historical resourceVersion arrives, the API server now finds the corresponding snapshot and serves the response directly from its memory. This closes the final major gap, allowing paginated requests to be served entirely from the cache.

A new era of API Server performance 🚀

With this final piece in place, the synergy of these three features ushers in a new era of API server predictability and performance:

  1. Get Data from Cache: Consistent reads and snapshottable cache work together to ensure nearly all read requests—whether for the latest data or a historical snapshot—are served from the API server's memory.
  2. Send data via stream: Streaming list responses ensure that sending this data to the client has a minimal and constant memory footprint.

The result is a system where the resource cost of read operations is almost fully predictable and much more resiliant to spikes in request load. This means dramatically reduced memory pressure, a lighter load on etcd, and a more stable, scalable, and reliable control plane for all Kubernetes clusters.

How to get started

With its graduation to Beta, the SnapshottableCache feature gate is enabled by default in Kubernetes v1.34. There are no actions required to start benefiting from these performance and stability improvements.

Acknowledgements

Special thanks for designing, implementing, and reviewing these critical features go to:

...and many others in SIG API Machinery. This milestone is a testament to the community's dedication to building a more scalable and robust Kubernetes.

Kubernetes v1.34: VolumeAttributesClass for Volume Modification GA

The VolumeAttributesClass API, which empowers users to dynamically modify volume attributes, has officially graduated to General Availability (GA) in Kubernetes v1.34. This marks a significant milestone, providing a robust and stable way to tune your persistent storage directly within Kubernetes.

What is VolumeAttributesClass?

At its core, VolumeAttributesClass is a cluster-scoped resource that defines a set of mutable parameters for a volume. Think of it as a "profile" for your storage, allowing cluster administrators to expose different quality-of-service (QoS) levels or performance tiers.

Users can then specify a volumeAttributesClassName in their PersistentVolumeClaim (PVC) to indicate which class of attributes they desire. The magic happens through the Container Storage Interface (CSI): when a PVC referencing a VolumeAttributesClass is updated, the associated CSI driver interacts with the underlying storage system to apply the specified changes to the volume.

This means you can now:

  • Dynamically scale performance: Increase IOPS or throughput for a busy database, or reduce it for a less critical application.
  • Optimize costs: Adjust attributes on the fly to match your current needs, avoiding over-provisioning.
  • Simplify operations: Manage volume modifications directly within the Kubernetes API, rather than relying on external tools or manual processes.

What is new from Beta to GA

There are two major enhancements from beta.

Cancellation support when errors occur

To improve resilience and user experience, the GA release introduces explicit cancel support when a requested volume modification encounters an error. If the underlying storage system or CSI driver indicates that the requested changes cannot be applied (e.g., due to invalid arguments), users can cancel the operation and revert the volume to its previous stable configuration, preventing the volume from being left in an inconsistent state.

Quota support based on scope

While VolumeAttributesClass doesn't add a new quota type, the Kubernetes control plane can be configured to enforce quotas on PersistentVolumeClaims that reference a specific VolumeAttributesClass.

This is achieved by using the scopeSelector field in a ResourceQuota to target PVCs that have .spec.volumeAttributesClassName set to a particular VolumeAttributesClass name. Please see more details here.

Drivers support VolumeAttributesClass

  • Amazon EBS CSI Driver: The AWS EBS CSI driver has robust support for VolumeAttributesClass and allows you to modify parameters like volume type (e.g., gp2 to gp3, io1 to io2), IOPS, and throughput of EBS volumes dynamically.
  • Google Compute Engine (GCE) Persistent Disk CSI Driver (pd.csi.storage.gke.io): This driver also supports dynamic modification of persistent disk attributes, including IOPS and throughput, via VolumeAttributesClass.

Contact

For any inquiries or specific questions related to VolumeAttributesClass, please reach out to the SIG Storage community.

Kubernetes v1.34: Pod Replacement Policy for Jobs Goes GA

In Kubernetes v1.34, the Pod replacement policy feature has reached general availability (GA). This blog post describes the Pod replacement policy feature and how to use it in your Jobs.

About Pod Replacement Policy

By default, the Job controller immediately recreates Pods as soon as they fail or begin terminating (when they have a deletion timestamp).

As a result, while some Pods are terminating, the total number of running Pods for a Job can temporarily exceed the specified parallelism. For Indexed Jobs, this can even mean multiple Pods running for the same index at the same time.

This behavior works fine for many workloads, but it can cause problems in certain cases.

For example, popular machine learning frameworks like TensorFlow and JAX expect exactly one Pod per worker index. If two Pods run at the same time, you might encounter errors such as:

/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4

Additionally, starting replacement Pods before the old ones fully terminate can lead to:

  • Scheduling delays by kube-scheduler as the nodes remain occupied.
  • Unnecessary cluster scale-ups to accommodate the replacement Pods.
  • Temporary bypassing of quota checks by workload orchestrators like Kueue.

With Pod replacement policy, Kubernetes gives you control over when the control plane replaces terminating Pods, helping you avoid these issues.

How Pod Replacement Policy works

This enhancement means that Jobs in Kubernetes have an optional field .spec.podReplacementPolicy.
You can choose one of two policies:

  • TerminatingOrFailed (default): Replaces Pods as soon as they start terminating.
  • Failed: Replaces Pods only after they fully terminate and transition to the Failed phase.

Setting the policy to Failed ensures that a new Pod is only created after the previous one has completely terminated.

For Jobs with a Pod Failure Policy, the default podReplacementPolicy is Failed, and no other value is allowed. See Pod Failure Policy to learn more about Pod Failure Policies for Jobs.

You can check how many Pods are currently terminating by inspecting the Job’s .status.terminating field:

kubectl get job myjob -o=jsonpath='{.status.terminating}'

Example

Here’s a Job example that executes a task two times (spec.completions: 2) in parallel (spec.parallelism: 2) and replaces Pods only after they fully terminate (spec.podReplacementPolicy: Failed):

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  completions: 2
  parallelism: 2
  podReplacementPolicy: Failed
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: worker
        image: your-image

If a Pod receives a SIGTERM signal (deletion, eviction, preemption...), it begins terminating. When the container handles termination gracefully, cleanup may take some time.

When the Job starts, we will see two Pods running:

kubectl get pods

NAME                READY   STATUS    RESTARTS   AGE
example-job-qr8kf   1/1     Running   0          2s
example-job-stvb4   1/1     Running   0          2s

Let's delete one of the Pods (example-job-qr8kf).

With the TerminatingOrFailed policy, as soon as one Pod (example-job-qr8kf) starts terminating, the Job controller immediately creates a new Pod (example-job-b59zk) to replace it.

kubectl get pods

NAME                READY   STATUS        RESTARTS   AGE
example-job-b59zk   1/1     Running       0          1s
example-job-qr8kf   1/1     Terminating   0          17s
example-job-stvb4   1/1     Running       0          17s

With the Failed policy, the new Pod (example-job-b59zk) is not created while the old Pod (example-job-qr8kf) is terminating.

kubectl get pods

NAME                READY   STATUS        RESTARTS   AGE
example-job-qr8kf   1/1     Terminating   0          17s
example-job-stvb4   1/1     Running       0          17s

When the terminating Pod has fully transitioned to the Failed phase, a new Pod is created:

kubectl get pods

NAME                READY   STATUS        RESTARTS   AGE
example-job-b59zk   1/1     Running       0          1s
example-job-stvb4   1/1     Running       0          25s

How can you learn more?

Acknowledgments

As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.

As this feature moves to stable after 2 years, we would like to thank the following people:

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

Kubernetes v1.34: PSI Metrics for Kubernetes Graduates to Beta

As Kubernetes clusters grow in size and complexity, understanding the health and performance of individual nodes becomes increasingly critical. We are excited to announce that as of Kubernetes v1.34, Pressure Stall Information (PSI) Metrics has graduated to Beta.

What is Pressure Stall Information (PSI)?

Pressure Stall Information (PSI) is a feature of the Linux kernel (version 4.20 and later) that provides a canonical way to quantify pressure on infrastructure resources, in terms of whether demand for a resource exceeds current supply. It moves beyond simple resource utilization metrics and instead measures the amount of time that tasks are stalled due to resource contention. This is a powerful way to identify and diagnose resource bottlenecks that can impact application performance.

PSI exposes metrics for CPU, memory, and I/O, categorized as either some or full pressure:

some
The percentage of time that at least one task is stalled on a resource. This indicates some level of resource contention.
full
The percentage of time that all non-idle tasks are stalled on a resource simultaneously. This indicates a more severe resource bottleneck.
Diagram illustrating the difference between 'some' and 'full' PSI pressure.

PSI: 'Some' vs. 'Full' Pressure

These metrics are aggregated over 10-second, 1-minute, and 5-minute rolling windows, providing a comprehensive view of resource pressure over time.

PSI metrics in Kubernetes

With the KubeletPSI feature gate enabled, the kubelet can now collect PSI metrics from the Linux kernel and expose them through two channels: the Summary API and the /metrics/cadvisor Prometheus endpoint. This allows you to monitor and alert on resource pressure at the node, pod, and container level.

The following new metrics are available in Prometheus exposition format via /metrics/cadvisor:

  • container_pressure_cpu_stalled_seconds_total
  • container_pressure_cpu_waiting_seconds_total
  • container_pressure_memory_stalled_seconds_total
  • container_pressure_memory_waiting_seconds_total
  • container_pressure_io_stalled_seconds_total
  • container_pressure_io_waiting_seconds_total

These metrics, along with the data from the Summary API, provide a granular view of resource pressure, enabling you to pinpoint the source of performance issues and take corrective action. For example, you can use these metrics to:

  • Identify memory leaks: A steadily increasing some pressure for memory can indicate a memory leak in an application.
  • Optimize resource requests and limits: By understanding the resource pressure of your workloads, you can more accurately tune their resource requests and limits.
  • Autoscale workloads: You can use PSI metrics to trigger autoscaling events, ensuring that your workloads have the resources they need to perform optimally.

How to enable PSI metrics

To enable PSI metrics in your Kubernetes cluster, you need to:

  1. Ensure your nodes are running a Linux kernel version 4.20 or later and are using cgroup v2.
  2. Enable the KubeletPSI feature gate on the kubelet.

Once enabled, you can start scraping the /metrics/cadvisor endpoint with your Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics. Note that PSI is a Linux-kernel feature, so these metrics are not available on Windows nodes. Your cluster can contain a mix of Linux and Windows nodes, and on the Windows nodes the kubelet does not expose PSI metrics.

What's next?

We are excited to bring PSI metrics to the Kubernetes community and look forward to your feedback. As a beta feature, we are actively working on improving and extending this functionality towards a stable GA release. We encourage you to try it out and share your experiences with us.

To learn more about PSI metrics, check out the official Kubernetes documentation. You can also get involved in the conversation on the #sig-node Slack channel.

Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta

The Kubernetes community continues to advance security best practices by reducing reliance on long-lived credentials. Following the successful alpha release in Kubernetes v1.33, Service Account Token Integration for Kubelet Credential Providers has now graduated to beta in Kubernetes v1.34, bringing us closer to eliminating long-lived image pull secrets from Kubernetes clusters.

This enhancement allows credential providers to use workload-specific service account tokens to obtain registry credentials, providing a secure, ephemeral alternative to traditional image pull secrets.

What's new in beta?

The beta graduation brings several important changes that make the feature more robust and production-ready:

Required cacheType field

Breaking change from alpha: The cacheType field is required in the credential provider configuration when using service account tokens. This field is new in beta and must be specified to ensure proper caching behavior.

# CAUTION: this is not a complete configuration example, just a reference for the 'tokenAttributes.cacheType' field.
tokenAttributes:
  serviceAccountTokenAudience: "my-registry-audience"
  cacheType: "ServiceAccount"  # Required field in beta
  requireServiceAccount: true

Choose between two caching strategies:

  • Token: Cache credentials per service account token (use when credential lifetime is tied to the token). This is useful when the credential provider transforms the service account token into registry credentials with the same lifetime as the token, or when registries support Kubernetes service account tokens directly. Note: The kubelet cannot send service account tokens directly to registries; credential provider plugins are needed to transform tokens into the username/password format expected by registries.
  • ServiceAccount: Cache credentials per service account identity (use when credentials are valid for all pods using the same service account)

Isolated image pull credentials

The beta release provides stronger security isolation for container images when using service account tokens for image pulls. It ensures that pods can only access images that were pulled using ServiceAccounts they're authorized to use. This prevents unauthorized access to sensitive container images and enables granular access control where different workloads can have different registry permissions based on their ServiceAccount.

When credential providers use service account tokens, the system tracks ServiceAccount identity (namespace, name, and UID) for each pulled image. When a pod attempts to use a cached image, the system verifies that the pod's ServiceAccount matches exactly with the ServiceAccount that was used to originally pull the image.

Administrators can revoke access to previously pulled images by deleting and recreating the ServiceAccount, which changes the UID and invalidates cached image access.

For more details about this capability, see the image pull credential verification documentation.

How it works

Configuration

Credential providers opt into using ServiceAccount tokens by configuring the tokenAttributes field:

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#
apiVersion: kubelet.config.k8s.io/v1
kind: CredentialProviderConfig
providers:
- name: my-credential-provider
  matchImages:
  - "*.myregistry.io/*"
  defaultCacheDuration: "10m"
  apiVersion: credentialprovider.kubelet.k8s.io/v1
  tokenAttributes:
    serviceAccountTokenAudience: "my-registry-audience"
    cacheType: "ServiceAccount"  # New in beta
    requireServiceAccount: true
    requiredServiceAccountAnnotationKeys:
    - "myregistry.io/identity-id"
    optionalServiceAccountAnnotationKeys:
    - "myregistry.io/optional-annotation"

Image pull flow

At a high level, kubelet coordinates with your credential provider and the container runtime as follows:

  • When the image is not present locally:

    • kubelet checks its credential cache using the configured cacheType (Token or ServiceAccount)
    • If needed, kubelet requests a ServiceAccount token for the pod's ServiceAccount and passes it, plus any required annotations, to the credential provider
    • The provider exchanges that token for registry credentials and returns them to kubelet
    • kubelet caches credentials per the cacheType strategy and pulls the image with those credentials
    • kubelet records the ServiceAccount coordinates (namespace, name, UID) associated with the pulled image for later authorization checks
  • When the image is already present locally:

    • kubelet verifies the pod's ServiceAccount coordinates match the coordinates recorded for the cached image
    • If they match exactly, the cached image can be used without pulling from the registry
    • If they differ, kubelet performs a fresh pull using credentials for the new ServiceAccount
  • With image pull credential verification enabled:

    • Authorization is enforced using the recorded ServiceAccount coordinates, ensuring pods only use images pulled by a ServiceAccount they are authorized to use
    • Administrators can revoke access by deleting and recreating a ServiceAccount; the UID changes and previously recorded authorization no longer matches

Audience restriction

The beta release builds on service account node audience restriction (beta since v1.33) to ensure kubelet can only request tokens for authorized audiences. Administrators configure allowed audiences using RBAC to enable kubelet to request service account tokens for image pulls:

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubelet-credential-provider-audiences
rules:
- verbs: ["request-serviceaccounts-token-audience"]
  apiGroups: [""]
  resources: ["my-registry-audience"]
  resourceNames: ["registry-access-sa"]  # Optional: specific SA

Getting started with beta

Prerequisites

  1. Kubernetes v1.34 or later
  2. Feature gate enabled: KubeletServiceAccountTokenForCredentialProviders=true (beta, enabled by default)
  3. Credential provider support: Update your credential provider to handle ServiceAccount tokens

Migration from alpha

If you're already using the alpha version, the migration to beta requires minimal changes:

  1. Add cacheType field: Update your credential provider configuration to include the required cacheType field
  2. Review caching strategy: Choose between Token and ServiceAccount cache types based on your provider's behavior
  3. Test audience restrictions: Ensure your RBAC configuration, or other cluster authorization rules, will properly restrict token audiences

Example setup

Here's a complete example for setting up a credential provider with service account tokens (this example assumes your cluster uses RBAC authorization):

#
# CAUTION: this is an example configuration.
#          Do not use this for your own cluster!
#

# Service Account with registry annotations
apiVersion: v1
kind: ServiceAccount
metadata:
  name: registry-access-sa
  namespace: default
  annotations:
    myregistry.io/identity-id: "user123"
---
# RBAC for audience restriction
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: registry-audience-access
rules:
- verbs: ["request-serviceaccounts-token-audience"]
  apiGroups: [""]
  resources: ["my-registry-audience"]
  resourceNames: ["registry-access-sa"]  # Optional: specific ServiceAccount
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubelet-registry-audience
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: registry-audience-access
subjects:
- kind: Group
  name: system:nodes
  apiGroup: rbac.authorization.k8s.io
---
# Pod using the ServiceAccount
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  serviceAccountName: registry-access-sa
  containers:
  - name: my-app
    image: myregistry.example/my-app:latest

What's next?

For Kubernetes v1.35, we - Kubernetes SIG Auth - expect the feature to stay in beta, and we will continue to solicit feedback.

You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.

You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.

Call to action

In this blog post, I have covered the beta graduation of ServiceAccount token integration for Kubelet Credential Providers in Kubernetes v1.34. I discussed the key improvements, including the required cacheType field and enhanced integration with Ensure Secret Pull Images.

We have been receiving positive feedback from the community during the alpha phase and would love to hear more as we stabilize this feature for GA. In particular, we would like feedback from credential provider implementors as they integrate with the new beta API and caching mechanisms. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack.

How to get involved

If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

Kubernetes v1.34: Introducing CPU Manager Static Policy Option for Uncore Cache Alignment

A new CPU Manager Static Policy Option called prefer-align-cpus-by-uncorecache was introduced in Kubernetes v1.32 as an alpha feature, and has graduated to beta in Kubernetes v1.34. This CPU Manager Policy Option is designed to optimize performance for specific workloads running on processors with a split uncore cache architecture. In this article, I'll explain what that means and why it's useful.

Understanding the feature

What is uncore cache?

Until relatively recently, nearly all mainstream computer processors had a monolithic last-level-cache cache that was shared across every core in a multiple CPU package. This monolithic cache is also referred to as uncore cache (because it is not linked to a specific core), or as Level 3 cache. As well as the Level 3 cache, there is other cache, commonly called Level 1 and Level 2 cache, that is associated with a specific CPU core.

In order to reduce access latency between the CPU cores and their cache, recent AMD64 and ARM architecture based processors have introduced a split uncore cache architecture, where the last-level-cache is divided into multiple physical caches, that are aligned to specific CPU groupings within the physical package. The shorter distances within the CPU package help to reduce latency. Diagram showing monolithic cache on the left and split cache on the right

Kubernetes is able to place workloads in a way that accounts for the cache topology within the CPU package(s).

Cache-aware workload placement

The matrix below shows the CPU-to-CPU latency measured in nanoseconds (lower is better) when passing a packet between CPUs, via its cache coherence protocol on a processor that uses split uncore cache. In this example, the processor package consists of 2 uncore caches. Each uncore cache serves 8 CPU cores. Table showing CPU-to-CPU latency figures Blue entries in the matrix represent latency between CPUs sharing the same uncore cache, while grey entries indicate latency between CPUs corresponding to different uncore caches. Latency between CPUs that correspond to different caches are higher than the latency between CPUs that belong to the same cache.

With prefer-align-cpus-by-uncorecache enabled, the static CPU Manager attempts to allocates CPU resources for a container, such that all CPUs assigned to a container share the same uncore cache. This policy operates on a best-effort basis, aiming to minimize the distribution of a container's CPU resources across uncore caches, based on the container's requirements, and accounting for allocatable resources on the node.

By running a workload, where it can, on a set of CPUS that use the smallest feasible number of uncore caches, applications benefit from reduced cache latency (as seen in the matrix above), and from reduced contention against other workloads, which can result in overall higher throughput. The benefit only shows up if your nodes use a split uncore cache topology for their processors.

The following diagram below illustrates uncore cache alignment when the feature is enabled.

Diagram showing an example workload CPU assignment, default static policy, and with prefer-align-cpus-by-uncorecache

By default, Kubernetes does not account for uncore cache topology; containers are assigned CPU resources using a packed methodology. As a result, Container 1 and Container 2 can experience a noisy neighbor impact due to cache access contention on Uncore Cache 0. Additionally, Container 2 will have CPUs distributed across both caches which can introduce a cross-cache latency.

With prefer-align-cpus-by-uncorecache enabled, each container is isolated on an individual cache. This resolves the cache contention between the containers and minimizes the cache latency for the CPUs being utilized.

Use cases

Common use cases can include telco applications like vRAN, Mobile Packet Core, and Firewalls. It's important to note that the optimization provided by prefer-align-cpus-by-uncorecache can be dependent on the workload. For example, applications that are memory bandwidth bound may not benefit from uncore cache alignment, as utilizing more uncore caches can increase memory bandwidth access.

Enabling the feature

To enable this feature, set the CPU Manager Policy to static and enable the CPU Manager Policy Options with prefer-align-cpus-by-uncorecache.

For Kubernetes 1.34, the feature is in the beta stage and requires the CPUManagerPolicyBetaOptions feature gate to also be enabled.

Append the following to the kubelet configuration file:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
featureGates:
  ...
  CPUManagerPolicyBetaOptions: true
cpuManagerPolicy: "static"
cpuManagerPolicyOptions:
  prefer-align-cpus-by-uncorecache: "true"
reservedSystemCPUs: "0"
...

If you're making this change to an existing node, remove the cpu_manager_state file and then restart kubelet.

prefer-align-cpus-by-uncorecache can be enabled on nodes with a monolithic uncore cache processor. The feature will mimic a best-effort socket alignment effect and will pack CPU resources on the socket similar to the default static CPU Manager policy.

Further reading

See Node Resource Managers to learn more about the CPU Manager and the available policies.

Reference the documentation for prefer-align-cpus-by-uncorecache here.

Please see the Kubernetes Enhancement Proposal for more information on how prefer-align-cpus-by-uncorecache is implemented.

Getting involved

This feature is driven by SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

Kubernetes v1.34: DRA has graduated to GA

Kubernetes 1.34 is here, and it has brought a huge wave of enhancements for Dynamic Resource Allocation (DRA)! This release marks a major milestone with many APIs in the resource.k8s.io group graduating to General Availability (GA), unlocking the full potential of how you manage devices on Kubernetes. On top of that, several key features have moved to beta, and a fresh batch of new alpha features promise even more expressiveness and flexibility.

Let's dive into what's new for DRA in Kubernetes 1.34!

The core of DRA is now GA

The headline feature of the v1.34 release is that the core of DRA has graduated to General Availability.

Kubernetes Dynamic Resource Allocation (DRA) provides a flexible framework for managing specialized hardware and infrastructure resources, such as GPUs or FPGAs. DRA provides APIs that enable each workload to specify the properties of the devices it needs, but leaving it to the scheduler to allocate actual devices, allowing increased reliability and improved utilization of expensive hardware.

With the graduation to GA, DRA is stable and will be part of Kubernetes for the long run. The community can still expect a steady stream of new features being added to DRA over the next several Kubernetes releases, but they will not make any breaking changes to DRA. So users and developers of DRA drivers can start adopting DRA with confidence.

Starting with Kubernetes 1.34, DRA is enabled by default; the DRA features that have reached beta are also enabled by default. That's because the default API version for DRA is now the stable v1 version, and not the earlier versions (eg: v1beta1 or v1beta2) that needed explicit opt in.

Features promoted to beta

Several powerful features have been promoted to beta, adding more control, flexibility, and observability to resource management with DRA.

Admin access labelling has been updated. In v1.34, you can restrict device support to people (or software) authorized to use it. This is meant as a way to avoid privilege escalation if a DRA driver grants additional privileges when admin access is requested and to avoid accessing devices which are in use by normal applications, potentially in another namespace. The restriction works by ensuring that only users with access to a namespace with the resource.k8s.io/admin-access: "true" label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess field set to true. This ensures that non-admin users cannot misuse the feature.

Prioritized list lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available on the node.

The kubelet's API has been updated to report on Pod resources allocated through DRA. This allows node monitoring agents to know the allocated DRA resources for Pods on a node and makes it possible to use the DRA information in the PodResources API to develop new features and integrations.

New alpha features

Kubernetes 1.34 also introduces several new alpha features that give us a glimpse into the future of resource management with DRA.

Extended resource mapping support in DRA allows cluster administrators to advertise DRA-managed resources as extended resources, allowing developers to consume them using the familiar, simpler request syntax while still benefiting from dynamic allocation. This makes it possible for existing workloads to start using DRA without modifications, simplifying the transition to DRA for both application developers and cluster administrators.

Consumable capacity introduces a flexible device sharing model where multiple, independent resource claims from unrelated pods can each be allocated a share of the same underlying physical device. This new capability is managed through optional, administrator-defined sharing policies that govern how a device's total capacity is divided and enforced by the platform for each request. This allows for sharing of devices in scenarios where pre-defined partitions are not viable.

For more information, see Kubernetes v1.34: DRA Consumable Capacity

Binding conditions improve scheduling reliability for certain classes of devices by allowing the Kubernetes scheduler to delay binding a pod to a node until its required external resources, such as attachable devices or FPGAs, are confirmed to be fully prepared. This prevents premature pod assignments that could lead to failures and ensures more robust, predictable scheduling by explicitly modeling resource readiness before the pod is committed to a node.

Resource health status for DRA improves observability by exposing the health status of devices allocated to a Pod via Pod Status. This works whether the device is allocated through DRA or Device Plugin. This makes it easier to understand the cause of an unhealthy device and respond properly.

For more information, see Kubernetes v1.34: Pods Report DRA Resource Health

What’s next?

While DRA got promoted to GA this cycle, the hard work on DRA doesn't stop. There are several features in alpha and beta that we plan to bring to GA in the next couple of releases and we are looking to continue to improve performance, scalability and reliability of DRA. So expect an equally ambitious set of features in DRA for the 1.35 release.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

Acknowledgments

A huge thanks to the new contributors to DRA this cycle:

Kubernetes v1.34: Finer-Grained Control Over Container Restarts

With the release of Kubernetes 1.34, a new alpha feature is introduced that gives you more granular control over container restarts within a Pod. This feature, named Container Restart Policy and Rules, allows you to specify a restart policy for each container individually, overriding the Pod's global restart policy. In addition, it also allows you to conditionally restart individual containers based on their exit codes. This feature is available behind the alpha feature gate ContainerRestartRules.

This has been a long-requested feature. Let's dive into how it works and how you can use it.

The problem with a single restart policy

Before this feature, the restartPolicy was set at the Pod level. This meant that all containers in a Pod shared the same restart policy (Always, OnFailure, or Never). While this works for many use cases, it can be limiting in others.

For example, consider a Pod with a main application container and an init container that performs some initial setup. You might want the main container to always restart on failure, but the init container should only run once and never restart. With a single Pod-level restart policy, this wasn't possible.

Introducing per-container restart policies

With the new ContainerRestartRules feature gate, you can now specify a restartPolicy for each container in your Pod's spec. You can also define restartPolicyRules to control restarts based on exit codes. This gives you the fine-grained control you need to handle complex scenarios.

Use cases

Let's look at some real-life use cases where per-container restart policies can be beneficial.

In-place restarts for training jobs

In ML research, it's common to orchestrate a large number of long-running AI/ML training workloads. In these scenarios, workload failures are unavoidable. When a workload fails with a retriable exit code, you want the container to restart quickly without rescheduling the entire Pod, which consumes a significant amount of time and resources. Restarting the failed container "in-place" is critical for better utilization of compute resources. The container should only restart "in-place" if it failed due to a retriable error; otherwise, the container and Pod should terminate and possibly be rescheduled.

This can now be achieved with container-level restartPolicyRules. The workload can exit with different codes to represent retriable and non-retriable errors. With restartPolicyRules, the workload can be restarted in-place quickly, but only when the error is retriable.

Try-once init containers

Init containers are often used to perform initialization work for the main container, such as setting up environments and credentials. Sometimes, you want the main container to always be restarted, but you don't want to retry initialization if it fails.

With a container-level restartPolicy, this is now possible. The init container can be executed only once, and its failure would be considered a Pod failure. If the initialization succeeds, the main container can be always restarted.

Pods with multiple containers

For Pods that run multiple containers, you might have different restart requirements for each container. Some containers might have a clear definition of success and should only be restarted on failure. Others might need to be always restarted.

This is now possible with a container-level restartPolicy, allowing individual containers to have different restart policies.

How to use it

To use this new feature, you need to enable the ContainerRestartRules feature gate on your Kubernetes cluster control-plane and worker nodes running Kubernetes 1.34+. Once enabled, you can specify the restartPolicy and restartPolicyRules fields in your container definitions.

Here are some examples:

Example 1: Restarting on specific exit codes

In this example, the container should restart if and only if it fails with a retriable error, represented by exit code 42.

To achieve this, the container has restartPolicy: Never, and a restart policy rule that tells Kubernetes to restart the container in-place if it exits with code 42.

apiVersion: v1
kind: Pod
metadata:
  name: restart-on-exit-codes
  annotations:
    kubernetes.io/description: "This Pod only restart the container only when it exits with code 42."
spec:
  restartPolicy: Never
  containers:
  - name: restart-on-exit-codes
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'sleep 60 && exit 0']
    restartPolicy: Never     # Container restart policy must be specified if rules are specified
    restartPolicyRules:      # Only restart the container if it exits with code 42
    - action: Restart
      exitCodes:
        operator: In
        values: [42]

Example 2: A try-once init container

In this example, a Pod should always be restarted once the initialization succeeds. However, the initialization should only be tried once.

To achieve this, the Pod has an Always restart policy. The init-once init container will only try once. If it fails, the Pod will fail. This allows the Pod to fail if the initialization failed, but also keep running once the initialization succeeds.

apiVersion: v1
kind: Pod
metadata:
  name: fail-pod-if-init-fails
  annotations:
    kubernetes.io/description: "This Pod has an init container that runs only once. After initialization succeeds, the main container will always be restarted."
spec:
  restartPolicy: Always
  initContainers:
  - name: init-once      # This init container will only try once. If it fails, the Pod will fail.
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'echo "Failing initialization" && sleep 10 && exit 1']
    restartPolicy: Never
  containers:
  - name: main-container # This container will always be restarted once initialization succeeds.
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'sleep 1800 && exit 0']

Example 3: Containers with different restart policies

In this example, there are two containers with different restart requirements. One should always be restarted, while the other should only be restarted on failure.

This is achieved by using a different container-level restartPolicy on each of the two containers.

apiVersion: v1
kind: Pod
metadata:
  name: on-failure-pod
  annotations:
    kubernetes.io/description: "This Pod has two containers with different restart policies."
spec:
  containers:
  - name: restart-on-failure
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'echo "Not restarting after success" && sleep 10 && exit 0']
    restartPolicy: OnFailure
  - name: restart-always
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'echo "Always restarting" && sleep 1800 && exit 0']
    restartPolicy: Always

Learn more

Roadmap

More actions and signals to restart Pods and containers are coming! Notably, there are plans to add support for restarting the entire Pod. Planning and discussions on these features are in progress. Feel free to share feedback or requests with the SIG Node community!

Your feedback is welcome!

This is an alpha feature, and the Kubernetes project would love to hear your feedback. Please try it out. This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to the SIG Node community!

You can reach SIG Node by several means:

Kubernetes v1.34: User preferences (kuberc) are available for testing in kubectl 1.34

Have you ever wished you could enable interactive delete, by default, in kubectl? Or maybe, you'd like to have custom aliases defined, but not necessarily generate hundreds of them manually? Look no further. SIG-CLI has been working hard to add user preferences to kubectl, and we are happy to announce that this functionality is reaching beta as part of the Kubernetes v1.34 release.

How it works

A full description of this functionality is available in our official documentation, but this blog post will answer both of the questions from the beginning of this article.

Before we dive into details, let's quickly cover what the user preferences file looks like and where to place it. By default, kubectl will look for kuberc file in your default kubeconfig directory, which is $HOME/.kube. Alternatively, you can specify this location using --kuberc option or the KUBERC environment variable.

Just like every Kubernetes manifest, kuberc file will start with an apiVersion and kind:

apiVersion: kubectl.config.k8s.io/v1beta1
kind: Preference
# the user preferences will follow here

Defaults

Let's start by setting default values for kubectl command options. Our goal is to always use interactive delete, which means we want the --interactive option for kubectl delete to always be set to true. This can be achieved with the following addition to our kuberc file:

defaults:
- command: delete
  options:
  - name: interactive
    default: "true"

In the above example, I'm introducing defaults section, which allows users to define default values for kubectl options. In this case, we're setting the interactive option for kubectl delete to be true by default. This default can be overridden if a user explicitly provides a different value such as kubectl delete --interactive=false, in which case the explicit option takes precedence.

Another highly encouraged default from SIG-CLI, is using Server-Side Apply. To do so, you can add the following snippet to your preferences:

# continuing defaults section
- command: apply
  options:
  - name: server-side
    default: "true"

Aliases

The ability to define aliases allows us to save precious seconds when typing commands. I bet that you most likely have one defined for kubectl, because typing seven letters is definitely longer than just pressing k.

For this reason, the ability to define aliases was a must-have when we decided to implement user preferences, alongside defaulting. To define an alias for any of the built-in commands, expand your kuberc file with the following addition:

aliases:
- name: gns
  command: get
  prependArgs:
   - namespace
  options:
   - name: output
     default: json

There's a lot going on above, so let me break this down. First, we're introducing a new section: aliases. Here, we're defining a new alias gns, which is mapped to the command get command. Next, we're defining arguments (namespace resource) that will be inserted right after the command name. Additionally, we're setting --output=json option for this alias. The structure of options block is identical to the one in the defaults section.

You probably noticed that we've introduced a mechanism for prepending arguments, and you might wonder if there is a complementary setting for appending them (in other words, adding to the end of the command, after user-provided arguments). This can be achieved through appendArgs block, which is presented below:

# continuing aliases section
- name: runx
  command: run
  options:
    - name: image
      default: busybox
    - name: namespace
      default: test-ns
  appendArgs:
    - --
    - custom-arg

Here, we're introducing another alias: runx, which invokes kubectl run command, passing --image and --namespace options with predefined values, and also appending -- and custom-arg at the end of the invocation.

Debugging

We hope that kubectl user preferences will open up new possibilities for our users. Whenever you're in doubt, feel free to run kubectl with increased verbosity. At -v=5, you should get all the possible debugging information from this feature, which will be crucial when reporting issues.

To learn more, I encourage you to read through our official documentation and the actual proposal.

Get involved

Kubectl user preferences feature has reached beta, and we are very interested in your feedback. We'd love to hear what you like about it and what problems you'd like to see it solve. Feel free to join SIG-CLI slack channel, or open an issue against kubectl repository. You can also join us at our community meetings, which happen every other Wednesday, and share your stories with us.

Kubernetes v1.34: Of Wind & Will (O' WaW)

Editors: Agustina Barbetta, Alejandro Josue Leon Bellido, Graziano Casto, Melony Qin, Dipesh Rawat

Similar to previous releases, the release of Kubernetes v1.34 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 58 enhancements. Of those enhancements, 23 have graduated to Stable, 22 have entered Beta, and 13 have entered Alpha.

There are also some deprecations and removals in this release; make sure to read about those.

A release powered by the wind around us — and the will within us.

Every release cycle, we inherit winds that we don't really control — the state of our tooling, documentation, and the historical quirks of our project. Sometimes these winds fill our sails, sometimes they push us sideways or die down.

What keeps Kubernetes moving isn't the perfect winds, but the will of our sailors who adjust the sails, man the helm, chart the courses and keep the ship steady. The release happens not because conditions are always ideal, but because of the people who build it, the people who release it, and the bears ^, cats, dogs, wizards, and curious minds who keep Kubernetes sailing strong — no matter which way the wind blows.

This release, Of Wind & Will (O' WaW), honors the winds that have shaped us, and the will that propels us forward.

^ Oh, and you wonder why bears? Keep wondering!

Spotlight on key updates

Kubernetes v1.34 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: The core of DRA is GA

Dynamic Resource Allocation (DRA) enables more powerful ways to select, allocate, share, and configure GPUs, TPUs, NICs and other devices.

Since the v1.30 release, DRA has been based around claiming devices using structured parameters that are opaque to the core of Kubernetes. This enhancement took inspiration from dynamic provisioning for storage volumes. DRA with structured parameters relies on a set of supporting API kinds: ResourceClaim, DeviceClass, ResourceClaimTemplate, and ResourceSlice API types under resource.k8s.io, while extending the .spec for Pods with a new resourceClaims field.
The resource.k8s.io/v1 APIs have graduated to stable and are now available by default.

This work was done as part of KEP #4381 led by WG Device Management.

Beta: Projected ServiceAccount tokens for kubelet image credential providers

The kubelet credential providers, used for pulling private container images, traditionally relied on long-lived Secrets stored on the node or in the cluster. This approach increased security risks and management overhead, as these credentials were not tied to the specific workload and did not rotate automatically.
To solve this, the kubelet can now request short-lived, audience-bound ServiceAccount tokens for authenticating to container registries. This allows image pulls to be authorized based on the Pod's own identity rather than a node-level credential.
The primary benefit is a significant security improvement. It eliminates the need for long-lived Secrets for image pulls, reducing the attack surface and simplifying credential management for both administrators and developers.

This work was done as part of KEP #4412 led by SIG Auth and SIG Node.

Alpha: Support for KYAML, a Kubernetes dialect of YAML

KYAML aims to be a safer and less ambiguous YAML subset, and was designed specifically for Kubernetes. Whatever version of Kubernetes you use, starting from Kubernetes v1.34 you are able to use KYAML as a new output format for kubectl.

KYAML addresses specific challenges with both YAML and JSON. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (for example: "The Norway Bug"). Meanwhile, JSON lacks comment support and has strict requirements for trailing commas and quoted keys.

You can write KYAML and pass it as an input to any version of kubectl, because all KYAML files are also valid as YAML. With kubectl v1.34, you are also able to request KYAML output (as in kubectl get -o kyaml …) by setting environment variable KUBECTL_KYAML=true. If you prefer, you can still request the output in JSON or YAML format.

This work was done as part of KEP #5295 led by SIG CLI.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.34 release.

Delayed creation of Job’s replacement Pods

By default, Job controllers create replacement Pods immediately when a Pod starts terminating, causing both Pods to run simultaneously. This can cause resource contention in constrained clusters, where the replacement Pod may struggle to find available nodes until the original Pod fully terminates. The situation can also trigger unwanted cluster autoscaler scale-ups. Additionally, some machine learning frameworks like TensorFlow and JAX require only one Pod per index to run at a time, making simultaneous Pod execution problematic. This feature introduces .spec.podReplacementPolicy in Jobs. You may choose to create replacement Pods only when the Pod is fully terminated (has .status.phase: Failed). To do this, set .spec.podReplacementPolicy: Failed.
Introduced as alpha in v1.28, this feature has graduated to stable in v1.34.

This work was done as part of KEP #3939 led by SIG Apps.

Recovery from volume expansion failure

This feature allows users to cancel volume expansions that are unsupported by the underlying storage provider, and retry volume expansion with smaller values that may succeed.
Introduced as alpha in v1.23, this feature has graduated to stable in v1.34.

This work was done as part of KEP #1790 led by SIG Storage.

VolumeAttributesClass for volume modification

VolumeAttributesClass has graduated to stable in v1.34. VolumeAttributesClass is a generic, Kubernetes-native API for modifying volume parameters like provisioned IO. It allows workloads to vertically scale their volumes on-line to balance cost and performance, if supported by their provider.
Like all new volume features in Kubernetes, this API is implemented via the container storage interface (CSI). Your provisioner-specific CSI driver must support the new ModifyVolume API which is the CSI side of this feature.

This work was done as part of KEP #3751 led by SIG Storage.

Structured authentication configuration

Kubernetes v1.29 introduced a configuration file format to manage API server client authentication, moving away from the previous reliance on a large set of command-line options. The AuthenticationConfiguration kind allows administrators to support multiple JWT authenticators, CEL expression validation, and dynamic reloading. This change significantly improves the manageability and auditability of the cluster's authentication settings - and has graduated to stable in v1.34.

This work was done as part of KEP #3331 led by SIG Auth.

Finer-grained authorization based on selectors

Kubernetes authorizers, including webhook authorizers and the built-in node authorizer, can now make authorization decisions based on field and label selectors in incoming requests. When you send list, watch or deletecollection requests with selectors, the authorization layer can now evaluate access with that additional context.

For example, you can write an authorization policy that only allows listing Pods bound to a specific .spec.nodeName. The client (perhaps the kubelet on a particular node) must specify the field selector that the policy requires, otherwise the request is forbidden. This change makes it feasible to set up least privilege rules, provided that the client knows how to conform to the restrictions you set. Kubernetes v1.34 now supports more granular control in environments like per-node isolation or custom multi-tenant setups.

This work was done as part of KEP #4601 led by SIG Auth.

Restrict anonymous requests with fine-grained controls

Instead of fully enabling or disabling anonymous access, you can now configure a strict list of endpoints where unauthenticated requests are allowed. This provides a safer alternative for clusters that rely on anonymous access to health or bootstrap endpoints like /healthz, /readyz, or /livez.

With this feature, accidental RBAC misconfigurations that grant broad access to anonymous users can be avoided without requiring changes to external probes or bootstrapping tools.

This work was done as part of KEP #4633 led by SIG Auth.

More efficient requeueing through plugin-specific callbacks

The kube-scheduler can now make more accurate decisions about when to retry scheduling Pods that were previously unschedulable. Each scheduling plugin can now register callback functions that tell the scheduler whether an incoming cluster event is likely to make a rejected Pod schedulable again.

This reduces unnecessary retries and improves overall scheduling throughput - especially in clusters using dynamic resource allocation. The feature also lets certain plugins skip the usual backoff delay when it is safe to do so, making scheduling faster in specific cases.

This work was done as part of KEP #4247 led by SIG Scheduling.

Ordered Namespace deletion

Semi-random resource deletion order can create security gaps or unintended behavior, such as Pods persisting after their associated NetworkPolicies are deleted.
This improvement introduces a more structured deletion process for Kubernetes namespaces to ensure secure and deterministic resource removal. By enforcing a structured deletion sequence that respects logical and security dependencies, this approach ensures Pods are removed before other resources.
This feature was introduced in Kubernetes v1.33 and graduated to stable in v1.34. The graduation improves security and reliability by mitigating risks from non-deterministic deletions, including the vulnerability described in CVE-2024-7598.

This work was done as part of KEP #5080 led by SIG API Machinery.

Streaming list responses

Handling large list responses in Kubernetes previously posed a significant scalability challenge. When clients requested extensive resource lists, such as thousands of Pods or Custom Resources, the API server was required to serialize the entire collection of objects into a single, large memory buffer before sending it. This process created substantial memory pressure and could lead to performance degradation, impacting the overall stability of the cluster.
To address this limitation, a streaming encoding mechanism for collections (list responses) has been introduced. For the JSON and Kubernetes Protobuf response formats, that streaming mechanism is automatically active and the associated feature gate is stable. The primary benefit of this approach is the avoidance of large memory allocations on the API server, resulting in a much smaller and more predictable memory footprint. Consequently, the cluster becomes more resilient and performant, especially in large-scale environments where frequent requests for extensive resource lists are common.

This work was done as part of KEP #5116 led by SIG API Machinery.

Resilient watch cache initialization

Watch cache is a caching layer inside kube-apiserver that maintains an eventually consistent cache of cluster state stored in etcd. In the past, issues could occur when the watch cache was not yet initialized during kube-apiserver startup or when it required re-initialization.

To address these issues, the watch cache initialization process has been made more resilient to failures, improving control plane robustness and ensuring controllers and clients can reliably establish watches. This improvement was introduced as beta in v1.31 and is now stable.

This work was done as part of KEP #4568 led by SIG API Machinery and SIG Scalability.

Relaxing DNS search path validation

Previously, the strict validation of a Pod's DNS search path in Kubernetes often created integration challenges in complex or legacy network environments. This restrictiveness could block configurations that were necessary for an organization's infrastructure, forcing administrators to implement difficult workarounds.
To address this, relaxed DNS validation was introduced as alpha in v1.32 and has now graduated to stable in v1.34. A common use case involves Pods that need to communicate with both internal Kubernetes services and external domains. By setting a single dot (.) as the first entry in the searches list of the Pod's .spec.dnsConfig, administrators can prevent the system's resolver from appending the cluster's internal search domains to external queries. This avoids generating unnecessary DNS requests to the internal DNS server for external hostnames, improving efficiency and preventing potential resolution errors.

This work was done as part of KEP #4427 led by SIG Network.

Support for Direct Service Return (DSR) in Windows kube-proxy

DSR provides performance optimizations by allowing return traffic routed through load balancers to bypass the load balancer and respond directly to the client, reducing load on the load balancer and improving overall latency. For information on DSR on Windows, read Direct Server Return (DSR) in a nutshell.
Initially introduced in v1.14, this feature has graduated to stable in v1.34.

This work was done as part of KEP #5100 led by SIG Windows.

Sleep action for Container lifecycle hooks

A Sleep action for containers’ PreStop and PostStart lifecycle hooks was introduced to provide a straightforward way to manage graceful shutdowns and improve overall container lifecycle management.
The Sleep action allows containers to pause for a specified duration after starting or before termination. Using a negative or zero sleep duration returns immediately, resulting in a no-op.
The Sleep action was introduced in Kubernetes v1.29, with zero value support added in v1.32. Both features graduated to stable in v1.34.

This work was done as part of KEP #3960 and KEP #4818 led by SIG Node.

Linux node swap support

Historically, the lack of swap support in Kubernetes could lead to workload instability, as nodes under memory pressure often had to terminate processes abruptly. This particularly affected applications with large but infrequently accessed memory footprints and prevented more graceful resource management.

To address this, configurable per-node swap support was introduced in v1.22. It has progressed through alpha and beta stages and has graduated to stable in v1.34. The primary mode, LimitedSwap, allows Pods to use swap within their existing memory limits, providing a direct solution to the problem. By default, the kubelet is configured with NoSwap mode, which means Kubernetes workloads cannot use swap.

This feature improves workload stability and allows for more efficient resource utilization. It enables clusters to support a wider variety of applications, especially in resource-constrained environments, though administrators must consider the potential performance impact of swapping.

This work was done as part of KEP #2400 led by SIG Node.

Allow special characters in environment variables

The environment variable validation rules in Kubernetes have been relaxed to allow nearly all printable ASCII characters in variable names, excluding =. This change supports scenarios where workloads require nonstandard characters in variable names - for example, frameworks like .NET Core that use : to represent nested configuration keys.

The relaxed validation applies to environment variables defined directly in Pod spec, as well as those injected using envFrom references to ConfigMaps and Secrets.

This work was done as part of KEP #4369 led by SIG Node.

Taint management is separated from Node lifecycle

Historically, the TaintManager's logic for applying NoSchedule and NoExecute taints to nodes based on their condition (NotReady, Unreachable, etc.) was tightly coupled with the node lifecycle controller. This tight coupling made the code harder to maintain and test, and it also limited the flexibility of the taint-based eviction mechanism. This KEP refactors the TaintManager into its own separate controller within the Kubernetes controller manager. It is an internal architectural improvement designed to increase code modularity and maintainability. This change allows the logic for taint-based evictions to be tested and evolved independently, but it has no direct user-facing impact on how taints are used.

This work was done as part of KEP #3902 led by SIG Scheduling and SIG Node.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.34 release.

Pod-level resource requests and limits

Defining resource needs for Pods with multiple containers has been challenging, as requests and limits could only be set on a per-container basis. This forced developers to either over-provision resources for each container or meticulously divide the total desired resources, making configuration complex and often leading to inefficient resource allocation. To simplify this, the ability to specify resource requests and limits at the Pod level was introduced. This allows developers to define an overall resource budget for a Pod, which is then shared among its constituent containers. This feature was introduced as alpha in v1.32 and has graduated to beta in v1.34, with HPA now supporting pod-level resource specifications.

The primary benefit is a more intuitive and straightforward way to manage resources for multi-container Pods. It ensures that the total resources used by all containers do not exceed the Pod's defined limits, leading to better resource planning, more accurate scheduling, and more efficient utilization of cluster resources.

This work was done as part of KEP #2837 led by SIG Scheduling and SIG Autoscaling.

.kuberc file for kubectl user preferences

A .kuberc configuration file allows you to define preferences for kubectl, such as default options and command aliases. Unlike the kubeconfig file, the .kuberc configuration file does not contain cluster details, usernames or passwords.
This feature was introduced as alpha in v1.33, gated behind the environment variable KUBECTL_KUBERC. It has graduated to beta in v1.34 and is enabled by default.

This work was done as part of KEP #3104 led by SIG CLI.

External ServiceAccount token signing

Traditionally, Kubernetes manages ServiceAccount tokens using static signing keys that are loaded from disk at kube-apiserver startup. This feature introduces an ExternalJWTSigner gRPC service for out-of-process signing, enabling Kubernetes distributions to integrate with external key management solutions (for example, HSMs, cloud KMSes) for ServiceAccount token signing instead of static disk-based keys.

Introduced as alpha in v1.32, this external JWT signing capability advances to beta and is enabled by default in v1.34.

This work was done as part of KEP #740 led by SIG Auth.

DRA features in beta

Admin access for secure resource monitoring

DRA supports controlled administrative access via the adminAccess field in ResourceClaims or ResourceClaimTemplates, allowing cluster operators to access devices already in use by others for monitoring or diagnostics. This privileged mode is limited to users authorized to create such objects in namespaces labeled resource.k8s.io/admin-access: "true", ensuring regular workloads remain unaffected. Graduating to beta in v1.34, this feature provides secure introspection capabilities while preserving workload isolation through namespace-based authorization checks.

This work was done as part of KEP #5018 led by WG Device Management and SIG Auth.

Prioritized alternatives in ResourceClaims and ResourceClaimTemplates

While a workload might run best on a single high-performance GPU, it might also be able to run on two mid-level GPUs.
With the feature gate DRAPrioritizedList (now enabled by default), ResourceClaims and ResourceClaimTemplates get a new field named firstAvailable. This field is an ordered list that allows users to specify that a request may be satisfied in different ways, including allocating nothing at all if specific hardware is not available. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available in the cluster.

This work was done as part of KEP #4816 led by WG Device Management.

The kubelet reports allocated DRA resources

The kubelet's API has been updated to report on Pod resources allocated through DRA. This allows node monitoring agents to discover the allocated DRA resources for Pods on a node. Additionally, it enables node components to use the PodResourcesAPI and leverage this DRA information when developing new features and integrations.
Starting from Kubernetes v1.34, this feature is enabled by default.

This work was done as part of KEP #3695 led by WG Device Management.

kube-scheduler non-blocking API calls

The kube-scheduler makes blocking API calls during scheduling cycles, creating performance bottlenecks. This feature introduces asynchronous API handling through a prioritized queue system with request deduplication, allowing the scheduler to continue processing Pods while API operations complete in the background. Key benefits include reduced scheduling latency, prevention of scheduler thread starvation during API delays, and immediate retry capability for unschedulable Pods. The implementation maintains backward compatibility and adds metrics for monitoring pending API operations.

This work was done as part of KEP #5229 led by SIG Scheduling.

Mutating admission policies

MutatingAdmissionPolicies offer a declarative, in-process alternative to mutating admission webhooks. This feature leverages CEL's object instantiation and JSON Patch strategies, combined with Server Side Apply’s merge algorithms.
This significantly simplifies admission control by allowing administrators to define mutation rules directly in the API server.
Introduced as alpha in v1.32, mutating admission policies has graduated to beta in v1.34.

This work was done as part of KEP #3962 led by SIG API Machinery.

Snapshottable API server cache

The kube-apiserver's caching mechanism (watch cache) efficiently serves requests for the latest observed state. However, list requests for previous states (for example, via pagination or by specifying a resourceVersion) often bypass this cache and are served directly from etcd. This direct etcd access significantly increases performance costs and can lead to stability issues, particularly with large resources, due to memory pressure from transferring large data blobs.
With the ListFromCacheSnapshot feature gate enabled by default, kube-apiserver will attempt to serve the response from snapshots if one is available with resourceVersion older than requested. The kube-apiserver starts with no snapshots, creates a new snapshot on every watch event, and keeps them until it detects etcd is compacted or if cache is full with events older than 75 seconds. If the provided resourceVersion is unavailable, the server will fallback to etcd.

This work was done as part of KEP #4988 led by SIG API Machinery.

Tooling for declarative validation of Kubernetes-native types

Prior to this release, validation rules for the APIs built into Kubernetes were written entirely by hand, which makes them difficult for maintainers to discover, understand, improve or test. There was no single way to find all the validation rules that might apply to an API. Declarative validation benefits Kubernetes maintainers by making API development, maintenance, and review easier while enabling programmatic inspection for better tooling and documentation. For people using Kubernetes libraries to write their own code (for example: a controller), the new approach streamlines adding new fields through IDL tags, rather than complex validation functions. This change helps speed up API creation by automating validation boilerplate, and provides more relevant error messages by performing validation on versioned types.​​​​​​​​​​​​​​​​
This enhancement (which graduated to beta in v1.33 and continues as beta in v1.34) brings CEL-based validation rules to native Kubernetes types. It allows for more granular and declarative validation to be defined directly in the type definitions, improving API consistency and developer experience.

This work was done as part of KEP #5073 led by SIG API Machinery.

Streaming informers for list requests

The streaming informers feature, which has been in beta since v1.32, gains further beta refinements in v1.34. This capability allows list requests to return data as a continuous stream of objects from the API server’s watch cache, rather than assembling paged results directly from etcd. By reusing the same mechanics used for watch operations, the API server can serve large datasets while keeping memory usage steady and avoiding allocation spikes that can affect stability.

In this release, the kube-apiserver and kube-controller-manager both take advantage of the new WatchList mechanism by default. For the kube-apiserver, this means list requests are streamed more efficiently, while the kube-controller-manager benefits from a more memory-efficient and predictable way to work with informers. Together, these improvements reduce memory pressure during large list operations, and improve reliability under sustained load, making list streaming more predictable and efficient.

This work was done as part of KEP #3157 led by SIG API Machinery and SIG Scalability.

Graceful node shutdown handling for Windows nodes

The kubelet on Windows nodes can now detect system shutdown events and begin graceful termination of running Pods. This mirrors existing behavior on Linux and helps ensure workloads exit cleanly during planned shutdowns or restarts.
When the system begins shutting down, the kubelet reacts by using standard termination logic. It respects the configured lifecycle hooks and grace periods, giving Pods time to stop before the node powers off. The feature relies on Windows pre-shutdown notifications to coordinate this process. This enhancement improves workload reliability during maintenance, restarts, or system updates. It is now in beta and enabled by default.

This work was done as part of KEP #4802 led by SIG Windows.

In-place Pod resize improvements

Graduated to beta and enabled by default in v1.33, in-place Pod resizing receives further improvements in v1.34. These include support for decreasing memory usage and integration with Pod-level resources.

This feature remains in beta in v1.34. For detailed usage instructions and examples, refer to the documentation: Resize CPU and Memory Resources assigned to Containers.

This work was done as part of KEP #1287 led by SIG Node and SIG Autoscaling.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.34 release.

Pod certificates for mTLS authentication

Authenticating workloads within a cluster, especially for communication with the API server, has primarily relied on ServiceAccount tokens. While effective, these tokens aren't always ideal for establishing a strong, verifiable identity for mutual TLS (mTLS) and can present challenges when integrating with external systems that expect certificate-based authentication.
Kubernetes v1.34 introduces a built-in mechanism for Pods to obtain X.509 certificates via PodCertificateRequests. The kubelet can request and manage certificates for Pods, which can then be used to authenticate to the Kubernetes API server and other services using mTLS. The primary benefit is a more robust and flexible identity mechanism for Pods. It provides a native way to implement strong mTLS authentication without relying solely on bearer tokens, aligning Kubernetes with standard security practices and simplifying integrations with certificate-aware observability and security tooling.

This work was done as part of KEP #4317 led by SIG Auth.

"Restricted" Pod security standard now forbids remote probes

The host field within probes and lifecycle handlers allows users to specify an entity other than the podIP for the kubelet to probe. However, this opens up a route for misuse and for attacks that bypass security controls, since the host field could be set to any value, including security sensitive external hosts, or localhost on the node. In Kubernetes v1.34, Pods only meet the Restricted Pod security standard if they either leave the host field unset, or if they don't even use this kind of probe. You can use Pod security admission, or a third party solution, to enforce that Pods meet this standard. Because these are security controls, check the documentation to understand the limitations and behavior of the enforcement mechanism you choose.

This work was done as part of KEP #4940 led by SIG Auth.

Use .status.nominatedNodeName to express Pod placement

When the kube-scheduler takes time to bind Pods to Nodes, cluster autoscalers may not understand that a Pod will be bound to a specific Node. Consequently, they may mistakenly consider the Node as underutilized and delete it.
To address this issue, the kube-scheduler can use .status.nominatedNodeName not only to indicate ongoing preemption but also to express Pod placement intentions. By enabling the NominatedNodeNameForExpectation feature gate, the scheduler uses this field to indicate where a Pod will be bound. This exposes internal reservations to help external components make informed decisions.

This work was done as part of KEP #5278 led by SIG Scheduling.

DRA features in alpha

Resource health status for DRA

It can be difficult to know when a Pod is using a device that has failed or is temporarily unhealthy, which makes troubleshooting Pod crashes challenging or impossible.
Resource Health Status for DRA improves observability by exposing the health status of devices allocated to a Pod in the Pod’s status. This makes it easier to identify the cause of Pod issues related to unhealthy devices and respond appropriately.
To enable this functionality, the ResourceHealthStatus feature gate must be enabled, and the DRA driver must implement the DRAResourceHealth gRPC service.

This work was done as part of KEP #4680 led by WG Device Management.

Extended resource mapping

Extended resource mapping provides a simpler alternative to DRA's expressive and flexible approach by offering a straightforward way to describe resource capacity and consumption. This feature enables cluster administrators to advertise DRA-managed resources as extended resources, allowing application developers and operators to continue using the familiar container’s .spec.resources syntax to consume them.
This enables existing workloads to adopt DRA without modifications, simplifying the transition to DRA for both application developers and cluster administrators.

This work was done as part of KEP #5004 led by WG Device Management.

DRA consumable capacity

Kubernetes v1.33 added support for resource drivers to advertise slices of a device that are available, rather than exposing the entire device as an all-or-nothing resource. However, this approach couldn't handle scenarios where device drivers manage fine-grained, dynamic portions of a device resource based on user demand, or share those resources independently of ResourceClaims, which are restricted by their spec and namespace.
Enabling the DRAConsumableCapacity feature gate (introduced as alpha in v1.34) allows resource drivers to share the same device, or even a slice of a device, across multiple ResourceClaims or across multiple DeviceRequests. The feature also extends the scheduler to support allocating portions of device resources, as defined in the capacity field. This DRA feature improves device sharing across namespaces and claims, tailoring it to Pod needs. It enables drivers to enforce capacity limits, enhances scheduling, and supports new use cases like bandwidth-aware networking and multi-tenant sharing.

This work was done as part of KEP #5075 led by WG Device Management.

Device binding conditions

The Kubernetes scheduler gets more reliable by delaying binding a Pod to a Node until its required external resources, such as attachable devices or FPGAs, are confirmed to be ready.
This delay mechanism is implemented in the PreBind phase of the scheduling framework. During this phase, the scheduler checks whether all required device conditions are satisfied before proceeding with binding. This enables coordination with external device controllers, ensuring more robust, predictable scheduling.

This work was done as part of KEP #5007 led by WG Device Management.

Container restart rules

Currently, all containers within a Pod will follow the same .spec.restartPolicy when exited or crashed. However, Pods that run multiple containers might have different restart requirements for each container. For example, for init containers used to perform initialization, you may not want to retry initialization if they fail. Similarly, in ML research environments with long-running training workloads, containers that fail with retriable exit codes should restart quickly in place, rather than triggering Pod recreation and losing progress.
Kubernetes v1.34 introduces the ContainerRestartRules feature gate. When enabled, a restartPolicy can be specified for each container within a Pod. A restartPolicyRules list can also be defined to override restartPolicy based on the last exit code. This provides the fine-grained control needed to handle complex scenarios and better utilization of compute resources.

This work was done as part of KEP #5307 led by SIG Node.

Load environment variables from files created in runtime

Application developers have long requested greater flexibility in declaring environment variables. Traditionally, environment variables are declared on the API server side via static values, ConfigMaps, or Secrets.

Behind the EnvFiles feature gate, Kubernetes v1.34 introduces the ability to declare environment variables at runtime. One container (typically an init container) can generate the variable and store it in a file, and a subsequent container can start with the environment variable loaded from that file. This approach eliminates the need to "wrap" the target container's entry point, enabling more flexible in-Pod container orchestration.

This feature particularly benefits AI/ML training workloads, where each Pod in a training Job requires initialization with runtime-defined values.

This work was done as part of KEP #5307 led by SIG Node.

Graduations, deprecations, and removals in v1.34

Graduations to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 23 enhancements promoted to stable:

Deprecations and removals

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Kubernetes v1.34 includes a couple of deprecations.

Manual cgroup driver configuration is deprecated

Historically, configuring the correct cgroup driver has been a pain point for users running Kubernetes clusters. Kubernetes v1.28 added a way for the kubelet to query the CRI implementation and find which cgroup driver to use. That automated detection is now strongly recommended and support for it has graduated to stable in v1.34. If your CRI container runtime does not support the ability to report the cgroup driver it needs, you should upgrade or change your container runtime. The cgroupDriver configuration setting in the kubelet configuration file is now deprecated. The corresponding command-line option --cgroup-driver was previously deprecated, as Kubernetes recommends using the configuration file instead. Both the configuration setting and command-line option will be removed in a future release, that removal will not happen before the v1.36 minor release.

This work was done as part of KEP #4033 led by SIG Node.

Kubernetes to end containerd 1.x support in v1.36

While Kubernetes v1.34 still supports containerd 1.7 and other LTS releases of containerd, as a consequence of automated cgroup driver detection, the Kubernetes SIG Node community has formally agreed upon a final support timeline for containerd v1.X. The last Kubernetes release to offer this support will be v1.35 (aligned with containerd 1.7 EOL). This is an early warning that if you are using containerd 1.X, consider switching to 2.0+ soon. You are able to monitor the kubelet_cri_losing_support metric to determine if any nodes in your cluster are using a containerd version that will soon be outdated.

This work was done as part of KEP #4033 led by SIG Node.

PreferClose traffic distribution is deprecated

The spec.trafficDistribution field within a Kubernetes Service allows users to express preferences for how traffic should be routed to Service endpoints.

KEP-3015 deprecates PreferClose and introduces two additional values: PreferSameZone and PreferSameNode. PreferSameZone is an alias for the existing PreferClose to clarify its semantics. PreferSameNode allows connections to be delivered to a local endpoint when possible, falling back to a remote endpoint when not possible.

This feature was introduced in v1.33 behind the PreferSameTrafficDistribution feature gate. It has graduated to beta in v1.34 and is enabled by default.

This work was done as part of KEP #3015 led by SIG Network.

Release notes

Check out the full details of the Kubernetes v1.34 release in our release notes.

Availability

Kubernetes v1.34 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.34 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We honor the memory of Rodolfo "Rodo" Martínez Vega, a dedicated contributor whose passion for technology and community building left a mark on the Kubernetes community. Rodo served as a member of the Kubernetes Release Team across multiple releases, including v1.22-v1.23 and v1.25-v1.30, demonstrating unwavering commitment to the project's success and stability.
Beyond his Release Team contributions, Rodo was deeply involved in fostering the Cloud Native LATAM community, helping to bridge language and cultural barriers in the space. His work on the Spanish version of Kubernetes documentation and the CNCF Glossary exemplified his dedication to making knowledge accessible to Spanish-speaking developers worldwide. Rodo's legacy lives on through the countless community members he mentored, the releases he helped deliver, and the vibrant LATAM Kubernetes community he helped cultivate.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.34 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out to our release lead, Vyom Yadav, for guiding us through a successful release cycle, for his hands-on approach to solving challenges, and for bringing the energy and care that drives our community forward.

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.34 release cycle, which spanned 15 weeks from 19th May 2025 to 27th August 2025, Kubernetes received contributions from as many as 106 different companies and 491 individuals. In the wider cloud native ecosystem, the figure goes up to 370 companies, counting 2235 total contributors.

Note that "contribution" counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.
If you are interested in contributing, visit Getting Started on our contributor website.

Source for this data:

Event Update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

August 2025

September 2025

October 2025

November 2025

December 2025

You can find the latest event details here.

Upcoming Release Webinar

Join members of the Kubernetes v1.34 Release Team on Wednesday, September 24th 2025 at 4:00 PM (UTC), to learn about the release highlights of this release. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Tuning Linux Swap for Kubernetes: A Deep Dive

The Kubernetes NodeSwap feature, likely to graduate to stable in the upcoming Kubernetes v1.34 release, allows swap usage: a significant shift from the conventional practice of disabling swap for performance predictability. This article focuses exclusively on tuning swap on Linux nodes, where this feature is available. By allowing Linux nodes to use secondary storage for additional virtual memory when physical RAM is exhausted, node swap support aims to improve resource utilization and reduce out-of-memory (OOM) kills.

However, enabling swap is not a "turn-key" solution. The performance and stability of your nodes under memory pressure are critically dependent on a set of Linux kernel parameters. Misconfiguration can lead to performance degradation and interfere with Kubelet's eviction logic.

In this blogpost, I'll dive into critical Linux kernel parameters that govern swap behavior. I will explore how these parameters influence Kubernetes workload performance, swap utilization, and crucial eviction mechanisms. I will present various test results showcasing the impact of different configurations, and share my findings on achieving optimal settings for stable and high-performing Kubernetes clusters.

Introduction to Linux swap

At a high level, the Linux kernel manages memory through pages, typically 4KiB in size. When physical memory becomes constrained, the kernel's page replacement algorithm decides which pages to move to swap space. While the exact logic is a sophisticated optimization, this decision-making process is influenced by certain key factors:

  1. Page access patterns (how recently pages are accessed)
  2. Page dirtyness (whether pages have been modified)
  3. Memory pressure (how urgently the system needs free memory)

Anonymous vs File-backed memory

It is important to understand that not all memory pages are the same. The kernel distinguishes between anonymous and file-backed memory.

Anonymous memory: This is memory that is not backed by a specific file on the disk, such as a program's heap and stack. From the application's perspective this is private memory, and when the kernel needs to reclaim these pages, it must write them to a dedicated swap device.

File-backed memory: This memory is backed by a file on a filesystem. This includes a program's executable code, shared libraries, and filesystem caches. When the kernel needs to reclaim these pages, it can simply discard them if they have not been modified ("clean"). If a page has been modified ("dirty"), the kernel must first write the changes back to the file before it can be discarded.

While a system without swap can still reclaim clean file-backed pages memory under pressure by dropping them, it has no way to offload anonymous memory. Enabling swap provides this capability, allowing the kernel to move less-frequently accessed memory pages to disk to conserve memory to avoid system OOM kills.

Key kernel parameters for swap tuning

To effectively tune swap behavior, Linux provides several kernel parameters that can be managed via sysctl.

  • vm.swappiness: This is the most well-known parameter. It is a value from 0 to 200 (100 in older kernels) that controls the kernel's preference for swapping anonymous memory pages versus reclaiming file-backed memory pages (page cache).
    • High value (eg: 90+): The kernel will be aggressive in swapping out less-used anonymous memory to make room for file-cache.
    • Low value (eg: < 10): The kernel will strongly prefer dropping file cache pages over swapping anonymous memory.
  • vm.min_free_kbytes: This parameter tells the kernel to keep a minimum amount of memory free as a buffer. When the amount of free memory drops below the this safety buffer, the kernel starts more aggressively reclaiming pages (swapping, and eventually handling OOM kills).
    • Function: It acts as a safety lever to ensure the kernel has enough memory for critical allocation requests that cannot be deferred.
    • Impact on swap: Setting a higher min_free_kbytes effectively raises the floor for for free memory, causing the kernel to initiate swap earlier under memory pressure.
  • vm.watermark_scale_factor: This setting controls the gap between different watermarks: min, low and high, which are calculated based on min_free_kbytes.
    • Watermarks explained:
      • low: When free memory is below this mark, the kswapd kernel process wakes up to reclaim pages in the background. This is when a swapping cycle begins.
      • min: When free memory hits this minimum level, then aggressive page reclamation will block process allocation. Failing to reclaim pages will cause OOM kills.
      • high: Memory reclamation stops once the free memory reaches this level.
    • Impact: A higher watermark_scale_factor careates a larger buffer between the low and min watermarks. This gives kswapd more time to reclaim memory gradually before the system hits a critical state.

In a typical server workload, you might have a long-running process with some memory that becomes 'cold'. A higher swappiness value can free up RAM by swapping out the cold memory, for other active processes that can benefit from keeping their file-cache.

Tuning the min_free_kbytes and watermark_scale_factor parameters to move the swapping window early will give more room for kswapd to offload memory to disk and prevent OOM kills during sudden memory spikes.

Swap tests and results

To understand the real-impact of these parameters, I designed a series of stress tests.

Test setup

  • Environment: GKE on Google Cloud
  • Kubernetes version: 1.33.2
  • Node configuration: n2-standard-2 (8GiB RAM, 50GB swap on a pd-balanced disk, without encryption), Ubuntu 22.04
  • Workload: A custom Go application designed to allocate memory at a configurable rate, generate file-cache pressure, and simulate different memory access patterns (random vs sequential).
  • Monitoring: A sidecar container capturing system metrics every second.
  • Protection: Critical system components (kubelet, container runtime, sshd) were prevented from swapping by setting memory.swap.max=0 in their respective cgroups.

Test methodology

I ran a stress-test pod on nodes with different swappiness settings (0, 60, and 90) and varied the min_free_kbytes and watermark_scale_factor parameters to observe the outcomes under heavy memory allocation and I/O pressure.

Visualizing swap in action

The graph below, from a 100MBps stress test, shows swap in action. As free memory (in the "Memory Usage" plot) decreases, swap usage (Swap Used (GiB)) and swap-out activity (Swap Out (MiB/s)) increase. Critically, as the system relies more on swap, the I/O activity and corresponding wait time (IO Wait % in the "CPU Usage" plot) also rises, indicating CPU stress.

Graph showing CPU, Memory, Swap utilization and I/O activity on a Kubernetes node

Findings

My initial tests with default kernel parameters (swappiness=60, min_free_kbytes=68MB, watermark_scale_factor=10) quickly led to OOM kills and even unexpected node restarts under high memory pressure. With selecting appropriate kernel parameters a good balance in node stability and performance can be achieved.

The impact of swappiness

The swappiness parameter directly influences the kernel's choice between reclaiming anonymous memory (swapping) and dropping page cache. To observe this, I ran a test where one pod generated and held file-cache pressure, followed by a second pod allocating anonymous memory at 100MB/s, to observe the kernel preference on reclaim:

My findings reveal a clear trade-off:

  • swappiness=90: The kernel proactively swapped out the inactive anonymous memory to keep the file cache. This resulted in high and sustained swap usage and significant I/O activity ("Blocks Out"), which in turn caused spikes in I/O wait on the CPU.
  • swappiness=0: The kernel favored dropping file-cache pages delaying swap consumption. However, it's critical to understand that this does not disable swapping. When memory pressure was high, the kernel still swapped anonymous memory to disk.

The choice is workload-dependent. For workloads sensitive to I/O latency, a lower swappiness is preferable. For workloads that rely on a large and frequently accessed file cache, a higher swappiness may be beneficial, provided the underlying disk is fast enough to handle the load.

Tuning watermarks to prevent eviction and OOM kills

The most critical challenge I encountered was the interaction between rapid memory allocation and Kubelet's eviction mechanism. When my test pod, which was deliberately configured to overcommit memory, allocated it at a high rate (e.g., 300-500 MBps), the system quickly ran out of free memory.

With default watermarks, the buffer for reclamation was too small. Before kswapd could free up enough memory by swapping, the node would hit a critical state, leading to two potential outcomes:

  1. Kubelet eviction If kubelet's eviction manager detected memory.available was below its threshold, it would evict the pod.
  2. OOM killer In some high-rate scenarios, the OOM Killer would activate before eviction could complete, sometimes killing higher priority pods that were not the source of the pressure.

To mitigate this I tuned the watermarks:

  1. Increased min_free_kbytes to 512MiB: This forces the kernel to start reclaiming memory much earlier, providing a larger safety buffer.
  2. Increased watermark_scale_factor to 2000: This widened the gap between the low and high watermarks (from ≈337MB to ≈591MB in my test node's /proc/zoneinfo), effectively increasing the swapping window.

This combination gave kswapd a larger operational zone and more time to swap pages to disk during memory spikes, successfully preventing both premature evictions and OOM kills in my test runs.

Table compares watermark levels from /proc/zoneinfo (Non-NUMA node):

min_free_kbytes=67584KiB and watermark_scale_factor=10min_free_kbytes=524288KiB and watermark_scale_factor=2000
Node 0, zone Normal
  pages free 583273
  boost 0
  min 10504
  low 13130
  high 15756
  spanned 1310720
  present 1310720
  managed 1265603
Node 0, zone Normal
  pages free 470539
  min 82109
  low 337017
  high 591925
  spanned 1310720
  present 1310720
  managed 1274542

The graph below reveals that the kernel buffer size and scaling factor play a crucial role in determining how the system responds to memory load. With the right combination of these parameters, the system can effectively use swap space to avoid eviction and maintain stability.

A side-by-side comparison of different min_free_kbytes settings, showing differences in Swap, Memory Usage and Eviction impact

Risks and recommendations

Enabling swap in Kubernetes is a powerful tool, but it comes with risks that must be managed through careful tuning.

  • Risk of performance degradation Swapping is orders of magnitude slower than accessing RAM. If an application's active working set is swapped out, its performance will suffer dramatically due to high I/O wait times (thrashing). Swap could preferably be provisioned with a SSD backed storage to improve performance.

  • Risk of masking memory leaks Swap can hide memory leaks in applications, which might otherwise lead to a quick OOM kill. With swap, a leaky application might slowly degrade node performance over time, making the root cause harder to diagnose.

  • Risk of disabling evictions Kubelet proactively monitors the node for memory-pressure and terminates pods to reclaim the resources. Improper tuning can lead to OOM kills before kubelet has a chance to evict pods gracefully. A properly configured min_free_kbytes is essential to ensure kubelet's eviction mechanism remains effective.

Kubernetes context

Together, the kernel watermarks and kubelet eviction threshold create a series of memory pressure zones on a node. The eviction-threshold parameters need to be adjusted to configure Kubernetes managed evictions occur before the OOM kills.

Preferred thresholds for effective swap utilization

As the diagram shows, an ideal configuration will be to create a large enough 'swapping zone' (between high and min watermarks) so that the kernel can handle memory pressure by swapping before available memory drops into the Eviction/Direct Reclaim zone.

Based on these findings, I recommend the following as a starting point for Linux nodes with swap enabled. You should benchmark this with your own workloads.

  • vm.swappiness=60: Linux default is a good starting point for general-purpose workloads. However, the ideal value is workload-dependent, and swap-sensitive applications may need more careful tuning.
  • vm.min_free_kbytes=500000 (500MB): Set this to a reasonably high value (e.g., 2-3% of total node memory) to give the node a reasonable safety buffer.
  • vm.watermark_scale_factor=2000: Create a larger window for kswapd to work with, preventing OOM kills during sudden memory allocation spikes.

I encourage running benchmark tests with your own workloads in test-environments, when setting up swap for the first time in your Kubernetes cluster. Swap performance can be sensitive to different environment differences such as CPU load, disk type (SSD vs HDD) and I/O patterns.

Introducing Headlamp AI Assistant

This announcement originally appeared on the Headlamp blog.

To simplify Kubernetes management and troubleshooting, we're thrilled to introduce Headlamp AI Assistant: a powerful new plugin for Headlamp that helps you understand and operate your Kubernetes clusters and applications with greater clarity and ease.

Whether you're a seasoned engineer or just getting started, the AI Assistant offers:

  • Fast time to value: Ask questions like "Is my application healthy?" or "How can I fix this?" without needing deep Kubernetes knowledge.
  • Deep insights: Start with high-level queries and dig deeper with prompts like "List all the problematic pods" or "How can I fix this pod?"
  • Focused & relevant: Ask questions in the context of what you're viewing in the UI, such as "What's wrong here?"
  • Action-oriented: Let the AI take action for you, like "Restart that deployment", with your permission.

Here is a demo of the AI Assistant in action as it helps troubleshoot an application running with issues in a Kubernetes cluster:

Hopping on the AI train

Large Language Models (LLMs) have transformed not just how we access data but also how we interact with it. The rise of tools like ChatGPT opened a world of possibilities, inspiring a wave of new applications. Asking questions or giving commands in natural language is intuitive, especially for users who aren't deeply technical. Now everyone can quickly ask how to do X or Y, without feeling awkward or having to traverse pages and pages of documentation like before.

Therefore, Headlamp AI Assistant brings a conversational UI to Headlamp, powered by LLMs that Headlamp users can configure with their own API keys. It is available as a Headlamp plugin, making it easy to integrate into your existing setup. Users can enable it by installing the plugin and configuring it with their own LLM API keys, giving them control over which model powers the assistant. Once enabled, the assistant becomes part of the Headlamp UI, ready to respond to contextual queries and perform actions directly from the interface.

Context is everything

As expected, the AI Assistant is focused on helping users with Kubernetes concepts. Yet, while there is a lot of value in responding to Kubernetes related questions from Headlamp's UI, we believe that the great benefit of such an integration is when it can use the context of what the user is experiencing in an application. So, the Headlamp AI Assistant knows what you're currently viewing in Headlamp, and this makes the interaction feel more like working with a human assistant.

For example, if a pod is failing, users can simply ask "What's wrong here?" and the AI Assistant will respond with the root cause, like a missing environment variable or a typo in the image name. Follow-up prompts like "How can I fix this?" allow the AI Assistant to suggest a fix, streamlining what used to take multiple steps into a quick, conversational flow.

Sharing the context from Headlamp is not a trivial task though, so it's something we will keep working on perfecting.

Tools

Context from the UI is helpful, but sometimes additional capabilities are needed. If the user is viewing the pod list and wants to identify problematic deployments, switching views should not be necessary. To address this, the AI Assistant includes support for a Kubernetes tool. This allows asking questions like "Get me all deployments with problems" prompting the assistant to fetch and display relevant data from the current cluster. Likewise, if the user requests an action like "Restart that deployment" after the AI points out what deployment needs restarting, it can also do that. In case of "write" operations, the AI Assistant does check with the user for permission to run them.

AI Plugins

Although the initial version of the AI Assistant is already useful for Kubernetes users, future iterations will expand its capabilities. Currently, the assistant supports only the Kubernetes tool, but further integration with Headlamp plugins is underway. Similarly, we could get richer insights for GitOps via the Flux plugin, monitoring through Prometheus, package management with Helm, and more.

And of course, as the popularity of MCP grows, we are looking into how to integrate it as well, for a more plug-and-play fashion.

Try it out!

We hope this first version of the AI Assistant helps users manage Kubernetes clusters more effectively and assist newcomers in navigating the learning curve. We invite you to try out this early version and give us your feedback. The AI Assistant plugin can be installed from Headlamp's Plugin Catalog in the desktop version, or by using the container image when deploying Headlamp. Stay tuned for the future versions of the Headlamp AI Assistant!

Kubernetes v1.34 Sneak Peek

Kubernetes v1.34 is coming at the end of August 2025. This release will not include any removal or deprecation, but it is packed with an impressive number of enhancements. Here are some of the features we are most excited about in this cycle!

Please note that this information reflects the current state of v1.34 development and may change before release.

The following list highlights some of the notable enhancements likely to be included in the v1.34 release, but is not an exhaustive list of all planned changes. This is not a commitment and the release content is subject to change.

The core of DRA targets stable

Dynamic Resource Allocation (DRA) provides a flexible way to categorize, request, and use devices like GPUs or custom hardware in your Kubernetes cluster.

Since the v1.30 release, DRA has been based around claiming devices using structured parameters that are opaque to the core of Kubernetes. The relevant enhancement proposal, KEP-4381, took inspiration from dynamic provisioning for storage volumes. DRA with structured parameters relies on a set of supporting API kinds: ResourceClaim, DeviceClass, ResourceClaimTemplate, and ResourceSlice API types under resource.k8s.io, while extending the .spec for Pods with a new resourceClaims field. The core of DRA is targeting graduation to stable in Kubernetes v1.34.

With DRA, device drivers and cluster admins define device classes that are available for use. Workloads can claim devices from a device class within device requests. Kubernetes allocates matching devices to specific claims and places the corresponding Pods on nodes that can access the allocated devices. This framework provides flexible device filtering using CEL, centralized device categorization, and simplified Pod requests, among other benefits.

Once this feature has graduated, the resource.k8s.io/v1 APIs will be available by default.

ServiceAccount tokens for image pull authentication

The ServiceAccount token integration for kubelet credential providers is likely to reach beta and be enabled by default in Kubernetes v1.34. This allows the kubelet to use these tokens when pulling container images from registries that require authentication.

That support already exists as alpha, and is tracked as part of KEP-4412.

The existing alpha integration allows the kubelet to use short-lived, automatically rotated ServiceAccount tokens (that follow OIDC-compliant semantics) to authenticate to a container image registry. Each token is scoped to one associated Pod; the overall mechanism replaces the need for long-lived image pull Secrets.

Adopting this new approach reduces security risks, supports workload-level identity, and helps cut operational overhead. It brings image pull authentication closer to modern, identity-aware good practice.

Pod replacement policy for Deployments

After a change to a Deployment, terminating pods may stay up for a considerable amount of time and may consume additional resources. As part of KEP-3973, the .spec.podReplacementPolicy field will be introduced (as alpha) for Deployments.

If your cluster has the feature enabled, you'll be able to select one of two policies:

TerminationStarted
Creates new pods as soon as old ones start terminating, resulting in faster rollouts at the cost of potentially higher resource consumption.
TerminationComplete
Waits until old pods fully terminate before creating new ones, resulting in slower rollouts but ensuring controlled resource consumption.

This feature makes Deployment behavior more predictable by letting you choose when new pods should be created during updates or scaling. It's beneficial when working in clusters with tight resource constraints or with workloads with long termination periods.

It's expected to be available as an alpha feature and can be enabled using the DeploymentPodReplacementPolicy and DeploymentReplicaSetTerminatingReplicas feature gates in the API server and kube-controller-manager.

Production-ready tracing for kubelet and API Server

To address the longstanding challenge of debugging node-level issues by correlating disconnected logs, KEP-2831 provides deep, contextual insights into the kubelet.

This feature instruments critical kubelet operations, particularly its gRPC calls to the Container Runtime Interface (CRI), using the vendor-agnostic OpenTelemetry standard. It allows operators to visualize the entire lifecycle of events (for example: a Pod startup) to pinpoint sources of latency and errors. Its most powerful aspect is the propagation of trace context; the kubelet passes a trace ID with its requests to the container runtime, enabling runtimes to link their own spans.

This effort is complemented by a parallel enhancement, KEP-647, which brings the same tracing capabilities to the Kubernetes API server. Together, these enhancements provide a more unified, end-to-end view of events, simplifying the process of pinpointing latency and errors from the control plane down to the node. These features have matured through the official Kubernetes release process. KEP-2831 was introduced as an alpha feature in v1.25, while KEP-647 debuted as alpha in v1.22. Both enhancements were promoted to beta together in the v1.27 release. Looking forward, Kubelet Tracing (KEP-2831) and API Server Tracing (KEP-647) are now targeting graduation to stable in the upcoming v1.34 release.

PreferSameZone and PreferSameNode traffic distribution for Services

The spec.trafficDistribution field within a Kubernetes Service allows users to express preferences for how traffic should be routed to Service endpoints.

KEP-3015 deprecates PreferClose and introduces two additional values: PreferSameZone and PreferSameNode. PreferSameZone is equivalent to the current PreferClose. PreferSameNode prioritizes sending traffic to endpoints on the same node as the client.

This feature was introduced in v1.33 behind the PreferSameTrafficDistribution feature gate. It is targeting graduation to beta in v1.34 with its feature gate enabled by default.

Support for KYAML: a Kubernetes dialect of YAML

KYAML aims to be a safer and less ambiguous YAML subset, and was designed specifically for Kubernetes. Whatever version of Kubernetes you use, you'll be able use KYAML for writing manifests and/or Helm charts. You can write KYAML and pass it as an input to any version of kubectl, because all KYAML files are also valid as YAML. With kubectl v1.34, we expect you'll also be able to request KYAML output from kubectl (as in kubectl get -o kyaml …). If you prefer, you can still request the output in JSON or YAML format.

KYAML addresses specific challenges with both YAML and JSON. YAML's significant whitespace requires careful attention to indentation and nesting, while its optional string-quoting can lead to unexpected type coercion (for example: "The Norway Bug"). Meanwhile, JSON lacks comment support and has strict requirements for trailing commas and quoted keys.

KEP-5295 introduces KYAML, which tries to address the most significant problems by:

  • Always double-quoting value strings

  • Leaving keys unquoted unless they are potentially ambiguous

  • Always using {} for mappings (associative arrays)

  • Always using [] for lists

This might sound a lot like JSON, because it is! But unlike JSON, KYAML supports comments, allows trailing commas, and doesn't require quoted keys.

We're hoping to see KYAML introduced as a new output format for kubectl v1.34. As with all these features, none of these changes are 100% confirmed; watch this space!

As a format, KYAML is and will remain a strict subset of YAML, ensuring that any compliant YAML parser can parse KYAML documents. Kubernetes does not require you to provide input specifically formatted as KYAML, and we have no plans to change that.

Fine-grained autoscaling control with HPA configurable tolerance

KEP-4951 introduces a new feature that allows users to configure autoscaling tolerance on a per-HPA basis, overriding the default cluster-wide 10% tolerance setting that often proves too coarse-grained for diverse workloads. The enhancement adds an optional tolerance field to the HPA's spec.behavior.scaleUp and spec.behavior.scaleDown sections, enabling different tolerance values for scale-up and scale-down operations, which is particularly valuable since scale-up responsiveness is typically more critical than scale-down speed for handling traffic surges.

Released as alpha in Kubernetes v1.33 behind the HPAConfigurableTolerance feature gate, this feature is expected to graduate to beta in v1.34. This improvement helps to address scaling challenges with large deployments, where for scaling in, a 10% tolerance might mean leaving hundreds of unnecessary Pods running. Using the new, more flexible approach would enable workload-specific optimization for both responsive and conservative scaling behaviors.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.34 as part of the CHANGELOG for that release.

The Kubernetes v1.34 release is planned for Wednesday 27th August 2025. Stay tuned for updates!

Get involved

The simplest way to get involved with Kubernetes is to join one of the many Special Interest Groups (SIGs) that align with your interests. Have something you'd like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Post-Quantum Cryptography in Kubernetes

The world of cryptography is on the cusp of a major shift with the advent of quantum computing. While powerful quantum computers are still largely theoretical for many applications, their potential to break current cryptographic standards is a serious concern, especially for long-lived systems. This is where Post-Quantum Cryptography (PQC) comes in. In this article, I'll dive into what PQC means for TLS and, more specifically, for the Kubernetes ecosystem. I'll explain what the (suprising) state of PQC in Kubernetes is and what the implications are for current and future clusters.

What is Post-Quantum Cryptography

Post-Quantum Cryptography refers to cryptographic algorithms that are thought to be secure against attacks by both classical and quantum computers. The primary concern is that quantum computers, using algorithms like Shor's Algorithm, could efficiently break widely used public-key cryptosystems such as RSA and Elliptic Curve Cryptography (ECC), which underpin much of today's secure communication, including TLS. The industry is actively working on standardizing and adopting PQC algorithms. One of the first to be standardized by NIST is the Module-Lattice Key Encapsulation Mechanism (ML-KEM), formerly known as Kyber, and now standardized as FIPS-203 (PDF download).

It is difficult to predict when quantum computers will be able to break classical algorithms. However, it is clear that we need to start migrating to PQC algorithms now, as the next section shows. To get a feeling for the predicted timeline we can look at a NIST report covering the transition to post-quantum cryptography standards. It declares that system with classical crypto should be deprecated after 2030 and disallowed after 2035.

Key exchange vs. digital signatures: different needs, different timelines

In TLS, there are two main cryptographic operations we need to secure:

Key Exchange: This is how the client and server agree on a shared secret to encrypt their communication. If an attacker records encrypted traffic today, they could decrypt it in the future, if they gain access to a quantum computer capable of breaking the key exchange. This makes migrating KEMs to PQC an immediate priority.

Digital Signatures: These are primarily used to authenticate the server (and sometimes the client) via certificates. The authenticity of a server is verified at the time of connection. While important, the risk of an attack today is much lower, because the decision of trusting a server cannot be abused after the fact. Additionally, current PQC signature schemes often come with significant computational overhead and larger key/signature sizes compared to their classical counterparts.

Another significant hurdle in the migration to PQ certificates is the upgrade of root certificates. These certificates have long validity periods and are installed in many devices and operating systems as trust anchors.

Given these differences, the focus for immediate PQC adoption in TLS has been on hybrid key exchange mechanisms. These combine a classical algorithm (such as Elliptic Curve Diffie-Hellman Ephemeral (ECDHE)) with a PQC algorithm (such as ML-KEM). The resulting shared secret is secure as long as at least one of the component algorithms remains unbroken. The X25519MLKEM768 hybrid scheme is the most widely supported one.

State of PQC key exchange mechanisms (KEMs) today

Support for PQC KEMs is rapidly improving across the ecosystem.

Go: The Go standard library's crypto/tls package introduced support for X25519MLKEM768 in version 1.24 (released February 2025). Crucially, it's enabled by default when there is no explicit configuration, i.e., Config.CurvePreferences is nil.

Browsers & OpenSSL: Major browsers like Chrome (version 131, November 2024) and Firefox (version 135, February 2025), as well as OpenSSL (version 3.5.0, April 2025), have also added support for the ML-KEM based hybrid scheme.

Apple is also rolling out support for X25519MLKEM768 in version 26 of their operating systems. Given the proliferation of Apple devices, this will have a significant impact on the global PQC adoption.

For a more detailed overview of the state of PQC in the wider industry, see this blog post by Cloudflare.

Post-quantum KEMs in Kubernetes: an unexpected arrival

So, what does this mean for Kubernetes? Kubernetes components, including the API server and kubelet, are built with Go.

As of Kubernetes v1.33, released in April 2025, the project uses Go 1.24. A quick check of the Kubernetes codebase reveals that Config.CurvePreferences is not explicitly set. This leads to a fascinating conclusion: Kubernetes v1.33, by virtue of using Go 1.24, supports hybrid post-quantum X25519MLKEM768 for TLS connections by default!

You can test this yourself. If you set up a Minikube cluster running Kubernetes v1.33.0, you can connect to the API server using a recent OpenSSL client:

$ minikube start --kubernetes-version=v1.33.0
$ kubectl cluster-info
Kubernetes control plane is running at https://127.0.0.1:<PORT>
$ kubectl config view --minify --raw -o jsonpath=\'{.clusters[0].cluster.certificate-authority-data}\' | base64 -d > ca.crt
$ openssl version
OpenSSL 3.5.0 8 Apr 2025 (Library: OpenSSL 3.5.0 8 Apr 2025)
$ echo -n "Q" | openssl s_client -connect 127.0.0.1:<PORT> -CAfile ca.crt
[...]
Negotiated TLS1.3 group: X25519MLKEM768
[...]
DONE

Lo and behold, the negotiated group is X25519MLKEM768! This is a significant step towards making Kubernetes quantum-safe, seemingly without a major announcement or dedicated KEP (Kubernetes Enhancement Proposal).

The Go version mismatch pitfall

An interesting wrinkle emerged with Go versions 1.23 and 1.24. Go 1.23 included experimental support for a draft version of ML-KEM, identified as X25519Kyber768Draft00. This was also enabled by default if Config.CurvePreferences was nil. Kubernetes v1.32 used Go 1.23. However, Go 1.24 removed the draft support and replaced it with the standardized version X25519MLKEM768.

What happens if a client and server are using mismatched Go versions (one on 1.23, the other on 1.24)? They won't have a common PQC KEM to negotiate, and the handshake will fall back to classical ECC curves (e.g., X25519). How could this happen in practice?

Consider a scenario:

A Kubernetes cluster is running v1.32 (using Go 1.23 and thus X25519Kyber768Draft00). A developer upgrades their kubectl to v1.33, compiled with Go 1.24, only supporting X25519MLKEM768. Now, when kubectl communicates with the v1.32 API server, they no longer share a common PQC algorithm. The connection will downgrade to classical cryptography, silently losing the PQC protection that has been in place. This highlights the importance of understanding the implications of Go version upgrades, and the details of the TLS stack.

Limitations: packet size

One practical consideration with ML-KEM is the size of its public keys with encoded key sizes of around 1.2 kilobytes for ML-KEM-768. This can cause the initial TLS ClientHello message not to fit inside a single TCP/IP packet, given the typical networking constraints (most commonly, the standard Ethernet frame size limit of 1500 bytes). Some TLS libraries or network appliances might not handle this gracefully, assuming the Client Hello always fits in one packet. This issue has been observed in some Kubernetes-related projects and networking components, potentially leading to connection failures when PQC KEMs are used. More details can be found at tldr.fail.

State of Post-Quantum Signatures

While KEMs are seeing broader adoption, PQC digital signatures are further behind in terms of widespread integration into standard toolchains. NIST has published standards for PQC signatures, such as ML-DSA (FIPS-204) and SLH-DSA (FIPS-205). However, implementing these in a way that's broadly usable (e.g., for PQC Certificate Authorities) presents challenges:

Larger Keys and Signatures: PQC signature schemes often have significantly larger public keys and signature sizes compared to classical algorithms like Ed25519 or RSA. For instance, Dilithium2 keys can be 30 times larger than Ed25519 keys, and certificates can be 12 times larger.

Performance: Signing and verification operations can be substantially slower. While some algorithms are on par with classical algorithms, others may have a much higher overhead, sometimes on the order of 10x to 1000x worse performance. To improve this situation, NIST is running a second round of standardization for PQC signatures.

Toolchain Support: Mainstream TLS libraries and CA software do not yet have mature, built-in support for these new signature algorithms. The Go team, for example, has indicated that ML-DSA support is a high priority, but the soonest it might appear in the standard library is Go 1.26 (as of May 2025).

Cloudflare's CIRCL (Cloudflare Interoperable Reusable Cryptographic Library) library implements some PQC signature schemes like variants of Dilithium, and they maintain a fork of Go (cfgo) that integrates CIRCL. Using cfgo, it's possible to experiment with generating certificates signed with PQC algorithms like Ed25519-Dilithium2. However, this requires using a custom Go toolchain and is not yet part of the mainstream Kubernetes or Go distributions.

Conclusion

The journey to a post-quantum secure Kubernetes is underway, and perhaps further along than many realize, thanks to the proactive adoption of ML-KEM in Go. With Kubernetes v1.33, users are already benefiting from hybrid post-quantum key exchange in many TLS connections by default.

However, awareness of potential pitfalls, such as Go version mismatches leading to downgrades and issues with Client Hello packet sizes, is crucial. While PQC for KEMs is becoming a reality, PQC for digital signatures and certificate hierarchies is still in earlier stages of development and adoption for mainstream use. As Kubernetes maintainers and contributors, staying informed about these developments will be key to ensuring the long-term security of the platform.

Navigating Failures in Pods With Devices

Kubernetes is the de facto standard for container orchestration, but when it comes to handling specialized hardware like GPUs and other accelerators, things get a bit complicated. This blog post dives into the challenges of managing failure modes when operating pods with devices in Kubernetes, based on insights from Sergey Kanzhelev and Mrunal Patel's talk at KubeCon NA 2024. You can follow the links to slides and recording.

The AI/ML boom and its impact on Kubernetes

The rise of AI/ML workloads has brought new challenges to Kubernetes. These workloads often rely heavily on specialized hardware, and any device failure can significantly impact performance and lead to frustrating interruptions. As highlighted in the 2024 Llama paper, hardware issues, particularly GPU failures, are a major cause of disruption in AI/ML training. You can also learn how much effort NVIDIA spends on handling devices failures and maintenance in the KubeCon talk by Ryan Hallisey and Piotr Prokop All-Your-GPUs-Are-Belong-to-Us: An Inside Look at NVIDIA's Self-Healing GeForce NOW Infrastructure (recording) as they see 19 remediation requests per 1000 nodes a day! We also see data centers offering spot consumption models and overcommit on power, making device failures commonplace and a part of the business model.

However, Kubernetes’s view on resources is still very static. The resource is either there or not. And if it is there, the assumption is that it will stay there fully functional - Kubernetes lacks good support for handling full or partial hardware failures. These long-existing assumptions combined with the overall complexity of a setup lead to a variety of failure modes, which we discuss here.

Understanding AI/ML workloads

Generally, all AI/ML workloads require specialized hardware, have challenging scheduling requirements, and are expensive when idle. AI/ML workloads typically fall into two categories - training and inference. Here is an oversimplified view of those categories’ characteristics, which are different from traditional workloads like web services:

Training
These workloads are resource-intensive, often consuming entire machines and running as gangs of pods. Training jobs are usually "run to completion" - but that could be days, weeks or even months. Any failure in a single pod can necessitate restarting the entire step across all the pods.
Inference
These workloads are usually long-running or run indefinitely, and can be small enough to consume a subset of a Node’s devices or large enough to span multiple nodes. They often require downloading huge files with the model weights.

These workload types specifically break many past assumptions:

Workload assumptions before and now
BeforeNow
Can get a better CPU and the app will work faster.Require a specific device (or class of devices) to run.
When something doesn’t work, just recreate it.Allocation or reallocation is expensive.
Any node will work. No need to coordinate between Pods.Scheduled in a special way - devices often connected in a cross-node topology.
Each Pod can be plug-and-play replaced if failed.Pods are a part of a larger task. Lifecycle of an entire task depends on each Pod.
Container images are slim and easily available.Container images may be so big that they require special handling.
Long initialization can be offset by slow rollout.Initialization may be long and should be optimized, sometimes across many Pods together.
Compute nodes are commoditized and relatively inexpensive, so some idle time is acceptable.Nodes with specialized hardware can be an order of magnitude more expensive than those without, so idle time is very wasteful.

The existing failure model was relying on old assumptions. It may still work for the new workload types, but it has limited knowledge about devices and is very expensive for them. In some cases, even prohibitively expensive. You will see more examples later in this article.

Why Kubernetes still reigns supreme

This article is not going deeper into the question: why not start fresh for
AI/ML workloads since they are so different from the traditional Kubernetes workloads. Despite many challenges, Kubernetes remains the platform of choice for AI/ML workloads. Its maturity, security, and rich ecosystem of tools make it a compelling option. While alternatives exist, they often lack the years of development and refinement that Kubernetes offers. And the Kubernetes developers are actively addressing the gaps identified in this article and beyond.

The current state of device failure handling

This section outlines different failure modes and the best practices and DIY (Do-It-Yourself) solutions used today. The next session will describe a roadmap of improving things for those failure modes.

Failure modes: K8s infrastructure

In order to understand the failures related to the Kubernetes infrastructure, you need to understand how many moving parts are involved in scheduling a Pod on the node. The sequence of events when the Pod is scheduled in the Node is as follows:

  1. Device plugin is scheduled on the Node
  2. Device plugin is registered with the kubelet via local gRPC
  3. Kubelet uses device plugin to watch for devices and updates capacity of the node
  4. Scheduler places a user Pod on a Node based on the updated capacity
  5. Kubelet asks Device plugin to Allocate devices for a User Pod
  6. Kubelet creates a User Pod with the allocated devices attached to it

This diagram shows some of those actors involved:

The diagram shows relationships between the kubelet, Device plugin, and a user Pod. It shows that kubelet connects to the Device plugin named my-device, kubelet reports the node status with the my-device availability, and the user Pod requesting the 2 of my-device.

As there are so many actors interconnected, every one of them and every connection may experience interruptions. This leads to many exceptional situations that are often considered failures, and may cause serious workload interruptions:

  • Pods failing admission at various stages of its lifecycle
  • Pods unable to run on perfectly fine hardware
  • Scheduling taking unexpectedly long time
The same diagram as one above it, however it has an overlayed orange bang drawings over individual components with the text indicating what can break in that component. Over the kubelet text reads: 'kubelet restart: looses all devices info before re-Watch'. Over the Device plugin text reads: 'device plugin update, evictIon, restart: kubelet cannot Allocate devices or loses all devices state'. Over the user Pod text reads: 'slow pod termination: devices are unavailable'.

The goal for Kubernetes is to make the interruption between these components as reliable as possible. Kubelet already implements retries, grace periods, and other techniques to improve it. The roadmap section goes into details on other edge cases that the Kubernetes project tracks. However, all these improvements only work when these best practices are followed:

  • Configure and restart kubelet and the container runtime (such as containerd or CRI-O) as early as possible to not interrupt the workload.
  • Monitor device plugin health and carefully plan for upgrades.
  • Do not overload the node with less-important workloads to prevent interruption of device plugin and other components.
  • Configure user pods tolerations to handle node readiness flakes.
  • Configure and code graceful termination logic carefully to not block devices for too long.

Another class of Kubernetes infra-related issues is driver-related. With traditional resources like CPU and memory, no compatibility checks between the application and hardware were needed. With special devices like hardware accelerators, there are new failure modes. Device drivers installed on the node:

  • Must match the hardware
  • Be compatible with an app
  • Must work with other drivers (like nccl, etc.)

Best practices for handling driver versions:

  • Monitor driver installer health
  • Plan upgrades of infrastructure and Pods to match the version
  • Have canary deployments whenever possible

Following the best practices in this section and using device plugins and device driver installers from trusted and reliable sources generally eliminate this class of failures. Kubernetes is tracking work to make this space even better.

Failure modes: device failed

There is very little handling of device failure in Kubernetes today. Device plugins report the device failure only by changing the count of allocatable devices. And Kubernetes relies on standard mechanisms like liveness probes or container failures to allow Pods to communicate the failure condition to the kubelet. However, Kubernetes does not correlate device failures with container crashes and does not offer any mitigation beyond restarting the container while being attached to the same device.

This is why many plugins and DIY solutions exist to handle device failures based on various signals.

Health controller

In many cases a failed device will result in unrecoverable and very expensive nodes doing nothing. A simple DIY solution is a node health controller. The controller could compare the device allocatable count with the capacity and if the capacity is greater, it starts a timer. Once the timer reaches a threshold, the health controller kills and recreates a node.

There are problems with the health controller approach:

  • Root cause of the device failure is typically not known
  • The controller is not workload aware
  • Failed device might not be in use and you want to keep other devices running
  • The detection may be too slow as it is very generic
  • The node may be part of a bigger set of nodes and simply cannot be deleted in isolation without other nodes

There are variations of the health controller solving some of the problems above. The overall theme here though is that to best handle failed devices, you need customized handling for the specific workload. Kubernetes doesn’t yet offer enough abstraction to express how critical the device is for a node, for the cluster, and for the Pod it is assigned to.

Pod failure policy

Another DIY approach for device failure handling is a per-pod reaction on a failed device. This approach is applicable for training workloads that are implemented as Jobs.

Pod can define special error codes for device failures. For example, whenever unexpected device behavior is encountered, Pod exits with a special exit code. Then the Pod failure policy can handle the device failure in a special way. Read more on Handling retriable and non-retriable pod failures with Pod failure policy

There are some problems with the Pod failure policy approach for Jobs:

  • There is no well-known device failed condition, so this approach does not work for the generic Pod case
  • Error codes must be coded carefully and in some cases are hard to guarantee.
  • Only works with Jobs with restartPolicy: Never, due to the limitation of a pod failure policy feature.

So, this solution has limited applicability.

Custom pod watcher

A little more generic approach is to implement the Pod watcher as a DIY solution or use some third party tools offering this functionality. The pod watcher is most often used to handle device failures for inference workloads.

Since Kubernetes just keeps a pod assigned to a device, even if the device is reportedly unhealthy, the idea is to detect this situation with the pod watcher and apply some remediation. It often involves obtaining device health status and its mapping to the Pod using Pod Resources API on the node. If a device fails, it can then delete the attached Pod as a remediation. The replica set will handle the Pod recreation on a healthy device.

The other reasons to implement this watcher:

  • Without it, the Pod will keep being assigned to the failed device forever.
  • There is no descheduling for a pod with restartPolicy=Always.
  • There are no built-in controllers that delete Pods in CrashLoopBackoff.

Problems with the custom pod watcher:

  • The signal for the pod watcher is expensive to get, and involves some privileged actions.
  • It is a custom solution and it assumes the importance of a device for a Pod.
  • The pod watcher relies on external controllers to reschedule a Pod.

There are more variations of DIY solutions for handling device failures or upcoming maintenance. Overall, Kubernetes has enough extension points to implement these solutions. However, some extension points require higher privilege than users may be comfortable with or are too disruptive. The roadmap section goes into more details on specific improvements in handling the device failures.

Failure modes: container code failed

When the container code fails or something bad happens with it, like out of memory conditions, Kubernetes knows how to handle those cases. There is either the restart of a container, or a crash of a Pod if it has restartPolicy: Never and scheduling it on another node. Kubernetes has limited expressiveness on what is a failure (for example, non-zero exit code or liveness probe failure) and how to react on such a failure (mostly either Always restart or immediately fail the Pod).

This level of expressiveness is often not enough for the complicated AI/ML workloads. AI/ML pods are better rescheduled locally or even in-place as that would save on image pulling time and device allocation. AI/ML pods are often interconnected and need to be restarted together. This adds another level of complexity and optimizing it often brings major savings in running AI/ML workloads.

There are various DIY solutions to handle Pod failures orchestration. The most typical one is to wrap a main executable in a container by some orchestrator. And this orchestrator will be able to restart the main executable whenever the job needs to be restarted because some other pod has failed.

Solutions like this are very fragile and elaborate. They are often worth the money saved comparing to a regular JobSet delete/recreate cycle when used in large training jobs. Making these solutions less fragile and more streamlined by developing new hooks and extension points in Kubernetes will make it easy to apply to smaller jobs, benefiting everybody.

Failure modes: device degradation

Not all device failures are terminal for the overall workload or batch job. As the hardware stack gets more and more complex, misconfiguration on one of the hardware stack layers, or driver failures, may result in devices that are functional, but lagging on performance. One device that is lagging behind can slow down the whole training job.

We see reports of such cases more and more often. Kubernetes has no way to express this type of failures today and since it is the newest type of failure mode, there is not much of a best practice offered by hardware vendors for detection and third party tooling for remediation of these situations.

Typically, these failures are detected based on observed workload characteristics. For example, the expected speed of AI/ML training steps on particular hardware. Remediation for those issues is highly depend on a workload needs.

Roadmap

As outlined in a section above, Kubernetes offers a lot of extension points which are used to implement various DIY solutions. The space of AI/ML is developing very fast, with changing requirements and usage patterns. SIG Node is taking a measured approach of enabling more extension points to implement the workload-specific scenarios over introduction of new semantics to support specific scenarios. This means prioritizing making information about failures readily available over implementing automatic remediations for those failures that might only be suitable for a subset of workloads.

This approach ensures there are no drastic changes for workload handling which may break existing, well-oiled DIY solutions or experiences with the existing more traditional workloads.

Many error handling techniques used today work for AI/ML, but are very expensive. SIG Node will invest in extension points to make those cheaper, with the understanding that the price cutting for AI/ML is critical.

The following is the set of specific investments we envision for various failure modes.

Roadmap for failure modes: K8s infrastructure

The area of Kubernetes infrastructure is the easiest to understand and very important to make right for the upcoming transition from Device Plugins to DRA. SIG Node is tracking many work items in this area, most notably the following:

Basically, every interaction of Kubernetes components must be reliable via either the kubelet improvements or the best practices in plugins development and deployment.

Roadmap for failure modes: device failed

For the device failures some patterns are already emerging in common scenarios that Kubernetes can support. However, the very first step is to make information about failed devices available easier. The very first step here is the work in KEP 4680 (Add Resource Health Status to the Pod Status for Device Plugin and DRA).

Longer term ideas include to be tested:

  • Integrate device failures into Pod Failure Policy.
  • Node-local retry policies, enabling pod failure policies for Pods with restartPolicy=OnFailure and possibly beyond that.
  • Ability to deschedule pod, including with the restartPolicy: Always, so it can get a new device allocated.
  • Add device health to the ResourceSlice used to represent devices in DRA, rather than simply withdrawing an unhealthy device from the ResourceSlice.

Roadmap for failure modes: container code failed

The main improvements to handle container code failures for AI/ML workloads are all targeting cheaper error handling and recovery. The cheapness is mostly coming from reuse of pre-allocated resources as much as possible. From reusing the Pods by restarting containers in-place, to node local restart of containers instead of rescheduling whenever possible, to snapshotting support, and re-scheduling prioritizing the same node to save on image pulls.

Consider this scenario: A big training job needs 512 Pods to run. And one of the pods failed. It means that all Pods need to be interrupted and synced up to restart the failed step. The most efficient way to achieve this generally is to reuse as many Pods as possible by restarting them in-place, while replacing the failed pod to clear up the error from it. Like demonstrated in this picture:

The picture shows 512 pod, most ot them are green and have a recycle sign next to them indicating that they can be reused, and one Pod drawn in red, and a new green replacement Pod next to it indicating that it needs to be replaced.

It is possible to implement this scenario, but all solutions implementing it are fragile due to lack of certain extension points in Kubernetes. Adding these extension points to implement this scenario is on the Kubernetes roadmap.

Roadmap for failure modes: device degradation

There is very little done in this area - there is no clear detection signal, very limited troubleshooting tooling, and no built-in semantics to express the "degraded" device on Kubernetes. There has been discussion of adding data on device performance or degradation in the ResourceSlice used by DRA to represent devices, but it is not yet clearly defined. There are also projects like node-healthcheck-operator that can be used for some scenarios.

We expect developments in this area from hardware vendors and cloud providers, and we expect to see mostly DIY solutions in the near future. As more users get exposed to AI/ML workloads, this is a space needing feedback on patterns used here.

Join the conversation

The Kubernetes community encourages feedback and participation in shaping the future of device failure handling. Join SIG Node and contribute to the ongoing discussions!

This blog post provides a high-level overview of the challenges and future directions for device failure management in Kubernetes. By addressing these issues, Kubernetes can solidify its position as the leading platform for AI/ML workloads, ensuring resilience and reliability for applications that depend on specialized hardware.

Image Compatibility In Cloud Native Environments

In industries where systems must run very reliably and meet strict performance criteria such as telecommunication, high-performance or AI computing, containerized applications often need specific operating system configuration or hardware presence. It is common practice to require the use of specific versions of the kernel, its configuration, device drivers, or system components. Despite the existence of the Open Container Initiative (OCI), a governing community to define standards and specifications for container images, there has been a gap in expression of such compatibility requirements. The need to address this issue has led to different proposals and, ultimately, an implementation in Kubernetes' Node Feature Discovery (NFD).

NFD is an open source Kubernetes project that automatically detects and reports hardware and system features of cluster nodes. This information helps users to schedule workloads on nodes that meet specific system requirements, which is especially useful for applications with strict hardware or operating system dependencies.

The need for image compatibility specification

Dependencies between containers and host OS

A container image is built on a base image, which provides a minimal runtime environment, often a stripped-down Linux userland, completely empty or distroless. When an application requires certain features from the host OS, compatibility issues arise. These dependencies can manifest in several ways:

  • Drivers: Host driver versions must match the supported range of a library version inside the container to avoid compatibility problems. Examples include GPUs and network drivers.
  • Libraries or Software: The container must come with a specific version or range of versions for a library or software to run optimally in the environment. Examples from high performance computing are MPI, EFA, or Infiniband.
  • Kernel Modules or Features: Specific kernel features or modules must be present. Examples include having support of write protected huge page faults, or the presence of VFIO
  • And more…

While containers in Kubernetes are the most likely unit of abstraction for these needs, the definition of compatibility can extend further to include other container technologies such as Singularity and other OCI artifacts such as binaries from a spack binary cache.

Multi-cloud and hybrid cloud challenges

Containerized applications are deployed across various Kubernetes distributions and cloud providers, where different host operating systems introduce compatibility challenges. Often those have to be pre-configured before workload deployment or are immutable. For instance, different cloud providers will include different operating systems like:

  • RHCOS/RHEL
  • Photon OS
  • Amazon Linux 2
  • Container-Optimized OS
  • Azure Linux OS
  • And more...

Each OS comes with unique kernel versions, configurations, and drivers, making compatibility a non-trivial issue for applications requiring specific features. It must be possible to quickly assess a container for its suitability to run on any specific environment.

Image compatibility initiative

An effort was made within the Open Containers Initiative Image Compatibility working group to introduce a standard for image compatibility metadata. A specification for compatibility would allow container authors to declare required host OS features, making compatibility requirements discoverable and programmable. The specification implemented in Kubernetes Node Feature Discovery is one of the discussed proposals. It aims to:

  • Define a structured way to express compatibility in OCI image manifests.
  • Support a compatibility specification alongside container images in image registries.
  • Allow automated validation of compatibility before scheduling containers.

The concept has since been implemented in the Kubernetes Node Feature Discovery project.

Implementation in Node Feature Discovery

The solution integrates compatibility metadata into Kubernetes via NFD features and the NodeFeatureGroup API. This interface enables the user to match containers to nodes based on exposing features of hardware and software, allowing for intelligent scheduling and workload optimization.

Compatibility specification

The compatibility specification is a structured list of compatibility objects containing Node Feature Groups. These objects define image requirements and facilitate validation against host nodes. The feature requirements are described by using the list of available features from the NFD project. The schema has the following structure:

  • version (string) - Specifies the API version.
  • compatibilities (array of objects) - List of compatibility sets.
    • rules (object) - Specifies NodeFeatureGroup to define image requirements.
    • weight (int, optional) - Node affinity weight.
    • tag (string, optional) - Categorization tag.
    • description (string, optional) - Short description.

An example might look like the following:

version: v1alpha1
compatibilities:
- description: "My image requirements"
  rules:
  - name: "kernel and cpu"
    matchFeatures:
    - feature: kernel.loadedmodule
      matchExpressions:
        vfio-pci: {op: Exists}
    - feature: cpu.model
      matchExpressions:
        vendor_id: {op: In, value: ["Intel", "AMD"]}
  - name: "one of available nics"
    matchAny:
    - matchFeatures:
      - feature: pci.device
        matchExpressions:
          vendor: {op: In, value: ["0eee"]}
          class: {op: In, value: ["0200"]}
    - matchFeatures:
      - feature: pci.device
        matchExpressions:
          vendor: {op: In, value: ["0fff"]}
          class: {op: In, value: ["0200"]}

Client implementation for node validation

To streamline compatibility validation, we implemented a client tool that allows for node validation based on an image's compatibility artifact. In this workflow, the image author would generate a compatibility artifact that points to the image it describes in a registry via the referrers API. When a need arises to assess the fit of an image to a host, the tool can discover the artifact and verify compatibility of an image to a node before deployment. The client can validate nodes both inside and outside a Kubernetes cluster, extending the utility of the tool beyond the single Kubernetes use case. In the future, image compatibility could play a crucial role in creating specific workload profiles based on image compatibility requirements, aiding in more efficient scheduling. Additionally, it could potentially enable automatic node configuration to some extent, further optimizing resource allocation and ensuring seamless deployment of specialized workloads.

Examples of usage

  1. Define image compatibility metadata

    A container image can have metadata that describes its requirements based on features discovered from nodes, like kernel modules or CPU models. The previous compatibility specification example in this article exemplified this use case.

  2. Attach the artifact to the image

    The image compatibility specification is stored as an OCI artifact. You can attach this metadata to your container image using the oras tool. The registry only needs to support OCI artifacts, support for arbitrary types is not required. Keep in mind that the container image and the artifact must be stored in the same registry. Use the following command to attach the artifact to the image:

    oras attach \
    --artifact-type application/vnd.nfd.image-compatibility.v1alpha1 <image-url> \ 
    <path-to-spec>.yaml:application/vnd.nfd.image-compatibility.spec.v1alpha1+yaml
    
  3. Validate image compatibility

    After attaching the compatibility specification, you can validate whether a node meets the image's requirements. This validation can be done using the nfd client:

    nfd compat validate-node --image <image-url>
    
  4. Read the output from the client

    Finally you can read the report generated by the tool or use your own tools to act based on the generated JSON report.

    validate-node command output

Conclusion

The addition of image compatibility to Kubernetes through Node Feature Discovery underscores the growing importance of addressing compatibility in cloud native environments. It is only a start, as further work is needed to integrate compatibility into scheduling of workloads within and outside of Kubernetes. However, by integrating this feature into Kubernetes, mission-critical workloads can now define and validate host OS requirements more efficiently. Moving forward, the adoption of compatibility metadata within Kubernetes ecosystems will significantly enhance the reliability and performance of specialized containerized applications, ensuring they meet the stringent requirements of industries like telecommunications, high-performance computing or any environment that requires special hardware or host OS configuration.

Get involved

Join the Kubernetes Node Feature Discovery project if you're interested in getting involved with the design and development of Image Compatibility API and tools. We always welcome new contributors.

Changes to Kubernetes Slack

UPDATE: We’ve received notice from Salesforce that our Slack workspace WILL NOT BE DOWNGRADED on June 20th. Stand by for more details, but for now, there is no urgency to back up private channels or direct messages.

Kubernetes Slack will lose its special status and will be changing into a standard free Slack on June 20, 2025. Sometime later this year, our community may move to a new platform. If you are responsible for a channel or private channel, or a member of a User Group, you will need to take some actions as soon as you can.

For the last decade, Slack has supported our project with a free customized enterprise account. They have let us know that they can no longer do so, particularly since our Slack is one of the largest and more active ones on the platform. As such, they will be downgrading it to a standard free Slack while we decide on, and implement, other options.

On Friday, June 20, we will be subject to the feature limitations of free Slack. The primary ones which will affect us will be only retaining 90 days of history, and having to disable several apps and workflows which we are currently using. The Slack Admin team will do their best to manage these limitations.

Responsible channel owners, members of private channels, and members of User Groups should take some actions to prepare for the upgrade and preserve information as soon as possible.

The CNCF Projects Staff have proposed that our community look at migrating to Discord. Because of existing issues where we have been pushing the limits of Slack, they have already explored what a Kubernetes Discord would look like. Discord would allow us to implement new tools and integrations which would help the community, such as GitHub group membership synchronization. The Steering Committee will discuss and decide on our future platform.

Please see our FAQ, and check the kubernetes-dev mailing list and the #announcements channel for further news. If you have specific feedback on our Slack status join the discussion on GitHub.

Enhancing Kubernetes Event Management with Custom Aggregation

Kubernetes Events provide crucial insights into cluster operations, but as clusters grow, managing and analyzing these events becomes increasingly challenging. This blog post explores how to build custom event aggregation systems that help engineering teams better understand cluster behavior and troubleshoot issues more effectively.

The challenge with Kubernetes events

In a Kubernetes cluster, events are generated for various operations - from pod scheduling and container starts to volume mounts and network configurations. While these events are invaluable for debugging and monitoring, several challenges emerge in production environments:

  1. Volume: Large clusters can generate thousands of events per minute
  2. Retention: Default event retention is limited to one hour
  3. Correlation: Related events from different components are not automatically linked
  4. Classification: Events lack standardized severity or category classifications
  5. Aggregation: Similar events are not automatically grouped

To learn more about Events in Kubernetes, read the Event API reference.

Real-World value

Consider a production environment with tens of microservices where the users report intermittent transaction failures:

Traditional event aggregation process: Engineers are wasting hours sifting through thousands of standalone events spread across namespaces. By the time they look into it, the older events have long since purged, and correlating pod restarts to node-level issues is practically impossible.

With its event aggregation in its custom events: The system groups events across resources, instantly surfacing correlation patterns such as volume mount timeouts before pod restarts. History indicates it occurred during past record traffic spikes, highlighting a storage scalability issue in minutes rather than hours.

The benefit of this approach is that organizations that implement it commonly cut down their troubleshooting time significantly along with increasing the reliability of systems by detecting patterns early.

Building an Event aggregation system

This post explores how to build a custom event aggregation system that addresses these challenges, aligned to Kubernetes best practices. I've picked the Go programming language for my example.

Architecture overview

This event aggregation system consists of three main components:

  1. Event Watcher: Monitors the Kubernetes API for new events
  2. Event Processor: Processes, categorizes, and correlates events
  3. Storage Backend: Stores processed events for longer retention

Here's a sketch for how to implement the event watcher:

package main

import (
    "context"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
    "k8s.io/client-go/rest"
    eventsv1 "k8s.io/api/events/v1"
)

type EventWatcher struct {
    clientset *kubernetes.Clientset
}

func NewEventWatcher(config *rest.Config) (*EventWatcher, error) {
    clientset, err := kubernetes.NewForConfig(config)
    if err != nil {
        return nil, err
    }
    return &EventWatcher{clientset: clientset}, nil
}

func (w *EventWatcher) Watch(ctx context.Context) (<-chan *eventsv1.Event, error) {
    events := make(chan *eventsv1.Event)
    
    watcher, err := w.clientset.EventsV1().Events("").Watch(ctx, metav1.ListOptions{})
    if err != nil {
        return nil, err
    }

    go func() {
        defer close(events)
        for {
            select {
            case event := <-watcher.ResultChan():
                if e, ok := event.Object.(*eventsv1.Event); ok {
                    events <- e
                }
            case <-ctx.Done():
                watcher.Stop()
                return
            }
        }
    }()

    return events, nil
}

Event processing and classification

The event processor enriches events with additional context and classification:

type EventProcessor struct {
    categoryRules []CategoryRule
    correlationRules []CorrelationRule
}

type ProcessedEvent struct {
    Event     *eventsv1.Event
    Category  string
    Severity  string
    CorrelationID string
    Metadata  map[string]string
}

func (p *EventProcessor) Process(event *eventsv1.Event) *ProcessedEvent {
    processed := &ProcessedEvent{
        Event:    event,
        Metadata: make(map[string]string),
    }
    
    // Apply classification rules
    processed.Category = p.classifyEvent(event)
    processed.Severity = p.determineSeverity(event)
    
    // Generate correlation ID for related events
    processed.CorrelationID = p.correlateEvent(event)
    
    // Add useful metadata
    processed.Metadata = p.extractMetadata(event)
    
    return processed
}

Implementing Event correlation

One of the key features you could implement is a way of correlating related Events. Here's an example correlation strategy:

func (p *EventProcessor) correlateEvent(event *eventsv1.Event) string {
    // Correlation strategies:
    // 1. Time-based: Events within a time window
    // 2. Resource-based: Events affecting the same resource
    // 3. Causation-based: Events with cause-effect relationships

    correlationKey := generateCorrelationKey(event)
    return correlationKey
}

func generateCorrelationKey(event *eventsv1.Event) string {
    // Example: Combine namespace, resource type, and name
    return fmt.Sprintf("%s/%s/%s",
        event.InvolvedObject.Namespace,
        event.InvolvedObject.Kind,
        event.InvolvedObject.Name,
    )
}

Event storage and retention

For long-term storage and analysis, you'll probably want a backend that supports:

  • Efficient querying of large event volumes
  • Flexible retention policies
  • Support for aggregation queries

Here's a sample storage interface:

type EventStorage interface {
    Store(context.Context, *ProcessedEvent) error
    Query(context.Context, EventQuery) ([]ProcessedEvent, error)
    Aggregate(context.Context, AggregationParams) ([]EventAggregate, error)
}

type EventQuery struct {
    TimeRange     TimeRange
    Categories    []string
    Severity      []string
    CorrelationID string
    Limit         int
}

type AggregationParams struct {
    GroupBy    []string
    TimeWindow string
    Metrics    []string
}

Good practices for Event management

  1. Resource Efficiency

    • Implement rate limiting for event processing
    • Use efficient filtering at the API server level
    • Batch events for storage operations
  2. Scalability

    • Distribute event processing across multiple workers
    • Use leader election for coordination
    • Implement backoff strategies for API rate limits
  3. Reliability

    • Handle API server disconnections gracefully
    • Buffer events during storage backend unavailability
    • Implement retry mechanisms with exponential backoff

Advanced features

Pattern detection

Implement pattern detection to identify recurring issues:

type PatternDetector struct {
    patterns map[string]*Pattern
    threshold int
}

func (d *PatternDetector) Detect(events []ProcessedEvent) []Pattern {
    // Group similar events
    groups := groupSimilarEvents(events)
    
    // Analyze frequency and timing
    patterns := identifyPatterns(groups)
    
    return patterns
}

func groupSimilarEvents(events []ProcessedEvent) map[string][]ProcessedEvent {
    groups := make(map[string][]ProcessedEvent)
    
    for _, event := range events {
        // Create similarity key based on event characteristics
        similarityKey := fmt.Sprintf("%s:%s:%s",
            event.Event.Reason,
            event.Event.InvolvedObject.Kind,
            event.Event.InvolvedObject.Namespace,
        )
        
        // Group events with the same key
        groups[similarityKey] = append(groups[similarityKey], event)
    }
    
    return groups
}


func identifyPatterns(groups map[string][]ProcessedEvent) []Pattern {
    var patterns []Pattern
    
    for key, events := range groups {
        // Only consider groups with enough events to form a pattern
        if len(events) < 3 {
            continue
        }
        
        // Sort events by time
        sort.Slice(events, func(i, j int) bool {
            return events[i].Event.LastTimestamp.Time.Before(events[j].Event.LastTimestamp.Time)
        })
        
        // Calculate time range and frequency
        firstSeen := events[0].Event.FirstTimestamp.Time
        lastSeen := events[len(events)-1].Event.LastTimestamp.Time
        duration := lastSeen.Sub(firstSeen).Minutes()
        
        var frequency float64
        if duration > 0 {
            frequency = float64(len(events)) / duration
        }
        
        // Create a pattern if it meets threshold criteria
        if frequency > 0.5 { // More than 1 event per 2 minutes
            pattern := Pattern{
                Type:         key,
                Count:        len(events),
                FirstSeen:    firstSeen,
                LastSeen:     lastSeen,
                Frequency:    frequency,
                EventSamples: events[:min(3, len(events))], // Keep up to 3 samples
            }
            patterns = append(patterns, pattern)
        }
    }
    
    return patterns
}

With this implementation, the system can identify recurring patterns such as node pressure events, pod scheduling failures, or networking issues that occur with a specific frequency.

Real-time alerts

The following example provides a starting point for building an alerting system based on event patterns. It is not a complete solution but a conceptual sketch to illustrate the approach.

type AlertManager struct {
    rules []AlertRule
    notifiers []Notifier
}

func (a *AlertManager) EvaluateEvents(events []ProcessedEvent) {
    for _, rule := range a.rules {
        if rule.Matches(events) {
            alert := rule.GenerateAlert(events)
            a.notify(alert)
        }
    }
}

Conclusion

A well-designed event aggregation system can significantly improve cluster observability and troubleshooting capabilities. By implementing custom event processing, correlation, and storage, operators can better understand cluster behavior and respond to issues more effectively.

The solutions presented here can be extended and customized based on specific requirements while maintaining compatibility with the Kubernetes API and following best practices for scalability and reliability.

Next steps

Future enhancements could include:

  • Machine learning for anomaly detection
  • Integration with popular observability platforms
  • Custom event APIs for application-specific events
  • Enhanced visualization and reporting capabilities

For more information on Kubernetes events and custom controllers, refer to the official Kubernetes documentation.

Introducing Gateway API Inference Extension

Modern generative AI and large language model (LLM) services create unique traffic-routing challenges on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches.

Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don’t account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing.

Gateway API Inference Extension

Gateway API Inference Extension was created to address this gap by building on the existing Gateway API, adding inference-specific routing capabilities while retaining the familiar model of Gateways and HTTPRoutes. By adding an inference extension to your existing gateway, you effectively transform it into an Inference Gateway, enabling you to self-host GenAI/LLMs with a “model-as-a-service” mindset.

The project’s goal is to improve and standardize routing to inference workloads across the ecosystem. Key objectives include enabling model-aware routing, supporting per-request criticalities, facilitating safe model roll-outs, and optimizing load balancing based on real-time model metrics. By achieving these, the project aims to reduce latency and improve accelerator (GPU) utilization for AI workloads.

How it works

The design introduces two new Custom Resources (CRDs) with distinct responsibilities, each aligning with a specific user persona in the AI/ML serving workflow​:

Resource Model
  1. InferencePool Defines a pool of pods (model servers) running on shared compute (e.g., GPU nodes). The platform admin can configure how these pods are deployed, scaled, and balanced. An InferencePool ensures consistent resource usage and enforces platform-wide policies. An InferencePool is similar to a Service but specialized for AI/ML serving needs and aware of the model-serving protocol.

  2. InferenceModel A user-facing model endpoint managed by AI/ML owners. It maps a public name (e.g., "gpt-4-chat") to the actual model within an InferencePool. This lets workload owners specify which models (and optional fine-tuning) they want served, plus a traffic-splitting or prioritization policy.

In summary, the InferenceModel API lets AI/ML owners manage what is served, while the InferencePool lets platform operators manage where and how it’s served.

Request flow

The flow of a request builds on the Gateway API model (Gateways and HTTPRoutes) with one or more extra inference-aware steps (extensions) in the middle. Here’s a high-level example of the request flow with the Endpoint Selection Extension (ESE):

Request Flow
  1. Gateway Routing
    A client sends a request (e.g., an HTTP POST to /completions). The Gateway (like Envoy) examines the HTTPRoute and identifies the matching InferencePool backend.

  2. Endpoint Selection
    Instead of simply forwarding to any available pod, the Gateway consults an inference-specific routing extension— the Endpoint Selection Extension—to pick the best of the available pods. This extension examines live pod metrics (queue lengths, memory usage, loaded adapters) to choose the ideal pod for the request.

  3. Inference-Aware Scheduling
    The chosen pod is the one that can handle the request with the lowest latency or highest efficiency, given the user’s criticality or resource needs. The Gateway then forwards traffic to that specific pod.

Endpoint Extension Scheduling

This extra step provides a smarter, model-aware routing mechanism that still feels like a normal single request to the client. Additionally, the design is extensible—any Inference Gateway can be enhanced with additional inference-specific extensions to handle new routing strategies, advanced scheduling logic, or specialized hardware needs. As the project continues to grow, contributors are encouraged to develop new extensions that are fully compatible with the same underlying Gateway API model, further expanding the possibilities for efficient and intelligent GenAI/LLM routing.

Benchmarks

We evaluated ​this extension against a standard Kubernetes Service for a vLLM‐based model serving deployment. The test environment consisted of multiple H100 (80 GB) GPU pods running vLLM (version 1) on a Kubernetes cluster, with 10 Llama2 model replicas. The Latency Profile Generator (LPG) tool was used to generate traffic and measure throughput, latency, and other metrics. The ShareGPT dataset served as the workload, and traffic was ramped from 100 Queries per Second (QPS) up to 1000 QPS.

Key results

Endpoint Extension Scheduling
  • Comparable Throughput: Throughout the tested QPS range, the ESE delivered throughput roughly on par with a standard Kubernetes Service.

  • Lower Latency:

    • Per‐Output‐Token Latency: The ​ESE showed significantly lower p90 latency at higher QPS (500+), indicating that its model-aware routing decisions reduce queueing and resource contention as GPU memory approaches saturation.
    • Overall p90 Latency: Similar trends emerged, with the ​ESE reducing end‐to‐end tail latencies compared to the baseline, particularly as traffic increased beyond 400–500 QPS.

These results suggest that this extension's model‐aware routing significantly reduced latency for GPU‐backed LLM workloads. By dynamically selecting the least‐loaded or best‐performing model server, it avoids hotspots that can appear when using traditional load balancing methods for large, long‐running inference requests.

Roadmap

As the Gateway API Inference Extension heads toward GA, planned features include:

  1. Prefix-cache aware load balancing for remote caches
  2. LoRA adapter pipelines for automated rollout
  3. Fairness and priority between workloads in the same criticality band
  4. HPA support for scaling based on aggregate, per-model metrics
  5. Support for large multi-modal inputs/outputs
  6. Additional model types (e.g., diffusion models)
  7. Heterogeneous accelerators (serving on multiple accelerator types with latency- and cost-aware load balancing)
  8. Disaggregated serving for independently scaling pools

Summary

By aligning model serving with Kubernetes-native tooling, Gateway API Inference Extension aims to simplify and standardize how AI/ML traffic is routed. With model-aware routing, criticality-based prioritization, and more, it helps ops teams deliver the right LLM services to the right users—smoothly and efficiently.

Ready to learn more? Visit the project docs to dive deeper, give an Inference Gateway extension a try with a few simple steps, and get involved if you’re interested in contributing to the project!

Start Sidecar First: How To Avoid Snags

From the Kubernetes Multicontainer Pods: An Overview blog post you know what their job is, what are the main architectural patterns, and how they are implemented in Kubernetes. The main thing I’ll cover in this article is how to ensure that your sidecar containers start before the main app. It’s more complicated than you might think!

A gentle refresher

I'd just like to remind readers that the v1.29.0 release of Kubernetes added native support for sidecar containers, which can now be defined within the .spec.initContainers field, but with restartPolicy: Always. You can see that illustrated in the following example Pod manifest snippet:

initContainers:
  - name: logshipper
    image: alpine:latest
    restartPolicy: Always # this is what makes it a sidecar container
    command: ['sh', '-c', 'tail -F /opt/logs.txt']
    volumeMounts:
    - name: data
        mountPath: /opt

What are the specifics of defining sidecars with a .spec.initContainers block, rather than as a legacy multi-container pod with multiple .spec.containers? Well, all .spec.initContainers are always launched before the main application. If you define Kubernetes-native sidecars, those are terminated after the main application. Furthermore, when used with Jobs, a sidecar container should still be alive and could potentially even restart after the owning Job is complete; Kubernetes-native sidecar containers do not block pod completion.

To learn more, you can also read the official Pod sidecar containers tutorial.

The problem

Now you know that defining a sidecar with this native approach will always start it before the main application. From the kubelet source code, it's visible that this often means being started almost in parallel, and this is not always what an engineer wants to achieve. What I'm really interested in is whether I can delay the start of the main application until the sidecar is not just started, but fully running and ready to serve. It might be a bit tricky because the problem with sidecars is there’s no obvious success signal, contrary to init containers - designed to run only for a specified period of time. With an init container, exit status 0 is unambiguously "I succeeded". With a sidecar, there are lots of points at which you can say "a thing is running". Starting one container only after the previous one is ready is part of a graceful deployment strategy, ensuring proper sequencing and stability during startup. It’s also actually how I’d expect sidecar containers to work as well, to cover the scenario where the main application is dependent on the sidecar. For example, it may happen that an app errors out if the sidecar isn’t available to serve requests (e.g., logging with DataDog). Sure, one could change the application code (and it would actually be the “best practice” solution), but sometimes they can’t - and this post focuses on this use case.

I'll explain some ways that you might try, and show you what approaches will really work.

Readiness probe

To check whether Kubernetes native sidecar delays the start of the main application until the sidecar is ready, let’s simulate a short investigation. Firstly, I’ll simulate a sidecar container which will never be ready by implementing a readiness probe which will never succeed. As a reminder, a readiness probe checks if the container is ready to start accepting traffic and therefore, if the pod can be used as a backend for services.

(Unlike standard init containers, sidecar containers can have probes so that the kubelet can supervise the sidecar and intervene if there are problems. For example, restarting a sidecar container if it fails a health check.)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: alpine:latest
          command: ["sh", "-c", "sleep 3600"]
      initContainers:
        - name: nginx
          image: nginx:latest
          restartPolicy: Always
          ports:
            - containerPort: 80
              protocol: TCP
          readinessProbe:
            exec:
              command:
              - /bin/sh
              - -c
              - exit 1 # this command always fails, keeping the container "Not Ready"
            periodSeconds: 5
      volumes:
        - name: data
          emptyDir: {}

The result is:

controlplane $ kubectl get pods -w
NAME                    READY   STATUS    RESTARTS   AGE
myapp-db5474f45-htgw5   1/2     Running   0          9m28s

controlplane $ kubectl describe pod myapp-db5474f45-htgw5 
Name:             myapp-db5474f45-htgw5
Namespace:        default
(...)
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  17s               default-scheduler  Successfully assigned default/myapp-db5474f45-htgw5 to node01
  Normal   Pulling    16s               kubelet            Pulling image "nginx:latest"
  Normal   Pulled     16s               kubelet            Successfully pulled image "nginx:latest" in 163ms (163ms including waiting). Image size: 72080558 bytes.
  Normal   Created    16s               kubelet            Created container nginx
  Normal   Started    16s               kubelet            Started container nginx
  Normal   Pulling    15s               kubelet            Pulling image "alpine:latest"
  Normal   Pulled     15s               kubelet            Successfully pulled image "alpine:latest" in 159ms (160ms including waiting). Image size: 3652536 bytes.
  Normal   Created    15s               kubelet            Created container myapp
  Normal   Started    15s               kubelet            Started container myapp
  Warning  Unhealthy  1s (x6 over 15s)  kubelet            Readiness probe failed:

From these logs it’s evident that only one container is ready - and I know it can’t be the sidecar, because I’ve defined it so it’ll never be ready (you can also check container statuses in kubectl get pod -o json). I also saw that myapp has been started before the sidecar is ready. That was not the result I wanted to achieve; in this case, the main app container has a hard dependency on its sidecar.

Maybe a startup probe?

To ensure that the sidecar is ready before the main app container starts, I can define a startupProbe. It will delay the start of the main container until the command is successfully executed (returns 0 exit status). If you’re wondering why I’ve added it to my initContainer, let’s analyse what happens If I’d added it to myapp container. I wouldn’t have guaranteed the probe would run before the main application code - and this one, can potentially error out without the sidecar being up and running.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: alpine:latest
          command: ["sh", "-c", "sleep 3600"]
      initContainers:
        - name: nginx
          image: nginx:latest
          ports:
            - containerPort: 80
              protocol: TCP
          restartPolicy: Always
          startupProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 30
            failureThreshold: 10
            timeoutSeconds: 20
      volumes:
        - name: data
          emptyDir: {}

This results in 2/2 containers being ready and running, and from events, it can be inferred that the main application started only after nginx had already been started. But to confirm whether it waited for the sidecar readiness, let’s change the startupProbe to the exec type of command:

startupProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - sleep 15

and run kubectl get pods -w to watch in real time whether the readiness of both containers only changes after a 15 second delay. Again, events confirm the main application starts after the sidecar. That means that using the startupProbe with a correct startupProbe.httpGet request helps to delay the main application start until the sidecar is ready. It’s not optimal, but it works.

What about the postStart lifecycle hook?

Fun fact: using the postStart lifecycle hook block will also do the job, but I’d have to write my own mini-shell script, which is even less efficient.

initContainers:
  - name: nginx
    image: nginx:latest
    restartPolicy: Always
    ports:
      - containerPort: 80
        protocol: TCP
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            echo "Waiting for readiness at http://localhost:80"
            until curl -sf http://localhost:80; do
              echo "Still waiting for http://localhost:80..."
              sleep 5
            done
            echo "Service is ready at http://localhost:80"

Liveness probe

An interesting exercise would be to check the sidecar container behavior with a liveness probe. A liveness probe behaves and is configured similarly to a readiness probe - only with the difference that it doesn’t affect the readiness of the container but restarts it in case the probe fails.

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - exit 1 # this command always fails, keeping the container "Not Ready"
  periodSeconds: 5

After adding the liveness probe configured just as the previous readiness probe and checking events of the pod by kubectl describe pod it’s visible that the sidecar has a restart count above 0. Nevertheless, the main application is not restarted nor influenced at all, even though I'm aware that (in our imaginary worst-case scenario) it can error out when the sidecar is not there serving requests. What if I’d used a livenessProbe without lifecycle postStart? Both containers will be immediately ready: at the beginning, this behavior will not be different from the one without any additional probes since the liveness probe doesn’t affect readiness at all. After a while, the sidecar will begin to restart itself, but it won’t influence the main container.

Findings summary

I’ll summarize the startup behavior in the table below:

Probe/HookSidecar starts before the main app?Main app waits for the sidecar to be ready?What if the check doesn’t pass?
readinessProbeYes, but it’s almost in parallel (effectively no)NoSidecar is not ready; main app continues running
livenessProbeYes, but it’s almost in parallel (effectively no)NoSidecar is restarted, main app continues running
startupProbeYesYesMain app is not started
postStartYes, main app container starts after postStart completesYes, but you have to provide custom logic for thatMain app is not started

To summarize: with sidecars often being a dependency of the main application, you may want to delay the start of the latter until the sidecar is healthy. The ideal pattern is to start both containers simultaneously and have the app container logic delay at all levels, but it’s not always possible. If that's what you need, you have to use the right kind of customization to the Pod definition. Thankfully, it’s nice and quick, and you have the recipe ready above.

Happy deploying!

Gateway API v1.3.0: Advancements in Request Mirroring, CORS, Gateway Merging, and Retry Budgets

Gateway API logo

Join us in the Kubernetes SIG Network community in celebrating the general availability of Gateway API v1.3.0! We are also pleased to announce that there are already a number of conformant implementations to try, made possible by postponing this blog announcement. Version 1.3.0 of the API was released about a month ago on April 24, 2025.

Gateway API v1.3.0 brings a new feature to the Standard channel (Gateway API's GA release channel): percentage-based request mirroring, and introduces three new experimental features: cross-origin resource sharing (CORS) filters, a standardized mechanism for listener and gateway merging, and retry budgets.

Also see the full release notes and applaud the v1.3.0 release team next time you see them.

Graduation to Standard channel

Graduation to the Standard channel is a notable achievement for Gateway API features, as inclusion in the Standard release channel denotes a high level of confidence in the API surface and provides guarantees of backward compatibility. Of course, as with any other Kubernetes API, Standard channel features can continue to evolve with backward-compatible additions over time, and we (SIG Network) certainly expect further refinements and improvements in the future. For more information on how all of this works, refer to the Gateway API Versioning Policy.

Percentage-based request mirroring

Leads: Lior Lieberman,Jake Bennert

GEP-3171: Percentage-Based Request Mirroring

Percentage-based request mirroring is an enhancement to the existing support for HTTP request mirroring, which allows HTTP requests to be duplicated to another backend using the RequestMirror filter type. Request mirroring is particularly useful in blue-green deployment. It can be used to assess the impact of request scaling on application performance without impacting responses to clients.

The previous mirroring capability worked on all the requests to a backendRef.
Percentage-based request mirroring allows users to specify a subset of requests they want to be mirrored, either by percentage or fraction. This can be particularly useful when services are receiving a large volume of requests. Instead of mirroring all of those requests, this new feature can be used to mirror a smaller subset of them.

Here's an example with 42% of the requests to "foo-v1" being mirrored to "foo-v2":

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-filter-mirror
  labels:
    gateway: mirror-gateway
spec:
  parentRefs:
  - name: mirror-gateway
  hostnames:
  - mirror.example
  rules:
  - backendRefs:
    - name: foo-v1
      port: 8080
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: foo-v2
          port: 8080
        percent: 42 # This value must be an integer.

You can also configure the partial mirroring using a fraction. Here is an example with 5 out of every 1000 requests to "foo-v1" being mirrored to "foo-v2".

  rules:
  - backendRefs:
    - name: foo-v1
      port: 8080
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: foo-v2
          port: 8080
        fraction:
          numerator: 5
          denominator: 1000

Additions to Experimental channel

The Experimental channel is Gateway API's channel for experimenting with new features and gaining confidence with them before allowing them to graduate to standard. Please note: the experimental channel may include features that are changed or removed later.

Starting in release v1.3.0, in an effort to distinguish Experimental channel resources from Standard channel resources, any new experimental API kinds have the prefix "X". For the same reason, experimental resources are now added to the API group gateway.networking.x-k8s.io instead of gateway.networking.k8s.io. Bear in mind that using new experimental channel resources means they can coexist with standard channel resources, but migrating these resources to the standard channel will require recreating them with the standard channel names and API group (both of which lack the "x-k8s" designator or "X" prefix).

The v1.3 release introduces two new experimental API kinds: XBackendTrafficPolicy and XListenerSet. To be able to use experimental API kinds, you need to install the Experimental channel Gateway API YAMLs from the locations listed below.

CORS filtering

Leads: Liang Li, Eyal Pazz, Rob Scott

GEP-1767: CORS Filter

Cross-origin resource sharing (CORS) is an HTTP-header based mechanism that allows a web page to access restricted resources from a server on an origin (domain, scheme, or port) different from the domain that served the web page. This feature adds a new HTTPRoute filter type, called "CORS", to configure the handling of cross-origin requests before the response is sent back to the client.

To be able to use experimental CORS filtering, you need to install the Experimental channel Gateway API HTTPRoute yaml.

Here's an example of a simple cross-origin configuration:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-route-cors
spec:
  parentRefs:
  - name: http-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /resource/foo
    filters:
    - cors:
      - type: CORS
        allowOrigins:
        - *
        allowMethods: 
        - GET
        - HEAD
        - POST
        allowHeaders: 
        - Accept
        - Accept-Language
        - Content-Language
        - Content-Type
        - Range
    backendRefs:
    - kind: Service
      name: http-route-cors
      port: 80

In this case, the Gateway returns an origin header of "*", which means that the requested resource can be referenced from any origin, a methods header (Access-Control-Allow-Methods) that permits the GET, HEAD, and POST verbs, and a headers header allowing Accept, Accept-Language, Content-Language, Content-Type, and Range.

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, HEAD, POST
Access-Control-Allow-Headers: Accept,Accept-Language,Content-Language,Content-Type,Range

The complete list of fields in the new CORS filter:

  • allowOrigins
  • allowMethods
  • allowHeaders
  • allowCredentials
  • exposeHeaders
  • maxAge

See CORS protocol for details.

XListenerSets (standardized mechanism for Listener and Gateway merging)

Lead: Dave Protasowski

GEP-1713: ListenerSets - Standard Mechanism to Merge Multiple Gateways

This release adds a new experimental API kind, XListenerSet, that allows a shared list of listeners to be attached to one or more parent Gateway(s). In addition, it expands upon the existing suggestion that Gateway API implementations may merge configuration from multiple Gateway objects. It also:

  • adds a new field allowedListeners to the .spec of a Gateway. The allowedListeners field defines from which Namespaces to select XListenerSets that are allowed to attach to that Gateway: Same, All, None, or Selector based.
  • increases the previous maximum number (64) of listeners with the addition of XListenerSets.
  • allows the delegation of listener configuration, such as TLS, to applications in other namespaces.

To be able to use experimental XListenerSet, you need to install the Experimental channel Gateway API XListenerSet yaml.

The following example shows a Gateway with an HTTP listener and two child HTTPS XListenerSets with unique hostnames and certificates. The combined set of listeners attached to the Gateway includes the two additional HTTPS listeners in the XListenerSets that attach to the Gateway. This example illustrates the delegation of listener TLS config to application owners in different namespaces ("store" and "app"). The HTTPRoute has both the Gateway listener named "foo" and one XListenerSet listener named "second" as parentRefs.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: prod-external
  namespace: infra
spec:
  gatewayClassName: example
  allowedListeners:
  - from: All
  listeners:
  - name: foo
    hostname: foo.com
    protocol: HTTP
    port: 80
---
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XListenerSet
metadata:
  name: store
  namespace: store
spec:
  parentRef:
    name: prod-external
  listeners:
  - name: first
    hostname: first.foo.com
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        group: ""
        name: first-workload-cert
---
apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XListenerSet
metadata:
  name: app
  namespace: app
spec:
  parentRef:
    name: prod-external
  listeners:
  - name: second
    hostname: second.foo.com
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        group: ""
        name: second-workload-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: httproute-example
spec:
  parentRefs:
  - name: app
    kind: XListenerSet
    sectionName: second
  - name: parent-gateway
    kind: Gateway
    sectionName: foo
    ...

Each listener in a Gateway must have a unique combination of port, protocol, (and hostname if supported by the protocol) in order for all listeners to be compatible and not conflicted over which traffic they should receive.

Furthermore, implementations can merge separate Gateways into a single set of listener addresses if all listeners across those Gateways are compatible. The management of merged listeners was under-specified in releases prior to v1.3.0.

With the new feature, the specification on merging is expanded. Implementations must treat the parent Gateways as having the merged list of all listeners from itself and from attached XListenerSets, and validation of this list of listeners must behave the same as if the list were part of a single Gateway. Within a single Gateway, listeners are ordered using the following precedence:

  1. Single Listeners (not a part of an XListenerSet) first,
  2. Remaining listeners ordered by:
    • object creation time (oldest first), and if two listeners are defined in objects that have the same timestamp, then
    • alphabetically based on "{namespace}/{name of listener}"

Retry budgets (XBackendTrafficPolicy)

Leads: Eric Bishop, Mike Morris

GEP-3388: Retry Budgets

This feature allows you to configure a retry budget across all endpoints of a destination Service. This is used to limit additional client-side retries after reaching a configured threshold. When configuring the budget, the maximum percentage of active requests that may consist of retries may be specified, as well as the interval over which requests will be considered when calculating the threshold for retries. The development of this specification changed the existing experimental API kind BackendLBPolicy into a new experimental API kind, XBackendTrafficPolicy, in the interest of reducing the proliferation of policy resources that had commonalities.

To be able to use experimental retry budgets, you need to install the Experimental channel Gateway API XBackendTrafficPolicy yaml.

The following example shows an XBackendTrafficPolicy that applies a retryConstraint that represents a budget that limits the retries to a maximum of 20% of requests, over a duration of 10 seconds, and to a minimum of 3 retries over 1 second.

apiVersion: gateway.networking.x-k8s.io/v1alpha1
kind: XBackendTrafficPolicy
metadata:
  name: traffic-policy-example
spec:
  retryConstraint:
    budget: 
        percent: 20
        interval: 10s
    minRetryRate:
      count: 3
      interval: 1s
    ...

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow the Getting Started Guide. As of this writing, four implementations are already conformant with Gateway API v1.3 experimental channel features. In alphabetical order:

Get involved

Wondering when a feature will be added? There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have made this kind of progress without the support of this dedicated and active community.

Kubernetes v1.33: In-Place Pod Resize Graduated to Beta

On behalf of the Kubernetes project, I am excited to announce that the in-place Pod resize feature (also known as In-Place Pod Vertical Scaling), first introduced as alpha in Kubernetes v1.27, has graduated to Beta and will be enabled by default in the Kubernetes v1.33 release! This marks a significant milestone in making resource management for Kubernetes workloads more flexible and less disruptive.

What is in-place Pod resize?

Traditionally, changing the CPU or memory resources allocated to a container required restarting the Pod. While acceptable for many stateless applications, this could be disruptive for stateful services, batch jobs, or any workloads sensitive to restarts.

In-place Pod resizing allows you to change the CPU and memory requests and limits assigned to containers within a running Pod, often without requiring a container restart.

Here's the core idea:

  • The spec.containers[*].resources field in a Pod specification now represents the desired resources and is mutable for CPU and memory.
  • The status.containerStatuses[*].resources field reflects the actual resources currently configured on a running container.
  • You can trigger a resize by updating the desired resources in the Pod spec via the new resize subresource.

You can try it out on a v1.33 Kubernetes cluster by using kubectl to edit a Pod (requires kubectl v1.32+):

kubectl edit pod <pod-name> --subresource resize

For detailed usage instructions and examples, please refer to the official Kubernetes documentation: Resize CPU and Memory Resources assigned to Containers.

Why does in-place Pod resize matter?

Kubernetes still excels at scaling workloads horizontally (adding or removing replicas), but in-place Pod resizing unlocks several key benefits for vertical scaling:

  • Reduced Disruption: Stateful applications, long-running batch jobs, and sensitive workloads can have their resources adjusted without suffering the downtime or state loss associated with a Pod restart.
  • Improved Resource Utilization: Scale down over-provisioned Pods without disruption, freeing up resources in the cluster. Conversely, provide more resources to Pods under heavy load without needing a restart.
  • Faster Scaling: Address transient resource needs more quickly. For example Java applications often need more CPU during startup than during steady-state operation. Start with higher CPU and resize down later.

What's changed between Alpha and Beta?

Since the alpha release in v1.27, significant work has gone into maturing the feature, improving its stability, and refining the user experience based on feedback and further development. Here are the key changes:

Notable user-facing changes

  • resize Subresource: Modifying Pod resources must now be done via the Pod's resize subresource (kubectl patch pod <name> --subresource resize ...). kubectl versions v1.32+ support this argument.
  • Resize Status via Conditions: The old status.resize field is deprecated. The status of a resize operation is now exposed via two Pod conditions:
    • PodResizePending: Indicates the Kubelet cannot grant the resize immediately (e.g., reason: Deferred if temporarily unable, reason: Infeasible if impossible on the node).
    • PodResizeInProgress: Indicates the resize is accepted and being applied. Errors encountered during this phase are now reported in this condition's message with reason: Error.
  • Sidecar Support: Resizing sidecar containers in-place is now supported.

Stability and reliability enhancements

  • Refined Allocated Resources Management: The allocation management logic with the Kubelet was significantly reworked, making it more consistent and robust. The changes eliminated whole classes of bugs, and greatly improved the reliability of in-place Pod resize.
  • Improved Checkpointing & State Tracking: A more robust system for tracking "allocated" and "actuated" resources was implemented, using new checkpoint files (allocated_pods_state, actuated_pods_state) to reliably manage resize state across Kubelet restarts and handle edge cases where runtime-reported resources differ from requested ones. Several bugs related to checkpointing and state restoration were fixed. Checkpointing efficiency was also improved.
  • Faster Resize Detection: Enhancements to the Kubelet's Pod Lifecycle Event Generator (PLEG) allow the Kubelet to respond to and complete resizes much more quickly.
  • Enhanced CRI Integration: A new UpdatePodSandboxResources CRI call was added to better inform runtimes and plugins (like NRI) about Pod-level resource changes.
  • Numerous Bug Fixes: Addressed issues related to systemd cgroup drivers, handling of containers without limits, CPU minimum share calculations, container restart backoffs, error propagation, test stability, and more.

What's next?

Graduating to Beta means the feature is ready for broader adoption, but development doesn't stop here! Here's what the community is focusing on next:

  • Stability and Productionization: Continued focus on hardening the feature, improving performance, and ensuring it is robust for production environments.
  • Addressing Limitations: Working towards relaxing some of the current limitations noted in the documentation, such as allowing memory limit decreases.
  • VerticalPodAutoscaler (VPA) Integration: Work to enable VPA to leverage in-place Pod resize is already underway. A new InPlaceOrRecreate update mode will allow it to attempt non-disruptive resizes first, or fall back to recreation if needed. This will allow users to benefit from VPA's recommendations with significantly less disruption.
  • User Feedback: Gathering feedback from users adopting the beta feature is crucial for prioritizing further enhancements and addressing any uncovered issues or bugs.

Getting started and providing feedback

With the InPlacePodVerticalScaling feature gate enabled by default in v1.33, you can start experimenting with in-place Pod resizing right away!

Refer to the documentation for detailed guides and examples.

As this feature moves through Beta, your feedback is invaluable. Please report any issues or share your experiences via the standard Kubernetes communication channels (GitHub issues, mailing lists, Slack). You can also review the KEP-1287: In-place Update of Pod Resources for the full in-depth design details.

We look forward to seeing how the community leverages in-place Pod resize to build more efficient and resilient applications on Kubernetes!

Announcing etcd v3.6.0

This announcement originally appeared on the etcd blog.

Today, we are releasing etcd v3.6.0, the first minor release since etcd v3.5.0 on June 15, 2021. This release introduces several new features, makes significant progress on long-standing efforts like downgrade support and migration to v3store, and addresses numerous critical & major issues. It also includes major optimizations in memory usage, improving efficiency and performance.

In addition to the features of v3.6.0, etcd has joined Kubernetes as a SIG (sig-etcd), enabling us to improve project sustainability. We've introduced systematic robustness testing to ensure correctness and reliability. Through the etcd-operator Working Group, we plan to improve usability as well.

What follows are the most significant changes introduced in etcd v3.6.0, along with the discussion of the roadmap for future development. For a detailed list of changes, please refer to the CHANGELOG-3.6.

A heartfelt thank you to all the contributors who made this release possible!

Security

etcd takes security seriously. To enhance software security in v3.6.0, we have improved our workflow checks by integrating govulncheck to scan the source code and trivy to scan container images. These improvements have also been backported to supported stable releases.

etcd continues to follow the Security Release Process to ensure vulnerabilities are properly managed and addressed.

Features

Migration to v3store

The v2store has been deprecated since etcd v3.4 but could still be enabled via --enable-v2. It remained the source of truth for membership data. In etcd v3.6.0, v2store can no longer be enabled as the --enable-v2 flag has been removed, and v3store has become the sole source of truth for membership data.

While v2store still exists in v3.6.0, etcd will fail to start if it contains any data other than membership information. To assist with migration, etcd v3.5.18+ provides the etcdutl check v2store command, which verifies that v2store contains only membership data (see PR 19113).

Compared to v2store, v3store offers better performance and transactional support. It is also the actively maintained storage engine moving forward.

The removal of v2store is still ongoing and is tracked in issues/12913.

Downgrade

etcd v3.6.0 is the first version to fully support downgrade. The effort for this downgrade task spans both versions 3.5 and 3.6, and all related work is tracked in issues/11716.

At a high level, the process involves migrating the data schema to the target version (e.g., v3.5), followed by a rolling downgrade.

Ensure the cluster is healthy, and take a snapshot backup. Validate whether the downgrade is valid:

$ etcdctl downgrade validate 3.5
Downgrade validate success, cluster version 3.6

If the downgrade is valid, enable downgrade mode:

$ etcdctl downgrade enable 3.5
Downgrade enable success, cluster version 3.6

etcd will then migrate the data schema in the background. Once complete, proceed with the rolling downgrade.

For details, refer to the Downgrade-3.6 guide.

Feature gates

In etcd v3.6.0, we introduced Kubernetes-style feature gates for managing new features. Previously, we indicated unstable features through the --experimental prefix in feature flag names. The prefix was removed once the feature was stable, causing a breaking change. Now, features will start in Alpha, progress to Beta, then GA, or get deprecated. This ensures a much smoother upgrade and downgrade experience for users.

See feature-gates for details.

livez / readyz checks

etcd now supports /livez and /readyz endpoints, aligning with Kubernetes' Liveness and Readiness probes. /livez indicates whether the etcd instance is alive, while /readyz indicates when it is ready to serve requests. This feature has also been backported to release-3.5 (starting from v3.5.11) and release-3.4 (starting from v3.4.29). See livez/readyz for details.

The existing /health endpoint remains functional. /livez is similar to /health?serializable=true, while /readyz is similar to /health or /health?serializable=false. Clearly, the /livez and /readyz endpoints provide clearer semantics and are easier to understand.

v3discovery

In etcd v3.6.0, the new discovery protocol v3discovery was introduced, based on clientv3. It facilitates the discovery of all cluster members during the bootstrap phase.

The previous v2discovery protocol, based on clientv2, has been deprecated. Additionally, the public discovery service at https://discovery.etcd.io/, which relied on v2discovery, is no longer maintained.

Performance

Memory

In this release, we reduced average memory consumption by at least 50% (see Figure 1). This improvement is primarily due to two changes:

  • The default value of --snapshot-count has been reduced from 100,000 in v3.5 to 10,000 in v3.6. As a result, etcd v3.6 now retains only about 10% of the history records compared to v3.5.
  • Raft history is compacted more frequently, as introduced in PR/18825.
Diagram of memory usage

Figure 1: Memory usage comparison between etcd v3.5.20 and v3.6.0-rc.2 under different read/write ratios. Each subplot shows the memory usage over time with a specific read/write ratio. The red line represents etcd v3.5.20, while the teal line represents v3.6.0-rc.2. Across all tested ratios, v3.6.0-rc.2 exhibits lower and more stable memory usage.

Throughput

Compared to v3.5, etcd v3.6 delivers an average performance improvement of approximately 10% in both read and write throughput (see Figure 2, 3, 4 and 5). This improvement is not attributed to any single major change, but rather the cumulative effect of multiple minor enhancements. One such example is the optimization of the free page queries introduced in PR/419.

etcd read transaction performance with a high write ratio

Figure 2: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.21% to 25.59%.

etcd read transaction performance with a high read ratio

Figure 3: Read throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in read throughput of v3.6.0-rc.2 over v3.5.20, ranging from 4.38% to 27.20%.

etcd write transaction performance with a high write ratio

Figure 4: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high write ratio. The read/write ratio is 0.0078, meaning 1 read per 128 writes. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 2.95% to 24.24%.

etcd write transaction performance with a high read ratio

Figure 5: Write throughput comparison between etcd v3.5.20 and v3.6.0-rc.2 under a high read ratio. The read/write ratio is 8, meaning 8 reads per write. The right bar shows the percentage improvement in write throughput of v3.6.0-rc.2 over v3.5.20, ranging from 3.86% to 28.37%.

Breaking changes

This section highlights a few notable breaking changes. For a complete list, please refer to the Upgrade etcd from v3.5 to v3.6 and the CHANGELOG-3.6.

Old binaries are incompatible with new schema versions

Old etcd binaries are not compatible with newer data schema versions. For example, etcd 3.5 cannot start with data created by etcd 3.6, and etcd 3.4 cannot start with data created by either 3.5 or 3.6.

When downgrading etcd, it's important to follow the documented downgrade procedure. Simply replacing the binary or image will result in the incompatibility issue.

Peer endpoints no longer serve client requests

Client endpoints (--advertise-client-urls) are intended to serve client requests only, while peer endpoints (--initial-advertise-peer-urls) are intended solely for peer communication. However, due to an implementation oversight, the peer endpoints were also able to handle client requests in etcd 3.4 and 3.5. This behavior was misleading and encouraged incorrect usage patterns. In etcd 3.6, this misleading behavior was corrected via PR/13565; peer endpoints no longer serve client requests.

Clear boundary between etcdctl and etcdutl

Both etcdctl and etcdutl are command line tools. etcdutl is an offline utility designed to operate directly on etcd data files, while etcdctl is an online tool that interacts with etcd over a network. Previously, there were some overlapping functionalities between the two, but these overlaps were removed in 3.6.0.

  • Removed etcdctl defrag --data-dir

    The etcdctl defrag command only support online defragmentation and no longer supports offline defragmentation. To perform offline defragmentation, use the etcdutl defrag --data-dir command instead.

  • Removed etcdctl snapshot status

    etcdctl no longer supports retrieving the status of a snapshot. Use the etcdutl snapshot status command instead.

  • Removed etcdctl snapshot restore

    etcdctl no longer supports restoring from a snapshot. Use the etcdutl snapshot restore command instead.

Critical bug fixes

Correctness has always been a top priority for the etcd project. In the process of developing 3.6.0, we found and fixed a few notable bugs that could lead to data inconsistency in specific cases. These fixes have been backported to previous releases, but we believe they deserve special mention here.

  • Data Inconsistency when Crashing Under Load

Previously, when etcd was applying data, it would update the consistent-index first, followed by committing the data. However, these operations were not atomic. If etcd crashed in between, it could lead to data inconsistency (see issue/13766). The issue was introduced in v3.5.0, and fixed in v3.5.3 with PR/13854.

  • Durability API guarantee broken in single node cluster

When a client writes data and receives a success response, the data is expected to be persisted. However, the data might be lost if etcd crashes immediately after sending the success response to the client. This was a legacy issue (see issue/14370) affecting all previous releases. It was addressed in v3.4.21 and v3.5.5 with PR/14400, and fixed in raft side in main branch (now release-3.6) with PR/14413.

  • Revision Inconsistency when Crashing During Defragmentation

If etcd crashed during the defragmentation operation, upon restart, it might reapply some entries which had already been applied, accordingly leading to the revision inconsistency issue (see the discussions in PR/14685). The issue was introduced in v3.5.0, and fixed in v3.5.6 with PR/14730.

Upgrade issue

This section highlights a common issue issues/19557 in the etcd v3.5 to v3.6 upgrade that may cause the upgrade process to fail. For a complete upgrade guide, refer to Upgrade etcd from v3.5 to v3.6.

The issue was introduced in etcd v3.5.1, and resolved in v3.5.20.

Key takeaway: users are required to first upgrade to etcd v3.5.20 (or a higher patch version) before upgrading to etcd v3.6.0; otherwise, the upgrade may fail.

For more background and technical context, see upgrade_from_3.5_to_3.6_issue.

Testing

We introduced the Robustness testing to verify correctness, which has always been our top priority. It plays traffic of various types and volumes against an etcd cluster, concurrently injects a random failpoint, records all operations (including both requests and responses), and finally performs a linearizability check. It also verifies that the Watch APIs guarantees have not been violated. The robustness test increases our confidence in ensuring the quality of each etcd release.

We have migrated most of the etcd workflow tests to Kubernetes' Prow testing infrastructure to take advantage of its benefit, such as nice dashboards for viewing test results and the ability for contributors to rerun failed tests themselves.

Platforms

While retaining all existing supported platforms, we have promoted Linux/ARM64 to Tier 1 support. For more details, please refer to issues/15951. For the complete list of supported platforms, see supported-platform.

Dependencies

Dependency bumping guide

We have published an official guide on how to bump dependencies for etcd’s main branch and stable releases. It also covers how to update the Go version. For more details, please refer to dependency_management. With this guide available, any contributors can now help with dependency upgrades.

Core Dependency Updates

bbolt and raft are two core dependencies of etcd.

Both etcd v3.4 and v3.5 depend on bbolt v1.3, while etcd v3.6 depends on bbolt v1.4.

For the release-3.4 and release-3.5 branches, raft is included in the etcd repository itself, so etcd v3.4 and v3.5 do not depend on an external raft module. Starting from etcd v3.6, raft was moved to a separate repository (raft), and the first standalone raft release is v3.6.0. As a result, etcd v3.6.0 depends on raft v3.6.0.

Please see the table below for a summary:

etcd versionsbbolt versionsraft versions
3.4.xv1.3.xN/A
3.5.xv1.3.xN/A
3.6.xv1.4.xv3.6.x

grpc-gateway@v2

We upgraded grpc-gateway from v1 to v2 via PR/16595 in etcd v3.6.0. This is a major step toward migrating to protobuf-go, the second major version of the Go protocol buffer API implementation.

grpc-gateway@v2 is designed to work with protobuf-go. However, etcd v3.6 still depends on the deprecated gogo/protobuf, which is actually protocol buffer v1 implementation. To resolve this incompatibility, we applied a patch to the generated *.pb.gw.go files to convert v1 messages to v2 messages.

grpc-ecosystem/go-grpc-middleware/providers/prometheus

We switched from the deprecated (and archived) grpc-ecosystem/go-grpc-prometheus to grpc-ecosystem/go-grpc-middleware/providers/prometheus via PR/19195. This change ensures continued support and access to the latest features and improvements in the gRPC Prometheus integration.

Community

There are exciting developments in the etcd community that reflect our ongoing commitment to strengthening collaboration, improving maintainability, and evolving the project’s governance.

etcd Becomes a Kubernetes SIG

etcd has officially become a Kubernetes Special Interest Group: SIG-etcd. This change reflects etcd’s critical role as the primary datastore for Kubernetes and establishes a more structured and transparent home for long-term stewardship and cross-project collaboration. The new SIG designation will help streamline decision-making, align roadmaps with Kubernetes needs, and attract broader community involvement.

New contributors, maintainers, and reviewers

We’ve seen increasing engagement from contributors, which has resulted in the addition of three new maintainers:

Their continued contributions have been instrumental in driving the project forward.

We also welcome two new reviewers to the project:

We appreciate their dedication to code quality and their willingness to take on broader review responsibilities within the community.

New release team

We've formed a new release team led by ivanvc and jmhbnz, streamlining the release process by automating many previously manual steps. Inspired by Kubernetes SIG Release, we've adopted several best practices, including clearly defined release team roles and the introduction of release shadows to support knowledge sharing and team sustainability. These changes have made our releases smoother and more reliable, allowing us to approach each release with greater confidence and consistency.

Introducing the etcd Operator Working Group

To further advance etcd’s operational excellence, we have formed a new working group: WG-etcd-operator. The working group is dedicated to enabling the automatic and efficient operation of etcd clusters that run in the Kubernetes environment using an etcd-operator.

Future Development

The legacy v2store has been deprecated since etcd v3.4, and the flag --enable-v2 was removed entirely in v3.6. This means that starting from v3.6, there is no longer a way to enable or use the v2store. However, etcd still bootstraps internally from the legacy v2 snapshots. To address this inconsistency, We plan to change etcd to bootstrap from the v3store and replay the WAL entries based on the consistent-index. The work is being tracked in issues/12913.

One of the most persistent challenges remains the large range of queries from the kube-apiserver, which can lead to process crashes due to their unpredictable nature. The range stream feature, originally outlined in the v3.5 release blog/Future roadmaps, remains an idea worth revisiting to address the challenges of large range queries.

For more details and upcoming plans, please refer to the etcd roadmap.

Kubernetes 1.33: Job's SuccessPolicy Goes GA

On behalf of the Kubernetes project, I'm pleased to announce that Job success policy has graduated to General Availability (GA) as part of the v1.33 release.

About Job's Success Policy

In batch workloads, you might want to use leader-follower patterns like MPI, in which the leader controls the execution, including the followers' lifecycle.

In this case, you might want to mark it as succeeded even if some of the indexes failed. Unfortunately, a leader-follower Kubernetes Job that didn't use a success policy, in most cases, would have to require all Pods to finish successfully for that Job to reach an overall succeeded state.

For Kubernetes Jobs, the API allows you to specify the early exit criteria using the .spec.successPolicy field (you can only use the .spec.successPolicy field for an indexed Job). Which describes a set of rules either using a list of succeeded indexes for a job, or defining a minimal required size of succeeded indexes.

This newly stable field is especially valuable for scientific simulation, AI/ML and High-Performance Computing (HPC) batch workloads. Users in these areas often run numerous experiments and may only need a specific number to complete successfully, rather than requiring all of them to succeed. In this case, the leader index failure is the only relevant Job exit criteria, and the outcomes for individual follower Pods are handled only indirectly via the status of the leader index. Moreover, followers do not know when they can terminate themselves.

After Job meets any Success Policy, the Job is marked as succeeded, and all Pods are terminated including the running ones.

How it works

The following excerpt from a Job manifest, using .successPolicy.rules[0].succeededCount, shows an example of using a custom success policy:

  parallelism: 10
  completions: 10
  completionMode: Indexed
  successPolicy:
    rules:
    - succeededCount: 1

Here, the Job is marked as succeeded when one index succeeded regardless of its number. Additionally, you can constrain index numbers against succeededCount in .successPolicy.rules[0].succeededCount as shown below:

parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
  rules:
  - succeededIndexes: 0 # index of the leader Pod
    succeededCount: 1

This example shows that the Job will be marked as succeeded once a Pod with a specific index (Pod index 0) has succeeded.

Once the Job either reaches one of the successPolicy rules, or achieves its Complete criteria based on .spec.completions, the Job controller within kube-controller-manager adds the SuccessCriteriaMet condition to the Job status. After that, the job-controller initiates cleanup and termination of Pods for Jobs with SuccessCriteriaMet condition. Eventually, Jobs obtain Complete condition when the job-controller finished cleanup and termination.

Learn more

Get involved

This work was led by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space I recommend subscribing to our Slack channel and attending the regular community meetings.

Kubernetes v1.33: Updates to Container Lifecycle

Kubernetes v1.33 introduces a few updates to the lifecycle of containers. The Sleep action for container lifecycle hooks now supports a zero sleep duration (feature enabled by default). There is also alpha support for customizing the stop signal sent to containers when they are being terminated.

This blog post goes into the details of these new aspects of the container lifecycle, and how you can use them.

Zero value for Sleep action

Kubernetes v1.29 introduced the Sleep action for container PreStop and PostStart Lifecycle hooks. The Sleep action lets your containers pause for a specified duration after the container is started or before it is terminated. This was needed to provide a straightforward way to manage graceful shutdowns. Before the Sleep action, folks used to run the sleep command using the exec action in their container lifecycle hooks. If you wanted to do this you'd need to have the binary for the sleep command in your container image. This is difficult if you're using third party images.

The sleep action when it was added initially didn't have support for a sleep duration of zero seconds. The time.Sleep which the Sleep action uses under the hood supports a duration of zero seconds. Using a negative or a zero value for the sleep returns immediately, resulting in a no-op. We wanted the same behaviour with the sleep action. This support for the zero duration was later added in v1.32, with the PodLifecycleSleepActionAllowZero feature gate.

The PodLifecycleSleepActionAllowZero feature gate has graduated to beta in v1.33, and is now enabled by default. The original Sleep action for preStop and postStart hooks is been enabled by default, starting from Kubernetes v1.30. With a cluster running Kubernetes v1.33, you are able to set a zero duration for sleep lifecycle hooks. For a cluster with default configuration, you don't need to enable any feature gate to make that possible.

Container stop signals

Container runtimes such as containerd and CRI-O honor a StopSignal instruction in the container image definition. This can be used to specify a custom stop signal that the runtime will used to terminate containers based on that image. Stop signal configuration was not originally part of the Pod API in Kubernetes. Until Kubernetes v1.33, the only way to override the stop signal for containers was by rebuilding your container image with the new custom stop signal (for example, specifying STOPSIGNAL in a Containerfile or Dockerfile).

The ContainerStopSignals feature gate which is newly added in Kubernetes v1.33 adds stop signals to the Kubernetes API. This allows users to specify a custom stop signal in the container spec. Stop signals are added to the API as a new lifecycle along with the existing PreStop and PostStart lifecycle handlers. In order to use this feature, we expect the Pod to have the operating system specified with spec.os.name. This is enforced so that we can cross-validate the stop signal against the operating system and make sure that the containers in the Pod are created with a valid stop signal for the operating system the Pod is being scheduled to. For Pods scheduled on Windows nodes, only SIGTERM and SIGKILL are allowed as valid stop signals. Find the full list of signals supported in Linux nodes here.

Default behaviour

If a container has a custom stop signal defined in its lifecycle, the container runtime would use the signal defined in the lifecycle to kill the container, given that the container runtime also supports custom stop signals. If there is no custom stop signal defined in the container lifecycle, the runtime would fallback to the stop signal defined in the container image. If there is no stop signal defined in the container image, the default stop signal of the runtime would be used. The default signal is SIGTERM for both containerd and CRI-O.

Version skew

For the feature to work as intended, both the versions of Kubernetes and the container runtime should support container stop signals. The changes to the Kuberentes API and kubelet are available in alpha stage from v1.33, which can be enabled with the ContainerStopSignals feature gate. The container runtime implementations for containerd and CRI-O are still a work in progress and will be rolled out soon.

Using container stop signals

To enable this feature, you need to turn on the ContainerStopSignals feature gate in both the kube-apiserver and the kubelet. Once you have nodes where the feature gate is turned on, you can create Pods with a StopSignal lifecycle and a valid OS name like so:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  os:
    name: linux
  containers:
    - name: nginx
      image: nginx:latest
      lifecycle:
        stopSignal: SIGUSR1

Do note that the SIGUSR1 signal in this example can only be used if the container's Pod is scheduled to a Linux node. Hence we need to specify spec.os.name as linux to be able to use the signal. You will only be able to configure SIGTERM and SIGKILL signals if the Pod is being scheduled to a Windows node. You cannot specify a containers[*].lifecycle.stopSignal if the spec.os.name field is nil or unset either.

How do I get involved?

This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please reach out to us!

You can reach SIG Node by several means:

You can also contact me directly:

  • GitHub: @sreeram-venkitesh
  • Slack: @sreeram.venkitesh

Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA

In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general availability (GA). This blog describes the Backoff Limit Per Index feature and its benefits.

About backoff limit per index

When you run workloads on Kubernetes, you must consider scenarios where Pod failures can affect the completion of your workloads. Ideally, your workload should tolerate transient failures and continue running.

To achieve failure tolerance in a Kubernetes Job, you can set the spec.backoffLimit field. This field specifies the total number of tolerated failures.

However, for workloads where every index is considered independent, like embarassingly parallel workloads - the spec.backoffLimit field is often not flexible enough. For example, you may choose to run multiple suites of integration tests by representing each suite as an index within an Indexed Job. In that setup, a fast-failing index (test suite) is likely to consume your entire budget for tolerating Pod failures, and you might not be able to run the other indexes.

In order to address this limitation, Kubernetes introduced backoff limit per index, which allows you to control the number of retries per index.

How backoff limit per index works

To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated Pod failures per index with the spec.backoffLimitPerIndex field. When you set this field, the Job executes all indexes by default.

Additionally, to fine-tune the error handling:

  • Specify the cap on the total number of failed indexes by setting the spec.maxFailedIndexes field. When the limit is exceeded the entire Job is terminated.
  • Define a short-circuit to detect a failed index by using the FailIndex action in the Pod Failure Policy mechanism.

When the number of tolerated failures is exceeded, the Job marks that index as failed and lists it in the Job's status.failedIndexes field.

Example

The following Job spec snippet is an example of how to combine backoff limit per index with the Pod Failure Policy feature:

completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
  rules:
  - action: Ignore
    onPodConditions:
    - type: DisruptionTarget
  - action: FailIndex
    onExitCodes:
      operator: In
      values: [ 42 ]

In this example, the Job handles Pod failures as follows:

  • Ignores any failed Pods that have the built-in disruption condition, called DisruptionTarget. These Pods don't count towards Job backoff limits.
  • Fails the index corresponding to the failed Pod if any of the failed Pod's containers finished with the exit code 42 - based on the matching "FailIndex" rule.
  • Retries the first failure of any index, unless the index failed due to the matching FailIndex rule.
  • Fails the entire Job if the number of failed indexes exceeded 5 (set by the spec.maxFailedIndexes field).

Learn more

Get involved

This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

Kubernetes v1.33: Image Pull Policy the way you always thought it worked!

Image Pull Policy the way you always thought it worked!

Some things in Kubernetes are surprising, and the way imagePullPolicy behaves might be one of them. Given Kubernetes is all about running pods, it may be peculiar to learn that there has been a caveat to restricting pod access to authenticated images for over 10 years in the form of issue 18787! It is an exciting release when you can resolve a ten-year-old issue.

Note:

Throughout this blog post, the term "pod credentials" will be used often. In this context, the term generally encapsulates the authentication material that is available to a pod to authenticate a container image pull.

IfNotPresent, even if I'm not supposed to have it

The gist of the problem is that the imagePullPolicy: IfNotPresent strategy has done precisely what it says, and nothing more. Let's set up a scenario. To begin, Pod A in Namespace X is scheduled to Node 1 and requires image Foo from a private repository. For it's image pull authentication material, the pod references Secret 1 in its imagePullSecrets. Secret 1 contains the necessary credentials to pull from the private repository. The Kubelet will utilize the credentials from Secret 1 as supplied by Pod A and it will pull container image Foo from the registry. This is the intended (and secure) behavior.

But now things get curious. If Pod B in Namespace Y happens to also be scheduled to Node 1, unexpected (and potentially insecure) things happen. Pod B may reference the same private image, specifying the IfNotPresent image pull policy. Pod B does not reference Secret 1 (or in our case, any secret) in its imagePullSecrets. When the Kubelet tries to run the pod, it honors the IfNotPresent policy. The Kubelet sees that the image Foo is already present locally, and will provide image Foo to Pod B. Pod B gets to run the image even though it did not provide credentials authorizing it to pull the image in the first place.

Illustration of the process of two pods trying to access a private image, the first one with a pull secret, the second one without it

Using a private image pulled by a different pod

While IfNotPresent should not pull image Foo if it is already present on the node, it is an incorrect security posture to allow all pods scheduled to a node to have access to previously pulled private image. These pods were never authorized to pull the image in the first place.

IfNotPresent, but only if I am supposed to have it

In Kubernetes v1.33, we - SIG Auth and SIG Node - have finally started to address this (really old) problem and getting the verification right! The basic expected behavior is not changed. If an image is not present, the Kubelet will attempt to pull the image. The credentials each pod supplies will be utilized for this task. This matches behavior prior to 1.33.

If the image is present, then the behavior of the Kubelet changes. The Kubelet will now verify the pod's credentials before allowing the pod to use the image.

Performance and service stability have been a consideration while revising the feature. Pods utilizing the same credential will not be required to re-authenticate. This is also true when pods source credentials from the same Kubernetes Secret object, even when the credentials are rotated.

Never pull, but use if authorized

The imagePullPolicy: Never option does not fetch images. However, if the container image is already present on the node, any pod attempting to use the private image will be required to provide credentials, and those credentials require verification.

Pods utilizing the same credential will not be required to re-authenticate. Pods that do not supply credentials previously used to successfully pull an image will not be allowed to use the private image.

Always pull, if authorized

The imagePullPolicy: Always has always worked as intended. Each time an image is requested, the request goes to the registry and the registry will perform an authentication check.

In the past, forcing the Always image pull policy via pod admission was the only way to ensure that your private container images didn't get reused by other pods on nodes which already pulled the images.

Fortunately, this was somewhat performant. Only the image manifest was pulled, not the image. However, there was still a cost and a risk. During a new rollout, scale up, or pod restart, the image registry that provided the image MUST be available for the auth check, putting the image registry in the critical path for stability of services running inside of the cluster.

How it all works

The feature is based on persistent, file-based caches that are present on each of the nodes. The following is a simplified description of how the feature works. For the complete version, please see KEP-2535.

The process of requesting an image for the first time goes like this:

  1. A pod requesting an image from a private registry is scheduled to a node.
  2. The image is not present on the node.
  3. The Kubelet makes a record of the intention to pull the image.
  4. The Kubelet extracts credentials from the Kubernetes Secret referenced by the pod as an image pull secret, and uses them to pull the image from the private registry.
  5. After the image has been successfully pulled, the Kubelet makes a record of the successful pull. This record includes details about credentials used (in the form of a hash) as well as the Secret from which they originated.
  6. The Kubelet removes the original record of intent.
  7. The Kubelet retains the record of successful pull for later use.

When future pods scheduled to the same node request the previously pulled private image:

  1. The Kubelet checks the credentials that the new pod provides for the pull.
  2. If the hash of these credentials, or the source Secret of the credentials match the hash or source Secret which were recorded for a previous successful pull, the pod is allowed to use the previously pulled image.
  3. If the credentials or their source Secret are not found in the records of successful pulls for that image, the Kubelet will attempt to use these new credentials to request a pull from the remote registry, triggering the authorization flow.

Try it out

In Kubernetes v1.33 we shipped the alpha version of this feature. To give it a spin, enable the KubeletEnsureSecretPulledImages feature gate for your 1.33 Kubelets.

You can learn more about the feature and additional optional configuration on the concept page for Images in the official Kubernetes documentation.

What's next?

In future releases we are going to:

  1. Make this feature work together with Projected service account tokens for Kubelet image credential providers which adds a new, workload-specific source of image pull credentials.
  2. Write a benchmarking suite to measure the performance of this feature and assess the impact of any future changes.
  3. Implement an in-memory caching layer so that we don't need to read files for each image pull request.
  4. Add support for credential expirations, thus forcing previously validated credentials to be re-authenticated.

How to get involved

Reading KEP-2535 is a great way to understand these changes in depth.

If you are interested in further involvement, reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/). You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

Kubernetes v1.33: Streaming List responses

Managing Kubernetes cluster stability becomes increasingly critical as your infrastructure grows. One of the most challenging aspects of operating large-scale clusters has been handling List requests that fetch substantial datasets - a common operation that could unexpectedly impact your cluster's stability.

Today, the Kubernetes community is excited to announce a significant architectural improvement: streaming encoding for List responses.

The problem: unnecessary memory consumption with large resources

Current API response encoders just serialize an entire response into a single contiguous memory and perform one ResponseWriter.Write call to transmit data to the client. Despite HTTP/2's capability to split responses into smaller frames for transmission, the underlying HTTP server continues to hold the complete response data as a single buffer. Even as individual frames are transmitted to the client, the memory associated with these frames cannot be freed incrementally.

When cluster size grows, the single response body can be substantial - like hundreds of megabytes in size. At large scale, the current approach becomes particularly inefficient, as it prevents incremental memory release during transmission. Imagining that when network congestion occurs, that large response body’s memory block stays active for tens of seconds or even minutes. This limitation leads to unnecessarily high and prolonged memory consumption in the kube-apiserver process. If multiple large List requests occur simultaneously, the cumulative memory consumption can escalate rapidly, potentially leading to an Out-of-Memory (OOM) situation that compromises cluster stability.

The encoding/json package uses sync.Pool to reuse memory buffers during serialization. While efficient for consistent workloads, this mechanism creates challenges with sporadic large List responses. When processing these large responses, memory pools expand significantly. But due to sync.Pool's design, these oversized buffers remain reserved after use. Subsequent small List requests continue utilizing these large memory allocations, preventing garbage collection and maintaining persistently high memory consumption in the kube-apiserver even after the initial large responses complete.

Additionally, Protocol Buffers are not designed to handle large datasets. But it’s great for handling individual messages within a large data set. This highlights the need for streaming-based approaches that can process and transmit large collections incrementally rather than as monolithic blocks.

As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

From https://protobuf.dev/programming-guides/techniques/

Streaming encoder for List responses

The streaming encoding mechanism is specifically designed for List responses, leveraging their common well-defined collection structures. The core idea focuses exclusively on the Items field within collection structures, which represents the bulk of memory consumption in large responses. Rather than encoding the entire Items array as one contiguous memory block, the new streaming encoder processes and transmits each item individually, allowing memory to be freed progressively as frame or chunk is transmitted. As a result, encoding items one by one significantly reduces the memory footprint required by the API server.

With Kubernetes objects typically limited to 1.5 MiB (from ETCD), streaming encoding keeps memory consumption predictable and manageable regardless of how many objects are in a List response. The result is significantly improved API server stability, reduced memory spikes, and better overall cluster performance - especially in environments where multiple large List operations might occur simultaneously.

To ensure perfect backward compatibility, the streaming encoder validates Go struct tags rigorously before activation, guaranteeing byte-for-byte consistency with the original encoder. Standard encoding mechanisms process all fields except Items, maintaining identical output formatting throughout. This approach seamlessly supports all Kubernetes List types—from built-in *List objects to Custom Resource UnstructuredList objects - requiring zero client-side modifications or awareness that the underlying encoding method has changed.

Performance gains you'll notice

  • Reduced Memory Consumption: Significantly lowers the memory footprint of the API server when handling large list requests, especially when dealing with large resources.
  • Improved Scalability: Enables the API server to handle more concurrent requests and larger datasets without running out of memory.
  • Increased Stability: Reduces the risk of OOM kills and service disruptions.
  • Efficient Resource Utilization: Optimizes memory usage and improves overall resource efficiency.

Benchmark results

To validate results Kubernetes has introduced a new list benchmark which executes concurrently 10 list requests each returning 1GB of data.

The benchmark has showed 20x improvement, reducing memory usage from 70-80GB to 3GB.

Screenshot of a K8s performance dashboard showing memory usage for benchmark list going down from 60GB to 3GB

List benchmark memory usage

Kubernetes 1.33: Volume Populators Graduate to GA

Kubernetes volume populators are now generally available (GA)! The AnyVolumeDataSource feature gate is treated as always enabled for Kubernetes v1.33, which means that users can specify any appropriate custom resource as the data source of a PersistentVolumeClaim (PVC).

An example of how to use dataSourceRef in PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc1
spec:
  ...
  dataSourceRef:
    apiGroup: provider.example.com
    kind: Provider
    name: provider1

What is new

There are four major enhancements from beta.

Populator Pod is optional

During the beta phase, contributors to Kubernetes identified potential resource leaks with PersistentVolumeClaim (PVC) deletion while volume population was in progress; these leaks happened due to limitations in finalizer handling. Ahead of the graduation to general availability, the Kubernetes project added support to delete temporary resources (PVC prime, etc.) if the original PVC is deleted.

To accommodate this, we've introduced three new plugin-based functions:

  • PopulateFn(): Executes the provider-specific data population logic.
  • PopulateCompleteFn(): Checks if the data population operation has finished successfully.
  • PopulateCleanupFn(): Cleans up temporary resources created by the provider-specific functions after data population is completed

A provider example is added in lib-volume-populator/example.

Mutator functions to modify the Kubernetes resources

For GA, the CSI volume populator controller code gained a MutatorConfig, allowing the specification of mutator functions to modify Kubernetes resources. For example, if the PVC prime is not an exact copy of the PVC and you need provider-specific information for the driver, you can include this information in the optional MutatorConfig. This allows you to customize the Kubernetes objects in the volume populator.

Flexible metric handling for providers

Our beta phase highlighted a new requirement: the need to aggregate metrics not just from lib-volume-populator, but also from other components within the provider's codebase.

To address this, SIG Storage introduced a provider metric manager. This enhancement delegates the implementation of metrics logic to the provider itself, rather than relying solely on lib-volume-populator. This shift provides greater flexibility and control over metrics collection and aggregation, enabling a more comprehensive view of provider performance.

Clean up for temporary resources

During the beta phase, we identified potential resource leaks with PersistentVolumeClaim (PVC) deletion while volume population was in progress, due to limitations in finalizer handling. We have improved the populator to support the deletion of temporary resources (PVC prime, etc.) if the original PVC is deleted in this GA release.

How to use it

To try it out, please follow the steps in the previous beta blog.

Future directions and potential feature requests

For next step, there are several potential feature requests for volume populator:

  • Multi sync: the current implementation is a one-time unidirectional sync from source to destination. This can be extended to support multiple syncs, enabling periodic syncs or allowing users to sync on demand
  • Bidirectional sync: an extension of multi sync above, but making it bidirectional between source and destination
  • Populate data with priorities: with a list of different dataSourceRef, populate based on priorities
  • Populate data from multiple sources of the same provider: populate multiple different sources to one destination
  • Populate data from multiple sources of the different providers: populate multiple different sources to one destination, pipelining different resources’ population

To ensure we're building something truly valuable, Kubernetes SIG Storage would love to hear about any specific use cases you have in mind for this feature. For any inquiries or specific questions related to volume populator, please reach out to the SIG Storage community.

Kubernetes v1.33: From Secrets to Service Accounts: Kubernetes Image Pulls Evolved

Kubernetes has steadily evolved to reduce reliance on long-lived credentials stored in the API. A prime example of this shift is the transition of Kubernetes Service Account (KSA) tokens from long-lived, static tokens to ephemeral, automatically rotated tokens with OpenID Connect (OIDC)-compliant semantics. This advancement enables workloads to securely authenticate with external services without needing persistent secrets.

However, one major gap remains: image pull authentication. Today, Kubernetes clusters rely on image pull secrets stored in the API, which are long-lived and difficult to rotate, or on node-level kubelet credential providers, which allow any pod running on a node to access the same credentials. This presents security and operational challenges.

To address this, Kubernetes is introducing Service Account Token Integration for Kubelet Credential Providers, now available in alpha. This enhancement allows credential providers to use pod-specific service account tokens to obtain registry credentials, which kubelet can then use for image pulls — eliminating the need for long-lived image pull secrets.

The problem with image pull secrets

Currently, Kubernetes administrators have two primary options for handling private container image pulls:

  1. Image pull secrets stored in the Kubernetes API

    • These secrets are often long-lived because they are hard to rotate.
    • They must be explicitly attached to a service account or pod.
    • Compromise of a pull secret can lead to unauthorized image access.
  2. Kubelet credential providers

    • These providers fetch credentials dynamically at the node level.
    • Any pod running on the node can access the same credentials.
    • There’s no per-workload isolation, increasing security risks.

Neither approach aligns with the principles of least privilege or ephemeral authentication, leaving Kubernetes with a security gap.

The solution: Service Account token integration for Kubelet credential providers

This new enhancement enables kubelet credential providers to use workload identity when fetching image registry credentials. Instead of relying on long-lived secrets, credential providers can use service account tokens to request short-lived credentials tied to a specific pod’s identity.

This approach provides:

  • Workload-specific authentication: Image pull credentials are scoped to a particular workload.
  • Ephemeral credentials: Tokens are automatically rotated, eliminating the risks of long-lived secrets.
  • Seamless integration: Works with existing Kubernetes authentication mechanisms, aligning with cloud-native security best practices.

How it works

1. Service Account tokens for credential providers

Kubelet generates short-lived, automatically rotated tokens for service accounts if the credential provider it communicates with has opted into receiving a service account token for image pulls. These tokens conform to OIDC ID token semantics and are provided to the credential provider as part of the CredentialProviderRequest. The credential provider can then use this token to authenticate with an external service.

2. Image registry authentication flow

  • When a pod starts, the kubelet requests credentials from a credential provider.
  • If the credential provider has opted in, the kubelet generates a service account token for the pod.
  • The service account token is included in the CredentialProviderRequest, allowing the credential provider to authenticate and exchange it for temporary image pull credentials from a registry (e.g. AWS ECR, GCP Artifact Registry, Azure ACR).
  • The kubelet then uses these credentials to pull images on behalf of the pod.

Benefits of this approach

  • Security: Eliminates long-lived image pull secrets, reducing attack surfaces.
  • Granular Access Control: Credentials are tied to individual workloads rather than entire nodes or clusters.
  • Operational Simplicity: No need for administrators to manage and rotate image pull secrets manually.
  • Improved Compliance: Helps organizations meet security policies that prohibit persistent credentials in the cluster.

What's next?

For Kubernetes v1.34, we expect to ship this feature in beta while continuing to gather feedback from users.

In the coming releases, we will focus on:

  • Implementing caching mechanisms to improve performance for token generation.
  • Giving more flexibility to credential providers to decide how the registry credentials returned to the kubelet are cached.
  • Making the feature work with Ensure Secret Pulled Images to ensure pods that use an image are authorized to access that image when service account tokens are used for authentication.

You can learn more about this feature on the service account token for image pulls page in the Kubernetes documentation.

You can also follow along on the KEP-4412 to track progress across the coming Kubernetes releases.

Try it out

To try out this feature:

  1. Ensure you are running Kubernetes v1.33 or later.
  2. Enable the ServiceAccountTokenForKubeletCredentialProviders feature gate on the kubelet.
  3. Ensure credential provider support: Modify or update your credential provider to use service account tokens for authentication.
  4. Update the credential provider configuration to opt into receiving service account tokens for the credential provider by configuring the tokenAttributes field.
  5. Deploy a pod that uses the credential provider to pull images from a private registry.

We would love to hear your feedback on this feature. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you are interested in getting involved in the development of this feature, sharing feedback, or participating in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every other Wednesday.

Kubernetes v1.33: Fine-grained SupplementalGroups Control Graduates to Beta

The new field, supplementalGroupsPolicy, was introduced as an opt-in alpha feature for Kubernetes v1.31 and has graduated to beta in v1.33; the corresponding feature gate (SupplementalGroupsPolicy) is now enabled by default. This feature enables to implement more precise control over supplemental groups in containers that can strengthen the security posture, particularly in accessing volumes. Moreover, it also enhances the transparency of UID/GID details in containers, offering improved security oversight.

Please be aware that this beta release contains some behavioral breaking change. See The Behavioral Changes Introduced In Beta and Upgrade Considerations sections for details.

Motivation: Implicit group memberships defined in /etc/group in the container image

Although the majority of Kubernetes cluster admins/users may not be aware, kubernetes, by default, merges group information from the Pod with information defined in /etc/group in the container image.

Let's see an example, below Pod manifest specifies runAsUser=1000, runAsGroup=3000 and supplementalGroups=4000 in the Pod's security context.

apiVersion: v1
kind: Pod
metadata:
  name: implicit-groups
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
  containers:
  - name: ctr
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

What is the result of id command in the ctr container? The output should be similar to this:

uid=1000 gid=3000 groups=3000,4000,50000

Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.

Checking the contents of /etc/group in the container image should show below:

user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image

This shows that the container's primary user 1000 belongs to the group 50000 in the last entry.

Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.

What's wrong with it?

The implicitly merged group information from /etc/group in the container image poses a security risk. These implicit GIDs can't be detected or validated by policy engines because there's no record of them in the Pod manifest. This can lead to unexpected access control issues, particularly when accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by UID/GIDs in Linux.

Fine-grained supplemental groups control in a Pod: supplementaryGroupsPolicy

To tackle the above problem, Pod's .spec.securityContext now includes supplementalGroupsPolicy field.

This field lets you control how Kubernetes calculates the supplementary groups for container processes within a Pod. The available policies are:

  • Merge: The group membership defined in /etc/group for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backward compatibility).

  • Strict: Only the group IDs specified in fsGroup, supplementalGroups, or runAsGroup are attached as supplementary groups to the container processes. Group memberships defined in /etc/group for the container's primary user are ignored.

Let's see how Strict policy works. Below Pod manifest specifies supplementalGroupsPolicy: Strict:

apiVersion: v1
kind: Pod
metadata:
  name: strict-supplementalgroups-policy
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
    supplementalGroupsPolicy: Strict
  containers:
  - name: ctr
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

The result of id command in the ctr container should be similar to this:

uid=1000 gid=3000 groups=3000,4000

You can see Strict policy can exclude group 50000 from groups!

Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.

Note:

A container with sufficient privileges can change its process identity. The supplementalGroupsPolicy only affect the initial process identity. See the following section for details.

Attached process identity in Pod status

This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.

...
status:
  containerStatuses:
  - name: ctr
    user:
      linux:
        gid: 3000
        supplementalGroups:
        - 3000
        - 4000
        uid: 1000
...

Note:

Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.

Strict Policy requires newer CRI versions

Actually, CRI runtime (e.g. containerd, CRI-O) plays a core role for calculating supplementary group ids to be attached to the containers. Thus, SupplementalGroupsPolicy=Strict requires a CRI runtime that support this feature (SupplementalGroupsPolicy: Merge can work with the CRI runtime which does not support this feature because this policy is fully backward compatible policy).

Here are some CRI runtimes that support this feature, and the versions you need to be running:

  • containerd: v2.0 or later
  • CRI-O: v1.31 or later

And, you can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field.

apiVersion: v1
kind: Node
...
status:
  features:
    supplementalGroupsPolicy: true

The behavioral changes introduced in beta

In the alpha release, when a Pod with supplementalGroupsPolicy: Strict was scheduled to a node that did not support the feature (i.e., .status.features.supplementalGroupsPolicy=false), the Pod's supplemental groups policy silently fell back to Merge.

In v1.33, this has entered beta to enforce the policy more strictly, where kubelet rejects pods whose nodes cannot ensure the specified policy. If your pod is rejected, you will see warning events with reason=SupplementalGroupsPolicyNotSupported like below:

apiVersion: v1
kind: Event
...
type: Warning
reason: SupplementalGroupsPolicyNotSupported
message: "SupplementalGroupsPolicy=Strict is not supported in this node"
involvedObject:
  apiVersion: v1
  kind: Pod
  ...

Upgrade consideration

If you're already using this feature, especially the supplementalGroupsPolicy: Strict policy, we assume that your cluster's CRI runtimes already support this feature. In that case, you don't need to worry about the pod rejections described above.

However, if your cluster:

  • uses the supplementalGroupsPolicy: Strict policy, but
  • its CRI runtimes do NOT yet support the feature (i.e., .status.features.supplementalGroupsPolicy=false),

you need to prepare the behavioral changes (pod rejection) when upgrading your cluster.

We recommend several ways to avoid unexpected pod rejections:

  • Upgrading your cluster's CRI runtimes together with kubernetes or before the upgrade
  • Putting some label to your nodes describing CRI runtime supports this feature or not and also putting label selector to pods with Strict policy to select such nodes (but, you will need to monitor the number of Pending pods in this case instead of pod rejections).

Getting involved

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

How can I learn more?

Kubernetes v1.33: Prevent PersistentVolume Leaks When Deleting out of Order graduates to GA

I am thrilled to announce that the feature to prevent PersistentVolume (or PVs for short) leaks when deleting out of order has graduated to General Availability (GA) in Kubernetes v1.33! This improvement, initially introduced as a beta feature in Kubernetes v1.31, ensures that your storage resources are properly reclaimed, preventing unwanted leaks.

How did reclaim work in previous Kubernetes releases?

PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.

Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.

For a Bound PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.

PV reclaim policy with Kubernetes v1.33

With the graduation to GA in Kubernetes v1.33, this issue is now resolved. Kubernetes now reliably honors the configured Delete reclaim policy, even when PVs are deleted before their bound PVCs. This is achieved through the use of finalizers, ensuring that the storage backend releases the allocated storage resource as intended.

How does it work?

For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. Addition or removal of finalizer is handled by external-provisioner `

An example of a PV with the finalizer, notice the new finalizer in the finalizers list

kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: csi.example.driver.com
  creationTimestamp: "2021-11-17T19:28:56Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-provisioner.volume.kubernetes.io/finalizer
  name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53
  resourceVersion: "194711"
  uid: 087f14f2-4157-4e95-8a70-8294b039d30e
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: example-vanilla-block-pvc
    namespace: default
    resourceVersion: "194677"
    uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53
  csi:
    driver: csi.example.driver.com
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.example.driver.com
      type: CNS Block Volume
    volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e
  persistentVolumeReclaimPolicy: Delete
  storageClassName: example-vanilla-block-sc
  volumeMode: Filesystem
status:
  phase: Bound

The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.

Similarly, the finalizer kubernetes.io/pv-controller is added to dynamically provisioned in-tree plugin volumes.

Important note

The fix does not apply to statically provisioned in-tree plugin volumes.

How to enable new behavior?

To take advantage of the new behavior, you must have upgraded your cluster to the v1.33 release of Kubernetes and run the CSI external-provisioner version 5.0.1 or later. The feature was released as beta in v1.31 release of Kubernetes, where it was enabled by default.

References

How do I get involved?

The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:

  • Fan Baofa (carlory)
  • Jan Šafránek (jsafrane)
  • Xing Yang (xing-yang)
  • Matthew Wong (wongma7)

Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We’re rapidly growing and always welcome new contributors.

Kubernetes v1.33: Mutable CSI Node Allocatable Count

Scheduling stateful applications reliably depends heavily on accurate information about resource availability on nodes. Kubernetes v1.33 introduces an alpha feature called mutable CSI node allocatable count, allowing Container Storage Interface (CSI) drivers to dynamically update the reported maximum number of volumes that a node can handle. This capability significantly enhances the accuracy of pod scheduling decisions and reduces scheduling failures caused by outdated volume capacity information.

Background

Traditionally, Kubernetes CSI drivers report a static maximum volume attachment limit when initializing. However, actual attachment capacities can change during a node's lifecycle for various reasons, such as:

  • Manual or external operations attaching/detaching volumes outside of Kubernetes control.
  • Dynamically attached network interfaces or specialized hardware (GPUs, NICs, etc.) consuming available slots.
  • Multi-driver scenarios, where one CSI driver’s operations affect available capacity reported by another.

Static reporting can cause Kubernetes to schedule pods onto nodes that appear to have capacity but don't, leading to pods stuck in a ContainerCreating state.

Dynamically adapting CSI volume limits

With the new feature gate MutableCSINodeAllocatableCount, Kubernetes enables CSI drivers to dynamically adjust and report node attachment capacities at runtime. This ensures that the scheduler has the most accurate, up-to-date view of node capacity.

How it works

When this feature is enabled, Kubernetes supports two mechanisms for updating the reported node volume limits:

  • Periodic Updates: CSI drivers specify an interval to periodically refresh the node's allocatable capacity.
  • Reactive Updates: An immediate update triggered when a volume attachment fails due to exhausted resources (ResourceExhausted error).

Enabling the feature

To use this alpha feature, you must enable the MutableCSINodeAllocatableCount feature gate in these components:

  • kube-apiserver
  • kubelet

Example CSI driver configuration

Below is an example of configuring a CSI driver to enable periodic updates every 60 seconds:

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: example.csi.k8s.io
spec:
  nodeAllocatableUpdatePeriodSeconds: 60

This configuration directs Kubelet to periodically call the CSI driver's NodeGetInfo method every 60 seconds, updating the node’s allocatable volume count. Kubernetes enforces a minimum update interval of 10 seconds to balance accuracy and resource usage.

Immediate updates on attachment failures

In addition to periodic updates, Kubernetes now reacts to attachment failures. Specifically, if a volume attachment fails with a ResourceExhausted error (gRPC code 8), an immediate update is triggered to correct the allocatable count promptly.

This proactive correction prevents repeated scheduling errors and helps maintain cluster health.

Getting started

To experiment with mutable CSI node allocatable count in your Kubernetes v1.33 cluster:

  1. Enable the feature gate MutableCSINodeAllocatableCount on the kube-apiserver and kubelet components.
  2. Update your CSI driver configuration by setting nodeAllocatableUpdatePeriodSeconds.
  3. Monitor and observe improvements in scheduling accuracy and pod placement reliability.

Next steps

This feature is currently in alpha and the Kubernetes community welcomes your feedback. Test it, share your experiences, and help guide its evolution toward beta and GA stability.

Join discussions in the Kubernetes Storage Special Interest Group (SIG-Storage) to shape the future of Kubernetes storage capabilities.

Kubernetes v1.33: New features in DRA

Kubernetes Dynamic Resource Allocation (DRA) was originally introduced as an alpha feature in the v1.26 release, and then went through a significant redesign for Kubernetes v1.31. The main DRA feature went to beta in v1.32, and the project hopes it will be generally available in Kubernetes v1.34.

The basic feature set of DRA provides a far more powerful and flexible API for requesting devices than Device Plugin. And while DRA remains a beta feature for v1.33, the DRA team has been hard at work implementing a number of new features and UX improvements. One feature has been promoted to beta, while a number of new features have been added in alpha. The team has also made progress towards getting DRA ready for GA.

Features promoted to beta

Driver-owned Resource Claim Status was promoted to beta. This allows the driver to report driver-specific device status data for each allocated device in a resource claim, which is particularly useful for supporting network devices.

New alpha features

Partitionable Devices lets a driver advertise several overlapping logical devices (“partitions”), and the driver can reconfigure the physical device dynamically based on the actual devices allocated. This makes it possible to partition devices on-demand to meet the needs of the workloads and therefore increase the utilization.

Device Taints and Tolerations allow devices to be tainted and for workloads to tolerate those taints. This makes it possible for drivers or cluster administrators to mark devices as unavailable. Depending on the effect of the taint, this can prevent devices from being allocated or cause eviction of pods that are using the device.

Prioritized List lets users specify a list of acceptable devices for their workloads, rather than just a single type of device. So while the workload might run best on a single high-performance GPU, it might also be able to run on 2 mid-level GPUs. The scheduler will attempt to satisfy the alternatives in the list in order, so the workload will be allocated the best set of devices available in the cluster.

Admin Access has been updated so that only users with access to a namespace with the resource.k8s.io/admin-access: "true" label are authorized to create ResourceClaim or ResourceClaimTemplates objects with the adminAccess field within the namespace. This grants administrators access to in-use devices and may enable additional permissions when making the device available in a container. This ensures that non-admin users cannot misuse the feature.

Preparing for general availability

A new v1beta2 API has been added to simplify the user experience and to prepare for additional features being added in the future. The RBAC rules for DRA have been improved and support has been added for seamless upgrades of DRA drivers.

What’s next?

The plan for v1.34 is even more ambitious than for v1.33. Most importantly, we (the Kubernetes device management working group) hope to bring DRA to general availability, which will make it available by default on all v1.34 Kubernetes clusters. This also means that many, perhaps all, of the DRA features that are still beta in v1.34 will become enabled by default, making it much easier to use them.

The alpha features that were added in v1.33 will be brought to beta in v1.34.

Getting involved

A good starting point is joining the WG Device Management Slack channel and meetings, which happen at US/EU and EU/APAC friendly time slots.

Not all enhancement ideas are tracked as issues yet, so come talk to us if you want to help or have some ideas yourself! We have work to do at all levels, from difficult core changes to usability enhancements in kubectl, which could be picked up by newcomers.

Acknowledgments

A huge thanks to everyone who has contributed:

Kubernetes v1.33: Storage Capacity Scoring of Nodes for Dynamic Provisioning (alpha)

Kubernetes v1.33 introduces a new alpha feature called StorageCapacityScoring. This feature adds a scoring method for pod scheduling with the topology-aware volume provisioning. This feature eases to schedule pods on nodes with either the most or least available storage capacity.

About this feature

This feature extends the kube-scheduler's VolumeBinding plugin to perform scoring using node storage capacity information obtained from Storage Capacity. Currently, you can only filter out nodes with insufficient storage capacity. So, you have to use a scheduler extender to achieve storage-capacity-based pod scheduling.

This feature is useful for provisioning node-local PVs, which have size limits based on the node's storage capacity. By using this feature, you can assign the PVs to the nodes with the most available storage space so that you can expand the PVs later as much as possible.

In another use case, you might want to reduce the number of nodes as much as possible for low operation costs in cloud environments by choosing the least storage capacity node. This feature helps maximize resource utilization by filling up nodes more sequentially, starting with the most utilized nodes first that still have enough storage capacity for the requested volume size.

How to use

Enabling the feature

In the alpha phase, StorageCapacityScoring is disabled by default. To use this feature, add StorageCapacityScoring=true to the kube-scheduler command line option --feature-gates.

Configuration changes

You can configure node priorities based on storage utilization using the shape parameter in the VolumeBinding plugin configuration. This allows you to prioritize nodes with higher available storage capacity (default) or, conversely, nodes with lower available storage capacity. For example, to prioritize lower available storage capacity, configure KubeSchedulerConfiguration as follows:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  ...
  pluginConfig:
  - name: VolumeBinding
    args:
      ...
      shape:
      - utilization: 0
        score: 0
      - utilization: 100
        score: 10

For more details, please refer to the documentation.

Further reading

Additional note: Relationship with VolumeCapacityPriority

The alpha feature gate VolumeCapacityPriority, which performs node scoring based on available storage capacity during static provisioning, will be deprecated and replaced by StorageCapacityScoring.

Please note that while VolumeCapacityPriority prioritizes nodes with lower available storage capacity by default, StorageCapacityScoring prioritizes nodes with higher available storage capacity by default.

Kubernetes v1.33: Image Volumes graduate to beta!

Image Volumes were introduced as an Alpha feature with the Kubernetes v1.31 release as part of KEP-4639. In Kubernetes v1.33, this feature graduates to beta.

Please note that the feature is still disabled by default, because not all container runtimes have full support for it. CRI-O supports the initial feature since version v1.31 and will add support for Image Volumes as beta in v1.33. containerd merged support for the alpha feature which will be part of the v2.1.0 release and is working on beta support as part of PR #11578.

What's new

The major change for the beta graduation of Image Volumes is the support for subPath and subPathExpr mounts for containers via spec.containers[*].volumeMounts.[subPath,subPathExpr]. This allows end-users to mount a certain subdirectory of an image volume, which is still mounted as readonly (ro). This means that non-existing subdirectories cannot be mounted by default. As for other subPath and subPathExpr values, Kubernetes will ensure that there are no absolute path or relative path components part of the specified sub path. Container runtimes are also required to double check those requirements for safety reasons. If a specified subdirectory does not exist within a volume, then runtimes should fail on container creation and provide user feedback by using existing kubelet events.

Besides that, there are also three new kubelet metrics available for image volumes:

  • kubelet_image_volume_requested_total: Outlines the number of requested image volumes.
  • kubelet_image_volume_mounted_succeed_total: Counts the number of successful image volume mounts.
  • kubelet_image_volume_mounted_errors_total: Accounts the number of failed image volume mounts.

To use an existing subdirectory for a specific image volume, just use it as subPath (or subPathExpr) value of the containers volumeMounts:

apiVersion: v1
kind: Pod
metadata:
  name: image-volume
spec:
  containers:
  - name: shell
    command: ["sleep", "infinity"]
    image: debian
    volumeMounts:
    - name: volume
      mountPath: /volume
      subPath: dir
  volumes:
  - name: volume
    image:
      reference: quay.io/crio/artifact:v2
      pullPolicy: IfNotPresent

Then, create the pod on your cluster:

kubectl apply -f image-volumes-subpath.yaml

Now you can attach to the container:

kubectl attach -it image-volume bash

And check the content of the file from the dir sub path in the volume:

cat /volume/file

The output will be similar to:

1

Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature graduation as part of Kubernetes v1.33.

As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there!

If you would like to provide feedback or suggestions feel free to reach out to SIG Node using the Kubernetes Slack (#sig-node) channel or the SIG Node mailing list.

Further reading

Kubernetes v1.33: HorizontalPodAutoscaler Configurable Tolerance

This post describes configurable tolerance for horizontal Pod autoscaling, a new alpha feature first available in Kubernetes 1.33.

What is it?

Horizontal Pod Autoscaling is a well-known Kubernetes feature that allows your workload to automatically resize by adding or removing replicas based on resource utilization.

Let's say you have a web application running in a Kubernetes cluster with 50 replicas. You configure the HorizontalPodAutoscaler (HPA) to scale based on CPU utilization, with a target of 75% utilization. Now, imagine that the current CPU utilization across all replicas is 90%, which is higher than the desired 75%. The HPA will calculate the required number of replicas using the formula:

desiredReplicas=ceilcurrentReplicas×currentMetricValuedesiredMetricValuedesiredReplicas = ceil\left\lceil currentReplicas \times \frac{currentMetricValue}{desiredMetricValue} \right\rceil

In this example:

50×(90/75)=6050 \times (90/75) = 60

So, the HPA will increase the number of replicas from 50 to 60 to reduce the load on each pod. Similarly, if the CPU utilization were to drop below 75%, the HPA would scale down the number of replicas accordingly. The Kubernetes documentation provides a detailed description of the scaling algorithm.

In order to avoid replicas being created or deleted whenever a small metric fluctuation occurs, Kubernetes applies a form of hysteresis: it only changes the number of replicas when the current and desired metric values differ by more than 10%. In the example above, since the ratio between the current and desired metric values is \(90/75\), or 20% above target, exceeding the 10% tolerance, the scale-up action will proceed.

This default tolerance of 10% is cluster-wide; in older Kubernetes releases, it could not be fine-tuned. It's a suitable value for most usage, but too coarse for large deployments, where a 10% tolerance represents tens of pods. As a result, the community has long asked to be able to tune this value.

In Kubernetes v1.33, this is now possible.

How do I use it?

After enabling the HPAConfigurableTolerance feature gate in your Kubernetes v1.33 cluster, you can add your desired tolerance for your HorizontalPodAutoscaler object.

Tolerances appear under the spec.behavior.scaleDown and spec.behavior.scaleUp fields and can thus be different for scale up and scale down. A typical usage would be to specify a small tolerance on scale up (to react quickly to spikes), but higher on scale down (to avoid adding and removing replicas too quickly in response to small metric fluctuations).

For example, an HPA with a tolerance of 5% on scale-down, and no tolerance on scale-up, would look like the following:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  ...
  behavior:
    scaleDown:
      tolerance: 0.05
    scaleUp:
      tolerance: 0

I want all the details!

Get all the technical details by reading KEP-4951 and follow issue 4951 to be notified of the feature graduation.

Kubernetes v1.33: User Namespaces enabled by default!

In Kubernetes v1.33 support for user namespaces is enabled by default. This means that, when the stack requirements are met, pods can opt-in to use user namespaces. To use the feature there is no need to enable any Kubernetes feature flag anymore!

In this blog post we answer some common questions about user namespaces. But, before we dive into that, let's recap what user namespaces are and why they are important.

What is a user namespace?

Note: Linux user namespaces are a different concept from Kubernetes namespaces. The former is a Linux kernel feature; the latter is a Kubernetes feature.

Linux provides different namespaces to isolate processes from each other. For example, a typical Kubernetes pod runs within a network namespace to isolate the network identity and a PID namespace to isolate the processes.

One Linux namespace that was left behind is the user namespace. It isolates the UIDs and GIDs of the containers from the ones on the host. The identifiers in a container can be mapped to identifiers on the host in a way where host and container(s) never end up in overlapping UID/GIDs. Furthermore, the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on the host. This brings three key benefits:

  • Prevention of lateral movement: As the UIDs and GIDs for different containers are mapped to different UIDs and GIDs on the host, containers have a harder time attacking each other, even if they escape the container boundaries. For example, suppose container A runs with different UIDs and GIDs on the host than container B. In that case, the operations it can do on container B's files and processes are limited: only read/write what a file allows to others, as it will never have permission owner or group permission (the UIDs/GIDs on the host are guaranteed to be different for different containers).

  • Increased host isolation: As the UIDs and GIDs are mapped to unprivileged users on the host, if a container escapes the container boundaries, even if it runs as root inside the container, it has no privileges on the host. This greatly protects what host files it can read/write, which process it can send signals to, etc. Furthermore, capabilities granted are only valid inside the user namespace and not on the host, limiting the impact a container escape can have.

  • Enablement of new use cases: User namespaces allow containers to gain certain capabilities inside their own user namespace without affecting the host. This unlocks new possibilities, such as running applications that require privileged operations without granting full root access on the host. This is particularly useful for running nested containers.

Image showing IDs 0-65535 are reserved to the host, pods use higher IDs

User namespace IDs allocation

If a pod running as the root user without a user namespace manages to breakout, it has root privileges on the node. If some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂).

Demos

Rodrigo created demos to understand how some CVEs are mitigated when user namespaces are used. We showed them here before (see here and here), but take a look if you haven't:

Mitigation of CVE 2024-21626 with user namespaces:

Mitigation of CVE 2022-0492 with user namespaces:

Everything you wanted to know about user namespaces in Kubernetes

Here we try to answer some of the questions we have been asked about user namespaces support in Kubernetes.

1. What are the requirements to use it?

The requirements are documented here. But we will elaborate a bit more, in the following questions.

Note this is a Linux-only feature.

2. How do I configure a pod to opt-in?

A complete step-by-step guide is available here. But the short version is you need to set the hostUsers: false field in the pod spec. For example like this:

apiVersion: v1
kind: Pod
metadata:
  name: userns
spec:
  hostUsers: false
  containers:
  - name: shell
    command: ["sleep", "infinity"]
    image: debian

Yes, it is that simple. Applications will run just fine, without any other changes needed (unless your application needs the privileges).

User namespaces allows you to run as root inside the container, but not have privileges in the host. However, if your application needs the privileges on the host, for example an app that needs to load a kernel module, then you can't use user namespaces.

3. What are idmap mounts and why the file-systems used need to support it?

Idmap mounts are a Linux kernel feature that uses a mapping of UIDs/GIDs when accessing a mount. When combined with user namespaces, it greatly simplifies the support for volumes, as you can forget about the host UIDs/GIDs the user namespace is using.

In particular, thanks to idmap mounts we can:

  • Run each pod with different UIDs/GIDs on the host. This is key for the lateral movement prevention we mentioned earlier.
  • Share volumes with pods that don't use user namespaces.
  • Enable/disable user namespaces without needing to chown the pod's volumes.

Support for idmap mounts in the kernel is per file-system and different kernel releases added support for idmap mounts on different file-systems.

To find which kernel version added support for each file-system, you can check out the mount_setattr man page, or the online version of it here.

Most popular file-systems are supported, the notable absence that isn't supported yet is NFS.

4. Can you clarify exactly which file-systems need to support idmap mounts?

The file-systems that need to support idmap mounts are all the file-systems used by a pod in the pod.spec.volumes field.

This means: for PV/PVC volumes, the file-system used in the PV needs to support idmap mounts; for hostPath volumes, the file-system used in the hostPath needs to support idmap mounts.

What does this mean for secrets/configmaps/projected/downwardAPI volumes? For these volumes, the kubelet creates a tmpfs file-system. So, you will need a 6.3 kernel to use these volumes (note that if you use them as env variables it is fine).

And what about emptyDir volumes? Those volumes are created by the kubelet by default in /var/lib/kubelet/pods/. You can also use a custom directory for this. But what needs to support idmap mounts is the file-system used in that directory.

The kubelet creates some more files for the container, like /etc/hostname, /etc/resolv.conf, /dev/termination-log, /etc/hosts, etc. These files are also created in /var/lib/kubelet/pods/ by default, so it's important for the file-system used in that directory to support idmap mounts.

Also, some container runtimes may put some of these ephemeral volumes inside a tmpfs file-system, in which case you will need support for idmap mounts in tmpfs.

5. Can I use a kernel older than 6.3?

Yes, but you will need to make sure you are not using a tmpfs file-system. If you avoid that, you can easily use 5.19 (if all the other file-systems you use support idmap mounts in that kernel).

It can be tricky to avoid using tmpfs, though, as we just described above. Besides having to avoid those volume types, you will also have to avoid mounting the service account token. Every pod has it mounted by default, and it uses a projected volume that, as we mentioned, uses a tmpfs file-system.

You could even go lower than 5.19, all the way to 5.12. However, your container rootfs probably uses an overlayfs file-system, and support for overlayfs was added in 5.19. We wouldn't recommend to use a kernel older than 5.19, as not being able to use idmap mounts for the rootfs is a big limitation. If you absolutely need to, you can check this blog post Rodrigo wrote some years ago, about tricks to use user namespaces when you can't support idmap mounts on the rootfs.

6. If my stack supports user namespaces, do I need to configure anything else?

No, if your stack supports it and you are using Kubernetes v1.33, there is nothing you need to configure. You should be able to follow the task: Use a user namespace with a pod.

However, in case you have specific requirements, you may configure various options. You can find more information here. You can also enable a feature gate to relax the PSS rules.

7. The demos are nice, but are there more CVEs that this mitigates?

Yes, quite a lot, actually! Besides the ones in the demo, the KEP has more CVEs you can check. That list is not exhaustive, there are many more.

8. Can you sum up why user namespaces is important?

Think about running a process as root, maybe even an untrusted process. Do you think that is secure? What if we limit it by adding seccomp and apparmor, mask some files in /proc (so it can't crash the node, etc.) and some more tweaks?

Wouldn't it be better if we don't give it privileges in the first place, instead of trying to play whack-a-mole with all the possible ways root can escape?

This is what user namespaces does, plus some other goodies:

  • Run as an unprivileged user on the host without making changes to your application. Greg and Vinayak did a great talk on the pains you can face when trying to run unprivileged without user namespaces. The pains part starts in this minute.

  • All pods run with different UIDs/GIDs, we significantly improve the lateral movement. This is guaranteed with user namespaces (the kubelet chooses it for you). In the same talk, Greg and Vinayak show that to achieve the same without user namespaces, they went through a quite complex custom solution. This part starts in this minute.

  • The capabilities granted are only granted inside the user namespace. That means that if a pod breaks out of the container, they are not valid on the host. We can't provide that without user namespaces.

  • It enables new use-cases in a secure way. You can run docker in docker, unprivileged container builds, Kubernetes inside Kubernetes, etc all in a secure way. Most of the previous solutions to do this required privileged containers or putting the node at a high risk of compromise.

9. Is there container runtime documentation for user namespaces?

Yes, we have containerd documentation. This explains different limitations of containerd 1.7 and how to use user namespaces in containerd without Kubernetes pods (using ctr). Note that if you use containerd, you need containerd 2.0 or higher to use user namespaces with Kubernetes.

CRI-O doesn't have special documentation for user namespaces, it works out of the box.

10. What about the other container runtimes?

No other container runtime that we are aware of supports user namespaces with Kubernetes. That sadly includes cri-dockerd too.

11. I'd like to learn more about it, what would you recommend?

Rodrigo did an introduction to user namespaces at KubeCon 2022:

Also, this aforementioned presentation at KubeCon 2023 can be useful as a motivation for user namespaces:

Bear in mind the presentation are some years old, some things have changed since then. Use the Kubernetes documentation as the source of truth.

If you would like to learn more about the low-level details of user namespaces, you can check man 7 user_namespaces and man 1 unshare. You can easily create namespaces and experiment with how they behave. Be aware that the unshare tool has a lot of flexibility, and with that options to create incomplete setups.

If you would like to know more about idmap mounts, you can check its Linux kernel documentation.

Conclusions

Running pods as root is not ideal and running them as non-root is also hard with containers, as it can require a lot of changes to the applications. User namespaces are a unique feature to let you have the best of both worlds: run as non-root, without any changes to your application.

This post covered: what are user namespaces, why they are important, some real world examples of CVEs mitigated by user-namespaces, and some common questions. Hopefully, this post helped you to eliminate the last doubts you had and you will now try user-namespaces (if you didn't already!).

How do I get involved?

You can reach SIG Node by several means:

You can also contact us directly:

  • GitHub: @rata @giuseppe @saschagrunert
  • Slack: @rata @giuseppe @sascha

Kubernetes v1.33: Continuing the transition from Endpoints to EndpointSlices

Since the addition of EndpointSlices (KEP-752) as alpha in v1.15 and later GA in v1.21, the Endpoints API in Kubernetes has been gathering dust. New Service features like dual-stack networking and traffic distribution are only supported via the EndpointSlice API, so all service proxies, Gateway API implementations, and similar controllers have had to be ported from using Endpoints to using EndpointSlices. At this point, the Endpoints API is really only there to avoid breaking end user workloads and scripts that still make use of it.

As of Kubernetes 1.33, the Endpoints API is now officially deprecated, and the API server will return warnings to users who read or write Endpoints resources rather than using EndpointSlices.

Eventually, the plan (as documented in KEP-4974) is to change the Kubernetes Conformance criteria to no longer require that clusters run the Endpoints controller (which generates Endpoints objects based on Services and Pods), to avoid doing work that is unneeded in most modern-day clusters.

Thus, while the Kubernetes deprecation policy means that the Endpoints type itself will probably never completely go away, users who still have workloads or scripts that use the Endpoints API should start migrating them to EndpointSlices.

Notes on migrating from Endpoints to EndpointSlices

Consuming EndpointSlices rather than Endpoints

For end users, the biggest change between the Endpoints API and the EndpointSlice API is that while every Service with a selector has exactly 1 Endpoints object (with the same name as the Service), a Service may have any number of EndpointSlices associated with it:

$ kubectl get endpoints myservice
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME        ENDPOINTS          AGE
myservice   10.180.3.17:443    1h

$ kubectl get endpointslice -l kubernetes.io/service-name=myservice
NAME              ADDRESSTYPE   PORTS   ENDPOINTS          AGE
myservice-7vzhx   IPv4          443     10.180.3.17        21s
myservice-jcv8s   IPv6          443     2001:db8:0123::5   21s

In this case, because the service is dual stack, it has 2 EndpointSlices: 1 for IPv4 addresses and 1 for IPv6 addresses. (The Endpoints API does not support dual stack, so the Endpoints object shows only the addresses in the cluster's primary address family.) Although any Service with multiple endpoints can have multiple EndpointSlices, there are three main cases where you will see this:

  • An EndpointSlice can only represent endpoints of a single IP family, so dual-stack Services will have separate EndpointSlices for IPv4 and IPv6.

  • All of the endpoints in an EndpointSlice must target the same ports. So, for example, if you have a set of endpoint Pods listening on port 80, and roll out an update to make them listen on port 8080 instead, then while the rollout is in progress, the Service will need 2 EndpointSlices: 1 for the endpoints listening on port 80, and 1 for the endpoints listening on port 8080.

  • When a Service has more than 100 endpoints, the EndpointSlice controller will split the endpoints into multiple EndpointSlices rather than aggregating them into a single excessively-large object like the Endpoints controller does.

Because there is not a predictable 1-to-1 mapping between Services and EndpointSlices, there is no way to know what the actual name of the EndpointSlice resource(s) for a Service will be ahead of time; thus, instead of fetching the EndpointSlice(s) by name, you instead ask for all EndpointSlices with a "kubernetes.io/service-name" label pointing to the Service:

$ kubectl get endpointslice -l kubernetes.io/service-name=myservice

A similar change is needed in Go code. With Endpoints, you would do something like:

// Get the Endpoints named `name` in `namespace`.
endpoint, err := client.CoreV1().Endpoints(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
	if apierrors.IsNotFound(err) {
		// No Endpoints exists for the Service (yet?)
		...
	}
        // handle other errors
	...
}

// process `endpoint`
...

With EndpointSlices, this becomes:

// Get all EndpointSlices for Service `name` in `namespace`.
slices, err := client.DiscoveryV1().EndpointSlices(namespace).List(ctx,
	metav1.ListOptions{LabelSelector: discoveryv1.LabelServiceName + "=" + name})
if err != nil {
        // handle errors
	...
} else if len(slices.Items) == 0 {
	// No EndpointSlices exist for the Service (yet?)
	...
}

// process `slices.Items`
...

Generating EndpointSlices rather than Endpoints

For people (or controllers) generating Endpoints, migrating to EndpointSlices is slightly easier, because in most cases you won't have to worry about multiple slices. You just need to update your YAML or Go code to use the new type (which organizes the information in a slightly different way than Endpoints did).

For example, this Endpoints object:

apiVersion: v1
kind: Endpoints
metadata:
  name: myservice
subsets:
  - addresses:
      - ip: 10.180.3.17
        nodeName: node-4
      - ip: 10.180.5.22
        nodeName: node-9
      - ip: 10.180.18.2
        nodeName: node-7
    notReadyAddresses:
      - ip: 10.180.6.6
        nodeName: node-8
    ports:
      - name: https
        protocol: TCP
        port: 443

would become something like:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  name: myservice
  labels:
    kubernetes.io/service-name: myservice
addressType: IPv4
endpoints:
  - addresses:
      - 10.180.3.17
    nodeName: node-4
  - addresses:
      - 10.180.5.22
    nodeName: node-9
  - addresses:
      - 10.180.18.12
    nodeName: node-7
  - addresses:
      - 10.180.6.6
    nodeName: node-8
    conditions:
      ready: false
ports:
  - name: https
    protocol: TCP
    port: 443

Some points to note:

  1. This example uses an explicit name, but you could also use generateName and let the API server append a unique suffix. The name itself does not matter: what matters is the "kubernetes.io/service-name" label pointing back to the Service.

  2. You have to explicitly indicate addressType: IPv4 (or IPv6).

  3. An EndpointSlice is similar to a single element of the "subsets" array in Endpoints. An Endpoints object with multiple subsets will normally need to be expressed as multiple EndpointSlices, each with different "ports".

  4. The endpoints and addresses fields are both arrays, but by convention, each addresses array only contains a single element. If your Service has multiple endpoints, then you need to have multiple elements in the endpoints array, each with a single element in its addresses array.

  5. The Endpoints API lists "ready" and "not-ready" endpoints separately, while the EndpointSlice API allows each endpoint to have conditions (such as "ready: false") associated with it.

And of course, once you have ported to EndpointSlice, you can make use of EndpointSlice-specific features, such as topology hints and terminating endpoints. Consult the EndpointSlice API documentation for more information.

Kubernetes v1.33: Octarine

Editors: Agustina Barbetta, Aakanksha Bhende, Udi Hofesh, Ryota Sawada, Sneha Yadav

Similar to previous releases, the release of Kubernetes v1.33 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 64 enhancements. Of those enhancements, 18 have graduated to Stable, 20 are entering Beta, 24 have entered Alpha, and 2 are deprecated or withdrawn.

There are also several notable deprecations and removals in this release; make sure to read about those if you already run an older version of Kubernetes.

The theme for Kubernetes v1.33 is Octarine: The Color of Magic1, inspired by Terry Pratchett’s Discworld series. This release highlights the open source magic2 that Kubernetes enables across the ecosystem.

If you’re familiar with the world of Discworld, you might recognize a small swamp dragon perched atop the tower of the Unseen University, gazing up at the Kubernetes moon above the city of Ankh-Morpork with 64 stars3 in the background.

As Kubernetes moves into its second decade, we celebrate both the wizardry of its maintainers, the curiosity of new contributors, and the collaborative spirit that fuels the project. The v1.33 release is a reminder that, as Pratchett wrote, “It’s still magic even if you know how it’s done.” Even if you know the ins and outs of the Kubernetes code base, stepping back at the end of the release cycle, you’ll realize that Kubernetes remains magical.

Kubernetes v1.33 is a testament to the enduring power of open source innovation, where hundreds of contributors4 from around the world work together to create something truly extraordinary. Behind every new feature, the Kubernetes community works to maintain and improve the project, ensuring it remains secure, reliable, and released on time. Each release builds upon the other, creating something greater than we could achieve alone.

1. Octarine is the mythical eighth color, visible only to those attuned to the arcane—wizards, witches, and, of course, cats. And occasionally, someone who’s stared at IPtable rules for too long.
2. Any sufficiently advanced technology is indistinguishable from magic…?
3. It’s not a coincidence 64 KEPs (Kubernetes Enhancement Proposals) are also included in v1.33.
4. See the Project Velocity section for v1.33 🚀

Spotlight on key updates

Kubernetes v1.33 is packed with new features and improvements. Here are a few select updates the Release Team would like to highlight!

Stable: Sidecar containers

The sidecar pattern involves deploying separate auxiliary container(s) to handle extra capabilities in areas such as networking, logging, and metrics gathering. Sidecar containers graduate to stable in v1.33.

Kubernetes implements sidecars as a special class of init containers with restartPolicy: Always, ensuring that sidecars start before application containers, remain running throughout the pod's lifecycle, and terminate automatically after the main containers exit.

Additionally, sidecars can utilize probes (startup, readiness, liveness) to signal their operational state, and their Out-Of-Memory (OOM) score adjustments are aligned with primary containers to prevent premature termination under memory pressure.

To learn more, read Sidecar Containers.

This work was done as part of KEP-753: Sidecar Containers led by SIG Node.

Beta: In-place resource resize for vertical scaling of Pods

Workloads can be defined using APIs like Deployment, StatefulSet, etc. These describe the template for the Pods that should run, including memory and CPU resources, as well as the replica count of the number of Pods that should run. Workloads can be scaled horizontally by updating the Pod replica count, or vertically by updating the resources required in the Pods container(s). Before this enhancement, container resources defined in a Pod's spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement.

But what if you could dynamically update the resource configuration for your existing Pods without restarting them?

The KEP-1287 is precisely to allow such in-place Pod updates. It was released as alpha in v1.27, and has graduated to beta in v1.33. This opens up various possibilities for vertical scale-up of stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup, which can then be reduced once the initial setup is complete.

This work was done as part of KEP-1287: In-Place Update of Pod Resources led by SIG Node and SIG Autoscaling.

Alpha: New configuration option for kubectl with .kuberc for user preferences

In v1.33, kubectl introduces a new alpha feature with opt-in configuration file .kuberc for user preferences. This file can contain kubectl aliases and overrides (e.g. defaulting to use server-side apply), while leaving cluster credentials and host information in kubeconfig. This separation allows sharing the same user preferences for kubectl interaction, regardless of target cluster and kubeconfig used.

To enable this alpha feature, users can set the environment variable of KUBECTL_KUBERC=true and create a .kuberc configuration file. By default, kubectl looks for this file in ~/.kube/kuberc. You can also specify an alternative location using the --kuberc flag, for example: kubectl --kuberc /var/kube/rc.

This work was done as part of KEP-3104: Separate kubectl user preferences from cluster configs led by SIG CLI.

Features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.33 release.

Backoff limits per index for indexed Jobs

​This release graduates a feature that allows setting backoff limits on a per-index basis for Indexed Jobs. Traditionally, the backoffLimit parameter in Kubernetes Jobs specifies the number of retries before considering the entire Job as failed. This enhancement allows each index within an Indexed Job to have its own backoff limit, providing more granular control over retry behavior for individual tasks. This ensures that the failure of specific indices does not prematurely terminate the entire Job, allowing the other indices to continue processing independently.

This work was done as part of KEP-3850: Backoff Limit Per Index For Indexed Jobs led by SIG Apps.

Job success policy

Using .spec.successPolicy, users can specify which pod indexes must succeed (succeededIndexes), how many pods must succeed (succeededCount), or a combination of both. This feature benefits various workloads, including simulations where partial completion is sufficient, and leader-worker patterns where only the leader's success determines the Job's overall outcome.

This work was done as part of KEP-3998: Job success/completion policy led by SIG Apps.

Bound ServiceAccount token security improvements

This enhancement introduced features such as including a unique token identifier (i.e. JWT ID Claim, also known as JTI) and node information within the tokens, enabling more precise validation and auditing. Additionally, it supports node-specific restrictions, ensuring that tokens are only usable on designated nodes, thereby reducing the risk of token misuse and potential security breaches. These improvements, now generally available, aim to enhance the overall security posture of service account tokens within Kubernetes clusters.

This work was done as part of KEP-4193: Bound service account token improvements led by SIG Auth.

Subresource support in kubectl

The --subresource argument is now generally available for kubectl subcommands such as get, patch, edit, apply and replace, allowing users to fetch and update subresources for all resources that support them. To learn more about the subresources supported, visit the kubectl reference.

This work was done as part of KEP-2590: Add subresource support to kubectl led by SIG CLI.

Multiple Service CIDRs

This enhancement introduced a new implementation of allocation logic for Service IPs. Across the whole cluster, every Service of type: ClusterIP must have a unique IP address assigned to it. Trying to create a Service with a specific cluster IP that has already been allocated will return an error. The updated IP address allocator logic uses two newly stable API objects: ServiceCIDR and IPAddress. Now generally available, these APIs allow cluster administrators to dynamically increase the number of IP addresses available for type: ClusterIP Services (by creating new ServiceCIDR objects).

This work was done as part of KEP-1880: Multiple Service CIDRs led by SIG Network.

nftables backend for kube-proxy

The nftables backend for kube-proxy is now stable, adding a new implementation that significantly improves performance and scalability for Services implementation within Kubernetes clusters. For compatibility reasons, iptables remains the default on Linux nodes. Check the migration guide if you want to try it out.

This work was done as part of KEP-3866: nftables kube-proxy backend led by SIG Network.

Topology aware routing with trafficDistribution: PreferClose

This release graduates topology-aware routing and traffic distribution to GA, which would allow us to optimize service traffic in multi-zone clusters. The topology-aware hints in EndpointSlices would enable components like kube-proxy to prioritize routing traffic to endpoints within the same zone, thereby reducing latency and cross-zone data transfer costs. Building upon this, trafficDistribution field is added to the Service specification, with the PreferClose option directing traffic to the nearest available endpoints based on network topology. This configuration enhances performance and cost-efficiency by minimizing inter-zone communication.

This work was done as part of KEP-4444: Traffic Distribution for Services and KEP-2433: Topology Aware Routing led by SIG Network.

Options to reject non SMT-aligned workload

This feature added policy options to the CPU Manager, enabling it to reject workloads that do not align with Simultaneous Multithreading (SMT) configurations. This enhancement, now generally available, ensures that when a pod requests exclusive use of CPU cores, the CPU Manager can enforce allocation of entire core pairs (comprising primary and sibling threads) on SMT-enabled systems, thereby preventing scenarios where workloads share CPU resources in unintended ways.

This work was done as part of KEP-2625: node: cpumanager: add options to reject non SMT-aligned workload led by SIG Node.

Defining Pod affinity or anti-affinity using matchLabelKeys and mismatchLabelKeys

The matchLabelKeys and mismatchLabelKeys fields are available in Pod affinity terms, enabling users to finely control the scope where Pods are expected to co-exist (Affinity) or not (AntiAffinity). These newly stable options complement the existing labelSelector mechanism. The affinity fields facilitate enhanced scheduling for versatile rolling updates, as well as isolation of services managed by tools or controllers based on global configurations.

This work was done as part of KEP-3633: Introduce MatchLabelKeys to Pod Affinity and Pod Anti Affinity led by SIG Scheduling.

Considering taints and tolerations when calculating Pod topology spread skew

This enhanced PodTopologySpread by introducing two fields: nodeAffinityPolicy and nodeTaintsPolicy. These fields allow users to specify whether node affinity rules and node taints should be considered when calculating pod distribution across nodes. By default, nodeAffinityPolicy is set to Honor, meaning only nodes matching the pod's node affinity or selector are included in the distribution calculation. The nodeTaintsPolicy defaults to Ignore, indicating that node taints are not considered unless specified. This enhancement provides finer control over pod placement, ensuring that pods are scheduled on nodes that meet both affinity and taint toleration requirements, thereby preventing scenarios where pods remain pending due to unsatisfied constraints.

This work was done as part of KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew led by SIG Scheduling.

Volume populators

After being released as beta in v1.24, volume populators have graduated to GA in v1.33. This newly stable feature provides a way to allow users to pre-populate volumes with data from various sources, and not just from PersistentVolumeClaim (PVC) clones or volume snapshots. The mechanism relies on the dataSourceRef field within a PersistentVolumeClaim. This field offers more flexibility than the existing dataSource field, and allows for custom resources to be used as data sources.

A special controller, volume-data-source-validator, validates these data source references, alongside a newly stable CustomResourceDefinition (CRD) for an API kind named VolumePopulator. The VolumePopulator API allows volume populator controllers to register the types of data sources they support. You need to set up your cluster with the appropriate CRD in order to use volume populators.

This work was done as part of KEP-1495: Generic data populators led by SIG Storage.

Always honor PersistentVolume reclaim policy

This enhancement addressed an issue where the Persistent Volume (PV) reclaim policy is not consistently honored, leading to potential storage resource leaks. Specifically, if a PV is deleted before its associated Persistent Volume Claim (PVC), the "Delete" reclaim policy may not be executed, leaving the underlying storage assets intact. To mitigate this, Kubernetes now sets finalizers on relevant PVs, ensuring that the reclaim policy is enforced regardless of the deletion sequence. This enhancement prevents unintended retention of storage resources and maintains consistency in PV lifecycle management.

This work was done as part of KEP-2644: Always Honor PersistentVolume Reclaim Policy led by SIG Storage.

New features in Beta

This is a selection of some of the improvements that are now beta following the v1.33 release.

Support for Direct Service Return (DSR) in Windows kube-proxy

DSR provides performance optimizations by allowing the return traffic routed through load balancers to bypass the load balancer and respond directly to the client; reducing load on the load balancer and also reducing overall latency. For information on DSR on Windows, read Direct Server Return (DSR) in a nutshell.

Initially introduced in v1.14, support for DSR has been promoted to beta by SIG Windows as part of KEP-5100: Support for Direct Service Return (DSR) and overlay networking in Windows kube-proxy.

Structured parameter support

While structured parameter support continues as a beta feature in Kubernetes v1.33, this core part of Dynamic Resource Allocation (DRA) has seen significant improvements. A new v1beta2 version simplifies the resource.k8s.io API, and regular users with the namespaced cluster edit role can now use DRA.

The kubelet now includes seamless upgrade support, enabling drivers deployed as DaemonSets to use a rolling update mechanism. For DRA implementations, this prevents the deletion and re-creation of ResourceSlices, allowing them to remain unchanged during upgrades. Additionally, a 30-second grace period has been introduced before the kubelet cleans up after unregistering a driver, providing better support for drivers that do not use rolling updates.

This work was done as part of KEP-4381: DRA: structured parameters by WG Device Management, a cross-functional team including SIG Node, SIG Scheduling, and SIG Autoscaling.

Dynamic Resource Allocation (DRA) for network interfaces

The standardized reporting of network interface data via DRA, introduced in v1.32, has graduated to beta in v1.33. This enables more native Kubernetes network integrations, simplifying the development and management of networking devices. This was covered previously in the v1.32 release announcement blog.

This work was done as part of KEP-4817: DRA: Resource Claim Status with possible standardized network interface data led by SIG Network, SIG Node, and WG Device Management.

Handle unscheduled pods early when scheduler does not have any pod on activeQ

This feature improves queue scheduling behavior. Behind the scenes, the scheduler achieves this by popping pods from the backoffQ, which are not backed off due to errors, when the activeQ is empty. Previously, the scheduler would become idle even when the activeQ was empty; this enhancement improves scheduling efficiency by preventing that.

This work was done as part of KEP-5142: Pop pod from backoffQ when activeQ is empty led by SIG Scheduling.

Asynchronous preemption in the Kubernetes Scheduler

Preemption ensures higher-priority pods get the resources they need by evicting lower-priority ones. Asynchronous Preemption, introduced in v1.32 as alpha, has graduated to beta in v1.33. With this enhancement, heavy operations such as API calls to delete pods are processed in parallel, allowing the scheduler to continue scheduling other pods without delays. This improvement is particularly beneficial in clusters with high Pod churn or frequent scheduling failures, ensuring a more efficient and resilient scheduling process.

This work was done as part of KEP-4832: Asynchronous preemption in the scheduler led by SIG Scheduling.

ClusterTrustBundles

ClusterTrustBundle, a cluster-scoped resource designed for holding X.509 trust anchors (root certificates), has graduated to beta in v1.33. This API makes it easier for in-cluster certificate signers to publish and communicate X.509 trust anchors to cluster workloads.

This work was done as part of KEP-3257: ClusterTrustBundles (previously Trust Anchor Sets) led by SIG Auth.

Fine-grained SupplementalGroups control

Introduced in v1.31, this feature graduates to beta in v1.33 and is now enabled by default. Provided that your cluster has the SupplementalGroupsPolicy feature gate enabled, the supplementalGroupsPolicy field within a Pod's securityContext supports two policies: the default Merge policy maintains backward compatibility by combining specified groups with those from the container image's /etc/group file, whereas the new Strict policy applies only to explicitly defined groups.

This enhancement helps to address security concerns where implicit group memberships from container images could lead to unintended file access permissions and bypass policy controls.

This work was done as part of KEP-3619: Fine-grained SupplementalGroups control led by SIG Node.

Support for mounting images as volumes

Support for using Open Container Initiative (OCI) images as volumes in Pods, introduced in v1.31, has graduated to beta. This feature allows users to specify an image reference as a volume in a Pod while reusing it as a volume mount within containers. It opens up the possibility of packaging the volume data separately, and sharing them among containers in a Pod without including them in the main image, thereby reducing vulnerabilities and simplifying image creation.

This work was done as part of KEP-4639: VolumeSource: OCI Artifact and/or Image led by SIG Node and SIG Storage.

Support for user namespaces within Linux Pods

One of the oldest open KEPs as of writing is KEP-127, Pod security improvement by using Linux User namespaces for Pods. This KEP was first opened in late 2016, and after multiple iterations, had its alpha release in v1.25, initial beta in v1.30 (where it was disabled by default), and has moved to on-by-default beta as part of v1.33.

This support will not impact existing Pods unless you manually specify pod.spec.hostUsers to opt in. As highlighted in the v1.30 sneak peek blog, this is an important milestone for mitigating vulnerabilities.

This work was done as part of KEP-127: Support User Namespaces in pods led by SIG Node.

Pod procMount option

The procMount option, introduced as alpha in v1.12, and off-by-default beta in v1.31, has moved to an on-by-default beta in v1.33. This enhancement improves Pod isolation by allowing users to fine-tune access to the /proc filesystem. Specifically, it adds a field to the Pod securityContext that lets you override the default behavior of masking and marking certain /proc paths as read-only. This is particularly useful for scenarios where users want to run unprivileged containers inside the Kubernetes Pod using user namespaces. Normally, the container runtime (via the CRI implementation) starts the outer container with strict /proc mount settings. However, to successfully run nested containers with an unprivileged Pod, users need a mechanism to relax those defaults, and this feature provides exactly that.

This work was done as part of KEP-4265: add ProcMount option led by SIG Node.

CPUManager policy to distribute CPUs across NUMA nodes

This feature adds a new policy option for the CPU Manager to distribute CPUs across Non-Uniform Memory Access (NUMA) nodes, rather than concentrating them on a single node. It optimizes CPU resource allocation by balancing workloads across multiple NUMA nodes, thereby improving performance and resource utilization in multi-NUMA systems.

This work was done as part of KEP-2902: Add CPUManager policy option to distribute CPUs across NUMA nodes instead of packing them led by SIG Node.

Zero-second sleeps for container PreStop hooks

Kubernetes 1.29 introduced a Sleep action for the preStop lifecycle hook in Pods, allowing containers to pause for a specified duration before termination. This provides a straightforward method to delay container shutdown, facilitating tasks such as connection draining or cleanup operations.

The Sleep action in a preStop hook can now accept a zero-second duration as a beta feature. This allows defining a no-op preStop hook, which is useful when a preStop hook is required but no delay is desired.

This work was done as part of KEP-3960: Introducing Sleep Action for PreStop Hook and KEP-4818: Allow zero value for Sleep Action of PreStop Hook led by SIG Node.

Internal tooling for declarative validation of Kubernetes-native types

Behind the scenes, the internals of Kubernetes are starting to use a new mechanism for validating objects and changes to objects. Kubernetes v1.33 introduces validation-gen, an internal tool that Kubernetes contributors use to generate declarative validation rules. The overall goal is to improve the robustness and maintainability of API validations by enabling developers to specify validation constraints declaratively, reducing manual coding errors and ensuring consistency across the codebase.

This work was done as part of KEP-5073: Declarative Validation Of Kubernetes Native Types With validation-gen led by SIG API Machinery.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.33 release.

Configurable tolerance for HorizontalPodAutoscalers

This feature introduces configurable tolerance for HorizontalPodAutoscalers, which dampens scaling reactions to small metric variations.

This work was done as part of KEP-4951: Configurable tolerance for Horizontal Pod Autoscalers led by SIG Autoscaling.

Configurable container restart delay

Introduced as alpha1 in v1.32, this feature provides a set of kubelet-level configurations to fine-tune how CrashLoopBackOff is handled.

This work was done as part of KEP-4603: Tune CrashLoopBackOff led by SIG Node.

Custom container stop signals

Before Kubernetes v1.33, stop signals could only be set in container image definitions (for example, via the StopSignal configuration field in the image metadata). If you wanted to modify termination behavior, you needed to build a custom container image. By enabling the (alpha) ContainerStopSignals feature gate in Kubernetes v1.33, you can now define custom stop signals directly within Pod specifications. This is defined in the container's lifecycle.stopSignal field and requires the Pod's spec.os.name field to be present. If unspecified, containers fall back to the image-defined stop signal (if present), or the container runtime default (typically SIGTERM for Linux).

This work was done as part of KEP-4960: Container Stop Signals led by SIG Node.

DRA enhancements galore!

Kubernetes v1.33 continues to develop Dynamic Resource Allocation (DRA) with features designed for today’s complex infrastructures. DRA is an API for requesting and sharing resources between pods and containers inside a pod. Typically those resources are devices such as GPUs, FPGAs, and network adapters.

The following are all the alpha DRA feature gates introduced in v1.33:

  • Similar to Node taints, by enabling the DRADeviceTaints feature gate, devices support taints and tolerations. An admin or a control plane component can taint devices to limit their usage. Scheduling of pods which depend on those devices can be paused while a taint exists and/or pods using a tainted device can be evicted.
  • By enabling the feature gate DRAPrioritizedList, DeviceRequests get a new field named firstAvailable. This field is an ordered list that allows the user to specify that a request may be satisfied in different ways, including allocating nothing at all if some specific hardware is not available.
  • With feature gate DRAAdminAccess enabled, only users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with resource.k8s.io/admin-access: "true" can use the adminAccess field. This ensures that non-admin users cannot misuse the adminAccess feature.
  • While it has been possible to consume device partitions since v1.31, vendors had to pre-partition devices and advertise them accordingly. By enabling the DRAPartitionableDevices feature gate in v1.33, device vendors can advertise multiple partitions, including overlapping ones. The Kubernetes scheduler will choose the partition based on workload requests, and prevent the allocation of conflicting partitions simultaneously. This feature gives vendors the ability to dynamically create partitions at allocation time. The allocation and dynamic partitioning are automatic and transparent to users, enabling improved resource utilization.

These feature gates have no effect unless you also enable the DynamicResourceAllocation feature gate.

This work was done as part of KEP-5055: DRA: device taints and tolerations, KEP-4816: DRA: Prioritized Alternatives in Device Requests, KEP-5018: DRA: AdminAccess for ResourceClaims and ResourceClaimTemplates, and KEP-4815: DRA: Add support for partitionable devices, led by SIG Node, SIG Scheduling and SIG Auth.

Robust image pull policy to authenticate images for IfNotPresent and Never

This feature allows users to ensure that kubelet requires an image pull authentication check for each new set of credentials, regardless of whether the image is already present on the node.

This work was done as part of KEP-2535: Ensure secret pulled images led by SIG Auth.

Node topology labels are available via downward API

This feature enables Node topology labels to be exposed via the downward API. Prior to Kubernetes v1.33, a workaround involved using an init container to query the Kubernetes API for the underlying node; this alpha feature simplifies how workloads can access Node topology information.

This work was done as part of KEP-4742: Expose Node labels via downward API led by SIG Node.

Better pod status with generation and observed generation

Prior to this change, the metadata.generation field was unused in pods. Along with extending to support metadata.generation, this feature will introduce status.observedGeneration to provide clearer pod status.

This work was done as part of KEP-5067: Pod Generation led by SIG Node.

Support for split level 3 cache architecture with kubelet’s CPU Manager

The previous kubelet’s CPU Manager was unaware of split L3 cache architecture (also known as Last Level Cache, or LLC), and can potentially distribute CPU assignments without considering the split L3 cache, causing a noisy neighbor problem. This alpha feature improves the CPU Manager to better assign CPU cores for better performance.

This work was done as part of KEP-5109: Split L3 Cache Topology Awareness in CPU Manager led by SIG Node.

PSI (Pressure Stall Information) metrics for scheduling improvements

This feature adds support on Linux nodes for providing PSI stats and metrics using cgroupv2. It can detect resource shortages and provide nodes with more granular control for pod scheduling.

This work was done as part of KEP-4205: Support PSI based on cgroupv2 led by SIG Node.

Secret-less image pulls with kubelet

The kubelet's on-disk credential provider now supports optional Kubernetes ServiceAccount (SA) token fetching. This simplifies authentication with image registries by allowing cloud providers to better integrate with OIDC compatible identity solutions.

This work was done as part of KEP-4412: Projected service account tokens for Kubelet image credential providers led by SIG Auth.

Graduations, deprecations, and removals in v1.33

Graduations to stable

This lists all the features that have graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 18 enhancements promoted to stable:

Deprecations and removals

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones to improve the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process. Many of these deprecations and removals were announced in the Deprecations and Removals blog post.

Deprecation of the stable Endpoints API

The EndpointSlices API has been stable since v1.21, which effectively replaced the original Endpoints API. While the original Endpoints API was simple and straightforward, it also posed some challenges when scaling to large numbers of network endpoints. The EndpointSlices API has introduced new features such as dual-stack networking, making the original Endpoints API ready for deprecation.

This deprecation affects only those who use the Endpoints API directly from workloads or scripts; these users should migrate to use EndpointSlices instead. There will be a dedicated blog post with more details on the deprecation implications and migration plans.

You can find more in KEP-4974: Deprecate v1.Endpoints.

Removal of kube-proxy version information in node status

Following its deprecation in v1.31, as highlighted in the v1.31 release announcement, the .status.nodeInfo.kubeProxyVersion field for Nodes was removed in v1.33.

This field was set by kubelet, but its value was not consistently accurate. As it has been disabled by default since v1.31, this field has been removed entirely in v1.33.

You can find more in KEP-4004: Deprecate status.nodeInfo.kubeProxyVersion field.

Removal of in-tree gitRepo volume driver

The gitRepo volume type has been deprecated since v1.11, nearly 7 years ago. Since its deprecation, there have been security concerns, including how gitRepo volume types can be exploited to gain remote code execution as root on the nodes. In v1.33, the in-tree driver code is removed.

There are alternatives such as git-sync and initContainers. gitVolumes in the Kubernetes API is not removed, and thus pods with gitRepo volumes will be admitted by kube-apiserver, but kubelets with the feature-gate GitRepoVolumeDriver set to false will not run them and return an appropriate error to the user. This allows users to opt-in to re-enabling the driver for 3 versions to give them enough time to fix workloads.

The feature gate in kubelet and in-tree plugin code is planned to be removed in the v1.39 release.

You can find more in KEP-5040: Remove gitRepo volume driver.

Removal of host network support for Windows pods

Windows Pod networking aimed to achieve feature parity with Linux and provide better cluster density by allowing containers to use the Node’s networking namespace. The original implementation landed as alpha with v1.26, but because it faced unexpected containerd behaviours and alternative solutions were available, the Kubernetes project has decided to withdraw the associated KEP. Support was fully removed in v1.33.

Please note that this does not affect HostProcess containers, which provides host network as well as host level access. The KEP withdrawn in v1.33 was about providing the host network only, which was never stable due to technical limitations with Windows networking logic.

You can find more in KEP-3503: Host network support for Windows pods.

Release notes

Check out the full details of the Kubernetes v1.33 release in our release notes.

Availability

Kubernetes v1.33 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.33 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Release Team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire Release Team for the hours spent hard at work to deliver the Kubernetes v1.33 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. There was a new team structure adopted in this release cycle, which was to combine Release Notes and Docs subteams into a unified subteam of Docs. Thanks to the meticulous effort in organizing the relevant information and resources from the new Docs team, both Release Notes and Docs tracking have seen a smooth and successful transition. Finally, a very special thanks goes out to our release lead, Nina Polshakova, for her support throughout a successful release cycle, her advocacy, her efforts to ensure that everyone could contribute effectively, and her challenges to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates several interesting data points related to the velocity of Kubernetes and various subprojects. This includes everything from individual contributions, to the number of companies contributing, and illustrates the depth and breadth of effort that goes into evolving this ecosystem.

During the v1.33 release cycle, which spanned 15 weeks from January 13 to April 23, 2025, Kubernetes received contributions from as many as 121 different companies and 570 individuals (as of writing, a few weeks before the release date). In the wider cloud native ecosystem, the figure goes up to 435 companies counting 2400 total contributors. You can find the data source in this dashboard. Compared to the velocity data from previous release, v1.32, we see a similar level of contribution from companies and individuals, indicating strong community interest and engagement.

Note that, “contribution” counts when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs. If you are interested in contributing, visit Getting Started on our contributor website.

Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.

Event update

Explore upcoming Kubernetes and cloud native events, including KubeCon + CloudNativeCon, KCD, and other notable conferences worldwide. Stay informed and get involved with the Kubernetes community!

May 2025

June 2025

July 2025

August 2025

You can find the latest KCD details here.

Upcoming release webinar

Join members of the Kubernetes v1.33 Release Team on Friday, May 16th 2025 at 4:00 PM (UTC), to learn about the release highlights of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Kubernetes Multicontainer Pods: An Overview

As cloud-native architectures continue to evolve, Kubernetes has become the go-to platform for deploying complex, distributed systems. One of the most powerful yet nuanced design patterns in this ecosystem is the sidecar pattern—a technique that allows developers to extend application functionality without diving deep into source code.

The origins of the sidecar pattern

Think of a sidecar like a trusty companion motorcycle attachment. Historically, IT infrastructures have always used auxiliary services to handle critical tasks. Before containers, we relied on background processes and helper daemons to manage logging, monitoring, and networking. The microservices revolution transformed this approach, making sidecars a structured and intentional architectural choice. With the rise of microservices, the sidecar pattern became more clearly defined, allowing developers to offload specific responsibilities from the main service without altering its code. Service meshes like Istio and Linkerd have popularized sidecar proxies, demonstrating how these companion containers can elegantly handle observability, security, and traffic management in distributed systems.

Kubernetes implementation

In Kubernetes, sidecar containers operate within the same Pod as the main application, enabling communication and resource sharing. Does this sound just like defining multiple containers along each other inside the Pod? It actually does, and this is how sidecar containers had to be implemented before Kubernetes v1.29.0, which introduced native support for sidecars. Sidecar containers can now be defined within a Pod manifest using the spec.initContainers field. What makes it a sidecar container is that you specify it with restartPolicy: Always. You can see an example of this below, which is a partial snippet of the full Kubernetes manifest:

initContainers:
  - name: logshipper
    image: alpine:latest
    restartPolicy: Always
  command: ['sh', '-c', 'tail -F /opt/logs.txt']
    volumeMounts:
    - name: data
        mountPath: /opt

That field name, spec.initContainers may sound confusing. How come when you want to define a sidecar container, you have to put an entry in the spec.initContainers array? spec.initContainers are run to completion just before main application starts, so they’re one-off, whereas sidecars often run in parallel to the main app container. It’s the spec.initContainers with restartPolicy:Always which differs classic init containers from Kubernetes-native sidecar containers and ensures they are always up.

When to embrace (or avoid) sidecars

While the sidecar pattern can be useful in many cases, it is generally not the preferred approach unless the use case justifies it. Adding a sidecar increases complexity, resource consumption, and potential network latency. Instead, simpler alternatives such as built-in libraries or shared infrastructure should be considered first.

Deploy a sidecar when:

  1. You need to extend application functionality without touching the original code
  2. Implementing cross-cutting concerns like logging, monitoring or security
  3. Working with legacy applications requiring modern networking capabilities
  4. Designing microservices that demand independent scaling and updates

Proceed with caution if:

  1. Resource efficiency is your primary concern
  2. Minimal network latency is critical
  3. Simpler alternatives exist
  4. You want to minimize troubleshooting complexity

Four essential multi-container patterns

Init container pattern

The Init container pattern is used to execute (often critical) setup tasks before the main application container starts. Unlike regular containers, init containers run to completion and then terminate, ensuring that preconditions for the main application are met.

Ideal for:

  1. Preparing configurations
  2. Loading secrets
  3. Verifying dependency availability
  4. Running database migrations

The init container ensures your application starts in a predictable, controlled environment without code modifications.

Ambassador pattern

An ambassador container provides Pod-local helper services that expose a simple way to access a network service. Commonly, ambassador containers send network requests on behalf of a an application container and take care of challenges such as service discovery, peer identity verification, or encryption in transit.

Perfect when you need to:

  1. Offload client connectivity concerns
  2. Implement language-agnostic networking features
  3. Add security layers like TLS
  4. Create robust circuit breakers and retry mechanisms

Configuration helper

A configuration helper sidecar provides configuration updates to an application dynamically, ensuring it always has access to the latest settings without disrupting the service. Often the helper needs to provide an initial configuration before the application would be able to start successfully.

Use cases:

  1. Fetching environment variables and secrets
  2. Polling configuration changes
  3. Decoupling configuration management from application logic

Adapter pattern

An adapter (or sometimes façade) container enables interoperability between the main application container and external services. It does this by translating data formats, protocols, or APIs.

Strengths:

  1. Transforming legacy data formats
  2. Bridging communication protocols
  3. Facilitating integration between mismatched services

Wrap-up

While sidecar patterns offer tremendous flexibility, they're not a silver bullet. Each added sidecar introduces complexity, consumes resources, and potentially increases operational overhead. Always evaluate simpler alternatives first. The key is strategic implementation: use sidecars as precision tools to solve specific architectural challenges, not as a default approach. When used correctly, they can improve security, networking, and configuration management in containerized environments. Choose wisely, implement carefully, and let your sidecars elevate your container ecosystem.

Introducing kube-scheduler-simulator

The Kubernetes Scheduler is a crucial control plane component that determines which node a Pod will run on. Thus, anyone utilizing Kubernetes relies on a scheduler.

kube-scheduler-simulator is a simulator for the Kubernetes scheduler, that started as a Google Summer of Code 2021 project developed by me (Kensei Nakada) and later received a lot of contributions. This tool allows users to closely examine the scheduler’s behavior and decisions.

It is useful for casual users who employ scheduling constraints (for example, inter-Pod affinity) and experts who extend the scheduler with custom plugins.

Motivation

The scheduler often appears as a black box, composed of many plugins that each contribute to the scheduling decision-making process from their unique perspectives. Understanding its behavior can be challenging due to the multitude of factors it considers.

Even if a Pod appears to be scheduled correctly in a simple test cluster, it might have been scheduled based on different calculations than expected. This discrepancy could lead to unexpected scheduling outcomes when deployed in a large production environment.

Also, testing a scheduler is a complex challenge. There are countless patterns of operations executed within a real cluster, making it unfeasible to anticipate every scenario with a finite number of tests. More often than not, bugs are discovered only when the scheduler is deployed in an actual cluster. Actually, many bugs are found by users after shipping the release, even in the upstream kube-scheduler.

Having a development or sandbox environment for testing the scheduler — or, indeed, any Kubernetes controllers — is a common practice. However, this approach falls short of capturing all the potential scenarios that might arise in a production cluster because a development cluster is often much smaller with notable differences in workload sizes and scaling dynamics. It never sees the exact same use or exhibits the same behavior as its production counterpart.

The kube-scheduler-simulator aims to solve those problems. It enables users to test their scheduling constraints, scheduler configurations, and custom plugins while checking every detailed part of scheduling decisions. It also allows users to create a simulated cluster environment, where they can test their scheduler with the same resources as their production cluster without affecting actual workloads.

Features of the kube-scheduler-simulator

The kube-scheduler-simulator’s core feature is its ability to expose the scheduler's internal decisions. The scheduler operates based on the scheduling framework, using various plugins at different extension points, filter nodes (Filter phase), score nodes (Score phase), and ultimately determine the best node for the Pod.

The simulator allows users to create Kubernetes resources and observe how each plugin influences the scheduling decisions for Pods. This visibility helps users understand the scheduler’s workings and define appropriate scheduling constraints.

Screenshot of the simulator web frontend that shows the detailed scheduling results per node and per extension point

The simulator web frontend

Inside the simulator, a debuggable scheduler runs instead of the vanilla scheduler. This debuggable scheduler outputs the results of each scheduler plugin at every extension point to the Pod’s annotations like the following manifest shows and the web front end formats/visualizes the scheduling results based on these annotations.

kind: Pod
apiVersion: v1
metadata:
  # The JSONs within these annotations are manually formatted for clarity in the blog post. 
  annotations:
    kube-scheduler-simulator.sigs.k8s.io/bind-result: '{"DefaultBinder":"success"}'
    kube-scheduler-simulator.sigs.k8s.io/filter-result: >-
      {
        "node-jjfg5":{
            "NodeName":"passed",
            "NodeResourcesFit":"passed",
            "NodeUnschedulable":"passed",
            "TaintToleration":"passed"
        },
        "node-mtb5x":{
            "NodeName":"passed",
            "NodeResourcesFit":"passed",
            "NodeUnschedulable":"passed",
            "TaintToleration":"passed"
        }
      }
    kube-scheduler-simulator.sigs.k8s.io/finalscore-result: >-
      {
        "node-jjfg5":{
            "ImageLocality":"0",
            "NodeAffinity":"0",
            "NodeResourcesBalancedAllocation":"52",
            "NodeResourcesFit":"47",
            "TaintToleration":"300",
            "VolumeBinding":"0"
        },
        "node-mtb5x":{
            "ImageLocality":"0",
            "NodeAffinity":"0",
            "NodeResourcesBalancedAllocation":"76",
            "NodeResourcesFit":"73",
            "TaintToleration":"300",
            "VolumeBinding":"0"
        }
      } 
    kube-scheduler-simulator.sigs.k8s.io/permit-result: '{}'
    kube-scheduler-simulator.sigs.k8s.io/permit-result-timeout: '{}'
    kube-scheduler-simulator.sigs.k8s.io/postfilter-result: '{}'
    kube-scheduler-simulator.sigs.k8s.io/prebind-result: '{"VolumeBinding":"success"}'
    kube-scheduler-simulator.sigs.k8s.io/prefilter-result: '{}'
    kube-scheduler-simulator.sigs.k8s.io/prefilter-result-status: >-
      {
        "AzureDiskLimits":"",
        "EBSLimits":"",
        "GCEPDLimits":"",
        "InterPodAffinity":"",
        "NodeAffinity":"",
        "NodePorts":"",
        "NodeResourcesFit":"success",
        "NodeVolumeLimits":"",
        "PodTopologySpread":"",
        "VolumeBinding":"",
        "VolumeRestrictions":"",
        "VolumeZone":""
      }
    kube-scheduler-simulator.sigs.k8s.io/prescore-result: >-
      {
        "InterPodAffinity":"",
        "NodeAffinity":"success",
        "NodeResourcesBalancedAllocation":"success",
        "NodeResourcesFit":"success",
        "PodTopologySpread":"",
        "TaintToleration":"success"
      }
    kube-scheduler-simulator.sigs.k8s.io/reserve-result: '{"VolumeBinding":"success"}'
    kube-scheduler-simulator.sigs.k8s.io/result-history: >-
      [
        {
            "kube-scheduler-simulator.sigs.k8s.io/bind-result":"{\"DefaultBinder\":\"success\"}",
            "kube-scheduler-simulator.sigs.k8s.io/filter-result":"{\"node-jjfg5\":{\"NodeName\":\"passed\",\"NodeResourcesFit\":\"passed\",\"NodeUnschedulable\":\"passed\",\"TaintToleration\":\"passed\"},\"node-mtb5x\":{\"NodeName\":\"passed\",\"NodeResourcesFit\":\"passed\",\"NodeUnschedulable\":\"passed\",\"TaintToleration\":\"passed\"}}",
            "kube-scheduler-simulator.sigs.k8s.io/finalscore-result":"{\"node-jjfg5\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"52\",\"NodeResourcesFit\":\"47\",\"TaintToleration\":\"300\",\"VolumeBinding\":\"0\"},\"node-mtb5x\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"76\",\"NodeResourcesFit\":\"73\",\"TaintToleration\":\"300\",\"VolumeBinding\":\"0\"}}",
            "kube-scheduler-simulator.sigs.k8s.io/permit-result":"{}",
            "kube-scheduler-simulator.sigs.k8s.io/permit-result-timeout":"{}",
            "kube-scheduler-simulator.sigs.k8s.io/postfilter-result":"{}",
            "kube-scheduler-simulator.sigs.k8s.io/prebind-result":"{\"VolumeBinding\":\"success\"}",
            "kube-scheduler-simulator.sigs.k8s.io/prefilter-result":"{}",
            "kube-scheduler-simulator.sigs.k8s.io/prefilter-result-status":"{\"AzureDiskLimits\":\"\",\"EBSLimits\":\"\",\"GCEPDLimits\":\"\",\"InterPodAffinity\":\"\",\"NodeAffinity\":\"\",\"NodePorts\":\"\",\"NodeResourcesFit\":\"success\",\"NodeVolumeLimits\":\"\",\"PodTopologySpread\":\"\",\"VolumeBinding\":\"\",\"VolumeRestrictions\":\"\",\"VolumeZone\":\"\"}",
            "kube-scheduler-simulator.sigs.k8s.io/prescore-result":"{\"InterPodAffinity\":\"\",\"NodeAffinity\":\"success\",\"NodeResourcesBalancedAllocation\":\"success\",\"NodeResourcesFit\":\"success\",\"PodTopologySpread\":\"\",\"TaintToleration\":\"success\"}",
            "kube-scheduler-simulator.sigs.k8s.io/reserve-result":"{\"VolumeBinding\":\"success\"}",
            "kube-scheduler-simulator.sigs.k8s.io/score-result":"{\"node-jjfg5\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"52\",\"NodeResourcesFit\":\"47\",\"TaintToleration\":\"0\",\"VolumeBinding\":\"0\"},\"node-mtb5x\":{\"ImageLocality\":\"0\",\"NodeAffinity\":\"0\",\"NodeResourcesBalancedAllocation\":\"76\",\"NodeResourcesFit\":\"73\",\"TaintToleration\":\"0\",\"VolumeBinding\":\"0\"}}",
            "kube-scheduler-simulator.sigs.k8s.io/selected-node":"node-mtb5x"
        }
      ]
    kube-scheduler-simulator.sigs.k8s.io/score-result: >-
      {
        "node-jjfg5":{
            "ImageLocality":"0",
            "NodeAffinity":"0",
            "NodeResourcesBalancedAllocation":"52",
            "NodeResourcesFit":"47",
            "TaintToleration":"0",
            "VolumeBinding":"0"
        },
        "node-mtb5x":{
            "ImageLocality":"0",
            "NodeAffinity":"0",
            "NodeResourcesBalancedAllocation":"76",
            "NodeResourcesFit":"73",
            "TaintToleration":"0",
            "VolumeBinding":"0"
        }
      }
    kube-scheduler-simulator.sigs.k8s.io/selected-node: node-mtb5x

Users can also integrate their custom plugins or extenders, into the debuggable scheduler and visualize their results.

This debuggable scheduler can also run standalone, for example, on any Kubernetes cluster or in integration tests. This would be useful to custom plugin developers who want to test their plugins or examine their custom scheduler in a real cluster with better debuggability.

The simulator as a better dev cluster

As mentioned earlier, with a limited set of tests, it is impossible to predict every possible scenario in a real-world cluster. Typically, users will test the scheduler in a small, development cluster before deploying it to production, hoping that no issues arise.

The simulator’s importing feature provides a solution by allowing users to simulate deploying a new scheduler version in a production-like environment without impacting their live workloads.

By continuously syncing between a production cluster and the simulator, users can safely test a new scheduler version with the same resources their production cluster handles. Once confident in its performance, they can proceed with the production deployment, reducing the risk of unexpected issues.

What are the use cases?

  1. Cluster users: Examine if scheduling constraints (for example, PodAffinity, PodTopologySpread) work as intended.
  2. Cluster admins: Assess how a cluster would behave with changes to the scheduler configuration.
  3. Scheduler plugin developers: Test a custom scheduler plugins or extenders, use the debuggable scheduler in integration tests or development clusters, or use the syncing feature for testing within a production-like environment.

Getting started

The simulator only requires Docker to be installed on a machine; a Kubernetes cluster is not necessary.

git clone git@github.com:kubernetes-sigs/kube-scheduler-simulator.git
cd kube-scheduler-simulator
make docker_up

You can then access the simulator's web UI at http://localhost:3000.

Visit the kube-scheduler-simulator repository for more details!

Getting involved

The scheduler simulator is developed by Kubernetes SIG Scheduling. Your feedback and contributions are welcome!

Open issues or PRs at the kube-scheduler-simulator repository. Join the conversation on the #sig-scheduling slack channel.

Acknowledgments

The simulator has been maintained by dedicated volunteer engineers, overcoming many challenges to reach its current form.

A big shout out to all the awesome contributors!

Kubernetes v1.33 sneak peek

As the release of Kubernetes v1.33 approaches, the Kubernetes project continues to evolve. Features may be deprecated, removed, or replaced to improve the overall health of the project. This blog post outlines some planned changes for the v1.33 release, which the release team believes you should be aware of to ensure the continued smooth operation of your Kubernetes environment and to keep you up-to-date with the latest developments. The information below is based on the current status of the v1.33 release and is subject to change before the final release date.

The Kubernetes API removal and deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release. It will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.

  • Beta or pre-release API versions must be supported for 3 releases after the deprecation.

  • Alpha or experimental API versions may be removed in any release without prior deprecation notice; this process can become a withdrawal in cases where a different implementation for the same feature is already in place.

Whether an API is removed as a result of a feature graduating from beta to stable, or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the deprecation guide.

Deprecations and removals for Kubernetes v1.33

Deprecation of the stable Endpoints API

The EndpointSlices API has been stable since v1.21, which effectively replaced the original Endpoints API. While the original Endpoints API was simple and straightforward, it also posed some challenges when scaling to large numbers of network endpoints. The EndpointSlices API has introduced new features such as dual-stack networking, making the original Endpoints API ready for deprecation.

This deprecation only impacts those who use the Endpoints API directly from workloads or scripts; these users should migrate to use EndpointSlices instead. There will be a dedicated blog post with more details on the deprecation implications and migration plans in the coming weeks.

You can find more in KEP-4974: Deprecate v1.Endpoints.

Removal of kube-proxy version information in node status

Following its deprecation in v1.31, as highlighted in the release announcement, the status.nodeInfo.kubeProxyVersion field will be removed in v1.33. This field was set by kubelet, but its value was not consistently accurate. As it has been disabled by default since v1.31, the v1.33 release will remove this field entirely.

You can find more in KEP-4004: Deprecate status.nodeInfo.kubeProxyVersion field.

Removal of host network support for Windows pods

Windows Pod networking aimed to achieve feature parity with Linux and provide better cluster density by allowing containers to use the Node’s networking namespace. The original implementation landed as alpha with v1.26, but as it faced unexpected containerd behaviours, and alternative solutions were available, the Kubernetes project has decided to withdraw the associated KEP. We're expecting to see support fully removed in v1.33.

You can find more in KEP-3503: Host network support for Windows pods.

As authors of this article, we picked one improvement as the most significant change to call out!

Support for user namespaces within Linux Pods

One of the oldest open KEPs today is KEP-127, Pod security improvement by using Linux User namespaces for Pods. This KEP was first opened in late 2016, and after multiple iterations, had its alpha release in v1.25, initial beta in v1.30 (where it was disabled by default), and now is set to be a part of v1.33, where the feature is available by default.

This support will not impact existing Pods unless you manually specify pod.spec.hostUsers to opt in. As highlighted in the v1.30 sneak peek blog, this is an important milestone for mitigating vulnerabilities.

You can find more in KEP-127: Support User Namespaces in pods.

Selected other Kubernetes v1.33 improvements

The following list of enhancements is likely to be included in the upcoming v1.33 release. This is not a commitment and the release content is subject to change.

In-place resource resize for vertical scaling of Pods

When provisioning a Pod, you can use various resources such as Deployment, StatefulSet, etc. Scalability requirements may need horizontal scaling by updating the Pod replica count, or vertical scaling by updating resources allocated to Pod’s container(s). Before this enhancement, container resources defined in a Pod's spec were immutable, and updating any of these details within a Pod template would trigger Pod replacement.

But what if you could dynamically update the resource configuration for your existing Pods without restarting them?

The KEP-1287 is precisely to allow such in-place Pod updates. It opens up various possibilities of vertical scale-up for stateful processes without any downtime, seamless scale-down when the traffic is low, and even allocating larger resources during startup that is eventually reduced once the initial setup is complete. This was released as alpha in v1.27, and is expected to land as beta in v1.33.

You can find more in KEP-1287: In-Place Update of Pod Resources.

DRA’s ResourceClaim Device Status graduates to beta

The devices field in ResourceClaim status, originally introduced in the v1.32 release, is likely to graduate to beta in v1.33. This field allows drivers to report device status data, improving both observability and troubleshooting capabilities.

For example, reporting the interface name, MAC address, and IP addresses of network interfaces in the status of a ResourceClaim can significantly help in configuring and managing network services, as well as in debugging network related issues. You can read more about ResourceClaim Device Status in Dynamic Resource Allocation: ResourceClaim Device Status document.

Also, you can find more about the planned enhancement in KEP-4817: DRA: Resource Claim Status with possible standardized network interface data.

Ordered namespace deletion

This KEP introduces a more structured deletion process for Kubernetes namespaces to ensure secure and deterministic resource removal. The current semi-random deletion order can create security gaps or unintended behaviour, such as Pods persisting after their associated NetworkPolicies are deleted. By enforcing a structured deletion sequence that respects logical and security dependencies, this approach ensures Pods are removed before other resources. The design improves Kubernetes’s security and reliability by mitigating risks associated with non-deterministic deletions.

You can find more in KEP-5080: Ordered namespace deletion.

Enhancements for indexed job management

These two KEPs are both set to graduate to GA to provide better reliability for job handling, specifically for indexed jobs. KEP-3850 provides per-index backoff limits for indexed jobs, which allows each index to be fully independent of other indexes. Also, KEP-3998 extends Job API to define conditions for making an indexed job as successfully completed when not all indexes are succeeded.

You can find more in KEP-3850: Backoff Limit Per Index For Indexed Jobs and KEP-3998: Job success/completion policy.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.33 as part of the CHANGELOG for that release.

Kubernetes v1.33 release is planned for Wednesday, 23rd April, 2025. Stay tuned for updates!

You can also see the announcements of changes in the release notes for:

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Fresh Swap Features for Linux Users in Kubernetes 1.32

Swap is a fundamental and an invaluable Linux feature. It offers numerous benefits, such as effectively increasing a node’s memory by swapping out unused data, shielding nodes from system-level memory spikes, preventing Pods from crashing when they hit their memory limits, and much more. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes.

The 1.22 release introduced Alpha support for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Later, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements. In the following Kubernetes releases more improvements were made, paving the way to GA in the near future.

Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node.

In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This provided Linux users the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and only partially worked on limited environments.

In version 1.28 swap support on Linux nodes was promoted to Beta. The Beta version was a drastic leap forward. Not only did it fix a large amount of bugs and made swap work in a stable way, but it also brought cgroup v2 support, introduced a wide variety of tests which include complex scenarios such as node-level pressure, and more. It also brought many exciting new capabilities such as the LimitedSwap behavior which sets an auto-calculated swap limit to containers, OpenMetrics instrumentation support (through the /metrics/resource endpoint) and Summary API for VerticalPodAutoscalers (through the /stats/summary endpoint), and more.

Today we are working on more improvements, paving the way for GA. Currently, the focus is especially towards ensuring node stability, enhanced debug abilities, addressing user feedback, polishing the feature and making it stable. For example, in order to increase stability, containers in high-priority pods cannot access swap which ensures the memory they need is ready to use. In addition, the UnlimitedSwap behavior was removed since it might compromise the node's health. Secret content protection against swapping has also been introduced (see relevant security-risk section for more info).

To conclude, compared to previous releases, the kubelet's support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. That said, the NodeSwap feature introduces basic swap support, and this is just the beginning. In the near future, additional features are planned to enhance swap functionality in various ways, such as improving evictions, extending the API, increasing customizability, and more!

How do I use it?

In order for the kubelet to initialize on a swap-enabled node, the failSwapOn field must be set to false on kubelet's configuration setting, or the deprecated --fail-swap-on command line flag must be deactivated.

It is possible to configure the memorySwap.swapBehavior option to define the manner in which a node utilizes swap memory. For instance,

# this fragment goes into the kubelet's configuration file
memorySwap:
  swapBehavior: LimitedSwap

The currently available configuration options for swapBehavior are:

  • NoSwap (default): Kubernetes workloads cannot use swap. However, processes outside of Kubernetes' scope, like system daemons (such as kubelet itself!) can utilize swap. This behavior is beneficial for protecting the node from system-level memory spikes, but it does not safeguard the workloads themselves from such spikes.
  • LimitedSwap: Kubernetes workloads can utilize swap memory, but with certain limitations. The amount of swap available to a Pod is determined automatically, based on the proportion of the memory requested relative to the node's total memory. Only non-high-priority Pods under the Burstable Quality of Service (QoS) tier are permitted to use swap. For more details, see the section below.

If configuration for memorySwap is not specified, by default the kubelet will apply the same behaviour as the NoSwap setting.

On Linux nodes, Kubernetes only supports running with swap enabled for hosts that use cgroup v2. On cgroup v1 systems, all Kubernetes workloads are not allowed to use swap memory.

Install a swap-enabled cluster with kubeadm

Before you begin

It is required for this demo that the kubeadm tool be installed, following the steps outlined in the kubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap.

Create a swap file and turn swap on

I'll demonstrate creating 4GiB of swap, both in the encrypted and unencrypted case.

Setting up unencrypted swap

An unencrypted swap file can be set up as follows.

# Allocate storage and restrict access
fallocate --length 4GiB /swapfile
chmod 600 /swapfile

# Format the swap space
mkswap /swapfile

# Activate the swap space for paging
swapon /swapfile

Setting up encrypted swap

An encrypted swap file can be set up as follows. Bear in mind that this example uses the cryptsetup binary (which is available on most Linux distributions).

# Allocate storage and restrict access
fallocate --length 4GiB /swapfile
chmod 600 /swapfile

# Create an encrypted device backed by the allocated storage
cryptsetup --type plain --cipher aes-xts-plain64 --key-size 256 -d /dev/urandom open /swapfile cryptswap

# Format the swap space
mkswap /dev/mapper/cryptswap

# Activate the swap space for paging
swapon /dev/mapper/cryptswap

Verify that swap is enabled

Swap can be verified to be enabled with both swapon -s command or the free command

> swapon -s
Filename				Type		Size		Used		Priority
/dev/dm-0                               partition	4194300		0		-2
> free -h
               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       1.3Gi       249Mi        25Mi       2.5Gi       2.5Gi
Swap:          4.0Gi          0B       4.0Gi

Enable swap on boot

After setting up swap, to start the swap file at boot time, you either set up a systemd unit to activate (encrypted) swap, or you add a line similar to /swapfile swap swap defaults 0 0 into /etc/fstab.

Set up a Kubernetes cluster that uses swap-enabled nodes

To make things clearer, here is an example kubeadm configuration file kubeadm-config.yaml for the swap enabled cluster.

---
apiVersion: "kubeadm.k8s.io/v1beta3"
kind: InitConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
memorySwap:
  swapBehavior: LimitedSwap

Then create a single-node cluster using kubeadm init --config kubeadm-config.yaml. During init, there is a warning that swap is enabled on the node and in case the kubelet failSwapOn is set to true. We plan to remove this warning in a future release.

How is the swap limit being determined with LimitedSwap?

The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap with automatic configuration of limitations.

With LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e. BestEffort/Guaranteed QoS Pods) are prohibited from utilizing swap memory. BestEffort QoS Pods exhibit unpredictable memory consumption patterns and lack information regarding their memory usage, making it difficult to determine a safe allocation of swap memory. Conversely, Guaranteed QoS Pods are typically employed for applications that rely on the precise allocation of resources specified by the workload, with memory being immediately available. To maintain the aforementioned security and node health guarantees, these Pods are not permitted to use swap memory when LimitedSwap is in effect. In addition, high-priority pods are not permitted to use swap in order to ensure the memory they consume always residents on disk, hence ready to use.

Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:

  • nodeTotalMemory: The total amount of physical memory available on the node.
  • totalPodsSwapAvailable: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).
  • containerMemoryRequest: The container's memory request.

Swap limitation is configured as: (containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable

In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node's total physical memory and the total amount of swap memory on the node that is available for use by Pods.

It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.

How does it work?

There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, the kubelet is able to be configured so that:

  • It can start with swap on.
  • It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.

Swap configuration on a node is exposed to a cluster admin via the memorySwap in the KubeletConfiguration. As a cluster administrator, you can specify the node's behaviour in the presence of swap memory by setting memorySwap.swapBehavior.

The kubelet employs the CRI (container runtime interface) API, and directs the container runtime to configure specific cgroup v2 parameters (such as memory.swap.max) in a manner that will enable the desired swap configuration for a container. For runtimes that use control groups, the container runtime is then responsible for writing these settings to the container-level cgroup.

How can I monitor swap?

Node and container level metric statistics

Kubelet now collects node and container level metric statistics, which can be accessed at the /metrics/resource (which is used mainly by monitoring tools like Prometheus) and /stats/summary (which is used mainly by Autoscalers) kubelet HTTP endpoints. This allows clients who can directly interrogate the kubelet to monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a machine_swap_bytes metric has been added to cadvisor to show the total physical swap capacity of the machine. See this page for more info.

Node Feature Discovery (NFD)

Node Feature Discovery is a Kubernetes addon for detecting hardware features and configuration. It can be utilized to discover which nodes are provisioned with swap.

As an example, to figure out which nodes are provisioned with swap, use the following command:

kubectl get nodes -o jsonpath='{range .items[?(@.metadata.labels.feature\.node\.kubernetes\.io/memory-swap)]}{.metadata.name}{"\t"}{.metadata.labels.feature\.node\.kubernetes\.io/memory-swap}{"\n"}{end}'

This will result in an output similar to:

k8s-worker1: true
k8s-worker2: true
k8s-worker3: false

In this example, swap is provisioned on nodes k8s-worker1 and k8s-worker2, but not on k8s-worker3.

Caveats

Having swap available on a system reduces predictability. While swap can enhance performance by making more RAM available, swapping data back to memory is a heavy operation, sometimes slower by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure. Enabling swap increases the risk of noisy neighbors, where Pods that frequently use their RAM may cause other Pods to swap. In addition, since swap allows for greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, and due to unexpected packing configurations, the scheduler currently does not account for swap memory usage. This heightens the risk of noisy neighbors.

The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe. As swap might cause IO pressure, it is recommended to give a higher IO latency priority to system critical daemons. See the relevant section in the recommended practices section below.

Memory-backed volumes

On Linux nodes, memory-backed volumes (such as secret volume mounts, or emptyDir with medium: Memory) are implemented with a tmpfs filesystem. The contents of such volumes should remain in memory at all times, hence should not be swapped to disk. To ensure the contents of such volumes remain in memory, the noswap tmpfs option is being used.

The Linux kernel officially supports the noswap option from version 6.3 (more info can be found in Linux Kernel Version Requirements). However, the different distributions often choose to backport this mount option to older Linux versions as well.

In order to verify whether the node supports the noswap option, the kubelet will do the following:

  • If the kernel's version is above 6.3 then the noswap option will be assumed to be supported.
  • Otherwise, kubelet would try to mount a dummy tmpfs with the noswap option at startup. If kubelet fails with an error indicating of an unknown option, noswap will be assumed to not be supported, hence will not be used. A kubelet log entry will be emitted to warn the user about memory-backed volumes might swap to disk. If kubelet succeeds, the dummy tmpfs will be deleted and the noswap option will be used.
    • If the noswap option is not supported, kubelet will emit a warning log entry, then continue its execution.

It is deeply encouraged to encrypt the swap space. See the section above with an example for setting unencrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. It is the administrator's responsibility to provision encrypted swap to mitigate this risk.

Good practice for using swap in a Kubernetes cluster

Disable swap for system-critical daemons

During the testing phase and based on user feedback, it was observed that the performance of system-critical daemons and services might degrade. This implies that system daemons, including the kubelet, could operate slower than usual. If this issue is encountered, it is advisable to configure the cgroup of the system slice to prevent swapping (i.e., set memory.swap.max=0).

Protect system-critical daemons for I/O latency

Swap can increase the I/O load on a node. When memory pressure causes the kernel to rapidly swap pages in and out, system-critical daemons and services that rely on I/O operations may experience performance degradation.

To mitigate this, it is recommended for systemd users to prioritize the system slice in terms of I/O latency. For non-systemd users, setting up a dedicated cgroup for system daemons and processes and prioritizing I/O latency in the same way is advised. This can be achieved by setting io.latency for the system slice, thereby granting it higher I/O priority. See cgroup's documentation for more info.

Swap and control plane nodes

The Kubernetes project recommends running control plane nodes without any swap space configured. The control plane primarily hosts Guaranteed QoS Pods, so swap can generally be disabled. The main concern is that swapping critical services on the control plane could negatively impact performance.

Use of a dedicated disk for swap

It is recommended to use a separate, encrypted disk for the swap partition. If swap resides on a partition or the root filesystem, workloads may interfere with system processes that need to write to disk. When they share the same disk, processes can overwhelm swap, disrupting the I/O of kubelet, container runtime, and systemd, which would impact other workloads. Since swap space is located on a disk, it is crucial to ensure the disk is fast enough for the intended use cases. Alternatively, one can configure I/O priorities between different mapped areas of a single backing device.

Looking ahead

As you can see, the swap feature was dramatically improved lately, paving the way for a feature GA. However, this is just the beginning. It's a foundational implementation marking the beginning of enhanced swap functionality.

In the near future, additional features are planned to further improve swap capabilities, including better eviction mechanisms, extended API support, increased customizability, better debug abilities and more!

How can I learn more?

You can review the current documentation for using swap with Kubernetes.

For more information, please see KEP-2400 and its design proposal.

How do I get involved?

Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG's mailing list. A Slack channel dedicated to swap is also available at #sig-node-swap.

Feel free to reach out to me, Itamar Holder (@iholder101 on Slack and GitHub) if you'd like to help or ask further questions.

Ingress-nginx CVE-2025-1974: What You Need to Know

Today, the ingress-nginx maintainers have released patches for a batch of critical vulnerabilities that could make it easy for attackers to take over your Kubernetes cluster: ingress-nginx v1.12.1 and ingress-nginx v1.11.5. If you are among the over 40% of Kubernetes administrators using ingress-nginx, you should take action immediately to protect your users and data.

Background

Ingress is the traditional Kubernetes feature for exposing your workload Pods to the world so that they can be useful. In an implementation-agnostic way, Kubernetes users can define how their applications should be made available on the network. Then, an ingress controller uses that definition to set up local or cloud resources as required for the user’s particular situation and needs.

Many different ingress controllers are available, to suit users of different cloud providers or brands of load balancers. Ingress-nginx is a software-only ingress controller provided by the Kubernetes project. Because of its versatility and ease of use, ingress-nginx is quite popular: it is deployed in over 40% of Kubernetes clusters!

Ingress-nginx translates the requirements from Ingress objects into configuration for nginx, a powerful open source webserver daemon. Then, nginx uses that configuration to accept and route requests to the various applications running within a Kubernetes cluster. Proper handling of these nginx configuration parameters is crucial, because ingress-nginx needs to allow users significant flexibility while preventing them from accidentally or intentionally tricking nginx into doing things it shouldn’t.

Vulnerabilities Patched Today

Four of today’s ingress-nginx vulnerabilities are improvements to how ingress-nginx handles particular bits of nginx config. Without these fixes, a specially-crafted Ingress object can cause nginx to misbehave in various ways, including revealing the values of Secrets that are accessible to ingress-nginx. By default, ingress-nginx has access to all Secrets cluster-wide, so this can often lead to complete cluster takeover by any user or entity that has permission to create an Ingress.

The most serious of today’s vulnerabilities, CVE-2025-1974, rated 9.8 CVSS, allows anything on the Pod network to exploit configuration injection vulnerabilities via the Validating Admission Controller feature of ingress-nginx. This makes such vulnerabilities far more dangerous: ordinarily one would need to be able to create an Ingress object in the cluster, which is a fairly privileged action. When combined with today’s other vulnerabilities, CVE-2025-1974 means that anything on the Pod network has a good chance of taking over your Kubernetes cluster, with no credentials or administrative access required. In many common scenarios, the Pod network is accessible to all workloads in your cloud VPC, or even anyone connected to your corporate network! This is a very serious situation.

Today, we have released ingress-nginx v1.12.1 and ingress-nginx v1.11.5, which have fixes for all five of these vulnerabilities.

Your next steps

First, determine if your clusters are using ingress-nginx. In most cases, you can check this by running kubectl get pods --all-namespaces --selector app.kubernetes.io/name=ingress-nginx with cluster administrator permissions.

If you are using ingress-nginx, make a plan to remediate these vulnerabilities immediately.

The best and easiest remedy is to upgrade to the new patch release of ingress-nginx. All five of today’s vulnerabilities are fixed by installing today’s patches.

If you can’t upgrade right away, you can significantly reduce your risk by turning off the Validating Admission Controller feature of ingress-nginx.

  • If you have installed ingress-nginx using Helm
    • Reinstall, setting the Helm value controller.admissionWebhooks.enabled=false
  • If you have installed ingress-nginx manually
    • delete the ValidatingWebhookconfiguration called ingress-nginx-admission
    • edit the ingress-nginx-controller Deployment or Daemonset, removing --validating-webhook from the controller container’s argument list

If you turn off the Validating Admission Controller feature as a mitigation for CVE-2025-1974, remember to turn it back on after you upgrade. This feature provides important quality of life improvements for your users, warning them about incorrect Ingress configurations before they can take effect.

Conclusion, thanks, and further reading

The ingress-nginx vulnerabilities announced today, including CVE-2025-1974, present a serious risk to many Kubernetes users and their data. If you use ingress-nginx, you should take action immediately to keep yourself safe.

Thanks go out to Nir Ohfeld, Sagi Tzadik, Ronen Shustin, and Hillai Ben-Sasson from Wiz for responsibly disclosing these vulnerabilities, and for working with the Kubernetes SRC members and ingress-nginx maintainers (Marco Ebert and James Strong) to ensure we fixed them effectively.

For further information about the maintenance and future of ingress-nginx, please see this GitHub issue and/or attend James and Marco’s KubeCon/CloudNativeCon EU 2025 presentation.

For further information about the specific vulnerabilities discussed in this article, please see the appropriate GitHub issue: CVE-2025-24513, CVE-2025-24514, CVE-2025-1097, CVE-2025-1098, or CVE-2025-1974

This blog post was revised in May 2025 to update the hyperlinks.

Introducing JobSet

Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)

In this article, we introduce JobSet, an open source API for representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML training and HPC workloads on Kubernetes.

Why JobSet?

The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML engineers who have found it to be a natural fit for the requirements of running distributed training workloads.

Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a single host are often distributed across tens of thousands of accelerator chips, which in turn may span thousands of hosts.

As such, the model training code is often containerized and executed simultaneously on all these hosts, performing distributed computations which often shard both the model parameters and/or the training dataset across the target accelerator chips, using communication collective primitives like all-gather and all-reduce to perform distributed computations and synchronize gradients between hosts.

These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently scheduling and managing the lifecycle of containerized applications across a cluster of compute resources is an area where it shines.

It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and controllers which manage the behavior and life cycle of these objects, allowing engineers to develop custom distributed training orchestration solutions to fit their needs.

However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do not adequately model them alone anymore.

Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become fragmented, and each of the existing solutions in this fragmented landscape has certain limitations that make it non-optimal for distributed ML training.

For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g. PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit specifically to the target framework, each with different semantics and behavior.

On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of the most recent enhancements. However, running ML training and HPC workloads using the upstream Job API requires extra orchestration to fill the following gaps:

Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different Pods are part of the same workload, but they need to run a different container, request different resources or have different failure policies. A common example is the driver-worker pattern.

Job groups : Large scale training workloads span multiple network topologies, running across multiple racks for example. Such workloads are network latency sensitive, and aim to localize communication and minimize traffic crossing the higher-latency network links. To facilitate this, the workload needs to be split into groups of Pods each assigned to a network topology.

Inter-Pod communication : Create and manage the resources (e.g. headless Services) necessary to establish communication between the Pods of a job.

Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is expected to start first (like Ray or Spark), in other cases the workers are expected to be ready before starting the driver (like MPI).

JobSet aims to address those gaps using the Job API as a building block to build a richer API for large-scale distributed HPC and ML use cases.

How JobSet Works

JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to easily specify different pod templates for different distinct groups of pods (e.g. a leader, workers, parameter servers, etc.).

It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is essentially a Job Template with some desired number of Job replicas specified. This provides a declarative way to easily create identical child-jobs to run on different islands of accelerators, without resorting to scripting or Helm charts to generate many versions of the same job but with different names.

JobSet Architecture

Some other key JobSet features which address the problems described above include:

Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in islands of homogenous accelerators connected via a specialized, high bandwidth network links. For example, a user might provision nodes containing a group of hosts co-located on a rack, each with H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of 64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed training job across multiple of these islands, we often want to partition the workload into a group of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within the same island to do segments of distributed computation, and keeping the gradient synchronization over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum.

Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod communication via pod hostname is enabled by default, with automatic configuration and lifecycle management of the headless service enabling this.

Configurable success policies : JobSet has configurable success policies which target specific ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can configure the JobSet to be marked complete if and only if all pods that are part of the “worker” ReplicatedJob are completed.

Configurable failure policies : JobSet has configurable failure policies which allow the user to specify a maximum number of times the JobSet should be restarted in the event of a failure. If any job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails.

Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1 exclusive assignment to a topology domain, typically an accelerator island like a rack. For example, if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job will be co-located on the same island, and that only one child job is allowed to schedule per island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices), running 1 model replica in each accelerator island, ensuring the forward and backward passes themselves occur within a single model replica occurs over the high bandwidth interconnect linking the accelerators chips within the island, and only the gradient synchronization between model replicas occurs across accelerator islands over the lower bandwidth data center network.

Integration with Kueue : Users can submit JobSets via Kueue to oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial scheduling and deadlocks, enable multi-tenancy, and more.

Example use case

Distributed ML training on multiple TPU slices with Jax

The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e slices. To learn more about TPU concepts and terminology, please refer to these docs.

This example uses Jax, an ML framework with native support for Just-In-Time (JIT) compilation targeting TPU chips via OpenXLA. However, you can also use PyTorch/XLA to do ML training on TPUs.

This example makes use of several JobSet features (both explicitly and implicitly) to support the unique scheduling requirements of TPU multislice training out-of-the-box with very little configuration required by the user.

# Run a simple Jax workload on 
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice
  annotations:
    # Give each child Job exclusive usage of a TPU slice 
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 3
  replicatedJobs:
  - name: workers
    replicas: 4 # Set to number of TPU slices
    template:
      spec:
        parallelism: 2 # Set to number of VMs per TPU slice
        completions: 2 # Set to number of VMs per TPU slice
        backoffLimit: 0
        template:
          spec:
            hostNetwork: true
            dnsPolicy: ClusterFirstWithHostNet
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
            containers:
            - name: jax-tpu
              image: python:3.8
              ports:
              - containerPort: 8471
              - containerPort: 8080
              securityContext:
                privileged: true
              command:
              - bash
              - -c
              - |
                pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                python -c 'import jax; print("Global device count:", jax.device_count())'
                sleep 60
              resources:
                limits:
                  google.com/tpu: 4

Future work and getting involved

We have a number of features on the JobSet roadmap planned for development this year, which can be found in the JobSet roadmap.

Please feel free to reach out with feedback of any kind. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation.

You can get in touch with us via our repo, mailing list or on Slack.

Last but not least, thanks to all our contributors who made this project possible!

Spotlight on SIG Apps

In our ongoing SIG Spotlight series, we dive into the heart of the Kubernetes project by talking to the leaders of its various Special Interest Groups (SIGs). This time, we focus on SIG Apps, the group responsible for everything related to developing, deploying, and operating applications on Kubernetes. Sandipan Panda (DevZero) had the opportunity to interview Maciej Szulik (Defense Unicorns) and Janet Kuo (Google), the chairs and tech leads of SIG Apps. They shared their experiences, challenges, and visions for the future of application management within the Kubernetes ecosystem.

Introductions

Sandipan: Hello, could you start by telling us a bit about yourself, your role, and your journey within the Kubernetes community that led to your current roles in SIG Apps?

Maciej: Hey, my name is Maciej, and I’m one of the leads for SIG Apps. Aside from this role, you can also find me helping SIG CLI and also being one of the Steering Committee members. I’ve been contributing to Kubernetes since late 2014 in various areas, including controllers, apiserver, and kubectl.

Janet: Certainly! I'm Janet, a Staff Software Engineer at Google, and I've been deeply involved with the Kubernetes project since its early days, even before the 1.0 launch in 2015. It's been an amazing journey!

My current role within the Kubernetes community is one of the chairs and tech leads of SIG Apps. My journey with SIG Apps started organically. I started with building the Deployment API and adding rolling update functionalities. I naturally gravitated towards SIG Apps and became increasingly involved. Over time, I took on more responsibilities, culminating in my current leadership roles.

About SIG Apps

All following answers were jointly provided by Maciej and Janet.

Sandipan: For those unfamiliar, could you provide an overview of SIG Apps' mission and objectives? What key problems does it aim to solve within the Kubernetes ecosystem?

As described in our charter, we cover a broad area related to developing, deploying, and operating applications on Kubernetes. That, in short, means we’re open to each and everyone showing up at our bi-weekly meetings and discussing the ups and downs of writing and deploying various applications on Kubernetes.

Sandipan: What are some of the most significant projects or initiatives currently being undertaken by SIG Apps?

At this point in time, the main factors driving the development of our controllers are the challenges coming from running various AI-related workloads. It’s worth giving credit here to two working groups we’ve sponsored over the past years:

  1. The Batch Working Group, which is looking at running HPC, AI/ML, and data analytics jobs on top of Kubernetes.
  2. The Serving Working Group, which is focusing on hardware-accelerated AI/ML inference.

Best practices and challenges

Sandipan: SIG Apps plays a crucial role in developing application management best practices for Kubernetes. Can you share some of these best practices and how they help improve application lifecycle management?

  1. Implementing health checks and readiness probes ensures that your applications are healthy and ready to serve traffic, leading to improved reliability and uptime. The above, combined with comprehensive logging, monitoring, and tracing solutions, will provide insights into your application's behavior, enabling you to identify and resolve issues quickly.

  2. Auto-scale your application based on resource utilization or custom metrics, optimizing resource usage and ensuring your application can handle varying loads.

  3. Use Deployment for stateless applications, StatefulSet for stateful applications, Job and CronJob for batch workloads, and DaemonSet for running a daemon on each node. Use Operators and CRDs to extend the Kubernetes API to automate the deployment, management, and lifecycle of complex applications, making them easier to operate and reducing manual intervention.

Sandipan: What are some of the common challenges SIG Apps faces, and how do you address them?

The biggest challenge we’re facing all the time is the need to reject a lot of features, ideas, and improvements. This requires a lot of discipline and patience to be able to explain the reasons behind those decisions.

Sandipan: How has the evolution of Kubernetes influenced the work of SIG Apps? Are there any recent changes or upcoming features in Kubernetes that you find particularly relevant or beneficial for SIG Apps?

The main benefit for both us and the whole community around SIG Apps is the ability to extend kubernetes with Custom Resource Definitions and the fact that users can build their own custom controllers leveraging the built-in ones to achieve whatever sophisticated use cases they might have and we, as the core maintainers, haven’t considered or weren’t able to efficiently resolve inside Kubernetes.

Contributing to SIG Apps

Sandipan: What opportunities are available for new contributors who want to get involved with SIG Apps, and what advice would you give them?

We get the question, "What good first issue might you recommend we start with?" a lot :-) But unfortunately, there’s no easy answer to it. We always tell everyone that the best option to start contributing to core controllers is to find one you are willing to spend some time with. Read through the code, then try running unit tests and integration tests focusing on that controller. Once you grasp the general idea, try breaking it and the tests again to verify your breakage. Once you start feeling confident you understand that particular controller, you may want to search through open issues affecting that controller and either provide suggestions, explaining the problem users have, or maybe attempt your first fix.

Like we said, there are no shortcuts on that road; you need to spend the time with the codebase to understand all the edge cases we’ve slowly built up to get to the point where we are. Once you’re successful with one controller, you’ll need to repeat that same process with others all over again.

Sandipan: How does SIG Apps gather feedback from the community, and how is this feedback integrated into your work?

We always encourage everyone to show up and present their problems and solutions during our bi-weekly meetings. As long as you’re solving an interesting problem on top of Kubernetes and you can provide valuable feedback about any of the core controllers, we’re always happy to hear from everyone.

Looking ahead

Sandipan: Looking ahead, what are the key focus areas or upcoming trends in application management within Kubernetes that SIG Apps is excited about? How is the SIG adapting to these trends?

Definitely the current AI hype is the major driving factor; as mentioned above, we have two working groups, each covering a different aspect of it.

Sandipan: What are some of your favorite things about this SIG?

Without a doubt, the people that participate in our meetings and on Slack, who tirelessly help triage issues, pull requests and invest a lot of their time (very frequently their private time) into making kubernetes great!


SIG Apps is an essential part of the Kubernetes community, helping to shape how applications are deployed and managed at scale. From its work on improving Kubernetes' workload APIs to driving innovation in AI/ML application management, SIG Apps is continually adapting to meet the needs of modern application developers and operators. Whether you’re a new contributor or an experienced developer, there’s always an opportunity to get involved and make an impact.

If you’re interested in learning more or contributing to SIG Apps, be sure to check out their SIG README and join their bi-weekly meetings.

Spotlight on SIG etcd

In this SIG etcd spotlight we talked with James Blair, Marek Siarkowicz, Wenjia Zhang, and Benjamin Wang to learn a bit more about this Kubernetes Special Interest Group.

Introducing SIG etcd

Frederico: Hello, thank you for the time! Let’s start with some introductions, could you tell us a bit about yourself, your role and how you got involved in Kubernetes.

Benjamin: Hello, I am Benjamin. I am a SIG etcd Tech Lead and one of the etcd maintainers. I work for VMware, which is part of the Broadcom group. I got involved in Kubernetes & etcd & CSI (Container Storage Interface) because of work and also a big passion for open source. I have been working on Kubernetes & etcd (and also CSI) since 2020.

James: Hey team, I’m James, a co-chair for SIG etcd and etcd maintainer. I work at Red Hat as a Specialist Architect helping people adopt cloud native technology. I got involved with the Kubernetes ecosystem in 2019. Around the end of 2022 I noticed how the etcd community and project needed help so started contributing as often as I could. There is a saying in our community that "you come for the technology, and stay for the people": for me this is absolutely real, it’s been a wonderful journey so far and I’m excited to support our community moving forward.

Marek: Hey everyone, I'm Marek, the SIG etcd lead. At Google, I lead the GKE etcd team, ensuring a stable and reliable experience for all GKE users. My Kubernetes journey began with SIG Instrumentation, where I created and led the Kubernetes Structured Logging effort.
I'm still the main project lead for Kubernetes Metrics Server, providing crucial signals for autoscaling in Kubernetes. I started working on etcd 3 years ago, right around the 3.5 release. We faced some challenges, but I'm thrilled to see etcd now the most scalable and reliable it's ever been, with the highest contribution numbers in the project's history. I'm passionate about distributed systems, extreme programming, and testing.

Wenjia: Hi there, my name is Wenjia, I am the co-chair of SIG etcd and one of the etcd maintainers. I work at Google as an Engineering Manager, working on GKE (Google Kubernetes Engine) and GDC (Google Distributed Cloud). I have been working in the area of open source Kubernetes and etcd since the Kubernetes v1.10 and etcd v3.1 releases. I got involved in Kubernetes because of my job, but what keeps me in the space is the charm of the container orchestration technology, and more importantly, the awesome open source community.

Becoming a Kubernetes Special Interest Group (SIG)

Frederico: Excellent, thank you. I'd like to start with the origin of the SIG itself: SIG etcd is a very recent SIG, could you quickly go through the history and reasons behind its creation?

Marek: Absolutely! SIG etcd was formed because etcd is a critical component of Kubernetes, serving as its data store. However, etcd was facing challenges like maintainer turnover and reliability issues. Creating a dedicated SIG allowed us to focus on addressing these problems, improving development and maintenance processes, and ensuring etcd evolves in sync with the cloud-native landscape.

Frederico: And has becoming a SIG worked out as expected? Better yet, are the motivations you just described being addressed, and to what extent?

Marek: It's been a positive change overall. Becoming a SIG has brought more structure and transparency to etcd's development. We've adopted Kubernetes processes like KEPs (Kubernetes Enhancement Proposals and PRRs (Production Readiness Reviews, which has improved our feature development and release cycle.

Frederico: On top of those, what would you single out as the major benefit that has resulted from becoming a SIG?

Marek: The biggest benefits for me was adopting Kubernetes testing infrastructure, tools like Prow and TestGrid. For large projects like etcd there is just no comparison to the default GitHub tooling. Having known, easy to use, clear tools is a major boost to the etcd as it makes it much easier for Kubernetes contributors to also help etcd.

Wenjia: Totally agree, while challenges remain, the SIG structure provides a solid foundation for addressing them and ensuring etcd's continued success as a critical component of the Kubernetes ecosystem.

The positive impact on the community is another crucial aspect of SIG etcd's success that I’d like to highlight. The Kubernetes SIG structure has created a welcoming environment for etcd contributors, leading to increased participation from the broader Kubernetes community. We have had greater collaboration with other SIGs like SIG API Machinery, SIG Scalability, SIG Testing, SIG Cluster Lifecycle, etc.

This collaboration helps ensure etcd's development aligns with the needs of the wider Kubernetes ecosystem. The formation of the etcd Operator Working Group under the joint effort between SIG etcd and SIG Cluster Lifecycle exemplifies this successful collaboration, demonstrating a shared commitment to improving etcd's operational aspects within Kubernetes.

Frederico: Since you mentioned collaboration, have you seen changes in terms of contributors and community involvement in recent months?

James: Yes -- as showing in our unique PR author data we recently hit an all time high in March and are trending in a positive direction:

Unique PR author data stats

Additionally, looking at our overall contributions across all etcd project repositories we are also observing a positive trend showing a resurgence in etcd project activity:

Overall contributions stats

The road ahead

Frederico: That's quite telling, thank you. In terms of the near future, what are the current priorities for SIG etcd?

Marek: Reliability is always top of mind -– we need to make sure etcd is rock-solid. We're also working on making etcd easier to use and manage for operators. And we have our sights set on making etcd a viable standalone solution for infrastructure management, not just for Kubernetes. Oh, and of course, scaling -– we need to ensure etcd can handle the growing demands of the cloud-native world.

Benjamin: I agree that reliability should always be our top guiding principle. We need to ensure not only correctness but also compatibility. Additionally, we should continuously strive to improve the understandability and maintainability of etcd. Our focus should be on addressing the pain points that the community cares about the most.

Frederico: Are there any specific SIGs that you work closely with?

Marek: SIG API Machinery, for sure – they own the structure of the data etcd stores, so we're constantly working together. And SIG Cluster Lifecycle – etcd is a key part of Kubernetes clusters, so we collaborate on the newly created etcd operator Working group.

Wenjia: Other than SIG API Machinery and SIG Cluster Lifecycle that Marek mentioned above, SIG Scalability and SIG Testing is another group that we work closely with.

Frederico: In a more general sense, how would you list the key challenges for SIG etcd in the evolving cloud native landscape?

Marek: Well, reliability is always a challenge when you're dealing with critical data. The cloud-native world is evolving so fast that scaling to meet those demands is a constant effort.

Getting involved

Frederico: We're almost at the end of our conversation, but for those interested in in etcd, how can they get involved?

Marek: We'd love to have them! The best way to start is to join our SIG etcd meetings, follow discussions on the etcd-dev mailing list, and check out our GitHub issues. We're always looking for people to review proposals, test code, and contribute to documentation.

Wenjia: I love this question 😀 . There are numerous ways for people interested in contributing to SIG etcd to get involved and make a difference. Here are some key areas where you can help:

Code Contributions:

  • Bug Fixes: Tackle existing issues in the etcd codebase. Start with issues labeled "good first issue" or "help wanted" to find tasks that are suitable for newcomers.
  • Feature Development: Contribute to the development of new features and enhancements. Check the etcd roadmap and discussions to see what's being planned and where your skills might fit in.
  • Testing and Code Reviews: Help ensure the quality of etcd by writing tests, reviewing code changes, and providing feedback.
  • Documentation: Improve etcd's documentation by adding new content, clarifying existing information, or fixing errors. Clear and comprehensive documentation is essential for users and contributors.
  • Community Support: Answer questions on forums, mailing lists, or Slack channels. Helping others understand and use etcd is a valuable contribution.

Getting Started:

  • Join the community: Start by joining the etcd community on Slack, attending SIG meetings, and following the mailing lists. This will help you get familiar with the project, its processes, and the people involved.
  • Find a mentor: If you're new to open source or etcd, consider finding a mentor who can guide you and provide support. Stay tuned! Our first cohort of mentorship program was very successful. We will have a new round of mentorship program coming up.
  • Start small: Don't be afraid to start with small contributions. Even fixing a typo in the documentation or submitting a simple bug fix can be a great way to get involved.

By contributing to etcd, you'll not only be helping to improve a critical piece of the cloud-native ecosystem but also gaining valuable experience and skills. So, jump in and start contributing!

Frederico: Excellent, thank you. Lastly, one piece of advice that you'd like to give to other newly formed SIGs?

Marek: Absolutely! My advice would be to embrace the established processes of the larger community, prioritize collaboration with other SIGs, and focus on building a strong community.

Wenjia: Here are some tips I myself found very helpful in my OSS journey:

  • Be patient: Open source development can take time. Don't get discouraged if your contributions aren't accepted immediately or if you encounter challenges.
  • Be respectful: The etcd community values collaboration and respect. Be mindful of others' opinions and work together to achieve common goals.
  • Have fun: Contributing to open source should be enjoyable. Find areas that interest you and contribute in ways that you find fulfilling.

Frederico: A great way to end this spotlight, thank you all!


For more information and resources, please take a look at :

  1. etcd website: https://etcd.io/
  2. etcd GitHub repository: https://github.com/etcd-io/etcd
  3. etcd community: https://etcd.io/community/

NFTables mode for kube-proxy

A new nftables mode for kube-proxy was introduced as an alpha feature in Kubernetes 1.29. Currently in beta, it is expected to be GA as of 1.33. The new mode fixes long-standing performance problems with the iptables mode and all users running on systems with reasonably-recent kernels are encouraged to try it out. (For compatibility reasons, even once nftables becomes GA, iptables will still be the default.)

Why nftables? Part 1: data plane latency

The iptables API was designed for implementing simple firewalls, and has problems scaling up to support Service proxying in a large Kubernetes cluster with tens of thousands of Services.

In general, the ruleset generated by kube-proxy in iptables mode has a number of iptables rules proportional to the sum of the number of Services and the total number of endpoints. In particular, at the top level of the ruleset, there is one rule to test each possible Service IP (and port) that a packet might be addressed to:

# If the packet is addressed to 172.30.0.41:80, then jump to the chain
# KUBE-SVC-XPGD46QRK7WJZT7O for further processing
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O

# If the packet is addressed to 172.30.0.42:443, then...
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT

# etc...
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK

This means that when a packet comes in, the time it takes the kernel to check it against all of the Service rules is O(n) in the number of Services. As the number of Services increases, both the average and the worst-case latency for the first packet of a new connection increases (with the difference between best-case, average, and worst-case being mostly determined by whether a given Service IP address appears earlier or later in the KUBE-SERVICES chain).

kube-proxy iptables first packet latency, at various percentiles, in clusters of various sizes

By contrast, with nftables, the normal way to write a ruleset like this is to have a single rule, using a "verdict map" to do the dispatch:

table ip kube-proxy {

        # The service-ips verdict map indicates the action to take for each matching packet.
	map service-ips {
		type ipv4_addr . inet_proto . inet_service : verdict
		comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
		elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
                             172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
                             172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
                             ... }
        }

        # Now we just need a single rule to process all packets matching an
        # element in the map. (This rule says, "construct a tuple from the
        # destination IP address, layer 4 protocol, and destination port; look
        # that tuple up in "service-ips"; and if there's a match, execute the
        # associated verdict.)
	chain services {
		ip daddr . meta l4proto . th dport vmap @service-ips
	}

        ...
}

Since there's only a single rule, with a roughly O(1) map lookup, packet processing time is more or less constant regardless of cluster size, and the best/average/worst cases are very similar:

kube-proxy nftables first packet latency, at various percentiles, in clusters of various sizes

But note the huge difference in the vertical scale between the iptables and nftables graphs! In the clusters with 5000 and 10,000 Services, the p50 (average) latency for nftables is about the same as the p01 (approximately best-case) latency for iptables. In the 30,000 Service cluster, the p99 (approximately worst-case) latency for nftables manages to beat out the p01 latency for iptables by a few microseconds! Here's both sets of data together, but you may have to squint to see the nftables results!:

kube-proxy iptables-vs-nftables first packet latency, at various percentiles, in clusters of various sizes

Why nftables? Part 2: control plane latency

While the improvements to data plane latency in large clusters are great, there's another problem with iptables kube-proxy that often keeps users from even being able to grow their clusters to that size: the time it takes kube-proxy to program new iptables rules when Services and their endpoints change.

With both iptables and nftables, the total size of the ruleset as a whole (actual rules, plus associated data) is O(n) in the combined number of Services and their endpoints. Originally, the iptables backend would rewrite every rule on every update, and with tens of thousands of Services, this could grow to be hundreds of thousands of iptables rules. Starting in Kubernetes 1.26, we began improving kube-proxy so that it could skip updating most of the unchanged rules in each update, but the limitations of iptables-restore as an API meant that it was still always necessary to send an update that's O(n) in the number of Services (though with a noticeably smaller constant than it used to be). Even with those optimizations, it can still be necessary to make use of kube-proxy's minSyncPeriod config option to ensure that it doesn't spend every waking second trying to push iptables updates.

The nftables APIs allow for doing much more incremental updates, and when kube-proxy in nftables mode does an update, the size of the update is only O(n) in the number of Services and endpoints that have changed since the last sync, regardless of the total number of Services and endpoints. The fact that the nftables API allows each nftables-using component to have its own private table also means that there is no global lock contention between components like with iptables. As a result, kube-proxy's nftables updates can be done much more efficiently than with iptables.

(Unfortunately I don't have cool graphs for this part.)

Why not nftables?

All that said, there are a few reasons why you might not want to jump right into using the nftables backend for now.

First, the code is still fairly new. While it has plenty of unit tests, performs correctly in our CI system, and has now been used in the real world by multiple users, it has not seen anything close to as much real-world usage as the iptables backend has, so we can't promise that it is as stable and bug-free.

Second, the nftables mode will not work on older Linux distributions; currently it requires a 5.13 or newer kernel. Additionally, because of bugs in early versions of the nft command line tool, you should not run kube-proxy in nftables mode on nodes that have an old (earlier than 1.0.0) version of nft in the host filesystem (or else kube-proxy's use of nftables may interfere with other uses of nftables on the system).

Third, you may have other networking components in your cluster, such as the pod network or NetworkPolicy implementation, that do not yet support kube-proxy in nftables mode. You should consult the documentation (or forums, bug tracker, etc.) for any such components to see if they have problems with nftables mode. (In many cases they will not; as long as they don't try to directly interact with or override kube-proxy's iptables rules, they shouldn't care whether kube-proxy is using iptables or nftables.) Additionally, observability and monitoring tools that have not been updated may report less data for kube-proxy in nftables mode than they do for kube-proxy in iptables mode.

Finally, kube-proxy in nftables mode is intentionally not 100% compatible with kube-proxy in iptables mode. There are a few old kube-proxy features whose default behaviors are less secure, less performant, or less intuitive than we'd like, but where we felt that changing the default would be a compatibility break. Since the nftables mode is opt-in, this gave us a chance to fix those bad defaults without breaking users who weren't expecting changes. (In particular, with nftables mode, NodePort Services are now only reachable on their nodes' default IPs, as opposed to being reachable on all IPs, including 127.0.0.1, with iptables mode.) The kube-proxy documentation has more information about this, including information about metrics you can look at to determine if you are relying on any of the changed functionality, and what configuration options are available to get more backward-compatible behavior.

Trying out nftables mode

Ready to try it out? In Kubernetes 1.31 and later, you just need to pass --proxy-mode nftables to kube-proxy (or set mode: nftables in your kube-proxy config file).

If you are using kubeadm to set up your cluster, the kubeadm documentation explains how to pass a KubeProxyConfiguration to kubeadm init. You can also deploy nftables-based clusters with kind.

You can also convert existing clusters from iptables (or ipvs) mode to nftables by updating the kube-proxy configuration and restarting the kube-proxy pods. (You do not need to reboot the nodes: when restarting in nftables mode, kube-proxy will delete any existing iptables or ipvs rules, and likewise, if you later revert back to iptables or ipvs mode, it will delete any existing nftables rules.)

Future plans

As mentioned above, while nftables is now the best kube-proxy mode, it is not the default, and we do not yet have a plan for changing that. We will continue to support the iptables mode for a long time.

The future of the IPVS mode of kube-proxy is less certain: its main advantage over iptables was that it was faster, but certain aspects of the IPVS architecture and APIs were awkward for kube-proxy's purposes (for example, the fact that the kube-ipvs0 device needs to have every Service IP address assigned to it), and some parts of Kubernetes Service proxying semantics were difficult to implement using IPVS (particularly the fact that some Services had to have different endpoints depending on whether you connected to them from a local or remote client). And now, the nftables mode has the same performance as IPVS mode (actually, slightly better), without any of the downsides:

kube-proxy ipvs-vs-nftables first packet latency, at various percentiles, in clusters of various sizes

(In theory the IPVS mode also has the advantage of being able to use various other IPVS functionality, like alternative "schedulers" for balancing endpoints. In practice, this ended up not being very useful, because kube-proxy runs independently on every node, and the IPVS schedulers on each node had no way of sharing their state with the proxies on other nodes, thus thwarting the effort to balance traffic more cleverly.)

While the Kubernetes project does not have an immediate plan to drop the IPVS backend, it is probably doomed in the long run, and people who are currently using IPVS mode should try out the nftables mode instead (and file bugs if you think there is missing functionality in nftables mode that you can't work around).

Learn more

The Cloud Controller Manager Chicken and Egg Problem

Kubernetes 1.31 completed the largest migration in Kubernetes history, removing the in-tree cloud provider. While the component migration is now done, this leaves some additional complexity for users and installer projects (for example, kOps or Cluster API) . We will go over those additional steps and failure points and make recommendations for cluster owners. This migration was complex and some logic had to be extracted from the core components, building four new subsystems.

  1. Cloud controller manager (KEP-2392)
  2. API server network proxy (KEP-1281)
  3. kubelet credential provider plugins (KEP-2133)
  4. Storage migration to use CSI (KEP-625)

The cloud controller manager is part of the control plane. It is a critical component that replaces some functionality that existed previously in the kube-controller-manager and the kubelet.

Components of Kubernetes

Components of Kubernetes

One of the most critical functionalities of the cloud controller manager is the node controller, which is responsible for the initialization of the nodes.

As you can see in the following diagram, when the kubelet starts, it registers the Node object with the apiserver, Tainting the node so it can be processed first by the cloud-controller-manager. The initial Node is missing the cloud-provider specific information, like the Node Addresses and the Labels with the cloud provider specific information like the Node, Region and Instance type information.

Chicken and egg problem sequence diagram

Chicken and egg problem sequence diagram

This new initialization process adds some latency to the node readiness. Previously, the kubelet was able to initialize the node at the same time it created the node. Since the logic has moved to the cloud-controller-manager, this can cause a chicken and egg problem during the cluster bootstrapping for those Kubernetes architectures that do not deploy the controller manager as the other components of the control plane, commonly as static pods, standalone binaries or daemonsets/deployments with tolerations to the taints and using hostNetwork (more on this below)

Examples of the dependency problem

As noted above, it is possible during bootstrapping for the cloud-controller-manager to be unschedulable and as such the cluster will not initialize properly. The following are a few concrete examples of how this problem can be expressed and the root causes for why they might occur.

These examples assume you are running your cloud-controller-manager using a Kubernetes resource (e.g. Deployment, DaemonSet, or similar) to control its lifecycle. Because these methods rely on Kubernetes to schedule the cloud-controller-manager, care must be taken to ensure it will schedule properly.

Example: Cloud controller manager not scheduling due to uninitialized taint

As noted in the Kubernetes documentation, when the kubelet is started with the command line flag --cloud-provider=external, its corresponding Node object will have a no schedule taint named node.cloudprovider.kubernetes.io/uninitialized added. Because the cloud-controller-manager is responsible for removing the no schedule taint, this can create a situation where a cloud-controller-manager that is being managed by a Kubernetes resource, such as a Deployment or DaemonSet, may not be able to schedule.

If the cloud-controller-manager is not able to be scheduled during the initialization of the control plane, then the resulting Node objects will all have the node.cloudprovider.kubernetes.io/uninitialized no schedule taint. It also means that this taint will not be removed as the cloud-controller-manager is responsible for its removal. If the no schedule taint is not removed, then critical workloads, such as the container network interface controllers, will not be able to schedule, and the cluster will be left in an unhealthy state.

Example: Cloud controller manager not scheduling due to not-ready taint

The next example would be possible in situations where the container network interface (CNI) is waiting for IP address information from the cloud-controller-manager (CCM), and the CCM has not tolerated the taint which would be removed by the CNI.

The Kubernetes documentation describes the node.kubernetes.io/not-ready taint as follows:

"The Node controller detects whether a Node is ready by monitoring its health and adds or removes this taint accordingly."

One of the conditions that can lead to a Node resource having this taint is when the container network has not yet been initialized on that node. As the cloud-controller-manager is responsible for adding the IP addresses to a Node resource, and the IP addresses are needed by the container network controllers to properly configure the container network, it is possible in some circumstances for a node to become stuck as not ready and uninitialized permanently.

This situation occurs for a similar reason as the first example, although in this case, the node.kubernetes.io/not-ready taint is used with the no execute effect and thus will cause the cloud-controller-manager not to run on the node with the taint. If the cloud-controller-manager is not able to execute, then it will not initialize the node. It will cascade into the container network controllers not being able to run properly, and the node will end up carrying both the node.cloudprovider.kubernetes.io/uninitialized and node.kubernetes.io/not-ready taints, leaving the cluster in an unhealthy state.

Our Recommendations

There is no one “correct way” to run a cloud-controller-manager. The details will depend on the specific needs of the cluster administrators and users. When planning your clusters and the lifecycle of the cloud-controller-managers please consider the following guidance:

For cloud-controller-managers running in the same cluster, they are managing.

  1. Use host network mode, rather than the pod network: in most cases, a cloud controller manager will need to communicate with an API service endpoint associated with the infrastructure. Setting “hostNetwork” to true will ensure that the cloud controller is using the host networking instead of the container network and, as such, will have the same network access as the host operating system. It will also remove the dependency on the networking plugin. This will ensure that the cloud controller has access to the infrastructure endpoint (always check your networking configuration against your infrastructure provider’s instructions).
  2. Use a scalable resource type. Deployments and DaemonSets are useful for controlling the lifecycle of a cloud controller. They allow easy access to running multiple copies for redundancy as well as using the Kubernetes scheduling to ensure proper placement in the cluster. When using these primitives to control the lifecycle of your cloud controllers and running multiple replicas, you must remember to enable leader election, or else your controllers will collide with each other which could lead to nodes not being initialized in the cluster.
  3. Target the controller manager containers to the control plane. There might exist other controllers which need to run outside the control plane (for example, Azure’s node manager controller). Still, the controller managers themselves should be deployed to the control plane. Use a node selector or affinity stanza to direct the scheduling of cloud controllers to the control plane to ensure that they are running in a protected space. Cloud controllers are vital to adding and removing nodes to a cluster as they form a link between Kubernetes and the physical infrastructure. Running them on the control plane will help to ensure that they run with a similar priority as other core cluster controllers and that they have some separation from non-privileged user workloads.
    1. It is worth noting that an anti-affinity stanza to prevent cloud controllers from running on the same host is also very useful to ensure that a single node failure will not degrade the cloud controller performance.
  4. Ensure that the tolerations allow operation. Use tolerations on the manifest for the cloud controller container to ensure that it will schedule to the correct nodes and that it can run in situations where a node is initializing. This means that cloud controllers should tolerate the node.cloudprovider.kubernetes.io/uninitialized taint, and it should also tolerate any taints associated with the control plane (for example, node-role.kubernetes.io/control-plane or node-role.kubernetes.io/master). It can also be useful to tolerate the node.kubernetes.io/not-ready taint to ensure that the cloud controller can run even when the node is not yet available for health monitoring.

For cloud-controller-managers that will not be running on the cluster they manage (for example, in a hosted control plane on a separate cluster), then the rules are much more constrained by the dependencies of the environment of the cluster running the cloud-controller-manager. The advice for running on a self-managed cluster may not be appropriate as the types of conflicts and network constraints will be different. Please consult the architecture and requirements of your topology for these scenarios.

Example

This is an example of a Kubernetes Deployment highlighting the guidance shown above. It is important to note that this is for demonstration purposes only, for production uses please consult your cloud provider’s documentation.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: cloud-controller-manager
  name: cloud-controller-manager
  namespace: kube-system
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: cloud-controller-manager
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: cloud-controller-manager
      annotations:
        kubernetes.io/description: Cloud controller manager for my infrastructure
    spec:
      containers: # the container details will depend on your specific cloud controller manager
      - name: cloud-controller-manager
        command:
        - /bin/my-infrastructure-cloud-controller-manager
        - --leader-elect=true
        - -v=1
        image: registry/my-infrastructure-cloud-controller-manager@latest
        resources:
          requests:
            cpu: 200m
            memory: 50Mi
      hostNetwork: true # these Pods are part of the control plane
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: "kubernetes.io/hostname"
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: cloud-controller-manager
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 120
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 120
      - effect: NoSchedule
        key: node.cloudprovider.kubernetes.io/uninitialized
        operator: Exists
      - effect: NoSchedule
        key: node.kubernetes.io/not-ready
        operator: Exists

When deciding how to deploy your cloud controller manager it is worth noting that cluster-proportional, or resource-based, pod autoscaling is not recommended. Running multiple replicas of a cloud controller manager is good practice for ensuring high-availability and redundancy, but does not contribute to better performance. In general, only a single instance of a cloud controller manager will be reconciling a cluster at any given time.

Spotlight on SIG Architecture: Enhancements

This is the fourth interview of a SIG Architecture Spotlight series that will cover the different subprojects, and we will be covering SIG Architecture: Enhancements.

In this SIG Architecture spotlight we talked with Kirsten Garrison, lead of the Enhancements subproject.

The Enhancements subproject

Frederico (FSM): Hi Kirsten, very happy to have the opportunity to talk about the Enhancements subproject. Let's start with some quick information about yourself and your role.

Kirsten Garrison (KG): I’m a lead of the Enhancements subproject of SIG-Architecture and currently work at Google. I first got involved by contributing to the service-catalog project with the help of Carolyn Van Slyck. With time, I joined the Release team, eventually becoming the Enhancements Lead and a Release Lead shadow. While on the release team, I worked on some ideas to make the process better for the SIGs and Enhancements team (the opt-in process) based on my team’s experiences. Eventually, I started attending Subproject meetings and contributing to the Subproject’s work.

FSM: You mentioned the Enhancements subproject: how would you describe its main goals and areas of intervention?

KG: The Enhancements Subproject primarily concerns itself with the Kubernetes Enhancement Proposal (KEP for short)—the "design" documents required for all features and significant changes to the Kubernetes project.

The KEP and its impact

FSM: The improvement of the KEP process was (and is) one in which SIG Architecture was heavily involved. Could you explain the process to those that aren’t aware of it?

KG: Every release, the SIGs let the Release Team know which features they intend to work on to be put into the release. As mentioned above, the prerequisite for these changes is a KEP - a standardized design document that all authors must fill out and approve in the first weeks of the release cycle. Most features will move through 3 phases: alpha, beta and finally GA so approving a feature represents a significant commitment for the SIG.

The KEP serves as the full source of truth of a feature. The KEP template has different requirements based on what stage a feature is in, but it generally requires a detailed discussion of the design and the impact as well as providing artifacts of stability and performance. The KEP takes quite a bit of iterative work between authors, SIG reviewers, api review team and the Production Readiness Review team1 before it is approved. Each set of reviewers is looking to make sure that the proposal meets their standards in order to have a stable and performant Kubernetes release. Only after all approvals are secured, can an author go forth and merge their feature in the Kubernetes code base.

FSM: I see, quite a bit of additional structure was added. Looking back, what were the most significant improvements of that approach?

KG: In general, I think that the improvements with the most impact had to do with focusing on the core intent of the KEP. KEPs exist not just to memorialize designs, but provide a structured way to discuss and come to an agreement about different facets of the change. At the core of the KEP process is communication and consideration.

To that end, some of the significant changes revolve around a more detailed and accessible KEP template. A significant amount of work was put in over time to get the k/enhancements repo into its current form -- a directory structure organized by SIG with the contours of the modern KEP template (with Proposal/Motivation/Design Details subsections). We might take that basic structure for granted today, but it really represents the work of many people trying to get the foundation of this process in place over time.

As Kubernetes matures, we’ve needed to think about more than just the end goal of getting a single feature merged. We need to think about things like: stability, performance, setting and meeting user expectations. And as we’ve thought about those things the template has grown more detailed. The addition of the Production Readiness Review was major as well as the enhanced testing requirements (varying at different stages of a KEP’s lifecycle).

Current areas of focus

FSM: Speaking of maturing, we’ve recently released Kubernetes v1.31, and work on v1.32 has started. Are there any areas that the Enhancements sub-project is currently addressing that might change the way things are done?

KG: We’re currently working on two things:

  1. Creating a Process KEP template. Sometimes people want to harness the KEP process for significant changes that are more process oriented rather than feature oriented. We want to support this because memorializing changes is important and giving people a better tool to do so will only encourage more discussion and transparency.
  2. KEP versioning. While our template changes aim to be as non-disruptive as possible, we believe that it will be easier to track and communicate those changes to the community better with a versioned KEP template and the policies that go alongside such versioning.

Both features will take some time to get right and fully roll out (just like a KEP feature) but we believe that they will both provide improvements that will benefit the community at large.

FSM: You mentioned improvements: I remember when project boards for Enhancement tracking were introduced in recent releases, to great effect and unanimous applause from release team members. Was this a particular area of focus for the subproject?

KG: The Subproject provided support to the Release Team’s Enhancement team in the migration away from using the spreadsheet to a project board. The collection and tracking of enhancements has always been a logistical challenge. During my time on the Release Team, I helped with the transition to an opt-in system of enhancements, whereby the SIG leads "opt-in" KEPs for release tracking. This helped to enhance communication between authors and SIGs before any significant work was undertaken on a KEP and removed toil from the Enhancements team. This change used the existing tools to avoid introducing too many changes at once to the community. Later, the Release Team approached the Subproject with an idea of leveraging GitHub Project Boards to further improve the collection process. This was to be a move away from the use of complicated spreadsheets to using repo-native labels on k/enhancement issues and project boards.

FSM: That surely adds an impact on simplifying the workflow...

KG: Removing sources of friction and promoting clear communication is very important to the Enhancements Subproject. At the same time, it’s important to give careful consideration to decisions that impact the community as a whole. We want to make sure that changes are balanced to give an upside and while not causing any regressions and pain in the rollout. We supported the Release Team in ideation as well as through the actual migration to the project boards. It was a great success and exciting to see the team make high impact changes that helped everyone involved in the KEP process!

Getting involved

FSM: For those reading that might be curious and interested in helping, how would you describe the required skills for participating in the sub-project?

KG: Familiarity with KEPs either via experience or taking time to look through the kubernetes/enhancements repo is helpful. All are welcome to participate if interested - we can take it from there.

FSM: Excellent! Many thanks for your time and insight -- any final comments you would like to share with our readers?

KG: The Enhancements process is one of the most important parts of Kubernetes and requires enormous amounts of coordination and collaboration of people and teams across the project to make it successful. I’m thankful and inspired by everyone’s continued hard work and dedication to making the project great. This is truly a wonderful community.


  1. For more information, check the Production Readiness Review spotlight interview in this series. ↩︎

Kubernetes 1.32: Moving Volume Group Snapshots to Beta

Volume group snapshots were introduced as an Alpha feature with the Kubernetes 1.27 release. The recent release of Kubernetes v1.32 moved that support to beta. The support for volume group snapshots relies on a set of extension APIs for group snapshots. These APIs allow users to take crash consistent snapshots for a set of volumes. Behind the scenes, Kubernetes uses a label selector to group multiple PersistentVolumeClaims for snapshotting. A key aim is to allow you restore that set of snapshots to new volumes and recover your workload based on a crash consistent recovery point.

This new feature is only supported for CSI volume drivers.

An overview of volume group snapshots

Some storage systems provide the ability to create a crash consistent snapshot of multiple volumes. A group snapshot represents copies made from multiple volumes, that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).

Why add volume group snapshots to Kubernetes?

The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage.

Underpinning all these features is the Kubernetes goal of workload portability: Kubernetes aims to create an abstraction layer between distributed applications and underlying clusters so that applications can be agnostic to the specifics of the cluster they run on and application deployment requires no cluster specific knowledge.

There was already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, there are other snapshotting functionalities not covered by the VolumeSnapshot API.

Some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This can be useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another volume. If snapshots for the data volume and the logs volume are taken at different times, the application will not be consistent and will not function properly if it is restored from those snapshots when a disaster strikes.

It is true that you can quiesce the application first, take an individual snapshot from each volume that is part of the application one after the other, and then unquiesce the application after all the individual snapshots are taken. This way, you would get application consistent snapshots.

However, sometimes the application quiesce can be so time consuming that you want to do it less frequently, or it may not be possible to quiesce an application at all. For example, a user may want to run weekly backups with application quiesce and nightly backups without application quiesce but with consistent group support which provides crash consistency across all volumes in the group.

Kubernetes APIs for volume group snapshots

Kubernetes' support for volume group snapshots relies on three API kinds that are used for managing snapshots:

VolumeGroupSnapshot
Created by a Kubernetes user (or perhaps by your own automation) to request creation of a volume group snapshot for multiple persistent volume claims. It contains information about the volume group snapshot operation such as the timestamp when the volume group snapshot was taken and whether it is ready to use. The creation and deletion of this object represents a desire to create or delete a cluster resource (a group snapshot).
VolumeGroupSnapshotContent
Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the volume group snapshot including the volume group snapshot ID. This object represents a provisioned resource on the cluster (a group snapshot). The VolumeGroupSnapshotContent object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
VolumeGroupSnapshotClass
Created by cluster administrators to describe how volume group snapshots should be created, including the driver information, the deletion policy, etc.

These three API kinds are defined as CustomResourceDefinitions (CRDs). These CRDs must be installed in a Kubernetes cluster for a CSI Driver to support volume group snapshots.

What components are needed to support volume group snapshots

Volume group snapshots are implemented in the external-snapshotter repository. Implementing volume group snapshots meant adding or changing several components:

  • Added new CustomResourceDefinitions for VolumeGroupSnapshot and two supporting APIs.
  • Volume group snapshot controller logic is added to the common snapshot controller.
  • Adding logic to make CSI calls into the snapshotter sidecar controller.

The volume snapshot controller and CRDs are deployed once per cluster, while the sidecar is bundled with each CSI driver.

Therefore, it makes sense to deploy the volume snapshot controller and CRDs as a cluster addon.

The Kubernetes project recommends that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver).

What's new in Beta?

  • The VolumeGroupSnapshot feature in CSI spec moved to GA in the v1.11.0 release.

  • The snapshot validation webhook was deprecated in external-snapshotter v8.0.0 and it is now removed. Most of the validation webhook logic was added as validation rules into the CRDs. Minimum required Kubernetes version is 1.25 for these validation rules. One thing in the validation webhook not moved to CRDs is the prevention of creating multiple default volume snapshot classes and multiple default volume group snapshot classes for the same CSI driver. With the removal of the validation webhook, an error will still be raised when dynamically provisioning a VolumeSnapshot or VolumeGroupSnapshot when multiple default volume snapshot classes or multiple default volume group snapshot classes for the same CSI driver exist.

  • The enable-volumegroup-snapshot flag in the snapshot-controller and the CSI snapshotter sidecar has been replaced by a feature gate. Since VolumeGroupSnapshot is a new API, the feature moves to Beta but the feature gate is disabled by default. To use this feature, enable the feature gate by adding the flag --feature-gates=CSIVolumeGroupSnapshot=true when starting the snapshot-controller and the CSI snapshotter sidecar.

  • The logic to dynamically create the VolumeGroupSnapshot and its corresponding individual VolumeSnapshot and VolumeSnapshotContent objects are moved from the CSI snapshotter to the common snapshot-controller. New RBAC rules are added to the common snapshot-controller and some RBAC rules are removed from the CSI snapshotter sidecar accordingly.

How do I use Kubernetes volume group snapshots

Creating a new group snapshot with Kubernetes

Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.

The source of the group snapshot specifies whether the underlying group snapshot should be dynamically created or if a pre-existing VolumeGroupSnapshotContent should be used.

A pre-existing VolumeGroupSnapshotContent is created by a cluster administrator. It contains the details of the real volume group snapshot on the storage system which is available for use by cluster users.

One of the following members in the source of the group snapshot must be set.

  • selector - a label query over PersistentVolumeClaims that are to be grouped together for snapshotting. This selector will be used to match the label added to a PVC.
  • volumeGroupSnapshotContentName - specifies the name of a pre-existing VolumeGroupSnapshotContent object representing an existing volume group snapshot.

Dynamically provision a group snapshot

In the following example, there are two PVCs.

NAME    STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      VOLUMEATTRIBUTESCLASS   AGE
pvc-0   Bound    pvc-6e1f7d34-a5c5-4548-b104-01e72c72b9f2   100Mi      RWO            csi-hostpath-sc   <unset>                 2m15s
pvc-1   Bound    pvc-abc640b3-2cc1-4c56-ad0c-4f0f0e636efa   100Mi      RWO            csi-hostpath-sc   <unset>                 2m7s

Label the PVCs.

% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled

% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled

For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshot
metadata:
  name: snapshot-daily-20241217
  namespace: demo-namespace
spec:
  volumeGroupSnapshotClassName: csi-groupSnapclass
  source:
    selector:
      matchLabels:
        group: myGroup

In the VolumeGroupSnapshot spec, a user can specify the VolumeGroupSnapshotClass which has the information about which CSI driver should be used for creating the group snapshot. A VolumGroupSnapshotClass is required for dynamic provisioning.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshotClass
metadata:
  name: csi-groupSnapclass
  annotations:
    kubernetes.io/description: "Example group snapshot class"
driver: example.csi.k8s.io
deletionPolicy: Delete

As a result of the volume group snapshot creation, a corresponding VolumeGroupSnapshotContent object will be created with a volumeGroupSnapshotHandle pointing to a resource on the storage system.

Two individual volume snapshots will be created as part of the volume group snapshot creation.

NAME                                                                        READYTOUSE   SOURCEPVC   RESTORESIZE   SNAPSHOTCONTENT                                                                AGE
snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0   true         pvc-0       100Mi         snapcontent-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0   16m
snapshot-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff   true         pvc-1       100Mi         snapcontent-da577d76bd2106c410616b346b2e72440f6ec7b12a75156263b989192b78caff   16m

Importing an existing group snapshot with Kubernetes

To import a pre-existing volume group snapshot into Kubernetes, you must also import the corresponding individual volume snapshots.

Identify the individual volume snapshot handles, manually construct a VolumeSnapshotContent object first, then create a VolumeSnapshot object pointing to the VolumeSnapshotContent object. Repeat this for every individual volume snapshot.

Then manually create a VolumeGroupSnapshotContent object, specifying the volumeGroupSnapshotHandle and individual volumeSnapshotHandles already existing on the storage system.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshotContent
metadata:
  name: static-group-content
spec:
  deletionPolicy: Delete
  driver: hostpath.csi.k8s.io
  source:
    groupSnapshotHandles:
      volumeGroupSnapshotHandle: e8779136-a93e-11ef-9549-66940726f2fd
      volumeSnapshotHandles:
      - e8779147-a93e-11ef-9549-66940726f2fd
      - e8783cd0-a93e-11ef-9549-66940726f2fd
  volumeGroupSnapshotRef:
    name: static-group-snapshot
    namespace: demo-namespace

After that create a VolumeGroupSnapshot object pointing to the VolumeGroupSnapshotContent object.

apiVersion: groupsnapshot.storage.k8s.io/v1beta1
kind: VolumeGroupSnapshot
metadata:
  name: static-group-snapshot
  namespace: demo-namespace
spec:
  source:
    volumeGroupSnapshotContentName: static-group-content

How to use group snapshot for restore in Kubernetes

At restore time, the user can request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. This will trigger provisioning of a new volume that is pre-populated with data from the specified snapshot. The user should repeat this until all volumes are created from all the snapshots that are part of a group snapshot.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: examplepvc-restored-2024-12-17
  namespace: demo-namespace
spec:
  storageClassName: example-foo-nearline
  dataSource:
    name: snapshot-0962a745b2bf930bb385b7b50c9b08af471f1a16780726de19429dd9c94eaca0
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOncePod
  resources:
    requests:
      storage: 100Mi # must be enough storage to fit the existing snapshot

As a storage vendor, how do I add support for group snapshots to my CSI driver?

To implement the volume group snapshot feature, a CSI driver must:

  • Implement a new group controller service.
  • Implement group controller RPCs: CreateVolumeGroupSnapshot, DeleteVolumeGroupSnapshot, and GetVolumeGroupSnapshot.
  • Add group controller capability CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT.

See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.

As mentioned earlier, it is strongly recommended that Kubernetes distributors bundle and deploy the volume snapshot controller and CRDs as part of their Kubernetes cluster management process (independent of any CSI Driver).

As part of this recommended deployment process, the Kubernetes team provides a number of sidecar (helper) containers, including the external-snapshotter sidecar container which has been updated to support volume group snapshot.

The external-snapshotter watches the Kubernetes API server for VolumeGroupSnapshotContent objects, and triggers CreateVolumeGroupSnapshot and DeleteVolumeGroupSnapshot operations against a CSI endpoint.

What are the limitations?

The beta implementation of volume group snapshots for Kubernetes has the following limitations:

  • Does not support reverting an existing PVC to an earlier state represented by a snapshot (only supports provisioning a new volume from a snapshot).
  • No application consistency guarantees beyond any guarantees provided by the storage system (e.g. crash consistency). See this doc for more discussions on application consistency.

What’s next?

Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.

How can I learn more?

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach beta:

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

Enhancing Kubernetes API Server Efficiency with API Streaming

Managing Kubernetes clusters efficiently is critical, especially as their size is growing. A significant challenge with large clusters is the memory overhead caused by list requests.

In the existing implementation, the kube-apiserver processes list requests by assembling the entire response in-memory before transmitting any data to the client. But what if the response body is substantial, say hundreds of megabytes? Additionally, imagine a scenario where multiple list requests flood in simultaneously, perhaps after a brief network outage. While API Priority and Fairness has proven to reasonably protect kube-apiserver from CPU overload, its impact is visibly smaller for memory protection. This can be explained by the differing nature of resource consumption by a single API request - the CPU usage at any given time is capped by a constant, whereas memory, being uncompressible, can grow proportionally with the number of processed objects and is unbounded. This situation poses a genuine risk, potentially overwhelming and crashing any kube-apiserver within seconds due to out-of-memory (OOM) conditions. To better visualize the issue, let's consider the below graph.

Monitoring graph showing kube-apiserver memory usage

The graph shows the memory usage of a kube-apiserver during a synthetic test. (see the synthetic test section for more details). The results clearly show that increasing the number of informers significantly boosts the server's memory consumption. Notably, at approximately 16:40, the server crashed when serving only 16 informers.

Why does kube-apiserver allocate so much memory for list requests?

Our investigation revealed that this substantial memory allocation occurs because the server before sending the first byte to the client must:

  • fetch data from the database,
  • deserialize the data from its stored format,
  • and finally construct the final response by converting and serializing the data into a client requested format

This sequence results in significant temporary memory consumption. The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects.

Unfortunately, neither API Priority and Fairness nor Golang's garbage collection or Golang memory limits can prevent the system from exhausting memory under these conditions. The memory is allocated suddenly and rapidly, and just a few requests can quickly deplete the available memory, leading to resource exhaustion.

Depending on how the API server is run on the node, it might either be killed through OOM by the kernel when exceeding the configured memory limits during these uncontrolled spikes, or if limits are not configured it might have even worse impact on the control plane node. And worst, after the first API server failure, the same requests will likely hit another control plane node in an HA setup with probably the same impact. Potentially a situation that is hard to diagnose and hard to recover from.

Streaming list requests

Today, we're excited to announce a major improvement. With the graduation of the watch list feature to beta in Kubernetes 1.32, client-go users can opt-in (after explicitly enabling WatchListClient feature gate) to streaming lists by switching from list to (a special kind of) watch requests.

Watch requests are served from the watch cache, an in-memory cache designed to improve scalability of read operations. By streaming each item individually instead of returning the entire collection, the new method maintains constant memory overhead. The API server is bound by the maximum allowed size of an object in etcd plus a few additional allocations. This approach drastically reduces the temporary memory usage compared to traditional list requests, ensuring a more efficient and stable system, especially in clusters with a large number of objects of a given type or large average object sizes where despite paging memory consumption used to be high.

Building on the insight gained from the synthetic test (see the synthetic test, we developed an automated performance test to systematically evaluate the impact of the watch list feature. This test replicates the same scenario, generating a large number of Secrets with a large payload, and scaling the number of informers to simulate heavy list request patterns. The automated test is executed periodically to monitor memory usage of the server with the feature enabled and disabled.

The results showed significant improvements with the watch list feature enabled. With the feature turned on, the kube-apiserver’s memory consumption stabilized at approximately 2 GB. By contrast, with the feature disabled, memory usage increased to approximately 20GB, a 10x increase! These results confirm the effectiveness of the new streaming API, which reduces the temporary memory footprint.

Enabling API Streaming for your component

Upgrade to Kubernetes 1.32. Make sure your cluster uses etcd in version 3.4.31+ or 3.5.13+. Change your client software to use watch lists. If your client code is written in Golang, you'll want to enable WatchListClient for client-go. For details on enabling that feature, read Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control.

What's next?

In Kubernetes 1.32, the feature is enabled in kube-controller-manager by default despite its beta state. This will eventually be expanded to other core components like kube-scheduler or kubelet; once the feature becomes generally available, if not earlier. Other 3rd-party components are encouraged to opt-in to the feature during the beta phase, especially when they are at risk of accessing a large number of resources or kinds with potentially large object sizes.

For the time being, API Priority and Fairness assigns a reasonable small cost to list requests. This is necessary to allow enough parallelism for the average case where list requests are cheap enough. But it does not match the spiky exceptional situation of many and large objects. Once the majority of the Kubernetes ecosystem has switched to watch list, the list cost estimation can be changed to larger values without risking degraded performance in the average case, and with that increasing the protection against this kind of requests that can still hit the API server in the future.

The synthetic test

In order to reproduce the issue, we conducted a manual test to understand the impact of list requests on kube-apiserver memory usage. In the test, we created 400 Secrets, each containing 1 MB of data, and used informers to retrieve all Secrets.

The results were alarming, only 16 informers were needed to cause the test server to run out of memory and crash, demonstrating how quickly memory consumption can grow under such conditions.

Special shout out to @deads2k for his help in shaping this feature.

Kubernetes 1.33 update

Since this feature was started, Marek Siarkowicz integrated a new technology into the Kubernetes API server: streaming collection encoding. Kubernetes v1.33 introduced two related feature gates, StreamingCollectionEncodingToJSON and StreamingCollectionEncodingToProtobuf. These features encode via a stream and avoid allocating all the memory at once. This functionality is bit-for-bit compatible with existing list encodings, produces even greater server-side memory savings, and doesn't require any changes to client code. In 1.33, the WatchList feature gate is disabled by default.

Kubernetes v1.32 Adds A New CPU Manager Static Policy Option For Strict CPU Reservation

In Kubernetes v1.32, after years of community discussion, we are excited to introduce a strict-cpu-reservation option for the CPU Manager static policy. This feature is currently in alpha, with the associated policy hidden by default. You can only use the policy if you explicitly enable the alpha behavior in your cluster.

Understanding the feature

The CPU Manager static policy is used to reduce latency or improve performance. The reservedSystemCPUs defines an explicit CPU set for OS system daemons and kubernetes system daemons. This option is designed for Telco/NFV type use cases where uncontrolled interrupts/timers may impact the workload performance. you can use this option to define the explicit cpuset for the system/kubernetes daemons as well as the interrupts/timers, so the rest CPUs on the system can be used exclusively for workloads, with less impact from uncontrolled interrupts/timers. More details of this parameter can be found on the Explicitly Reserved CPU List page.

If you want to protect your system daemons and interrupt processing, the obvious way is to use the reservedSystemCPUs option.

However, until the Kubernetes v1.32 release, this isolation was only implemented for guaranteed pods that made requests for a whole number of CPUs. At pod admission time, the kubelet only compares the CPU requests against the allocatable CPUs. In Kubernetes, limits can be higher than the requests; the previous implementation allowed burstable and best-effort pods to use up the capacity of reservedSystemCPUs, which could then starve host OS services of CPU - and we know that people saw this in real life deployments. The existing behavior also made benchmarking (for both infrastructure and workloads) results inaccurate.

When this new strict-cpu-reservation policy option is enabled, the CPU Manager static policy will not allow any workload to use the reserved system CPU cores.

Enabling the feature

To enable this feature, you need to turn on both the CPUManagerPolicyAlphaOptions feature gate and the strict-cpu-reservation policy option. And you need to remove the /var/lib/kubelet/cpu_manager_state file if it exists and restart kubelet.

With the following kubelet configuration:

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
featureGates:
  ...
  CPUManagerPolicyOptions: true
  CPUManagerPolicyAlphaOptions: true
cpuManagerPolicy: static
cpuManagerPolicyOptions:
  strict-cpu-reservation: "true"
reservedSystemCPUs: "0,32,1,33,16,48"
...

When strict-cpu-reservation is not set or set to false:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-63","checksum":1058907510}

When strict-cpu-reservation is set to true:

# cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"2-15,17-31,34-47,49-63","checksum":4141502832}

Monitoring the feature

You can monitor the feature impact by checking the following CPU Manager counters:

  • cpu_manager_shared_pool_size_millicores: report shared pool size, in millicores (e.g. 13500m)
  • cpu_manager_exclusive_cpu_allocation_count: report exclusively allocated cores, counting full cores (e.g. 16)

Your best-effort workloads may starve if the cpu_manager_shared_pool_size_millicores count is zero for prolonged time.

We believe any pod that is required for operational purpose like a log forwarder should not run as best-effort, but you can review and adjust the amount of CPU cores reserved as needed.

Conclusion

Strict CPU reservation is critical for Telco/NFV use cases. It is also a prerequisite for enabling the all-in-one type of deployments where workloads are placed on nodes serving combined control+worker+storage roles.

We want you to start using the feature and looking forward to your feedback.

Further reading

Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.

Getting involved

This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

Kubernetes v1.32: Memory Manager Goes GA

With Kubernetes 1.32, the memory manager has officially graduated to General Availability (GA), marking a significant milestone in the journey toward efficient and predictable memory allocation for containerized applications. Since Kubernetes v1.22, where it graduated to beta, the memory manager has proved itself reliable, stable and a good complementary feature for the CPU Manager.

As part of kubelet's workload admission process, the memory manager provides topology hints to optimize memory allocation and alignment. This enables users to allocate exclusive memory for Pods in the Guaranteed QoS class. More details about the process can be found in the memory manager goes to beta blog.

Most of the changes introduced since the Beta are bug fixes, internal refactoring and observability improvements, such as metrics and better logging.

Observability improvements

As part of the effort to increase the observability of memory manager, new metrics have been added to provide some statistics on memory allocation patterns.

  • memory_manager_pinning_requests_total - tracks the number of times the pod spec required the memory manager to pin memory pages.

  • memory_manager_pinning_errors_total - tracks the number of times the pod spec required the memory manager to pin memory pages, but the allocation failed.

Improving memory manager reliability and consistency

The kubelet does not guarantee pod ordering when admitting pods after a restart or reboot.

In certain edge cases, this behavior could cause the memory manager to reject some pods, and in more extreme cases, it may cause kubelet to fail upon restart.

Previously, the beta implementation lacked certain checks and logic to prevent these issues.

To stabilize the memory manager for general availability (GA) readiness, small but critical refinements have been made to the algorithm, improving its robustness and handling of edge cases.

Future development

There is more to come for the future of Topology Manager in general, and memory manager in particular. Notably, ongoing efforts are underway to extend memory manager support to Windows, enabling CPU and memory affinity on a Windows operating system.

Getting involved

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Kubernetes v1.32: QueueingHint Brings a New Possibility to Optimize Pod Scheduling

The Kubernetes scheduler is the core component that selects the nodes on which new Pods run. The scheduler processes these new Pods one by one. Therefore, the larger your clusters, the more important the throughput of the scheduler becomes.

Over the years, Kubernetes SIG Scheduling has improved the throughput of the scheduler in multiple enhancements. This blog post describes a major improvement to the scheduler in Kubernetes v1.32: a scheduling context element named QueueingHint. This page provides background knowledge of the scheduler and explains how QueueingHint improves scheduling throughput.

Scheduling queue

The scheduler stores all unscheduled Pods in an internal component called the scheduling queue.

The scheduling queue consists of the following data structures:

  • ActiveQ: holds newly created Pods or Pods that are ready to be retried for scheduling.
  • BackoffQ: holds Pods that are ready to be retried but are waiting for a backoff period to end. The backoff period depends on the number of unsuccessful scheduling attempts performed by the scheduler on that Pod.
  • Unschedulable Pod Pool: holds Pods that the scheduler won't attempt to schedule for one of the following reasons:
    • The scheduler previously attempted and was unable to schedule the Pods. Since that attempt, the cluster hasn't changed in a way that could make those Pods schedulable.
    • The Pods are blocked from entering the scheduling cycles by PreEnqueue Plugins, for example, they have a scheduling gate, and get blocked by the scheduling gate plugin.

Scheduling framework and plugins

The Kubernetes scheduler is implemented following the Kubernetes scheduling framework.

And, all scheduling features are implemented as plugins (e.g., Pod affinity is implemented in the InterPodAffinity plugin.)

The scheduler processes pending Pods in phases called cycles as follows:

  1. Scheduling cycle: the scheduler takes pending Pods from the activeQ component of the scheduling queue one by one. For each Pod, the scheduler runs the filtering/scoring logic from every scheduling plugin. The scheduler then decides on the best node for the Pod, or decides that the Pod can't be scheduled at that time.

    If the scheduler decides that a Pod can't be scheduled, that Pod enters the Unschedulable Pod Pool component of the scheduling queue. However, if the scheduler decides to place the Pod on a node, the Pod goes to the binding cycle.

  2. Binding cycle: the scheduler communicates the node placement decision to the Kubernetes API server. This operation bounds the Pod to the selected node.

Aside from some exceptions, most unscheduled Pods enter the unschedulable pod pool after each scheduling cycle. The Unschedulable Pod Pool component is crucial because of how the scheduling cycle processes Pods one by one. If the scheduler had to constantly retry placing unschedulable Pods, instead of offloading those Pods to the Unschedulable Pod Pool, multiple scheduling cycles would be wasted on those Pods.

Improvements to retrying Pod scheduling with QueuingHint

Unschedulable Pods only move back into the ActiveQ or BackoffQ components of the scheduling queue if changes in the cluster might allow the scheduler to place those Pods on nodes.

Prior to v1.32, each plugin registered which cluster changes could solve their failures, an object creation, update, or deletion in the cluster (called cluster events), with EnqueueExtensions (EventsToRegister), and the scheduling queue retries a pod with an event that is registered by a plugin that rejected the pod in a previous scheduling cycle.

Additionally, we had an internal feature called preCheck, which helped further filtering of events for efficiency, based on Kubernetes core scheduling constraints; For example, preCheck could filter out node-related events when the node status is NotReady.

However, we had two issues for those approaches:

  • Requeueing with events was too broad, could lead to scheduling retries for no reason.
    • A new scheduled Pod might solve the InterPodAffinity's failure, but not all of them do. For example, if a new Pod is created, but without a label matching InterPodAffinity of the unschedulable pod, the pod wouldn't be schedulable.
  • preCheck relied on the logic of in-tree plugins and was not extensible to custom plugins, like in issue #110175.

Here QueueingHints come into play; a QueueingHint subscribes to a particular kind of cluster event, and make a decision about whether each incoming event could make the Pod schedulable.

For example, consider a Pod named pod-a that has a required Pod affinity. pod-a was rejected in the scheduling cycle by the InterPodAffinity plugin because no node had an existing Pod that matched the Pod affinity specification for pod-a.

A diagram showing the scheduling queue and pod-a rejected by InterPodAffinity plugin

A diagram showing the scheduling queue and pod-a rejected by InterPodAffinity plugin

pod-a moves into the Unschedulable Pod Pool. The scheduling queue records which plugin caused the scheduling failure for the Pod. For pod-a, the scheduling queue records that the InterPodAffinity plugin rejected the Pod.

pod-a will never be schedulable until the InterPodAffinity failure is resolved. There're some scenarios that the failure could be resolved, one example is an existing running pod gets a label update and becomes matching a Pod affinity. For this scenario, the InterPodAffinity plugin's QueuingHint callback function checks every Pod label update that occurs in the cluster. Then, if a Pod gets a label update that matches the Pod affinity requirement of pod-a, the InterPodAffinity, plugin's QueuingHint prompts the scheduling queue to move pod-a back into the ActiveQ or the BackoffQ component.

A diagram showing the scheduling queue and pod-a being moved by InterPodAffinity QueueingHint

A diagram showing the scheduling queue and pod-a being moved by InterPodAffinity QueueingHint

QueueingHint's history and what's new in v1.32

At SIG Scheduling, we have been working on the development of QueueingHint since Kubernetes v1.28.

While QueuingHint isn't user-facing, we implemented the SchedulerQueueingHints feature gate as a safety measure when we originally added this feature. In v1.28, we implemented QueueingHints with a few in-tree plugins experimentally, and made the feature gate enabled by default.

However, users reported a memory leak, and consequently we disabled the feature gate in a patch release of v1.28. From v1.28 until v1.31, we kept working on the QueueingHint implementation within the rest of the in-tree plugins and fixing bugs.

In v1.32, we made this feature enabled by default again. We finished implementing QueueingHints in all plugins and also identified the cause of the memory leak!

We thank all the contributors who participated in the development of this feature and those who reported and investigated the earlier issues.

Getting involved

These features are managed by Kubernetes SIG Scheduling.

Please join us and share your feedback.

How can I learn more?

Kubernetes v1.32: Penelope

Editors: Matteo Bianchi, Edith Puclla, William Rizzo, Ryota Sawada, Rashan Smith

Announcing the release of Kubernetes v1.32: Penelope!

In line with previous releases, the release of Kubernetes v1.32 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community. This release consists of 44 enhancements in total. Of those enhancements, 13 have graduated to Stable, 12 are entering Beta, and 19 have entered in Alpha.

The Kubernetes v1.32 Release Theme is "Penelope".

If Kubernetes is Ancient Greek for "pilot", in this release we start from that origin and reflect on the last 10 years of Kubernetes and our accomplishments: each release cycle is a journey, and just like Penelope, in "The Odyssey",
weaved for 10 years -- each night removing parts of what she had done during the day -- so does each release add new features and removes others, albeit here with a much clearer purpose of constantly improving Kubernetes. With v1.32 being the last release in the year Kubernetes marks its first decade anniversary, we wanted to honour all of those that have been part of the global Kubernetes crew that roams the cloud-native seas through perils and challanges: may we continue to weave the future of Kubernetes together.

Updates to recent key features

A note on DRA enhancements

In this release, like the previous one, the Kubernetes project continues proposing a number of enhancements to the Dynamic Resource Allocation (DRA), a key component of the Kubernetes resource management system. These enhancements aim to improve the flexibility and efficiency of resource allocation for workloads that require specialized hardware, such as GPUs, FPGAs and network adapters. These features are particularly useful for use-cases such as machine learning or high-performance computing applications. The core part enabling DRA Structured parameter support got promoted to beta.

Quality of life improvements on nodes and sidecar containers update

SIG Node has the following highlights that go beyond KEPs:

  1. The systemd watchdog capability is now used to restart the kubelet when its health check fails, while also limiting the maximum number of restarts within a given time period. This enhances the reliability of the kubelet. For more details, see pull request #127566.

  2. In cases when an image pull back-off error is encountered, the message displayed in the Pod status has been improved to be more human-friendly and to indicate details about why the Pod is in this condition. When an image pull back-off occurs, the error is appended to the status.containerStatuses[*].state.waiting.message field in the Pod specification with an ImagePullBackOff value in the reason field. This change provides you with more context and helps you to identify the root cause of the issue. For more details, see pull request #127918.

  3. The sidecar containers feature is targeting graduation to Stable in v1.33. To view the remaining work items and feedback from users, see comments in the issue #753.

Highlights of features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.32 release.

Custom Resource field selectors

Custom resource field selector allows developers to add field selectors to custom resources, mirroring the functionality available for built-in Kubernetes objects. This allows for more efficient and precise filtering of custom resources, promoting better API design practices.

This work was done as a part of KEP #4358, by SIG API Machinery.

Support to size memory backed volumes

This feature makes it possible to dynamically size memory-backed volumes based on Pod resource limits, improving the workload's portability and overall node resource utilization.

This work was done as a part of KEP #1967, by SIG Node.

Bound service account token improvement

The inclusion of the node name in the service account token claims allows users to use such information during authorization and admission (ValidatingAdmissionPolicy). Furthermore this improvement keeps service account credentials from being a privilege escalation path for nodes.

This work was done as part of KEP #4193 by SIG Auth.

Structured authorization configuration

Multiple authorizers can be configured in the API server to allow for structured authorization decisions, with support for CEL match conditions in webhooks. This work was done as part of KEP #3221 by SIG Auth.

Auto remove PVCs created by StatefulSet

PersistentVolumeClaims (PVCs) created by StatefulSets get automatically deleted when no longer needed, while ensuring data persistence during StatefulSet updates and node maintenance. This feature simplifies storage management for StatefulSets and reduces the risk of orphaned PVCs.

This work was done as part of KEP #1847 by SIG Apps.

Highlights of features graduating to Beta

This is a selection of some of the improvements that are now beta following the v1.32 release.

Job API managed-by mechanism

The managedBy field for Jobs was promoted to beta in the v1.32 release. This feature enables external controllers (like Kueue) to manage Job synchronization, offering greater flexibility and integration with advanced workload management systems.

This work was done as a part of KEP #4368, by SIG Apps.

Only allow anonymous auth for configured endpoints

This feature lets admins specify which endpoints are allowed for anonymous requests. For example, the admin can choose to only allow anonymous access to health endpoints like /healthz, /livez, and /readyz while making sure preventing anonymous access to other cluster endpoints or resources even if a user misconfigures RBAC.

This work was done as a part of KEP #4633, by SIG Auth.

Per-plugin callback functions for accurate requeueing in kube-scheduler enhancements

This feature enhances scheduling throughput with more efficient scheduling retry decisions by per-plugin callback functions (QueueingHint). All plugins now have QueueingHints.

This work was done as a part of KEP #4247, by SIG Scheduling.

Recover from volume expansion failure

This feature lets users recover from volume expansion failure by retrying with a smaller size. This enhancement ensures that volume expansion is more resilient and reliable, reducing the risk of data loss or corruption during the process.

This work was done as a part of KEP #1790, by SIG Storage.

Volume group snapshot

This feature introduces a VolumeGroupSnapshot API, which lets users take a snapshot of multiple volumes together, ensuring data consistency across the volumes.

This work was done as a part of KEP #3476, by SIG Storage.

Structured parameter support

The core part of Dynamic Resource Allocation (DRA), the structured parameter support, got promoted to beta. This allows the kube-scheduler and Cluster Autoscaler to simulate claim allocation directly, without needing a third-party driver. These components can now predict whether resource requests can be fulfilled based on the cluster's current state without actually committing to the allocation. By eliminating the need for a third-party driver to validate or test allocations, this feature improves planning and decision-making for resource distribution, making the scheduling and scaling processes more efficient.

This work was done as a part of KEP #4381, by WG Device Management (a cross functional team containing SIG Node, SIG Scheduling and SIG Autoscaling).

Label and field selector authorization

Label and field selectors can be used in authorization decisions. The node authorizer automatically takes advantage of this to limit nodes to list or watch their pods only. Webhook authorizers can be updated to limit requests based on the label or field selector used.

This work was done as part of KEP #4601 by SIG Auth.

Highlights of new features in Alpha

This is a selection of key improvements introduced as alpha features in the v1.32 release.

Asynchronous preemption in the Kubernetes Scheduler

The Kubernetes scheduler has been enhanced with Asynchronous Preemption, a feature that improves scheduling throughput by handling preemption operations asynchronously. Preemption ensures higher-priority pods get the resources they need by evicting lower-priority ones, but this process previously involved heavy operations like API calls to delete pods, slowing down the scheduler. With this enhancement, such tasks are now processed in parallel, allowing the scheduler to continue scheduling other pods without delays. This improvement is particularly beneficial in clusters with high Pod churn or frequent scheduling failures, ensuring a more efficient and resilient scheduling process.

This work was done as a part of KEP #4832 by SIG Scheduling.

Mutating admission policies using CEL expressions

This feature leverages CEL's object instantiation and JSON Patch strategies, combined with Server Side Apply’s merge algorithms. It simplifies policy definition, reduces mutation conflicts, and enhances admission control performance while laying a foundation for more robust, extensible policy frameworks in Kubernetes.

The Kubernetes API server now supports Common Expression Language (CEL)-based Mutating Admission Policies, providing a lightweight, efficient alternative to mutating admission webhooks. With this enhancement, administrators can use CEL to declare mutations like setting labels, defaulting fields, or injecting sidecars with simple, declarative expressions. This approach reduces operational complexity, eliminates the need for webhooks, and integrates directly with the kube-apiserver, offering faster and more reliable in-process mutation handling.

This work was done as a part of KEP #3962 by SIG API Machinery.

Pod-level resource specifications

This enhancement simplifies resource management in Kubernetes by introducing the ability to set resource requests and limits at the Pod level, creating a shared pool that all containers in the Pod can dynamically use. This is particularly valuable for workloads with containers that have fluctuating or bursty resource needs, as it minimizes over-provisioning and improves overall resource efficiency.

By leveraging Linux cgroup settings at the Pod level, Kubernetes ensures that these resource limits are enforced while enabling tightly coupled containers to collaborate more effectively without hitting artificial constraints. Importantly, this feature maintains backward compatibility with existing container-level resource settings, allowing users to adopt it incrementally without disrupting current workflows or existing configurations.

This marks a significant improvement for multi-container pods, as it reduces the operational complexity of managing resource allocations across containers. It also provides a performance boost for tightly integrated applications, such as sidecar architectures, where containers share workloads or depend on each other’s availability to perform optimally.

This work was done as part of KEP #2837 by SIG Node.

Allow zero value for sleep action of PreStop hook

This enhancement introduces the ability to set a zero-second sleep duration for the PreStop lifecycle hook in Kubernetes, offering a more flexible and no-op option for resource validation and customization. Previously, attempting to define a zero value for the sleep action resulted in validation errors, restricting its use. With this update, users can configure a zero-second duration as a valid sleep setting, enabling immediate execution and termination behaviors where needed.

The enhancement is backward-compatible, introduced as an opt-in feature controlled by the PodLifecycleSleepActionAllowZero feature gate. This change is particularly beneficial for scenarios requiring PreStop hooks for validation or admission webhook processing without requiring an actual sleep duration. By aligning with the capabilities of the time.After Go function, this update simplifies configuration and expands usability for Kubernetes workloads.

This work was done as part of KEP #4818 by SIG Node.

DRA: Standardized network interface data for resource claim status

This enhancement adds a new field that allows drivers to report specific device status data for each allocated object in a ResourceClaim. It also establishes a standardized way to represent networking devices information.

This work was done as a part of KEP #4817, by SIG Network.

New statusz and flagz endpoints for core components

You can enable two new HTTP endpoints, /statusz and /flagz, for core components. These enhance cluster debuggability by gaining insight into what versions (e.g. Golang version) that component is running as, along with details about its uptime, and which command line flags that component was executed with; making it easier to diagnose both runtime and configuration issues.

This work was done as part of KEP #4827 and KEP #4828 by SIG Instrumentation.

Windows strikes back!

Support for graceful shutdowns of Windows nodes in Kubernetes clusters has been added. Before this release, Kubernetes provided graceful node shutdown functionality for Linux nodes but lacked equivalent support for Windows. This enhancement enables the kubelet on Windows nodes to handle system shutdown events properly. Doing so, it ensures that Pods running on Windows nodes are gracefully terminated, allowing workloads to be rescheduled without disruption. This improvement enhances the reliability and stability of clusters that include Windows nodes, especially during a planned maintenance or any system updates.

Moreover CPU and memory affinity support has been added for Windows nodes with nodes, with improvements to the CPU manager, memory manager and topology manager.

This work was done respectively as part of KEP #4802 and KEP #4885 by SIG Windows.

Graduations, deprecations, and removals in 1.32

Graduations to Stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 13 enhancements promoted to Stable:

Deprecations and removals

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones for the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process.

Withdrawal of the old DRA implementation

The enhancement #3063 introduced Dynamic Resource Allocation (DRA) in Kubernetes 1.26.

However, in Kubernetes v1.32, this approach to DRA will be significantly changed. Code related to the original implementation will be removed, leaving KEP #4381 as the "new" base functionality.

The decision to change the existing approach originated from its incompatibility with cluster autoscaling as resource availability was non-transparent, complicating decision-making for both Cluster Autoscaler and controllers. The newly added Structured Parameter model substitutes the functionality.

This removal will allow Kubernetes to handle new hardware requirements and resource claims more predictably, bypassing the complexities of back and forth API calls to the kube-apiserver.

See the enhancement issue #3063 to find out more.

API removals

There is one API removal in Kubernetes v1.32:

  • The flowcontrol.apiserver.k8s.io/v1beta3 API version of FlowSchema and PriorityLevelConfiguration has been removed. To prepare for this, you can edit your existing manifests and rewrite client software to use the flowcontrol.apiserver.k8s.io/v1 API version, available since v1.29. All existing persisted objects are accessible via the new API. Notable changes in flowcontrol.apiserver.k8s.io/v1beta3 include that the PriorityLevelConfiguration spec.limited.nominalConcurrencyShares field only defaults to 30 when unspecified, and an explicit value of 0 is not changed to 30.

For more information, refer to the API deprecation guide.

Release notes and upgrade actions required

Check out the full details of the Kubernetes v1.32 release in our release notes.

Availability

Kubernetes v1.32 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.32 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to deliver the Kubernetes v1.32 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out our release lead, Frederico Muñoz, for leading the release team so gracefully and handle any matter with the uttermost care, making sure this release was executed smoothly and efficiently. Last but not least a big thanks goes to all the release members - leads and shadows alike - and to the following SIGs for the terrific work and outcome achieved during these 14 weeks of release work:

  • SIG Docs - for the fundamental support in docs and blog reviews and continous collaboration with release Comms and Docs;
  • SIG k8s Infra and SIG Testing - for the outstanding work in keeping the testing framework in check, along with all the infra components necessary;
  • SIG Release and all the release managers - for the incredible support provided throughout the orchestration of the entire release, addressing even the most challenging issues in a graceful and timely manner.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.32 release cycle, which ran for 14 weeks (September 9th to December 11th), we saw contributions to Kubernetes from as many as 125 different companies and 559 individuals as of writing.

In the whole Cloud Native ecosystem, the figure goes up to 433 companies counting 2441 total contributors. This sees an increase of 7% more overall contributions compared to the previous release cycle, along with 14% increase in the number of companies involved, showcasing strong interest and community behind the Cloud Native projects.

Source for this data:

By contribution we mean when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.

If you are interested in contributing visit Getting Started on our contributor website.

Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.

Event updates

Explore the upcoming Kubernetes and cloud-native events from March to June 2025, featuring KubeCon and KCD Stay informed and engage with the Kubernetes community.

March 2025

April 2025

May 2025

June 2025

Upcoming release webinar

Join members of the Kubernetes v1.32 release team on Thursday, January 9th 2025 at 5:00 PM (UTC), to learn about the release highlights of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Gateway API v1.2: WebSockets, Timeouts, Retries, and More

Gateway API logo

Kubernetes SIG Network is delighted to announce the general availability of Gateway API v1.2! This version of the API was released on October 3, and we're delighted to report that we now have a number of conformant implementations of it for you to try out.

Gateway API v1.2 brings a number of new features to the Standard channel (Gateway API's GA release channel), introduces some new experimental features, and inaugurates our new release process — but it also brings two breaking changes that you'll want to be careful of.

Breaking changes

GRPCRoute and ReferenceGrant v1alpha2 removal

Now that the v1 versions of GRPCRoute and ReferenceGrant have graduated to Standard, the old v1alpha2 versions have been removed from both the Standard and Experimental channels, in order to ease the maintenance burden that perpetually supporting the old versions would place on the Gateway API community.

Before upgrading to Gateway API v1.2, you'll want to confirm that any implementations of Gateway API have been upgraded to support the v1 API version of these resources instead of the v1alpha2 API version. Note that even if you've been using v1 in your YAML manifests, a controller may still be using v1alpha2 which would cause it to fail during this upgrade. Additionally, Kubernetes itself goes to some effort to stop you from removing a CRD version that it thinks you're using: check out the release notes for more information about what you need to do to safely upgrade.

Change to .status.supportedFeatures (experimental)

A much smaller breaking change: .status.supportedFeatures in a Gateway is now a list of objects instead of a list of strings. The objects have a single name field, so the translation from the strings is straightforward, but moving to objects permits a lot more flexibility for the future. This stanza is not yet present in the Standard channel.

Graduations to the standard channel

Gateway API 1.2.0 graduates four features to the Standard channel, meaning that they can now be considered generally available. Inclusion in the Standard release channel denotes a high level of confidence in the API surface and provides guarantees of backward compatibility. Of course, as with any other Kubernetes API, Standard channel features can continue to evolve with backward-compatible additions over time, and we certainly expect further refinements and improvements to these new features in the future. For more information on how all of this works, refer to the Gateway API Versioning Policy.

HTTPRoute timeouts

GEP-1742 introduced the timeouts stanza into HTTPRoute, permitting configuring basic timeouts for HTTP traffic. This is a simple but important feature for proper resilience when handling HTTP traffic, and it is now Standard.

For example, this HTTPRoute configuration sets a timeout of 300ms for traffic to the /face path:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: face-with-timeouts
  namespace: faces
spec:
  parentRefs:
    - name: my-gateway
      kind: Gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /face
    backendRefs:
    - name: face
      port: 80
    timeouts:
      request: 300ms

For more information, check out the HTTP routing documentation. (Note that this applies only to HTTPRoute timeouts. GRPCRoute timeouts are not yet part of Gateway API.)

Gateway infrastructure labels and annotations

Gateway API implementations are responsible for creating the backing infrastructure needed to make each Gateway work. For example, implementations running in a Kubernetes cluster often create Services and Deployments, while cloud-based implementations may be creating cloud load balancer resources. In many cases, it can be helpful to be able to propagate labels or annotations to these generated resources.

In v1.2.0, the Gateway infrastructure stanza moves to the Standard channel, allowing you to specify labels and annotations for the infrastructure created by the Gateway API controller. For example, if your Gateway infrastructure is running in-cluster, you can specify both Linkerd and Istio injection using the following Gateway configuration, making it simpler for the infrastructure to be incorporated into whichever service mesh you've installed:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: meshed-gateway
  namespace: incoming
spec:
  gatewayClassName: meshed-gateway-class
  listeners:
  - name: http-listener
    protocol: HTTP
    port: 80
  infrastructure:
    labels:
      istio-injection: enabled
    annotations:
      linkerd.io/inject: enabled

For more information, check out the infrastructure API reference.

Backend protocol support

Since Kubernetes v1.20, the Service and EndpointSlice resources have supported a stable appProtocol field to allow users to specify the L7 protocol that Service supports. With the adoption of KEP 3726, Kubernetes now supports three new appProtocol values:

kubernetes.io/h2c
HTTP/2 over cleartext as described in RFC7540
kubernetes.io/ws
WebSocket over cleartext as described in RFC6445
kubernetes.io/wss
WebSocket over TLS as described in RFC6445

With Gateway API 1.2.0, support for honoring appProtocol is now Standard. For example, given the following Service:

apiVersion: v1
kind: Service
metadata:
  name: websocket-service
  namespace: my-namespace
spec:
  selector:
    app.kubernetes.io/name: websocket-app
  ports:
    - name: http
      port: 80
      targetPort: 9376
      protocol: TCP
      appProtocol: kubernetes.io/ws

then an HTTPRoute that includes this Service as a backendRef will automatically upgrade the connection to use WebSockets rather than assuming that the connection is pure HTTP.

For more information, check out GEP-1911.

New additions to experimental channel

Named rules for *Route resources

The rules field in HTTPRoute and GRPCRoute resources can now be named, in order to make it easier to reference the specific rule, for example:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: multi-color-route
  namespace: faces
spec:
  parentRefs:
    - name: my-gateway
      kind: Gateway
      port: 80
  rules:
  - name: center-rule
    matches:
    - path:
        type: PathPrefix
        value: /color/center
    backendRefs:
    - name: color-center
      port: 80
  - name: edge-rule
    matches:
    - path:
        type: PathPrefix
        value: /color/edge
    backendRefs:
    - name: color-edge
      port: 80

Logging or status messages can now refer to these two rules as center-rule or edge-rule instead of being forced to refer to them by index. For more information, see GEP-995.

HTTPRoute retry support

Gateway API 1.2.0 introduces experimental support for counted HTTPRoute retries. For example, the following HTTPRoute configuration retries requests to the /face path up to 3 times with a 500ms delay between retries:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: face-with-retries
  namespace: faces
spec:
  parentRefs:
    - name: my-gateway
      kind: Gateway
      port: 80
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /face
    backendRefs:
    - name: face
      port: 80
    retry:
      codes: [ 500, 502, 503, 504 ]
      attempts: 3
      backoff: 500ms

For more information, check out GEP 1731.

HTTPRoute percentage-based mirroring

Gateway API has long supported the Request Mirroring feature, which allows sending the same request to multiple backends. In Gateway API 1.2.0, we're introducing percentage-based mirroring, which allows you to specify a percentage of requests to mirror to a different backend. For example, the following HTTPRoute configuration mirrors 42% of requests to the color-mirror backend:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: color-mirror-route
  namespace: faces
spec:
  parentRefs:
  - name: mirror-gateway
  hostnames:
  - mirror.example
  rules:
  - backendRefs:
    - name: color
      port: 80
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: color-mirror
          port: 80
        percent: 42   # This value must be an integer.

There's also a fraction stanza which can be used in place of percent, to allow for more precise control over exactly what amount of traffic is mirrored, for example:

    ...
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: color-mirror
          port: 80
        fraction:
          numerator: 1
          denominator: 10000

This configuration mirrors 1 in 10,000 requests to the color-mirror backend, which may be relevant with very high request rates. For more details, see GEP-1731.

Additional backend TLS configuration

This release includes three additions related to TLS configuration for communications between a Gateway and a workload (a backend):

  1. A new backendTLS field on Gateway

    This new field allows you to specify the client certificate that a Gateway should use when connecting to backends.

  2. A new subjectAltNames field on BackendTLSPolicy

    Previously, the hostname field was used to configure both the SNI that a Gateway should send to a backend and the identity that should be provided by a certificate. When the new subjectAltNames field is specified, any certificate matching at least one of the specified SANs will be considered valid. This is particularly critical for SPIFFE where URI-based SANs may not be valid SNIs.

  3. A new options field on BackendTLSPolicy

    Similar to the TLS options field on Gateway Listeners, we believe the same concept will be broadly useful for TLS-specific configuration for Backend TLS.

For more information, check out GEP-3135.

More changes

For a full list of the changes included in this release, please refer to the v1.2.0 release notes.

Project updates

Beyond the technical, the v1.2 release also marks a few milestones in the life of the Gateway API project itself.

Release process improvements

Gateway API has never been intended to be a static API, and as more projects use it as a component to build on, it's become clear that we need to bring some more predictability to Gateway API releases. To that end, we're pleased - and a little nervous! - to announce that we've formalized a new release process:

  • Scoping (4-6 weeks): maintainers and community determine the set of features we want to include in the release. A particular emphasis here is getting features out of the Experimental channel — ideally this involves moving them to Standard, but it can also mean removing them.

  • GEP Iteration and Review (5-7 weeks): contributors write or update Gateway Enhancement Proposals (GEPs) for features accepted into the release, with emphasis on getting consensus around the design and graduation criteria of the feature.

  • API Refinement and Documentation (3-5 weeks): contributors implement the features in the Gateway API controllers and write the necessary documentation.

  • SIG Network Review and Release Candidates (2-4 weeks): maintainers get the required upstream review, build release candidates, and release the new version.

Gateway API 1.2.0 was the first release to use the new process, and although there are the usual rough edges of anything new, we believe that it went well. We've already completed the Scoping phase for Gateway API 1.3, with the release expected around the end of January 2025.

gwctl moves out

The gwctl CLI tool has moved into its very own repository, https://github.com/kubernetes-sigs/gwctl. gwctl has proven a valuable tool for the Gateway API community; moving it into its own repository will, we believe, make it easier to maintain and develop. As always, we welcome contributions; while still experimental, gwctl already helps make working with Gateway API a bit easier — especially for newcomers to the project!

Maintainer changes

Rounding out our changes to the project itself, we're pleased to announce that Mattia Lavacca has joined the ranks of Gateway API Maintainers! We're also sad to announce that Keith Mattix has stepped down as a GAMMA lead — happily, Mike Morris has returned to the role. We're grateful for everything Keith has done, and excited to have Mattia and Mike on board.

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow our Getting Started Guide. As of this writing, five implementations are already conformant with Gateway API v1.2. In alphabetical order:

Get involved

There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We could never have gotten this far without the support of this dedicated and active community.

How we built a dynamic Kubernetes API Server for the API Aggregation Layer in Cozystack

Hi there! I'm Andrei Kvapil, but you might know me as @kvaps in communities dedicated to Kubernetes and cloud-native tools. In this article, I want to share how we implemented our own extension api-server in the open-source PaaS platform, Cozystack.

Kubernetes truly amazes me with its powerful extensibility features. You're probably already familiar with the controller concept and frameworks like kubebuilder and operator-sdk that help you implement it. In a nutshell, they allow you to extend your Kubernetes cluster by defining custom resources (CRDs) and writing additional controllers that handle your business logic for reconciling and managing these kinds of resources. This approach is well-documented, with a wealth of information available online on how to develop your own operators.

However, this is not the only way to extend the Kubernetes API. For more complex scenarios such as implementing imperative logic, managing subresources, and dynamically generating responses—the Kubernetes API aggregation layer provides an effective alternative. Through the aggregation layer, you can develop a custom extension API server and seamlessly integrate it within the broader Kubernetes API framework.

In this article, I will explore the API aggregation layer, the types of challenges it is well-suited to address, cases where it may be less appropriate, and how we utilized this model to implement our own extension API server in Cozystack.

What Is the API Aggregation Layer?

First, let's get definitions straight to avoid any confusion down the road. The API aggregation layer is a feature in Kubernetes, while an extension api-server is a specific implementation of an API server for the aggregation layer. An extension API server is just like the standard Kubernetes API server, except it runs separately and handles requests for your specific resource types.

So, the aggregation layer lets you write your own extension API server, integrate it easily into Kubernetes, and directly process requests for resources in a certain group. Unlike the CRD mechanism, the extension API is registered in Kubernetes as an APIService, telling Kubernetes to consider this new API server and acknowledge that it serves certain APIs.

You can execute this command to list all registered apiservices:

kubectl get apiservices.apiregistration.k8s.io

Example APIService:

NAME                          	SERVICE                   	AVAILABLE   AGE
v1alpha1.apps.cozystack.io    	cozy-system/cozystack-api 	True    	7h29m

As soon as the Kubernetes api-server receives requests for resources in the group v1alpha1.apps.cozystack.io, it redirects all those requests to our extension api-server, which can handle them based on the business logic we've built into it.

When to use the API Aggregation Layer

The API Aggregation Layer helps solve several issues where the usual CRD mechanism might not enough. Let's break them down.

Imperative Logic and Subresources

Besides regular resources, Kubernetes also has something called subresources.

In Kubernetes, subresources are additional actions or operations you can perform on primary resources (like Pods, Deployments, Services) via the Kubernetes API. They provide interfaces to manage specific aspects of resources without affecting the entire object.

A simple example is status, which is traditionally exposed as a separate subresource that you can access independently from the parent object. The status field isn't meant to be changed

But beyond /status, Pods in Kubernetes also have subresources like /exec, /portforward, and /log. Interestingly, instead of the usual declarative resources in Kubernetes, these represent endpoints for imperative operations like viewing logs, proxying connections, executing commands in a running container, and so on.

To support such imperative commands on your own API, you need implement an extension API and an extension API server. Here are some well-known examples:

  • KubeVirt: An add-on for Kubernetes that extends its API capabilities to run traditional virtual machines. The extension api-server created as part of KubeVirt handles subresources like /restart, /console, and /vnc for virtual machines.
  • Knative: A Kubernetes add-on that extends its capabilities for serverless computing, implementing the /scale subresource to set up autoscaling for its resource types.

By the way, even though subresource logic in Kubernetes can be imperative, you can manage access to them declaratively using Kubernetes standard RBAC model.

For example this way you can control access to the /log and /exec subresources of the Pod kind:

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: pod-and-pod-logs-reader
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["create"]

You're not tied to use etcd

Usually, the Kubernetes API server uses etcd for its backend. However, implementing your own API server doesn't lock you into using only etcd. If it doesn't make sense to store your server's state in etcd, you can store information in any other system and generate responses on the fly. Here are a few cases to illustrate:

  • metrics-server is a standard extension for Kubernetes which allows you to view real-time metrics of your nodes and pods. It defines alternative Pod and Node kinds in its own metrics.k8s.io API. Requests to these resources are translated into metrics directly from Kubelet. So when you run kubectl top node or kubectl top pod, metrics-server fetches metrics from cAdvisor in real-time. It then returns these metrics to you. Since the information is generated in real-time and is only relevant at the moment of the request, there is no need to store it in etcd. This approach saves resources.

  • If needed, you can use a backend other than etcd. You can even implement a Kubernetes-compatible API for it. For example, if you use Postgres, you can create a transparent representation of its entities in the Kubernetes API. Eg. databases, users, and grants within Postgres would appear as regular Kubernetes resources, thanks to your extension API server. You could manage them using kubectl or any other Kubernetes-compatible tool. Unlike controllers, which implement business logic using custom resources and reconciliation methods, an extension API server eliminates the need for separate controllers for every kind. This means you don't have to sync state between the Kubernetes API and your backend.

One-Time resources

  • Kubernetes has a special API used to provide users with information about their permissions. This is implemented using the SelfSubjectAccessReview API. One unusual detail of these resources is that you can't view them using get or list verbs. You can only create them (using the create verb) and receive output with information about what you have access to at that moment.

    If you try to run kubectl get selfsubjectaccessreviews directly, you'll just get an error like this:

    Error from server (MethodNotAllowed): the server does not allow this method on the requested resource
    

    The reason is that the Kubernetes API server doesn't support any other interaction with this type of resource (you can only CREATE them).

    The SelfSubjectAccessReview API supports commands such as:

    kubectl auth can-i create deployments --namespace dev
    

    When you run the command above, kubectl creates a SelfSubjectAccessReview using the Kubernetes API. This allows Kubernetes to fetch a list of possible permissions for your user. Kubernetes then generates a personalized response to your request in real-time. This logic is different from a scenario where this resource is simply stored in etcd.

  • Similarly, in KubeVirt's CDI (Containerized Data Importer) extension, which allows file uploads into a PVC from a local machine using the virtctl tool, a special token is required before the upload process begins. This token is generated by creating an UploadTokenRequest resource via the Kubernetes API. Kubernetes routes (proxies) all UploadTokenRequest resource creation requests to the CDI extension API server, which generates and returns the token in response.

Full control over conversion, validation, and output formatting

  • Your own API server can have all the capabilities of the vanilla Kubernetes API server. The resources you create in your API server can be validated immediately on the server side without additional webhooks. While CRDs also support server-side validation using Common Expression Language (CEL) for declarative validation and ValidatingAdmissionPolicies without the need for webhooks, a custom API server allows for more complex and tailored validation logic if needed.

    Kubernetes allows you to serve multiple API versions for each resource type, traditionally v1alpha1, v1beta1 and v1. Only one version can be specified as the storage version. All requests to other versions must be automatically converted to the version specified as storage version. With CRDs, this mechanism is implemented using conversion webhooks. Whereas in an extension API server, you can implement your own conversion mechanism, choose to mix up different storage versions (one object might be serialized as v1, another as v2), or rely on an external backing API.

  • Directly implementing the Kubernetes API lets you format table output however you like and doesn't force you to follow the additionalPrinterColumns logic in CRDs. Instead, you can write your own formatter that formats the table output and custom fields in it. For example, when using additionalPrinterColumns, you can display field values only following the JSONPath logic. In your own API server, you can generate and insert values on the fly, formatting the table output as you wish.

Dynamic resource registration

  • The resources served by an extension api-server don't need to be pre-registered as CRDs. Once your extension API server is registered using an APIService, Kubernetes starts polling it to discover APIs and resources it can serve. After receiving a discovery response, the Kubernetes API server automatically registers all available types for this API group. Although this isn't considered common practice, you can implement logic that dynamically registers the resource types you need in your Kubernetes cluster.

When not to use the API Aggregation Layer

There are some anti-patterns where using the API Aggregation Layer isn't recommended. Let's go through them.

Unstable backend

If your API server stops responding for some reason due to an unavailable backend or other issues it may block some Kubernetes functionality. For example, when deleting namespaces, Kubernetes will wait for a response from your API server to see if there are any remaining resources. If the response doesn't come, the namespace deletion will be blocked.

Also, you might have encountered a situation where, when the metrics-server is unavailable, an extra message appears in stderr after every API request (even unrelated to metrics) stating that metrics.k8s.io is unavailable. This is another example of how using the API Aggregation Layer can lead to problems when the api-server handling requests is unavailable.

Slow requests

If you can't guarantee an instant response for user requests, it's better to consider using a CustomResourceDefinition and controller. Otherwise, you might make your cluster less stable. Many projects implement an extension API server only for a limited set of resources, particularly for imperative logic and subresources. This recommendation is also mentioned in the official Kubernetes documentation.

Why we needed it in Cozystack

As a reminder, we're developing the open-source PaaS platform Cozystack, which can also be used as a framework for building your own private cloud. Therefore, the ability to easily extend the platform is crucial for us.

Cozystack is built on top of FluxCD. Any application is packaged into its own Helm chart, ready for deployment in a tenant namespace. Deploying any application on the platform is done by creating a HelmRelease resource, specifying the chart name and parameters for the application. All the rest logic is handled by FluxCD. This pattern allows us to easily extend the platform with new applications and provide the ability to create new applications that just need to be packaged into the appropriate Helm chart.

Interface of the Cozystack platform

Interface of the Cozystack platform

So, in our platform, everything is configured as HelmRelease resources. However, we ran into two problems: limitations of the RBAC model and the need for a public API. Let's delve into these

Limitations of the RBAC model

The widely-deployed RBAC system in Kubernetes doesn't allow you to restrict access to a list of resources of the same kind based on labels or specific fields in the spec. When creating a role, you can limit access across the resources in the same kind only by specifying specific resource names in resourceNames. For verbs like get or update it will work. However, filtering by resourceNames using list verb doesn't work like that. Thus you can limit listing certain resources by kind but not by name.

  • Kubernetes has a special API used to provide users with information about their permissions. This is implemented using the SelfSubjectAccessReview API. One unusual detail of these resources is that you can't view them using get or list verbs. You can only create them (using the create verb) and receive output with information about what you have access to at that moment.

So, we decided to introduce new resource types based on the names of the Helm charts they use and generate the list of available kinds dynamically at runtime in our extension api-server. This way, we can reuse Kubernetes standard RBAC model to manage access to specific resource types.

Need for a public API

Since our platform provides capabilities for deploying various managed services, we want to organize public access to the platform's API. However, we can't allow users to interact directly with resources like HelmRelease because that would let them specify arbitrary names and parameters for Helm charts to deploy, potentially compromising our system.

We wanted to give users the ability to deploy a specific service simply by creating the resource with corresponding kind in Kubernetes. The type of this resource should be named the same as the chart from which it's deployed. Here are some examples:

  • kind: Kuberneteschart: kubernetes
  • kind: Postgreschart: postgres
  • kind: Redischart: redis
  • kind: VirtualMachinechart: virtual-machine

Moreover, we don't want to have to add a new type to codegen and recompile our extension API server every time we add a new chart for it to start being served. The schema update should be done dynamically or provided via a ConfigMap by the administrator.

Two-Way conversion

Currently, we already have integrations and a dashboard that continue to use HelmRelease resources. At this stage, we didn't want to lose the ability to support this API. Considering that we're simply translating one resource into another, support is maintained and it works both ways. If you create a HelmRelease, you'll get a custom resource in Kubernetes, and if you create a custom resource in Kubernetes, it will also be available as a HelmRelease.

We don't have any additional controllers that synchronize state between these resources. All requests to resources in our extension API server are transparently proxied to HelmRelease and vice versa. This eliminates intermediate states and the need to write controllers and synchronization logic.

Implementation

To implement the Aggregation API, you might consider starting with the following projects:

  • apiserver-builder: Currently in alpha and hasn't been updated for two years. It works like kubebuilder, providing a framework for creating an extension API server, allowing you to sequentially create a project structure and generate code for your resources.
  • sample-apiserver: A ready-made example of an implemented API server, based on official Kubernetes libraries, which you can use as a foundation for your project.

For practical reasons, we chose the second project. Here's what we needed to do:

Disable etcd support

In our case, we don't need it since all resources are stored directly in the Kubernetes API.

You can disable etcd options by passing nil to RecommendedOptions.Etcd:

Generate a common resource kind

We called it Application, and it looks like this:

This is a generic type used for any application type, and its handling logic is the same for all charts.

Configure configuration loading

Since we want to configure our extension api-server via a config file, we formed the config structure in Go:

We also modified the resource registration logic so that the resources we create are registered in scheme with different Kind values:

As a result, we got a config where you can pass all possible types and specify what they should map to:

Implement our own registry

To store state not in etcd but translate it directly into Kubernetes HelmRelease resources (and vice versa), we wrote conversion functions from Application to HelmRelease and from HelmRelease to Application:

We implemented logic to filter resources by chart name, sourceRef, and prefix in the HelmRelease name:

Then, using this logic, we implemented the methods Get(), Delete(), List(), Create().

You can see the full example here:

At the end of each method, we set the correct Kind and return an unstructured.Unstructured{} object so that Kubernetes serializes the object correctly. Otherwise, it would always serialize them with kind: Application, which we don't want.

What did we achieve?

In Cozystack, all our types from the ConfigMap are now available in Kubernetes as-is:

kubectl api-resources | grep cozystack
buckets                   apps.cozystack.io/v1alpha1      true        Bucket
clickhouses               apps.cozystack.io/v1alpha1      true        ClickHouse
etcds                     apps.cozystack.io/v1alpha1      true        Etcd
ferretdb                  apps.cozystack.io/v1alpha1      true        FerretDB
httpcaches                apps.cozystack.io/v1alpha1      true        HTTPCache
ingresses                 apps.cozystack.io/v1alpha1      true        Ingress
kafkas                    apps.cozystack.io/v1alpha1      true        Kafka
kuberneteses              apps.cozystack.io/v1alpha1      true        Kubernetes
monitorings               apps.cozystack.io/v1alpha1      true        Monitoring
mysqls                    apps.cozystack.io/v1alpha1      true        MySQL
natses                    apps.cozystack.io/v1alpha1      true        NATS
postgreses                apps.cozystack.io/v1alpha1      true        Postgres
rabbitmqs                 apps.cozystack.io/v1alpha1      true        RabbitMQ
redises                   apps.cozystack.io/v1alpha1      true        Redis
seaweedfses               apps.cozystack.io/v1alpha1      true        SeaweedFS
tcpbalancers              apps.cozystack.io/v1alpha1      true        TCPBalancer
tenants                   apps.cozystack.io/v1alpha1      true        Tenant
virtualmachines           apps.cozystack.io/v1alpha1      true        VirtualMachine
vmdisks                   apps.cozystack.io/v1alpha1      true        VMDisk
vminstances               apps.cozystack.io/v1alpha1      true        VMInstance
vpns                      apps.cozystack.io/v1alpha1      true        VPN

We can work with them just like regular Kubernetes resources.

Listing S3 Buckets:

kubectl get buckets.apps.cozystack.io -n tenant-kvaps

Example output:

NAME         READY   AGE    VERSION
foo          True    22h    0.1.0
testaasd     True    27h    0.1.0

Listing Kubernetes Clusters:

kubectl get kuberneteses.apps.cozystack.io -n tenant-kvaps

Example output:

NAME     READY   AGE    VERSION
abc      False   19h    0.14.0
asdte    True    22h    0.13.0

Listing Virtual Machine Disks:

kubectl get vmdisks.apps.cozystack.io -n tenant-kvaps

Example output:

NAME               READY   AGE    VERSION
docker             True    21d    0.1.0
test               True    18d    0.1.0
win2k25-iso        True    21d    0.1.0
win2k25-system     True    21d    0.1.0

Listing Virtual Machine Instances:

kubectl get vminstances.apps.cozystack.io -n tenant-kvaps

Example output:

NAME        READY   AGE    VERSION
docker      True    21d    0.1.0
test        True    18d    0.1.0
win2k25     True    20d    0.1.0

We can create, modify, and delete each of them, and any interaction with them will be translated into HelmRelease resources, while also applying the resource structure and prefix in the name.

To see all related Helm releases:

kubectl get helmreleases -n tenant-kvaps -l cozystack.io/ui

Example output:

NAME                     AGE    READY
bucket-foo               22h    True
bucket-testaasd          27h    True
kubernetes-abc           19h    False
kubernetes-asdte         22h    True
redis-test               18d    True
redis-yttt               12d    True
vm-disk-docker           21d    True
vm-disk-test             18d    True
vm-disk-win2k25-iso      21d    True
vm-disk-win2k25-system   21d    True
vm-instance-docker       21d    True
vm-instance-test         18d    True
vm-instance-win2k25      20d    True

Next Steps

We don’t intend to stop here with our API. In the future, we plan to add new features:

  • Add validation based on an OpenAPI spec generated directly from Helm charts.
  • Develop a controller that collects release notes from deployed releases and shows users access information for specific services.
  • Revamp our dashboard to work directly with the new API.

Conclusion

The API Aggregation Layer allowed us to quickly and efficiently solve our problem by providing a flexible mechanism for extending the Kubernetes API with dynamically registered resources and converting them on the fly. Ultimately, this made our platform even more flexible and extensible without the need to write code for each new resource.

You can test the API yourself in the open-source PaaS platform Cozystack, starting from version v0.18.

Kubernetes v1.32 sneak peek

As we get closer to the release date for Kubernetes v1.32, the project develops and matures. Features may be deprecated, removed, or replaced with better ones for the project's overall health.

This blog outlines some of the planned changes for the Kubernetes v1.32 release, that the release team feels you should be aware of, for the continued maintenance of your Kubernetes environment and keeping up to date with the latest changes. Information listed below is based on the current status of the v1.32 release and may change before the actual release date.

The Kubernetes API removal and deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release will continue to function until removal (at least one year from the deprecation). Its usage will result in a warning being displayed. Removed APIs are no longer available in the current version, so you must migrate to use the replacement instead.

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.

  • Beta or pre-release API versions must be supported for 3 releases after the deprecation.

  • Alpha or experimental API versions may be removed in any release without prior deprecation notice; this process can become a withdrawal in cases where a different implementation for the same feature is already in place.

Whether an API is removed due to a feature graduating from beta to stable or because that API did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the deprecation guide.

Note on the withdrawal of the old DRA implementation

The enhancement #3063 introduced Dynamic Resource Allocation (DRA) in Kubernetes 1.26.

However, in Kubernetes v1.32, this approach to DRA will be significantly changed. Code related to the original implementation will be removed, leaving KEP #4381 as the "new" base functionality.

The decision to change the existing approach originated from its incompatibility with cluster autoscaling as resource availability was non-transparent, complicating decision-making for both Cluster Autoscaler and controllers. The newly added Structured Parameter model substitutes the functionality.

This removal will allow Kubernetes to handle new hardware requirements and resource claims more predictably, bypassing the complexities of back and forth API calls to the kube-apiserver.

Please also see the enhancement issue #3063 to find out more.

API removal

There is only a single API removal planned for Kubernetes v1.32:

  • The flowcontrol.apiserver.k8s.io/v1beta3 API version of FlowSchema and PriorityLevelConfiguration has been removed. To prepare for this, you can edit your existing manifests and rewrite client software to use the flowcontrol.apiserver.k8s.io/v1 API version, available since v1.29. All existing persisted objects are accessible via the new API. Notable changes in flowcontrol.apiserver.k8s.io/v1beta3 include that the PriorityLevelConfiguration spec.limited.nominalConcurrencyShares field only defaults to 30 when unspecified, and an explicit value of 0 is not changed to 30.

For more information, please refer to the API deprecation guide.

Sneak peek of Kubernetes v1.32

The following list of enhancements is likely to be included in the v1.32 release. This is not a commitment and the release content is subject to change.

Even more DRA enhancements!

In this release, like the previous one, the Kubernetes project continues proposing a number of enhancements to the Dynamic Resource Allocation (DRA), a key component of the Kubernetes resource management system. These enhancements aim to improve the flexibility and efficiency of resource allocation for workloads that require specialized hardware, such as GPUs, FPGAs and network adapters. This release introduces improvements, including the addition of resource health status in the Pod status, as outlined in KEP #4680.

Add resource health status to the Pod status

It isn't easy to know when a Pod uses a device that has failed or is temporarily unhealthy. KEP #4680 proposes exposing device health via Pod status, making troubleshooting of Pod crashes easier.

Windows strikes back!

KEP #4802 adds support for graceful shutdowns of Windows nodes in Kubernetes clusters. Before this release, Kubernetes provided graceful node shutdown functionality for Linux nodes but lacked equivalent support for Windows. This enhancement enables the kubelet on Windows nodes to handle system shutdown events properly. Doing so, it ensures that Pods running on Windows nodes are gracefully terminated, allowing workloads to be rescheduled without disruption. This improvement enhances the reliability and stability of clusters that include Windows nodes, especially during a planned maintenance or any system updates.

Allow special characters in environment variables

With the graduation of this enhancement to beta, Kubernetes now allows almost all printable ASCII characters (excluding "=") to be used as environment variable names. This change addresses the limitations previously imposed on variable naming, facilitating a broader adoption of Kubernetes by accommodating various application needs. The relaxed validation will be enabled by default via the RelaxedEnvironmentVariableValidation feature gate, ensuring that users can easily utilize environment variables without strict constraints, enhancing flexibility for developers working with applications like .NET Core that require special characters in their configurations.

Make Kubernetes aware of the LoadBalancer behavior

KEP #1860 graduates to GA, introducing the ipMode field for a Service of type: LoadBalancer, which can be set to either "VIP" or "Proxy". This enhancement is aimed at improving how cloud providers load balancers interact with kube-proxy and it is a change transparent to the end user. The existing behavior of kube-proxy is preserved when using "VIP", where kube-proxy handles the load balancing. Using "Proxy" results in traffic sent directly to the load balancer, providing cloud providers greater control over relying on kube-proxy; this means that you could see an improvement in the performance of your load balancer for some cloud providers.

Retry generate name for resources

This enhancement improves how name conflicts are handled for Kubernetes resources created with the generateName field. Previously, if a name conflict occurred, the API server returned a 409 HTTP Conflict error and clients had to manually retry the request. With this update, the API server automatically retries generating a new name up to seven times in case of a conflict. This significantly reduces the chances of collision, ensuring smooth generation of up to 1 million names with less than a 0.1% probability of a conflict, providing more resilience for large-scale workloads.

Want to know more?

New features and deprecations are also announced in the Kubernetes release notes. We will formally announce what's new in Kubernetes v1.32 as part of the CHANGELOG for this release.

You can see the announcements of changes in the release notes for:

Spotlight on Kubernetes Upstream Training in Japan

We are organizers of Kubernetes Upstream Training in Japan. Our team is composed of members who actively contribute to Kubernetes, including individuals who hold roles such as member, reviewer, approver, and chair.

Our goal is to increase the number of Kubernetes contributors and foster the growth of the community. While Kubernetes community is friendly and collaborative, newcomers may find the first step of contributing to be a bit challenging. Our training program aims to lower that barrier and create an environment where even beginners can participate smoothly.

What is Kubernetes upstream training in Japan?

Upstream Training in 2022

Our training started in 2019 and is held 1 to 2 times a year. Initially, Kubernetes Upstream Training was conducted as a co-located event of KubeCon (Kubernetes Contributor Summit), but we launched Kubernetes Upstream Training in Japan with the aim of increasing Japanese contributors by hosting a similar event in Japan.

Before the pandemic, the training was held in person, but since 2020, it has been conducted online. The training offers the following content for those who have not yet contributed to Kubernetes:

  • Introduction to Kubernetes community
  • Overview of Kubernetes codebase and how to create your first PR
  • Tips and encouragement to lower participation barriers, such as language
  • How to set up the development environment
  • Hands-on session using kubernetes-sigs/contributor-playground

At the beginning of the program, we explain why contributing to Kubernetes is important and who can contribute. We emphasize that contributing to Kubernetes allows you to make a global impact and that Kubernetes community is looking forward to your contributions!

We also explain Kubernetes community, SIGs, and Working Groups. Next, we explain the roles and responsibilities of Member, Reviewer, Approver, Tech Lead, and Chair. Additionally, we introduce the communication tools we primarily use, such as Slack, GitHub, and mailing lists. Some Japanese speakers may feel that communicating in English is a barrier. Additionally, those who are new to the community need to understand where and how communication takes place. We emphasize the importance of taking that first step, which is the most important aspect we focus on in our training!

We then go over the structure of Kubernetes codebase, the main repositories, how to create a PR, and the CI/CD process using Prow. We explain in detail the process from creating a PR to getting it merged.

After several lectures, participants get to experience hands-on work using kubernetes-sigs/contributor-playground, where they can create a simple PR. The goal is for participants to get a feel for the process of contributing to Kubernetes.

At the end of the program, we also provide a detailed explanation of setting up the development environment for contributing to the kubernetes/kubernetes repository, including building code locally, running tests efficiently, and setting up clusters.

Interview with participants

We conducted interviews with those who participated in our training program. We asked them about their reasons for joining, their impressions, and their future goals.

Keita Mochizuki (NTT DATA Group Corporation)

Keita Mochizuki is a contributor who consistently contributes to Kubernetes and related projects. Keita is also a professional in container security and has recently published a book. Additionally, he has made available a Roadmap for New Contributors, which is highly beneficial for those new to contributing.

Junya: Why did you decide to participate in Kubernetes Upstream Training?

Keita: Actually, I participated twice, in 2020 and 2022. In 2020, I had just started learning about Kubernetes and wanted to try getting involved in activities outside of work, so I signed up after seeing the event on Twitter by chance. However, I didn't have much knowledge at the time, and contributing to OSS felt like something beyond my reach. As a result, my understanding after the training was shallow, and I left with more of a "hmm, okay" feeling.

In 2022, I participated again when I was at a stage where I was seriously considering starting contributions. This time, I did prior research and was able to resolve my questions during the lectures, making it a very productive experience.

Junya: How did you feel after participating?

Keita: I felt that the significance of this training greatly depends on the participant's mindset. The training itself consists of general explanations and simple hands-on exercises, but it doesn't mean that attending the training will immediately lead to contributions.

Junya: What is your purpose for contributing?

Keita: My initial motivation was to "gain a deep understanding of Kubernetes and build a track record," meaning "contributing itself was the goal." Nowadays, I also contribute to address bugs or constraints I discover during my work. Additionally, through contributing, I've become less hesitant to analyze undocumented features directly from the source code.

Junya: What has been challenging about contributing?

Keita: The most difficult part was taking the first step. Contributing to OSS requires a certain level of knowledge, and leveraging resources like this training and support from others was essential. One phrase that stuck with me was, "Once you take the first step, it becomes easier to move forward." Also, in terms of continuing contributions as part of my job, the most challenging aspect is presenting the outcomes as achievements. To keep contributing over time, it's important to align it with business goals and strategies, but upstream contributions don't always lead to immediate results that can be directly tied to performance. Therefore, it's crucial to ensure mutual understanding with managers and gain their support.

Junya: What are your future goals?

Keita: My goal is to contribute to areas with a larger impact. So far, I've mainly contributed by fixing smaller bugs as my primary focus was building a track record, but moving forward, I'd like to challenge myself with contributions that have a greater impact on Kubernetes users or that address issues related to my work. Recently, I've also been working on reflecting the changes I've made to the codebase into the official documentation, and I see this as a step toward achieving my goals.

Junya: Thank you very much!

Yoshiki Fujikane (CyberAgent, Inc.)

Yoshiki Fujikane is one of the maintainers of PipeCD, a CNCF Sandbox project. In addition to developing new features for Kubernetes support in PipeCD, Yoshiki actively participates in community management and speaks at various technical conferences.

Junya: Why did you decide to participate in the Kubernetes Upstream Training?

Yoshiki: At the time I participated, I was still a student. I had only briefly worked with EKS, but I thought Kubernetes seemed complex yet cool, and I was casually interested in it. Back then, OSS felt like something out of reach, and upstream development for Kubernetes seemed incredibly daunting. While I had always been interested in OSS, I didn't know where to start. It was during this time that I learned about the Kubernetes Upstream Training and decided to take the challenge of contributing to Kubernetes.

Junya: What were your impressions after participating?

Yoshiki: I found it extremely valuable as a way to understand what it's like to be part of an OSS community. At the time, my English skills weren't very strong, so accessing primary sources of information felt like a big hurdle for me. Kubernetes is a very large project, and I didn't have a clear understanding of the overall structure, let alone what was necessary for contributing. The upstream training provided a Japanese explanation of the community structure and allowed me to gain hands-on experience with actual contributions. Thanks to the guidance I received, I was able to learn how to approach primary sources and use them as entry points for further investigation, which was incredibly helpful. This experience made me realize the importance of organizing and reviewing primary sources, and now I often dive into GitHub issues and documentation when something piques my interest. As a result, while I am no longer contributing to Kubernetes itself, the experience has been a great foundation for contributing to other projects.

Junya: What areas are you currently contributing to, and what are the other projects you're involved in?

Yoshiki: Right now, I'm no longer working with Kubernetes, but instead, I'm a maintainer of PipeCD, a CNCF Sandbox project. PipeCD is a CD tool that supports GitOps-style deployments for various application platforms. The tool originally started as an internal project at CyberAgent. With different teams adopting different platforms, PipeCD was developed to provide a unified CD platform with a consistent user experience. Currently, it supports Kubernetes, AWS ECS, Lambda, Cloud Run, and Terraform.

Junya: What role do you play within the PipeCD team?

Yoshiki: I work full-time on improving and developing Kubernetes-related features within the team. Since we provide PipeCD as a SaaS internally, my main focus is on adding new features and improving existing ones as part of that support. In addition to code contributions, I also contribute by giving talks at various events and managing community meetings to help grow the PipeCD community.

Junya: Could you explain what kind of improvements or developments you are working on with regards to Kubernetes?

Yoshiki: PipeCD supports GitOps and Progressive Delivery for Kubernetes, so I'm involved in the development of those features. Recently, I've been working on features that streamline deployments across multiple clusters.

Junya: Have you encountered any challenges while contributing to OSS?

Yoshiki: One challenge is developing features that maintain generality while meeting user use cases. When we receive feature requests while operating the internal SaaS, we first consider adding features to solve those issues. At the same time, we want PipeCD to be used by a broader audience as an OSS tool. So, I always think about whether a feature designed for one use case could be applied to another, ensuring the software remains flexible and widely usable.

Junya: What are your goals moving forward?

Yoshiki: I want to focus on expanding PipeCD's functionality. Currently, we are developing PipeCD under the slogan "One CD for All." As I mentioned earlier, it supports Kubernetes, AWS ECS, Lambda, Cloud Run, and Terraform, but there are many other platforms out there, and new platforms may emerge in the future. For this reason, we are currently developing a plugin system that will allow users to extend PipeCD on their own, and I want to push this effort forward. I'm also working on features for multi-cluster deployments in Kubernetes, and I aim to continue making impactful contributions.

Junya: Thank you very much!

Future of Kubernetes upstream training

We plan to continue hosting Kubernetes Upstream Training in Japan and look forward to welcoming many new contributors. Our next session is scheduled to take place at the end of November during CloudNative Days Winter 2024.

Moreover, our goal is to expand these training programs not only in Japan but also around the world. Kubernetes celebrated its 10th anniversary this year, and for the community to become even more active, it's crucial for people across the globe to continue contributing. While Upstream Training is already held in several regions, we aim to bring it to even more places.

We hope that as more people join Kubernetes community and contribute, our community will become even more vibrant!

Announcing the 2024 Steering Committee Election Results

The 2024 Steering Committee Election is now complete. The Kubernetes Steering Committee consists of 7 seats, 3 of which were up for election in 2024. Incoming committee members serve a term of 2 years, and all members are elected by the Kubernetes Community.

This community body is significant since it oversees the governance of the entire Kubernetes project. With that great power comes great responsibility. You can learn more about the steering committee’s role in their charter.

Thank you to everyone who voted in the election; your participation helps support the community’s continued health and success.

Results

Congratulations to the elected committee members whose two year terms begin immediately (listed in alphabetical order by GitHub handle):

They join continuing members:

Benjamin Elder is a returning Steering Committee Member.

Big thanks!

Thank you and congratulations on a successful election to this round’s election officers:

Thanks to the Emeritus Steering Committee Members. Your service is appreciated by the community:

And thank you to all the candidates who came forward to run for election.

Get involved with the Steering Committee

This governing body, like all of Kubernetes, is open to all. You can follow along with Steering Committee meeting notes and weigh in by filing an issue or creating a PR against their repo. They have an open meeting on the first Monday at 8am PT of every month. They can also be contacted at their public mailing list steering@kubernetes.io.

You can see what the Steering Committee meetings are all about by watching past meetings on the YouTube Playlist.

If you want to meet some of the newly elected Steering Committee members, join us for the Steering AMA at the Kubernetes Contributor Summit North America 2024 in Salt Lake City.


This post was adapted from one written by the Contributor Comms Subproject. If you want to write stories about the Kubernetes community, learn more about us.

Spotlight on CNCF Deaf and Hard-of-hearing Working Group (DHHWG)

In recognition of Deaf Awareness Month and the importance of inclusivity in the tech community, we are spotlighting Catherine Paganini, facilitator and one of the founding members of CNCF Deaf and Hard-of-Hearing Working Group (DHHWG). In this interview, Sandeep Kanabar, a deaf member of the DHHWG and part of the Kubernetes SIG ContribEx Communications team, sits down with Catherine to explore the impact of the DHHWG on cloud native projects like Kubernetes.

Sandeep’s journey is a testament to the power of inclusion. Through his involvement in the DHHWG, he connected with members of the Kubernetes community who encouraged him to join SIG ContribEx - the group responsible for sustaining the Kubernetes contributor experience. In an ecosystem where open-source projects are actively seeking contributors and maintainers, this story highlights how important it is to create pathways for underrepresented groups, including those with disabilities, to contribute their unique perspectives and skills.

In this interview, we delve into Catherine’s journey, the challenges and triumphs of establishing the DHHWG, and the vision for a more inclusive future in cloud native. We invite Kubernetes contributors, maintainers, and community members to reflect on the significance of empathy, advocacy, and community in fostering a truly inclusive environment for all, and to think about how they can support efforts to increase diversity and accessibility within their own projects.

Introduction

Sandeep Kanabar (SK): Hello Catherine, could you please introduce yourself, share your professional background, and explain your connection to the Kubernetes ecosystem?

Catherine Paganini (CP): I'm the Head of Marketing at Buoyant, the creator of Linkerd, the CNCF-graduated service mesh, and 5th CNCF project. Four years ago, I started contributing to open source. The initial motivation was to make cloud native concepts more accessible to newbies and non-technical people. Without a technical background, it was hard for me to understand what Kubernetes, containers, service meshes, etc. mean. All content was targeted at engineers already familiar with foundational concepts. Clearly, I couldn't be the only one struggling with wrapping my head around cloud native.

My first contribution was the CNCF Landscape Guide, which I co-authored with my former colleague Jason Morgan. Next, we started the CNCF Glossary, which explains cloud native concepts in simple terms. Today, the glossary has been (partially) localised into 14 languages!

Currently, I'm the co-chair of the TAG Contributor Strategy and the Facilitator of the Deaf and Hard of Hearing Working Group (DHHWG) and Blind and Visually Impaired WG (BVIWG), which is still in formation. I'm also working on a new Linux Foundation (LF) initiative called ABIDE (Accessibility and Belonging through Inclusion, Diversity, and Equity), so stay tuned to learn more about it!

Motivation and early milestones

SK: That's inspiring! Building on your passion for accessibility, what motivated you to facilitate the creation of the DHHWG? Was there a speecifc moment or experience that sparked this initiative?

CP: Last year at KubeCon Amsterdam, I learned about a great initiative by Jay Tihema that creates pathways for Maori youth into cloud native and open source. While telling my CODA (children of deaf adults) high school friend about it, I thought it'd be great to create something similar for deaf folks. A few months later, I posted about it in a LinkedIn post that the CNCF shared. Deaf people started to reach out, wanting to participate. And the rest is history.

SK: Speaking of history, since its launch, how has the DHHWG evolved? Could you highlight some of the key milestones or achievements the group has reached recently?

CP: Our WG is about a year old. It started with a few deaf engineers and me brainstorming how to make KubeCon more accessible. We published an initial draft of Best practices for an inclusive conference and shared it with the LF events team. KubeCon Chicago was two months later, and we had a couple of deaf attendees. It was the first KubeCon accessible to deaf signers. Destiny, one of our co-chairs, even participated in a keynote panel. It was incredible how quickly everything happened!

DHHWG members at KubeCon Chicago DHHWG members at KubeCon Chicago

The team has grown since then, and we've been able to do much more. With a kiosk in the project pavilion, an open space discussion, a sign language crash course, and a few media interviews, KubeCon Paris had a stronger advocacy and outreach focus. Check out this video of our team in Paris to get a glimpse of all the different KubeCon activities — it was such a great event! The team also launched the first CNCF Community Group in sign language, Deaf in Cloud Native, a glossary team that creates sign language videos for each technical term to help standardize technical signs across the globe. It's crazy to think that it all happened within one year!

Overcoming challenges and addressing misconceptions

SK: That's remarkable progress in just a year! Building such momentum must have come with its challenges. What barriers have you encountered in facilitating the DHHWG, and how did you and the group work to overcome them?

CP: The support from the community, LF, and CNCF has been incredible. The fact that we achieved so much is proof of it. The challenges are more in helping some team members overcome their fear of contributing. Most are new to open source, and it can be intimidating to put your work out there for everyone to see. The fear of being criticized in public is real; however, as they will hopefully realize over time, our community is incredibly supportive. Instead of criticizing, people tend to help improve the work, leading to better outcomes.

SK: Are there any misconceptions about the deaf and hard-of-hearing community in tech that you'd like to address?

CP: Deaf and hard of hearing individuals are very diverse — there is no one-size-fits-all. Some deaf people are oral (speak), others sign, while some lip read or prefer captions. It generally depends on how people grew up. While some people come from deaf families and sign language is their native language, others were born into hearing families who may or may not have learned how to sign. Some deaf people grew up surrounded by hearing people, while others grew up deeply embedded in Deaf culture. Hard-of-hearing individuals, on the other hand, typically can communicate well with hearing peers one-on-one in quiet settings, but loud environments or conversations with multiple people can make it hard to follow the conversation. Most rely heavily on captions. Each background and experience will shape their communication style and preferences. In short, what works for one person, doesn't necessarily work for others. So never assume and always ask about accessibility needs and preferences.

Impact and the role of allies

SK: Can you share some key impacts/outcomes of the conference best practices document?

CP: Here are the two most important ones: Captions should be on the monitor, not in an app. That's especially important during technical talks with live demos. Deaf and hard of hearing attendees will miss important information switching between captions on their phone and code on the screen.

Interpreters are most valuable during networking, not in talks (with captions). Most people come to conferences for the hallway track. That is no different for deaf attendees. If they can't network, they are missing out on key professional connections, affecting their career prospects.

SK: In your view, how crucial is the role of allies within the DHHWG, and what contributions have they made to the group’s success?

CP: Deaf and hard of hearing individuals are a minority and can only do so much. Allies are the key to any diversity and inclusion initiative. As a majority, allies can help spread the word and educate their peers, playing a key role in scaling advocacy efforts. They also have the power to demand change. It's easy for companies to ignore minorities, but if the majority demands that their employers be accessible, environmentally conscious, and good citizens, they will ultimately be pushed to adapt to new societal values.

Expanding DEI efforts and future vision

SK: The importance of allies in driving change is clear. Beyond the DHHWG, are you involved in any other DEI groups or initiatives within the tech community?

CP: As mentioned above, I'm working on an initiative called ABIDE, which is still work in progress. I don't want to share too much about it yet, but what I can say is that the DHHWG will be part of it and that we just started a Blind and Visually Impaired WG (BVIWG). ABIDE will start by focusing on accessibility, so if anyone reading this has an idea for another WG, please reach out to me via the CNCF Slack @Catherine Paganini.

SK: What does the future hold for the DHHWG? Can you share details about any ongoing or upcoming initiatives?

CP: I think we've been very successful in terms of visibility and awareness so far. We can't stop, though. Awareness work is ongoing, and most people in our community haven't heard about us or met anyone on our team yet, so a lot of work still lies ahead.

DHHWG members at KubeCon Paris DHHWG members at KubeCon Paris

The next step is to refocus on advocacy. The same thing we did with the conference best practices but for other areas. The goal is to help educate the community about what real accessibility looks like, how projects can be more accessible, and why employers should seriously consider deaf candidates while providing them with the tools they need to conduct successful interviews and employee onboarding. We need to capture all that in documents, publish it, and then get the word out. That last part is certainly the most challenging, but it's also where everyone can get involved.

Call to action

SK: Thank you for sharing your insights, Catherine. As we wrap up, do you have any final thoughts or a call to action for our readers?

CP: As we build our accessibility page, check in regularly to see what's new. Share the docs with your team, employer, and network — anyone, really. The more people understand what accessibility really means and why it matters, the more people will recognize when something isn't accessible, and be able to call out marketing-BS, which, unfortunately, is more often the case than not. We need allies to help push for change. No minority can do this on their own. So please learn about accessibility, keep an eye out for it, and call it out when something isn't accessible. We need your help!

Wrapping up

Catherine and the DHHWG's work exemplify the power of community and advocacy. As we celebrate Deaf Awareness Month, let's reflect on her role as an ally and consider how we can all contribute to building a more inclusive tech community, particularly within open-source projects like Kubernetes.

Together, we can break down barriers, challenge misconceptions, and ensure that everyone feels welcome and valued. By advocating for accessibility, supporting initiatives like the DHHWG, and fostering a culture of empathy, we can create a truly inclusive and welcoming space for all.

Spotlight on SIG Scheduling

In this SIG Scheduling spotlight we talked with Kensei Nakada, an approver in SIG Scheduling.

Introductions

Arvind: Hello, thank you for the opportunity to learn more about SIG Scheduling! Would you like to introduce yourself and tell us a bit about your role, and how you got involved with Kubernetes?

Kensei: Hi, thanks for the opportunity! I’m Kensei Nakada (@sanposhiho), a software engineer at Tetrate.io. I have been contributing to Kubernetes in my free time for more than 3 years, and now I’m an approver of SIG Scheduling in Kubernetes. Also, I’m a founder/owner of two SIG subprojects, kube-scheduler-simulator and kube-scheduler-wasm-extension.

About SIG Scheduling

AP: That's awesome! You've been involved with the project since a long time. Can you provide a brief overview of SIG Scheduling and explain its role within the Kubernetes ecosystem?

KN: As the name implies, our responsibility is to enhance scheduling within Kubernetes. Specifically, we develop the components that determine which Node is the best place for each Pod. In Kubernetes, our main focus is on maintaining the kube-scheduler, along with other scheduling-related components as part of our SIG subprojects.

AP: I see, got it! That makes me curious--what recent innovations or developments has SIG Scheduling introduced to Kubernetes scheduling?

KN: From a feature perspective, there have been several enhancements to PodTopologySpread recently. PodTopologySpread is a relatively new feature in the scheduler, and we are still in the process of gathering feedback and making improvements.

Most recently, we have been focusing on a new internal enhancement called QueueingHint which aims to enhance scheduling throughput. Throughput is one of our crucial metrics in scheduling. Traditionally, we have primarily focused on optimizing the latency of each scheduling cycle. QueueingHint takes a different approach, optimizing when to retry scheduling, thereby reducing the likelihood of wasting scheduling cycles.

A: That sounds interesting! Are there any other interesting topics or projects you are currently working on within SIG Scheduling?

KN: I’m leading the development of QueueingHint which I just shared. Given that it’s a big new challenge for us, we’ve been facing many unexpected challenges, especially around the scalability, and we’re trying to solve each of them to eventually enable it by default.

And also, I believe kube-scheduler-wasm-extension (a SIG subproject) that I started last year would be interesting to many people. Kubernetes has various extensions from many components. Traditionally, extensions are provided via webhooks (extender in the scheduler) or Go SDK (Scheduling Framework in the scheduler). However, these come with drawbacks - performance issues with webhooks and the need to rebuild and replace schedulers with Go SDK, posing difficulties for those seeking to extend the scheduler but lacking familiarity with it. The project is trying to introduce a new solution to this general challenge - a WebAssembly based extension. Wasm allows users to build plugins easily, without worrying about recompiling or replacing their scheduler, and sidestepping performance concerns.

Through this project, SIG Scheduling has been learning valuable insights about WebAssembly's interaction with large Kubernetes objects. And I believe the experience that we’re gaining should be useful broadly within the community, beyond SIG Scheduling.

A: Definitely! Now, there are 8 subprojects inside SIG Scheduling. Would you like to talk about them? Are there some interesting contributions by those teams you want to highlight?

KN: Let me pick up three subprojects: Kueue, KWOK and descheduler.

Kueue
Recently, many people have been trying to manage batch workloads with Kubernetes, and in 2022, Kubernetes community founded WG-Batch for better support for such batch workloads in Kubernetes. Kueue is a project that takes a crucial role for it. It’s a job queueing controller, deciding when a job should wait, when a job should be admitted to start, and when a job should be preempted. Kueue aims to be installed on a vanilla Kubernetes cluster while cooperating with existing matured controllers (scheduler, cluster-autoscaler, kube-controller-manager, etc).
KWOK
KWOK is a component in which you can create a cluster of thousands of Nodes in seconds. It’s mostly useful for simulation/testing as a lightweight cluster, and actually another SIG sub project kube-scheduler-simulator uses KWOK background.
descheduler
Descheduler is a component recreating pods that are running on undesired Nodes. In Kubernetes, scheduling constraints (PodAffinity, NodeAffinity, PodTopologySpread, etc) are honored only at Pod schedule, but it’s not guaranteed that the contrtaints are kept being satisfied afterwards. Descheduler evicts Pods violating their scheduling constraints (or other undesired conditions) so that they’re recreated and rescheduled.
Descheduling Framework
One very interesting on-going project, similar to Scheduling Framework in the scheduler, aiming to make descheduling logic extensible and allow maintainers to focus on building a core engine of descheduler.

AP: Thank you for letting us know! And I have to ask, what are some of your favorite things about this SIG?

KN: What I really like about this SIG is how actively engaged everyone is. We come from various companies and industries, bringing diverse perspectives to the table. Instead of these differences causing division, they actually generate a wealth of opinions. Each view is respected, and this makes our discussions both rich and productive.

I really appreciate this collaborative atmosphere, and I believe it has been key to continuously improving our components over the years.

Contributing to SIG Scheduling

AP: Kubernetes is a community-driven project. Any recommendations for new contributors or beginners looking to get involved and contribute to SIG scheduling? Where should they start?

KN: Let me start with a general recommendation for contributing to any SIG: a common approach is to look for good-first-issue. However, you'll soon realize that many people worldwide are trying to contribute to the Kubernetes repository.

I suggest starting by examining the implementation of a component that interests you. If you have any questions about it, ask in the corresponding Slack channel (e.g., #sig-scheduling for the scheduler, #sig-node for kubelet, etc). Once you have a rough understanding of the implementation, look at issues within the SIG (e.g., sig-scheduling), where you'll find more unassigned issues compared to good-first-issue ones. You may also want to filter issues with the kind/cleanup label, which often indicates lower-priority tasks and can be starting points.

Specifically for SIG Scheduling, you should first understand the Scheduling Framework, which is the fundamental architecture of kube-scheduler. Most of the implementation is found in pkg/scheduler. I suggest starting with ScheduleOne function and then exploring deeper from there.

Additionally, apart from the main kubernetes/kubernetes repository, consider looking into sub-projects. These typically have fewer maintainers and offer more opportunities to make a significant impact. Despite being called "sub" projects, many have a large number of users and a considerable impact on the community.

And last but not least, remember contributing to the community isn’t just about code. While I talked a lot about the implementation contribution, there are many ways to contribute, and each one is valuable. One comment to an issue, one feedback to an existing feature, one review comment in PR, one clarification on the documentation; every small contribution helps drive the Kubernetes ecosystem forward.

AP: Those are some pretty useful tips! And if I may ask, how do you assist new contributors in getting started, and what skills are contributors likely to learn by participating in SIG Scheduling?

KN: Our maintainers are available to answer your questions in the #sig-scheduling Slack channel. By participating, you'll gain a deeper understanding of Kubernetes scheduling and have the opportunity to collaborate and network with maintainers from diverse backgrounds. You'll learn not just how to write code, but also how to maintain a large project, design and discuss new features, address bugs, and much more.

Future Directions

AP: What are some Kubernetes-specific challenges in terms of scheduling? Are there any particular pain points?

KN: Scheduling in Kubernetes can be quite challenging because of the diverse needs of different organizations with different business requirements. Supporting all possible use cases in kube-scheduler is impossible. Therefore, extensibility is a key focus for us. A few years ago, we rearchitected kube-scheduler with Scheduling Framework, which offers flexible extensibility for users to implement various scheduling needs through plugins. This allows maintainers to focus on the core scheduling features and the framework runtime.

Another major issue is maintaining sufficient scheduling throughput. Typically, a Kubernetes cluster has only one kube-scheduler, so its throughput directly affects the overall scheduling scalability and, consequently, the cluster's scalability. Although we have an internal performance test (scheduler_perf), unfortunately, we sometimes overlook performance degradation in less common scenarios. It’s difficult as even small changes, which look irrelevant to performance, can lead to degradation.

AP: What are some upcoming goals or initiatives for SIG Scheduling? How do you envision the SIG evolving in the future?

KN: Our primary goal is always to build and maintain extensible and stable scheduling runtime, and I bet this goal will remain unchanged forever.

As already mentioned, extensibility is key to solving the challenge of the diverse needs of scheduling. Rather than trying to support every different use case directly in kube-scheduler, we will continue to focus on enhancing extensibility so that it can accommodate various use cases. kube-scheduler-wasm-extension that I mentioned is also part of this initiative.

Regarding stability, introducing new optimizations like QueueHint is one of our strategies. Additionally, maintaining throughput is also a crucial goal towards the future. We’re planning to enhance our throughput monitoring (ref), so that we can notice degradation as much as possible on our own before releasing. But, realistically, we can't cover every possible scenario. We highly appreciate any attention the community can give to scheduling throughput and encourage feedback and alerts regarding performance issues!

Closing Remarks

AP: Finally, what message would you like to convey to those who are interested in learning more about SIG Scheduling?

KN: Scheduling is one of the most complicated areas in Kubernetes, and you may find it difficult at first. But, as I shared earlier, you can find many opportunities for contributions, and many maintainers are willing to help you understand things. We know your unique perspective and skills are what makes our open source so powerful 😊

Feel free to reach out to us in Slack (#sig-scheduling) or meetings. I hope this article interests everyone and we can see new contributors!

AP: Thank you so much for taking the time to do this! I'm confident that many will find this information invaluable for understanding more about SIG Scheduling and for contributing to the SIG.

Kubernetes v1.31: kubeadm v1beta4

As part of the Kubernetes v1.31 release, kubeadm is adopting a new (v1beta4) version of its configuration file format. Configuration in the previous v1beta3 format is now formally deprecated, which means it's supported but you should migrate to v1beta4 and stop using the deprecated format. Support for v1beta3 configuration will be removed after a minimum of 3 Kubernetes minor releases.

In this article, I'll walk you through key changes; I'll explain about the kubeadm v1beta4 configuration format, and how to migrate from v1beta3 to v1beta4.

You can read the reference for the v1beta4 configuration format: kubeadm Configuration (v1beta4).

A list of changes since v1beta3

This version improves on the v1beta3 format by fixing some minor issues and adding a few new fields.

To put it simply,

  • Two new configuration elements: ResetConfiguration and UpgradeConfiguration
  • For InitConfiguration and JoinConfiguration, dryRun mode and nodeRegistration.imagePullSerial are supported
  • For ClusterConfiguration, there are new fields including certificateValidityPeriod, caCertificateValidityPeriod, encryptionAlgorithm, dns.disabled and proxy.disabled.
  • Support extraEnvs for all control plan components
  • extraArgs changed from a map to structured extra arguments for duplicates
  • Add a timeouts structure for init, join, upgrade and reset.

For details, you can see the official document below:

  • Support custom environment variables in control plane components under ClusterConfiguration. Use apiServer.extraEnvs, controllerManager.extraEnvs, scheduler.extraEnvs, etcd.local.extraEnvs.
  • The ResetConfiguration API type is now supported in v1beta4. Users are able to reset a node by passing a --config file to kubeadm reset.
  • dryRun mode is now configurable in InitConfiguration and JoinConfiguration.
  • Replace the existing string/string extra argument maps with structured extra arguments that support duplicates. The change applies to ClusterConfiguration - apiServer.extraArgs, controllerManager.extraArgs, scheduler.extraArgs, etcd.local.extraArgs. Also to nodeRegistrationOptions.kubeletExtraArgs.
  • Added ClusterConfiguration.encryptionAlgorithm that can be used to set the asymmetric encryption algorithm used for this cluster's keys and certificates. Can be one of "RSA-2048" (default), "RSA-3072", "RSA-4096" or "ECDSA-P256".
  • Added ClusterConfiguration.dns.disabled and ClusterConfiguration.proxy.disabled that can be used to disable the CoreDNS and kube-proxy addons during cluster initialization. Skipping the related addons phases, during cluster creation will set the same fields to true.
  • Added the nodeRegistration.imagePullSerial field in InitConfiguration and JoinConfiguration, which can be used to control if kubeadm pulls images serially or in parallel.
  • The UpgradeConfiguration kubeadm API is now supported in v1beta4 when passing --config to kubeadm upgrade subcommands. For upgrade subcommands, the usage of component configuration for kubelet and kube-proxy, as well as InitConfiguration and ClusterConfiguration, is now deprecated and will be ignored when passing --config.
  • Added a timeouts structure to InitConfiguration, JoinConfiguration, ResetConfiguration and UpgradeConfiguration that can be used to configure various timeouts. The ClusterConfiguration.timeoutForControlPlane field is replaced by timeouts.controlPlaneComponentHealthCheck. The JoinConfiguration.discovery.timeout is replaced by timeouts.discovery.
  • Added a certificateValidityPeriod and caCertificateValidityPeriod fields to ClusterConfiguration. These fields can be used to control the validity period of certificates generated by kubeadm during sub-commands such as init, join, upgrade and certs. Default values continue to be 1 year for non-CA certificates and 10 years for CA certificates. Also note that only non-CA certificates are renewable by kubeadm certs renew.

These changes simplify the configuration of tools that use kubeadm and improve the extensibility of kubeadm itself.

How to migrate v1beta3 configuration to v1beta4?

If your configuration is not using the latest version, it is recommended that you migrate using the kubeadm config migrate command.

This command reads an existing configuration file that uses the old format, and writes a new file that uses the current format.

Example

Using kubeadm v1.31, run kubeadm config migrate --old-config old-v1beta3.yaml --new-config new-v1beta4.yaml

How do I get involved?

Huge thanks to all the contributors who helped with the design, implementation, and review of this feature:

For those interested in getting involved in future discussions on kubeadm configuration, you can reach out kubeadm or SIG-cluster-lifecycle by several means:

Kubernetes 1.31: Custom Profiling in Kubectl Debug Graduates to Beta

There are many ways of troubleshooting the pods and nodes in the cluster. However, kubectl debug is one of the easiest, highly used and most prominent ones. It provides a set of static profiles and each profile serves for a different kind of role. For instance, from the network administrator's point of view, debugging the node should be as easy as this:

$ kubectl debug node/mynode -it --image=busybox --profile=netadmin

On the other hand, static profiles also bring about inherent rigidity, which has some implications for some pods contrary to their ease of use. Because there are various kinds of pods (or nodes) that all have their specific necessities, and unfortunately, some can't be debugged by only using the static profiles.

Take an instance of a simple pod consisting of a container whose healthiness relies on an environment variable:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  containers:
  - name: example-container
    image: customapp:latest
    env:
    - name: REQUIRED_ENV_VAR
      value: "value1"

Currently, copying the pod is the sole mechanism that supports debugging this pod in kubectl debug. Furthermore, what if user needs to modify the REQUIRED_ENV_VAR to something different for advanced troubleshooting?. There is no mechanism to achieve this.

Custom Profiling

Custom profiling is a new functionality available under --custom flag, introduced in kubectl debug to provide extensibility. It expects partial Container spec in either YAML or JSON format. In order to debug the example-container above by creating an ephemeral container, we simply have to define this YAML:

# partial_container.yaml
env:
  - name: REQUIRED_ENV_VAR
    value: value2

and execute:

kubectl debug example-pod -it --image=customapp --custom=partial_container.yaml

Here is another example that modifies multiple fields at once (change port number, add resource limits, modify environment variable) in JSON:

{
  "ports": [
    {
      "containerPort": 80
    }
  ],
  "resources": {
    "limits": {
      "cpu": "0.5",
      "memory": "512Mi"
    },
    "requests": {
      "cpu": "0.2",
      "memory": "256Mi"
    }
  },
  "env": [
    {
      "name": "REQUIRED_ENV_VAR",
      "value": "value2"
    }
  ]
}

Constraints

Uncontrolled extensibility hurts the usability. So that, custom profiling is not allowed for certain fields such as command, image, lifecycle, volume devices and container name. In the future, more fields can be added to the disallowed list if required.

Limitations

The kubectl debug command has 3 aspects: Debugging with ephemeral containers, pod copying, and node debugging. The largest intersection set of these aspects is the container spec within a Pod That's why, custom profiling only supports the modification of the fields that are defined with containers. This leads to a limitation that if user needs to modify the other fields in the Pod spec, it is not supported.

Acknowledgments

Special thanks to all the contributors who reviewed and commented on this feature, from the initial conception to its actual implementation (alphabetical order):

Kubernetes 1.31: Fine-grained SupplementalGroups control

This blog discusses a new feature in Kubernetes 1.31 to improve the handling of supplementary groups in containers within Pods.

Motivation: Implicit group memberships defined in /etc/group in the container image

Although this behavior may not be popular with many Kubernetes cluster users/admins, kubernetes, by default, merges group information from the Pod with information defined in /etc/group in the container image.

Let's see an example, below Pod specifies runAsUser=1000, runAsGroup=3000 and supplementalGroups=4000 in the Pod's security context.

apiVersion: v1
kind: Pod
metadata:
  name: implicit-groups
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
  containers:
  - name: ctr
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false

What is the result of id command in the ctr container?

# Create the Pod:
$ kubectl apply -f https://k8s.io/blog/2024-08-22-Fine-grained-SupplementalGroups-control/implicit-groups.yaml

# Verify that the Pod's Container is running:
$ kubectl get pod implicit-groups

# Check the id command
$ kubectl exec implicit-groups -- id

Then, output should be similar to this:

uid=1000 gid=3000 groups=3000,4000,50000

Where does group ID 50000 in supplementary groups (groups field) come from, even though 50000 is not defined in the Pod's manifest at all? The answer is /etc/group file in the container image.

Checking the contents of /etc/group in the container image should show below:

$ kubectl exec implicit-groups -- cat /etc/group
...
user-defined-in-image:x:1000:
group-defined-in-image:x:50000:user-defined-in-image

Aha! The container's primary user 1000 belongs to the group 50000 in the last entry.

Thus, the group membership defined in /etc/group in the container image for the container's primary user is implicitly merged to the information from the Pod. Please note that this was a design decision the current CRI implementations inherited from Docker, and the community never really reconsidered it until now.

What's wrong with it?

The implicitly merged group information from /etc/group in the container image may cause some concerns particularly in accessing volumes (see kubernetes/kubernetes#112879 for details) because file permission is controlled by uid/gid in Linux. Even worse, the implicit gids from /etc/group can not be detected/validated by any policy engines because there is no clue for the implicit group information in the manifest. This can also be a concern for Kubernetes security.

Fine-grained SupplementalGroups control in a Pod: SupplementaryGroupsPolicy

To tackle the above problem, Kubernetes 1.31 introduces new field supplementalGroupsPolicy in Pod's .spec.securityContext.

This field provies a way to control how to calculate supplementary groups for the container processes in a Pod. The available policy is below:

  • Merge: The group membership defined in /etc/group for the container's primary user will be merged. If not specified, this policy will be applied (i.e. as-is behavior for backword compatibility).

  • Strict: it only attaches specified group IDs in fsGroup, supplementalGroups, or runAsGroup fields as the supplementary groups of the container processes. This means no group membership defined in /etc/group for the container's primary user will be merged.

Let's see how Strict policy works.

apiVersion: v1
kind: Pod
metadata:
  name: strict-supplementalgroups-policy
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    supplementalGroups: [4000]
    supplementalGroupsPolicy: Strict
  containers:
  - name: ctr
    image: registry.k8s.io/e2e-test-images/agnhost:2.45
    command: [ "sh", "-c", "sleep 1h" ]
    securityContext:
      allowPrivilegeEscalation: false
# Create the Pod:
$ kubectl apply -f https://k8s.io/blog/2024-08-22-Fine-grained-SupplementalGroups-control/strict-supplementalgroups-policy.yaml

# Verify that the Pod's Container is running:
$ kubectl get pod strict-supplementalgroups-policy

# Check the process identity:
kubectl exec -it strict-supplementalgroups-policy -- id

The output should be similar to this:

uid=1000 gid=3000 groups=3000,4000

You can see Strict policy can exclude group 50000 from groups!

Thus, ensuring supplementalGroupsPolicy: Strict (enforced by some policy mechanism) helps prevent the implicit supplementary groups in a Pod.

Note:

Actually, this is not enough because container with sufficient privileges / capability can change its process identity. Please see the following section for details.

Attached process identity in Pod status

This feature also exposes the process identity attached to the first container process of the container via .status.containerStatuses[].user.linux field. It would be helpful to see if implicit group IDs are attached.

...
status:
  containerStatuses:
  - name: ctr
    user:
      linux:
        gid: 3000
        supplementalGroups:
        - 3000
        - 4000
        uid: 1000
...

Note:

Please note that the values in status.containerStatuses[].user.linux field is the firstly attached process identity to the first container process in the container. If the container has sufficient privilege to call system calls related to process identity (e.g. setuid(2), setgid(2) or setgroups(2), etc.), the container process can change its identity. Thus, the actual process identity will be dynamic.

Feature availability

To enable supplementalGroupsPolicy field, the following components have to be used:

  • Kubernetes: v1.31 or later, with the SupplementalGroupsPolicy feature gate enabled. As of v1.31, the gate is marked as alpha.
  • CRI runtime:
    • containerd: v2.0 or later
    • CRI-O: v1.31 or later

You can see if the feature is supported in the Node's .status.features.supplementalGroupsPolicy field.

apiVersion: v1
kind: Node
...
status:
  features:
    supplementalGroupsPolicy: true

What's next?

Kubernetes SIG Node hope - and expect - that the feature will be promoted to beta and eventually general availability (GA) in future releases of Kubernetes, so that users no longer need to enable the feature gate manually.

Merge policy is applied when supplementalGroupsPolicy is not specified, for backwards compatibility.

How can I learn more?

How to get involved?

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Kubernetes v1.31: New Kubernetes CPUManager Static Policy: Distribute CPUs Across Cores

In Kubernetes v1.31, we are excited to introduce a significant enhancement to CPU management capabilities: the distribute-cpus-across-cores option for the CPUManager static policy. This feature is currently in alpha and hidden by default, marking a strategic shift aimed at optimizing CPU utilization and improving system performance across multi-core processors.

Understanding the feature

Traditionally, Kubernetes' CPUManager tends to allocate CPUs as compactly as possible, typically packing them onto the fewest number of physical cores. However, allocation strategy matters, CPUs on the same physical host still share some resources of the physical core, such as the cache and execution units, etc.

cpu-cache-architecture

While default approach minimizes inter-core communication and can be beneficial under certain scenarios, it also poses a challenge. CPUs sharing a physical core can lead to resource contention, which in turn may cause performance bottlenecks, particularly noticeable in CPU-intensive applications.

The new distribute-cpus-across-cores feature addresses this issue by modifying the allocation strategy. When enabled, this policy option instructs the CPUManager to spread out the CPUs (hardware threads) across as many physical cores as possible. This distribution is designed to minimize contention among CPUs sharing the same physical core, potentially enhancing the performance of applications by providing them dedicated core resources.

Technically, within this static policy, the free CPU list is reordered in the manner depicted in the diagram, aiming to allocate CPUs from separate physical cores.

cpu-ordering

Enabling the feature

To enable this feature, users firstly need to add --cpu-manager-policy=static kubelet flag or the cpuManagerPolicy: static field in KubeletConfiuration. Then user can add --cpu-manager-policy-options distribute-cpus-across-cores=true or distribute-cpus-across-cores=true to their CPUManager policy options in the Kubernetes configuration or. This setting directs the CPUManager to adopt the new distribution strategy. It is important to note that this policy option cannot currently be used in conjunction with full-pcpus-only or distribute-cpus-across-numa options.

Current limitations and future directions

As with any new feature, especially one in alpha, there are limitations and areas for future improvement. One significant current limitation is that distribute-cpus-across-cores cannot be combined with other policy options that might conflict in terms of CPU allocation strategies. This restriction can affect compatibility with certain workloads and deployment scenarios that rely on more specialized resource management.

Looking forward, we are committed to enhancing the compatibility and functionality of the distribute-cpus-across-cores option. Future updates will focus on resolving these compatibility issues, allowing this policy to be combined with other CPUManager policies seamlessly. Our goal is to provide a more flexible and robust CPU allocation framework that can adapt to a variety of workloads and performance demands.

Conclusion

The introduction of the distribute-cpus-across-cores policy in Kubernetes CPUManager is a step forward in our ongoing efforts to refine resource management and improve application performance. By reducing the contention on physical cores, this feature offers a more balanced approach to CPU resource allocation, particularly beneficial for environments running heterogeneous workloads. We encourage Kubernetes users to test this new feature and provide feedback, which will be invaluable in shaping its future development.

This draft aims to clearly explain the new feature while setting expectations for its current stage and future improvements.

Further reading

Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.

Getting involved

This feature is driven by the SIG Node. If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Node projects, please attend the SIG Node meeting for more details.

Kubernetes 1.31: Autoconfiguration For Node Cgroup Driver (beta)

Historically, configuring the correct cgroup driver has been a pain point for users running new Kubernetes clusters. On Linux systems, there are two different cgroup drivers: cgroupfs and systemd. In the past, both the kubelet and CRI implementation (like CRI-O or containerd) needed to be configured to use the same cgroup driver, or else the kubelet would exit with an error. This was a source of headaches for many cluster admins. However, there is light at the end of the tunnel!

Automated cgroup driver detection

In v1.28.0, the SIG Node community introduced the feature gate KubeletCgroupDriverFromCRI, which instructs the kubelet to ask the CRI implementation which cgroup driver to use. A few minor releases of Kubernetes happened whilst we waited for support to land in the major two CRI implementations (containerd and CRI-O), but as of v1.31.0, this feature is now beta!

In addition to setting the feature gate, a cluster admin needs to ensure their CRI implementation is new enough:

  • containerd: Support was added in v2.0.0
  • CRI-O: Support was added in v1.28.0

Then, they should ensure their CRI implementation is configured to the cgroup_driver they would like to use.

Future work

Eventually, support for the kubelet's cgroupDriver configuration field will be dropped, and the kubelet will fail to start if the CRI implementation isn't new enough to have support for this feature.

Kubernetes 1.31: Streaming Transitions from SPDY to WebSockets

In Kubernetes 1.31, by default kubectl now uses the WebSocket protocol instead of SPDY for streaming.

This post describes what these changes mean for you and why these streaming APIs matter.

Streaming APIs in Kubernetes

In Kubernetes, specific endpoints that are exposed as an HTTP or RESTful interface are upgraded to streaming connections, which require a streaming protocol. Unlike HTTP, which is a request-response protocol, a streaming protocol provides a persistent connection that's bi-directional, low-latency, and lets you interact in real-time. Streaming protocols support reading and writing data between your client and the server, in both directions, over the same connection. This type of connection is useful, for example, when you create a shell in a running container from your local workstation and run commands in the container.

Why change the streaming protocol?

Before the v1.31 release, Kubernetes used the SPDY/3.1 protocol by default when upgrading streaming connections. SPDY/3.1 has been deprecated for eight years, and it was never standardized. Many modern proxies, gateways, and load balancers no longer support the protocol. As a result, you might notice that commands like kubectl cp, kubectl attach, kubectl exec, and kubectl port-forward stop working when you try to access your cluster through a proxy or gateway.

As of Kubernetes v1.31, SIG API Machinery has modified the streaming protocol that a Kubernetes client (such as kubectl) uses for these commands to the more modern WebSocket streaming protocol. The WebSocket protocol is a currently supported standardized streaming protocol that guarantees compatibility and interoperability with different components and programming languages. The WebSocket protocol is more widely supported by modern proxies and gateways than SPDY.

How streaming APIs work

Kubernetes upgrades HTTP connections to streaming connections by adding specific upgrade headers to the originating HTTP request. For example, an HTTP upgrade request for running the date command on an nginx container within a cluster is similar to the following:

$ kubectl exec -v=8 nginx -- date
GET https://127.0.0.1:43251/api/v1/namespaces/default/pods/nginx/exec?command=date…
Request Headers:
    Connection: Upgrade
    Upgrade: websocket
    Sec-Websocket-Protocol: v5.channel.k8s.io
    User-Agent: kubectl/v1.31.0 (linux/amd64) kubernetes/6911225

If the container runtime supports the WebSocket streaming protocol and at least one of the subprotocol versions (e.g. v5.channel.k8s.io), the server responds with a successful 101 Switching Protocols status, along with the negotiated subprotocol version:

Response Status: 101 Switching Protocols in 3 milliseconds
Response Headers:
    Upgrade: websocket
    Connection: Upgrade
    Sec-Websocket-Accept: j0/jHW9RpaUoGsUAv97EcKw8jFM=
    Sec-Websocket-Protocol: v5.channel.k8s.io

At this point the TCP connection used for the HTTP protocol has changed to a streaming connection. Subsequent STDIN, STDOUT, and STDERR data (as well as terminal resizing data and process exit code data) for this shell interaction is then streamed over this upgraded connection.

How to use the new WebSocket streaming protocol

If your cluster and kubectl are on version 1.29 or later, there are two control plane feature gates and two kubectl environment variables that govern the use of the WebSockets rather than SPDY. In Kubernetes 1.31, all of the following feature gates are in beta and are enabled by default:

  • Feature gates
    • TranslateStreamCloseWebsocketRequests
      • .../exec
      • .../attach
    • PortForwardWebsockets
      • .../port-forward
  • kubectl feature control environment variables
    • KUBECTL_REMOTE_COMMAND_WEBSOCKETS
      • kubectl exec
      • kubectl cp
      • kubectl attach
    • KUBECTL_PORT_FORWARD_WEBSOCKETS
      • kubectl port-forward

If you're connecting to an older cluster but can manage the feature gate settings, turn on both TranslateStreamCloseWebsocketRequests (added in Kubernetes v1.29) and PortForwardWebsockets (added in Kubernetes v1.30) to try this new behavior. Version 1.31 of kubectl can automatically use the new behavior, but you do need to connect to a cluster where the server-side features are explicitly enabled.

Learn more about streaming APIs

Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA

This post describes Pod failure policy, which graduates to stable in Kubernetes 1.31, and how to use it in your Jobs.

About Pod failure policy

When you run workloads on Kubernetes, Pods might fail for a variety of reasons. Ideally, workloads like Jobs should be able to ignore transient, retriable failures and continue running to completion.

To allow for these transient failures, Kubernetes Jobs include the backoffLimit field, which lets you specify a number of Pod failures that you're willing to tolerate during Job execution. However, if you set a large value for the backoffLimit field and rely solely on this field, you might notice unnecessary increases in operating costs as Pods restart excessively until the backoffLimit is met.

This becomes particularly problematic when running large-scale Jobs with thousands of long-running Pods across thousands of nodes.

The Pod failure policy extends the backoff limit mechanism to help you reduce costs in the following ways:

  • Gives you control to fail the Job as soon as a non-retriable Pod failure occurs.
  • Allows you to ignore retriable errors without increasing the backoffLimit field.

For example, you can use a Pod failure policy to run your workload on more affordable spot machines by ignoring Pod failures caused by graceful node shutdown.

The policy allows you to distinguish between retriable and non-retriable Pod failures based on container exit codes or Pod conditions in a failed Pod.

How it works

You specify a Pod failure policy in the Job specification, represented as a list of rules.

For each rule you define match requirements based on one of the following properties:

  • Container exit codes: the onExitCodes property.
  • Pod conditions: the onPodConditions property.

Additionally, for each rule, you specify one of the following actions to take when a Pod matches the rule:

  • Ignore: Do not count the failure towards the backoffLimit or backoffLimitPerIndex.
  • FailJob: Fail the entire Job and terminate all running Pods.
  • FailIndex: Fail the index corresponding to the failed Pod. This action works with the Backoff limit per index feature.
  • Count: Count the failure towards the backoffLimit or backoffLimitPerIndex. This is the default behavior.

When Pod failures occur in a running Job, Kubernetes matches the failed Pod status against the list of Pod failure policy rules, in the specified order, and takes the corresponding actions for the first matched rule.

Note that when specifying the Pod failure policy, you must also set the Job's Pod template with restartPolicy: Never. This prevents race conditions between the kubelet and the Job controller when counting Pod failures.

Kubernetes-initiated Pod disruptions

To allow matching Pod failure policy rules against failures caused by disruptions initiated by Kubernetes, this feature introduces the DisruptionTarget Pod condition.

Kubernetes adds this condition to any Pod, regardless of whether it's managed by a Job controller, that fails because of a retriable disruption scenario. The DisruptionTarget condition contains one of the following reasons that corresponds to these disruption scenarios:

In all other disruption scenarios, like eviction due to exceeding Pod container limits, Pods don't receive the DisruptionTarget condition because the disruptions were likely caused by the Pod and would reoccur on retry.

Example

The Pod failure policy snippet below demonstrates an example use:

podFailurePolicy:
  rules:
  - action: Ignore
    onPodConditions:
    - type: DisruptionTarget
  - action: FailJob
    onPodConditions:
    - type: ConfigIssue
  - action: FailJob
    onExitCodes:
      operator: In
      values: [ 42 ]

In this example, the Pod failure policy does the following:

  • Ignores any failed Pods that have the built-in DisruptionTarget condition. These Pods don't count towards Job backoff limits.
  • Fails the Job if any failed Pods have the custom user-supplied ConfigIssue condition, which was added either by a custom controller or webhook.
  • Fails the Job if any containers exited with the exit code 42.
  • Counts all other Pod failures towards the default backoffLimit (or backoffLimitPerIndex if used).

Learn more

Based on the concepts introduced by Pod failure policy, the following additional work is in progress:

Get involved

This work was sponsored by batch working group in close collaboration with the SIG Apps, and SIG Node, and SIG Scheduling communities.

If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.

Acknowledgments

I would love to thank everyone who was involved in this project over the years - it's been a journey and a joint community effort! The list below is my best-effort attempt to remember and recognize people who made an impact. Thank you!

Kubernetes 1.31: MatchLabelKeys in PodAffinity graduates to beta

Kubernetes 1.29 introduced new fields matchLabelKeys and mismatchLabelKeys in podAffinity and podAntiAffinity.

In Kubernetes 1.31, this feature moves to beta and the corresponding feature gate (MatchLabelKeysInPodAffinity) gets enabled by default.

matchLabelKeys - Enhanced scheduling for versatile rolling updates

During a workload's (e.g., Deployment) rolling update, a cluster may have Pods from multiple versions at the same time. However, the scheduler cannot distinguish between old and new versions based on the labelSelector specified in podAffinity or podAntiAffinity. As a result, it will co-locate or disperse Pods regardless of their versions.

This can lead to sub-optimal scheduling outcome, for example:

  • New version Pods are co-located with old version Pods (podAffinity), which will eventually be removed after rolling updates.
  • Old version Pods are distributed across all available topologies, preventing new version Pods from finding nodes due to podAntiAffinity.

matchLabelKeys is a set of Pod label keys and addresses this problem. The scheduler looks up the values of these keys from the new Pod's labels and combines them with labelSelector so that podAffinity matches Pods that have the same key-value in labels.

By using label pod-template-hash in matchLabelKeys, you can ensure that only Pods of the same version are evaluated for podAffinity or podAntiAffinity.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: application-server
...
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - database
        topologyKey: topology.kubernetes.io/zone
        matchLabelKeys:
        - pod-template-hash

The above matchLabelKeys will be translated in Pods like:

kind: Pod
metadata:
  name: application-server
  labels:
    pod-template-hash: xyz
...
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - database
          - key: pod-template-hash # Added from matchLabelKeys; Only Pods from the same replicaset will match this affinity.
            operator: In
            values:
            - xyz
        topologyKey: topology.kubernetes.io/zone
        matchLabelKeys:
        - pod-template-hash

mismatchLabelKeys - Service isolation

mismatchLabelKeys is a set of Pod label keys, like matchLabelKeys, which looks up the values of these keys from the new Pod's labels, and merge them with labelSelector as key notin (value) so that podAffinity does not match Pods that have the same key-value in labels.

Suppose all Pods for each tenant get tenant label via a controller or a manifest management tool like Helm.

Although the value of tenant label is unknown when composing each workload's manifest, the cluster admin wants to achieve exclusive 1:1 tenant to domain placement for a tenant isolation.

mismatchLabelKeys works for this usecase; By applying the following affinity globally using a mutating webhook, the cluster admin can ensure that the Pods from the same tenant will land on the same domain exclusively, meaning Pods from other tenants won't land on the same domain.

affinity:
  podAffinity:      # ensures the pods of this tenant land on the same node pool
    requiredDuringSchedulingIgnoredDuringExecution:
    - matchLabelKeys:
        - tenant
      topologyKey: node-pool
  podAntiAffinity:  # ensures only Pods from this tenant lands on the same node pool
    requiredDuringSchedulingIgnoredDuringExecution:
    - mismatchLabelKeys:
        - tenant
      labelSelector:
        matchExpressions:
        - key: tenant
          operator: Exists
      topologyKey: node-pool

The above matchLabelKeys and mismatchLabelKeys will be translated to like:

kind: Pod
metadata:
  name: application-server
  labels:
    tenant: service-a
spec: 
  affinity:
    podAffinity:      # ensures the pods of this tenant land on the same node pool
      requiredDuringSchedulingIgnoredDuringExecution:
      - matchLabelKeys:
          - tenant
        topologyKey: node-pool
        labelSelector:
          matchExpressions:
          - key: tenant
            operator: In
            values:
            - service-a 
    podAntiAffinity:  # ensures only Pods from this tenant lands on the same node pool
      requiredDuringSchedulingIgnoredDuringExecution:
      - mismatchLabelKeys:
          - tenant
        labelSelector:
          matchExpressions:
          - key: tenant
            operator: Exists
          - key: tenant
            operator: NotIn
            values:
            - service-a
        topologyKey: node-pool

Getting involved

These features are managed by Kubernetes SIG Scheduling.

Please join us and share your feedback. We look forward to hearing from you!

How can I learn more?

Kubernetes 1.31: Prevent PersistentVolume Leaks When Deleting out of Order

PersistentVolume (or PVs for short) are associated with Reclaim Policy. The reclaim policy is used to determine the actions that need to be taken by the storage backend on deletion of the PVC Bound to a PV. When the reclaim policy is Delete, the expectation is that the storage backend releases the storage resource allocated for the PV. In essence, the reclaim policy needs to be honored on PV deletion.

With the recent Kubernetes v1.31 release, a beta feature lets you configure your cluster to behave that way and honor the configured reclaim policy.

How did reclaim work in previous Kubernetes releases?

PersistentVolumeClaim (or PVC for short) is a user's request for storage. A PV and PVC are considered Bound if a newly created PV or a matching PV is found. The PVs themselves are backed by volumes allocated by the storage backend.

Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions on deleting a PV before deleting a PVC.

First, I'll demonstrate the behavior for clusters running an older version of Kubernetes.

Retrieve a PVC that is bound to a PV

Retrieve an existing PVC example-vanilla-block-pvc

kubectl get pvc example-vanilla-block-pvc

The following output shows the PVC and its bound PV; the PV is shown under the VOLUME column:

NAME                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
example-vanilla-block-pvc   Bound    pvc-6791fdd4-5fad-438e-a7fb-16410363e3da   5Gi        RWO            example-vanilla-block-sc   19s

Delete PV

When I try to delete a bound PV, the kubectl session blocks and the kubectl tool does not return back control to the shell; for example:

kubectl delete pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
persistentvolume "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" deleted
^C

Retrieving the PV

kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da

It can be observed that the PV is in a Terminating state

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM                               STORAGECLASS               REASON   AGE
pvc-6791fdd4-5fad-438e-a7fb-16410363e3da   5Gi        RWO            Delete           Terminating   default/example-vanilla-block-pvc   example-vanilla-block-sc            2m23s

Delete PVC

kubectl delete pvc example-vanilla-block-pvc

The following output is seen if the PVC gets successfully deleted:

persistentvolumeclaim "example-vanilla-block-pvc" deleted

The PV object from the cluster also gets deleted. When attempting to retrieve the PV it will be observed that the PV is no longer found:

kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
Error from server (NotFound): persistentvolumes "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" not found

Although the PV is deleted, the underlying storage resource is not deleted and needs to be removed manually.

To sum up, the reclaim policy associated with the PersistentVolume is currently ignored under certain circumstances. For a Bound PV-PVC pair, the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first; however, if the PV is deleted prior to deleting the PVC, then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.

PV reclaim policy with Kubernetes v1.31

The new behavior ensures that the underlying storage object is deleted from the backend when users attempt to delete a PV manually.

How to enable new behavior?

To take advantage of the new behavior, you must have upgraded your cluster to the v1.31 release of Kubernetes and run the CSI external-provisioner version 5.0.1 or later.

How does it work?

For CSI volumes, the new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer on new and existing PVs. The finalizer is only removed after the storage from the backend is deleted. `

An example of a PV with the finalizer, notice the new finalizer in the finalizers list

kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com
  creationTimestamp: "2021-11-17T19:28:56Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-provisioner.volume.kubernetes.io/finalizer
  name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53
  resourceVersion: "194711"
  uid: 087f14f2-4157-4e95-8a70-8294b039d30e
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: example-vanilla-block-pvc
    namespace: default
    resourceVersion: "194677"
    uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53
  csi:
    driver: csi.vsphere.vmware.com
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.vsphere.vmware.com
      type: vSphere CNS Block Volume
    volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e
  persistentVolumeReclaimPolicy: Delete
  storageClassName: example-vanilla-block-sc
  volumeMode: Filesystem
status:
  phase: Bound

The finalizer prevents this PersistentVolume from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.

Similarly, the finalizer kubernetes.io/pv-controller is added to dynamically provisioned in-tree plugin volumes.

What about CSI migrated volumes?

The fix applies to CSI migrated volumes as well.

Some caveats

The fix does not apply to statically provisioned in-tree plugin volumes.

References

How do I get involved?

The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:

  • Fan Baofa (carlory)
  • Jan Šafránek (jsafrane)
  • Xing Yang (xing-yang)
  • Matthew Wong (wongma7)

Join the Kubernetes Storage Special Interest Group (SIG) if you're interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system. We’re rapidly growing and always welcome new contributors.

Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)

The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future. While the project has been designed to fulfill microservice architectures in the past, it’s now time to listen to the end users and introduce features which have a stronger focus on AI/ML.

One of these requirements is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries. A feature like this gives the Kubernetes project a chance to grow into use cases which go beyond running particular images.

Given that, the Kubernetes community is proud to present a new alpha feature introduced in v1.31: The Image Volume Source (KEP-4639). This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers:


kind: Pod
spec:
  containers:
    - …
      volumeMounts:
        - name: my-volume
          mountPath: /path/to/directory
  volumes:
    - name: my-volume
      image:
        reference: my-image:tag

The above example would result in mounting my-image:tag to /path/to/directory in the pod’s container.

Use cases

The goal of this enhancement is to stick as close as possible to the existing container image implementation within the kubelet, while introducing a new API surface to allow more extended use cases.

For example, users could share a configuration file among multiple containers in a pod without including the file in the main image, so that they can minimize security risks and the overall image size. They can also package and distribute binary artifacts using OCI images and mount them directly into Kubernetes pods, so that they can streamline their CI/CD pipeline as an example.

Data scientists, MLOps engineers, or AI developers, can mount large language model weights or machine learning model weights in a pod alongside a model-server, so that they can efficiently serve them without including them in the model-server container image. They can package these in an OCI object to take advantage of OCI distribution and ensure efficient model deployment. This allows them to separate the model specifications/content from the executables that process them.

Another use case is that security engineers can use a public image for a malware scanner and mount in a volume of private (commercial) malware signatures, so that they can load those signatures without baking their own combined image (which might not be allowed by the copyright on the public image). Those files work regardless of the OS or version of the scanner software.

But in the long term it will be up to you as an end user of this project to outline further important use cases for the new feature. SIG Node is happy to retrieve any feedback or suggestions for further enhancements to allow more advanced usage scenarios. Feel free to provide feedback by either using the Kubernetes Slack (#sig-node) channel or the SIG Node mailinglist.

Detailed example

The Kubernetes alpha feature gate ImageVolume needs to be enabled on the API Server as well as the kubelet to make it functional. If that’s the case and the container runtime has support for the feature (like CRI-O ≥ v1.31), then an example pod.yaml like this can be created:

apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  containers:
    - name: test
      image: registry.k8s.io/e2e-test-images/echoserver:2.3
      volumeMounts:
        - name: volume
          mountPath: /volume
  volumes:
    - name: volume
      image:
        reference: quay.io/crio/artifact:v1
        pullPolicy: IfNotPresent

The pod declares a new volume using the image.reference of quay.io/crio/artifact:v1, which refers to an OCI object containing two files. The pullPolicy behaves in the same way as for container images and allows the following values:

  • Always: the kubelet always attempts to pull the reference and the container creation will fail if the pull fails.
  • Never: the kubelet never pulls the reference and only uses a local image or artifact. The container creation will fail if the reference isn’t present.
  • IfNotPresent: the kubelet pulls if the reference isn’t already present on disk. The container creation will fail if the reference isn’t present and the pull fails.

The volumeMounts field is indicating that the container with the name test should mount the volume under the path /volume.

If you now create the pod:

kubectl apply -f pod.yaml

And exec into it:

kubectl exec -it pod -- sh

Then you’re able to investigate what has been mounted:

/ # ls /volume
dir   file
/ # cat /volume/file
2
/ # ls /volume/dir
file
/ # cat /volume/dir/file
1

You managed to consume an OCI artifact using Kubernetes!

The container runtime pulls the image (or artifact), mounts it to the container and makes it finally available for direct usage. There are a bunch of details in the implementation, which closely align to the existing image pull behavior of the kubelet. For example:

  • If a :latest tag as reference is provided, then the pullPolicy will default to Always, while in any other case it will default to IfNotPresent if unset.
  • The volume gets re-resolved if the pod gets deleted and recreated, which means that new remote content will become available on pod recreation. A failure to resolve or pull the image during pod startup will block containers from starting and may add significant latency. Failures will be retried using normal volume backoff and will be reported on the pod reason and message.
  • Pull secrets will be assembled in the same way as for the container image by looking up node credentials, service account image pull secrets, and pod spec image pull secrets.
  • The OCI object gets mounted in a single directory by merging the manifest layers in the same way as for container images.
  • The volume is mounted as read-only (ro) and non-executable files (noexec).
  • Sub-path mounts for containers are not supported (spec.containers[*].volumeMounts.subpath).
  • The field spec.securityContext.fsGroupChangePolicy has no effect on this volume type.
  • The feature will also work with the AlwaysPullImages admission plugin if enabled.

Thank you for reading through the end of this blog post! SIG Node is proud and happy to deliver this feature as part of Kubernetes v1.31.

As writer of this blog post, I would like to emphasize my special thanks to all involved individuals out there! You all rock, let’s keep on hacking!

Further reading

Kubernetes 1.31: VolumeAttributesClass for Volume Modification Beta

Volumes in Kubernetes have been described by two attributes: their storage class, and their capacity. The storage class is an immutable property of the volume, while the capacity can be changed dynamically with volume resize.

This complicates vertical scaling of workloads with volumes. While cloud providers and storage vendors often offer volumes which allow specifying IO quality of service (Performance) parameters like IOPS or throughput and tuning them as workloads operate, Kubernetes has no API which allows changing them.

We are pleased to announce that the VolumeAttributesClass KEP, alpha since Kubernetes 1.29, will be beta in 1.31. This provides a generic, Kubernetes-native API for modifying volume parameters like provisioned IO.

Like all new volume features in Kubernetes, this API is implemented via the container storage interface (CSI). In addition to the VolumeAttributesClass feature gate, your provisioner-specific CSI driver must support the new ModifyVolume API which is the CSI side of this feature.

See the full documentation for all details. Here we show the common workflow.

Dynamically modifying volume attributes.

A VolumeAttributesClass is a cluster-scoped resource that specifies provisioner-specific attributes. These are created by the cluster administrator in the same way as storage classes. For example, a series of gold, silver and bronze volume attribute classes can be created for volumes with greater or lessor amounts of provisioned IO.

apiVersion: storage.k8s.io/v1alpha1
kind: VolumeAttributesClass
metadata:
  name: silver
driverName: your-csi-driver
parameters:
  provisioned-iops: "500"
  provisioned-throughput: "50MiB/s"
---
apiVersion: storage.k8s.io/v1alpha1
kind: VolumeAttributesClass
metadata:
  name: gold
driverName: your-csi-driver
parameters:
  provisioned-iops: "10000"
  provisioned-throughput: "500MiB/s"

An attribute class is added to a PVC in much the same way as a storage class.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pv-claim
spec:
  storageClassName: any-storage-class
  volumeAttributesClassName: silver
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 64Gi

Unlike a storage class, the volume attributes class can be changed:

kubectl patch pvc test-pv-claim -p '{"spec": "volumeAttributesClassName": "gold"}'

Kubernetes will work with the CSI driver to update the attributes of the volume. The status of the PVC will track the current and desired attributes class. The PV resource will also be updated with the new volume attributes class which will be set to the currently active attributes of the PV.

Limitations with the beta

As a beta feature, there are still some features which are planned for GA but not yet present. The largest is quota support, see the KEP and discussion in sig-storage for details.

See the Kubernetes CSI driver list for up-to-date information of support for this feature in CSI drivers.

Kubernetes v1.31: Accelerating Cluster Performance with Consistent Reads from Cache

Kubernetes is renowned for its robust orchestration of containerized applications, but as clusters grow, the demands on the control plane can become a bottleneck. A key challenge has been ensuring strongly consistent reads from the etcd datastore, requiring resource-intensive quorum reads.

Today, the Kubernetes community is excited to announce a major improvement: consistent reads from cache, graduating to Beta in Kubernetes v1.31.

Why consistent reads matter

Consistent reads are essential for ensuring that Kubernetes components have an accurate view of the latest cluster state. Guaranteeing consistent reads is crucial for maintaining the accuracy and reliability of Kubernetes operations, enabling components to make informed decisions based on up-to-date information. In large-scale clusters, fetching and processing this data can be a performance bottleneck, especially for requests that involve filtering results. While Kubernetes can filter data by namespace directly within etcd, any other filtering by labels or field selectors requires the entire dataset to be fetched from etcd and then filtered in-memory by the Kubernetes API server. This is particularly impactful for components like the kubelet, which only needs to list pods scheduled to its node - but previously required the API Server and etcd to process all pods in the cluster.

The breakthrough: Caching with confidence

Kubernetes has long used a watch cache to optimize read operations. The watch cache stores a snapshot of the cluster state and receives updates through etcd watches. However, until now, it couldn't serve consistent reads directly, as there was no guarantee the cache was sufficiently up-to-date.

The consistent reads from cache feature addresses this by leveraging etcd's progress notifications mechanism. These notifications inform the watch cache about how current its data is compared to etcd. When a consistent read is requested, the system first checks if the watch cache is up-to-date. If the cache is not up-to-date, the system queries etcd for progress notifications until it's confirmed that the cache is sufficiently fresh. Once ready, the read is efficiently served directly from the cache, which can significantly improve performance, particularly in cases where it would require fetching a lot of data from etcd. This enables requests that filter data to be served from the cache, with only minimal metadata needing to be read from etcd.

Important Note: To benefit from this feature, your Kubernetes cluster must be running etcd version 3.4.31+ or 3.5.13+. For older etcd versions, Kubernetes will automatically fall back to serving consistent reads directly from etcd.

Performance gains you'll notice

This seemingly simple change has a profound impact on Kubernetes performance and scalability:

  • Reduced etcd Load: Kubernetes v1.31 can offload work from etcd, freeing up resources for other critical operations.
  • Lower Latency: Serving reads from cache is significantly faster than fetching and processing data from etcd. This translates to quicker responses for components, improving overall cluster responsiveness.
  • Improved Scalability: Large clusters with thousands of nodes and pods will see the most significant gains, as the reduction in etcd load allows the control plane to handle more requests without sacrificing performance.

5k Node Scalability Test Results: In recent scalability tests on 5,000 node clusters, enabling consistent reads from cache delivered impressive improvements:

  • 30% reduction in kube-apiserver CPU usage
  • 25% reduction in etcd CPU usage
  • Up to 3x reduction (from 5 seconds to 1.5 seconds) in 99th percentile pod LIST request latency

What's next?

With the graduation to beta, consistent reads from cache are enabled by default, offering a seamless performance boost to all Kubernetes users running a supported etcd version.

Our journey doesn't end here. Kubernetes community is actively exploring pagination support in the watch cache, which will unlock even more performance optimizations in the future.

Getting started

Upgrading to Kubernetes v1.31 and ensuring you are using etcd version 3.4.31+ or 3.5.13+ is the easiest way to experience the benefits of consistent reads from cache. If you have any questions or feedback, don't hesitate to reach out to the Kubernetes community.

Let us know how consistent reads from cache transforms your Kubernetes experience!

Special thanks to @ah8ad3 and @p0lyn0mial for their contributions to this feature!

Kubernetes 1.31: Moving cgroup v1 Support into Maintenance Mode

As Kubernetes continues to evolve and adapt to the changing landscape of container orchestration, the community has decided to move cgroup v1 support into maintenance mode in v1.31. This shift aligns with the broader industry's move towards cgroup v2, offering improved functionalities: including scalability and a more consistent interface. Before we dive into the consequences for Kubernetes, let's take a step back to understand what cgroups are and their significance in Linux.

Understanding cgroups

Control groups, or cgroups, are a Linux kernel feature that allows the allocation, prioritization, denial, and management of system resources (such as CPU, memory, disk I/O, and network bandwidth) among processes. This functionality is crucial for maintaining system performance and ensuring that no single process can monopolize system resources, which is especially important in multi-tenant environments.

There are two versions of cgroups: v1 and v2. While cgroup v1 provided sufficient capabilities for resource management, it had limitations that led to the development of cgroup v2. Cgroup v2 offers a more unified and consistent interface, on top of better resource control features.

Cgroups in Kubernetes

For Linux nodes, Kubernetes relies heavily on cgroups to manage and isolate the resources consumed by containers running in pods. Each container in Kubernetes is placed in its own cgroup, which allows Kubernetes to enforce resource limits, monitor usage, and ensure fair resource distribution among all containers.

How Kubernetes uses cgroups

Resource Allocation
Ensures that containers do not exceed their allocated CPU and memory limits.
Isolation
Isolates containers from each other to prevent resource contention.
Monitoring
Tracks resource usage for each container to provide insights and metrics.

Transitioning to Cgroup v2

The Linux community has been focusing on cgroup v2 for new features and improvements. Major Linux distributions and projects like systemd are transitioning towards cgroup v2. Using cgroup v2 provides several benefits over cgroupv1, such as Unified Hierarchy, Improved Interface, Better Resource Control, cgroup aware OOM killer, rootless support etc.

Given these advantages, Kubernetes is also making the move to embrace cgroup v2 more fully. However, this transition needs to be handled carefully to avoid disrupting existing workloads and to provide a smooth migration path for users.

Moving cgroup v1 support into maintenance mode

What does maintenance mode mean?

When cgroup v1 is placed into maintenance mode in Kubernetes, it means that:

  1. Feature Freeze: No new features will be added to cgroup v1 support.
  2. Security Fixes: Critical security fixes will still be provided.
  3. Best-Effort Bug Fixes: Major bugs may be fixed if feasible, but some issues might remain unresolved.

Why move to maintenance mode?

The move to maintenance mode is driven by the need to stay in line with the broader ecosystem and to encourage the adoption of cgroup v2, which offers better performance, security, and usability. By transitioning cgroup v1 to maintenance mode, Kubernetes can focus on enhancing support for cgroup v2 and ensure it meets the needs of modern workloads. It's important to note that maintenance mode does not mean deprecation; cgroup v1 will continue to receive critical security fixes and major bug fixes as needed.

What this means for cluster administrators

Users currently relying on cgroup v1 are highly encouraged to plan for the transition to cgroup v2. This transition involves:

  1. Upgrading Systems: Ensuring that the underlying operating systems and container runtimes support cgroup v2.
  2. Testing Workloads: Verifying that workloads and applications function correctly with cgroup v2.

Further reading

Kubernetes v1.31: PersistentVolume Last Phase Transition Time Moves to GA

Announcing the graduation to General Availability (GA) of the PersistentVolume lastTransitionTime status field, in Kubernetes v1.31!

The Kubernetes SIG Storage team is excited to announce that the "PersistentVolumeLastPhaseTransitionTime" feature, introduced as an alpha in Kubernetes v1.28, has now reached GA status and is officially part of the Kubernetes v1.31 release. This enhancement helps Kubernetes users understand when a PersistentVolume transitions between different phases, allowing for more efficient and informed resource management.

For a v1.31 cluster, you can now assume that every PersistentVolume object has a .status.lastTransitionTime field, that holds a timestamp of when the volume last transitioned its phase. This change is not immediate; the new field will be populated whenever a PersistentVolume is updated and first transitions between phases (Pending, Bound, or Released) after upgrading to Kubernetes v1.31.

What changed?

The API strategy for updating PersistentVolume objects has been modified to populate the .status.lastTransitionTime field with the current timestamp whenever a PersistentVolume transitions phases. Users are allowed to set this field manually if needed, but it will be overwritten when the PersistentVolume transitions phases again.

For more details, read about Phase transition timestamp in the Kubernetes documentation. You can also read the previous blog post announcing the feature as alpha in v1.28.

To provide feedback, join our Kubernetes Storage Special-Interest-Group (SIG) or participate in discussions on our public Slack channel.

Kubernetes v1.31: Elli

Editors: Matteo Bianchi, Yigit Demirbas, Abigail McCarthy, Edith Puclla, Rashan Smith

Announcing the release of Kubernetes v1.31: Elli!

Similar to previous releases, the release of Kubernetes v1.31 introduces new stable, beta, and alpha features. The consistent delivery of high-quality releases underscores the strength of our development cycle and the vibrant support from our community. This release consists of 45 enhancements. Of those enhancements, 11 have graduated to Stable, 22 are entering Beta, and 12 have graduated to Alpha.

The Kubernetes v1.31 Release Theme is "Elli".

Kubernetes v1.31's Elli is a cute and joyful dog, with a heart of gold and a nice sailor's cap, as a playful wink to the huge and diverse family of Kubernetes contributors.

Kubernetes v1.31 marks the first release after the project has successfully celebrated its first 10 years. Kubernetes has come a very long way since its inception, and it's still moving towards exciting new directions with each release. After 10 years, it is awe-inspiring to reflect on the effort, dedication, skill, wit and tiring work of the countless Kubernetes contributors who have made this a reality.

And yet, despite the herculean effort needed to run the project, there is no shortage of people who show up, time and again, with enthusiasm, smiles and a sense of pride for contributing and being part of the community. This "spirit" that we see from new and old contributors alike is the sign of a vibrant community, a "joyful" community, if we might call it that.

Kubernetes v1.31's Elli is all about celebrating this wonderful spirit! Here's to the next decade of Kubernetes!

Highlights of features graduating to Stable

This is a selection of some of the improvements that are now stable following the v1.31 release.

AppArmor support is now stable

Kubernetes support for AppArmor is now GA. Protect your containers using AppArmor by setting the appArmorProfile.type field in the container's securityContext. Note that before Kubernetes v1.30, AppArmor was controlled via annotations; starting in v1.30 it is controlled using fields. It is recommended that you should migrate away from using annotations and start using the appArmorProfile.type field.

To learn more read the AppArmor tutorial. This work was done as a part of KEP #24, by SIG Node.

Improved ingress connectivity reliability for kube-proxy

Kube-proxy improved ingress connectivity reliability is stable in v1.31. One of the common problems with load balancers in Kubernetes is the synchronization between the different components involved to avoid traffic drop. This feature implements a mechanism in kube-proxy for load balancers to do connection draining for terminating Nodes exposed by services of type: LoadBalancer and externalTrafficPolicy: Cluster and establish some best practices for cloud providers and Kubernetes load balancers implementations.

To use this feature, kube-proxy needs to run as default service proxy on the cluster and the load balancer needs to support connection draining. There are no specific changes required for using this feature, it has been enabled by default in kube-proxy since v1.30 and been promoted to stable in v1.31.

For more details about this feature please visit the Virtual IPs and Service Proxies documentation page.

This work was done as part of KEP #3836 by SIG Network.

Persistent Volume last phase transition time

Persistent Volume last phase transition time feature moved to GA in v1.31. This feature adds a PersistentVolumeStatus field which holds a timestamp of when a PersistentVolume last transitioned to a different phase. With this feature enabled, every PersistentVolume object will have a new field .status.lastTransitionTime, that holds a timestamp of when the volume last transitioned its phase. This change is not immediate; the new field will be populated whenever a PersistentVolume is updated and first transitions between phases (Pending, Bound, or Released) after upgrading to Kubernetes v1.31. This allows you to measure time between when a PersistentVolume moves from Pending to Bound. This can be also useful for providing metrics and SLOs.

For more details about this feature please visit the PersistentVolume documentation page.

This work was done as a part of KEP #3762 by SIG Storage.

Highlights of features graduating to Beta

This is a selection of some of the improvements that are now beta following the v1.31 release.

nftables backend for kube-proxy

The nftables backend moves to beta in v1.31, behind the NFTablesProxyMode feature gate which is now enabled by default.

The nftables API is the successor to the iptables API and is designed to provide better performance and scalability than iptables. The nftables proxy mode is able to process changes to service endpoints faster and more efficiently than the iptables mode, and is also able to more efficiently process packets in the kernel (though this only becomes noticeable in clusters with tens of thousands of services).

As of Kubernetes v1.31, the nftables mode is still relatively new, and may not be compatible with all network plugins; consult the documentation for your network plugin. This proxy mode is only available on Linux nodes, and requires kernel 5.13 or later. Before migrating, note that some features, especially around NodePort services, are not implemented exactly the same in nftables mode as they are in iptables mode. Check the migration guide to see if you need to override the default configuration.

This work was done as part of KEP #3866 by SIG Network.

Changes to reclaim policy for PersistentVolumes

The Always Honor PersistentVolume Reclaim Policy feature has advanced to beta in Kubernetes v1.31. This enhancement ensures that the PersistentVolume (PV) reclaim policy is respected even after the associated PersistentVolumeClaim (PVC) is deleted, thereby preventing the leakage of volumes.

Prior to this feature, the reclaim policy linked to a PV could be disregarded under specific conditions, depending on whether the PV or PVC was deleted first. Consequently, the corresponding storage resource in the external infrastructure might not be removed, even if the reclaim policy was set to "Delete". This led to potential inconsistencies and resource leaks.

With the introduction of this feature, Kubernetes now guarantees that the "Delete" reclaim policy will be enforced, ensuring the deletion of the underlying storage object from the backend infrastructure, regardless of the deletion sequence of the PV and PVC.

This work was done as a part of KEP #2644 and by SIG Storage.

Bound service account token improvements

The ServiceAccountTokenNodeBinding feature is promoted to beta in v1.31. This feature allows requesting a token bound only to a node, not to a pod, which includes node information in claims in the token and validates the existence of the node when the token is used. For more information, read the bound service account tokens documentation.

This work was done as part of KEP #4193 by SIG Auth.

Multiple Service CIDRs

Support for clusters with multiple Service CIDRs moves to beta in v1.31 (disabled by default).

There are multiple components in a Kubernetes cluster that consume IP addresses: Nodes, Pods and Services. Nodes and Pods IP ranges can be dynamically changed because depend on the infrastructure or the network plugin respectively. However, Services IP ranges are defined during the cluster creation as a hardcoded flag in the kube-apiserver. IP exhaustion has been a problem for long lived or large clusters, as admins needed to expand, shrink or even replace entirely the assigned Service CIDR range. These operations were never supported natively and were performed via complex and delicate maintenance operations, often causing downtime on their clusters. This new feature allows users and cluster admins to dynamically modify Service CIDR ranges with zero downtime.

For more details about this feature please visit the Virtual IPs and Service Proxies documentation page.

This work was done as part of KEP #1880 by SIG Network.

Traffic distribution for Services

Traffic distribution for Services moves to beta in v1.31 and is enabled by default.

After several iterations on finding the best user experience and traffic engineering capabilities for Services networking, SIG Networking implemented the trafficDistribution field in the Service specification, which serves as a guideline for the underlying implementation to consider while making routing decisions.

For more details about this feature please read the 1.30 Release Blog or visit the Service documentation page.

This work was done as part of KEP #4444 by SIG Network.

Kubernetes VolumeAttributesClass ModifyVolume

VolumeAttributesClass API is moving to beta in v1.31. The VolumeAttributesClass provides a generic, Kubernetes-native API for modifying dynamically volume parameters like provisioned IO. This allows workloads to vertically scale their volumes on-line to balance cost and performance, if supported by their provider. This feature had been alpha since Kubernetes 1.29.

This work was done as a part of KEP #3751 and lead by SIG Storage.

New features in Alpha

This is a selection of some of the improvements that are now alpha following the v1.31 release.

New DRA APIs for better accelerators and other hardware management

Kubernetes v1.31 brings an updated dynamic resource allocation (DRA) API and design. The main focus in the update is on structured parameters because they make resource information and requests transparent to Kubernetes and clients and enable implementing features like cluster autoscaling. DRA support in the kubelet was updated such that version skew between kubelet and the control plane is possible. With structured parameters, the scheduler allocates ResourceClaims while scheduling a pod. Allocation by a DRA driver controller is still supported through what is now called "classic DRA".

With Kubernetes v1.31, classic DRA has a separate feature gate named DRAControlPlaneController, which you need to enable explicitly. With such a control plane controller, a DRA driver can implement allocation policies that are not supported yet through structured parameters.

This work was done as part of KEP #3063 by SIG Node.

Support for image volumes

The Kubernetes community is moving towards fulfilling more Artificial Intelligence (AI) and Machine Learning (ML) use cases in the future.

One of the requirements to fulfill these use cases is to support Open Container Initiative (OCI) compatible images and artifacts (referred as OCI objects) directly as a native volume source. This allows users to focus on OCI standards as well as enables them to store and distribute any content using OCI registries.

Given that, v1.31 adds a new alpha feature to allow using an OCI image as a volume in a Pod. This feature allows users to specify an image reference as volume in a pod while reusing it as volume mount within containers. You need to enable the ImageVolume feature gate to try this out.

This work was done as part of KEP #4639 by SIG Node and SIG Storage.

Exposing device health information through Pod status

Expose device health information through the Pod Status is added as a new alpha feature in v1.31, disabled by default.

Before Kubernetes v1.31, the way to know whether or not a Pod is associated with the failed device is to use the PodResources API.

By enabling this feature, the field allocatedResourcesStatus will be added to each container status, within the .status for each Pod. The allocatedResourcesStatus field reports health information for each device assigned to the container.

This work was done as part of KEP #4680 by SIG Node.

Finer-grained authorization based on selectors

This feature allows webhook authorizers and future (but not currently designed) in-tree authorizers to allow list and watch requests, provided those requests use label and/or field selectors. For example, it is now possible for an authorizer to express: this user cannot list all pods, but can list all pods where .spec.nodeName matches some specific value. Or to allow a user to watch all Secrets in a namespace that are not labelled as confidential: true. Combined with CRD field selectors (also moving to beta in v1.31), it is possible to write more secure per-node extensions.

This work was done as part of KEP #4601 by SIG Auth.

Restrictions on anonymous API access

By enabling the feature gate AnonymousAuthConfigurableEndpoints users can now use the authentication configuration file to configure the endpoints that can be accessed by anonymous requests. This allows users to protect themselves against RBAC misconfigurations that can give anonymous users broad access to the cluster.

This work was done as a part of KEP #4633 and by SIG Auth.

Graduations, deprecations, and removals in 1.31

Graduations to Stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 11 enhancements promoted to Stable:

Deprecations and Removals

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones for the project's overall health. See the Kubernetes deprecation and removal policy for more details on this process.

Cgroup v1 enters the maintenance mode

As Kubernetes continues to evolve and adapt to the changing landscape of container orchestration, the community has decided to move cgroup v1 support into maintenance mode in v1.31. This shift aligns with the broader industry's move towards cgroup v2, offering improved functionality, scalability, and a more consistent interface. Kubernetes maintance mode means that no new features will be added to cgroup v1 support. Critical security fixes will still be provided, however, bug-fixing is now best-effort, meaning major bugs may be fixed if feasible, but some issues might remain unresolved.

It is recommended that you start switching to use cgroup v2 as soon as possible. This transition depends on your architecture, including ensuring the underlying operating systems and container runtimes support cgroup v2 and testing workloads to verify that workloads and applications function correctly with cgroup v2.

Please report any problems you encounter by filing an issue.

This work was done as part of KEP #4569 by SIG Node.

A note about SHA-1 signature support

In go1.18 (released in March 2022), the crypto/x509 library started to reject certificates signed with a SHA-1 hash function. While SHA-1 is established to be unsafe and publicly trusted Certificate Authorities have not issued SHA-1 certificates since 2015, there might still be cases in the context of Kubernetes where user-provided certificates are signed using a SHA-1 hash function through private authorities with them being used for Aggregated API Servers or webhooks. If you have relied on SHA-1 based certificates, you must explicitly opt back into its support by setting GODEBUG=x509sha1=1 in your environment.

Given Go's compatibility policy for GODEBUGs, the x509sha1 GODEBUG and the support for SHA-1 certificates will fully go away in go1.24 which will be released in the first half of 2025. If you rely on SHA-1 certificates, please start moving off them.

Please see Kubernetes issue #125689 to get a better idea of timelines around the support for SHA-1 going away, when Kubernetes releases plans to adopt go1.24, and for more details on how to detect usage of SHA-1 certificates via metrics and audit logging.

Deprecation of status.nodeInfo.kubeProxyVersion field for Nodes (KEP 4004)

The .status.nodeInfo.kubeProxyVersion field of Nodes has been deprecated in Kubernetes v1.31, and will be removed in a later release. It's being deprecated because the value of this field wasn't (and isn't) accurate. This field is set by the kubelet, which does not have reliable information about the kube-proxy version or whether kube-proxy is running.

The DisableNodeKubeProxyVersion feature gate will be set to true in by default in v1.31 and the kubelet will no longer attempt to set the .status.kubeProxyVersion field for its associated Node.

Removal of all in-tree integrations with cloud providers

As highlighted in a previous article, the last remaining in-tree support for cloud provider integration has been removed as part of the v1.31 release. This doesn't mean you can't integrate with a cloud provider, however you now must use the recommended approach using an external integration. Some integrations are part of the Kubernetes project and others are third party software.

This milestone marks the completion of the externalization process for all cloud providers' integrations from the Kubernetes core (KEP-2395), a process started with Kubernetes v1.26. This change helps Kubernetes to get closer to being a truly vendor-neutral platform.

For further details on the cloud provider integrations, read our v1.29 Cloud Provider Integrations feature blog. For additional context about the in-tree code removal, we invite you to check the (v1.29 deprecation blog).

The latter blog also contains useful information for users who need to migrate to version v1.29 and later.

Removal of in-tree provider feature gates

In Kubernetes v1.31, the following alpha feature gates InTreePluginAWSUnregister, InTreePluginAzureDiskUnregister, InTreePluginAzureFileUnregister, InTreePluginGCEUnregister, InTreePluginOpenStackUnregister, and InTreePluginvSphereUnregister have been removed. These feature gates were introduced to facilitate the testing of scenarios where in-tree volume plugins were removed from the codebase, without actually removing them. Since Kubernetes 1.30 had deprecated these in-tree volume plugins, these feature gates were redundant and no longer served a purpose. The only CSI migration gate still standing is InTreePluginPortworxUnregister, which will remain in alpha until the CSI migration for Portworx is completed and its in-tree volume plugin will be ready for removal.

Removal of kubelet --keep-terminated-pod-volumes command line flag

The kubelet flag --keep-terminated-pod-volumes, which was deprecated in 2017, has been removed as part of the v1.31 release.

You can find more details in the pull request #122082.

Removal of CephFS volume plugin

CephFS volume plugin was removed in this release and the cephfs volume type became non-functional.

It is recommended that you use the CephFS CSI driver as a third-party storage driver instead. If you were using the CephFS volume plugin before upgrading the cluster version to v1.31, you must re-deploy your application to use the new driver.

CephFS volume plugin was formally marked as deprecated in v1.28.

Removal of Ceph RBD volume plugin

The v1.31 release removes the Ceph RBD volume plugin and its CSI migration support, making the rbd volume type non-functional.

It's recommended that you use the RBD CSI driver in your clusters instead. If you were using Ceph RBD volume plugin before upgrading the cluster version to v1.31, you must re-deploy your application to use the new driver.

The Ceph RBD volume plugin was formally marked as deprecated in v1.28.

Deprecation of non-CSI volume limit plugins in kube-scheduler

The v1.31 release will deprecate all non-CSI volume limit scheduler plugins, and will remove some already deprected plugins from the default plugins, including:

  • AzureDiskLimits
  • CinderLimits
  • EBSLimits
  • GCEPDLimits

It's recommended that you use the NodeVolumeLimits plugin instead because it can handle the same functionality as the removed plugins since those volume types have been migrated to CSI. Please replace the deprecated plugins with the NodeVolumeLimits plugin if you explicitly use them in the scheduler config. The AzureDiskLimits, CinderLimits, EBSLimits, and GCEPDLimits plugins will be removed in a future release.

These plugins will be removed from the default scheduler plugins list as they have been deprecated since Kubernetes v1.14.

Release notes and upgrade actions required

Check out the full details of the Kubernetes v1.31 release in our release notes.

Scheduler now uses QueueingHint when SchedulerQueueingHints is enabled

Added support to the scheduler to start using a QueueingHint registered for Pod/Updated events, to determine whether updates to previously unschedulable Pods have made them schedulable. The new support is active when the feature gate SchedulerQueueingHints is enabled.

Previously, when unschedulable Pods were updated, the scheduler always put Pods back to into a queue (activeQ / backoffQ). However not all updates to Pods make Pods schedulable, especially considering many scheduling constraints nowadays are immutable. Under the new behaviour, once unschedulable Pods are updated, the scheduling queue checks with QueueingHint(s) whether the update may make the pod(s) schedulable, and requeues them to activeQ or backoffQ only when at least one QueueingHint returns Queue.

Action required for custom scheduler plugin developers: Plugins have to implement a QueueingHint for Pod/Update event if the rejection from them could be resolved by updating unscheduled Pods themselves. Example: suppose you develop a custom plugin that denies Pods that have a schedulable=false label. Given Pods with a schedulable=false label will be schedulable if the schedulable=false label is removed, this plugin would implement QueueingHint for Pod/Update event that returns Queue when such label changes are made in unscheduled Pods. You can find more details in the pull request #122234.

Removal of kubelet --keep-terminated-pod-volumes command line flag

The kubelet flag --keep-terminated-pod-volumes, which was deprecated in 2017, was removed as part of the v1.31 release.

You can find more details in the pull request #122082.

Availability

Kubernetes v1.31 is available for download on GitHub or on the Kubernetes download page.

To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.31 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to deliver the Kubernetes v1.31 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out our release lead, Angelos Kolaitis, for supporting us through a successful release cycle, advocating for us, making sure that we could all contribute in the best way possible, and challenging us to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.31 release cycle, which ran for 14 weeks (May 7th to August 13th), we saw contributions to Kubernetes from 113 different companies and 528 individuals.

In the whole Cloud Native ecosystem we have 379 companies counting 2268 total contributors - which means that respect to the previous release cycle we experienced an astounding 63% increase on individuals contributing!

Source for this data:

By contribution we mean when someone makes a commit, code review, comment, creates an issue or PR, reviews a PR (including blogs and documentation) or comments on issues and PRs.

If you are interested in contributing visit this page to get started.

Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.

Event update

Explore the upcoming Kubernetes and cloud-native events from August to November 2024, featuring KubeCon, KCD, and other notable conferences worldwide. Stay informed and engage with the Kubernetes community.

August 2024

September 2024

October 2024

November 2024

Upcoming release webinar

Join members of the Kubernetes v1.31 release team on Thursday, Thu Sep 12, 2024 10am PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control

Kubernetes components use on-off switches called feature gates to manage the risk of adding a new feature. The feature gate mechanism is what enables incremental graduation of a feature through the stages Alpha, Beta, and GA.

Kubernetes components, such as kube-controller-manager and kube-scheduler, use the client-go library to interact with the API. The same library is used across the Kubernetes ecosystem to build controllers, tools, webhooks, and more. client-go now includes its own feature gating mechanism, giving developers and cluster administrators more control over how they adopt client features.

To learn more about feature gates in Kubernetes, visit Feature Gates.

Motivation

In the absence of client-go feature gates, each new feature separated feature availability from enablement in its own way, if at all. Some features were enabled by updating to a newer version of client-go. Others needed to be actively configured in each program that used them. A few were configurable at runtime using environment variables. Consuming a feature-gated functionality exposed by the kube-apiserver sometimes required a client-side fallback mechanism to remain compatible with servers that don’t support the functionality due to their age or configuration. In cases where issues were discovered in these fallback mechanisms, mitigation required updating to a fixed version of client-go or rolling back.

None of these approaches offer good support for enabling a feature by default in some, but not all, programs that consume client-go. Instead of enabling a new feature at first only for a single component, a change in the default setting immediately affects the default for all Kubernetes components, which broadens the blast radius significantly.

Feature gates in client-go

To address these challenges, substantial client-go features will be phased in using the new feature gate mechanism. It will allow developers and users to enable or disable features in a way that will be familiar to anyone who has experience with feature gates in the Kubernetes components.

Out of the box, simply by using a recent version of client-go, this offers several benefits.

For people who use software built with client-go:

  • Early adopters can enable a default-off client-go feature on a per-process basis.
  • Misbehaving features can be disabled without building a new binary.
  • The state of all known client-go feature gates is logged, allowing users to inspect it.

For people who develop software built with client-go:

  • By default, client-go feature gate overrides are read from environment variables. If a bug is found in a client-go feature, users will be able to disable it without waiting for a new release.
  • Developers can replace the default environment-variable-based overrides in a program to change defaults, read overrides from another source, or disable runtime overrides completely. The Kubernetes components use this customizability to integrate client-go feature gates with the existing --feature-gates command-line flag, feature enablement metrics, and logging.

Overriding client-go feature gates

Note: This describes the default method for overriding client-go feature gates at runtime. It can be disabled or customized by the developer of a particular program. In Kubernetes components, client-go feature gate overrides are controlled by the --feature-gates flag.

Features of client-go can be enabled or disabled by setting environment variables prefixed with KUBE_FEATURE. For example, to enable a feature named MyFeature, set the environment variable as follows:

 KUBE_FEATURE_MyFeature=true

To disable the feature, set the environment variable to false:

 KUBE_FEATURE_MyFeature=false

Note: Environment variables are case-sensitive on some operating systems. Therefore, KUBE_FEATURE_MyFeature and KUBE_FEATURE_MYFEATURE would be considered two different variables.

Customizing client-go feature gates

The default environment-variable based mechanism for feature gate overrides can be sufficient for many programs in the Kubernetes ecosystem, and requires no special integration. Programs that require different behavior can replace it with their own custom feature gate provider. This allows a program to do things like force-disable a feature that is known to work poorly, read feature gates directly from a remote configuration service, or accept feature gate overrides through command-line options.

The Kubernetes components replace client-go’s default feature gate provider with a shim to the existing Kubernetes feature gate provider. For all practical purposes, client-go feature gates are treated the same as other Kubernetes feature gates: they are wired to the --feature-gates command-line flag, included in feature enablement metrics, and logged on startup.

To replace the default feature gate provider, implement the Gates interface and call ReplaceFeatureGates at package initialization time, as in this simple example:

import (
 k8s.io/client-go/features
)

type AlwaysEnabledGates struct{}

func (AlwaysEnabledGates) Enabled(features.Feature) bool {
 return true
}

func init() {
 features.ReplaceFeatureGates(AlwaysEnabledGates{})
}

Implementations that need the complete list of defined client-go features can get it by implementing the Registry interface and calling AddFeaturesToExistingFeatureGates. For a complete example, refer to the usage within Kubernetes.

Summary

With the introduction of feature gates in client-go v1.30, rolling out a new client-go feature has become safer and easier. Users and developers can control the pace of their own adoption of client-go features. The work of Kubernetes contributors is streamlined by having a common mechanism for graduating features that span both sides of the Kubernetes API boundary.

Special shoutout to @sttts and @deads2k for their help in shaping this feature.

Spotlight on SIG API Machinery

We recently talked with Federico Bongiovanni (Google) and David Eads (Red Hat), Chairs of SIG API Machinery, to know a bit more about this Kubernetes Special Interest Group.

Introductions

Frederico (FSM): Hello, and thank your for your time. To start with, could you tell us about yourselves and how you got involved in Kubernetes?

David: I started working on OpenShift (the Red Hat distribution of Kubernetes) in the fall of 2014 and got involved pretty quickly in API Machinery. My first PRs were fixing kube-apiserver error messages and from there I branched out to kubectl (kubeconfigs are my fault!), auth (RBAC and *Review APIs are ports from OpenShift), apps (workqueues and sharedinformers for example). Don’t tell the others, but API Machinery is still my favorite :)

Federico: I was not as early in Kubernetes as David, but now it's been more than six years. At my previous company we were starting to use Kubernetes for our own products, and when I came across the opportunity to work directly with Kubernetes I left everything and boarded the ship (no pun intended). I joined Google and Kubernetes in early 2018, and have been involved since.

SIG Machinery's scope

FSM: It only takes a quick look at the SIG API Machinery charter to see that it has quite a significant scope, nothing less than the Kubernetes control plane. Could you describe this scope in your own words?

David: We own the kube-apiserver and how to efficiently use it. On the backend, that includes its contract with backend storage and how it allows API schema evolution over time. On the frontend, that includes schema best practices, serialization, client patterns, and controller patterns on top of all of it.

Federico: Kubernetes has a lot of different components, but the control plane has a really critical mission: it's your communication layer with the cluster and also owns all the extensibility mechanisms that make Kubernetes so powerful. We can't make mistakes like a regression, or an incompatible change, because the blast radius is huge.

FSM: Given this breadth, how do you manage the different aspects of it?

Federico: We try to organize the large amount of work into smaller areas. The working groups and subprojects are part of it. Different people on the SIG have their own areas of expertise, and if everything fails, we are really lucky to have people like David, Joe, and Stefan who really are "all terrain", in a way that keeps impressing me even after all these years. But on the other hand this is the reason why we need more people to help us carry the quality and excellence of Kubernetes from release to release.

An evolving collaboration model

FSM: Was the existing model always like this, or did it evolve with time - and if so, what would you consider the main changes and the reason behind them?

David: API Machinery has evolved over time both growing and contracting in scope. When trying to satisfy client access patterns it’s very easy to add scope both in terms of features and applying them.

A good example of growing scope is the way that we identified a need to reduce memory utilization by clients writing controllers and developed shared informers. In developing shared informers and the controller patterns use them (workqueues, error handling, and listers), we greatly reduced memory utilization and eliminated many expensive lists. The downside: we grew a new set of capability to support and effectively took ownership of that area from sig-apps.

For an example of more shared ownership: building out cooperative resource management (the goal of server-side apply), kubectl expanded to take ownership of leveraging the server-side apply capability. The transition isn’t yet complete, but SIG CLI manages that usage and owns it.

FSM: And for the boundary between approaches, do you have any guidelines?

David: I think much depends on the impact. If the impact is local in immediate effect, we advise other SIGs and let them move at their own pace. If the impact is global in immediate effect without a natural incentive, we’ve found a need to press for adoption directly.

FSM: Still on that note, SIG Architecture has an API Governance subproject, is it mostly independent from SIG API Machinery or are there important connection points?

David: The projects have similar sounding names and carry some impacts on each other, but have different missions and scopes. API Machinery owns the how and API Governance owns the what. API conventions, the API approval process, and the final say on individual k8s.io APIs belong to API Governance. API Machinery owns the REST semantics and non-API specific behaviors.

Federico: I really like how David put it: "API Machinery owns the how and API Governance owns the what": we don't own the actual APIs, but the actual APIs live through us.

The challenges of Kubernetes popularity

FSM: With the growth in Kubernetes adoption we have certainly seen increased demands from the Control Plane: how is this felt and how does it influence the work of the SIG?

David: It’s had a massive influence on API Machinery. Over the years we have often responded to and many times enabled the evolutionary stages of Kubernetes. As the central orchestration hub of nearly all capability on Kubernetes clusters, we both lead and follow the community. In broad strokes I see a few evolution stages for API Machinery over the years, with constantly high activity.

  1. Finding purpose: pre-1.0 up until v1.3 (up to our first 1000+ nodes/namespaces) or so. This time was characterized by rapid change. We went through five different versions of our schemas and rose to meet the need. We optimized for quick, in-tree API evolution (sometimes to the detriment of longer term goals), and defined patterns for the first time.

  2. Scaling to meet the need: v1.3-1.9 (up to shared informers in controllers) or so. When we started trying to meet customer needs as we gained adoption, we found severe scale limitations in terms of CPU and memory. This was where we broadened API machinery to include access patterns, but were still heavily focused on in-tree types. We built the watch cache, protobuf serialization, and shared caches.

  3. Fostering the ecosystem: v1.8-1.21 (up to CRD v1) or so. This was when we designed and wrote CRDs (the considered replacement for third-party-resources), the immediate needs we knew were coming (admission webhooks), and evolution to best practices we knew we needed (API schemas). This enabled an explosion of early adopters willing to work very carefully within the constraints to enable their use-cases for servicing pods. The adoption was very fast, sometimes outpacing our capability, and creating new problems.

  4. Simplifying deployments: v1.22+. In the relatively recent past, we’ve been responding to pressures or running kube clusters at scale with large numbers of sometimes-conflicting ecosystem projects using our extensions mechanisms. Lots of effort is now going into making platform extensions easier to write and safer to manage by people who don't hold PhDs in kubernetes. This started with things like server-side-apply and continues today with features like webhook match conditions and validating admission policies.

Work in API Machinery has a broad impact across the project and the ecosystem. It’s an exciting area to work for those able to make a significant time investment on a long time horizon.

The road ahead

FSM: With those different evolutionary stages in mind, what would you pinpoint as the top priorities for the SIG at this time?

David: Reliability, efficiency, and capability in roughly that order.

With the increased usage of our kube-apiserver and extensions mechanisms, we find that our first set of extensions mechanisms, while fairly complete in terms of capability, carry significant risks in terms of potential mis-use with large blast radius. To mitigate these risks, we’re investing in features that reduce the blast radius for accidents (webhook match conditions) and which provide alternative mechanisms with lower risk profiles for most actions (validating admission policy).

At the same time, the increased usage has made us more aware of scaling limitations that we can improve both server and client-side. Efforts here include more efficient serialization (CBOR), reduced etcd load (consistent reads from cache), and reduced peak memory usage (streaming lists).

And finally, the increased usage has highlighted some long existing gaps that we’re closing. Things like field selectors for CRDs which the Batch Working Group is eager to leverage and will eventually form the basis for a new way to prevent trampoline pod attacks from exploited nodes.

Joining the fun

FSM: For anyone wanting to start contributing, what's your suggestions?

Federico: SIG API Machinery is not an exception to the Kubernetes motto: Chop Wood and Carry Water. There are multiple weekly meetings that are open to everybody, and there is always more work to be done than people to do it.

I acknowledge that API Machinery is not easy, and the ramp up will be steep. The bar is high, because of the reasons we've been discussing: we carry a huge responsibility. But of course with passion and perseverance many people has ramped up through the years, and we hope more will come.

In terms of concrete opportunities, there is the SIG meeting every two weeks. Everyone is welcome to attend and listen, see what the group talks about, see what's going on in this release, etc.

Also two times a week, Tuesday and Thursday, we have the public Bug Triage, where we go through everything new from the last meeting. We've been keeping this practice for more than 7 years now. It's a great opportunity to volunteer to review code, fix bugs, improve documentation, etc. Tuesday's it's at 1 PM (PST) and Thursday is on an EMEA friendly time (9:30 AM PST). We are always looking to improve, and we hope to be able to provide more concrete opportunities to join and participate in the future.

FSM: Excellent, thank you! Any final comments you would like to share with our readers?

Federico: As I mentioned, the first steps might be hard, but the reward is also larger. Working on API Machinery is working on an area of huge impact (millions of users?), and your contributions will have a direct outcome in the way that Kubernetes works and the way that it's used. For me that's enough reward and motivation!

Kubernetes Removals and Major Changes In v1.31

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones for the project's overall health. This article outlines some planned changes for the Kubernetes v1.31 release that the release team feels you should be aware of for the continued maintenance of your Kubernetes environment. The information listed below is based on the current status of the v1.31 release. It may change before the actual release date.

The Kubernetes API removal and deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release. It will continue to function until removal (at least one year from the deprecation), but usage will display a warning. Removed APIs are no longer available in the current version, so you must migrate to using the replacement.

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.

  • Beta or pre-release API versions must be supported for 3 releases after the deprecation.

  • Alpha or experimental API versions may be removed in any release without prior deprecation notice.

Whether an API is removed because a feature graduated from beta to stable or because that API did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the documentation.

A note about SHA-1 signature support

In go1.18 (released in March 2022), the crypto/x509 library started to reject certificates signed with a SHA-1 hash function. While SHA-1 is established to be unsafe and publicly trusted Certificate Authorities have not issued SHA-1 certificates since 2015, there might still be cases in the context of Kubernetes where user-provided certificates are signed using a SHA-1 hash function through private authorities with them being used for Aggregated API Servers or webhooks. If you have relied on SHA-1 based certificates, you must explicitly opt back into its support by setting GODEBUG=x509sha1=1 in your environment.

Given Go's compatibility policy for GODEBUGs, the x509sha1 GODEBUG and the support for SHA-1 certificates will fully go away in go1.24 which will be released in the first half of 2025. If you rely on SHA-1 certificates, please start moving off them.

Please see Kubernetes issue #125689 to get a better idea of timelines around the support for SHA-1 going away, when Kubernetes releases plans to adopt go1.24, and for more details on how to detect usage of SHA-1 certificates via metrics and audit logging.

Deprecations and removals in Kubernetes 1.31

Deprecation of status.nodeInfo.kubeProxyVersion field for Nodes (KEP 4004)

The .status.nodeInfo.kubeProxyVersion field of Nodes is being deprecated in Kubernetes v1.31, and will be removed in a later release. It's being deprecated because the value of this field wasn't (and isn't) accurate. This field is set by the kubelet, which does not have reliable information about the kube-proxy version or whether kube-proxy is running.

The DisableNodeKubeProxyVersion feature gate will be set to true in by default in v1.31 and the kubelet will no longer attempt to set the .status.kubeProxyVersion field for its associated Node.

Removal of all in-tree integrations with cloud providers

As highlighted in a previous article, the last remaining in-tree support for cloud provider integration will be removed as part of the v1.31 release. This doesn't mean you can't integrate with a cloud provider, however you now must use the recommended approach using an external integration. Some integrations are part of the Kubernetes project and others are third party software.

This milestone marks the completion of the externalization process for all cloud providers' integrations from the Kubernetes core (KEP-2395), a process started with Kubernetes v1.26. This change helps Kubernetes to get closer to being a truly vendor-neutral platform.

For further details on the cloud provider integrations, read our v1.29 Cloud Provider Integrations feature blog. For additional context about the in-tree code removal, we invite you to check the (v1.29 deprecation blog).

The latter blog also contains useful information for users who need to migrate to version v1.29 and later.

Removal of kubelet --keep-terminated-pod-volumes command line flag

The kubelet flag --keep-terminated-pod-volumes, which was deprecated in 2017, will be removed as part of the v1.31 release.

You can find more details in the pull request #122082.

Removal of CephFS volume plugin

CephFS volume plugin was removed in this release and the cephfs volume type became non-functional.

It is recommended that you use the CephFS CSI driver as a third-party storage driver instead. If you were using the CephFS volume plugin before upgrading the cluster version to v1.31, you must re-deploy your application to use the new driver.

CephFS volume plugin was formally marked as deprecated in v1.28.

Removal of Ceph RBD volume plugin

The v1.31 release will remove the Ceph RBD volume plugin and its CSI migration support, making the rbd volume type non-functional.

It's recommended that you use the RBD CSI driver in your clusters instead. If you were using Ceph RBD volume plugin before upgrading the cluster version to v1.31, you must re-deploy your application to use the new driver.

The Ceph RBD volume plugin was formally marked as deprecated in v1.28.

Deprecation of non-CSI volume limit plugins in kube-scheduler

The v1.31 release will deprecate all non-CSI volume limit scheduler plugins, and will remove some already deprected plugins from the default plugins, including:

  • AzureDiskLimits
  • CinderLimits
  • EBSLimits
  • GCEPDLimits

It's recommended that you use the NodeVolumeLimits plugin instead because it can handle the same functionality as the removed plugins since those volume types have been migrated to CSI. Please replace the deprecated plugins with the NodeVolumeLimits plugin if you explicitly use them in the scheduler config. The AzureDiskLimits, CinderLimits, EBSLimits, and GCEPDLimits plugins will be removed in a future release.

These plugins will be removed from the default scheduler plugins list as they have been deprecated since Kubernetes v1.14.

Looking ahead

The official list of API removals planned for Kubernetes v1.32 include:

  • The flowcontrol.apiserver.k8s.io/v1beta3 API version of FlowSchema and PriorityLevelConfiguration will be removed. To prepare for this, you can edit your existing manifests and rewrite client software to use the flowcontrol.apiserver.k8s.io/v1 API version, available since v1.29. All existing persisted objects are accessible via the new API. Notable changes in flowcontrol.apiserver.k8s.io/v1beta3 include that the PriorityLevelConfiguration spec.limited.nominalConcurrencyShares field only defaults to 30 when unspecified, and an explicit value of 0 is not changed to 30.

For more information, please refer to the API deprecation guide.

Want to know more?

The Kubernetes release notes announce deprecations. We will formally announce the deprecations in Kubernetes v1.31 as part of the CHANGELOG for that release.

You can see the announcements of pending deprecations in the release notes for:

Spotlight on SIG Node

In the world of container orchestration, Kubernetes reigns supreme, powering some of the most complex and dynamic applications across the globe. Behind the scenes, a network of Special Interest Groups (SIGs) drives Kubernetes' innovation and stability.

Today, I have the privilege of speaking with Matthias Bertschy, Gunju Kim, and Sergey Kanzhelev, members of SIG Node, who will shed some light on their roles, challenges, and the exciting developments within SIG Node.

Answers given collectively by all interviewees will be marked by their initials.

Introductions

Arpit: Thank you for joining us today. Could you please introduce yourselves and provide a brief overview of your roles within SIG Node?

Matthias: My name is Matthias Bertschy, I am French and live next to Lake Geneva, near the French Alps. I have been a Kubernetes contributor since 2017, a reviewer for SIG Node and a maintainer of Prow. I work as a Senior Kubernetes Developer for a security startup named ARMO, which donated Kubescape to the CNCF.

Lake Geneva and the Alps

Gunju: My name is Gunju Kim. I am a software engineer at NAVER, where I focus on developing a cloud platform for search services. I have been contributing to the Kubernetes project in my free time since 2021.

Sergey: My name is Sergey Kanzhelev. I have worked on Kubernetes and Google Kubernetes Engine for 3 years and have worked on open-source projects for many years now. I am a chair of SIG Node.

Understanding SIG Node

Arpit: Thank you! Could you provide our readers with an overview of SIG Node's responsibilities within the Kubernetes ecosystem?

M/G/S: SIG Node is one of the first if not the very first SIG in Kubernetes. The SIG is responsible for all iterations between Kubernetes and node resources, as well as node maintenance itself. This is quite a large scope, and the SIG owns a large part of the Kubernetes codebase. Because of this wide ownership, SIG Node is always in contact with other SIGs such as SIG Network, SIG Storage, and SIG Security and almost any new features and developments in Kubernetes involves SIG Node in some way.

Arpit: How does SIG Node contribute to Kubernetes' performance and stability?

M/G/S: Kubernetes works on nodes of many different sizes and shapes, from small physical VMs with cheap hardware to large AI/ML-optimized GPU-enabled nodes. Nodes may stay online for months or maybe be short-lived and be preempted at any moment as they are running on excess compute of a cloud provider.

kubelet — the Kubernetes agent on a node — must work in all these environments reliably. As for the performance of kubelet operations, this is becoming increasingly important today. On one hand, as Kubernetes is being used on extra small nodes more and more often in telecom and retail environments, it needs to scale into the smallest footprint possible. On the other hand, with AI/ML workloads where every node is extremely expensive, every second of delayed operations can visibly change the price of computation.

Challenges and Opportunities

Arpit: What upcoming challenges and opportunities is SIG Node keeping an eye on?

M/G/S: As Kubernetes enters the second decade of its life, we see a huge demand to support new workload types. And SIG Node will play a big role in this. The Sidecar KEP, which we will be talking about later, is one of the examples of increased emphasis on supporting new workload types.

The key challenge we will have in the next few years is how to keep innovations while maintaining high quality and backward compatibility of existing scenarios. SIG Node will continue to play a central role in Kubernetes.

Arpit: And are there any ongoing research or development areas within SIG Node that excite you?

M/G/S: Supporting new workload types is a fascinating area for us. Our recent exploration of sidecar containers is a testament to this. Sidecars offer a versatile solution for enhancing application functionality without altering the core codebase.

Arpit: What are some of the challenges you've faced while maintaining SIG Node, and how have you overcome them?

M/G/S: The biggest challenge of SIG Node is its size and the many feature requests it receives. We are encouraging more people to join as reviewers and are always open to improving processes and addressing feedback. For every release, we run the feedback session at the SIG Node meeting and identify problematic areas and action items.

Arpit: Are there specific technologies or advancements that SIG Node is closely monitoring or integrating into Kubernetes?

M/G/S: Developments in components that the SIG depends on, like container runtimes (e.g. containerd and CRI-O, and OS features are something we contribute to and monitor closely. For example, there is an upcoming cgroup v1 deprecation and removal that Kubernetes and SIG Node will need to guide Kubernetes users through. Containerd is also releasing version 2.0, which removes deprecated features, which will affect Kubernetes users.

Arpit: Could you share a memorable experience or achievement from your time as a SIG Node maintainer that you're particularly proud of?

Mathias: I think the best moment was when my first KEP (introducing the startupProbe) finally graduated to GA (General Availability). I also enjoy seeing my contributions being used daily by contributors, such as the comment containing the GitHub tree hash used to retain LGTM despite squash commits.

Sidecar containers

Arpit: Can you provide more context on the concept of sidecar containers and their evolution in the context of Kubernetes?

M/G/S: The concept of sidecar containers dates back to 2015 when Kubernetes introduced the idea of composite containers. These additional containers, running alongside the main application container within the same pod, were seen as a way to extend and enhance application functionality without modifying the core codebase. Early adopters of sidecars employed custom scripts and configurations to manage them, but this approach presented challenges in terms of consistency and scalability.

Arpit: Can you share specific use cases or examples where sidecar containers are particularly beneficial?

M/G/S: Sidecar containers are a versatile tool that can be used to enhance the functionality of applications in a variety of ways:

  • Logging and monitoring: Sidecar containers can be used to collect logs and metrics from the primary application container and send them to a centralized logging and monitoring system.
  • Traffic filtering and routing: Sidecar containers can be used to filter and route traffic to and from the primary application container.
  • Encryption and decryption: Sidecar containers can be used to encrypt and decrypt data as it flows between the primary application container and external services.
  • Data synchronization: Sidecar containers can be used to synchronize data between the primary application container and external databases or services.
  • Fault injection: Sidecar containers can be used to inject faults into the primary application container in order to test its resilience to failures.

Arpit: The proposal mentions that some companies are using a fork of Kubernetes with sidecar functionality added. Can you provide insights into the level of adoption and community interest in this feature?

M/G/S: While we lack concrete metrics to measure adoption rates, the KEP has garnered significant interest from the community, particularly among service mesh vendors like Istio, who actively participated in its alpha testing phase. The KEP's visibility through numerous blog posts, interviews, talks, and workshops further demonstrates its widespread appeal. The KEP addresses the growing demand for additional capabilities alongside main containers in Kubernetes pods, such as network proxies, logging systems, and security measures. The community acknowledges the importance of providing easy migration paths for existing workloads to facilitate widespread adoption of the feature.

Arpit: Are there any notable examples or success stories from companies using sidecar containers in production?

M/G/S: It is still too early to expect widespread adoption in production environments. The 1.29 release has only been available in Google Kubernetes Engine (GKE) since January 11, 2024, and there still needs to be comprehensive documentation on how to enable and use them effectively via universal injector. Istio, a popular service mesh platform, also lacks proper documentation for enabling native sidecars, making it difficult for developers to get started with this new feature. However, as native sidecar support matures and documentation improves, we can expect to see wider adoption of this technology in production environments.

Arpit: The proposal suggests introducing a restartPolicy field for init containers to indicate sidecar functionality. Can you explain how this solution addresses the outlined challenges?

M/G/S: The proposal to introduce a restartPolicy field for init containers addresses the outlined challenges by utilizing existing infrastructure and simplifying sidecar management. This approach avoids adding new fields to the pod specification, keeping it manageable and avoiding more clutter. By leveraging the existing init container mechanism, sidecars can be run alongside regular init containers during pod startup, ensuring a consistent ordering of initialization. Additionally, setting the restart policy of sidecar init containers to Always explicitly states that they continue running even after the main application container terminates, enabling persistent services like logging and monitoring until the end of the workload.

Arpit: How will the introduction of the restartPolicy field for init containers affect backward compatibility with existing Kubernetes configurations?

M/G/S: The introduction of the restartPolicy field for init containers will maintain backward compatibility with existing Kubernetes configurations. Existing init containers will continue to function as they have before, and the new restartPolicy field will only apply to init containers explicitly marked as sidecars. This approach ensures that existing applications and deployments will not be disrupted by the new feature, and provides a more streamlined way to define and manage sidecars.

Contributing to SIG Node

Arpit: What is the best place for the new members and especially beginners to contribute?

M/G/S: New members and beginners can contribute to the Sidecar KEP (Kubernetes Enhancement Proposal) by:

  • Raising awareness: Create content that highlights the benefits and use cases of sidecars. This can educate others about the feature and encourage its adoption.
  • Providing feedback: Share your experiences with sidecars, both positive and negative. This feedback can be used to improve the feature and make it more widely usable.
  • Sharing your use cases: If you are using sidecars in production, share your experiences with others. This can help to demonstrate the real-world value of the feature and encourage others to adopt it.
  • Improving the documentation: Help to clarify and expand the documentation for the feature. This can make it easier for others to understand and use sidecars.

In addition to the Sidecar KEP, there are many other areas where SIG Node needs more contributors:

  • Test coverage: SIG Node is always looking for ways to improve the test coverage of Kubernetes components.
  • CI maintenance: SIG Node maintains a suite of e2e tests ensuring Kubernetes components function as intended across a variety of scenarios.

Conclusion

In conclusion, SIG Node stands as a cornerstone in Kubernetes' journey, ensuring its reliability and adaptability in the ever-changing landscape of cloud-native computing. With dedicated members like Matthias, Gunju, and Sergey leading the charge, SIG Node remains at the forefront of innovation, driving Kubernetes towards new horizons.

10 Years of Kubernetes

KCSEU 2024 group photo

Ten (10) years ago, on June 6th, 2014, the first commit of Kubernetes was pushed to GitHub. That first commit with 250 files and 47,501 lines of go, bash and markdown kicked off the project we have today. Who could have predicted that 10 years later, Kubernetes would grow to become one of the largest Open Source projects to date with over 88,000 contributors from more than 8,000 companies, across 44 countries.

KCSCN 2019

This milestone isn't just for Kubernetes but for the Cloud Native ecosystem that blossomed from it. There are close to 200 projects within the CNCF itself, with contributions from 240,000+ individual contributors and thousands more in the greater ecosystem. Kubernetes would not be where it is today without them, the 7M+ Developers, and the even larger user community that have all helped shape the ecosystem that it is today.

Kubernetes' beginnings - a converging of technologies

The ideas underlying Kubernetes started well before the first commit, or even the first prototype (which came about in 2013). In the early 2000s, Moore's Law was well in effect. Computing hardware was becoming more and more powerful at an incredibly fast rate. Correspondingly, applications were growing more and more complex. This combination of hardware commoditization and application complexity pointed to a need to further abstract software from hardware, and solutions started to emerge.

Like many companies at the time, Google was scaling rapidly, and its engineers were interested in the idea of creating a form of isolation in the Linux kernel. Google engineer Rohit Seth described the concept in an email in 2006:

We use the term container to indicate a structure against which we track and charge utilization of system resources like memory, tasks, etc. for a Workload.

The future of Linux containers

In March of 2013, a 5-minute lightning talk called "The future of Linux Containers," presented by Solomon Hykes at PyCon, introduced an upcoming open source tool called "Docker" for creating and using Linux Containers. Docker introduced a level of usability to Linux Containers that made them accessible to more users than ever before, and the popularity of Docker, and thus of Linux Containers, skyrocketed. With Docker making the abstraction of Linux Containers accessible to all, running applications in much more portable and repeatable ways was suddenly possible, but the question of scale remained.

Google's Borg system for managing application orchestration at scale had adopted Linux containers as they were developed in the mid-2000s. Since then, the company had also started working on a new version of the system called "Omega." Engineers at Google who were familiar with the Borg and Omega systems saw the popularity of containerization driven by Docker. They recognized not only the need for an open source container orchestration system but its "inevitability," as described by Brendan Burns in this blog post. That realization in the fall of 2013 inspired a small team to start working on a project that would later become Kubernetes. That team included Joe Beda, Brendan Burns, Craig McLuckie, Ville Aikas, Tim Hockin, Dawn Chen, Brian Grant, and Daniel Smith.

A decade of Kubernetes

KubeCon EU 2017

Kubernetes' history begins with that historic commit on June 6th, 2014, and the subsequent announcement of the project in a June 10th keynote by Google engineer Eric Brewer at DockerCon 2014 (and its corresponding Google blog).

Over the next year, a small community of contributors, largely from Google and Red Hat, worked hard on the project, culminating in a version 1.0 release on July 21st, 2015. Alongside 1.0, Google announced that Kubernetes would be donated to a newly formed branch of the Linux Foundation called the Cloud Native Computing Foundation (CNCF).

Despite reaching 1.0, the Kubernetes project was still very challenging to use and understand. Kubernetes contributor Kelsey Hightower took special note of the project's shortcomings in ease of use and on July 7, 2016, he pushed the first commit of his famed "Kubernetes the Hard Way" guide.

The project has changed enormously since its original 1.0 release; experiencing a number of big wins such as Custom Resource Definitions (CRD) going GA in 1.16 or full dual stack support launching in 1.23 and community "lessons learned" from the removal of widely used beta APIs in 1.22 or the deprecation of Dockershim.

Some notable updates, milestones and events since 1.0 include:

  • December 2016 - Kubernetes 1.5 introduces runtime pluggability with initial CRI support and alpha Windows node support. OpenAPI also appears for the first time, paving the way for clients to be able to discover extension APIs.
    • This release also introduced StatefulSets and PodDisruptionBudgets in Beta.
  • April 2017 — Introduction of Role-Based Access Controls or RBAC.
  • June 2017 — In Kubernetes 1.7, ThirdPartyResources or "TPRs" are replaced with CustomResourceDefinitions (CRDs).
  • December 2017 — Kubernetes 1.9 sees the Workloads API becoming GA (Generally Available). The release blog states: "Deployment and ReplicaSet, two of the most commonly used objects in Kubernetes, are now stabilized after more than a year of real-world use and feedback."
  • December 2018 — In 1.13, the Container Storage Interface (CSI) reaches GA, kubeadm tool for bootstrapping minimum viable clusters reaches GA, and CoreDNS becomes the default DNS server.
  • September 2019 — Custom Resource Definitions go GA in Kubernetes 1.16.
  • August 2020 — Kubernetes 1.19 increases the support window for releases to 1 year.
  • December 2020 — Dockershim is deprecated in 1.20
  • April 2021 — the Kubernetes release cadence changes from 4 releases per year to 3 releases per year.
  • July 2021 — Widely used beta APIs are removed in Kubernetes 1.22.
  • May 2022 — Kubernetes 1.24 sees beta APIs become disabled by default to reduce upgrade conflicts and removal of Dockershim, leading to widespread user confusion (we've since improved our communication!)
  • December 2022 — In 1.26, there was a significant batch and Job API overhaul that paved the way for better support for AI /ML / batch workloads.

PS: Curious to see how far the project has come for yourself? Check out this tutorial for spinning up a Kubernetes 1.0 cluster created by community members Carlos Santana, Amim Moises Salum Knabben, and James Spurin.


Kubernetes offers more extension points than we can count. Originally designed to work with Docker and only Docker, now you can plug in any container runtime that adheres to the CRI standard. There are other similar interfaces: CSI for storage and CNI for networking. And that's far from all you can do. In the last decade, whole new patterns have emerged, such as using

Custom Resource Definitions (CRDs) to support third-party controllers - now a huge part of the Kubernetes ecosystem.

The community building the project has also expanded immensely over the last decade. Using DevStats, we can see the incredible volume of contribution over the last decade that has made Kubernetes the second-largest open source project in the world:

  • 88,474 contributors
  • 15,121 code committers
  • 4,228,347 contributions
  • 158,530 issues
  • 311,787 pull requests

Kubernetes today

KubeCon NA 2023

Since its early days, the project has seen enormous growth in technical capability, usage, and contribution. The project is still actively working to improve and better serve its users.

In the upcoming 1.31 release, the project will celebrate the culmination of an important long-term project: the removal of in-tree cloud provider code. In this largest migration in Kubernetes history, roughly 1.5 million lines of code have been removed, reducing the binary sizes of core components by approximately 40%. In the project's early days, it was clear that extensibility would be key to success. However, it wasn't always clear how that extensibility should be achieved. This migration removes a variety of vendor-specific capabilities from the core Kubernetes code base. Vendor-specific capabilities can now be better served by other pluggable extensibility features or patterns, such as Custom Resource Definitions (CRDs) or API standards like the Gateway API. Kubernetes also faces new challenges in serving its vast user base, and the community is adapting accordingly. One example of this is the migration of image hosting to the new, community-owned registry.k8s.io. The egress bandwidth and costs of providing pre-compiled binary images for user consumption have become immense. This new registry change enables the community to continue providing these convenient images in more cost- and performance-efficient ways. Make sure you check out the blog post and update any automation you have to use registry.k8s.io!

The future of Kubernetes

A decade in, the future of Kubernetes still looks bright. The community is prioritizing changes that both improve the user experiences, and enhance the sustainability of the project. The world of application development continues to evolve, and Kubernetes is poised to change along with it.

In 2024, the advent of AI changed a once-niche workload type into one of prominent importance. Distributed computing and workload scheduling has always gone hand-in-hand with the resource-intensive needs of Artificial Intelligence, Machine Learning, and High Performance Computing workloads. Contributors are paying close attention to the needs of newly developed workloads and how Kubernetes can best serve them. The new Serving Working Group is one example of how the community is organizing to address these workloads' needs. It's likely that the next few years will see improvements to Kubernetes' ability to manage various types of hardware, and its ability to manage the scheduling of large batch-style workloads which are run across hardware in chunks.

The ecosystem around Kubernetes will continue to grow and evolve. In the future, initiatives to maintain the sustainability of the project, like the migration of in-tree vendor code and the registry change, will be ever more important.

The next 10 years of Kubernetes will be guided by its users and the ecosystem, but most of all, by the people who contribute to it. The community remains open to new contributors. You can find more information about contributing in our New Contributor Course at https://k8s.dev/docs/onboarding.

We look forward to building the future of Kubernetes with you!

KCSNA 2023

Completing the largest migration in Kubernetes history

Since as early as Kubernetes v1.7, the Kubernetes project has pursued the ambitious goal of removing built-in cloud provider integrations (KEP-2395). While these integrations were instrumental in Kubernetes' early development and growth, their removal was driven by two key factors: the growing complexity of maintaining native support for every cloud provider across millions of lines of Go code, and the desire to establish Kubernetes as a truly vendor-neutral platform.

After many releases, we're thrilled to announce that all cloud provider integrations have been successfully migrated from the core Kubernetes repository to external plugins. In addition to achieving our initial objectives, we've also significantly streamlined Kubernetes by removing roughly 1.5 million lines of code and reducing the binary sizes of core components by approximately 40%.

This migration was a complex and long-running effort due to the numerous impacted components and the critical code paths that relied on the built-in integrations for the five initial cloud providers: Google Cloud, AWS, Azure, OpenStack, and vSphere. To successfully complete this migration, we had to build four new subsystems from the ground up:

  1. Cloud controller manager (KEP-2392)
  2. API server network proxy (KEP-1281)
  3. kubelet credential provider plugins (KEP-2133)
  4. Storage migration to use CSI (KEP-625)

Each subsystem was critical to achieve full feature parity with built-in capabilities and required several releases to bring each subsystem to GA-level maturity with a safe and reliable migration path. More on each subsystem below.

Cloud controller manager

The cloud controller manager was the first external component introduced in this effort, replacing functionality within the kube-controller-manager and kubelet that directly interacted with cloud APIs. This essential component is responsible for initializing nodes by applying metadata labels that indicate the cloud region and zone a Node is running on, as well as IP addresses that are only known to the cloud provider. Additionally, it runs the service controller, which is responsible for provisioning cloud load balancers for Services of type LoadBalancer.

Kubernetes components

To learn more, read Cloud Controller Manager in the Kubernetes documentation.

API server network proxy

The API Server Network Proxy project, initiated in 2018 in collaboration with SIG API Machinery, aimed to replace the SSH tunneler functionality within the kube-apiserver. This tunneler had been used to securely proxy traffic between the Kubernetes control plane and nodes, but it heavily relied on provider-specific implementation details embedded in the kube-apiserver to establish these SSH tunnels.

Now, the API Server Network Proxy is a GA-level extension point within the kube-apiserver. It offers a generic proxying mechanism that can route traffic from the API server to nodes through a secure proxy, eliminating the need for the API server to have any knowledge of the specific cloud provider it is running on. This project also introduced the Konnectivity project, which has seen growing adoption in production environments.

You can learn more about the API Server Network Proxy from its README.

Credential provider plugins for the kubelet

The Kubelet credential provider plugin was developed to replace the kubelet's built-in functionality for dynamically fetching credentials for image registries hosted on Google Cloud, AWS, or Azure. The legacy capability was convenient as it allowed the kubelet to seamlessly retrieve short-lived tokens for pulling images from GCR, ECR, or ACR. However, like other areas of Kubernetes, supporting this required the kubelet to have specific knowledge of different cloud environments and APIs.

Introduced in 2019, the credential provider plugin mechanism offers a generic extension point for the kubelet to execute plugin binaries that dynamically provide credentials for images hosted on various clouds. This extensibility expands the kubelet's capabilities to fetch short-lived tokens beyond the initial three cloud providers.

To learn more, read kubelet credential provider for authenticated image pulls.

Storage plugin migration from in-tree to CSI

The Container Storage Interface (CSI) is a control plane standard for managing block and file storage systems in Kubernetes and other container orchestrators that went GA in 1.13. It was designed to replace the in-tree volume plugins built directly into Kubernetes with drivers that can run as Pods within the Kubernetes cluster. These drivers communicate with kube-controller-manager storage controllers via the Kubernetes API, and with kubelet through a local gRPC endpoint. Now there are over 100 CSI drivers available across all major cloud and storage vendors, making stateful workloads in Kubernetes a reality.

However, a major challenge remained on how to handle all the existing users of in-tree volume APIs. To retain API backwards compatibility, we built an API translation layer into our controllers that will convert the in-tree volume API into the equivalent CSI API. This allowed us to redirect all storage operations to the CSI driver, paving the way for us to remove the code for the built-in volume plugins without removing the API.

You can learn more about In-tree Storage migration in Kubernetes In-Tree to CSI Volume Migration Moves to Beta.

What's next?

This migration has been the primary focus for SIG Cloud Provider over the past few years. With this significant milestone achieved, we will be shifting our efforts towards exploring new and innovative ways for Kubernetes to better integrate with cloud providers, leveraging the external subsystems we've built over the years. This includes making Kubernetes smarter in hybrid environments where nodes in the cluster can run on both public and private clouds, as well as providing better tools and frameworks for developers of external providers to simplify and streamline their integration efforts.

With all the new features, tools, and frameworks being planned, SIG Cloud Provider is not forgetting about the other side of the equation: testing. Another area of focus for the SIG's future activities is the improvement of cloud controller testing to include more providers. The ultimate goal of this effort being to create a testing framework that will include as many providers as possible so that we give the Kubernetes community the highest levels of confidence about their Kubernetes environments.

If you're using a version of Kubernetes older than v1.29 and haven't migrated to an external cloud provider yet, we recommend checking out our previous blog post Kubernetes 1.29: Cloud Provider Integrations Are Now Separate Components.It provides detailed information on the changes we've made and offers guidance on how to migrate to an external provider. Starting in v1.31, in-tree cloud providers will be permanently disabled and removed from core Kubernetes components.

If you’re interested in contributing, come join our bi-weekly SIG meetings!

Gateway API v1.1: Service mesh, GRPCRoute, and a whole lot more

Gateway API logo

Following the GA release of Gateway API last October, Kubernetes SIG Network is pleased to announce the v1.1 release of Gateway API. In this release, several features are graduating to Standard Channel (GA), notably including support for service mesh and GRPCRoute. We're also introducing some new experimental features, including session persistence and client certificate verification.

What's new

Graduation to Standard

This release includes the graduation to Standard of four eagerly awaited features. This means they are no longer experimental concepts; inclusion in the Standard release channel denotes a high level of confidence in the API surface and provides guarantees of backward compatibility. Of course, as with any other Kubernetes API, Standard Channel features can continue to evolve with backward-compatible additions over time, and we certainly expect further refinements and improvements to these new features in the future. For more information on how all of this works, refer to the Gateway API Versioning Policy.

Service Mesh Support

Service mesh support in Gateway API allows service mesh users to use the same API to manage ingress traffic and mesh traffic, reusing the same policy and routing interfaces. In Gateway API v1.1, routes (such as HTTPRoute) can now have a Service as a parentRef, to control how traffic to specific services behave. For more information, read the Gateway API service mesh documentation or see the list of Gateway API implementations.

As an example, one could do a canary deployment of a workload deep in an application's call graph with an HTTPRoute as follows:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: color-canary
  namespace: faces
spec:
  parentRefs:
    - name: color
      kind: Service
      group: ""
      port: 80
  rules:
  - backendRefs:
    - name: color
      port: 80
      weight: 50
    - name: color2
      port: 80
      weight: 50

This would split traffic sent to the color Service in the faces namespace 50/50 between the original color Service and the color2 Service, using a portable configuration that's easy to move from one mesh to another.

GRPCRoute

If you are already using the experimental version of GRPCRoute, we recommend holding off on upgrading to the standard channel version of GRPCRoute until the controllers you're using have been updated to support GRPCRoute v1. Until then, it is safe to upgrade to the experimental channel version of GRPCRoute in v1.1 that includes both v1alpha2 and v1 API versions.

ParentReference Port

The port field was added to ParentReference, allowing you to attach resources to Gateway Listeners, Services, or other parent resources (depending on the implementation). Binding to a port also allows you to attach to multiple Listeners at once.

For example, you can attach an HTTPRoute to one or more specific Listeners of a Gateway as specified by the Listener port, instead of the Listener name field.

For more information, see Attaching to Gateways.

Conformance Profiles and Reports

The conformance report API has been expanded with the mode field (intended to specify the working mode of the implementation), and the gatewayAPIChannel (standard or experimental). The gatewayAPIVersion and gatewayAPIChannel are now filled in automatically by the suite machinery, along with a brief description of the testing outcome. The Reports have been reorganized in a more structured way, and the implementations can now add information on how the tests have been run and provide reproduction steps.

New additions to Experimental channel

Gateway Client Certificate Verification

Gateways can now configure client cert verification for each Gateway Listener by introducing a new frontendValidation field within tls. This field supports configuring a list of CA Certificates that can be used as a trust anchor to validate the certificates presented by the client.

The following example shows how the CACertificate stored in the foo-example-com-ca-cert ConfigMap can be used to validate the certificates presented by clients connecting to the foo-https Gateway Listener.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: client-validation-basic
spec:
  gatewayClassName: acme-lb
  listeners:
  - name: foo-https
    protocol: HTTPS
    port: 443
    hostname: foo.example.com
    tls:
      certificateRefs:
      - kind: Secret
        group: ""
        name: foo-example-com-cert
      frontendValidation:
        caCertificateRefs:
        - kind: ConfigMap
          group: ""
          name: foo-example-com-ca-cert

Session Persistence and BackendLBPolicy

Session Persistence is being introduced to Gateway API via a new policy (BackendLBPolicy) for Service-level configuration and as fields within HTTPRoute and GRPCRoute for route-level configuration. The BackendLBPolicy and route-level APIs provide the same session persistence configuration, including session timeouts, session name, session type, and cookie lifetime type.

Below is an example configuration of BackendLBPolicy that enables cookie-based session persistence for the foo service. It sets the session name to foo-session, defines absolute and idle timeouts, and configures the cookie to be a session cookie:

apiVersion: gateway.networking.k8s.io/v1alpha2
kind: BackendLBPolicy
metadata:
  name: lb-policy
  namespace: foo-ns
spec:
  targetRefs:
  - group: core
    kind: service
    name: foo
  sessionPersistence:
    sessionName: foo-session
    absoluteTimeout: 1h
    idleTimeout: 30m
    type: Cookie
    cookieConfig:
      lifetimeType: Session

Everything else

TLS Terminology Clarifications

As part of a broader goal of making our TLS terminology more consistent throughout the API, we've introduced some breaking changes to BackendTLSPolicy. This has resulted in a new API version (v1alpha3) and will require any existing implementations of this policy to properly handle the version upgrade, e.g. by backing up data and uninstalling the v1alpha2 version before installing this newer version.

Any references to v1alpha2 BackendTLSPolicy fields will need to be updated to v1alpha3. Specific changes to fields include:

  • targetRef becomes targetRefs to allow a BackendTLSPolicy to attach to multiple targets
  • tls becomes validation
  • tls.caCertRefs becomes validation.caCertificateRefs
  • tls.wellKnownCACerts becomes validation.wellKnownCACertificates

For a full list of the changes included in this release, please refer to the v1.1.0 release notes.

Gateway API background

The idea of Gateway API was initially proposed at the 2019 KubeCon San Diego as the next generation of Ingress API. Since then, an incredible community has formed to develop what has likely become the most collaborative API in Kubernetes history. Over 200 people have contributed to this API so far, and that number continues to grow.

The maintainers would like to thank everyone who's contributed to Gateway API, whether in the form of commits to the repo, discussion, ideas, or general support. We literally couldn't have gotten this far without the support of this dedicated and active community.

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running Kubernetes 1.26 or later, you'll be able to get up and running with this version of Gateway API.

To try out the API, follow our Getting Started Guide.

Get involved

There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both ingress and service mesh.

Container Runtime Interface streaming explained

The Kubernetes Container Runtime Interface (CRI) acts as the main connection between the kubelet and the Container Runtime. Those runtimes have to provide a gRPC server which has to fulfill a Kubernetes defined Protocol Buffer interface. This API definition evolves over time, for example when contributors add new features or fields are going to become deprecated.

In this blog post, I'd like to dive into the functionality and history of three extraordinary Remote Procedure Calls (RPCs), which are truly outstanding in terms of how they work: Exec, Attach and PortForward.

Exec can be used to run dedicated commands within the container and stream the output to a client like kubectl or crictl. It also allows interaction with that process using standard input (stdin), for example if users want to run a new shell instance within an existing workload.

Attach streams the output of the currently running process via standard I/O from the container to the client and also allows interaction with them. This is particularly useful if users want to see what is going on in the container and be able to interact with the process.

PortForward can be utilized to forward a port from the host to the container to be able to interact with it using third party network tools. This allows it to bypass Kubernetes services for a certain workload and interact with its network interface.

What is so special about them?

All RPCs of the CRI either use the gRPC unary calls for communication or the server side streaming feature (only GetContainerEvents right now). This means that mainly all RPCs retrieve a single client request and have to return a single server response. The same applies to Exec, Attach, and PortForward, where their protocol definition looks like this:

// Exec prepares a streaming endpoint to execute a command in the container.
rpc Exec(ExecRequest) returns (ExecResponse) {}
// Attach prepares a streaming endpoint to attach to a running container.
rpc Attach(AttachRequest) returns (AttachResponse) {}
// PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}

The requests carry everything required to allow the server to do the work, for example, the ContainerId or command (Cmd) to be run in case of Exec. More interestingly, all of their responses only contain a url:

message ExecResponse {
    // Fully qualified URL of the exec streaming server.
    string url = 1;
}
message AttachResponse {
    // Fully qualified URL of the attach streaming server.
    string url = 1;
}
message PortForwardResponse {
    // Fully qualified URL of the port-forward streaming server.
    string url = 1;
}

Why is it implemented like that? Well, the original design document for those RPCs even predates Kubernetes Enhancements Proposals (KEPs) and was originally outlined back in 2016. The kubelet had a native implementation for Exec, Attach, and PortForward before the initiative to bring the functionality to the CRI started. Before that, everything was bound to Docker or the later abandoned container runtime rkt.

The CRI related design document also elaborates on the option to use native RPC streaming for exec, attach, and port forward. The downsides outweighed this approach: the kubelet would still create a network bottleneck and future runtimes would not be free in choosing the server implementation details. Also, another option that the Kubelet implements a portable, runtime-agnostic solution has been abandoned over the final one, because this would mean another project to maintain which nevertheless would be runtime dependent.

This means, that the basic flow for Exec, Attach and PortForward was proposed to look like this:

CRI Streaming flow

Clients like crictl or the kubelet (via kubectl) request a new exec, attach or port forward session from the runtime using the gRPC interface. The runtime implements a streaming server that also manages the active sessions. This streaming server provides an HTTP endpoint for the client to connect to. The client upgrades the connection to use the SPDY streaming protocol or (in the future) to a WebSocket connection and starts to stream the data back and forth.

This implementation allows runtimes to have the flexibility to implement Exec, Attach and PortForward the way they want, and also allows a simple test path. Runtimes can change the underlying implementation to support any kind of feature without having a need to modify the CRI at all.

Many smaller enhancements to this overall approach have been merged into Kubernetes in the past years, but the general pattern has always stayed the same. The kubelet source code transformed into a reusable library, which is nowadays usable from container runtimes to implement the basic streaming capability.

How does the streaming actually work?

At a first glance, it looks like all three RPCs work the same way, but that's not the case. It's possible to group the functionality of Exec and Attach, while PortForward follows a distinct internal protocol definition.

Exec and Attach

Kubernetes defines Exec and Attach as remote commands, where its protocol definition exists in five different versions:

#VersionNote
1channel.k8s.ioInitial (unversioned) SPDY sub protocol (#13394, #13395)
2v2.channel.k8s.ioResolves the issues present in the first version (#15961)
3v3.channel.k8s.ioAdds support for resizing container terminals (#25273)
4v4.channel.k8s.ioAdds support for exit codes using JSON errors (#26541)
5v5.channel.k8s.ioAdds support for a CLOSE signal (#119157)

On top of that, there is an overall effort to replace the SPDY transport protocol using WebSockets as part KEP #4006. Runtimes have to satisfy those protocols over their life cycle to stay up to date with the Kubernetes implementation.

Let's assume that a client uses the latest (v5) version of the protocol as well as communicating over WebSockets. In that case, the general flow would be:

  1. The client requests an URL endpoint for Exec or Attach using the CRI.

    • The server (runtime) validates the request, inserts it into a connection tracking cache, and provides the HTTP endpoint URL for that request.
  2. The client connects to that URL, upgrades the connection to establish a WebSocket, and starts to stream data.

    • In the case of Attach, the server has to stream the main container process data to the client.
    • In the case of Exec, the server has to create the subprocess command within the container and then streams the output to the client.

    If stdin is required, then the server needs to listen for that as well and redirect it to the corresponding process.

Interpreting data for the defined protocol is fairly simple: The first byte of every input and output packet defines the actual stream:

First ByteTypeDescription
0standard inputData streamed from stdin
1standard outputData streamed to stdout
2standard errorData streamed to stderr
3stream errorA streaming error occurred
4stream resizeA terminal resize event
255stream closeStream should be closed (for WebSockets)

How should runtimes now implement the streaming server methods for Exec and Attach by using the provided kubelet library? The key is that the streaming server implementation in the kubelet outlines an interface called Runtime which has to be fulfilled by the actual container runtime if it wants to use that library:

// Runtime is the interface to execute the commands and provide the streams.
type Runtime interface {
        Exec(ctx context.Context, containerID string, cmd []string, in io.Reader, out, err io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize) error
        Attach(ctx context.Context, containerID string, in io.Reader, out, err io.WriteCloser, tty bool, resize <-chan remotecommand.TerminalSize) error
        PortForward(ctx context.Context, podSandboxID string, port int32, stream io.ReadWriteCloser) error
}

Everything related to the protocol interpretation is already in place and runtimes only have to implement the actual Exec and Attach logic. For example, the container runtime CRI-O does it like this pseudo code:

func (s StreamService) Exec(
    ctx context.Context,
    containerID string,
    cmd []string,
    stdin io.Reader, stdout, stderr io.WriteCloser,
    tty bool,
    resizeChan <-chan remotecommand.TerminalSize,
) error {
    // Retrieve the container by the provided containerID
    // …

    // Update the container status and verify that the workload is running
    // …

    // Execute the command and stream the data
    return s.runtimeServer.Runtime().ExecContainer(
        s.ctx, c, cmd, stdin, stdout, stderr, tty, resizeChan,
    )
}

PortForward

Forwarding ports to a container works a bit differently when comparing it to streaming IO data from a workload. The server still has to provide a URL endpoint for the client to connect to, but then the container runtime has to enter the network namespace of the container, allocate the port as well as stream the data back and forth. There is no simple protocol definition available like for Exec or Attach. This means that the client will stream the plain SPDY frames (with or without an additional WebSocket connection) which can be interpreted using libraries like moby/spdystream.

Luckily, the kubelet library already provides the PortForward interface method which has to be implemented by the runtime. CRI-O does that by (simplified):

func (s StreamService) PortForward(
    ctx context.Context,
    podSandboxID string,
    port int32,
    stream io.ReadWriteCloser,
) error {
    // Retrieve the pod sandbox by the provided podSandboxID
    sandboxID, err := s.runtimeServer.PodIDIndex().Get(podSandboxID)
    sb := s.runtimeServer.GetSandbox(sandboxID)
    // …

    // Get the network namespace path on disk for that sandbox
    netNsPath := sb.NetNsPath()
    // …

    // Enter the network namespace and stream the data
    return s.runtimeServer.Runtime().PortForwardContainer(
        ctx, sb.InfraContainer(), netNsPath, port, stream,
    )
}

Future work

The flexibility Kubernetes provides for the RPCs Exec, Attach and PortForward is truly outstanding compared to other methods. Nevertheless, container runtimes have to keep up with the latest and greatest implementations to support those features in a meaningful way. The general effort to support WebSockets is not only a plain Kubernetes thing, it also has to be supported by container runtimes as well as clients like crictl.

For example, crictl v1.30 features a new --transport flag for the subcommands exec, attach and port-forward (#1383, #1385) to allow choosing between websocket and spdy.

CRI-O is going an experimental path by moving the streaming server implementation into conmon-rs (a substitute for the container monitor conmon). conmon-rs is a Rust implementation of the original container monitor and allows streaming WebSockets directly using supported libraries (#2070). The major benefit of this approach is that CRI-O does not even have to be running while conmon-rs can keep active Exec, Attach and PortForward sessions open. The simplified flow when using crictl directly will then look like this:

sequenceDiagram autonumber participant crictl participant runtime as Container Runtime participant conmon-rs Note over crictl,runtime: Container Runtime Interface (CRI) crictl->>runtime: Exec, Attach, PortForward Note over runtime,conmon-rs: Cap’n Proto runtime->>conmon-rs: Serve Exec, Attach, PortForward conmon-rs->>runtime: HTTP endpoint (URL) runtime->>crictl: Response URL crictl-->>conmon-rs: Connection upgrade to WebSocket conmon-rs-)crictl: Stream data

All of those enhancements require iterative design decisions, while the original well-conceived implementation acts as the foundation for those. I really hope you've enjoyed this compact journey through the history of CRI RPCs. Feel free to reach out to me anytime for suggestions or feedback using the official Kubernetes Slack.

Kubernetes 1.30: Preventing unauthorized volume mode conversion moves to GA

With the release of Kubernetes 1.30, the feature to prevent the modification of the volume mode of a PersistentVolumeClaim that was created from an existing VolumeSnapshot in a Kubernetes cluster, has moved to GA!

The problem

The Volume Mode of a PersistentVolumeClaim refers to whether the underlying volume on the storage device is formatted into a filesystem or presented as a raw block device to the Pod that uses it.

Users can leverage the VolumeSnapshot feature, which has been stable since Kubernetes v1.20, to create a PersistentVolumeClaim (shortened as PVC) from an existing VolumeSnapshot in the Kubernetes cluster. The PVC spec includes a dataSource field, which can point to an existing VolumeSnapshot instance. Visit Create a PersistentVolumeClaim from a Volume Snapshot for more details on how to create a PVC from an existing VolumeSnapshot in a Kubernetes cluster.

When leveraging the above capability, there is no logic that validates whether the mode of the original volume, whose snapshot was taken, matches the mode of the newly created volume.

This presents a security gap that allows malicious users to potentially exploit an as-yet-unknown vulnerability in the host operating system.

There is a valid use case to allow some users to perform such conversions. Typically, storage backup vendors convert the volume mode during the course of a backup operation, to retrieve changed blocks for greater efficiency of operations. This prevents Kubernetes from blocking the operation completely and presents a challenge in distinguishing trusted users from malicious ones.

Preventing unauthorized users from converting the volume mode

In this context, an authorized user is one who has access rights to perform update or patch operations on VolumeSnapshotContents, which is a cluster-level resource.
It is up to the cluster administrator to provide these rights only to trusted users or applications, like backup vendors. Users apart from such authorized ones will never be allowed to modify the volume mode of a PVC when it is being created from a VolumeSnapshot.

To convert the volume mode, an authorized user must do the following:

  1. Identify the VolumeSnapshot that is to be used as the data source for a newly created PVC in the given namespace.
  2. Identify the VolumeSnapshotContent bound to the above VolumeSnapshot.
kubectl describe volumesnapshot -n <namespace> <name>
  1. Add the annotation snapshot.storage.kubernetes.io/allow-volume-mode-change: "true" to the above VolumeSnapshotContent. The VolumeSnapshotContent annotations must include one similar to the following manifest fragment:
kind: VolumeSnapshotContent
metadata:
  annotations:
    - snapshot.storage.kubernetes.io/allow-volume-mode-change: "true"
...

Note: For pre-provisioned VolumeSnapshotContents, you must take an extra step of setting spec.sourceVolumeMode field to either Filesystem or Block, depending on the mode of the volume from which this snapshot was taken.

An example is shown below:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  annotations:
  - snapshot.storage.kubernetes.io/allow-volume-mode-change: "true"
  name: <volume-snapshot-content-name>
spec:
  deletionPolicy: Delete
  driver: hostpath.csi.k8s.io
  source:
    snapshotHandle: <snapshot-handle>
  sourceVolumeMode: Filesystem
  volumeSnapshotRef:
    name: <volume-snapshot-name>
    namespace: <namespace>

Repeat steps 1 to 3 for all VolumeSnapshotContents whose volume mode needs to be converted during a backup or restore operation. This can be done either via software with credentials of an authorized user or manually by the authorized user(s).

If the annotation shown above is present on a VolumeSnapshotContent object, Kubernetes will not prevent the volume mode from being converted. Users should keep this in mind before they attempt to add the annotation to any VolumeSnapshotContent.

Action required

The prevent-volume-mode-conversion feature flag is enabled by default in the external-provisioner v4.0.0 and external-snapshotter v7.0.0. Volume mode change will be rejected when creating a PVC from a VolumeSnapshot unless the steps described above have been performed.

What's next

To determine which CSI external sidecar versions support this feature, please head over to the CSI docs page. For any queries or issues, join Kubernetes on Slack and create a thread in the #csi or #sig-storage channel. Alternately, create an issue in the CSI external-snapshotter repository.

Kubernetes 1.30: Multi-Webhook and Modular Authorization Made Much Easier

With Kubernetes 1.30, we (SIG Auth) are moving Structured Authorization Configuration to beta.

Today's article is about authorization: deciding what someone can and cannot access. Check a previous article from yesterday to find about what's new in Kubernetes v1.30 around authentication (finding out who's performing a task, and checking that they are who they say they are).

Introduction

Kubernetes continues to evolve to meet the intricate requirements of system administrators and developers alike. A critical aspect of Kubernetes that ensures the security and integrity of the cluster is the API server authorization. Until recently, the configuration of the authorization chain in kube-apiserver was somewhat rigid, limited to a set of command-line flags and allowing only a single webhook in the authorization chain. This approach, while functional, restricted the flexibility needed by cluster administrators to define complex, fine-grained authorization policies. The latest Structured Authorization Configuration feature (KEP-3221) aims to revolutionize this aspect by introducing a more structured and versatile way to configure the authorization chain, focusing on enabling multiple webhooks and providing explicit control mechanisms.

The Need for Improvement

Cluster administrators have long sought the ability to specify multiple authorization webhooks within the API Server handler chain and have control over detailed behavior like timeout and failure policy for each webhook. This need arises from the desire to create layered security policies, where requests can be validated against multiple criteria or sets of rules in a specific order. The previous limitations also made it difficult to dynamically configure the authorizer chain, leaving no room to manage complex authorization scenarios efficiently.

The Structured Authorization Configuration feature addresses these limitations by introducing a configuration file format to configure the Kubernetes API Server Authorization chain. This format allows specifying multiple webhooks in the authorization chain (all other authorization types are specified no more than once). Each webhook authorizer has well-defined parameters, including timeout settings, failure policies, and conditions for invocation with CEL rules to pre-filter requests before they are dispatched to webhooks, helping you prevent unnecessary invocations. The configuration also supports automatic reloading, ensuring changes can be applied dynamically without restarting the kube-apiserver. This feature addresses current limitations and opens up new possibilities for securing and managing Kubernetes clusters more effectively.

Sample Configurations

Here is a sample structured authorization configuration along with descriptions for all fields, their defaults, and possible values.

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthorizationConfiguration
authorizers:
  - type: Webhook
    # Name used to describe the authorizer
    # This is explicitly used in monitoring machinery for metrics
    # Note:
    #   - Validation for this field is similar to how K8s labels are validated today.
    # Required, with no default
    name: webhook
    webhook:
      # The duration to cache 'authorized' responses from the webhook
      # authorizer.
      # Same as setting `--authorization-webhook-cache-authorized-ttl` flag
      # Default: 5m0s
      authorizedTTL: 30s
      # The duration to cache 'unauthorized' responses from the webhook
      # authorizer.
      # Same as setting `--authorization-webhook-cache-unauthorized-ttl` flag
      # Default: 30s
      unauthorizedTTL: 30s
      # Timeout for the webhook request
      # Maximum allowed is 30s.
      # Required, with no default.
      timeout: 3s
      # The API version of the authorization.k8s.io SubjectAccessReview to
      # send to and expect from the webhook.
      # Same as setting `--authorization-webhook-version` flag
      # Required, with no default
      # Valid values: v1beta1, v1
      subjectAccessReviewVersion: v1
      # MatchConditionSubjectAccessReviewVersion specifies the SubjectAccessReview
      # version the CEL expressions are evaluated against
      # Valid values: v1
      # Required, no default value
      matchConditionSubjectAccessReviewVersion: v1
      # Controls the authorization decision when a webhook request fails to
      # complete or returns a malformed response or errors evaluating
      # matchConditions.
      # Valid values:
      #   - NoOpinion: continue to subsequent authorizers to see if one of
      #     them allows the request
      #   - Deny: reject the request without consulting subsequent authorizers
      # Required, with no default.
      failurePolicy: Deny
      connectionInfo:
        # Controls how the webhook should communicate with the server.
        # Valid values:
        # - KubeConfigFile: use the file specified in kubeConfigFile to locate the
        #   server.
        # - InClusterConfig: use the in-cluster configuration to call the
        #   SubjectAccessReview API hosted by kube-apiserver. This mode is not
        #   allowed for kube-apiserver.
        type: KubeConfigFile
        # Path to KubeConfigFile for connection info
        # Required, if connectionInfo.Type is KubeConfigFile
        kubeConfigFile: /kube-system-authz-webhook.yaml
        # matchConditions is a list of conditions that must be met for a request to be sent to this
        # webhook. An empty list of matchConditions matches all requests.
        # There are a maximum of 64 match conditions allowed.
        #
        # The exact matching logic is (in order):
        #   1. If at least one matchCondition evaluates to FALSE, then the webhook is skipped.
        #   2. If ALL matchConditions evaluate to TRUE, then the webhook is called.
        #   3. If at least one matchCondition evaluates to an error (but none are FALSE):
        #      - If failurePolicy=Deny, then the webhook rejects the request
        #      - If failurePolicy=NoOpinion, then the error is ignored and the webhook is skipped
      matchConditions:
      # expression represents the expression which will be evaluated by CEL. Must evaluate to bool.
      # CEL expressions have access to the contents of the SubjectAccessReview in v1 version.
      # If version specified by subjectAccessReviewVersion in the request variable is v1beta1,
      # the contents would be converted to the v1 version before evaluating the CEL expression.
      #
      # Documentation on CEL: https://kubernetes.io/docs/reference/using-api/cel/
      #
      # only send resource requests to the webhook
      - expression: has(request.resourceAttributes)
      # only intercept requests to kube-system
      - expression: request.resourceAttributes.namespace == 'kube-system'
      # don't intercept requests from kube-system service accounts
      - expression: "!('system:serviceaccounts:kube-system' in request.groups)"
  - type: Node
    name: node
  - type: RBAC
    name: rbac
  - type: Webhook
    name: in-cluster-authorizer
    webhook:
      authorizedTTL: 5m
      unauthorizedTTL: 30s
      timeout: 3s
      subjectAccessReviewVersion: v1
      failurePolicy: NoOpinion
      connectionInfo:
        type: InClusterConfig

The following configuration examples illustrate real-world scenarios that need the ability to specify multiple webhooks with distinct settings, precedence order, and failure modes.

Protecting Installed CRDs

Ensuring of Custom Resource Definitions (CRDs) availability at cluster startup has been a key demand. One of the blockers of having a controller reconcile those CRDs is having a protection mechanism for them, which can be achieved through multiple authorization webhooks. This was not possible before as specifying multiple authorization webhooks in the Kubernetes API Server authorization chain was simply not possible. Now, with the Structured Authorization Configuration feature, administrators can specify multiple webhooks, offering a solution where RBAC falls short, especially when denying permissions to 'non-system' users for certain CRDs.

Assuming the following for this scenario:

  • The "protected" CRDs are installed.
  • They can only be modified by users in the group admin.
apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthorizationConfiguration
authorizers:
  - type: Webhook
    name: system-crd-protector
    webhook:
      unauthorizedTTL: 30s
      timeout: 3s
      subjectAccessReviewVersion: v1
      matchConditionSubjectAccessReviewVersion: v1
      failurePolicy: Deny
      connectionInfo:
        type: KubeConfigFile
        kubeConfigFile: /files/kube-system-authz-webhook.yaml
      matchConditions:
      # only send resource requests to the webhook
      - expression: has(request.resourceAttributes)
      # only intercept requests for CRDs
      - expression: request.resourceAttributes.resource.resource = "customresourcedefinitions"
      - expression: request.resourceAttributes.resource.group = ""
      # only intercept update, patch, delete, or deletecollection requests
      - expression: request.resourceAttributes.verb in ['update', 'patch', 'delete','deletecollection']
  - type: Node
  - type: RBAC

Preventing unnecessarily nested webhooks

A system administrator wants to apply specific validations to requests before handing them off to webhooks using frameworks like Open Policy Agent. In the past, this would require running nested webhooks within the one added to the authorization chain to achieve the desired result. The Structured Authorization Configuration feature simplifies this process, offering a structured API to selectively trigger additional webhooks when needed. It also enables administrators to set distinct failure policies for each webhook, ensuring more consistent and predictable responses.

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthorizationConfiguration
authorizers:
  - type: Webhook
    name: system-crd-protector
    webhook:
      unauthorizedTTL: 30s
      timeout: 3s
      subjectAccessReviewVersion: v1
      matchConditionSubjectAccessReviewVersion: v1
      failurePolicy: Deny
      connectionInfo:
        type: KubeConfigFile
        kubeConfigFile: /files/kube-system-authz-webhook.yaml
      matchConditions:
      # only send resource requests to the webhook
      - expression: has(request.resourceAttributes)
      # only intercept requests for CRDs
      - expression: request.resourceAttributes.resource.resource = "customresourcedefinitions"
      - expression: request.resourceAttributes.resource.group = ""
      # only intercept update, patch, delete, or deletecollection requests
      - expression: request.resourceAttributes.verb in ['update', 'patch', 'delete','deletecollection']
  - type: Node
  - type: RBAC
  - name: opa
    type: Webhook
    webhook:
      unauthorizedTTL: 30s
      timeout: 3s
      subjectAccessReviewVersion: v1
      matchConditionSubjectAccessReviewVersion: v1
      failurePolicy: Deny
      connectionInfo:
        type: KubeConfigFile
        kubeConfigFile: /files/opa-default-authz-webhook.yaml
      matchConditions:
      # only send resource requests to the webhook
      - expression: has(request.resourceAttributes)
      # only intercept requests to default namespace
      - expression: request.resourceAttributes.namespace == 'default'
      # don't intercept requests from default service accounts
      - expression: "!('system:serviceaccounts:default' in request.groups)"

What's next?

From Kubernetes 1.30, the feature is in beta and enabled by default. For Kubernetes v1.31, we expect the feature to stay in beta while we get more feedback from users. Once it is ready for GA, the feature flag will be removed, and the configuration file version will be promoted to v1.

Learn more about this feature on the structured authorization configuration Kubernetes doc website. You can also follow along with KEP-3221 to track progress in coming Kubernetes releases.

Call to action

In this post, we have covered the benefits of the Structured Authorization Configuration feature in Kubernetes v1.30 and a few sample configurations for real-world scenarios. To use this feature, you must specify the path to the authorization configuration using the --authorization-config command line argument. From Kubernetes 1.30, the feature is in beta and enabled by default. If you want to keep using command line flags instead of a configuration file, those will continue to work as-is. Specifying both --authorization-config and --authorization-modes/--authorization-webhook-* won't work. You need to drop the older flags from your kube-apiserver command.

The following kind Cluster configuration sets that command argument on the APIserver to load an AuthorizationConfiguration from a file (authorization_config.yaml) in the files folder. Any needed kubeconfig and certificate files can also be put in the files directory.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  StructuredAuthorizationConfiguration: true  # enabled by default in v1.30
kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    metadata:
      name: config
    apiServer:
      extraArgs:
        authorization-config: "/files/authorization_config.yaml"
      extraVolumes:
      - name: files
        hostPath: "/files"
        mountPath: "/files"
        readOnly: true
nodes:
- role: control-plane
  extraMounts:
  - hostPath: files
    containerPath: /files

We would love to hear your feedback on this feature. In particular, we would like feedback from Kubernetes cluster administrators and authorization webhook implementors as they build their integrations with this new API. Please reach out to us on the #sig-auth-authorizers-dev channel on Kubernetes Slack.

How to get involved

If you are interested in helping develop this feature, sharing feedback, or participating in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings held every other Wednesday.

Acknowledgments

This feature was driven by contributors from several different companies. We would like to extend a huge thank you to everyone who contributed their time and effort to make this possible.

Kubernetes 1.30: Structured Authentication Configuration Moves to Beta

With Kubernetes 1.30, we (SIG Auth) are moving Structured Authentication Configuration to beta.

Today's article is about authentication: finding out who's performing a task, and checking that they are who they say they are. Check back in tomorrow to find about what's new in Kubernetes v1.30 around authorization (deciding what someone can and can't access).

Motivation

Kubernetes has had a long-standing need for a more flexible and extensible authentication system. The current system, while powerful, has some limitations that make it difficult to use in certain scenarios. For example, it is not possible to use multiple authenticators of the same type (e.g., multiple JWT authenticators) or to change the configuration without restarting the API server. The Structured Authentication Configuration feature is the first step towards addressing these limitations and providing a more flexible and extensible way to configure authentication in Kubernetes.

What is structured authentication configuration?

Kubernetes v1.30 builds on the experimental support for configurating authentication based on a file, that was added as alpha in Kubernetes v1.30. At this beta stage, Kubernetes only supports configuring JWT authenticators, which serve as the next iteration of the existing OIDC authenticator. JWT authenticator is an authenticator to authenticate Kubernetes users using JWT compliant tokens. The authenticator will attempt to parse a raw ID token, verify it's been signed by the configured issuer.

The Kubernetes project added configuration from a file so that it can provide more flexibility than using command line options (which continue to work, and are still supported). Supporting a configuration file also makes it easy to deliver further improvements in upcoming releases.

Benefits of structured authentication configuration

Here's why using a configuration file to configure cluster authentication is a benefit:

  1. Multiple JWT authenticators: You can configure multiple JWT authenticators simultaneously. This allows you to use multiple identity providers (e.g., Okta, Keycloak, GitLab) without needing to use an intermediary like Dex that handles multiplexing between multiple identity providers.
  2. Dynamic configuration: You can change the configuration without restarting the API server. This allows you to add, remove, or modify authenticators without disrupting the API server.
  3. Any JWT-compliant token: You can use any JWT-compliant token for authentication. This allows you to use tokens from any identity provider that supports JWT. The minimum valid JWT payload must contain the claims documented in structured authentication configuration page in the Kubernetes documentation.
  4. CEL (Common Expression Language) support: You can use CEL to determine whether the token's claims match the user's attributes in Kubernetes (e.g., username, group). This allows you to use complex logic to determine whether a token is valid.
  5. Multiple audiences: You can configure multiple audiences for a single authenticator. This allows you to use the same authenticator for multiple audiences, such as using a different OAuth client for kubectl and dashboard.
  6. Using identity providers that don't support OpenID connect discovery: You can use identity providers that don't support OpenID Connect discovery. The only requirement is to host the discovery document at a different location than the issuer (such as locally in the cluster) and specify the issuer.discoveryURL in the configuration file.

How to use Structured Authentication Configuration

To use structured authentication configuration, you specify the path to the authentication configuration using the --authentication-config command line argument in the API server. The configuration file is a YAML file that specifies the authenticators and their configuration. Here is an example configuration file that configures two JWT authenticators:

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
# Someone with a valid token from either of these issuers could authenticate
# against this cluster.
jwt:
- issuer:
    url: https://issuer1.example.com
    audiences:
    - audience1
    - audience2
    audienceMatchPolicy: MatchAny
  claimValidationRules:
    expression: 'claims.hd == "example.com"'
    message: "the hosted domain name must be example.com"
  claimMappings:
    username:
      expression: 'claims.username'
    groups:
      expression: 'claims.groups'
    uid:
      expression: 'claims.uid'
    extra:
    - key: 'example.com/tenant'
      expression: 'claims.tenant'
  userValidationRules:
  - expression: "!user.username.startsWith('system:')"
    message: "username cannot use reserved system: prefix"
# second authenticator that exposes the discovery document at a different location
# than the issuer
- issuer:
    url: https://issuer2.example.com
    discoveryURL: https://discovery.example.com/.well-known/openid-configuration
    audiences:
    - audience3
    - audience4
    audienceMatchPolicy: MatchAny
  claimValidationRules:
    expression: 'claims.hd == "example.com"'
    message: "the hosted domain name must be example.com"
  claimMappings:
    username:
      expression: 'claims.username'
    groups:
      expression: 'claims.groups'
    uid:
      expression: 'claims.uid'
    extra:
    - key: 'example.com/tenant'
      expression: 'claims.tenant'
  userValidationRules:
  - expression: "!user.username.startsWith('system:')"
    message: "username cannot use reserved system: prefix"

Migration from command line arguments to configuration file

The Structured Authentication Configuration feature is designed to be backwards-compatible with the existing approach, based on command line options, for configuring the JWT authenticator. This means that you can continue to use the existing command-line options to configure the JWT authenticator. However, we (Kubernetes SIG Auth) recommend migrating to the new configuration file-based approach, as it provides more flexibility and extensibility.

Here is an example of how to migrate from the command-line flags to the configuration file:

Command-line arguments

--oidc-issuer-url=https://issuer.example.com
--oidc-client-id=example-client-id
--oidc-username-claim=username
--oidc-groups-claim=groups
--oidc-username-prefix=oidc:
--oidc-groups-prefix=oidc:
--oidc-required-claim="hd=example.com"
--oidc-required-claim="admin=true"
--oidc-ca-file=/path/to/ca.pem

There is no equivalent in the configuration file for the --oidc-signing-algs. For Kubernetes v1.30, the authenticator supports all the asymmetric algorithms listed in oidc.go.

Configuration file

apiVersion: apiserver.config.k8s.io/v1beta1
kind: AuthenticationConfiguration
jwt:
- issuer:
    url: https://issuer.example.com
    audiences:
    - example-client-id
    certificateAuthority: <value is the content of file /path/to/ca.pem>
  claimMappings:
    username:
      claim: username
      prefix: "oidc:"
    groups:
      claim: groups
      prefix: "oidc:"
  claimValidationRules:
  - claim: hd
    requiredValue: "example.com"
  - claim: admin
    requiredValue: "true"

What's next?

For Kubernetes v1.31, we expect the feature to stay in beta while we get more feedback. In the coming releases, we want to investigate:

  • Making distributed claims work via CEL expressions.
  • Egress selector configuration support for calls to issuer.url and issuer.discoveryURL.

You can learn more about this feature on the structured authentication configuration page in the Kubernetes documentation. You can also follow along on the KEP-3331 to track progress across the coming Kubernetes releases.

Try it out

In this post, I have covered the benefits the Structured Authentication Configuration feature brings in Kubernetes v1.30. To use this feature, you must specify the path to the authentication configuration using the --authentication-config command line argument. From Kubernetes v1.30, the feature is in beta and enabled by default. If you want to keep using command line arguments instead of a configuration file, those will continue to work as-is.

We would love to hear your feedback on this feature. Please reach out to us on the #sig-auth-authenticators-dev channel on Kubernetes Slack (for an invitation, visit https://slack.k8s.io/).

How to get involved

If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings held every-other Wednesday.

Kubernetes 1.30: Validating Admission Policy Is Generally Available

On behalf of the Kubernetes project, I am excited to announce that ValidatingAdmissionPolicy has reached general availability as part of Kubernetes 1.30 release. If you have not yet read about this new declarative alternative to validating admission webhooks, it may be interesting to read our previous post about the new feature. If you have already heard about ValidatingAdmissionPolicies and you are eager to try them out, there is no better time to do it than now.

Let's have a taste of a ValidatingAdmissionPolicy, by replacing a simple webhook.

Example admission webhook

First, let's take a look at an example of a simple webhook. Here is an excerpt from a webhook that enforces runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation, and privileged to be set to the least permissive values.

func verifyDeployment(deploy *appsv1.Deployment) error {
	var errs []error
	for i, c := range deploy.Spec.Template.Spec.Containers {
		if c.Name == "" {
			return fmt.Errorf("container %d has no name", i)
		}
		if c.SecurityContext == nil {
			errs = append(errs, fmt.Errorf("container %q does not have SecurityContext", c.Name))
		}
		if c.SecurityContext.RunAsNonRoot == nil || !*c.SecurityContext.RunAsNonRoot {
			errs = append(errs, fmt.Errorf("container %q must set RunAsNonRoot to true in its SecurityContext", c.Name))
		}
		if c.SecurityContext.ReadOnlyRootFilesystem == nil || !*c.SecurityContext.ReadOnlyRootFilesystem {
			errs = append(errs, fmt.Errorf("container %q must set ReadOnlyRootFilesystem to true in its SecurityContext", c.Name))
		}
		if c.SecurityContext.AllowPrivilegeEscalation != nil && *c.SecurityContext.AllowPrivilegeEscalation {
			errs = append(errs, fmt.Errorf("container %q must NOT set AllowPrivilegeEscalation to true in its SecurityContext", c.Name))
		}
		if c.SecurityContext.Privileged != nil && *c.SecurityContext.Privileged {
			errs = append(errs, fmt.Errorf("container %q must NOT set Privileged to true in its SecurityContext", c.Name))
		}
	}
	return errors.NewAggregate(errs)
}

Check out What are admission webhooks? Or, see the full code of this webhook to follow along with this walkthrough.

The policy

Now let's try to recreate the validation faithfully with a ValidatingAdmissionPolicy.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "pod-security.policy.example.com"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  validations:
  - expression: object.spec.template.spec.containers.all(c, has(c.securityContext) && has(c.securityContext.runAsNonRoot) && c.securityContext.runAsNonRoot)
    message: 'all containers must set runAsNonRoot to true'
  - expression: object.spec.template.spec.containers.all(c, has(c.securityContext) && has(c.securityContext.readOnlyRootFilesystem) && c.securityContext.readOnlyRootFilesystem)
    message: 'all containers must set readOnlyRootFilesystem to true'
  - expression: object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.allowPrivilegeEscalation) || !c.securityContext.allowPrivilegeEscalation)
    message: 'all containers must NOT set allowPrivilegeEscalation to true'
  - expression: object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.Privileged) || !c.securityContext.Privileged)
    message: 'all containers must NOT set privileged to true'

Create the policy with kubectl. Great, no complain so far. But let's get the policy object back and take a look at its status.

kubectl get -oyaml validatingadmissionpolicies/pod-security.policy.example.com
  status:
    typeChecking:
      expressionWarnings:
      - fieldRef: spec.validations[3].expression
        warning: |
          apps/v1, Kind=Deployment: ERROR: <input>:1:76: undefined field 'Privileged'
           | object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.Privileged) || !c.securityContext.Privileged)
           | ...........................................................................^
          ERROR: <input>:1:128: undefined field 'Privileged'
           | object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.Privileged) || !c.securityContext.Privileged)
           | ...............................................................................................................................^

The policy was checked against its matched type, which is apps/v1.Deployment. Looking at the fieldRef, the problem was with the 3rd expression (index starts with 0) The expression in question accessed an undefined Privileged field. Ahh, looks like it was a copy-and-paste error. The field name should be in lowercase.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "pod-security.policy.example.com"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  validations:
  - expression: object.spec.template.spec.containers.all(c, has(c.securityContext) && has(c.securityContext.runAsNonRoot) && c.securityContext.runAsNonRoot)
    message: 'all containers must set runAsNonRoot to true'
  - expression: object.spec.template.spec.containers.all(c, has(c.securityContext) && has(c.securityContext.readOnlyRootFilesystem) && c.securityContext.readOnlyRootFilesystem)
    message: 'all containers must set readOnlyRootFilesystem to true'
  - expression: object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.allowPrivilegeEscalation) || !c.securityContext.allowPrivilegeEscalation)
    message: 'all containers must NOT set allowPrivilegeEscalation to true'
  - expression: object.spec.template.spec.containers.all(c, !has(c.securityContext) || !has(c.securityContext.privileged) || !c.securityContext.privileged)
    message: 'all containers must NOT set privileged to true'

Check its status again, and you should see all warnings cleared.

Next, let's create a namespace for our tests.

kubectl create namespace policy-test

Then, I bind the policy to the namespace. But at this point, I set the action to Warn so that the policy prints out warnings instead of rejecting the requests. This is especially useful to collect results from all expressions during development and automated testing.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "pod-security.policy-binding.example.com"
spec:
  policyName: "pod-security.policy.example.com"
  validationActions: ["Warn"]
  matchResources:
    namespaceSelector:
      matchLabels:
        "kubernetes.io/metadata.name": "policy-test"

Tests out policy enforcement.

kubectl create -n policy-test -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        securityContext:
          privileged: true
          allowPrivilegeEscalation: true
EOF
Warning: Validation failed for ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com': all containers must set runAsNonRoot to true
Warning: Validation failed for ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com': all containers must set readOnlyRootFilesystem to true
Warning: Validation failed for ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com': all containers must NOT set allowPrivilegeEscalation to true
Warning: Validation failed for ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com': all containers must NOT set privileged to true
Error from server: error when creating "STDIN": admission webhook "webhook.example.com" denied the request: [container "nginx" must set RunAsNonRoot to true in its SecurityContext, container "nginx" must set ReadOnlyRootFilesystem to true in its SecurityContext, container "nginx" must NOT set AllowPrivilegeEscalation to true in its SecurityContext, container "nginx" must NOT set Privileged to true in its SecurityContext]

Looks great! The policy and the webhook give equivalent results. After a few other cases, when we are confident with our policy, maybe it is time to do some cleanup.

  • For every expression, we repeat access to object.spec.template.spec.containers and to each securityContext;
  • There is a pattern of checking presence of a field and then accessing it, which looks a bit verbose.

Fortunately, since Kubernetes 1.28, we have new solutions for both issues. Variable Composition allows us to extract repeated sub-expressions into their own variables. Kubernetes enables the optional library for CEL, which are excellent to work with fields that are, you guessed it, optional.

With both features in mind, let's refactor the policy a bit.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "pod-security.policy.example.com"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  variables:
  - name: containers
    expression: object.spec.template.spec.containers
  - name: securityContexts
    expression: 'variables.containers.map(c, c.?securityContext)'
  validations:
  - expression: variables.securityContexts.all(c, c.?runAsNonRoot == optional.of(true))
    message: 'all containers must set runAsNonRoot to true'
  - expression: variables.securityContexts.all(c, c.?readOnlyRootFilesystem == optional.of(true))
    message: 'all containers must set readOnlyRootFilesystem to true'
  - expression: variables.securityContexts.all(c, c.?allowPrivilegeEscalation != optional.of(true))
    message: 'all containers must NOT set allowPrivilegeEscalation to true'
  - expression: variables.securityContexts.all(c, c.?privileged != optional.of(true))
    message: 'all containers must NOT set privileged to true'

The policy is now much cleaner and more readable. Update the policy, and you should see it function the same as before.

Now let's change the policy binding from warning to actually denying requests that fail validation.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "pod-security.policy-binding.example.com"
spec:
  policyName: "pod-security.policy.example.com"
  validationActions: ["Deny"]
  matchResources:
    namespaceSelector:
      matchLabels:
        "kubernetes.io/metadata.name": "policy-test"

And finally, remove the webhook. Now the result should include only messages from the policy.

kubectl create -n policy-test -f- <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        securityContext:
          privileged: true
          allowPrivilegeEscalation: true
EOF
The deployments "nginx" is invalid: : ValidatingAdmissionPolicy 'pod-security.policy.example.com' with binding 'pod-security.policy-binding.example.com' denied request: all containers must set runAsNonRoot to true

Please notice that, by design, the policy will stop evaluation after the first expression that causes the request to be denied. This is different from what happens when the expressions generate only warnings.

Set up monitoring

Unlike a webhook, a policy is not a dedicated process that can expose its own metrics. Instead, you can use metrics from the API server in their place.

Here are some examples in Prometheus Query Language of common monitoring tasks.

To find the 95th percentile execution duration of the policy shown above.

histogram_quantile(0.95, sum(rate(apiserver_validating_admission_policy_check_duration_seconds_bucket{policy="pod-security.policy.example.com"}[5m])) by (le)) 

To find the rate of the policy evaluation.

rate(apiserver_validating_admission_policy_check_total{policy="pod-security.policy.example.com"}[5m])

You can read the metrics reference to learn more about the metrics above. The metrics of ValidatingAdmissionPolicy are currently in alpha, and more and better metrics will come while the stability graduates in the future release.

Kubernetes 1.30: Read-only volume mounts can be finally literally read-only

Read-only volume mounts have been a feature of Kubernetes since the beginning. Surprisingly, read-only mounts are not completely read-only under certain conditions on Linux. As of the v1.30 release, they can be made completely read-only, with alpha support for recursive read-only mounts.

Read-only volume mounts are not really read-only by default

Volume mounts can be deceptively complicated.

You might expect that the following manifest makes everything under /mnt in the containers read-only:

---
apiVersion: v1
kind: Pod
spec:
  volumes:
    - name: mnt
      hostPath:
        path: /mnt
  containers:
    - volumeMounts:
        - name: mnt
          mountPath: /mnt
          readOnly: true

However, any sub-mounts beneath /mnt may still be writable! For example, consider that /mnt/my-nfs-server is writeable on the host. Inside the container, writes to /mnt/* will be rejected but /mnt/my-nfs-server/* will still be writeable.

New mount option: recursiveReadOnly

Kubernetes 1.30 added a new mount option recursiveReadOnly so as to make submounts recursively read-only.

The option can be enabled as follows:

---
apiVersion: v1
kind: Pod
spec:
  volumes:
    - name: mnt
      hostPath:
        path: /mnt
  containers:
    - volumeMounts:
        - name: mnt
          mountPath: /mnt
          readOnly: true
          # NEW
          # Possible values are `Enabled`, `IfPossible`, and `Disabled`.
          # Needs to be specified in conjunction with `readOnly: true`.
          recursiveReadOnly: Enabled

This is implemented by applying the MOUNT_ATTR_RDONLY attribute with the AT_RECURSIVE flag using mount_setattr(2) added in Linux kernel v5.12.

For backwards compatibility, the recursiveReadOnly field is not a replacement for readOnly, but is used in conjunction with it. To get a properly recursive read-only mount, you must set both fields.

Feature availability

To enable recursiveReadOnly mounts, the following components have to be used:

  • Kubernetes: v1.30 or later, with the RecursiveReadOnlyMounts feature gate enabled. As of v1.30, the gate is marked as alpha.

  • CRI runtime:

    • containerd: v2.0 or later
  • OCI runtime:

    • runc: v1.1 or later
    • crun: v1.8.6 or later
  • Linux kernel: v5.12 or later

What's next?

Kubernetes SIG Node hope - and expect - that the feature will be promoted to beta and eventually general availability (GA) in future releases of Kubernetes, so that users no longer need to enable the feature gate manually.

The default value of recursiveReadOnly will still remain Disabled, for backwards compatibility.

How can I learn more?

Please check out the documentation for the further details of recursiveReadOnly mounts.

How to get involved?

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Kubernetes 1.30: Beta Support For Pods With User Namespaces

Linux provides different namespaces to isolate processes from each other. For example, a typical Kubernetes pod runs within a network namespace to isolate the network identity and a PID namespace to isolate the processes.

One Linux namespace that was left behind is the user namespace. This namespace allows us to isolate the user and group identifiers (UIDs and GIDs) we use inside the container from the ones on the host.

This is a powerful abstraction that allows us to run containers as "root": we are root inside the container and can do everything root can inside the pod, but our interactions with the host are limited to what a non-privileged user can do. This is great for limiting the impact of a container breakout.

A container breakout is when a process inside a container can break out onto the host using some unpatched vulnerability in the container runtime or the kernel and can access/modify files on the host or other containers. If we run our pods with user namespaces, the privileges the container has over the rest of the host are reduced, and the files outside the container it can access are limited too.

In Kubernetes v1.25, we introduced support for user namespaces only for stateless pods. Kubernetes 1.28 lifted that restriction, and now, with Kubernetes 1.30, we are moving to beta!

What is a user namespace?

Note: Linux user namespaces are a different concept from Kubernetes namespaces. The former is a Linux kernel feature; the latter is a Kubernetes feature.

User namespaces are a Linux feature that isolates the UIDs and GIDs of the containers from the ones on the host. The identifiers in the container can be mapped to identifiers on the host in a way where the host UID/GIDs used for different containers never overlap. Furthermore, the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on the host. This brings two key benefits:

  • Prevention of lateral movement: As the UIDs and GIDs for different containers are mapped to different UIDs and GIDs on the host, containers have a harder time attacking each other, even if they escape the container boundaries. For example, suppose container A runs with different UIDs and GIDs on the host than container B. In that case, the operations it can do on container B's files and processes are limited: only read/write what a file allows to others, as it will never have permission owner or group permission (the UIDs/GIDs on the host are guaranteed to be different for different containers).

  • Increased host isolation: As the UIDs and GIDs are mapped to unprivileged users on the host, if a container escapes the container boundaries, even if it runs as root inside the container, it has no privileges on the host. This greatly protects what host files it can read/write, which process it can send signals to, etc. Furthermore, capabilities granted are only valid inside the user namespace and not on the host, limiting the impact a container escape can have.

Image showing IDs 0-65535 are reserved to the host, pods use higher IDs

User namespace IDs allocation

Without using a user namespace, a container running as root in the case of a container breakout has root privileges on the node. If some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂).

Changes in 1.30

In Kubernetes 1.30, besides moving user namespaces to beta, the contributors working on this feature:

  • Introduced a way for the kubelet to use custom ranges for the UIDs/GIDs mapping
  • Have added a way for Kubernetes to enforce that the runtime supports all the features needed for user namespaces. If they are not supported, Kubernetes will show a clear error when trying to create a pod with user namespaces. Before 1.30, if the container runtime didn't support user namespaces, the pod could be created without a user namespace.
  • Added more tests, including tests in the cri-tools repository.

You can check the documentation on user namespaces for how to configure custom ranges for the mapping.

Demo

A few months ago, CVE-2024-21626 was disclosed. This vulnerability score is 8.6 (HIGH). It allows an attacker to escape a container and read/write to any path on the node and other pods hosted on the same node.

Rodrigo created a demo that exploits CVE 2024-21626 and shows how the exploit, which works without user namespaces, is mitigated when user namespaces are in use.

Please note that with user namespaces, an attacker can do on the host file system what the permission bits for "others" allow. Therefore, the CVE is not completely prevented, but the impact is greatly reduced.

Node system requirements

There are requirements on the Linux kernel version and the container runtime to use this feature.

On Linux you need Linux 6.3 or greater. This is because the feature relies on a kernel feature named idmap mounts, and support for using idmap mounts with tmpfs was merged in Linux 6.3.

Suppose you are using CRI-O with crun; as always, you can expect support for Kubernetes 1.30 with CRI-O 1.30. Please note you also need crun 1.9 or greater. If you are using CRI-O with runc, this is still not supported.

Containerd support is currently targeted for containerd 2.0, and the same crun version requirements apply. If you are using containerd with runc, this is still not supported.

Please note that containerd 1.7 added experimental support for user namespaces, as implemented in Kubernetes 1.25 and 1.26. We did a redesign in Kubernetes 1.27, which requires changes in the container runtime. Those changes are not present in containerd 1.7, so it only works with user namespaces support in Kubernetes 1.25 and 1.26.

Another limitation of containerd 1.7 is that it needs to change the ownership of every file and directory inside the container image during Pod startup. This has a storage overhead and can significantly impact the container startup latency. Containerd 2.0 will probably include an implementation that will eliminate the added startup latency and storage overhead. Consider this if you plan to use containerd 1.7 with user namespaces in production.

None of these containerd 1.7 limitations apply to CRI-O.

How do I get involved?

You can reach SIG Node by several means:

You can also contact us directly:

  • GitHub: @rata @giuseppe @saschagrunert
  • Slack: @rata @giuseppe @sascha

Kubernetes v1.30: Uwubernetes

Editors: Amit Dsouza, Frederick Kautz, Kristin Martin, Abigail McCarthy, Natali Vlatko

Announcing the release of Kubernetes v1.30: Uwubernetes, the cutest release!

Similar to previous releases, the release of Kubernetes v1.30 introduces new stable, beta, and alpha features. The consistent delivery of top-notch releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 45 enhancements. Of those enhancements, 17 have graduated to Stable, 18 are entering Beta, and 10 have graduated to Alpha.

Kubernetes v1.30: Uwubernetes

Kubernetes v1.30 makes your clusters cuter!

Kubernetes is built and released by thousands of people from all over the world and all walks of life. Most contributors are not being paid to do this; we build it for fun, to solve a problem, to learn something, or for the simple love of the community. Many of us found our homes, our friends, and our careers here. The Release Team is honored to be a part of the continued growth of Kubernetes.

For the people who built it, for the people who release it, and for the furries who keep all of our clusters online, we present to you Kubernetes v1.30: Uwubernetes, the cutest release to date. The name is a portmanteau of “kubernetes” and “UwU,” an emoticon used to indicate happiness or cuteness. We’ve found joy here, but we’ve also brought joy from our outside lives that helps to make this community as weird and wonderful and welcoming as it is. We’re so happy to share our work with you.

UwU ♥️

Improvements that graduated to stable in Kubernetes v1.30

This is a selection of some of the improvements that are now stable following the v1.30 release.

Robust VolumeManager reconstruction after kubelet restart (SIG Storage)

This is a volume manager refactoring that allows the kubelet to populate additional information about how existing volumes are mounted during the kubelet startup. In general, this makes volume cleanup after kubelet restart or machine reboot more robust.

This does not bring any changes for user or cluster administrators. We used the feature process and feature gate NewVolumeManagerReconstruction to be able to fall back to the previous behavior in case something goes wrong. Now that the feature is stable, the feature gate is locked and cannot be disabled.

Prevent unauthorized volume mode conversion during volume restore (SIG Storage)

For Kubernetes v1.30, the control plane always prevents unauthorized changes to volume modes when restoring a snapshot into a PersistentVolume. As a cluster administrator, you'll need to grant permissions to the appropriate identity principals (for example: ServiceAccounts representing a storage integration) if you need to allow that kind of change at restore time.

Warning:

Action required before upgrading. The prevent-volume-mode-conversion feature flag is enabled by default in the external-provisioner v4.0.0 and external-snapshotter v7.0.0. Volume mode change will be rejected when creating a PVC from a VolumeSnapshot unless you perform the steps described in the "Urgent Upgrade Notes" sections for the external-provisioner 4.0.0 and the external-snapshotter v7.0.0.

For more information on this feature also read converting the volume mode of a Snapshot.

Pod Scheduling Readiness (SIG Scheduling)

Pod scheduling readiness graduates to stable this release, after being promoted to beta in Kubernetes v1.27.

This now-stable feature lets Kubernetes avoid trying to schedule a Pod that has been defined, when the cluster doesn't yet have the resources provisioned to allow actually binding that Pod to a node. That's not the only use case; the custom control on whether a Pod can be allowed to schedule also lets you implement quota mechanisms, security controls, and more.

Crucially, marking these Pods as exempt from scheduling cuts the work that the scheduler would otherwise do, churning through Pods that can't or won't schedule onto the nodes your cluster currently has. If you have cluster autoscaling active, using scheduling gates doesn't just cut the load on the scheduler, it can also save money. Without scheduling gates, the autoscaler might otherwise launch a node that doesn't need to be started.

In Kubernetes v1.30, by specifying (or removing) a Pod's .spec.schedulingGates, you can control when a Pod is ready to be considered for scheduling. This is a stable feature and is now formally part of the Kubernetes API definition for Pod.

Min domains in PodTopologySpread (SIG Scheduling)

The minDomains parameter for PodTopologySpread constraints graduates to stable this release, which allows you to define the minimum number of domains. This feature is designed to be used with Cluster Autoscaler.

If you previously attempted use and there weren't enough domains already present, Pods would be marked as unschedulable. The Cluster Autoscaler would then provision node(s) in new domain(s), and you'd eventually get Pods spreading over enough domains.

Go workspaces for k/k (SIG Architecture)

The Kubernetes repo now uses Go workspaces. This should not impact end users at all, but does have a impact for developers of downstream projects. Switching to workspaces caused some breaking changes in the flags to the various k8s.io/code-generator tools. Downstream consumers should look at staging/src/k8s.io/code-generator/kube_codegen.sh to see the changes.

For full details on the changes and reasons why Go workspaces was introduced, read Using Go workspaces in Kubernetes.

Improvements that graduated to beta in Kubernetes v1.30

This is a selection of some of the improvements that are now beta following the v1.30 release.

Node log query (SIG Windows)

To help with debugging issues on nodes, Kubernetes v1.27 introduced a feature that allows fetching logs of services running on the node. To use the feature, ensure that the NodeLogQuery feature gate is enabled for that node, and that the kubelet configuration options enableSystemLogHandler and enableSystemLogQuery are both set to true.

Following the v1.30 release, this is now beta (you still need to enable the feature to use it, though).

On Linux the assumption is that service logs are available via journald. On Windows the assumption is that service logs are available in the application log provider. Logs are also available by reading files within /var/log/ (Linux) or C:\var\log\ (Windows). For more information, see the log query documentation.

CRD validation ratcheting (SIG API Machinery)

You need to enable the CRDValidationRatcheting feature gate to use this behavior, which then applies to all CustomResourceDefinitions in your cluster.

Provided you enabled the feature gate, Kubernetes implements validation ratcheting for CustomResourceDefinitions. The API server is willing to accept updates to resources that are not valid after the update, provided that each part of the resource that failed to validate was not changed by the update operation. In other words, any invalid part of the resource that remains invalid must have already been wrong. You cannot use this mechanism to update a valid resource so that it becomes invalid.

This feature allows authors of CRDs to confidently add new validations to the OpenAPIV3 schema under certain conditions. Users can update to the new schema safely without bumping the version of the object or breaking workflows.

Contextual logging (SIG Instrumentation)

Contextual Logging advances to beta in this release, empowering developers and operators to inject customizable, correlatable contextual details like service names and transaction IDs into logs through WithValues and WithName. This enhancement simplifies the correlation and analysis of log data across distributed systems, significantly improving the efficiency of troubleshooting efforts. By offering a clearer insight into the workings of your Kubernetes environments, Contextual Logging ensures that operational challenges are more manageable, marking a notable step forward in Kubernetes observability.

Make Kubernetes aware of the LoadBalancer behaviour (SIG Network)

The LoadBalancerIPMode feature gate is now beta and is now enabled by default. This feature allows you to set the .status.loadBalancer.ingress.ipMode for a Service with type set to LoadBalancer. The .status.loadBalancer.ingress.ipMode specifies how the load-balancer IP behaves. It may be specified only when the .status.loadBalancer.ingress.ip field is also specified. See more details about specifying IPMode of load balancer status.

Structured Authentication Configuration (SIG Auth)

Structured Authentication Configuration graduates to beta in this release.

Kubernetes has had a long-standing need for a more flexible and extensible authentication system. The current system, while powerful, has some limitations that make it difficult to use in certain scenarios. For example, it is not possible to use multiple authenticators of the same type (e.g., multiple JWT authenticators) or to change the configuration without restarting the API server. The Structured Authentication Configuration feature is the first step towards addressing these limitations and providing a more flexible and extensible way to configure authentication in Kubernetes. See more details about structured authentication configuration.

Structured Authorization Configuration (SIG Auth)

Structured Authorization Configuration graduates to beta in this release.

Kubernetes continues to evolve to meet the intricate requirements of system administrators and developers alike. A critical aspect of Kubernetes that ensures the security and integrity of the cluster is the API server authorization. Until recently, the configuration of the authorization chain in kube-apiserver was somewhat rigid, limited to a set of command-line flags and allowing only a single webhook in the authorization chain. This approach, while functional, restricted the flexibility needed by cluster administrators to define complex, fine-grained authorization policies. The latest Structured Authorization Configuration feature aims to revolutionize this aspect by introducing a more structured and versatile way to configure the authorization chain, focusing on enabling multiple webhooks and providing explicit control mechanisms. See more details about structured authorization configuration.

New alpha features

Speed up recursive SELinux label change (SIG Storage)

From the v1.27 release, Kubernetes already included an optimization that sets SELinux labels on the contents of volumes, using only constant time. Kubernetes achieves that speed up using a mount option. The slower legacy behavior requires the container runtime to recursively walk through the whole volumes and apply SELinux labelling individually to each file and directory; this is especially noticable for volumes with large amount of files and directories.

Kubernetes v1.27 graduated this feature as beta, but limited it to ReadWriteOncePod volumes. The corresponding feature gate is SELinuxMountReadWriteOncePod. It's still enabled by default and remains beta in v1.30.

Kubernetes v1.30 extends support for SELinux mount option to all volumes as alpha, with a separate feature gate: SELinuxMount. This feature gate introduces a behavioral change when multiple Pods with different SELinux labels share the same volume. See KEP for details.

We strongly encourage users that run Kubernetes with SELinux enabled to test this feature and provide any feedback on the KEP issue.

Feature gateStage in v1.30Behavior change
SELinuxMountReadWriteOncePodBetaNo
SELinuxMountAlphaYes

Both feature gates SELinuxMountReadWriteOncePod and SELinuxMount must be enabled to test this feature on all volumes.

This feature has no effect on Windows nodes or on Linux nodes without SELinux support.

Recursive Read-only (RRO) mounts (SIG Node)

Introducing Recursive Read-Only (RRO) Mounts in alpha this release, you'll find a new layer of security for your data. This feature lets you set volumes and their submounts as read-only, preventing accidental modifications. Imagine deploying a critical application where data integrity is key—RRO Mounts ensure that your data stays untouched, reinforcing your cluster's security with an extra safeguard. This is especially crucial in tightly controlled environments, where even the slightest change can have significant implications.

Job success/completion policy (SIG Apps)

From Kubernetes v1.30, indexed Jobs support .spec.successPolicy to define when a Job can be declared succeeded based on succeeded Pods. This allows you to define two types of criteria:

  • succeededIndexes indicates that the Job can be declared succeeded when these indexes succeeded, even if other indexes failed.
  • succeededCount indicates that the Job can be declared succeeded when the number of succeeded Indexes reaches this criterion.

After the Job meets the success policy, the Job controller terminates the lingering Pods.

Traffic distribution for services (SIG Network)

Kubernetes v1.30 introduces the spec.trafficDistribution field within a Kubernetes Service as alpha. This allows you to express preferences for how traffic should be routed to Service endpoints. While traffic policies focus on strict semantic guarantees, traffic distribution allows you to express preferences (such as routing to topologically closer endpoints). This can help optimize for performance, cost, or reliability. You can use this field by enabling the ServiceTrafficDistribution feature gate for your cluster and all of its nodes. In Kubernetes v1.30, the following field value is supported:

PreferClose: Indicates a preference for routing traffic to endpoints that are topologically proximate to the client. The interpretation of "topologically proximate" may vary across implementations and could encompass endpoints within the same node, rack, zone, or even region. Setting this value gives implementations permission to make different tradeoffs, for example optimizing for proximity rather than equal distribution of load. You should not set this value if such tradeoffs are not acceptable.

If the field is not set, the implementation (like kube-proxy) will apply its default routing strategy.

See Traffic Distribution for more details.

Storage Version Migration (SIG API Machinery)

Kubernetes v1.30 introduces a new built-in API for StorageVersionMigration. Kubernetes relies on API data being actively re-written, to support some maintenance activities related to at rest storage. Two prominent examples are the versioned schema of stored resources (that is, the preferred storage schema changing from v1 to v2 for a given resource) and encryption at rest (that is, rewriting stale data based on a change in how the data should be encrypted).

StorageVersionMigration is alpha API which was available out of tree before.

See storage version migration for more details.

Graduations, deprecations and removals for Kubernetes v1.30

Graduated to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 17 enhancements promoted to Stable:

Deprecations and removals

Removed the SecurityContextDeny admission plugin, deprecated since v1.27

(SIG Auth, SIG Security, and SIG Testing) With the removal of the SecurityContextDeny admission plugin, the Pod Security Admission plugin, available since v1.25, is recommended instead.

Release notes

Check out the full details of the Kubernetes v1.30 release in our release notes.

Availability

Kubernetes v1.30 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.30 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to deliver the Kubernetes v1.30 release to our community. The Release Team's membership ranges from first-time shadows to returning team leads with experience forged over several release cycles. A very special thanks goes out our release lead, Kat Cosgrove, for supporting us through a successful release cycle, advocating for us, making sure that we could all contribute in the best way possible, and challenging us to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.30 release cycle, which ran for 14 weeks (January 8 to April 17), we saw contributions from 863 companies and 1391 individuals.

Event update

  • KubeCon + CloudNativeCon China 2024 will take place in Hong Kong, from 21 – 23 August 2024! You can find more information about the conference and registration on the event site.
  • KubeCon + CloudNativeCon North America 2024 will take place in Salt Lake City, Utah, The United States of America, from 12 – 15 November 2024! You can find more information about the conference and registration on the eventsite.

Upcoming release webinar

Join members of the Kubernetes v1.30 release team on Thursday, May 23rd, 2024, at 9 A.M. PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

This blog was updated on April 19th, 2024 to highlight two additional changes not originally included in the release blog.

Spotlight on SIG Architecture: Code Organization

This is the third interview of a SIG Architecture Spotlight series that will cover the different subprojects. We will cover SIG Architecture: Code Organization.

In this SIG Architecture spotlight I talked with Madhav Jivrajani (VMware), a member of the Code Organization subproject.

Introducing the Code Organization subproject

Frederico (FSM): Hello Madhav, thank you for your availability. Could you start by telling us a bit about yourself, your role and how you got involved in Kubernetes?

Madhav Jivrajani (MJ): Hello! My name is Madhav Jivrajani, I serve as a technical lead for SIG Contributor Experience and a GitHub Admin for the Kubernetes project. Apart from that I also contribute to SIG API Machinery and SIG Etcd, but more recently, I’ve been helping out with the work that is needed to help Kubernetes stay on supported versions of Go, and it is through this that I am involved with the Code Organization subproject of SIG Architecture.

FSM: A project the size of Kubernetes must have unique challenges in terms of code organization -- is this a fair assumption? If so, what would you pick as some of the main challenges that are specific to Kubernetes?

MJ: That’s a fair assumption! The first interesting challenge comes from the sheer size of the Kubernetes codebase. We have ≅2.2 million lines of Go code (which is steadily decreasing thanks to dims and other folks in this sub-project!), and a little over 240 dependencies that we rely on either directly or indirectly, which is why having a sub-project dedicated to helping out with dependency management is crucial: we need to know what dependencies we’re pulling in, what versions these dependencies are at, and tooling to help make sure we are managing these dependencies across different parts of the codebase in a consistent manner.

Another interesting challenge with Kubernetes is that we publish a lot of Go modules as part of the Kubernetes release cycles, one example of this is client-go.However, we as a project would also like the benefits of having everything in one repository to get the advantages of using a monorepo, like atomic commits... so, because of this, code organization works with other SIGs (like SIG Release) to automate the process of publishing code from the monorepo to downstream individual repositories which are much easier to consume, and this way you won’t have to import the entire Kubernetes codebase!

Code organization and Kubernetes

FSM: For someone just starting contributing to Kubernetes code-wise, what are the main things they should consider in terms of code organization? How would you sum up the key concepts?

MJ: I think one of the key things to keep in mind at least as you’re starting off is the concept of staging directories. In the kubernetes/kubernetes repository, you will come across a directory called staging/. The sub-folders in this directory serve as a bunch of pseudo-repositories. For example, the kubernetes/client-go repository that publishes releases for client-go is actually a staging repo.

FSM: So the concept of staging directories fundamentally impact contributions?

MJ: Precisely, because if you’d like to contribute to any of the staging repos, you will need to send in a PR to its corresponding staging directory in kubernetes/kubernetes. Once the code merges there, we have a bot called the publishing-bot that will sync the merged commits to the required staging repositories (like kubernetes/client-go). This way we get the benefits of a monorepo but we also can modularly publish code for downstream consumption. PS: The publishing-bot needs more folks to help out!

For more information on staging repositories, please see the contributor documentation.

FSM: Speaking of contributions, the very high number of contributors, both individuals and companies, must also be a challenge: how does the subproject operate in terms of making sure that standards are being followed?

MJ: When it comes to dependency management in the project, there is a dedicated team that helps review and approve dependency changes. These are folks who have helped lay the foundation of much of the tooling that Kubernetes uses today for dependency management. This tooling helps ensure there is a consistent way that contributors can make changes to dependencies. The project has also worked on additional tooling to signal statistics of dependencies that is being added or removed: depstat

Apart from dependency management, another crucial task that the project does is management of the staging repositories. The tooling for achieving this (publishing-bot) is completely transparent to contributors and helps ensure that the staging repos get a consistent view of contributions that are submitted to kubernetes/kubernetes.

Code Organization also works towards making sure that Kubernetes stays on supported versions of Go. The linked KEP provides more context on why we need to do this. We collaborate with SIG Release to ensure that we are testing Kubernetes as rigorously and as early as we can on Go releases and working on changes that break our CI as a part of this. An example of how we track this process can be found here.

Release cycle and current priorities

FSM: Is there anything that changes during the release cycle?

MJ During the release cycle, specifically before code freeze, there are often changes that go in that add/update/delete dependencies, fix code that needs fixing as part of our effort to stay on supported versions of Go.

Furthermore, some of these changes are also candidates for backporting to our supported release branches.

FSM: Is there any major project or theme the subproject is working on right now that you would like to highlight?

MJ: I think one very interesting and immensely useful change that has been recently added (and I take the opportunity to specifically highlight the work of Tim Hockin on this) is the introduction of Go workspaces to the Kubernetes repo. A lot of our current tooling for dependency management and code publishing, as well as the experience of editing code in the Kubernetes repo, can be significantly improved by this change.

Wrapping up

FSM: How would someone interested in the topic start helping the subproject?

MJ: The first step, as is the first step with any project in Kubernetes, is to join our slack: slack.k8s.io, and after that join the #k8s-code-organization channel. There is also a code-organization office hours that takes place that you can choose to attend. Timezones are hard, so feel free to also look at the recordings or meeting notes and follow up on slack!

FSM: Excellent, thank you! Any final comments you would like to share?

MJ: The Code Organization subproject always needs help! Especially areas like the publishing bot, so don’t hesitate to get involved in the #k8s-code-organization Slack channel.

DIY: Create Your Own Cloud with Kubernetes (Part 3)

Approaching the most interesting phase, this article delves into running Kubernetes within Kubernetes. Technologies such as Kamaji and Cluster API are highlighted, along with their integration with KubeVirt.

Previous discussions have covered preparing Kubernetes on bare metal and how to turn Kubernetes into virtual machines management system. This article concludes the series by explaining how, using all of the above, you can build a full-fledged managed Kubernetes and run virtual Kubernetes clusters with just a click.

First up, let's dive into the Cluster API.

Cluster API

Cluster API is an extension for Kubernetes that allows the management of Kubernetes clusters as custom resources within another Kubernetes cluster.

The main goal of the Cluster API is to provide a unified interface for describing the basic entities of a Kubernetes cluster and managing their lifecycle. This enables the automation of processes for creating, updating, and deleting clusters, simplifying scaling, and infrastructure management.

Within the context of Cluster API, there are two terms: management cluster and tenant clusters.

  • Management cluster is a Kubernetes cluster used to deploy and manage other clusters. This cluster contains all the necessary Cluster API components and is responsible for describing, creating, and updating tenant clusters. It is often used just for this purpose.
  • Tenant clusters are the user clusters or clusters deployed using the Cluster API. They are created by describing the relevant resources in the management cluster. They are then used for deploying applications and services by end-users.

It's important to understand that physically, tenant clusters do not necessarily have to run on the same infrastructure with the management cluster; more often, they are running elsewhere.

A diagram showing interaction of management Kubernetes cluster and tenant Kubernetes clusters using Cluster API

A diagram showing interaction of management Kubernetes cluster and tenant Kubernetes clusters using Cluster API

For its operation, Cluster API utilizes the concept of providers which are separate controllers responsible for specific components of the cluster being created. Within Cluster API, there are several types of providers. The major ones are:

  • Infrastructure Provider, which is responsible for providing the computing infrastructure, such as virtual machines or physical servers.
  • Control Plane Provider, which provides the Kubernetes control plane, namely the components kube-apiserver, kube-scheduler, and kube-controller-manager.
  • Bootstrap Provider, which is used for generating cloud-init configuration for the virtual machines and servers being created.

To get started, you will need to install the Cluster API itself and one provider of each type. You can find a complete list of supported providers in the project's documentation.

For installation, you can use the clusterctl utility, or Cluster API Operator as the more declarative method.

Choosing providers

Infrastructure provider

To run Kubernetes clusters using KubeVirt, the KubeVirt Infrastructure Provider must be installed. It enables the deployment of virtual machines for worker nodes in the same management cluster, where the Cluster API operates.

Control plane provider

The Kamaji project offers a ready solution for running the Kubernetes control plane for tenant clusters as containers within the management cluster. This approach has several significant advantages:

  • Cost-effectiveness: Running the control plane in containers avoids the use of separate control plane nodes for each cluster, thereby significantly reducing infrastructure costs.
  • Stability: Simplifying architecture by eliminating complex multi-layered deployment schemes. Instead of sequentially launching a virtual machine and then installing etcd and Kubernetes components inside it, there's a simple control plane that is deployed and run as a regular application inside Kubernetes and managed by an operator.
  • Security: The cluster's control plane is hidden from the end user, reducing the possibility of its components being compromised, and also eliminates user access to the cluster's certificate store. This approach to organizing a control plane invisible to the user is often used by cloud providers.

Bootstrap provider

Kubeadm as the Bootstrap Provider - as the standard method for preparing clusters in Cluster API. This provider is developed as part of the Cluster API itself. It requires only a prepared system image with kubelet and kubeadm installed and allows generating configs in the cloud-init and ignition formats.

It's worth noting that Talos Linux also supports provisioning via the Cluster API and has providers for this. Although previous articles discussed using Talos Linux to set up a management cluster on bare-metal nodes, to provision tenant clusters the Kamaji+Kubeadm approach has more advantages. It facilitates the deployment of Kubernetes control planes in containers, thus removing the need for separate virtual machines for control plane instances. This simplifies the management and reduces costs.

How it works

The primary object in Cluster API is the Cluster resource, which acts as the parent for all the others. Typically, this resource references two others: a resource describing the control plane and a resource describing the infrastructure, each managed by a separate provider.

Unlike the Cluster, these two resources are not standardized, and their kind depends on the specific provider you are using:

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

Within Cluster API, there is also a resource named MachineDeployment, which describes a group of nodes, whether they are physical servers or virtual machines. This resource functions similarly to standard Kubernetes resources such as Deployment, ReplicaSet, and Pod, providing a mechanism for the declarative description of a group of nodes and automatic scaling.

In other words, the MachineDeployment resource allows you to declaratively describe nodes for your cluster, automating their creation, deletion, and updating according to specified parameters and the requested number of replicas.

A diagram showing the relationship of a Cluster resource and its children in Cluster API

A diagram showing the relationship of a MachineDeployment resource and its children in Cluster API

To create machines, MachineDeployment refers to a template for generating the machine itself and a template for generating its cloud-init config:

A diagram showing the relationship of a Cluster resource and the resources it links to in Cluster API

A diagram showing the relationship of a MachineDeployment resource and the resources it links to in Cluster API

To deploy a new Kubernetes cluster using Cluster API, you will need to prepare the following set of resources:

  • A general Cluster resource
  • A KamajiControlPlane resource, responsible for the control plane operated by Kamaji
  • A KubevirtCluster resource, describing the cluster configuration in KubeVirt
  • A KubevirtMachineTemplate resource, responsible for the virtual machine template
  • A KubeadmConfigTemplate resource, responsible for generating tokens and cloud-init
  • At least one MachineDeployment to create some workers

Polishing the cluster

In most cases, this is sufficient, but depending on the providers used, you may need other resources as well. You can find examples of the resources created for each type of provider in the Kamaji project documentation.

At this stage, you already have a ready tenant Kubernetes cluster, but so far, it contains nothing but API workers and a few core plugins that are standardly included in the installation of any Kubernetes cluster: kube-proxy and CoreDNS. For full integration, you will need to install several more components:

To install additional components, you can use a separate Cluster API Add-on Provider for Helm, or the same FluxCD discussed in previous articles.

When creating resources in FluxCD, it's possible to specify the target cluster by referring to the kubeconfig generated by Cluster API. Then, the installation will be performed directly into it. Thus, FluxCD becomes a universal tool for managing resources both in the management cluster and in the user tenant clusters.

A diagram showing the interaction scheme of fluxcd, which can install components in both management and tenant Kubernetes clusters

A diagram showing the interaction scheme of fluxcd, which can install components in both management and tenant Kubernetes clusters

What components are being discussed here? Generally, the set includes the following:

CNI Plugin

To ensure communication between pods in a tenant Kubernetes cluster, it's necessary to deploy a CNI plugin. This plugin creates a virtual network that allows pods to interact with each other and is traditionally deployed as a Daemonset on the cluster's worker nodes. You can choose and install any CNI plugin that you find suitable.

A diagram showing a CNI plugin installed inside the tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

A diagram showing a CNI plugin installed inside the tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

Cloud Controller Manager

The main task of the Cloud Controller Manager (CCM) is to integrate Kubernetes with the cloud infrastructure provider's environment (in your case, it is the management Kubernetes cluster in which all worksers of tenant Kubernetes are provisioned). Here are some tasks it performs:

  1. When a service of type LoadBalancer is created, the CCM initiates the process of creating a cloud load balancer, which directs traffic to your Kubernetes cluster.
  2. If a node is removed from the cloud infrastructure, the CCM ensures its removal from your cluster as well, maintaining the cluster's current state.
  3. When using the CCM, nodes are added to the cluster with a special taint, node.cloudprovider.kubernetes.io/uninitialized, which allows for the processing of additional business logic if necessary. After successful initialization, this taint is removed from the node.

Depending on the cloud provider, the CCM can operate both inside and outside the tenant cluster.

The KubeVirt Cloud Provider is designed to be installed in the external parent management cluster. Thus, creating services of type LoadBalancer in the tenant cluster initiates the creation of LoadBalancer services in the parent cluster, which direct traffic into the tenant cluster.

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of services it manages from the parent to the child Kubernetes cluster

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of services it manages from the parent to the child Kubernetes cluster

CSI Driver

The Container Storage Interface (CSI) is divided into two main parts for interacting with storage in Kubernetes:

  • csi-controller: This component is responsible for interacting with the cloud provider's API to create, delete, attach, detach, and resize volumes.
  • csi-node: This component runs on each node and facilitates the mounting of volumes to pods as requested by kubelet.

In the context of using the KubeVirt CSI Driver, a unique opportunity arises. Since virtual machines in KubeVirt runs within the management Kubernetes cluster, where a full-fledged Kubernetes API is available, this opens the path for running the csi-controller outside of the user's tenant cluster. This approach is popular in the KubeVirt community and offers several key advantages:

  • Security: This method hides the internal cloud API from the end-user, providing access to resources exclusively through the Kubernetes interface. Thus, it reduces the risk of direct access to the management cluster from user clusters.
  • Simplicity and Convenience: Users don't need to manage additional controllers in their clusters, simplifying the architecture and reducing the management burden.

However, the CSI-node must necessarily run inside the tenant cluster, as it directly interacts with kubelet on each node. This component is responsible for the mounting and unmounting of volumes into pods, requiring close integration with processes occurring directly on the cluster nodes.

The KubeVirt CSI Driver acts as a proxy for ordering volumes. When a PVC is created inside the tenant cluster, a PVC is created in the management cluster, and then the created PV is connected to the virtual machine.

A diagram showing a CSI plugin components installed on both inside and outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of persistent volumes it manages from the parent to the child Kubernetes cluster

A diagram showing a CSI plugin components installed on both inside and outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters and the mapping of persistent volumes it manages from the parent to the child Kubernetes cluster

Cluster Autoscaler

The Cluster Autoscaler is a versatile component that can work with various cloud APIs, and its integration with Cluster-API is just one of the available functions. For proper configuration, it requires access to two clusters: the tenant cluster, to track pods and determine the need for adding new nodes, and the managing Kubernetes cluster (management kubernetes cluster), where it interacts with the MachineDeployment resource and adjusts the number of replicas.

Although Cluster Autoscaler usually runs inside the tenant Kubernetes cluster, in this situation, it is suggested to install it outside for the same reasons described before. This approach is simpler to maintain and more secure as it prevents users of tenant clusters from accessing the management API of the management cluster.

A diagram showing a Cloud Controller Manager installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

A diagram showing a Cluster Autoscaler installed outside of a tenant Kubernetes cluster on a scheme of nested Kubernetes clusters

Konnectivity

There's another additional component I'd like to mention - Konnectivity. You will likely need it later on to get webhooks and the API aggregation layer working in your tenant Kubernetes cluster. This topic is covered in detail in one of my previous article.

Unlike the components presented above, Kamaji allows you to easily enable Konnectivity and manage it as one of the core components of your tenant cluster, alongside kube-proxy and CoreDNS.

Conclusion

Now you have a fully functional Kubernetes cluster with the capability for dynamic scaling, automatic provisioning of volumes, and load balancers.

Going forward, you might consider metrics and logs collection from your tenant clusters, but that goes beyond the scope of this article.

Of course, all the components necessary for deploying a Kubernetes cluster can be packaged into a single Helm chart and deployed as a unified application. This is precisely how we organize the deployment of managed Kubernetes clusters with the click of a button on our open PaaS platform, Cozystack, where you can try all the technologies described in the article for free.

DIY: Create Your Own Cloud with Kubernetes (Part 2)

Continuing our series of posts on how to build your own cloud using just the Kubernetes ecosystem. In the previous article, we explained how we prepare a basic Kubernetes distribution based on Talos Linux and Flux CD. In this article, we'll show you a few various virtualization technologies in Kubernetes and prepare everything need to run virtual machines in Kubernetes, primarily storage and networking.

We will talk about technologies such as KubeVirt, LINSTOR, and Kube-OVN.

But first, let's explain what virtual machines are needed for, and why can't you just use docker containers for building cloud? The reason is that containers do not provide a sufficient level of isolation. Although the situation improves year by year, we often encounter vulnerabilities that allow escaping the container sandbox and elevating privileges in the system.

On the other hand, Kubernetes was not originally designed to be a multi-tenant system, meaning the basic usage pattern involves creating a separate Kubernetes cluster for every independent project and development team.

Virtual machines are the primary means of isolating tenants from each other in a cloud environment. In virtual machines, users can execute code and programs with administrative privilege, but this doesn't affect other tenants or the environment itself. In other words, virtual machines allow to achieve hard multi-tenancy isolation, and run in environments where tenants do not trust each other.

Virtualization technologies in Kubernetes

There are several different technologies that bring virtualization into the Kubernetes world: KubeVirt and Kata Containers are the most popular ones. But you should know that they work differently.

Kata Containers implements the CRI (Container Runtime Interface) and provides an additional level of isolation for standard containers by running them in virtual machines. But they work in a same single Kubernetes-cluster.

A diagram showing how container isolation is ensured by running containers in virtual machines with Kata Containers

A diagram showing how container isolation is ensured by running containers in virtual machines with Kata Containers

KubeVirt allows running traditional virtual machines using the Kubernetes API. KubeVirt virtual machines are run as regular linux processes in containers. In other words, in KubeVirt, a container is used as a sandbox for running virtual machine (QEMU) processes. This can be clearly seen in the figure below, by looking at how live migration of virtual machines is implemented in KubeVirt. When migration is needed, the virtual machine moves from one container to another.

A diagram showing live migration of a virtual machine from one container to another in KubeVirt

A diagram showing live migration of a virtual machine from one container to another in KubeVirt

There is also an alternative project - Virtink, which implements lightweight virtualization using Cloud-Hypervisor and is initially focused on running virtual Kubernetes clusters using the Cluster API.

Considering our goals, we decided to use KubeVirt as the most popular project in this area. Besides we have extensive expertise and already made a lot of contributions to KubeVirt.

KubeVirt is easy to install and allows you to run virtual machines out-of-the-box using containerDisk feature - this allows you to store and distribute VM images directly as OCI images from container image registry. Virtual machines with containerDisk are well suited for creating Kubernetes worker nodes and other VMs that do not require state persistence.

For managing persistent data, KubeVirt offers a separate tool, Containerized Data Importer (CDI). It allows for cloning PVCs and populating them with data from base images. The CDI is necessary if you want to automatically provision persistent volumes for your virtual machines, and it is also required for the KubeVirt CSI Driver, which is used to handle persistent volumes claims from tenant Kubernetes clusters.

But at first, you have to decide where and how you will store these data.

Storage for Kubernetes VMs

With the introduction of the CSI (Container Storage Interface), a wide range of technologies that integrate with Kubernetes has become available. In fact, KubeVirt fully utilizes the CSI interface, aligning the choice of storage for virtualization closely with the choice of storage for Kubernetes itself. However, there are nuances, which you need to consider. Unlike containers, which typically use a standard filesystem, block devices are more efficient for virtual machine.

Although the CSI interface in Kubernetes allows the request of both types of volumes: filesystems and block devices, it's important to verify that your storage backend supports this.

Using block devices for virtual machines eliminates the need for an additional abstraction layer, such as a filesystem, that makes it more performant and in most cases enables the use of the ReadWriteMany mode. This mode allows concurrent access to the volume from multiple nodes, which is a critical feature for enabling the live migration of virtual machines in KubeVirt.

The storage system can be external or internal (in the case of hyper-converged infrastructure). Using external storage in many cases makes the whole system more stable, as your data is stored separately from compute nodes.

A diagram showing external data storage communication with the compute nodes

A diagram showing external data storage communication with the compute nodes

External storage solutions are often popular in enterprise systems because such storage is frequently provided by an external vendor, that takes care of its operations. The integration with Kubernetes involves only a small component installed in the cluster - the CSI driver. This driver is responsible for provisioning volumes in this storage and attaching them to pods run by Kubernetes. However, such storage solutions can also be implemented using purely open-source technologies. One of the popular solutions is TrueNAS powered by democratic-csi driver.

A diagram showing local data storage running on the compute nodes

A diagram showing local data storage running on the compute nodes

On the other hand, hyper-converged systems are often implemented using local storage (when you do not need replication) and with software-defined storages, often installed directly in Kubernetes, such as Rook/Ceph, OpenEBS, Longhorn, LINSTOR, and others.

A diagram showing clustered data storage running on the compute nodes

A diagram showing clustered data storage running on the compute nodes

A hyper-converged system has its advantages. For example, data locality: when your data is stored locally, access to such data is faster. But there are disadvantages as such a system is usually more difficult to manage and maintain.

At Ænix, we wanted to provide a ready-to-use solution that could be used without the need to purchase and setup an additional external storage, and that was optimal in terms of speed and resource utilization. LINSTOR became that solution. The time-tested and industry-popular technologies such as LVM and ZFS as backend gives confidence that data is securely stored. DRBD-based replication is incredible fast and consumes a small amount of computing resources.

For installing LINSTOR in Kubernetes, there is the Piraeus project, which already provides a ready-made block storage to use with KubeVirt.

Note:

In case you are using Talos Linux, as we described in the previous article, you will need to enable the necessary kernel modules in advance, and configure piraeus as described in the instruction.

Networking for Kubernetes VMs

Despite having the similar interface - CNI, The network architecture in Kubernetes is actually more complex and typically consists of many independent components that are not directly connected to each other. In fact, you can split Kubernetes networking into four layers, which are described below.

Node Network (Data Center Network)

The network through which nodes are interconnected with each other. This network is usually not managed by Kubernetes, but it is an important one because, without it, nothing would work. In practice, the bare metal infrastructure usually has more than one of such networks e.g. one for node-to-node communication, second for storage replication, third for external access, etc.

A diagram showing the role of the node network (data center network) on the Kubernetes networking scheme

A diagram showing the role of the node network (data center network) on the Kubernetes networking scheme

Configuring the physical network interaction between nodes goes beyond the scope of this article, as in most situations, Kubernetes utilizes already existing network infrastructure.

Pod Network

This is the network provided by your CNI plugin. The task of the CNI plugin is to ensure transparent connectivity between all containers and nodes in the cluster. Most CNI plugins implement a flat network from which separate blocks of IP addresses are allocated for use on each node.

A diagram showing the role of the pod network (CNI-plugin) on the Kubernetes network scheme

A diagram showing the role of the pod network (CNI-plugin) on the Kubernetes network scheme

In practice, your cluster can have several CNI plugins managed by Multus. This approach is often used in virtualization solutions based on KubeVirt - Rancher and OpenShift. The primary CNI plugin is used for integration with Kubernetes services, while additional CNI plugins are used to implement private networks (VPC) and integration with the physical networks of your data center.

The default CNI-plugins can be used to connect bridges or physical interfaces. Additionally, there are specialized plugins such as macvtap-cni which are designed to provide more performance.

One additional aspect to keep in mind when running virtual machines in Kubernetes is the need for IPAM (IP Address Management), especially for secondary interfaces provided by Multus. This is commonly managed by a DHCP server operating within your infrastructure. Additionally, the allocation of MAC addresses for virtual machines can be managed by Kubemacpool.

Although in our platform, we decided to go another way and fully rely on Kube-OVN. This CNI plugin is based on OVN (Open Virtual Network) which was originally developed for OpenStack and it provides a complete network solution for virtual machines in Kubernetes, features Custom Resources for managing IPs and MAC addresses, supports live migration with preserving IP addresses between the nodes, and enables the creation of VPCs for physical network separation between tenants.

In Kube-OVN you can assign separate subnets to an entire namespace or connect them as additional network interfaces using Multus.

Services Network

In addition to the CNI plugin, Kubernetes also has a services network, which is primarily needed for service discovery. Contrary to traditional virtual machines, Kubernetes is originally designed to run pods with a random address. And the services network provides a convenient abstraction (stable IP addresses and DNS names) that will always direct traffic to the correct pod. The same approach is also commonly used with virtual machines in clouds despite the fact that their IPs are usually static.

A diagram showing the role of the services network (services network plugin) on the Kubernetes network scheme

A diagram showing the role of the services network (services network plugin) on the Kubernetes network scheme

The implementation of the services network in Kubernetes is handled by the services network plugin, The standard implementation is called kube-proxy and is used in most clusters. But nowadays, this functionality might be provided as part of the CNI plugin. The most advanced implementation is offered by the Cilium project, which can be run in kube-proxy replacement mode.

Cilium is based on the eBPF technology, which allows for efficient offloading of the Linux networking stack, thereby improving performance and security compared to traditional methods based on iptables.

In practice, Cilium and Kube-OVN can be easily integrated to provide a unified solution that offers seamless, multi-tenant networking for virtual machines, as well as advanced network policies and combined services network functionality.

External Traffic Load Balancer

At this stage, you already have everything needed to run virtual machines in Kubernetes. But there is actually one more thing. You still need to access your services from outside your cluster, and an external load balancer will help you with organizing this.

For bare metal Kubernetes clusters, there are several load balancers available: MetalLB, kube-vip, LoxiLB, also Cilium and Kube-OVN provides built-in implementation.

The role of a external load balancer is to provide a stable address available externally and direct external traffic to the services network. The services network plugin will direct it to your pods and virtual machines as usual.

The role of the external load balancer on the Kubernetes network scheme

A diagram showing the role of the external load balancer on the Kubernetes network scheme

In most cases, setting up a load balancer on bare metal is achieved by creating floating IP address on the nodes within the cluster, and announce it externally using ARP/NDP or BGP protocols.

After exploring various options, we decided that MetalLB is the simplest and most reliable solution, although we do not strictly enforce the use of only it.

Another benefit is that in L2 mode, MetalLB speakers continuously check their neighbour's state by sending preforming liveness checks using a memberlist protocol. This enables failover that works independently of Kubernetes control-plane.

Conclusion

This concludes our overview of virtualization, storage, and networking in Kubernetes. The technologies mentioned here are available and already pre-configured on the Cozystack platform, where you can try them with no limitations.

In the next article, I'll detail how, on top of this, you can implement the provisioning of fully functional Kubernetes clusters with just the click of a button.

DIY: Create Your Own Cloud with Kubernetes (Part 1)

At Ænix, we have a deep affection for Kubernetes and dream that all modern technologies will soon start utilizing its remarkable patterns.

Have you ever thought about building your own cloud? I bet you have. But is it possible to do this using only modern technologies and approaches, without leaving the cozy Kubernetes ecosystem? Our experience in developing Cozystack required us to delve deeply into it.

You might argue that Kubernetes is not intended for this purpose and why not simply use OpenStack for bare metal servers and run Kubernetes inside it as intended. But by doing so, you would simply shift the responsibility from your hands to the hands of OpenStack administrators. This would add at least one more huge and complex system to your ecosystem.

Why complicate things? - after all, Kubernetes already has everything needed to run tenant Kubernetes clusters at this point.

I want to share with you our experience in developing a cloud platform based on Kubernetes, highlighting the open-source projects that we use ourselves and believe deserve your attention.

In this series of articles, I will tell you our story about how we prepare managed Kubernetes from bare metal using only open-source technologies. Starting from the basic level of data center preparation, running virtual machines, isolating networks, setting up fault-tolerant storage to provisioning full-featured Kubernetes clusters with dynamic volume provisioning, load balancers, and autoscaling.

With this article, I start a series consisting of several parts:

  • Part 1: Preparing the groundwork for your cloud. Challenges faced during the preparation and operation of Kubernetes on bare metal and a ready-made recipe for provisioning infrastructure.
  • Part 2: Networking, storage, and virtualization. How to turn Kubernetes into a tool for launching virtual machines and what is needed for this.
  • Part 3: Cluster API and how to start provisioning Kubernetes clusters at the push of a button. How autoscaling works, dynamic provisioning of volumes, and load balancers.

I will try to describe various technologies as independently as possible, but at the same time, I will share our experience and why we came to one solution or another.

To begin with, let's understand the main advantage of Kubernetes and how it has changed the approach to using cloud resources.

It is important to understand that the use of Kubernetes in the cloud and on bare metal differs.

Kubernetes in the cloud

When you operate Kubernetes in the cloud, you don't worry about persistent volumes, cloud load balancers, or the process of provisioning nodes. All of this is handled by your cloud provider, who accepts your requests in the form of Kubernetes objects. In other words, the server side is completely hidden from you, and you don't really want to know how exactly the cloud provider implements as it's not in your area of responsibility.

A diagram showing cloud Kubernetes, with load balancing and storage done outside the cluster

A diagram showing cloud Kubernetes, with load balancing and storage done outside the cluster

Kubernetes offers convenient abstractions that work the same everywhere, allowing you to deploy your application on any Kubernetes in any cloud.

In the cloud, you very commonly have several separate entities: the Kubernetes control plane, virtual machines, persistent volumes, and load balancers as distinct entities. Using these entities, you can create highly dynamic environments.

Thanks to Kubernetes, virtual machines are now only seen as a utility entity for utilizing cloud resources. You no longer store data inside virtual machines. You can delete all your virtual machines at any moment and recreate them without breaking your application. The Kubernetes control plane will continue to hold information about what should run in your cluster. The load balancer will keep sending traffic to your workload, simply changing the endpoint to send traffic to a new node. And your data will be safely stored in external persistent volumes provided by cloud.

This approach is fundamental when using Kubernetes in clouds. The reason for it is quite obvious: the simpler the system, the more stable it is, and for this simplicity you go buying Kubernetes in the cloud.

Kubernetes on bare metal

Using Kubernetes in the clouds is really simple and convenient, which cannot be said about bare metal installations. In the bare metal world, Kubernetes, on the contrary, becomes unbearably complex. Firstly, because the entire network, backend storage, cloud balancers, etc. are usually run not outside, but inside your cluster. As result such a system is much more difficult to update and maintain.

A diagram showing bare metal Kubernetes, with load balancing and storage done inside the cluster

A diagram showing bare metal Kubernetes, with load balancing and storage done inside the cluster

Judge for yourself: in the cloud, to update a node, you typically delete the virtual machine (or even use kubectl delete node) and you let your node management tooling create a new one, based on an immutable image. The new node will join the cluster and ”just work” as a node; following a very simple and commonly used pattern in the Kubernetes world. Many clusters order new virtual machines every few minutes, simply because they can use cheaper spot instances. However, when you have a physical server, you can't just delete and recreate it, firstly because it often runs some cluster services, stores data, and its update process is significantly more complicated.

There are different approaches to solving this problem, ranging from in-place updates, as done by kubeadm, kubespray, and k3s, to full automation of provisioning physical nodes through Cluster API and Metal3.

I like the hybrid approach offered by Talos Linux, where your entire system is described in a single configuration file. Most parameters of this file can be applied without rebooting or recreating the node, including the version of Kubernetes control-plane components. However, it still keeps the maximum declarative nature of Kubernetes. This approach minimizes unnecessary impact on cluster services when updating bare metal nodes. In most cases, you won't need to migrate your virtual machines and rebuild the cluster filesystem on minor updates.

Preparing a base for your future cloud

So, suppose you've decided to build your own cloud. To start somewhere, you need a base layer. You need to think not only about how you will install Kubernetes on your servers but also about how you will update and maintain it. Consider the fact that you will have to think about things like updating the kernel, installing necessary modules, as well packages and security patches. Now you have to think much more that you don't have to worry about when using a ready-made Kubernetes in the cloud.

Of course you can use standard distributions like Ubuntu or Debian, or you can consider specialized ones like Flatcar Container Linux, Fedora Core, and Talos Linux. Each has its advantages and disadvantages.

What about us? At Ænix, we use quite a few specific kernel modules like ZFS, DRBD, and OpenvSwitch, so we decided to go the route of forming a system image with all the necessary modules in advance. In this case, Talos Linux turned out to be the most convenient for us. For example, such a config is enough to build a system image with all the necessary kernel modules:

arch: amd64
platform: metal
secureboot: false
version: v1.6.4
input:
  kernel:
    path: /usr/install/amd64/vmlinuz
  initramfs:
    path: /usr/install/amd64/initramfs.xz
  baseInstaller:
    imageRef: ghcr.io/siderolabs/installer:v1.6.4
  systemExtensions:
    - imageRef: ghcr.io/siderolabs/amd-ucode:20240115
    - imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240115
    - imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240115
    - imageRef: ghcr.io/siderolabs/i915-ucode:20240115
    - imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240115
    - imageRef: ghcr.io/siderolabs/intel-ucode:20231114
    - imageRef: ghcr.io/siderolabs/qlogic-firmware:20240115
    - imageRef: ghcr.io/siderolabs/drbd:9.2.6-v1.6.4
    - imageRef: ghcr.io/siderolabs/zfs:2.1.14-v1.6.4
output:
  kind: installer
  outFormat: raw

Then we use the docker command line tool to build an OS image:

cat config.yaml | docker run --rm -i -v /dev:/dev --privileged "ghcr.io/siderolabs/imager:v1.6.4" - 

And as a result, we get a Docker container image with everything we need, which we can use to install Talos Linux on our servers. You can do the same; this image will contain all the necessary firmware and kernel modules.

But the question arises, how do you deliver the freshly formed image to your nodes?

I have been contemplating the idea of PXE booting for quite some time. For example, the Kubefarm project that I wrote an article about two years ago was entirely built using this approach. But unfortunately, it does help you to deploy your very first parent cluster that will hold the others. So now you have prepared a solution that will help you do this the same using PXE approach.

Essentially, all you need to do is run temporary DHCP and PXE servers inside containers. Then your nodes will boot from your image, and you can use a simple Debian-flavored script to help you bootstrap your nodes.

asciicast

The source for that talos-bootstrap script is available on GitHub.

This script allows you to deploy Kubernetes on bare metal in five minutes and obtain a kubeconfig for accessing it. However, many unresolved issues still lie ahead.

Delivering system components

At this stage, you already have a Kubernetes cluster capable of running various workloads. However, it is not fully functional yet. In other words, you need to set up networking and storage, as well as install necessary cluster extensions, like KubeVirt to run virtual machines, as well the monitoring stack and other system-wide components.

Traditionally, this is solved by installing Helm charts into your cluster. You can do this by running helm install commands locally, but this approach becomes inconvenient when you want to track updates, and if you have multiple clusters and you want to keep them uniform. In fact, there are plenty of ways to do this declaratively. To solve this, I recommend using best GitOps practices. I mean tools like ArgoCD and FluxCD.

While ArgoCD is more convenient for dev purposes with its graphical interface and a central control plane, FluxCD, on the other hand, is better suited for creating Kubernetes distributions. With FluxCD, you can specify which charts with what parameters should be launched and describe dependencies. Then, FluxCD will take care of everything for you.

It is suggested to perform a one-time installation of FluxCD in your newly created cluster and provide it with the configuration. This will install everything necessary, bringing the cluster to the expected state.

By carrying out a single installation of FluxCD in your newly minted cluster and configuring it accordingly, you enable it to automatically deploy all the essentials. This will allow your cluster to upgrade itself into the desired state. For example, after installing our platform you'll see the next pre-configured Helm charts with system components:

NAMESPACE                        NAME                        AGE    READY   STATUS
cozy-cert-manager                cert-manager                4m1s   True    Release reconciliation succeeded
cozy-cert-manager                cert-manager-issuers        4m1s   True    Release reconciliation succeeded
cozy-cilium                      cilium                      4m1s   True    Release reconciliation succeeded
cozy-cluster-api                 capi-operator               4m1s   True    Release reconciliation succeeded
cozy-cluster-api                 capi-providers              4m1s   True    Release reconciliation succeeded
cozy-dashboard                   dashboard                   4m1s   True    Release reconciliation succeeded
cozy-fluxcd                      cozy-fluxcd                 4m1s   True    Release reconciliation succeeded
cozy-grafana-operator            grafana-operator            4m1s   True    Release reconciliation succeeded
cozy-kamaji                      kamaji                      4m1s   True    Release reconciliation succeeded
cozy-kubeovn                     kubeovn                     4m1s   True    Release reconciliation succeeded
cozy-kubevirt-cdi                kubevirt-cdi                4m1s   True    Release reconciliation succeeded
cozy-kubevirt-cdi                kubevirt-cdi-operator       4m1s   True    Release reconciliation succeeded
cozy-kubevirt                    kubevirt                    4m1s   True    Release reconciliation succeeded
cozy-kubevirt                    kubevirt-operator           4m1s   True    Release reconciliation succeeded
cozy-linstor                     linstor                     4m1s   True    Release reconciliation succeeded
cozy-linstor                     piraeus-operator            4m1s   True    Release reconciliation succeeded
cozy-mariadb-operator            mariadb-operator            4m1s   True    Release reconciliation succeeded
cozy-metallb                     metallb                     4m1s   True    Release reconciliation succeeded
cozy-monitoring                  monitoring                  4m1s   True    Release reconciliation succeeded
cozy-postgres-operator           postgres-operator           4m1s   True    Release reconciliation succeeded
cozy-rabbitmq-operator           rabbitmq-operator           4m1s   True    Release reconciliation succeeded
cozy-redis-operator              redis-operator              4m1s   True    Release reconciliation succeeded
cozy-telepresence                telepresence                4m1s   True    Release reconciliation succeeded
cozy-victoria-metrics-operator   victoria-metrics-operator   4m1s   True    Release reconciliation succeeded

Conclusion

As a result, you achieve a highly repeatable environment that you can provide to anyone, knowing that it operates exactly as intended. This is actually what the Cozystack project does, which you can try out for yourself absolutely free.

In the following articles, I will discuss how to prepare Kubernetes for running virtual machines and how to run Kubernetes clusters with the click of a button. Stay tuned, it'll be fun!

Introducing the Windows Operational Readiness Specification

Since Windows support graduated to stable with Kubernetes 1.14 in 2019, the capability to run Windows workloads has been much appreciated by the end user community. The level of and availability of Windows workload support has consistently been a major differentiator for Kubernetes distributions used by large enterprises. However, with more Windows workloads being migrated to Kubernetes and new Windows features being continuously released, it became challenging to test Windows worker nodes in an effective and standardized way.

The Kubernetes project values the ability to certify conformance without requiring a closed-source license for a certified distribution or service that has no intention of offering Windows.

Some notable examples brought to the attention of SIG Windows were:

  • An issue with load balancer source address ranges functionality not operating correctly on Windows nodes, detailed in a GitHub issue: kubernetes/kubernetes#120033.
  • Reports of functionality issues with Windows features, such as “GMSA not working with containerd, discussed in microsoft/Windows-Containers#44.
  • Challenges developing networking policy tests that could objectively evaluate Container Network Interface (CNI) plugins across different operating system configurations, as discussed in kubernetes/kubernetes#97751.

SIG Windows therefore recognized the need for a tailored solution to ensure Windows nodes' operational readiness before their deployment into production environments. Thus, the idea to develop a Windows Operational Readiness Specification was born.

Can’t we just run the official Conformance tests?

The Kubernetes project contains a set of conformance tests, which are standardized tests designed to ensure that a Kubernetes cluster meets the required Kubernetes specifications.

However, these tests were originally defined at a time when Linux was the only operating system compatible with Kubernetes, and thus, they were not easily extendable for use with Windows. Given that Windows workloads, despite their importance, account for a smaller portion of the Kubernetes community, it was important to ensure that the primary conformance suite relied upon by many Kubernetes distributions to certify Linux conformance, didn't become encumbered with Windows specific features or enhancements such as GMSA or multi-operating system kube-proxy behavior.

Therefore, since there was a specialized need for Windows conformance testing, SIG Windows went down the path of offering Windows specific conformance tests through the Windows Operational Readiness Specification.

Can’t we just run the Kubernetes end-to-end test suite?

In the Linux world, tools such as Sonobuoy simplify execution of the conformance suite, relieving users from needing to be aware of Kubernetes' compilation paths or the semantics of Ginkgo tags.

Regarding needing to compile the Kubernetes tests, we realized that Windows users might similarly find the process of compiling and running the Kubernetes e2e suite from scratch similarly undesirable, hence, there was a clear need to provide a user-friendly, "push-button" solution that is ready to go. Moreover, regarding Ginkgo tags, applying conformance tests to Windows nodes through a set of Ginkgo tags would also be burdensome for any user, including Linux enthusiasts or experienced Windows system admins alike.

To bridge the gap and give users a straightforward way to confirm their clusters support a variety of features, the Kubernetes SIG for Windows found it necessary to therefore create the Windows Operational Readiness application. This application written in Go, simplifies the process to run the necessary Windows specific tests while delivering results in a clear, accessible format.

This initiative has been a collaborative effort, with contributions from different cloud providers and platforms, including Amazon, Microsoft, SUSE, and Broadcom.

A closer look at the Windows Operational Readiness Specification

The Windows Operational Readiness specification specifically targets and executes tests found within the Kubernetes repository in a more user-friendly way than simply targeting Ginkgo tags. It introduces a structured test suite that is split into sets of core and extended tests, with each set of tests containing categories directed at testing a specific area of testing, such as networking. Core tests target fundamental and critical functionalities that Windows nodes should support as defined by the Kubernetes specification. On the other hand, extended tests cover more complex features, more aligned with diving deeper into Windows-specific capabilities such as integrations with Active Directory. These goal of these tests is to be extensive, covering a wide array of Windows-specific capabilities to ensure compatibility with a diverse set of workloads and configurations, extending beyond basic requirements. Below is the current list of categories.

Category NameCategory Description
Core.NetworkTests minimal networking functionality (ability to access pod-by-pod IP.)
Core.StorageTests minimal storage functionality, (ability to mount a hostPath storage volume.)
Core.SchedulingTests minimal scheduling functionality, (ability to schedule a pod with CPU limits.)
Core.ConcurrentTests minimal concurrent functionality, (the ability of a node to handle traffic to multiple pods concurrently.)
Extend.HostProcessTests features related to Windows HostProcess pod functionality.
Extend.ActiveDirectoryTests features related to Active Directory functionality.
Extend.NetworkPolicyTests features related to Network Policy functionality.
Extend.NetworkTests advanced networking functionality, (ability to support IPv6)
Extend.WorkerTests features related to Windows worker node functionality, (ability for nodes to access TCP and UDP services in the same cluster)

How to conduct operational readiness tests for Windows nodes

To run the Windows Operational Readiness test suite, refer to the test suite's README, which explains how to set it up and run it. The test suite offers flexibility in how you can execute tests, either using a compiled binary or a Sonobuoy plugin. You also have the choice to run the tests against the entire test suite or by specifying a list of categories. Cloud providers have the choice of uploading their conformance results, enhancing transparency and reliability.

Once you have checked out that code, you can run a test. For example, this sample command runs the tests from the Core.Concurrent category:

./op-readiness --kubeconfig $KUBE_CONFIG --category Core.Concurrent

As a contributor to Kubernetes, if you want to test your changes against a specific pull request using the Windows Operational Readiness Specification, use the following bot command in the new pull request.

/test operational-tests-capz-windows-2019

Looking ahead

We’re looking to improve our curated list of Windows-specific tests by adding new tests to the Kubernetes repository and also identifying existing test cases that can be targetted. The long term goal for the specification is to continually enhance test coverage for Windows worker nodes and improve the robustness of Windows support, facilitating a seamless experience across diverse cloud environments. We also have plans to integrate the Windows Operational Readiness tests into the official Kubernetes conformance suite.

If you are interested in helping us out, please reach out to us! We welcome help in any form, from giving once-off feedback to making a code contribution, to having long-term owners to help us drive changes. The Windows Operational Readiness specification is owned by the SIG Windows team. You can reach out to the team on the Kubernetes Slack workspace #sig-windows channel. You can also explore the Windows Operational Readiness test suite and make contributions directly to the GitHub repository.

Special thanks to Kulwant Singh (AWS), Pramita Gautam Rana (VMWare), Xinqi Li (Google) and Marcio Morales (AWS) for their help in making notable contributions to the specification. Additionally, appreciation goes to James Sturtevant (Microsoft), Mark Rossetti (Microsoft), Claudiu Belu (Cloudbase Solutions) and Aravindh Puthiyaparambil (Softdrive Technologies Group Inc.) from the SIG Windows team for their guidance and support.

A Peek at Kubernetes v1.30

A quick look: exciting changes in Kubernetes v1.30

It's a new year and a new Kubernetes release. We're halfway through the release cycle and have quite a few interesting and exciting enhancements coming in v1.30. From brand new features in alpha, to established features graduating to stable, to long-awaited improvements, this release has something for everyone to pay attention to!

To tide you over until the official release, here's a sneak peek of the enhancements we're most excited about in this cycle!

Major changes for Kubernetes v1.30

Structured parameters for dynamic resource allocation (KEP-4381)

Dynamic resource allocation was added to Kubernetes as an alpha feature in v1.26. It defines an alternative to the traditional device-plugin API for requesting access to third-party resources. By design, dynamic resource allocation uses parameters for resources that are completely opaque to core Kubernetes. This approach poses a problem for the Cluster Autoscaler (CA) or any higher-level controller that needs to make decisions for a group of pods (e.g. a job scheduler). It cannot simulate the effect of allocating or deallocating claims over time. Only the third-party DRA drivers have the information available to do this.

​​Structured Parameters for dynamic resource allocation is an extension to the original implementation that addresses this problem by building a framework to support making these claim parameters less opaque. Instead of handling the semantics of all claim parameters themselves, drivers could manage resources and describe them using a specific "structured model" pre-defined by Kubernetes. This would allow components aware of this "structured model" to make decisions about these resources without outsourcing them to some third-party controller. For example, the scheduler could allocate claims rapidly without back-and-forth communication with dynamic resource allocation drivers. Work done for this release centers on defining the framework necessary to enable different "structured models" and to implement the "named resources" model. This model allows listing individual resource instances and, compared to the traditional device plugin API, adds the ability to select those instances individually via attributes.

Node memory swap support (KEP-2400)

In Kubernetes v1.30, memory swap support on Linux nodes gets a big change to how it works - with a strong emphasis on improving system stability. In previous Kubernetes versions, the NodeSwap feature gate was disabled by default, and when enabled, it used UnlimitedSwap behavior as the default behavior. To achieve better stability, UnlimitedSwap behavior (which might compromise node stability) will be removed in v1.30.

The updated, still-beta support for swap on Linux nodes will be available by default. However, the default behavior will be to run the node set to NoSwap (not UnlimitedSwap) mode. In NoSwap mode, the kubelet supports running on a node where swap space is active, but Pods don't use any of the page file. You'll still need to set --fail-swap-on=false for the kubelet to run on that node. However, the big change is the other mode: LimitedSwap. In this mode, the kubelet actually uses the page file on that node and allows Pods to have some of their virtual memory paged out. Containers (and their parent pods) do not have access to swap beyond their memory limit, but the system can still use the swap space if available.

Kubernetes' Node special interest group (SIG Node) will also update the documentation to help you understand how to use the revised implementation, based on feedback from end users, contributors, and the wider Kubernetes community.

Read the previous blog post or the node swap documentation for more details on Linux node swap support in Kubernetes.

Support user namespaces in pods (KEP-127)

User namespaces is a Linux-only feature that better isolates pods to prevent or mitigate several CVEs rated high/critical, including CVE-2024-21626, published in January 2024. In Kubernetes 1.30, support for user namespaces is migrating to beta and now supports pods with and without volumes, custom UID/GID ranges, and more!

Structured authorization configuration (KEP-3221)

Support for structured authorization configuration is moving to beta and will be enabled by default. This feature enables the creation of authorization chains with multiple webhooks with well-defined parameters that validate requests in a particular order and allows fine-grained control – such as explicit Deny on failures. The configuration file approach even allows you to specify CEL rules to pre-filter requests before they are dispatched to webhooks, helping you to prevent unnecessary invocations. The API server also automatically reloads the authorizer chain when the configuration file is modified.

You must specify the path to that authorization configuration using the --authorization-config command line argument. If you want to keep using command line flags instead of a configuration file, those will continue to work as-is. To gain access to new authorization webhook capabilities like multiple webhooks, failure policy, and pre-filter rules, switch to putting options in an --authorization-config file. From Kubernetes 1.30, the configuration file format is beta-level, and only requires specifying --authorization-config since the feature gate is enabled by default. An example configuration with all possible values is provided in the Authorization docs. For more details, read the Authorization docs.

Container resource based pod autoscaling (KEP-1610)

Horizontal pod autoscaling based on ContainerResource metrics will graduate to stable in v1.30. This new behavior for HorizontalPodAutoscaler allows you to configure automatic scaling based on the resource usage for individual containers, rather than the aggregate resource use over a Pod. See our previous article for further details, or read container resource metrics.

CEL for admission control (KEP-3488)

Integrating Common Expression Language (CEL) for admission control in Kubernetes introduces a more dynamic and expressive way of evaluating admission requests. This feature allows complex, fine-grained policies to be defined and enforced directly through the Kubernetes API, enhancing security and governance capabilities without compromising performance or flexibility.

CEL's addition to Kubernetes admission control empowers cluster administrators to craft intricate rules that can evaluate the content of API requests against the desired state and policies of the cluster without resorting to Webhook-based access controllers. This level of control is crucial for maintaining the integrity, security, and efficiency of cluster operations, making Kubernetes environments more robust and adaptable to various use cases and requirements. For more information on using CEL for admission control, see the API documentation for ValidatingAdmissionPolicy.

We hope you're as excited for this release as we are. Keep an eye out for the official release blog in a few weeks for more highlights!

CRI-O: Applying seccomp profiles from OCI registries

Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a node to your Pods and containers.

But distributing those seccomp profiles is a major challenge in Kubernetes, because the JSON files have to be available on all nodes where a workload can possibly run. Projects like the Security Profiles Operator solve that problem by running as a daemon within the cluster, which makes me wonder which part of that distribution could be done by the container runtime.

Runtimes usually apply the profiles from a local path, for example:

apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  containers:
    - name: container
      image: nginx:1.25.3
      securityContext:
        seccompProfile:
          type: Localhost
          localhostProfile: nginx-1.25.3.json

The profile nginx-1.25.3.json has to be available in the root directory of the kubelet, appended by the seccomp directory. This means the default location for the profile on-disk would be /var/lib/kubelet/seccomp/nginx-1.25.3.json. If the profile is not available, then runtimes will fail on container creation like this:

kubectl get pods
NAME   READY   STATUS                 RESTARTS   AGE
pod    0/1     CreateContainerError   0          38s
kubectl describe pod/pod | tail
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  117s                default-scheduler  Successfully assigned default/pod to 127.0.0.1
  Normal   Pulling    117s                kubelet            Pulling image "nginx:1.25.3"
  Normal   Pulled     111s                kubelet            Successfully pulled image "nginx:1.25.3" in 5.948s (5.948s including waiting)
  Warning  Failed     7s (x10 over 111s)  kubelet            Error: setup seccomp: unable to load local profile "/var/lib/kubelet/seccomp/nginx-1.25.3.json": open /var/lib/kubelet/seccomp/nginx-1.25.3.json: no such file or directory
  Normal   Pulled     7s (x9 over 111s)   kubelet            Container image "nginx:1.25.3" already present on machine

The major obstacle of having to manually distribute the Localhost profiles will lead many end-users to fall back to RuntimeDefault or even running their workloads as Unconfined (with disabled seccomp).

CRI-O to the rescue

The Kubernetes container runtime CRI-O provides various features using custom annotations. The v1.30 release adds support for a new set of annotations called seccomp-profile.kubernetes.cri-o.io/POD and seccomp-profile.kubernetes.cri-o.io/<CONTAINER>. Those annotations allow you to specify:

  • a seccomp profile for a specific container, when used as: seccomp-profile.kubernetes.cri-o.io/<CONTAINER> (example: seccomp-profile.kubernetes.cri-o.io/webserver: 'registry.example/example/webserver:v1')
  • a seccomp profile for every container within a pod, when used without the container name suffix but the reserved name POD: seccomp-profile.kubernetes.cri-o.io/POD
  • a seccomp profile for a whole container image, if the image itself contains the annotation seccomp-profile.kubernetes.cri-o.io/POD or seccomp-profile.kubernetes.cri-o.io/<CONTAINER>.

CRI-O will only respect the annotation if the runtime is configured to allow it, as well as for workloads running as Unconfined. All other workloads will still use the value from the securityContext with a higher priority.

The annotations alone will not help much with the distribution of the profiles, but the way they can be referenced will! For example, you can now specify seccomp profiles like regular container images by using OCI artifacts:

apiVersion: v1
kind: Pod
metadata:
  name: pod
  annotations:
    seccomp-profile.kubernetes.cri-o.io/POD: quay.io/crio/seccomp:v2
spec: 

The image quay.io/crio/seccomp:v2 contains a seccomp.json file, which contains the actual profile content. Tools like ORAS or Skopeo can be used to inspect the contents of the image:

oras pull quay.io/crio/seccomp:v2
Downloading 92d8ebfa89aa seccomp.json
Downloaded  92d8ebfa89aa seccomp.json
Pulled [registry] quay.io/crio/seccomp:v2
Digest: sha256:f0205dac8a24394d9ddf4e48c7ac201ca7dcfea4c554f7ca27777a7f8c43ec1b
jq . seccomp.json | head
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 38,
  "defaultErrno": "ENOSYS",
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": [
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
# Inspect the plain manifest of the image
skopeo inspect --raw docker://quay.io/crio/seccomp:v2 | jq .
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config":
    {
      "mediaType": "application/vnd.cncf.seccomp-profile.config.v1+json",
      "digest": "sha256:ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356",
      "size": 3,
    },
  "layers":
    [
      {
        "mediaType": "application/vnd.oci.image.layer.v1.tar",
        "digest": "sha256:92d8ebfa89aa6dd752c6443c27e412df1b568d62b4af129494d7364802b2d476",
        "size": 18853,
        "annotations": { "org.opencontainers.image.title": "seccomp.json" },
      },
    ],
  "annotations": { "org.opencontainers.image.created": "2024-02-26T09:03:30Z" },
}

The image manifest contains a reference to a specific required config media type (application/vnd.cncf.seccomp-profile.config.v1+json) and a single layer (application/vnd.oci.image.layer.v1.tar) pointing to the seccomp.json file. But now, let's give that new feature a try!

Using the annotation for a specific container or whole pod

CRI-O needs to be configured adequately before it can utilize the annotation. To do this, add the annotation to the allowed_annotations array for the runtime. This can be done by using a drop-in configuration /etc/crio/crio.conf.d/10-crun.conf like this:

[crio.runtime]
default_runtime = "crun"

[crio.runtime.runtimes.crun]
allowed_annotations = [
    "seccomp-profile.kubernetes.cri-o.io",
]

Now, let's run CRI-O from the latest main commit. This can be done by either building it from source, using the static binary bundles or the prerelease packages.

To demonstrate this, I ran the crio binary from my command line using a single node Kubernetes cluster via local-up-cluster.sh. Now that the cluster is up and running, let's try a pod without the annotation running as seccomp Unconfined:

cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  containers:
    - name: container
      image: nginx:1.25.3
      securityContext:
        seccompProfile:
          type: Unconfined
kubectl apply -f pod.yaml

The workload is up and running:

kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
pod    1/1     Running   0          15s

And no seccomp profile got applied if I inspect the container using crictl:

export CONTAINER_ID=$(sudo crictl ps --name container -q)
sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp
null

Now, let's modify the pod to apply the profile quay.io/crio/seccomp:v2 to the container:

apiVersion: v1
kind: Pod
metadata:
  name: pod
  annotations:
    seccomp-profile.kubernetes.cri-o.io/container: quay.io/crio/seccomp:v2
spec:
  containers:
    - name: container
      image: nginx:1.25.3

I have to delete and recreate the Pod, because only recreation will apply a new seccomp profile:

kubectl delete pod/pod
pod "pod" deleted
kubectl apply -f pod.yaml
pod/pod created

The CRI-O logs will now indicate that the runtime pulled the artifact:

WARN[…] Allowed annotations are specified for workload [seccomp-profile.kubernetes.cri-o.io]
INFO[…] Found container specific seccomp profile annotation: seccomp-profile.kubernetes.cri-o.io/container=quay.io/crio/seccomp:v2  id=26ddcbe6-6efe-414a-88fd-b1ca91979e93 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Pulling OCI artifact from ref: quay.io/crio/seccomp:v2  id=26ddcbe6-6efe-414a-88fd-b1ca91979e93 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Retrieved OCI artifact seccomp profile of len: 18853  id=26ddcbe6-6efe-414a-88fd-b1ca91979e93 name=/runtime.v1.RuntimeService/CreateContainer

And the container is finally using the profile:

export CONTAINER_ID=$(sudo crictl ps --name container -q)
sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp | head
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 38,
  "architectures": [
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_X86",
    "SCMP_ARCH_X32"
  ],
  "syscalls": [
    {

The same would work for every container in the pod, if users replace the /container suffix with the reserved name /POD, for example:

apiVersion: v1
kind: Pod
metadata:
  name: pod
  annotations:
    seccomp-profile.kubernetes.cri-o.io/POD: quay.io/crio/seccomp:v2
spec:
  containers:
    - name: container
      image: nginx:1.25.3

Using the annotation for a container image

While specifying seccomp profiles as OCI artifacts on certain workloads is a cool feature, the majority of end users would like to link seccomp profiles to published container images. This can be done by using a container image annotation; instead of being applied to a Kubernetes Pod, the annotation is some metadata applied at the container image itself. For example, Podman can be used to add the image annotation directly during image build:

podman build \
    --annotation seccomp-profile.kubernetes.cri-o.io=quay.io/crio/seccomp:v2 \
    -t quay.io/crio/nginx-seccomp:v2 .

The pushed image then contains the annotation:

skopeo inspect --raw docker://quay.io/crio/nginx-seccomp:v2 |
    jq '.annotations."seccomp-profile.kubernetes.cri-o.io"'
"quay.io/crio/seccomp:v2"

If I now use that image in an CRI-O test pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: pod
  # no Pod annotations set
spec:
  containers:
    - name: container
      image: quay.io/crio/nginx-seccomp:v2

Then the CRI-O logs will indicate that the image annotation got evaluated and the profile got applied:

kubectl delete pod/pod
pod "pod" deleted
kubectl apply -f pod.yaml
pod/pod created
INFO[…] Found image specific seccomp profile annotation: seccomp-profile.kubernetes.cri-o.io=quay.io/crio/seccomp:v2  id=c1f22c59-e30e-4046-931d-a0c0fdc2c8b7 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Pulling OCI artifact from ref: quay.io/crio/seccomp:v2  id=c1f22c59-e30e-4046-931d-a0c0fdc2c8b7 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Retrieved OCI artifact seccomp profile of len: 18853  id=c1f22c59-e30e-4046-931d-a0c0fdc2c8b7 name=/runtime.v1.RuntimeService/CreateContainer
INFO[…] Created container 116a316cd9a11fe861dd04c43b94f45046d1ff37e2ed05a4e4194fcaab29ee63: default/pod/container  id=c1f22c59-e30e-4046-931d-a0c0fdc2c8b7 name=/runtime.v1.RuntimeService/CreateContainer
export CONTAINER_ID=$(sudo crictl ps --name container -q)
sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp | head
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 38,
  "architectures": [
    "SCMP_ARCH_X86_64",
    "SCMP_ARCH_X86",
    "SCMP_ARCH_X32"
  ],
  "syscalls": [
    {

For container images, the annotation seccomp-profile.kubernetes.cri-o.io will be treated in the same way as seccomp-profile.kubernetes.cri-o.io/POD and applies to the whole pod. In addition to that, the whole feature also works when using the container specific annotation on an image, for example if a container is named container1:

skopeo inspect --raw docker://quay.io/crio/nginx-seccomp:v2-container |
    jq '.annotations."seccomp-profile.kubernetes.cri-o.io/container1"'
"quay.io/crio/seccomp:v2"

The cool thing about this whole feature is that users can now create seccomp profiles for specific container images and store them side by side in the same registry. Linking the images to the profiles provides a great flexibility to maintain them over the whole application's life cycle.

Pushing profiles using ORAS

The actual creation of the OCI object that contains a seccomp profile requires a bit more work when using ORAS. I have the hope that tools like Podman will simplify the overall process in the future. Right now, the container registry needs to be OCI compatible, which is also the case for Quay.io. CRI-O expects the seccomp profile object to have a container image media type (application/vnd.cncf.seccomp-profile.config.v1+json), while ORAS uses application/vnd.oci.empty.v1+json per default. To achieve all of that, the following commands can be executed:

echo "{}" > config.json
oras push \
    --config config.json:application/vnd.cncf.seccomp-profile.config.v1+json \
     quay.io/crio/seccomp:v2 seccomp.json

The resulting image contains the mediaType that CRI-O expects. ORAS pushes a single layer seccomp.json to the registry. The name of the profile does not matter much. CRI-O will pick the first layer and check if that can act as a seccomp profile.

Future work

CRI-O internally manages the OCI artifacts like regular files. This provides the benefit of moving them around, removing them if not used any more or having any other data available than seccomp profiles. This enables future enhancements in CRI-O on top of OCI artifacts, but also allows thinking about stacking seccomp profiles as part of having multiple layers in an OCI artifact. The limitation that it only works for Unconfined workloads for v1.30.x releases is something different CRI-O would like to address in the future. Simplifying the overall user experience by not compromising security seems to be the key for a successful future of seccomp in container workloads.

The CRI-O maintainers will be happy to listen to any feedback or suggestions on the new feature! Thank you for reading this blog post, feel free to reach out to the maintainers via the Kubernetes Slack channel #crio or create an issue in the GitHub repository.

Spotlight on SIG Cloud Provider

One of the most popular ways developers use Kubernetes-related services is via cloud providers, but have you ever wondered how cloud providers can do that? How does this whole process of integration of Kubernetes to various cloud providers happen? To answer that, let's put the spotlight on SIG Cloud Provider.

SIG Cloud Provider works to create seamless integrations between Kubernetes and various cloud providers. Their mission? Keeping the Kubernetes ecosystem fair and open for all. By setting clear standards and requirements, they ensure every cloud provider plays nicely with Kubernetes. It is their responsibility to configure cluster components to enable cloud provider integrations.

In this blog of the SIG Spotlight series, Arujjwal Negi interviews Michael McCune (Red Hat), also known as elmiko, co-chair of SIG Cloud Provider, to give us an insight into the workings of this group.

Introduction

Arujjwal: Let's start by getting to know you. Can you give us a small intro about yourself and how you got into Kubernetes?

Michael: Hi, I’m Michael McCune, most people around the community call me by my handle, elmiko. I’ve been a software developer for a long time now (Windows 3.1 was popular when I started!), and I’ve been involved with open-source software for most of my career. I first got involved with Kubernetes as a developer of machine learning and data science applications; the team I was on at the time was creating tutorials and examples to demonstrate the use of technologies like Apache Spark on Kubernetes. That said, I’ve been interested in distributed systems for many years and when an opportunity arose to join a team working directly on Kubernetes, I jumped at it!

Functioning and working

Arujjwal: Can you give us an insight into what SIG Cloud Provider does and how it functions?

Michael: SIG Cloud Provider was formed to help ensure that Kubernetes provides a neutral integration point for all infrastructure providers. Our largest task to date has been the extraction and migration of in-tree cloud controllers to out-of-tree components. The SIG meets regularly to discuss progress and upcoming tasks and also to answer questions and bugs that arise. Additionally, we act as a coordination point for cloud provider subprojects such as the cloud provider framework, specific cloud controller implementations, and the Konnectivity proxy project.

Arujjwal: After going through the project README, I learned that SIG Cloud Provider works with the integration of Kubernetes with cloud providers. How does this whole process go?

Michael: One of the most common ways to run Kubernetes is by deploying it to a cloud environment (AWS, Azure, GCP, etc). Frequently, the cloud infrastructures have features that enhance the performance of Kubernetes, for example, by providing elastic load balancing for Service objects. To ensure that cloud-specific services can be consistently consumed by Kubernetes, the Kubernetes community has created cloud controllers to address these integration points. Cloud providers can create their own controllers either by using the framework maintained by the SIG or by following the API guides defined in the Kubernetes code and documentation. One thing I would like to point out is that SIG Cloud Provider does not deal with the lifecycle of nodes in a Kubernetes cluster; for those types of topics, SIG Cluster Lifecycle and the Cluster API project are more appropriate venues.

Important subprojects

Arujjwal: There are a lot of subprojects within this SIG. Can you highlight some of the most important ones and what job they do?

Michael: I think the two most important subprojects today are the cloud provider framework and the extraction/migration project. The cloud provider framework is a common library to help infrastructure integrators build a cloud controller for their infrastructure. This project is most frequently the starting point for new people coming to the SIG. The extraction and migration project is the other big subproject and a large part of why the framework exists. A little history might help explain further: for a long time, Kubernetes needed some integration with the underlying infrastructure, not necessarily to add features but to be aware of cloud events like instance termination. The cloud provider integrations were built into the Kubernetes code tree, and thus the term "in-tree" was created (check out this article on the topic for more info). The activity of maintaining provider-specific code in the main Kubernetes source tree was considered undesirable by the community. The community’s decision inspired the creation of the extraction and migration project to remove the "in-tree" cloud controllers in favor of "out-of-tree" components.

Arujjwal: What makes [the cloud provider framework] a good place to start? Does it have consistent good beginner work? What kind?

Michael: I feel that the cloud provider framework is a good place to start as it encodes the community’s preferred practices for cloud controller managers and, as such, will give a newcomer a strong understanding of how and what the managers do. Unfortunately, there is not a consistent stream of beginner work on this component; this is due in part to the mature nature of the framework and that of the individual providers as well. For folks who are interested in getting more involved, having some Go language knowledge is good and also having an understanding of how at least one cloud API (e.g., AWS, Azure, GCP) works is also beneficial. In my personal opinion, being a newcomer to SIG Cloud Provider can be challenging as most of the code around this project deals directly with specific cloud provider interactions. My best advice to people wanting to do more work on cloud providers is to grow your familiarity with one or two cloud APIs, then look for open issues on the controller managers for those clouds, and always communicate with the other contributors as much as possible.

Accomplishments

Arujjwal: Can you share about an accomplishment(s) of the SIG that you are proud of?

Michael: Since I joined the SIG, more than a year ago, we have made great progress in advancing the extraction and migration subproject. We have moved from an alpha status on the defining KEP to a beta status and are inching ever closer to removing the old provider code from the Kubernetes source tree. I've been really proud to see the active engagement from our community members and to see the progress we have made towards extraction. I have a feeling that, within the next few releases, we will see the final removal of the in-tree cloud controllers and the completion of the subproject.

Advice for new contributors

Arujjwal: Is there any suggestion or advice for new contributors on how they can start at SIG Cloud Provider?

Michael: This is a tricky question in my opinion. SIG Cloud Provider is focused on the code pieces that integrate between Kubernetes and an underlying infrastructure. It is very common, but not necessary, for members of the SIG to be representing a cloud provider in an official capacity. I recommend that anyone interested in this part of Kubernetes should come to an SIG meeting to see how we operate and also to study the cloud provider framework project. We have some interesting ideas for future work, such as a common testing framework, that will cut across all cloud providers and will be a great opportunity for anyone looking to expand their Kubernetes involvement.

Arujjwal: Are there any specific skills you're looking for that we should highlight? To give you an example from our own [SIG ContribEx] (https://github.com/kubernetes/community/blob/master/sig-contributor-experience/README.md): if you're an expert in Hugo, we can always use some help with k8s.dev!

Michael: The SIG is currently working through the final phases of our extraction and migration process, but we are looking toward the future and starting to plan what will come next. One of the big topics that the SIG has discussed is testing. Currently, we do not have a generic common set of tests that can be exercised by each cloud provider to confirm the behaviour of their controller manager. If you are an expert in Ginkgo and the Kubetest framework, we could probably use your help in designing and implementing the new tests.


This is where the conversation ends. I hope this gave you some insights about SIG Cloud Provider's aim and working. This is just the tip of the iceberg. To know more and get involved with SIG Cloud Provider, try attending their meetings here.

A look into the Kubernetes Book Club

Learning Kubernetes and the entire ecosystem of technologies around it is not without its challenges. In this interview, we will talk with Carlos Santana (AWS) to learn a bit more about how he created the Kubernetes Book Club, how it works, and how anyone can join in to take advantage of a community-based learning experience.

Carlos Santana speaking at KubeCon NA 2023

Frederico Muñoz (FSM): Hello Carlos, thank you so much for your availability. To start with, could you tell us a bit about yourself?

Carlos Santana (CS): Of course. My experience in deploying Kubernetes in production six years ago opened the door for me to join Knative and then contribute to Kubernetes through the Release Team. Working on upstream Kubernetes has been one of the best experiences I've had in open-source. Over the past two years, in my role as a Senior Specialist Solutions Architect at AWS, I have been assisting large enterprises build their internal developer platforms (IDP) on top of Kubernetes. Going forward, my open source contributions are directed towards CNOE and CNCF projects like Argo, Crossplane, and Backstage.

Creating the Book Club

FSM: So your path led you to Kubernetes, and at that point what was the motivating factor for starting the Book Club?

CS: The idea for the Kubernetes Book Club sprang from a casual suggestion during a TGIK livestream. For me, it was more than just about reading a book; it was about creating a learning community. This platform has not only been a source of knowledge but also a support system, especially during the challenging times of the pandemic. It's gratifying to see how this initiative has helped members cope and grow. The first book Production Kubernetes took 36 weeks, when we started on March 5th 2021. Currently don't take that long to cover a book, one or two chapters per week.

FSM: Could you describe the way the Kubernetes Book Club works? How do you select the books and how do you go through them?

CS: We collectively choose books based on the interests and needs of the group. This practical approach helps members, especially beginners, grasp complex concepts more easily. We have two weekly series, one for the EMEA timezone, and I organize the US one. Each organizer works with their co-host and picks a book on Slack, then sets up a lineup of hosts for a couple of weeks to discuss each chapter.

FSM: If I’m not mistaken, the Kubernetes Book Club is in its 17th book, which is significant: is there any secret recipe for keeping things active?

CS: The secret to keeping the club active and engaging lies in a couple of key factors.

Firstly, consistency has been crucial. We strive to maintain a regular schedule, only cancelling meetups for major events like holidays or KubeCon. This regularity helps members stay engaged and builds a reliable community.

Secondly, making the sessions interesting and interactive has been vital. For instance, I often introduce pop-up quizzes during the meetups, which not only tests members' understanding but also adds an element of fun. This approach keeps the content relatable and helps members understand how theoretical concepts are applied in real-world scenarios.

Topics covered in the Book Club

FSM: The main topics of the books have been Kubernetes, GitOps, Security, SRE, and Observability: is this a reflection of the cloud native landscape, especially in terms of popularity?

CS: Our journey began with 'Production Kubernetes', setting the tone for our focus on practical, production-ready solutions. Since then, we've delved into various aspects of the CNCF landscape, aligning our books with a different theme. Each theme, whether it be Security, Observability, or Service Mesh, is chosen based on its relevance and demand within the community. For instance, in our recent themes on Kubernetes Certifications, we brought the book authors into our fold as active hosts, enriching our discussions with their expertise.

FSM: I know that the project had recent changes, namely being integrated into the CNCF as a Cloud Native Community Group. Could you talk a bit about this change?

CS: The CNCF graciously accepted the book club as a Cloud Native Community Group. This is a significant development that has streamlined our operations and expanded our reach. This alignment has been instrumental in enhancing our administrative capabilities, similar to those used by Kubernetes Community Days (KCD) meetups. Now, we have a more robust structure for memberships, event scheduling, mailing lists, hosting web conferences, and recording sessions.

FSM: How has your involvement with the CNCF impacted the growth and engagement of the Kubernetes Book Club over the past six months?

CS: Since becoming part of the CNCF community six months ago, we've witnessed significant quantitative changes within the Kubernetes Book Club. Our membership has surged to over 600 members, and we've successfully organized and conducted more than 40 events during this period. What's even more promising is the consistent turnout, with an average of 30 attendees per event. This growth and engagement are clear indicators of the positive influence of our CNCF affiliation on the Kubernetes Book Club's reach and impact in the community.

Joining the Book Club

FSM: For anyone wanting to join, what should they do?

CS: There are three steps to join:

FSM: Excellent, thank you! Any final comments you would like to share?

CS: The Kubernetes Book Club is more than just a group of professionals discussing books; it's a vibrant community and amazing volunteers that help organize and host Neependra Khare, Eric Smalling, Sevi Karakulak, Chad M. Crowell, and Walid (CNJ) Shaari. Look us up at KubeCon and get your Kubernetes Book Club sticker!

Image Filesystem: Configuring Kubernetes to store containers on a separate filesystem

A common issue in running/operating Kubernetes clusters is running out of disk space. When the node is provisioned, you should aim to have a good amount of storage space for your container images and running containers. The container runtime usually writes to /var. This can be located as a separate partition or on the root filesystem. CRI-O, by default, writes its containers and images to /var/lib/containers, while containerd writes its containers and images to /var/lib/containerd.

In this blog post, we want to bring attention to ways that you can configure your container runtime to store its content separately from the default partition.
This allows for more flexibility in configuring Kubernetes and provides support for adding a larger disk for the container storage while keeping the default filesystem untouched.

One area that needs more explaining is where/what Kubernetes is writing to disk.

Understanding Kubernetes disk usage

Kubernetes has persistent data and ephemeral data. The base path for the kubelet and local Kubernetes-specific storage is configurable, but it is usually assumed to be /var/lib/kubelet. In the Kubernetes docs, this is sometimes referred to as the root or node filesystem. The bulk of this data can be categorized into:

  • ephemeral storage
  • logs
  • and container runtime

This is different from most POSIX systems as the root/node filesystem is not / but the disk that /var/lib/kubelet is on.

Ephemeral storage

Pods and containers can require temporary or transient local storage for their operation. The lifetime of the ephemeral storage does not extend beyond the life of the individual pod, and the ephemeral storage cannot be shared across pods.

Logs

By default, Kubernetes stores the logs of each running container, as files within /var/log. These logs are ephemeral and are monitored by the kubelet to make sure that they do not grow too large while the pods are running.

You can customize the log rotation settings for each node to manage the size of these logs, and configure log shipping (using a 3rd party solution) to avoid relying on the node-local storage.

Container runtime

The container runtime has two different areas of storage for containers and images.

  • read-only layer: Images are usually denoted as the read-only layer, as they are not modified when containers are running. The read-only layer can consist of multiple layers that are combined into a single read-only layer. There is a thin layer on top of containers that provides ephemeral storage for containers if the container is writing to the filesystem.

  • writeable layer: Depending on your container runtime, local writes might be implemented as a layered write mechanism (for example, overlayfs on Linux or CimFS on Windows). This is referred to as the writable layer. Local writes could also use a writeable filesystem that is initialized with a full clone of the container image; this is used for some runtimes based on hypervisor virtualisation.

The container runtime filesystem contains both the read-only layer and the writeable layer. This is considered the imagefs in Kubernetes documentation.

Container runtime configurations

CRI-O

CRI-O uses a storage configuration file in TOML format that lets you control how the container runtime stores persistent and temporary data. CRI-O utilizes the storage library.
Some Linux distributions have a manual entry for storage (man 5 containers-storage.conf). The main configuration for storage is located in /etc/containers/storage.conf and one can control the location for temporary data and the root directory.
The root directory is where CRI-O stores the persistent data.

[storage]
# Default storage driver
driver = "overlay"
# Temporary storage location
runroot = "/var/run/containers/storage"
# Primary read/write location of container storage 
graphroot = "/var/lib/containers/storage"
  • graphroot
    • Persistent data stored from the container runtime
    • If SELinux is enabled, this must match the /var/lib/containers/storage
  • runroot
    • Temporary read/write access for container
    • Recommended to have this on a temporary filesystem

Here is a quick way to relabel your graphroot directory to match /var/lib/containers/storage:

semanage fcontext -a -e /var/lib/containers/storage <YOUR-STORAGE-PATH>
restorecon -R -v <YOUR-STORAGE-PATH>

containerd

The containerd runtime uses a TOML configuration file to control where persistent and ephemeral data is stored. The default path for the config file is located at /etc/containerd/config.toml.

The relevant fields for containerd storage are root and state.

  • root
    • The root directory for containerd metadata
    • Default is /var/lib/containerd
    • Root also requires SELinux labels if your OS requires it
  • state
    • Temporary data for containerd
    • Default is /run/containerd

Kubernetes node pressure eviction

Kubernetes will automatically detect if the container filesystem is split from the node filesystem. When one separates the filesystem, Kubernetes is responsible for monitoring both the node filesystem and the container runtime filesystem. Kubernetes documentation refers to the node filesystem and the container runtime filesystem as nodefs and imagefs. If either nodefs or the imagefs are running out of disk space, then the overall node is considered to have disk pressure. Kubernetes will first reclaim space by deleting unusued containers and images, and then it will resort to evicting pods. On a node that has a nodefs and an imagefs, the kubelet will garbage collect unused container images on imagefs and will remove dead pods and their containers from the nodefs. If there is only a nodefs, then Kubernetes garbage collection includes dead containers, dead pods and unused images.

Kubernetes allows more configurations for determining if your disk is full.
The eviction manager within the kubelet has some configuration settings that let you control the relevant thresholds. For filesystems, the relevant measurements are nodefs.available, nodefs.inodesfree, imagefs.available, and imagefs.inodesfree. If there is not a dedicated disk for the container runtime then imagefs is ignored.

Users can use the existing defaults:

  • memory.available < 100MiB
  • nodefs.available < 10%
  • imagefs.available < 15%
  • nodefs.inodesFree < 5% (Linux nodes)

Kubernetes allows you to set user defined values in EvictionHard and EvictionSoft in the kubelet configuration file.

EvictionHard
defines limits; once these limits are exceeded, pods will be evicted without any grace period.
EvictionSoft
defines limits; once these limits are exceeded, pods will be evicted with a grace period that can be set per signal.

If you specify a value for EvictionHard, it will replace the defaults.
This means it is important to set all signals in your configuration.

For example, the following kubelet configuration could be used to configure eviction signals and grace period options.

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
port: 20250
serializeImagePulls: false
evictionHard:
    memory.available:  "100Mi"
    nodefs.available:  "10%"
    nodefs.inodesFree: "5%"
    imagefs.available: "15%"
    imagefs.inodesFree: "5%"
evictionSoft:
    memory.available:  "100Mi"
    nodefs.available:  "10%"
    nodefs.inodesFree: "5%"
    imagefs.available: "15%"
    imagefs.inodesFree: "5%"
evictionSoftGracePeriod:
    memory.available:  "1m30s"
    nodefs.available:  "2m"
    nodefs.inodesFree: "2m"
    imagefs.available: "2m"
    imagefs.inodesFree: "2m"
evictionMaxPodGracePeriod: 60s

Problems

The Kubernetes project recommends that you either use the default settings for eviction or you set all the fields for eviction. You can use the default settings or specify your own evictionHard settings. If you miss a signal, then Kubernetes will not monitor that resource. One common misconfiguration administrators or users can hit is mounting a new filesystem to /var/lib/containers/storage or /var/lib/containerd. Kubernetes will detect a separate filesystem, so you want to make sure to check that imagefs.inodesfree and imagefs.available match your needs if you've done this.

Another area of confusion is that ephemeral storage reporting does not change if you define an image filesystem for your node. The image filesystem (imagefs) is used to store container image layers; if a container writes to its own root filesystem, that local write doesn't count towards the size of the container image. The place where the container runtime stores those local modifications is runtime-defined, but is often the image filesystem. If a container in a pod is writing to a filesystem-backed emptyDir volume, then this uses space from the nodefs filesystem. The kubelet always reports ephemeral storage capacity and allocations based on the filesystem represented by nodefs; this can be confusing when ephemeral writes are actually going to the image filesystem.

Future work

To fix the ephemeral storage reporting limitations and provide more configuration options to the container runtime, SIG Node are working on KEP-4191. In KEP-4191, Kubernetes will detect if the writeable layer is separated from the read-only layer (images). This would allow us to have all ephemeral storage, including the writeable layer, on the same disk as well as allowing for a separate disk for images.

Getting involved

If you would like to get involved, you can join Kubernetes Node Special-Interest-Group (SIG).

If you would like to share feedback, you can do so on our #sig-node Slack channel. If you're not already part of that Slack workspace, you can visit https://slack.k8s.io/ for an invitation.

Special thanks to all the contributors who provided great reviews, shared valuable insights or suggested the topic idea.

  • Peter Hunt
  • Mrunal Patel
  • Ryan Phillips
  • Gaurav Singh

Spotlight on SIG Release (Release Team Subproject)

The Release Special Interest Group (SIG Release), where Kubernetes sharpens its blade with cutting-edge features and bug fixes every 4 months. Have you ever considered how such a big project like Kubernetes manages its timeline so efficiently to release its new version, or how the internal workings of the Release Team look like? If you're curious about these questions or want to know more and get involved with the work SIG Release does, read on!

SIG Release plays a crucial role in the development and evolution of Kubernetes. Its primary responsibility is to manage the release process of new versions of Kubernetes. It operates on a regular release cycle, typically every three to four months. During this cycle, the Kubernetes Release Team works closely with other SIGs and contributors to ensure a smooth and well-coordinated release. This includes planning the release schedule, setting deadlines for code freeze and testing phases, as well as creating release artefacts like binaries, documentation, and release notes.

Before you read further, it is important to note that there are two subprojects under SIG Release - Release Engineering and Release Team.

In this blog post, Nitish Kumar interviews Verónica López (PlanetScale), Technical Lead of SIG Release, with the spotlight on the Release Team subproject, how the release process looks like, and ways to get involved.

  1. What is the typical release process for a new version of Kubernetes, from initial planning to the final release? Are there any specific methodologies and tools that you use to ensure a smooth release?

    The release process for a new Kubernetes version is a well-structured and community-driven effort. There are no specific methodologies or tools as such that we follow, except a calendar with a series of steps to keep things organised. The complete release process looks like this:

  • Release Team Onboarding: We start with the formation of a Release Team, which includes volunteers from the Kubernetes community who will be responsible for managing different components of the new release. This is typically done before the previous release is about to wrap up. Once the team is formed, new members are onboarded while the Release Team Lead and the Branch Manager propose a calendar for the usual deliverables. As an example, you can take a look at the v1.29 team formation issue created at the SIG Release repository. For a contributor to be the part of Release Team, they typically go through the Release Shadow program, but that's not the only way to get involved with SIG Release.

  • Beginning Phase: In the initial weeks of each release cycle, SIG Release diligently tracks the progress of new features and enhancements outlined in Kubernetes Enhancement Proposals (KEPs). While not all of these features are entirely new, they often commence their journey in the alpha phase, subsequently advancing to the beta stage, and ultimately attaining the status of stability.

  • Feature Maturation Phase: We usually cut a couple of Alpha releases, containing new features in an experimental state, to gather feedback from the community, followed by a couple of Beta releases, where features are more stable and the focus is on fixing bugs. Feedback from users is critical at this stage, to the point where sometimes we need to cut an additional Beta release to address bugs or other concerns that may arise during this phase. Once this is cleared, we cut a release candidate (RC) before the actual release. Throughout the cycle, efforts are made to update and improve documentation, including release notes and user guides, a process that, in my opinion, deserves its own post.

  • Stabilisation Phase: A few weeks before the new release, we implement a code freeze, and no new features are allowed after this point: this allows the focus to shift towards testing and stabilisation. In parallel to the main release, we keep cutting monthly patches of old, officially supported versions of Kubernetes, so you could say that the lifecycle of a Kubernetes version extends for several months afterwards. Throughout the complete release cycle, efforts are made to update and improve documentation, including release notes and user guides, a process that, in our opinion, deserves its own post.

    Release team onboarding; beginning phase; stabalization phase; feature maturation phase

  1. How do you handle the balance between stability and introducing new features in each release? What criteria are used to determine which features make it into a release?

    It’s a neverending mission, however, we think that the key is in respecting our process and guidelines. Our guidelines are the result of hours of discussions and feedback from dozens of members of the community who bring a wealth of knowledge and experience to the project. If we didn’t have strict guidelines, we would keep having the same discussions over and over again, instead of using our time for more productive topics that needs our attention. All the critical exceptions require consensus from most of the team members, so we can ensure quality.

    The process of deciding what makes it into a release starts way before the Release Teams takes over the workflows. Each individual SIG along with the most experienced contributors gets to decide whether they’d like to include a feature or change, so the planning and ultimate approval usually belongs to them. Then, the Release Team makes sure those contributions meet the requirements of documentation, testing, backwards compatibility, among others, before officially allowing them in. A similar process happens with cherry-picks for the monthly patch releases, where we have strict policies about not accepting PRs that would require a full KEP, or fixes that don’t include all the affected branches.

  2. What are some of the most significant challenges you’ve encountered while developing and releasing Kubernetes? How have you overcome these challenges?

    Every cycle of release brings its own array of challenges. It might involve tackling last-minute concerns like newly discovered Common Vulnerabilities and Exposures (CVEs), resolving bugs within our internal tools, or addressing unexpected regressions caused by features from previous releases. Another obstacle we often face is that, although our team is substantial, most of us contribute on a volunteer basis. Sometimes it can feel like we’re a bit understaffed, however we always manage to get organised and make it work.

  3. As a new contributor, what should be my ideal path to get involved with SIG Release? In a community where everyone is busy with their own tasks, how can I find the right set of tasks to contribute effectively to it?

    Everyone's way of getting involved within the Open Source community is different. SIG Release is a self-serving team, meaning that we write our own tools to be able to ship releases. We collaborate a lot with other SIGs, such as SIG K8s Infra, but all the tools that we used needs to be tailor-made for our massive technical needs, while reducing costs. This means that we are constantly looking for volunteers who’d like to help with different types of projects, beyond “just” cutting a release.

    Our current project requires a mix of skills like Go programming, understanding Kubernetes internals, Linux packaging, supply chain security, technical writing, and general open-source project maintenance. This skill set is always evolving as our project grows.

    For an ideal path, this is what we suggest:

    • Get yourself familiar with the code, including how features are managed, the release calendar, and the overall structure of the Release Team.
    • Join the Kubernetes community communication channels, such as Slack (#sig-release), where we are particularly active.
    • Join the SIG Release weekly meetings which are open to all in the community. Participating in these meetings is a great way to learn about ongoing and future projects that you might find relevant for your skillset and interests.

    Remember, every experienced contributor was once in your shoes, and the community is often more than willing to guide and support newcomers. Don't hesitate to ask questions, engage in discussions, and take small steps to contribute. sig-release-questions

  4. What is the Release Shadow Program and how is it different from other shadow programs included in various other SIGs?

    The Release Shadow Program offers a chance for interested individuals to shadow experienced members of the Release Team throughout a Kubernetes release cycle. This is a unique chance to see all the hard work that a Kubernetes release requires across sub-teams. A lot of people think that all we do is cut a release every three months, but that’s just the top of the iceberg.

    Our program typically aligns with a specific Kubernetes release cycle, which has a predictable timeline of approximately three months. While this program doesn’t involve writing new Kubernetes features, it still requires a high sense of responsibility since the Release Team is the last step between a new release and thousands of contributors, so it’s a great opportunity to learn a lot about modern software development cycles at an accelerated pace.

  5. What are the qualifications that you generally look for in a person to volunteer as a release shadow/release lead for the next Kubernetes release?

    While all the roles require some degree of technical ability, some require more hands-on experience with Go and familiarity with the Kubernetes API while others require people who are good at communicating technical content in a clear and concise way. It’s important to mention that we value enthusiasm and commitment over technical expertise from day 1. If you have the right attitude and show us that you enjoy working with Kubernetes and or/release engineering, even if it’s only through a personal project that you put together in your spare time, the team will make sure to guide you. Being a self-starter and not being afraid to ask questions can take you a long way in our team.

  6. What will you suggest to someone who has got rejected from being a part of the Release Shadow Program several times?

    Keep applying.

    With every release cycle we have had an exponential growth in the number of applicants, so it gets harder to be selected, which can be discouraging, but please know that getting rejected doesn’t mean you’re not talented. It’s just practically impossible to accept every applicant, however here's an alternative that we suggest:

    Start attending our weekly Kubernetes SIG Release meetings to introduce yourself and get familiar with the team and the projects we are working on.

    The Release Team is one of the way to join SIG Release, but we are always looking for more hands to help. Again, in addition to certain technical ability, the most sought after trait that we look for is people we can trust, and that requires time. sig-release-motivation

  7. Can you discuss any ongoing initiatives or upcoming features that the release team is particularly excited about for Kubernetes v1.28? How do these advancements align with the long-term vision of Kubernetes?

    We are excited about finally publishing Kubernetes packages on community infrastructure. It has been something that we have been wanting to do for a few years now, but it’s a project with many technical implications that must be in place before doing the transition. Once that’s done, we’ll be able to increase our productivity and take control of the entire workflows.

Final thoughts

Well, this conversation ends here but not the learning. I hope this interview has given you some idea about what SIG Release does and how to get started in helping out. It is important to mention again that this article covers the first subproject under SIG Release, the Release Team. In the next Spotlight blog on SIG Release, we will provide a spotlight on the Release Engineering subproject, what it does and how to get involved. Finally, you can go through the SIG Release charter to get a more in-depth understanding of how SIG Release operates.

Contextual logging in Kubernetes 1.29: Better troubleshooting and enhanced logging

On behalf of the Structured Logging Working Group and SIG Instrumentation, we are pleased to announce that the contextual logging feature introduced in Kubernetes v1.24 has now been successfully migrated to two components (kube-scheduler and kube-controller-manager) as well as some directories. This feature aims to provide more useful logs for better troubleshooting of Kubernetes and to empower developers to enhance Kubernetes.

What is contextual logging?

Contextual logging is based on the go-logr API. The key idea is that libraries are passed a logger instance by their caller and use that for logging instead of accessing a global logger. The binary decides the logging implementation, not the libraries. The go-logr API is designed around structured logging and supports attaching additional information to a logger.

This enables additional use cases:

  • The caller can attach additional information to a logger:

    • WithName adds a "logger" key with the names concatenated by a dot as value
    • WithValues adds key/value pairs

    When passing this extended logger into a function, and the function uses it instead of the global logger, the additional information is then included in all log entries, without having to modify the code that generates the log entries. This is useful in highly parallel applications where it can become hard to identify all log entries for a certain operation, because the output from different operations gets interleaved.

  • When running unit tests, log output can be associated with the current test. Then, when a test fails, only the log output of the failed test gets shown by go test. That output can also be more verbose by default because it will not get shown for successful tests. Tests can be run in parallel without interleaving their output.

One of the design decisions for contextual logging was to allow attaching a logger as value to a context.Context. Since the logger encapsulates all aspects of the intended logging for the call, it is part of the context, and not just using it. A practical advantage is that many APIs already have a ctx parameter or can add one. This provides additional advantages, like being able to get rid of context.TODO() calls inside the functions.

How to use it

The contextual logging feature is alpha starting from Kubernetes v1.24, so it requires the ContextualLogging feature gate to be enabled. If you want to test the feature while it is alpha, you need to enable this feature gate on the kube-controller-manager and the kube-scheduler.

For the kube-scheduler, there is one thing to note, in addition to enabling the ContextualLogging feature gate, instrumentation also depends on log verbosity. To avoid slowing down the scheduler with the logging instrumentation for contextual logging added for 1.29, it is important to choose carefully when to add additional information:

  • At -v3 or lower, only WithValues("pod") is used once per scheduling cycle. This has the intended effect that all log messages for the cycle include the pod information. Once contextual logging is GA, "pod" key/value pairs can be removed from all log calls.
  • At -v4 or higher, richer log entries get produced where WithValues is also used for the node (when applicable) and WithName is used for the current operation and plugin.

Here is an example that demonstrates the effect:

I1113 08:43:37.029524 87144 default_binder.go:53] "Attempting to bind pod to node" logger="Bind.DefaultBinder" pod="kube-system/coredns-69cbfb9798-ms4pq" node="127.0.0.1"

The immediate benefit is that the operation and plugin name are visible in logger. pod and node are already logged as parameters in individual log calls in kube-scheduler code. Once contextual logging is supported by more packages outside of kube-scheduler, they will also be visible there (for example, client-go). Once it is GA, log calls can be simplified to avoid repeating those values.

In kube-controller-manager, WithName is used to add the user-visible controller name to log output, for example:

I1113 08:43:29.284360 87141 graph_builder.go:285] "garbage controller monitor not synced: no monitors" logger="garbage-collector-controller"

The logger=”garbage-collector-controller” was added by the kube-controller-manager core when instantiating that controller and appears in all of its log entries - at least as long as the code that it calls supports contextual logging. Further work is needed to convert shared packages like client-go.

Performance impact

Supporting contextual logging in a package, i.e. accepting a logger from a caller, is cheap. No performance impact was observed for the kube-scheduler. As noted above, adding WithName and WithValues needs to be done more carefully.

In Kubernetes 1.29, enabling contextual logging at production verbosity (-v3 or lower) caused no measurable slowdown for the kube-scheduler and is not expected for the kube-controller-manager either. At debug levels, a 28% slowdown for some test cases is still reasonable given that the resulting logs make debugging easier. For details, see the discussion around promoting the feature to beta.

Impact on downstream users

Log output is not part of the Kubernetes API and changes regularly in each release, whether it is because developers work on the code or because of the ongoing conversion to structured and contextual logging.

If downstream users have dependencies on specific logs, they need to be aware of how this change affects them.

Further reading

Get involved

If you're interested in getting involved, we always welcome new contributors to join us. Contextual logging provides a fantastic opportunity for you to contribute to Kubernetes development and make a meaningful impact. By joining Structured Logging WG, you can actively participate in the development of Kubernetes and make your first contribution. It's a great way to learn and engage with the community while gaining valuable experience.

We encourage you to explore the repository and familiarize yourself with the ongoing discussions and projects. It's a collaborative environment where you can exchange ideas, ask questions, and work together with other contributors.

If you have any questions or need guidance, don't hesitate to reach out to us and you can do so on our public Slack channel. If you're not already part of that Slack workspace, you can visit https://slack.k8s.io/ for an invitation.

We would like to express our gratitude to all the contributors who provided excellent reviews, shared valuable insights, and assisted in the implementation of this feature (in alphabetical order):

Kubernetes 1.29: Decoupling taint manager from node lifecycle controller

This blog discusses a new feature in Kubernetes 1.29 to improve the handling of taint-based pod eviction.

Background

In Kubernetes 1.29, an improvement has been introduced to enhance the taint-based pod eviction handling on nodes. This blog discusses the changes made to node-lifecycle-controller to separate its responsibilities and improve overall code maintainability.

Summary of changes

node-lifecycle-controller previously combined two independent functions:

  • Adding a pre-defined set of NoExecute taints to Node based on Node's condition.
  • Performing pod eviction on NoExecute taint.

With the Kubernetes 1.29 release, the taint-based eviction implementation has been moved out of node-lifecycle-controller into a separate and independent component called taint-eviction-controller. This separation aims to disentangle code, enhance code maintainability, and facilitate future extensions to either component.

As part of the change, additional metrics were introduced to help you monitor taint-based pod evictions:

  • pod_deletion_duration_seconds measures the latency between the time when a taint effect has been activated for the Pod and its deletion via taint-eviction-controller.
  • pod_deletions_total reports the total number of Pods deleted by taint-eviction-controller since its start.

How to use the new feature?

A new feature gate, SeparateTaintEvictionController, has been added. The feature is enabled by default as Beta in Kubernetes 1.29. Please refer to the feature gate document.

When this feature is enabled, users can optionally disable taint-based eviction by setting --controllers=-taint-eviction-controller in kube-controller-manager.

To disable the new feature and use the old taint-manager within node-lifecylecycle-controller , users can set the feature gate SeparateTaintEvictionController=false.

Use cases

This new feature will allow cluster administrators to extend and enhance the default taint-eviction-controller and even replace the default taint-eviction-controller with a custom implementation to meet different needs. An example is to better support stateful workloads that use PersistentVolume on local disks.

FAQ

Does this feature change the existing behavior of taint-based pod evictions?

No, the taint-based pod eviction behavior remains unchanged. If the feature gate SeparateTaintEvictionController is turned off, the legacy node-lifecycle-controller with taint-manager will continue to be used.

Will enabling/using this feature result in an increase in the time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling/using this feature result in an increase in resource usage (CPU, RAM, disk, IO, ...)?

The increase in resource usage by running a separate taint-eviction-controller will be negligible.

Learn more

For more details, refer to the KEP.

Acknowledgments

As with any Kubernetes feature, multiple community members have contributed, from writing the KEP to implementing the new controller and reviewing the KEP and code. Special thanks to:

  • Aldo Culquicondor (@alculquicondor)
  • Maciej Szulik (@soltysh)
  • Filip Křepinský (@atiratree)
  • Han Kang (@logicalhan)
  • Wei Huang (@Huang-Wei)
  • Sergey Kanzhelevi (@SergeyKanzhelev)
  • Ravi Gudimetla (@ravisantoshgudimetla)
  • Deep Debroy (@ddebroy)

Kubernetes 1.29: PodReadyToStartContainers Condition Moves to Beta

With the recent release of Kubernetes 1.29, the PodReadyToStartContainers condition is available by default. The kubelet manages the value for that condition throughout a Pod's lifecycle, in the status field of a Pod. The kubelet will use the PodReadyToStartContainers condition to accurately surface the initialization state of a Pod, from the perspective of Pod sandbox creation and network configuration by a container runtime.

What's the motivation for this feature?

Cluster administrators did not have a clear and easily accessible way to view the completion of Pod's sandbox creation and initialization. As of 1.28, the Initialized condition in Pods tracks the execution of init containers. However, it has limitations in accurately reflecting the completion of sandbox creation and readiness to start containers for all Pods in a cluster. This distinction is particularly important in multi-tenant clusters where tenants own the Pod specifications, including the set of init containers, while cluster administrators manage storage plugins, networking plugins, and container runtime handlers. Therefore, there is a need for an improved mechanism to provide cluster administrators with a clear and comprehensive view of Pod sandbox creation completion and container readiness.

What's the benefit?

  1. Improved Visibility: Cluster administrators gain a clearer and more comprehensive view of Pod sandbox creation completion and container readiness. This enhanced visibility allows them to make better-informed decisions and troubleshoot issues more effectively.
  2. Metric Collection and Monitoring: Monitoring services can leverage the fields associated with the PodReadyToStartContainers condition to report sandbox creation state and latency. Metrics can be collected at per-Pod cardinality or aggregated based on various properties of the Pod, such as volumes, runtimeClassName, custom annotations for CNI and IPAM plugins or arbitrary labels and annotations, and storageClassName of PersistentVolumeClaims. This enables comprehensive monitoring and analysis of Pod readiness across the cluster.
  3. Enhanced Troubleshooting: With a more accurate representation of Pod sandbox creation and container readiness, cluster administrators can quickly identify and address any issues that may arise during the initialization process. This leads to improved troubleshooting capabilities and reduced downtime.

What’s next?

Due to feedback and adoption, the Kubernetes team promoted PodReadyToStartContainersCondition to Beta in 1.29. Your comments will help determine if this condition continues forward to get promoted to GA, so please submit additional feedback on this feature!

How can I learn more?

Please check out the documentation for the PodReadyToStartContainersCondition to learn more about it and how it fits in relation to other Pod conditions.

How to get involved?

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Kubernetes 1.29: New (alpha) Feature, Load Balancer IP Mode for Services

This blog introduces a new alpha feature in Kubernetes 1.29. It provides a configurable approach to define how Service implementations, exemplified in this blog by kube-proxy, handle traffic from pods to the Service, within the cluster.

Background

In older Kubernetes releases, the kube-proxy would intercept traffic that was destined for the IP address associated with a Service of type: LoadBalancer. This happened whatever mode you used for kube-proxy. The interception implemented the expected behavior (traffic eventually reaching the expected endpoints behind the Service). The mechanism to make that work depended on the mode for kube-proxy; on Linux, kube-proxy in iptables mode would redirecting packets directly to the endpoint; in ipvs mode, kube-proxy would configure the load balancer's IP address to one interface on the node. The motivation for implementing that interception was for two reasons:

  1. Traffic path optimization: Efficiently redirecting pod traffic - when a container in a pod sends an outbound packet that is destined for the load balancer's IP address - directly to the backend service by bypassing the load balancer.

  2. Handling load balancer packets: Some load balancers send packets with the destination IP set to the load balancer's IP address. As a result, these packets need to be routed directly to the correct backend (which might not be local to that node), in order to avoid loops.

Problems

However, there are several problems with the aforementioned behavior:

  1. Source IP: Some cloud providers use the load balancer's IP as the source IP when transmitting packets to the node. In the ipvs mode of kube-proxy, there is a problem that health checks from the load balancer never return. This occurs because the reply packets would be forward to the local interface kube-ipvs0(where the load balancer's IP is bound to) and be subsequently ignored.

  2. Feature loss at load balancer level: Certain cloud providers offer features(such as TLS termination, proxy protocol, etc.) at the load balancer level. Bypassing the load balancer results in the loss of these features when the packet reaches the service (leading to protocol errors).

Even with the new alpha behaviour disabled (the default), there is a workaround that involves setting .status.loadBalancer.ingress.hostname for the Service, in order to bypass kube-proxy binding. But this is just a makeshift solution.

Solution

In summary, providing an option for cloud providers to disable the current behavior would be highly beneficial.

To address this, Kubernetes v1.29 introduces a new (alpha) .status.loadBalancer.ingress.ipMode field for a Service. This field specifies how the load balancer IP behaves and can be specified only when the .status.loadBalancer.ingress.ip field is also specified.

Two values are possible for .status.loadBalancer.ingress.ipMode: "VIP" and "Proxy". The default value is "VIP", meaning that traffic delivered to the node with the destination set to the load balancer's IP and port will be redirected to the backend service by kube-proxy. This preserves the existing behavior of kube-proxy. The "Proxy" value is intended to prevent kube-proxy from binding the load balancer's IP address to the node in both ipvs and iptables modes. Consequently, traffic is sent directly to the load balancer and then forwarded to the destination node. The destination setting for forwarded packets varies depending on how the cloud provider's load balancer delivers traffic:

  • If the traffic is delivered to the node then DNATed to the pod, the destination would be set to the node's IP and node port;
  • If the traffic is delivered directly to the pod, the destination would be set to the pod's IP and port.

Usage

Here are the necessary steps to enable this feature:

  • Download the latest Kubernetes project (version v1.29.0 or later).
  • Enable the feature gate with the command line flag --feature-gates=LoadBalancerIPMode=true on kube-proxy, kube-apiserver, and cloud-controller-manager.
  • For Services with type: LoadBalancer, set ipMode to the appropriate value. This step is likely handled by your chosen cloud-controller-manager during the EnsureLoadBalancer process.

More information

Getting involved

Reach us on Slack#sig-network, or through the mailing list.

Acknowledgments

Huge thanks to @Sh4d1 for the original KEP and initial implementation code. I took over midway and completed the work. Similarly, immense gratitude to other contributors who have assisted in the design, implementation, and review of this feature (alphabetical order):

Kubernetes 1.29: Single Pod Access Mode for PersistentVolumes Graduates to Stable

With the release of Kubernetes v1.29, the ReadWriteOncePod volume access mode has graduated to general availability: it's part of Kubernetes' stable API. In this blog post, I'll take a closer look at this access mode and what it does.

What is ReadWriteOncePod?

ReadWriteOncePod is an access mode for PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs) introduced in Kubernetes v1.22. This access mode enables you to restrict volume access to a single pod in the cluster, ensuring that only one pod can write to the volume at a time. This can be particularly useful for stateful workloads that require single-writer access to storage.

For more context on access modes and how ReadWriteOncePod works read What are access modes and why are they important? in the Introducing Single Pod Access Mode for PersistentVolumes article from 2021.

How can I start using ReadWriteOncePod?

The ReadWriteOncePod volume access mode is available by default in Kubernetes versions v1.27 and beyond. In Kubernetes v1.29 and later, the Kubernetes API always recognizes this access mode.

Note that ReadWriteOncePod is only supported for CSI volumes, and before using this feature, you will need to update the following CSI sidecars to these versions or greater:

To start using ReadWriteOncePod, you need to create a PVC with the ReadWriteOncePod access mode:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: single-writer-only
spec:
  accessModes:
  - ReadWriteOncePod # Allows only a single pod to access single-writer-only.
  resources:
    requests:
      storage: 1Gi

If your storage plugin supports Dynamic provisioning, then new PersistentVolumes will be created with the ReadWriteOncePod access mode applied.

Read Migrating existing PersistentVolumes for details on migrating existing volumes to use ReadWriteOncePod.

How can I learn more?

Please see the blog posts alpha, beta, and KEP-2485 for more details on the ReadWriteOncePod access mode and motivations for CSI spec changes.

How do I get involved?

The Kubernetes #csi Slack channel and any of the standard SIG Storage communication channels are great methods to reach out to the SIG Storage and the CSI teams.

Special thanks to the following people whose thoughtful reviews and feedback helped shape this feature:

  • Abdullah Gharaibeh (ahg-g)
  • Aldo Culquicondor (alculquicondor)
  • Antonio Ojea (aojea)
  • David Eads (deads2k)
  • Jan Šafránek (jsafrane)
  • Joe Betz (jpbetz)
  • Kante Yin (kerthcet)
  • Michelle Au (msau42)
  • Tim Bannister (sftim)
  • Xing Yang (xing-yang)

If you’re interested in getting involved with the design and development of CSI or any part of the Kubernetes storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

Kubernetes 1.29: CSI Storage Resizing Authenticated and Generally Available in v1.29

Kubernetes version v1.29 brings generally available support for authentication during CSI (Container Storage Interface) storage resize operations.

Let's embark on the evolution of this feature, initially introduced in alpha in Kubernetes v1.25, and unravel the changes accompanying its transition to GA.

Authenticated CSI storage resizing unveiled

Kubernetes harnesses the capabilities of CSI to seamlessly integrate with third-party storage systems, empowering your cluster to seamlessly expand storage volumes managed by the CSI driver. The recent elevation of authentication secret support for resizes from Beta to GA ushers in new horizons, enabling volume expansion in scenarios where the underlying storage operation demands credentials for backend cluster operations – such as accessing a SAN/NAS fabric. This enhancement addresses a critical limitation for CSI drivers, allowing volume expansion at the node level, especially in cases necessitating authentication for resize operations.

The challenges extend beyond node-level expansion. Within the Special Interest Group (SIG) Storage, use cases have surfaced, including scenarios where the CSI driver needs to validate the actual size of backend block storage before initiating a node-level filesystem expand operation. This validation prevents false positive returns from the backend storage cluster during file system expansion. Additionally, for PersistentVolumes representing encrypted block storage (e.g., using LUKS), a passphrase is mandated to expand the device and grow the filesystem, underscoring the necessity for authenticated resizing.

What's new for Kubernetes v1.29

With the graduation to GA, the feature remains enabled by default. Support for node-level volume expansion secrets has been seamlessly integrated into the CSI external-provisioner sidecar controller. To take advantage, ensure your external CSI storage provisioner sidecar controller is operating at v3.3.0 or above.

Assuming all requisite components, including the CSI driver, are deployed and operational on your cluster, and you have a CSI driver supporting resizing, you can initiate a NodeExpand operation on a CSI volume. Credentials for the CSI NodeExpand operation can be conveniently provided as a Kubernetes Secret, specifying the Secret via the StorageClass. Here's an illustrative manifest for a Secret holding credentials:

---
apiVersion: v1
kind: Secret
metadata:
  name: test-secret
  namespace: default
data:
  stringData:
    username: admin
    password: t0p-Secret

And here's an example manifest for a StorageClass referencing those credentials:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-blockstorage-sc
parameters:
  csi.storage.k8s.io/node-expand-secret-name: test-secret
  csi.storage.k8s.io/node-expand-secret-namespace: default
provisioner: blockstorage.cloudprovider.example
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true

Upon successful creation of the PersistentVolumeClaim (PVC), you can verify the configuration within the .spec.csi field of the PersistentVolume. To confirm, execute kubectl get persistentvolume <pv_name> -o yaml.

Engage with the Evolution!

For those enthusiastic about contributing or delving deeper into the technical intricacies, the enhancement proposal comprises exhaustive details about the feature's history and implementation. Explore the realms of StorageClass-based dynamic provisioning in Kubernetes by referring to the [storage class documentation] (https://kubernetes.io/docs/concepts/storage/persistent-volumes/#class) and the overarching PersistentVolumes documentation.

Join the Kubernetes Storage SIG (Special Interest Group) to actively participate in elevating this feature. Your insights are invaluable, and we eagerly anticipate welcoming more contributors to shape the future of Kubernetes storage!

Kubernetes 1.29: VolumeAttributesClass for Volume Modification

The v1.29 release of Kubernetes introduced an alpha feature to support modifying a volume by changing the volumeAttributesClassName that was specified for a PersistentVolumeClaim (PVC). With the feature enabled, Kubernetes can handle updates of volume attributes other than capacity. Allowing volume attributes to be changed without managing it through different provider's APIs directly simplifies the current flow.

You can read about VolumeAttributesClass usage details in the Kubernetes documentation or you can read on to learn about why the Kubernetes project is supporting this feature.

VolumeAttributesClass

The new storage.k8s.io/v1alpha1 API group provides two new types:

VolumeAttributesClass

Represents a specification of mutable volume attributes defined by the CSI driver. The class can be specified during dynamic provisioning of PersistentVolumeClaims, and changed in the PersistentVolumeClaim spec after provisioning.

ModifyVolumeStatus

Represents the status object of ControllerModifyVolume operation.

With this alpha feature enabled, the spec of PersistentVolumeClaim defines VolumeAttributesClassName that is used in the PVC. At volume provisioning, the CreateVolume operation will apply the parameters in the VolumeAttributesClass along with the parameters in the StorageClass.

When there is a change of volumeAttributesClassName in the PVC spec, the external-resizer sidecar will get an informer event. Based on the current state of the configuration, the resizer will trigger a CSI ControllerModifyVolume. More details can be found in KEP-3751.

How to use it

If you want to test the feature whilst it's alpha, you need to enable the relevant feature gate in the kube-controller-manager and the kube-apiserver. Use the --feature-gates command line argument:

--feature-gates="...,VolumeAttributesClass=true"

It also requires that the CSI driver has implemented the ModifyVolume API.

User flow

If you would like to see the feature in action and verify it works fine in your cluster, here's what you can try:

  1. Define a StorageClass and VolumeAttributesClass

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: csi-sc-example
    provisioner: pd.csi.storage.gke.io
    parameters:
      type: "hyperdisk-balanced"
    volumeBindingMode: WaitForFirstConsumer
    
    apiVersion: storage.k8s.io/v1alpha1
    kind: VolumeAttributesClass
    metadata:
      name: silver
    driverName: pd.csi.storage.gke.io
    parameters:
      provisioned-iops: "3000"
      provisioned-throughput: "50"
    
  2. Define and create the PersistentVolumeClaim

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-pv-claim
    spec:
      storageClassName: csi-sc-example
      volumeAttributesClassName: silver
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 64Gi
    
  3. Verify that the PersistentVolumeClaim is now provisioned correctly with:

    kubectl get pvc
    
  4. Create a new VolumeAttributesClass gold:

    apiVersion: storage.k8s.io/v1alpha1
    kind: VolumeAttributesClass
    metadata:
      name: gold
    driverName: pd.csi.storage.gke.io
    parameters:
      iops: "4000"
      throughput: "60"
    
  5. Update the PVC with the new VolumeAttributesClass and apply:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-pv-claim
    spec:
      storageClassName: csi-sc-example
      volumeAttributesClassName: gold
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 64Gi
    
  6. Verify that PersistentVolumeClaims has the updated VolumeAttributesClass parameters with:

    kubectl describe pvc <PVC_NAME>
    

Next steps

  • See the VolumeAttributesClass KEP for more information on the design
  • You can view or comment on the project board for VolumeAttributesClass
  • In order to move this feature towards beta, we need feedback from the community, so here's a call to action: add support to the CSI drivers, try out this feature, consider how it can help with problems that your users are having…

Getting involved

We always welcome new contributors. So, if you would like to get involved, you can join our Kubernetes Storage Special Interest Group (SIG).

If you would like to share feedback, you can do so on our public Slack channel.

Special thanks to all the contributors that provided great reviews, shared valuable insight and helped implement this feature (alphabetical order):

  • Baofa Fan (calory)
  • Ben Swartzlander (bswartz)
  • Connor Catlett (ConnorJC3)
  • Hemant Kumar (gnufied)
  • Jan Šafránek (jsafrane)
  • Joe Betz (jpbetz)
  • Jordan Liggitt (liggitt)
  • Matthew Cary (mattcary)
  • Michelle Au (msau42)
  • Xing Yang (xing-yang)

Kubernetes 1.29: Cloud Provider Integrations Are Now Separate Components

For Kubernetes v1.29, you need to use additional components to integrate your Kubernetes cluster with a cloud infrastructure provider. By default, Kubernetes v1.29 components abort if you try to specify integration with any cloud provider using one of the legacy compiled-in cloud provider integrations. If you want to use a legacy integration, you have to opt back in - and a future release will remove even that option.

In 2018, the Kubernetes community agreed to form the Cloud Provider Special Interest Group (SIG), with a mission to externalize all cloud provider integrations and remove all the existing in-tree cloud provider integrations. In January 2019, the Kubernetes community approved the initial draft of KEP-2395: Removing In-Tree Cloud Provider Code. This KEP defines a process by which we can remove cloud provider specific code from the core Kubernetes source tree. From the KEP:

Motiviation [sic] behind this effort is to allow cloud providers to develop and make releases independent from the core Kubernetes release cycle. The de-coupling of cloud provider code allows for separation of concern between "Kubernetes core" and the cloud providers within the ecosystem. In addition, this ensures all cloud providers in the ecosystem are integrating with Kubernetes in a consistent and extendable way.

After many years of development and collaboration across many contributors, the default behavior for legacy cloud provider integrations is changing. This means that users will need to confirm their Kubernetes configurations, and in some cases run external cloud controller managers. These changes are taking effect in Kubernetes version 1.29; read on to learn if you are affected and what changes you will need to make.

These updated default settings affect a large proportion of Kubernetes users, and will require changes for users who were previously using the in-tree provider integrations. The legacy integrations offered compatibility with Azure, AWS, GCE, OpenStack, and vSphere; however for AWS and OpenStack the compiled-in integrations were removed in Kubernetes versions 1.27 and 1.26, respectively.

What has changed?

At the most basic level, two feature gates are changing their default value from false to true. Those feature gates, DisableCloudProviders and DisableKubeletCloudCredentialProviders, control the way that the kube-apiserver, kube-controller-manager, and kubelet invoke the cloud provider related code that is included in those components. When these feature gates are true (the default), the only recognized value for the --cloud-provider command line argument is external.

Let's see what the official Kubernetes documentation says about these feature gates:

DisableCloudProviders: Disables any functionality in kube-apiserver, kube-controller-manager and kubelet related to the --cloud-provider component flag.

DisableKubeletCloudCredentialProviders: Disable the in-tree functionality in kubelet to authenticate to a cloud provider container registry for image pull credentials.

The next stage beyond beta will be full removal; for that release onwards, you won't be able to override those feature gates back to false.

What do you need to do?

If you are upgrading from Kubernetes 1.28+ and are not on Azure, GCE, or vSphere then there are no changes you will need to make. If you are on Azure, GCE, or vSphere, or you are upgrading from a version older than 1.28, then read on.

Historically, Kubernetes has included code for a set of cloud providers that included AWS, Azure, GCE, OpenStack, and vSphere. Since the inception of KEP-2395 the community has been moving towards removal of that cloud provider code. The OpenStack provider code was removed in version 1.26, and the AWS provider code was removed in version 1.27. This means that users who are upgrading from one of the affected cloud providers and versions will need to modify their deployments.

Upgrading on Azure, GCE, or vSphere

There are two options for upgrading in this configuration: migrate to external cloud controller managers, or continue using the in-tree provider code. Although migrating to external cloud controller managers is recommended, there are scenarios where continuing with the current behavior is desired. Please choose the best option for your needs.

Migrate to external cloud controller managers

Migrating to use external cloud controller managers is the recommended upgrade path, when possible in your situation. To do this you will need to enable the --cloud-provider=external command line flag for the kube-apiserver, kube-controller-manager, and kubelet components. In addition you will need to deploy a cloud controller manager for your provider.

Installing and running cloud controller managers is a larger topic than this post can address; if you would like more information on this process please read the documentation for Cloud Controller Manager Administration and Migrate Replicated Control Plane To Use Cloud Controller Manager. See below for links to specific cloud provider implementations.

Continue using in-tree provider code

If you wish to continue using Kubernetes with the in-tree cloud provider code, you will need to modify the command line parameters for kube-apiserver, kube-controller-manager, and kubelet to disable the feature gates for DisableCloudProviders and DisableKubeletCloudCredentialProviders. To do this, add the following command line flag to the arguments for the previously listed commands:

--feature-gates=DisableCloudProviders=false,DisableKubeletCloudCredentialProviders=false

Please note that if you have other feature gate modifications on the command line, they will need to include these 2 feature gates.

Note: These feature gates will be locked to true in an upcoming release. Setting these feature gates to false should be used as a last resort. It is highly recommended to migrate to an external cloud controller manager as the in-tree providers are planned for removal as early as Kubernetes version 1.31.

Upgrading on other providers

For providers other than Azure, GCE, or vSphere, good news, the external cloud controller manager should already be in use. You can confirm this by inspecting the --cloud-provider flag for the kubelets in your cluster, they will have the value external if using external providers. The code for AWS and OpenStack providers was removed from Kubernetes before version 1.27 was released. Other providers beyond the AWS, Azure, GCE, OpenStack, and vSphere were never included in Kubernetes and as such they began their life as external cloud controller managers.

Upgrading from older Kubernetes versions

If you are upgrading from a Kubernetes release older than 1.26, and you are on AWS, Azure, GCE, OpenStack, or vSphere then you will need to enable the --cloud-provider=external flag, and follow the advice for installing and running a cloud controller manager for your provider.

Please read the documentation for Cloud Controller Manager Administration and Migrate Replicated Control Plane To Use Cloud Controller Manager. See below for links to specific cloud provider implementations.

Where to find a cloud controller manager?

At its core, this announcement is about the cloud provider integrations that were previously included in Kubernetes. As these components move out of the core Kubernetes code and into their own repositories, it is important to note a few things:

First, SIG Cloud Provider offers a reference framework for developers who wish to create cloud controller managers for any provider. See the cloud-provider repository for more information about how these controllers work and how to get started creating your own.

Second, there are many cloud controller managers available for Kubernetes. This post is addressing the provider integrations that have been historically included with Kubernetes but are now in the process of being removed. If you need a cloud controller manager for your provider and do not see it listed here, please reach out to the cloud provider you are integrating with or the Kubernetes SIG Cloud Provider community for help and advice. It is worth noting that while most cloud controller managers are open source today, this may not always be the case. Users should always contact their cloud provider to learn if there are preferred solutions to utilize on their infrastructure.

Cloud provider integrations provided by the Kubernetes project

If you are looking for an automated approach to installing cloud controller managers in your clusters, the kOps project provides a convenient solution for managing production-ready clusters.

Want to learn more?

Cloud providers and cloud controller managers serve a core function in Kubernetes. Cloud providers are often the substrate upon which Kubernetes is operated, and the cloud controller managers supply the essential lifeline between Kubernetes clusters and their physical infrastructure.

This post covers one aspect of how the Kubernetes community interacts with the world of cloud infrastructure providers. If you are curious about this topic and want to learn more, the Cloud Provider Special Interest Group (SIG) is the place to go. SIG Cloud Provider hosts bi-weekly meetings to discuss all manner of topics related to cloud providers and cloud controller managers in Kubernetes.

SIG Cloud Provider

Kubernetes v1.29: Mandala

Editors: Carol Valencia, Kristin Martin, Abigail McCarthy, James Quigley

Announcing the release of Kubernetes v1.29: Mandala (The Universe), the last release of 2023!

Similar to previous releases, the release of Kubernetes v1.29 introduces new stable, beta, and alpha features. The consistent delivery of top-notch releases underscores the strength of our development cycle and the vibrant support from our community.

This release consists of 49 enhancements. Of those enhancements, 11 have graduated to Stable, 19 are entering Beta and 19 have graduated to Alpha.

Kubernetes v1.29: Mandala (The Universe) ✨🌌

Join us on a cosmic journey with Kubernetes v1.29!

This release is inspired by the beautiful art form that is Mandala—a symbol of the universe in its perfection. Our tight-knit universe of around 40 Release Team members, backed by hundreds of community contributors, has worked tirelessly to turn challenges into joy for millions worldwide.

The Mandala theme reflects our community’s interconnectedness—a vibrant tapestry woven by enthusiasts and experts alike. Each contributor is a crucial part, adding their unique energy, much like the diverse patterns in Mandala art. Kubernetes thrives on collaboration, echoing the harmony in Mandala creations.

The release logo, made by Mario Jason Braganza (base Mandala art, courtesy - Fibrel Ojalá), symbolizes the little universe that is the Kubernetes project and all its people.

In the spirit of Mandala’s transformative symbolism, Kubernetes v1.29 celebrates our project’s evolution. Like stars in the Kubernetes universe, each contributor, user, and supporter lights the way. Together, we create a universe of possibilities—one release at a time.

Improvements that graduated to stable in Kubernetes v1.29

This is a selection of some of the improvements that are now stable following the v1.29 release.

ReadWriteOncePod PersistentVolume access mode (SIG Storage)

In Kubernetes, volume access modes are the way you can define how durable storage is consumed. These access modes are a part of the spec for PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). When using storage, there are different ways to model how that storage is consumed. For example, a storage system like a network file share can have many users all reading and writing data simultaneously. In other cases maybe everyone is allowed to read data but not write it. For highly sensitive data, maybe only one user is allowed to read and write data but nobody else.

Before v1.22, Kubernetes offered three access modes for PVs and PVCs:

  • ReadWriteOnce – the volume can be mounted as read-write by a single node
  • ReadOnlyMany – the volume can be mounted read-only by many nodes
  • ReadWriteMany – the volume can be mounted as read-write by many nodes

The ReadWriteOnce access mode restricts volume access to a single node, which means it is possible for multiple pods on the same node to read from and write to the same volume. This could potentially be a major problem for some applications, especially if they require at most one writer for data safety guarantees.

To address this problem, a fourth access mode ReadWriteOncePod was introduced as an Alpha feature in v1.22 for CSI volumes. If you create a pod with a PVC that uses the ReadWriteOncePod access mode, Kubernetes ensures that pod is the only pod across your whole cluster that can read that PVC or write to it. In v1.29, this feature became Generally Available.

Node volume expansion Secret support for CSI drivers (SIG Storage)

In Kubernetes, a volume expansion operation may include the expansion of the volume on the node, which involves filesystem resize. Some CSI drivers require secrets, for example a credential for accessing a SAN fabric, during the node expansion for the following use cases:

  • When a PersistentVolume represents encrypted block storage, for example using LUKS, you may need to provide a passphrase in order to expand the device.
  • For various validations, the CSI driver needs to have credentials to communicate with the backend storage system at time of node expansion.

To meet this requirement, the CSI Node Expand Secret feature was introduced in Kubernetes v1.25. This allows an optional secret field to be sent as part of the NodeExpandVolumeRequest by the CSI drivers so that node volume expansion operation can be performed with the underlying storage system. In Kubernetes v1.29, this feature became generally available.

KMS v2 encryption at rest generally available (SIG Auth)

One of the first things to consider when securing a Kubernetes cluster is encrypting persisted API data at rest. KMS provides an interface for a provider to utilize a key stored in an external key service to perform this encryption. With the Kubernetes v1.29, KMS v2 has become a stable feature bringing numerous improvements in performance, key rotation, health check & status, and observability. These enhancements provide users with a reliable solution to encrypt all resources in their Kubernetes clusters. You can read more about this in KEP-3299.

It is recommended to use KMS v2. KMS v1 feature gate is disabled by default. You will have to opt in to continue to use it.

Improvements that graduated to beta in Kubernetes v1.29

This is a selection of some of the improvements that are now beta following the v1.29 release.

The throughput of the scheduler is our eternal challenge. This QueueingHint feature brings a new possibility to optimize the efficiency of requeueing, which could reduce useless scheduling retries significantly.

Node lifecycle separated from taint management (SIG Scheduling)

As title describes, it's to decouple TaintManager that performs taint-based pod eviction from NodeLifecycleController and make them two separate controllers: NodeLifecycleController to add taints to unhealthy nodes and TaintManager to perform pod deletion on nodes tainted with NoExecute effect.

Clean up for legacy Secret-based ServiceAccount tokens (SIG Auth)

Kubernetes switched to using more secure service account tokens, which were time-limited and bound to specific pods by 1.22. Stopped auto-generating legacy secret-based service account tokens in 1.24. Then started labeling remaining auto-generated secret-based tokens still in use with their last-used date in 1.27.

In v1.29, to reduce potential attack surface, the LegacyServiceAccountTokenCleanUp feature labels legacy auto-generated secret-based tokens as invalid if they have not been used for a long time (1 year by default), and automatically removes them if use is not attempted for a long time after being marked as invalid (1 additional year by default). KEP-2799

New alpha features

Define Pod affinity or anti-affinity using matchLabelKeys (SIG Scheduling)

One enhancement will be introduced in PodAffinity/PodAntiAffinity as alpha. It will increase the accuracy of calculation during rolling updates.

nftables backend for kube-proxy (SIG Network)

The default kube-proxy implementation on Linux is currently based on iptables. This was the preferred packet filtering and processing system in the Linux kernel for many years (starting with the 2.4 kernel in 2001). However, unsolvable problems with iptables led to the development of a successor, nftables. Development on iptables has mostly stopped, with new features and performance improvements primarily going into nftables instead.

This feature adds a new backend to kube-proxy based on nftables, since some Linux distributions already started to deprecate and remove iptables, and nftables claims to solve the main performance problems of iptables.

APIs to manage IP address ranges for Services (SIG Network)

Services are an abstract way to expose an application running on a set of Pods. Services can have a cluster-scoped virtual IP address, that is allocated from a predefined CIDR defined in the kube-apiserver flags. However, users may want to add, remove, or resize existing IP ranges allocated for Services without having to restart the kube-apiserver.

This feature implements a new allocator logic that uses 2 new API Objects: ServiceCIDR and IPAddress, allowing users to dynamically increase the number of Services IPs available by creating new ServiceCIDRs. This helps to resolve problems like IP exhaustion or IP renumbering.

Add support to containerd/kubelet/CRI to support image pull per runtime class (SIG Windows)

Kubernetes v1.29 adds support to pull container images based on the RuntimeClass of the Pod that uses them. This feature is off by default in v1.29 under a feature gate called RuntimeClassInImageCriApi.

Container images can either be a manifest or an index. When the image being pulled is an index (image index has a list of image manifests ordered by platform), platform matching logic in the container runtime is used to pull an appropriate image manifest from the index. By default, the platform matching logic picks a manifest that matches the host that the image pull is being executed from. This can be limiting for VM-based containers where a user could pull an image with the intention of running it as a VM-based container, for example, Windows Hyper-V containers.

The image pull per runtime class feature adds support to pull different images based the runtime class specified. This is achieved by referencing an image by a tuple of (imageID, runtimeClass), instead of just the imageName or imageID. Container runtimes could choose to add support for this feature if they'd like. If they do not, the default behavior of kubelet that existed prior to Kubernetes v1.29 will be retained.

In-place updates for Pod resources, for Windows Pods (SIG Windows)

As an alpha feature, Kubernetes Pods can be mutable with respect to their resources, allowing users to change the desired resource requests and limits for a Pod without the need to restart the Pod. With v1.29, this feature is now supported for Windows containers.

Graduations, deprecations and removals for Kubernetes v1.29

Graduated to stable

This lists all the features that graduated to stable (also known as general availability). For a full list of updates including new features and graduations from alpha to beta, see the release notes.

This release includes a total of 11 enhancements promoted to Stable:

Deprecations and removals

Removal of in-tree integrations with cloud providers (SIG Cloud Provider)

Kubernetes v1.29 defaults to operating without a built-in integration to any cloud provider. If you have previously been relying on in-tree cloud provider integrations (with Azure, GCE, or vSphere) then you can either:

  • enable an equivalent external cloud controller manager integration (recommended)
  • opt back in to the legacy integration by setting the associated feature gates to false; the feature gates to change are DisableCloudProviders and DisableKubeletCloudCredentialProviders

Enabling external cloud controller managers means you must run a suitable cloud controller manager within your cluster's control plane; it also requires setting the command line argument --cloud-provider=external for the kubelet (on every relevant node), and across the control plane (kube-apiserver and kube-controller-manager).

For more information about how to enable and run external cloud controller managers, read Cloud Controller Manager Administration and Migrate Replicated Control Plane To Use Cloud Controller Manager.

If you need a cloud controller manager for one of the legacy in-tree providers, please see the following links:

There are more details in KEP-2395.

Removal of the v1beta2 flow control API group

The deprecated flowcontrol.apiserver.k8s.io/v1beta2 API version of FlowSchema and PriorityLevelConfiguration are no longer served in Kubernetes v1.29.

If you have manifests or client software that uses the deprecated beta API group, you should change these before you upgrade to v1.29. See the deprecated API migration guide for details and advice.

Deprecation of the status.nodeInfo.kubeProxyVersion field for Node

The .status.kubeProxyVersion field for Node objects is now deprecated, and the Kubernetes project is proposing to remove that field in a future release. The deprecated field is not accurate and has historically been managed by kubelet - which does not actually know the kube-proxy version, or even whether kube-proxy is running.

If you've been using this field in client software, stop - the information isn't reliable and the field is now deprecated.

Legacy Linux package repositories

Please note that in August of 2023, the legacy package repositories (apt.kubernetes.io and yum.kubernetes.io) were formally deprecated and the Kubernetes project announced the general availability of the community-owned package repositories for Debian and RPM packages, available at https://pkgs.k8s.io.

These legacy repositories were frozen in September of 2023, and will go away entirely in January of 2024. If you are currently relying on them, you must migrate.

This deprecation is not directly related to the v1.29 release. For more details, including how these changes may affect you and what to do if you are affected, please read the legacy package repository deprecation announcement.

Release notes

Check out the full details of the Kubernetes v1.29 release in our release notes.

Availability

Kubernetes v1.29 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using minikube. You can also easily install v1.29 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to deliver the Kubernetes v1.29 release for our community. A very special thanks is in order for our release lead, Priyanka Saggu, for supporting and guiding us through a successful release cycle, making sure that we could all contribute in the best way possible, and challenging us to improve the release process.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.29 release cycle, which ran for 14 weeks (September 6 to December 13), we saw contributions from 888 companies and 1422 individuals.

Ecosystem updates

  • KubeCon + CloudNativeCon Europe 2024 will take in Paris, France, from 19 – 22 March 2024! You can find more information about the conference and registration on the event site.

Upcoming release webinar

Join members of the Kubernetes v1.29 release team on Friday, December 15th, 2023, at 11am PT (2pm eastern) to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below. Thank you for your continued feedback and support.

New Experimental Features in Gateway API v1.0

Recently, the Gateway API announced its v1.0 GA release, marking a huge milestone for the project.

Along with stabilizing some of the core functionality in the API, a number of exciting new experimental features have been added.

Backend TLS Policy

BackendTLSPolicy is a new Gateway API type used for specifying the TLS configuration of the connection from the Gateway to backend Pods via the Service API object. It is specified as a Direct PolicyAttachment without defaults or overrides, applied to a Service that accesses a backend, where the BackendTLSPolicy resides in the same namespace as the Service to which it is applied. All Gateway API Routes that point to a referenced Service should respect a configured BackendTLSPolicy.

While there were existing ways provided for TLS to be configured for edge and passthrough termination, this new API object specifically addresses the configuration of TLS in order to convey HTTPS from the Gateway dataplane to the backend. This is referred to as "backend TLS termination" and enables the Gateway to know how to connect to a backend Pod that has its own certificate.

Termination Types

The specification of a BackendTLSPolicy consists of:

  • targetRef - Defines the targeted API object of the policy. Only Service is allowed.
  • tls - Defines the configuration for TLS, including hostname, caCertRefs, and wellKnownCACerts. Either caCertRefs or wellKnownCACerts may be specified, but not both.
  • hostname - Defines the Server Name Indication (SNI) that the Gateway uses to connect to the backend. The certificate served by the backend must match this SNI.
  • caCertRefs - Defines one or more references to objects that contain PEM-encoded TLS certificates, which are used to establish a TLS handshake between the Gateway and backend.
  • wellKnownCACerts - Specifies whether or not system CA certificates may be used in the TLS handshake between the Gateway and backend.

Examples

Using System Certificates

In this example, the BackendTLSPolicy is configured to use system certificates to connect with a TLS-encrypted upstream connection where Pods backing the dev Service are expected to serve a valid certificate for dev.example.com.

apiVersion: gateway.networking.k8s.io/v1alpha2
kind: BackendTLSPolicy
metadata:
  name: tls-upstream-dev
spec:
  targetRef:
    kind: Service
    name: dev-service
    group: ""
  tls:
    wellKnownCACerts: "System"
    hostname: dev.example.com

Using Explicit CA Certificates

In this example, the BackendTLSPolicy is configured to use certificates defined in the configuration map auth-cert to connect with a TLS-encrypted upstream connection where Pods backing the auth Service are expected to serve a valid certificate for auth.example.com.

apiVersion: gateway.networking.k8s.io/v1alpha2
kind: BackendTLSPolicy
metadata:
  name: tls-upstream-auth
spec:
  targetRef:
    kind: Service
    name: auth-service
    group: ""
  tls:
    caCertRefs:
      - kind: ConfigMapReference
        name: auth-cert
        group: ""
    hostname: auth.example.com

The following illustrates a BackendTLSPolicy that configures TLS for a Service serving a backend:

flowchart LR client(["client"]) gateway["Gateway"] style gateway fill:#02f,color:#fff httproute["HTTP
Route"] style httproute fill:#02f,color:#fff service["Service"] style service fill:#02f,color:#fff pod1["Pod"] style pod1 fill:#02f,color:#fff pod2["Pod"] style pod2 fill:#02f,color:#fff client -.->|HTTP
request| gateway gateway --> httproute httproute -.->|BackendTLSPolicy|service service --> pod1 & pod2

For more information, refer to the documentation for TLS.

HTTPRoute Timeouts

A key enhancement in Gateway API's latest release (v1.0) is the introduction of the timeouts field within HTTPRoute Rules. This feature offers a dynamic way to manage timeouts for incoming HTTP requests, adding precision and reliability to your gateway setups.

With Timeouts, developers can fine-tune their Gateway API's behavior in two fundamental ways:

  1. Request Timeout:

    The request timeout is the duration within which the Gateway API implementation must send a response to a client's HTTP request. It allows flexibility in specifying when this timeout starts, either before or after the entire client request stream is received, making it implementation-specific. This timeout efficiently covers the entire request-response transaction, enhancing the responsiveness of your services.

  2. Backend Request Timeout:

    The backendRequest timeout is a game-changer for those dealing with backends. It sets a timeout for a single request sent from the Gateway to a backend service. This timeout spans from the initiation of the request to the reception of the full response from the backend. This feature is particularly helpful in scenarios where the Gateway needs to retry connections to a backend, ensuring smooth communication under various conditions.

Notably, the request timeout encompasses the backendRequest timeout. Hence, the value of backendRequest should never exceed the value of the request timeout.

The ability to configure these timeouts adds a new layer of reliability to your Kubernetes services. Whether it's ensuring client requests are processed within a specified timeframe or managing backend service communications, Gateway API's Timeouts offer the control and predictability you need.

To get started, you can define timeouts in your HTTPRoute Rules using the Timeouts field, specifying their type as Duration. A zero-valued timeout (0s) disables the timeout, while a valid non-zero-valued timeout should be at least 1ms.

Here's an example of setting request and backendRequest timeouts in an HTTPRoute:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: timeout-example
spec:
  parentRefs:
  - name: example-gateway
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /timeout
    timeouts:
      request: 10s
      backendRequest: 2s
    backendRefs:
    - name: timeout-svc
      port: 8080

In this example, a request timeout of 10 seconds is defined, ensuring that client requests are processed within that timeframe. Additionally, a 2-second backendRequest timeout is set for individual requests from the Gateway to a backend service called timeout-svc.

These new HTTPRoute Timeouts provide Kubernetes users with more control and flexibility in managing network communications, helping ensure a smoother and more predictable experience for both clients and backends. For additional details and examples, refer to the official timeouts API documentation.

Gateway Infrastructure Labels

While Gateway API providers a common API for different implementations, each implementation will have different resources created under-the-hood to apply users' intent. This could be configuring cloud load balancers, creating in-cluster Pods and Services, or more.

While the API has always provided an extension point -- parametersRef in GatewayClass -- to customize implementation specific things, there was no common core way to express common infrastructure customizations.

Gateway API v1.0 paves the way for this with a new infrastructure field on the Gateway object, allowing customization of the underlying infrastructure. For now, this starts small with two critical fields: labels and annotations. When these are set, any generated infrastructure will have the provided labels and annotations set on them.

For example, I may want to group all my resources for one application together:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: hello-world
spec:
  infrastructure:
    labels:
      app.kubernetes.io/name: hello-world

In the future, we are looking into more common infrastructure configurations, such as resource sizing.

For more information, refer to the documentation for this feature.

Support for Websockets, HTTP/2 and more!

Not all implementations of Gateway API support automatic protocol selection. In some cases protocols are disabled without an explicit opt-in.

When a Route's backend references a Kubernetes Service, application developers can specify the protocol using ServicePort appProtocol field.

For example the following store Kubernetes Service is indicating the port 8080 supports HTTP/2 Prior Knowledge.

apiVersion: v1
kind: Service
metadata:
  name: store
spec:
  selector:
    app: store
  ports:
  - protocol: TCP
    appProtocol: kubernetes.io/h2c
    port: 8080
    targetPort: 8080

Currently, Gateway API has conformance testing for:

  • kubernetes.io/h2c - HTTP/2 Prior Knowledge
  • kubernetes.io/ws - WebSocket over HTTP

For more information, refer to the documentation for Backend Protocol Selection.

gwctl, our new Gateway API command line tool

gwctl is a command line tool that aims to be a kubectl replacement for viewing Gateway API resources.

The initial release of gwctl that comes bundled with Gateway v1.0 release includes helpful features for managing Gateway API Policies. Gateway API Policies serve as powerful extension mechanisms for modifying the behavior of Gateway resources. One challenge with using policies is that it may be hard to discover which policies are affecting which Gateway resources. gwctl helps bridge this gap by answering questions like:

  • Which policies are available for use in the Kubernetes cluster?
  • Which policies are attached to a particular Gateway, HTTPRoute, etc?
  • If policies are applied to multiple resources in the Gateway resource hierarchy, what is the effective policy that is affecting a particular resource? (For example, if an HTTP request timeout policy is applied to both an HTTPRoute and its parent Gateway, what is the effective timeout for the HTTPRoute?)

gwctl is still in the very early phases of development and hence may be a bit rough around the edges. Follow the instructions in the repository to install and try out gwctl.

Examples

Here are some examples of how gwctl can be used:

# List all policies in the cluster. This will also give the resource they bind
# to.
gwctl get policies -A
# List all available policy types.
gwctl get policycrds
# Describe all HTTPRoutes in namespace ns2. (Output includes effective policies)
gwctl describe httproutes -n ns2
# Describe a single HTTPRoute in the default namespace. (Output includes
# effective policies)
gwctl describe httproutes my-httproute-1
# Describe all Gateways across all namespaces. (Output includes effective
# policies)
gwctl describe gateways -A
# Describe a single GatewayClass. (Output includes effective policies)
gwctl describe gatewayclasses foo-com-external-gateway-class

Get involved

These projects, and many more, continue to be improved in Gateway API. There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both Ingress and Mesh.

If this is interesting to you, please join us in the community and help us build the future of Gateway API together!

Spotlight on SIG Testing

Welcome to another edition of the SIG spotlight blog series, where we highlight the incredible work being done by various Special Interest Groups (SIGs) within the Kubernetes project. In this edition, we turn our attention to SIG Testing, a group interested in effective testing of Kubernetes and automating away project toil. SIG Testing focus on creating and running tools and infrastructure that make it easier for the community to write and run tests, and to contribute, analyze and act upon test results.

To gain some insights into SIG Testing, Sandipan Panda spoke with Michelle Shepardson, a senior software engineer at Google and a chair of SIG Testing, and Patrick Ohly, a software engineer and architect at Intel and a SIG Testing Tech Lead.

Meet the contributors

Sandipan: Could you tell us a bit about yourself, your role, and how you got involved in the Kubernetes project and SIG Testing?

Michelle: Hi! I'm Michelle, a senior software engineer at Google. I first got involved in Kubernetes through working on tooling for SIG Testing, like the external instance of TestGrid. I'm part of oncall for TestGrid and Prow, and am now a chair for the SIG.

Patrick: Hello! I work as a software engineer and architect in a team at Intel which focuses on open source Cloud Native projects. When I ramped up on Kubernetes to develop a storage driver, my very first question was "how do I test it in a cluster and how do I log information?" That interest led to various enhancement proposals until I had (re)written enough code that also took over official roles as SIG Testing Tech Lead (for the E2E framework) and structured logging WG lead.

Testing practices and tools

Sandipan: Testing is a field in which multiple approaches and tools exist; how did you arrive at the existing practices?

Patrick: I can’t speak about the early days because I wasn’t around yet 😆, but looking back at some of the commit history it’s pretty obvious that developers just took what was available and started using it. For E2E testing, that was Ginkgo+Gomega. Some hacks were necessary, for example around cleanup after a test run and for categorising tests. Eventually this led to Ginkgo v2 and revised best practices for E2E testing. Regarding unit testing opinions are pretty diverse: some maintainers prefer to use just the Go standard library with hand-written checks. Others use helper packages like stretchr/testify. That diversity is okay because unit tests are self-contained - contributors just have to be flexible when working on many different areas. Integration testing falls somewhere in the middle. It’s based on Go unit tests, but needs complex helper packages to bring up an apiserver and other components, then runs tests that are more like E2E tests.

Subprojects owned by SIG Testing

Sandipan: SIG Testing is pretty diverse. Can you give a brief overview of the various subprojects owned by SIG Testing?

Michelle: Broadly, we have subprojects related to testing frameworks, and infrastructure, though they definitely overlap. So for the former, there's e2e-framework (used externally), test/e2e/framework (used for Kubernetes itself) and kubetest2 for end-to-end testing, as well as boskos (resource rental for e2e tests), KIND (Kubernetes-in-Docker, for local testing and development), and the cloud provider for KIND. For the latter, there's Prow (K8s-based CI/CD and chatops), and a litany of other tools and utilities for triage, analysis, coverage, Prow/TestGrid config generation, and more in the test-infra repo.

If you are willing to learn more and get involved with any of the SIG Testing subprojects, check out the SIG Testing README.

Key challenges and accomplishments

Sandipan: What are some of the key challenges you face?

Michelle: Kubernetes is a gigantic project in every aspect, from contributors to code to users and more. Testing and infrastructure have to meet that scale, keeping up with every change from every repo under Kubernetes while facilitating developing, improving, and releasing the project as much as possible, though of course, we're not the only SIG involved in that. I think another other challenge is staffing subprojects. SIG Testing has a number of subprojects that have existed for years, but many of the original maintainers for them have moved on to other areas or no longer have the time to maintain them. We need to grow long-term expertise and owners in those subprojects.

Patrick: As Michelle said, the sheer size can be a challenge. It’s not just the infrastructure, also our processes must scale with the number of contributors. It’s good to document best practices, but not good enough: we have many new contributors, which is good, but having reviewers explain best practices doesn’t scale - assuming that the reviewers even know about them! It also doesn’t help that existing code cannot get updated immediately because there is so much of it, in particular for E2E testing. The initiative to apply stricter linting to new or modified code while accepting that existing code doesn’t pass those same linter checks helps a bit.

Sandipan: Any SIG accomplishments that you are proud of and would like to highlight?

Patrick: I am biased because I have been driving this, but I think that the E2E framework and linting are now in a much better shape than they used to be. We may soon be able to run integration tests with race detection enabled, which is important because we currently only have that for unit tests and those tend to be less complex.

Sandipan: Testing is always important, but is there anything specific to your work in terms of the Kubernetes release process?

Patrick: test flakes… if we have too many of those, development velocity goes down because PRs cannot be merged without clean test runs and those become less likely. Developers also lose trust in testing and just "retest" until they have a clean run, without checking whether failures might indeed be related to a regression in their current change.

The people and the scope

Sandipan: What are some of your favourite things about this SIG?

Michelle: The people, of course 🙂. Aside from that, I like the broad scope SIG Testing has. I feel like even small changes can make a big difference for fellow contributors, and even if my interests change over time, I'll never run out of projects to work on.

Patrick: I can work on things that make my life and the life of my fellow developers better, like the tooling that we have to use every day while working on some new feature elsewhere.

Sandipan: Are there any funny / cool / TIL anecdotes that you could tell us?

Patrick: I started working on E2E framework enhancements five years ago, then was less active there for a while. When I came back and wanted to test some new enhancement, I asked about how to write unit tests for the new code and was pointed to some existing tests which looked vaguely familiar, as if I had seen them before. I looked at the commit history and found that I had written them! I’ll let you decide whether that says something about my failing long-term memory or simply is normal… Anyway, folks, remember to write good commit messages and comments; someone will need them at some point - it might even be yourself!

Looking ahead

Sandipan: What areas and/or subprojects does your SIG need help with?

Michelle: Some subprojects aren't staffed at the moment and could use folks willing to learn more about them. boskos and kubetest2 especially stand out to me, since both are important for testing but lack dedicated owners.

Sandipan: Are there any useful skills that new contributors to SIG Testing can bring to the table? What are some things that people can do to help this SIG if they come from a background that isn’t directly linked to programming?

Michelle: I think user empathy, writing clear feedback, and recognizing patterns are really useful. Someone who uses the test framework or tooling and can outline pain points with clear examples, or who can recognize a wider issue in the project and pull data to inform solutions for it.

Sandipan: What’s next for SIG Testing?

Patrick: Stricter linting will soon become mandatory for new code. There are several E2E framework sub-packages that could be modernised, if someone wants to take on that work. I also see an opportunity to unify some of our helper code for E2E and integration testing, but that needs more thought and discussion.

Michelle: I'm looking forward to making some usability improvements for some of our tools and infra, and to supporting more long-term contributions and growth of contributors into long-term roles within the SIG. If you're interested, hit us up!

Looking ahead, SIG Testing has exciting plans in store. You can get in touch with the folks at SIG Testing in their Slack channel or attend one of their regular bi-weekly meetings on Tuesdays. If you are interested in making it easier for the community to run tests and contribute test results, to ensure Kubernetes is stable across a variety of cluster configurations and cloud providers, join the SIG Testing community today!

Kubernetes Removals, Deprecations, and Major Changes in Kubernetes 1.29

As with every release, Kubernetes v1.29 will introduce feature deprecations and removals. Our continued ability to produce high-quality releases is a testament to our robust development cycle and healthy community. The following are some of the deprecations and removals coming in the Kubernetes 1.29 release.

The Kubernetes API removal and deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API is one that has been marked for removal in a future Kubernetes release; it will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

  • Generally available (GA) or stable API versions may be marked as deprecated, but must not be removed within a major version of Kubernetes.
  • Beta or pre-release API versions must be supported for 3 releases after deprecation.
  • Alpha or experimental API versions may be removed in any release without prior deprecation notice.

Whether an API is removed as a result of a feature graduating from beta to stable or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the documentation.

A note about the k8s.gcr.io redirect to registry.k8s.io

To host its container images, the Kubernetes project uses a community-owned image registry called registry.k8s.io. Starting last March traffic to the old k8s.gcr.io registry began being redirected to registry.k8s.io. The deprecated k8s.gcr.io registry will eventually be phased out. For more details on this change or to see if you are impacted, please read k8s.gcr.io Redirect to registry.k8s.io - What You Need to Know.

A note about the Kubernetes community-owned package repositories

Earlier in 2023, the Kubernetes project introduced pkgs.k8s.io, community-owned software repositories for Debian and RPM packages. The community-owned repositories replaced the legacy Google-owned repositories (apt.kubernetes.io and yum.kubernetes.io). On September 13, 2023, those legacy repositories were formally deprecated and their contents frozen.

For more information on this change or to see if you are impacted, please read the deprecation announcement.

Deprecations and removals for Kubernetes v1.29

See the official list of API removals for a full list of planned deprecations for Kubernetes v1.29.

Removal of in-tree integrations with cloud providers (KEP-2395)

The feature gates DisableCloudProviders and DisableKubeletCloudCredentialProviders will both be set to true by default for Kubernetes v1.29. This change will require that users who are currently using in-tree cloud provider integrations (Azure, GCE, or vSphere) enable external cloud controller managers, or opt in to the legacy integration by setting the associated feature gates to false.

Enabling external cloud controller managers means you must run a suitable cloud controller manager within your cluster's control plane; it also requires setting the command line argument --cloud-provider=external for the kubelet (on every relevant node), and across the control plane (kube-apiserver and kube-controller-manager).

For more information about how to enable and run external cloud controller managers, read Cloud Controller Manager Administration and Migrate Replicated Control Plane To Use Cloud Controller Manager.

For general information about cloud controller managers, please see Cloud Controller Manager in the Kubernetes documentation.

Removal of the v1beta2 flow control API group

The flowcontrol.apiserver.k8s.io/v1beta2 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in Kubernetes v1.29.

To prepare for this, you can edit your existing manifests and rewrite client software to use the flowcontrol.apiserver.k8s.io/v1beta3 API version, available since v1.26. All existing persisted objects are accessible via the new API. Notable changes in flowcontrol.apiserver.k8s.io/v1beta3 include that the PriorityLevelConfiguration spec.limited.assuredConcurrencyShares field was renamed to spec.limited.nominalConcurrencyShares.

Deprecation of the status.nodeInfo.kubeProxyVersion field for Node

The .status.kubeProxyVersion field for Node objects will be marked as deprecated in v1.29 in preparation for its removal in a future release. This field is not accurate and is set by kubelet, which does not actually know the kube-proxy version, or even if kube-proxy is running.

Want to know more?

Deprecations are announced in the Kubernetes release notes. You can see the announcements of pending deprecations in the release notes for:

We will formally announce the deprecations that come with Kubernetes v1.29 as part of the CHANGELOG for that release.

For information on the deprecation and removal process, refer to the official Kubernetes deprecation policy document.

The Case for Kubernetes Resource Limits: Predictability vs. Efficiency

There’s been quite a lot of posts suggesting that not using Kubernetes resource limits might be a fairly useful thing (for example, For the Love of God, Stop Using CPU Limits on Kubernetes or Kubernetes: Make your services faster by removing CPU limits ). The points made there are totally valid – it doesn’t make much sense to pay for compute power that will not be used due to limits, nor to artificially increase latency. This post strives to argue that limits have their legitimate use as well.

As a Site Reliability Engineer on the Grafana Labs platform team, which maintains and improves internal infrastructure and tooling used by the product teams, I primarily try to make Kubernetes upgrades as smooth as possible. But I also spend a lot of time going down the rabbit hole of various interesting Kubernetes issues. This article reflects my personal opinion, and others in the community may disagree.

Let’s flip the problem upside down. Every pod in a Kubernetes cluster has inherent resource limits – the actual CPU, memory, and other resources of the machine it’s running on. If those physical limits are reached by a pod, it will experience throttling similar to what is caused by reaching Kubernetes limits.

The problem

Pods without (or with generous) limits can easily consume the extra resources on the node. This, however, has a hidden cost – the amount of extra resources available often heavily depends on pods scheduled on the particular node and their actual load. These extra resources make each pod a special snowflake when it comes to real resource allocation. Even worse, it’s fairly hard to figure out the resources that the pod had at its disposal at any given moment – certainly not without unwieldy data mining of pods running on a particular node, their resource consumption, and similar. And finally, even if we pass this obstacle, we can only have data sampled up to a certain rate and get profiles only for a certain fraction of our calls. This can be scaled up, but the amount of observability data generated might easily reach diminishing returns. Thus, there’s no easy way to tell if a pod had a quick spike and for a short period of time used twice as much memory as usual to handle a request burst.

Now, with Black Friday and Cyber Monday approaching, businesses expect a surge in traffic. Good performance data/benchmarks of the past performance allow businesses to plan for some extra capacity. But is data about pods without limits reliable? With memory or CPU instant spikes handled by the extra resources, everything might look good according to past data. But once the pod bin-packing changes and the extra resources get more scarce, everything might start looking different – ranging from request latencies rising negligibly to requests slowly snowballing and causing pod OOM kills. While almost no one actually cares about the former, the latter is a serious issue that requires instant capacity increase.

Configuring the limits

Not using limits takes a tradeoff – it opportunistically improves the performance if there are extra resources available, but lowers predictability of the performance, which might strike back in the future. There are a few approaches that can be used to increase the predictability again. Let’s pick two of them to analyze:

  • Configure workload limits to be a fixed (and small) percentage more than the requests – I'll call it fixed-fraction headroom. This allows the use of some extra shared resources, but keeps the per-node overcommit bound and can be taken to guide worst-case estimates for the workload. Note that the bigger the limits percentage is, the bigger the variance in the performance that might happen across the workloads.
  • Configure workloads with requests = limits. From some point of view, this is equivalent to giving each pod its own tiny machine with constrained resources; the performance is fairly predictable. This also puts the pod into the Guaranteed QoS class, which makes it get evicted only after BestEffort and Burstable pods have been evicted by a node under resource pressure (see Quality of Service for Pods).

Some other cases might also be considered, but these are probably the two simplest ones to discuss.

Cluster resource economy

Note that in both cases discussed above, we’re effectively preventing the workloads from using some cluster resources it has at the cost of getting more predictability – which might sound like a steep price to pay for a bit more stable performance. Let’s try to quantify the impact there.

Bin-packing and cluster resource allocation

Firstly, let’s discuss bin-packing and cluster resource allocation. There’s some inherent cluster inefficiency that comes to play – it’s hard to achieve 100% resource allocation in a Kubernetes cluster. Thus, some percentage will be left unallocated.

When configuring fixed-fraction headroom limits, a proportional amount of this will be available to the pods. If the percentage of unallocated resources in the cluster is lower than the constant we use for setting fixed-fraction headroom limits (see the figure, line 2), all the pods together are able to theoretically use up all the node’s resources; otherwise there are some resources that will inevitably be wasted (see the figure, line 1). In order to eliminate the inevitable resource waste, the percentage for fixed-fraction headroom limits should be configured so that it’s at least equal to the expected percentage of unallocated resources.

Chart displaying various requests/limits configurations

For requests = limits (see the figure, line 3), this does not hold: Unless we’re able to allocate all node’s resources, there’s going to be some inevitably wasted resources. Without any knobs to turn on the requests/limits side, the only suitable approach here is to ensure efficient bin-packing on the nodes by configuring correct machine profiles. This can be done either manually or by using a variety of cloud service provider tooling – for example Karpenter for EKS or GKE Node auto provisioning.

Optimizing actual resource utilization

Free resources also come in the form of unused resources of other pods (reserved vs. actual CPU utilization, etc.), and their availability can’t be predicted in any reasonable way. Configuring limits makes it next to impossible to utilize these. Looking at this from a different perspective, if a workload wastes a significant amount of resources it has requested, re-visiting its own resource requests might be a fair thing to do. Looking at past data and picking more fitting resource requests might help to make the packing more tight (although at the price of worsening its performance – for example increasing long tail latencies).

Conclusion

Optimizing resource requests and limits is hard. Although it’s much easier to break things when setting limits, those breakages might help prevent a catastrophe later by giving more insights into how the workload behaves in bordering conditions. There are cases where setting limits makes less sense: batch workloads (which are not latency-sensitive – for example non-live video encoding), best-effort services (don’t need that level of availability and can be preempted), clusters that have a lot of spare resources by design (various cases of specialty workloads – for example services that handle spikes by design).

On the other hand, setting limits shouldn’t be avoided at all costs – even though figuring out the "right” value for limits is harder and configuring a wrong value yields less forgiving situations. Configuring limits helps you learn about a workload’s behavior in corner cases, and there are simple strategies that can help when reasoning about the right value. It’s a tradeoff between efficient resource usage and performance predictability and should be considered as such.

There’s also an economic aspect of workloads with spiky resource usage. Having “freebie” resources always at hand does not serve as an incentive to improve performance for the product team. Big enough spikes might easily trigger efficiency issues or even problems when trying to defend a product’s SLA – and thus, might be a good candidate to mention when assessing any risks.

Introducing SIG etcd

Special Interest Groups (SIGs) are a fundamental part of the Kubernetes project, with a substantial share of the community activity happening within them. When the need arises, new SIGs can be created, and that was precisely what happened recently.

SIG etcd is the most recent addition to the list of Kubernetes SIGs. In this article we will get to know it a bit better, understand its origins, scope, and plans.

The critical role of etcd

If we look inside the control plane of a Kubernetes cluster, we will find etcd, a consistent and highly-available key value store used as Kubernetes' backing store for all cluster data -- this description alone highlights the critical role that etcd plays, and the importance of it within the Kubernetes ecosystem.

This critical role makes the health of the etcd project and community an important consideration, and concerns about the state of the project in early 2022 did not go unnoticed. The changes in the maintainer team, amongst other factors, contributed to a situation that needed to be addressed.

Why a special interest group

With the critical role of etcd in mind, it was proposed that the way forward would be to create a new special interest group. If etcd was already at the heart of Kubernetes, creating a dedicated SIG not only recognises that role, it would make etcd a first-class citizen of the Kubernetes community.

Establishing SIG etcd creates a dedicated space to make explicit the contract between etcd and Kubernetes api machinery and to prevent, on the etcd level, changes which violate this contract. Additionally, etcd will be able to adopt the processes that Kubernetes offers its SIGs (KEPs, PRR, phased feature gates, amongst others) in order to improve the consistency and reliability of the codebase. Being able to use these processes will be a substantial benefit to the etcd community.

As a SIG, etcd will also be able to draw contributor support from Kubernetes proper: active contributions to etcd from Kubernetes maintainers would decrease the likelihood of breaking Kubernetes changes, through the increased number of potential reviewers and the integration with existing testing framework. This will not only benefit Kubernetes, which will be able to better participate and shape the direction of etcd in terms of the critical role it plays, but also etcd as a whole.

About SIG etcd

The recently created SIG is already working towards its goals, defined in its Charter and Vision. The purpose is clear: to ensure etcd is a reliable, simple, and scalable production-ready store for building cloud-native distributed systems and managing cloud-native infrastructure via orchestrators like Kubernetes.

The scope of SIG etcd is not exclusively about etcd as a Kubernetes component, it also covers etcd as a standard solution. Our goal is to make etcd the most reliable key-value storage to be used anywhere, unconstrained by any Kubernetes-specific limits and scaling to meet the requirements of many diverse use-cases.

We are confident that the creation of SIG etcd constitutes an important milestone in the lifecycle of the project, simultaneously improving etcd itself, and also the integration of etcd with Kubernetes. We invite everyone interested in etcd to visit our page, join us at our Slack channel, and get involved in this new stage of etcd's life.

Kubernetes Contributor Summit: Behind-the-scenes

Every year, just before the official start of KubeCon+CloudNativeCon, there's a special event that has a very special place in the hearts of those organizing and participating in it: the Kubernetes Contributor Summit. To find out why, and to provide a behind-the-scenes perspective, we interview Noah Abrahams, whom amongst other roles was the co-lead for the Kubernetes Contributor Summit in 2023.

Frederico Muñoz (FSM): Hello Noah, and welcome. Could you start by introducing yourself and telling us how you got involved in Kubernetes?

Noah Abrahams (NA): I’ve been in this space for quite a while.  I got started in IT in the mid 90's, and I’ve been working in the "Cloud" space for about 15 years.  It was, frankly, through a combination of sheer luck (being in the right place at the right time) and having good mentors to pull me into those places (thanks, Tim!), that I ended up at a startup called Apprenda in 2016. While I was there, they pivoted into Kubernetes, and it was the best thing that could have happened to my career.  It was around v1.2 and someone asked me if I could give a presentation on Kubernetes concepts at "my local meetup" in Las Vegas.  The meetup didn’t exist yet, so I created it, and got involved in the wider community.  One thing led to another, and soon I was involved in ContribEx, joined the release team, was doing booth duty for the CNCF, became an ambassador, and here we are today.

The Contributor Summit

KCSEU 2023 group photo

FM: Before leading the organisation of the KCSEU 2023, how many other Contributor Summits were you a part of?

NA: I was involved in four or five before taking the lead. If I'm recalling correctly, I attended the summit in Copenhagen, then sometime in 2018 I joined the wrong meeting, because the summit staff meeting was listed on the ContribEx calendar. Instead of dropping out of the call, I listened a bit, then volunteered to take on some work that didn't look like it had anybody yet dedicated to it. I ended up running Ops in Seattle and helping run the New Contributor Workshop in Shanghai, that year. Since then, I’ve been involved in all but two, since I missed both Barcelona and Valencia.

FM: Have you noticed any major changes in terms of how the conference is organized throughout the years? Namely in terms of number of participants, venues, speakers, themes...

NA: The summit changes over the years with the ebb and flow of the desires of the contributors that attend. While we can typically expect about the same number of attendees, depending on the region that the event is held in, we adapt the style and content greatly based on the feedback that we receive at the end of each event. Some years, contributors ask for more free-style or unconference type sessions, and we plan on having more of those, but some years, people ask for more planned sessions or workshops, so that's what we facilitate. We also have to continually adapt to the venue that we have, the number of rooms we're allotted, how we're going to share the space with other events and so forth. That all goes into the planning ahead of time, from how many talk tracks we’ll have, to what types of tables and how many microphones we want in a room.

There has been one very significant change over the years, though, and that is that we no longer run the New Contributor Workshop. While the content was valuable, running the session during the summit never led to any people who weren’t already contributing to the project becoming dedicated contributors to the project, so we removed it from the schedule. We'll deliver that content another way, while we’ll keep the summit focused on existing contributors.

What makes it special

FM: Going back to the introduction I made, I’ve heard several participants saying that KubeCon is great, but that the Contributor Summit is for them the main event. In your opinion, why do you think that makes it so?

NA: I think part of it ties into what I mentioned a moment ago, the flexibility in our content types. For many contributors, I think the summit is basically "How Kubecon used to be", back when it was primarily a gathering of the contributors to talk about the health of the project and the work that needed to be done. So, in that context, if the contributors want to discuss, say, a new Working Group, then they have dedicated space to do so in the summit. They also have the space to sit down and hack on a tough problem, discuss architectural philosophy, bring potential problems to more people’s attention, refine our methods, and so forth. Plus, the unconference aspect allows for some malleability on the day-of, for whatever is most important right then and there. Whatever folks want to get out of this environment is what we’ll provide, and having a space and time specifically to address your particular needs is always going to be well received.

Let's not forget the social aspect, too. Despite the fact that we're a global community and work together remotely and asynchronously, it's still easier to work together when you have a personal connection, and can put a face to a Github handle. Zoom meetings are a good start, but even a single instance of in-person time makes a big difference in how people work together. So, getting folks together a couple times a year makes the project run more smoothly.

Organizing the Summit

FM: In terms of the organization team itself, could you share with us a general overview of the staffing process? Who are the people that make it happen? How many different teams are involved?

NA: There's a bit of the "usual suspects" involved in making this happen, many of whom you'll find in the ContribEx meetings, but really it comes down to whoever is going to step up and do the work. We start with a general call out for volunteers from the org. There's a Github issue where we'll track the staffing and that will get shouted out to all the usual comms channels: slack, k-dev, etc.

From there, there's a handful of different teams, overseeing content/program committee, registration, communications, day-of operations, the awards the SIGs present to their members, the after-summit social event, and so on. The leads for each team/role are generally picked from folks who have stepped up and worked the event before, either as a shadow, or a previous lead, so we know we can rely on them, which is a recurring theme. The leads pick their shadows from whoever pipes up on the issue, and the teams move forward, operating according to their role books, which we try to update at the end of each summit, with what we've learned over the past few months. It's expected that a shadow will be in line to lead that role at some point in a future summit, so we always have a good bench of folks available to make this event happen. A couple of the roles also have some non-shadow volunteers where people can step in to help a bit, like as an on-site room monitor, and get a feel for how things are put together without having to give a serious up-front commitment, but most of the folks working the event are dedicated to both making the summit successful, and coming back to do so in the future. Of course, the roster can change over time, or even suddenly, as people gain or lose travel budget, get new jobs, only attend Europe or North America or Asia, etc. It's a constant dance, relying 100% on the people who want to make this project successful.

Last, but not least, is the Summit lead. They have to keep the entire process moving forward, be willing to step in to keep bike-shedding from derailing our deadlines, make sure the right people are talking to one another, lead all our meetings to make sure everyone gets a voice, etc. In some cases, the lead has to even be willing to take over an entirely separate role, in case someone gets sick or has any other extenuating circumstances, to make sure absolutely nothing falls through the cracks. The lead is only allowed to volunteer after they’ve been through this a few times and know what the event entails. Event planning is not for the faint of heart.

FM: The participation of volunteers is essential, but there's also the topic of CNCF support: how does this dynamic play out in practice?

NA: This event would not happen in its current form without our CNCF liaison. They provide us with space, make sure we are fed and caffeinated and cared for, bring us outside spaces to evaluate, so we have somewhere to hold the social gathering, get us the budget so we have t-shirts and patches and the like, and generally make it possible for us to put this event together. They're even responsible for the signage and arrows, so the attendees know where to go. They're the ones sitting at the front desk, keeping an eye on everything and answering people's questions. At the same time, they're along to facilitate, and try to avoid influencing our planning.

There's a ton of work that goes into making the summit happen that is easy to overlook, as an attendee, because people tend to expect things to just work. It is not exaggerating to say this event would not have happened like it has over the years, without the help from our liaisons, like Brienne and Deb. They are an integral part of the team.

A look ahead

FM: Currently, we’re preparing the NA 2023 summit, how is it going? Any changes in format compared with previous ones?

NA: I would say it's going great, though I'm sort of emeritus lead for this event, mostly picking up the things that I see need to be done and don't have someone assigned to it. We're always learning from our past experiences and making small changes to continually be better, from how many people need to be on a particular rotation to how far in advance we open and close the CFP. There's no major changes right now, just continually providing the content that the contributors want.

FM: For our readers that might be interested in joining in the Kubernetes Contributor Summit, is there anything they should know?

NA: First of all, the summit is an event by and for Org members. If you're not already an org member, you should be getting involved before trying to attend the summit, as the content is curated specifically towards the contributors and maintainers of the project. That applies to the staff, as well, as all the decisions should be made with the interests and health of kubernetes contributors being the end goal. We get a lot of people who show interest in helping out, but then aren't ready to make any sort of commitment, and that just makes more work for us. If you're not already a proven and committed member of this community, it’s difficult for us to place you in a position that requires reliability. We have made some rare exceptions when we need someone local to help us out, but those are few and far between.

If you are, however, already a member, we'd love to have you. The more people that are involved, the better the event becomes. That applies to both dedicated staff, and those in attendance bringing CFPs, unconference topics, and just contributing to the discussions. If you're part of this community and you're going to be at KubeCon, I would highly urge you to attend, and if you're not yet an org member, let's make that happen!

FM: Indeed! Any final comments you would like to share?

NA: Just that the Contributor Summit is, for me, the ultimate manifestation of the Hallway Track. By being here, you're part of the conversations that move this project forward. It's good for you, and it's good for Kubernetes. I hope to see you all in Chicago!

Spotlight on SIG Architecture: Production Readiness

This is the second interview of a SIG Architecture Spotlight series that will cover the different subprojects. In this blog, we will cover the SIG Architecture: Production Readiness subproject.

In this SIG Architecture spotlight, we talked with Wojciech Tyczynski (Google), lead of the Production Readiness subproject.

About SIG Architecture and the Production Readiness subproject

Frederico (FSM): Hello Wojciech, could you tell us a bit about yourself, your role and how you got involved in Kubernetes?

Wojciech Tyczynski (WT): I started contributing to Kubernetes in January 2015. At that time, Google (where I was and still am working) decided to start a Kubernetes team in the Warsaw office (in addition to already existing teams in California and Seattle). I was lucky enough to be one of the seeding engineers for that team.

After two months of onboarding and helping with different tasks across the project towards 1.0 launch, I took ownership of the scalability area and I was leading Kubernetes to support clusters with 5000 nodes. I’m still involved in SIG Scalability as its Technical Lead. That was the start of a journey since scalability is such a cross-cutting topic, and I started contributing to many other areas including, over time, to SIG Architecture.

FSM: In SIG Architecture, why specifically the Production Readiness subproject? Was it something you had in mind from the start, or was it an unexpected consequence of your initial involvement in scalability?

WT: After reaching that milestone of Kubernetes supporting 5000-node clusters, one of the goals was to ensure that Kubernetes would not degrade its scalability properties over time. While non-scalable implementation is always fixable, designing non-scalable APIs or contracts is problematic. I was looking for a way to ensure that people are thinking about scalability when they create new features and capabilities without introducing too much overhead.

This is when I joined forces with John Belamaric and David Eads and created a Production Readiness subproject within SIG Architecture. While setting the bar for scalability was only one of a few motivations for it, it ended up fitting quite well. At the same time, I was already involved in the overall reliability of the system internally, so other goals of Production Readiness were also close to my heart.

FSM: To anyone new to how SIG Architecture works, how would you describe the main goals and areas of intervention of the Production Readiness subproject?

WT: The goal of the Production Readiness subproject is to ensure that any feature that is added to Kubernetes can be reliably used in production clusters. This primarily means that those features are observable, scalable, supportable, can always be safely enabled and in case of production issues also disabled.

Production readiness and the Kubernetes project

FSM: Architectural consistency being one of the goals of the SIG, is this made more challenging by the distributed and open nature of Kubernetes? Do you feel this impacts the approach that Production Readiness has to take?

WT: The distributed nature of Kubernetes certainly impacts Production Readiness, because it makes thinking about aspects like enablement/disablement or scalability more challenging. To be more precise, when enabling or disabling features that span multiple components you need to think about version skew between them and design for it. For scalability, changes in one component may actually result in problems for a completely different one, so it requires a good understanding of the whole system, not just individual components. But it’s also what makes this project so interesting.

FSM: Those running Kubernetes in production will have their own perspective on things, how do you capture this feedback?

WT: Fortunately, we aren’t talking about "them" here, we’re talking about "us": all of us are working for companies that are managing large fleets of Kubernetes clusters and we’re involved in that too, so we suffer from those problems ourselves.

So while we’re trying to get feedback (our annual PRR survey is very important for us), it rarely reveals completely new problems - it rather shows the scale of them. And we try to react to it - changes like "Beta APIs off by default" happen in reaction to the data that we observe.

FSM: On the topic of reaction, that made me think of how the Kubernetes Enhancement Proposal (KEP) template has a Production Readiness Review (PRR) section, which is tied to the graduation process. Was this something born out of identified insufficiencies? How would you describe the results?

WT: As mentioned above, the overall goal of the Production Readiness subproject is to ensure that every newly added feature can be reliably used in production. It’s not possible to enforce that by a central team - we need to make it everyone's problem.

To achieve it, we wanted to ensure that everyone designing their new feature is thinking about safe enablement, scalability, observability, supportability, etc. from the very beginning. Which means not when the implementation starts, but rather during the design. Given that KEPs are effectively Kubernetes design docs, making it part of the KEP template was the way to achieve the goal.

FSM: So, in a way making sure that feature owners have thought about the implications of their proposal.

WT: Exactly. We already observed that just by forcing feature owners to think through the PRR aspects (via forcing them to fill in the PRR questionnaire) many of the original issues are going away. Sure - as PRR approvers we’re still catching gaps, but even the initial versions of KEPs are better now than they used to be a couple of years ago in what concerns thinking about productionisation aspects, which is exactly what we wanted to achieve - spreading the culture of thinking about reliability in its widest possible meaning.

FSM: We've been talking about the PRR process, could you describe it for our readers?

WT: The PRR process is fairly simple - we just want to ensure that you think through the productionisation aspects of your feature early enough. If you do your job, it’s just a matter of answering some questions in the KEP template and getting approval from a PRR approver (in addition to regular SIG approval). If you didn’t think about those aspects earlier, it may require spending more time and potentially revising some decisions, but that’s exactly what we need to make the Kubernetes project reliable.

Helping with Production Readiness

FSM: Production Readiness seems to be one area where a good deal of prior exposure is required in order to be an effective contributor. Are there also ways for someone newer to the project to contribute?

WT: PRR approvers have to have a deep understanding of the whole Kubernetes project to catch potential issues. Kubernetes is such a large project now with so many nuances that people who are new to the project can simply miss the context, no matter how senior they are.

That said, there are many ways that you may implicitly help. Increasing the reliability of particular areas of the project by improving its observability and debuggability, increasing test coverage, and building new kinds of tests (upgrade, downgrade, chaos, etc.) will help us a lot. Note that the PRR subproject is focused on keeping the bar at the design level, but we should also care equally about the implementation. For that, we’re relying on individual SIGs and code approvers, so having people there who are aware of productionisation aspects, and who deeply care about it, will help the project a lot.

FSM: Thank you! Any final comments you would like to share with our readers?

WT: I would like to highlight and thank all contributors for their cooperation. While the PRR adds some additional work for them, we see that people care about it, and what’s even more encouraging is that with every release the quality of the answers improves, and questions "do I really need a metric reflecting if my feature works" or "is downgrade really that important" don’t really appear anymore.

Gateway API v1.0: GA Release

On behalf of Kubernetes SIG Network, we are pleased to announce the v1.0 release of Gateway API! This release marks a huge milestone for this project. Several key APIs are graduating to GA (generally available), while other significant features have been added to the Experimental channel.

What's new

Graduation to v1

This release includes the graduation of Gateway, GatewayClass, and HTTPRoute to v1, which means they are now generally available (GA). This API version denotes a high level of confidence in the API surface and provides guarantees of backwards compatibility. Note that although, the version of these APIs included in the Standard channel are now considered stable, that does not mean that they are complete. These APIs will continue to receive new features via the Experimental channel as they meet graduation criteria. For more information on how all of this works, refer to the Gateway API Versioning Policy.

Gateway API now has a logo! This logo was designed through a collaborative process, and is intended to represent the idea that this is a set of Kubernetes APIs for routing traffic both north-south and east-west:

Gateway API Logo

CEL Validation

Historically, Gateway API has bundled a validating webhook as part of installing the API. Starting in v1.0, webhook installation is optional and only recommended for Kubernetes 1.24. Gateway API now includes CEL validation rules as part of the CRDs. This new form of validation is supported in Kubernetes 1.25+, and thus the validating webhook is no longer required in most installations.

Standard channel

This release was primarily focused on ensuring that the existing beta APIs were well defined and sufficiently stable to graduate to GA. That led to a variety of spec clarifications, as well as some improvements to status to improve the overall UX when interacting with Gateway API.

Experimental channel

Most of the changes included in this release were limited to the experimental channel. These include HTTPRoute timeouts, TLS config from Gateways to backends, WebSocket support, Gateway infrastructure labels, and more. Stay tuned for a follow up blog post that will cover each of these new features in detail.

Everything else

For a full list of the changes included in this release, please refer to the v1.0.0 release notes.

How we got here

The idea of Gateway API was initially proposed 4 years ago at KubeCon San Diego as the next generation of Ingress API. Since then, an incredible community has formed to develop what has likely become the most collaborative API in Kubernetes history. Over 170 people have contributed to this API so far, and that number continues to grow.

A special thank you to the 20+ community members who agreed to take on an official role in the project, providing some time for reviews and sharing the load of maintaining the project!

We especially want to highlight the emeritus maintainers that played a pivotal role in the early development of this project:

Try it out

Unlike other Kubernetes APIs, you don't need to upgrade to the latest version of Kubernetes to get the latest version of Gateway API. As long as you're running one of the 5 most recent minor versions of Kubernetes (1.24+), you'll be able to get up and running with the latest version of Gateway API.

To try out the API, follow our Getting Started guide.

What's next

This release is just the beginning of a much larger journey for Gateway API, and there are still plenty of new features and new ideas in flight for future releases of the API.

One of our key goals going forward is to work to stabilize and graduate other experimental features of the API. These include support for service mesh, additional route types (GRPCRoute, TCPRoute, TLSRoute, UDPRoute), and a variety of experimental features.

We've also been working towards moving ReferenceGrant into a built-in Kubernetes API that can be used for more than just Gateway API. Within Gateway API, we've used this resource to safely enable cross-namespace references, and that concept is now being adopted by other SIGs. The new version of this API will be owned by SIG Auth and will likely include at least some modifications as it migrates to a built-in Kubernetes API.

Gateway API at KubeCon + CloudNativeCon

At KubeCon North America (Chicago) and the adjacent Contributor Summit there are several talks related to Gateway API that will go into more detail on these topics. If you're attending either of these events this year, considering adding these to your schedule.

Contributor Summit:

KubeCon Main Event:

KubeCon Office Hours:

Gateway API maintainers will be holding office hours sessions at KubeCon if you'd like to discuss or brainstorm any related topics. To get the latest updates on these sessions, join the #sig-network-gateway-api channel on Kubernetes Slack.

Get involved

We've only barely scratched the surface of what's in flight with Gateway API. There are lots of opportunities to get involved and help define the future of Kubernetes routing APIs for both Ingress and Mesh.

If this is interesting to you, please join us in the community and help us build the future of Gateway API together!

Introducing ingress2gateway; Simplifying Upgrades to Gateway API

Update: Ingress2Gateway v1.0 is now available!

Today we are releasing ingress2gateway, a tool that can help you migrate from Ingress to Gateway API. Gateway API is just weeks away from graduating to GA, if you haven't upgraded yet, now's the time to think about it!

Background

In the ever-evolving world of Kubernetes, networking plays a pivotal role. As more applications are deployed in Kubernetes clusters, effective exposure of these services to clients becomes a critical concern. If you've been working with Kubernetes, you're likely familiar with the Ingress API, which has been the go-to solution for managing external access to services.

The Ingress API provides a way to route external traffic to your applications within the cluster, making it an indispensable tool for many Kubernetes users. Ingress has its limitations however, and as applications become more complex and the demands on your Kubernetes clusters increase, these limitations can become bottlenecks.

Some of the limitations are:

  • Insufficient common denominator - by attempting to establish a common denominator for various HTTP proxies, Ingress can only accommodate basic HTTP routing, forcing more features of contemporary proxies like traffic splitting and header matching into provider-specific, non-transferable annotations.
  • Inadequate permission model - Ingress spec configures both infrastructure and application configuration in one object. With Ingress, the cluster operator and application developer operate on the same Ingress object without being aware of each other’s roles. This creates an insufficient role-based access control and has high potential for setup errors.
  • Lack of protocol diversity - Ingress primarily focuses on HTTP(S) routing and does not provide native support for other protocols, such as TCP, UDP and gRPC. This limitation makes it less suitable for handling non-HTTP workloads.

Gateway API

To overcome this, Gateway API is designed to provide a more flexible, extensible, and powerful way to manage traffic to your services.

Gateway API is just weeks away from a GA (General Availability) release. It provides a standard Kubernetes API for ingress traffic control. It offers extended functionality, improved customization, and greater flexibility. By focusing on modular and expressive API resources, Gateway API makes it possible to describe a wider array of routing configurations and models.

The transition from Ingress API to Gateway API in Kubernetes is driven by advantages and advanced functionalities that Gateway API offers, with its foundation built on four core principles: a role-oriented approach, portability, expressiveness and extensibility.

A role-oriented approach

Gateway API employs a role-oriented approach that aligns with the conventional roles within organizations involved in configuring Kubernetes service networking. This approach enables infrastructure engineers, cluster operators, and application developers to collectively address different aspects of Gateway API.

For instance, infrastructure engineers play a pivotal role in deploying GatewayClasses, cluster-scoped resources that act as templates to explicitly define behavior for Gateways derived from them, laying the groundwork for robust service networking.

Subsequently, cluster operators utilize these GatewayClasses to deploy gateways. A Gateway in Kubernetes' Gateway API defines how external traffic can be directed to Services within the cluster, essentially bridging non-Kubernetes sources to Kubernetes-aware destinations. It represents a request for a load balancer configuration aligned with a GatewayClass’ specification. The Gateway spec may not be exhaustive as some details can be supplied by the GatewayClass controller, ensuring portability. Additionally, a Gateway can be linked to multiple Route references to channel specific traffic subsets to designated services.

Lastly, application developers configure route resources (such as HTTPRoutes), to manage configuration (e.g. timeouts, request matching/filter) and Service composition (e.g. path routing to backends) Route resources define protocol-specific rules for mapping requests from a Gateway to Kubernetes Services. HTTPRoute is for multiplexing HTTP or terminated HTTPS connections. It's intended for use in cases where you want to inspect the HTTP stream and use HTTP request data for either routing or modification, for example using HTTP Headers for routing, or modifying them in-flight.

Diagram showing the key resources that make up Gateway API and how they relate to each other. The resources shown are GatewayClass, Gateway, and HTTPRoute; the Service API is also shown

Portability

With more than 20 API implementations, Gateway API is designed to be more portable across different implementations, clusters and environments. It helps reduce Ingress' reliance on non-portable, provider-specific annotations, making your configurations more consistent and easier to manage across multiple clusters.

Gateway API commits to supporting the 5 latest Kubernetes minor versions. That means that Gateway API currently supports Kubernetes 1.24+.

Expressiveness

Gateway API provides standard, Kubernetes-backed support for a wide range of features, such as header-based matching, traffic splitting, weight-based routing, request mirroring and more. With Ingress, these features need custom provider-specific annotations.

Extensibility

Gateway API is designed with extensibility as a core feature. Rather than enforcing a one-size-fits-all model, it offers the flexibility to link custom resources at multiple layers within the API's framework. This layered approach to customization ensures that users can tailor configurations to their specific needs without overwhelming the main structure. By doing so, Gateway API facilitates more granular and context-sensitive adjustments, allowing for a fine-tuned balance between standardization and adaptability. This becomes particularly valuable in complex cloud-native environments where specific use cases require nuanced configurations. A critical difference is that Gateway API has a much broader base set of features and a standard pattern for extensions that can be more expressive than annotations were on Ingress.

Upgrading to Gateway

Migrating from Ingress to Gateway API may seem intimidating, but luckily Kubernetes just released a tool to simplify the process. ingress2gateway assists in the migration by converting your existing Ingress resources into Gateway API resources. Here is how you can get started with Gateway API and using ingress2gateway:

  1. Install a Gateway controller OR install the Gateway API CRDs manually .

  2. Install ingress2gateway.

    If you have a Go development environment locally, you can install ingress2gateway with:

    go install github.com/kubernetes-sigs/ingress2gateway@v0.1.0
    

    This installs ingress2gateway to $(go env GOPATH)/bin/ingress2gateway.

    Alternatively, follow the installation guide here.

  3. Once the tool is installed, you can use it to convert the ingress resources in your cluster to Gateway API resources.

    ingress2gateway print
    

    This above command will:

    1. Load your current Kubernetes client config including the active context, namespace and authentication details.
    2. Search for ingresses and provider-specific resources in that namespace.
    3. Convert them to Gateway API resources (Currently only Gateways and HTTPRoutes). For other options you can run the tool with -h, or refer to https://github.com/kubernetes-sigs/ingress2gateway#options.
  4. Review the converted Gateway API resources, validate them, and then apply them to your cluster.

  5. Send test requests to your Gateway to check that it is working. You could get your gateway address using kubectl get gateway <gateway-name> -n <namespace> -o jsonpath='{.status.addresses}{"\n"}'.

  6. Update your DNS to point to the new Gateway.

  7. Once you've confirmed that no more traffic is going through your Ingress configuration, you can safely delete it.

Wrapping up

Achieving reliable, scalable and extensible networking has always been a challenging objective. The Gateway API is designed to improve the current Kubernetes networking standards like ingress and reduce the need for implementation specific annotations and CRDs.

It is a Kubernetes standard API, consistent across different platforms and implementations and most importantly it is future proof. Gateway API is the next generation of the Ingress API, but has a larger scope than that, expanding to tackle mesh and layer 4 routing as well. Gateway API and ingress2gateway are supported by a dedicated team under SIG Network that actively work on it and manage the ecosystem. It is also likely to receive more updates and community support.

The Road Ahead

ingress2gateway is just getting started. We're planning to onboard more providers, introduce support for more types of Gateway API routes, and make sure everything syncs up smoothly with the ongoing development of Gateway API.

Excitingly, Gateway API is also making significant strides. While v1.0 is about to launching, there's still a lot of work ahead. This release incorporates many new experimental features, with additional functionalities currently in the early stages of planning and development.

If you're interested in helping to contribute, we would love to have you! Please check out the community page which includes links to the Slack channel and community meetings. We look forward to seeing you!!

Plants, process and parties: the Kubernetes 1.28 release interview

Since 2018, one of my favourite contributions to the Kubernetes community has been to share the story of each release. Many of these stories were told on behalf of a past employer; by popular demand, I've brought them back, now under my own name. If you were a fan of the old show, I would be delighted if you would subscribe.

Back in August, we welcomed the release of Kubernetes 1.28. That release was led by Grace Nguyen, a CS student at the University of Waterloo. Grace joined me for the traditional release interview, and while you can read her story below, I encourage you to listen to it if you can.

This transcript has been lightly edited and condensed for clarity.


You're a student at the University of Waterloo, so I want to spend the first two minutes of this interview talking about the Greater Kitchener-Waterloo region. It's August, so this is one of the four months of the year when there's no snow visible on the ground?
Well, it's not that bad. I think the East Coast has it kind of good. I grew up in Calgary, but I do love summer here in Waterloo. We have a petting zoo close to our university campus, so I go and see the llamas sometimes.

Is that a new thing?
I'm not sure, it seems like it's been around five-ish years, the Waterloo Park?

I lived there in 2007, for a couple of years, just to set the scene for why we're talking about this. I think they were building a lot of the park then. I do remember, of course, that Kitchener holds the second largest Oktoberfest in the world. Is that something you've had a chance to check out?
I have not. I actually didn't know that was a fact.

The local civic organization is going to have to do a bit more work, I feel. Do you like ribs?
I have mixed feelings about ribs. It's kind of a hit or miss situation for me so far.

Again, that might be something that's changed over the last few years. The Ribfests used to have a lot of trophies with little pigs on top of them, but I feel that the shifting dining habits of the world might mean they have to offer some vegan or vegetarian options, to please the modern palette.
[LAUGHS] For sure. Do you recommend the Oktoberfest here? Have you been?

I went a couple of times. It was a lot of fun.
Okay.

It's basically just drinking. I would have recommended it back then; I'm not sure it would be quite what I'd be doing today.
All right, good to know.

The Ribfest, however, I would go back just for that.
Oh, ok.

And the great thing about Ribfests as a concept is that they have one in every little town. The Kitchener Ribfest, I looked it up, it's in July; you've just missed that. But, you could go to the Waterloo Ribfest in September.
Oh, it is in September? They have their own Ribfest?

They do. I think Guelph has one, and Cambridge has one. That's the advantage of the region — there are lots of little cities. Kitchener and Waterloo are two cities that grew into each other — they do call them the Twin Cities. I hear that they finally built the light rail link between the two of them?
It is fantastic, and makes the city so much more walkable.

Yes, you can go from one mall to the other. That's Canada for you.
Well, Uptown is really nice. I quite like it. It's quite cozy.

Do you ever cross the border over into Kitchener? Or only when you've lost a bet?
Yeah, not a lot. Only for farmer's market, I say.

It's worthwhile. There's a lot of good food there, I remember.
Yeah. Quite lovely.

Now we've got all that out of the way, let's travel back in time a little bit. You mentioned there that you went to high school in Calgary?
I did. I had not been to Ontario before I went to university. Calgary was frankly too cold and not walkable enough for me.

I basically say the same thing about Waterloo and that's why I moved to England.
Fascinating. Gets better.

How did you get into tech?
I took a computer science class in high school. I was one of maybe only three women in the class, and I kind of stuck with it since.

Was the gender distribution part of your thought process at the time?
Yeah, I think I was drawn to it partially because I didn't see a lot of people who looked like me in the class.

You followed it through to university. What is it that you're studying?
I am studying computer engineering, so a lot of hardware stuff.

You're involved in the UW Cybersecurity Club. What can you tell me about that without having to kill me?
Oh, we are very nice and friendly people! I told myself I'm going to have a nice and chill summer and then I got chosen to lead the release and also ended up running the Waterloo Cybersecurity Club. The club kind of died out during the pandemic, because we weren't on campus, but we have so many smart and amazing people who are in cybersecurity, so it's great to get them together and I learned so many things.

Is that like the modern equivalent of the LAN party? You're all getting into a dark room and trying to hack the Gibson?
[LAUGHS] Well, you'll have to explain to me again what a LAN party is. Do you bring your own PC?

You used to. Back in the day it was incomprehensible that you could communicate with a different person in a different place at a fast enough speed, so you had to physically sit next to somebody and plug a cable in between you.
Okay, well kind of the same, I guess. We bring our own laptop and we go to CTF competitions together.

They didn't have laptops back in the days of LAN parties. You'd bring a giant 19-inch square monitor, and everything. It was a badge of honor what you could carry.
Okay. Can't relate, but good to know. [LAUGHS]

One of the more unique aspects of UW is its co-op system. Tell us a little bit about that?
As part of my degree, I am required to do minimum five and maximum six co-ops. I've done all six of them. Two of them were in Kubernetes and that's how I got started.

A co-op is a placement, as opposed to something you do on campus?
Right, so co-op is basically an internship. My first one was at the Canada Revenue Agency. We didn't have wifi and I had my own cubicle, which is interesting. They don't do that anymore, they have open office space. But my second was at Ericsson, where I learned about Kubernetes. It was during the pandemic. KubeCon offered virtual attendance for students and I signed up and I poked around and I have been around since.

What was it like going through university during the COVID years? What did that mean in terms of the fact you would previously have traveled to these internships? Did you do them all from home?
I'm not totally sure what I missed out on. For sure, a lot of relationship building, but also that we do have to move a lot as part of the co-op experience. Last fall I was in San Francisco, I was in Palo Alto earlier this year. A lot of that dynamic has already been the case.

Definitely different weather systems, Palo Alto versus Waterloo.
Oh, for sure. Yes, yes. Really glad I was there over the winter.

The first snow would fall in Ontario about the end of October and it would pile up over the next few months. There were still piles that hadn't melted by June. That's why I say, there were only four months of the year, July through September, where there was no snow on the ground.
That's true. Didn't catch any snow in Palo Alto, and honestly, that's great. [CHUCKLES]

Thank you, global warming, I guess.
Oh no! [LAUGHS]

Tell me about the co-op term that you did working with Kubernetes at Ericsson?
This was such a long time ago, but we were trying to build some sort of pipeline to deploy testing. It was running inside a cluster, and I learned Helm charts and all that good stuff. And then, for the co-op after that, I worked at a Canadian startup in FinTech. It was 24/7 Kubernetes, building their secret injection system, using ArgoCD to automatically pull secrets from 1Password.

How did that lead you on to involvement with the release team?
It was over the pandemic, so I didn't have a lot to do, I went to the conference, saw so many cool talks. One that really stuck out to me was a Kubernetes hacking talk by Tabitha Sable and V Korbes. I thought it was the most amazing thing and it was so cool. One of my friends was on the release team at the time, and she showed me what she does. I applied and thankfully got in. I didn't have any open source experience. It was fully like one of those things where someone took a chance on me.

How would you characterize the experience that you've had to date? You have had involvement with pretty much every release since then.
Yeah, I think it was a really formative experience, and the community has been such a big part of it.

You started as an enhancement shadow with Kubernetes 1.22, eventually moving up to enhancements lead, then you moved on to be the release lead shadow. Obviously, you are the lead for 1.28, but for 1.27 you did something a bit different. What was that, and why did you do it?
For 1.25 and 1.26, I was release lead shadow, so I had an understanding of what that role was like. I wanted to shadow another team, and at that time I thought CI Signal was a big black box to me. I joined the team, but I also had capacity for other things, I joined as a branch manager associate as well.

What is the difference between that role and the traditional release team roles we think about?
Yeah, that's a great question. So the branch management role is a more constant role. They don't necessarily get swapped out every release. You shadow as an associate, so you do things like cut releases, distribute them, update distros, things like that. It's a really important role, and the folks that are in there are more technical. So if you have been on the release team for a long time and are looking for more permanent role, I recommend looking into that.

Congratulations again on the release of 1.28 today.
Yeah, thank you.

What is the best new feature in Kubernetes 1.28, and why is it sidecar container support?
Great question. I am as excited as you. In 1.28, we have a new feature in alpha, which is sidecar container support. We introduced a new field called restartPolicy for init containers, that allows the containers to live throughout the life cycle of the pod and not block the pod from terminating. Craig, you know a lot about this, but there are so many use cases for this. It is a very common pattern. You use it for logging, monitoring, metrics; also configs and secrets as well.

And the service mesh!
And the service mesh.

Very popular. I will say that the Sidecar pattern was called out very early on, in a blog post Brendan Burns wrote, talking about how you can achieve some of the things you just mentioned. Support for it in Kubernetes has been— it's been a while, shall we say. I've been doing these interviews since 2018, and September 2019 was when I first had a conversation with a release manager who felt they had to apologize for Sidecar containers not shipping in that release.
Well, here we are!

Thank you for not letting the side down.
[LAUGHS]

There are a bunch of other features that are going to GA in 1.28. Tell me about what's new with kubectl events?
It got a new CLI and now it is separate from kubectl get. I think that changes in the CLI are always a little bit more apparent because they are user-facing.

Are there a lot of other user-facing changes, or are most of the things in the release very much behind the scenes?
I would say it's a good mix of both; it depends on what you're interested in.

I am interested, of course, in non-graceful node shutdown support. What can you tell us about that?
Right, so for situations where you have a hardware failure or a broken OS, we have added additional support for a better graceful shutdown.

If someone trips over the power cord at your LAN party and your cluster goes offline as a result?
Right, exactly. More availability! That's always good.

And if it's not someone tripping over your power cord, it's probably DNS that broke your cluster. What's changed in terms of DNS configuration?
Oh, we introduced a new feature gate to allow more DNS search path.

Is that all there is to it?
That's pretty much it. [LAUGHING] Yeah, you can have more and longer DNS search path.

It can never be long enough. Just search everything! If .com doesn't work, try .net and try .io after that.
Surely.

Those are a few of the big features that are moving to stable. Obviously, over the course of the last few releases, features come in, moving from Alpha to Beta and so on. New features coming in today might not be available to people for a while. As you mentioned, there are feature gates that you can enable to allow people to have access to these. What are some of the newest features that have been introduced that are in Alpha, that are particularly interesting to you personally?
I have two. The first one is kubectl delete --interactive. I'm always nervous when I delete something, you know, it's going to be a typo or it's going to be on the wrong tab. So we have an --interactive flag for that now.

So you can get feedback on what you're about to delete before you do it?
Right; confirmation is good!

You mentioned two there, what was the second one?
Right; this one is close to my heart. It is a SIG Release KEP, publishing on community infrastructure. I'm not sure if you know, but as part of my branch management associate role in 1.27, I had the opportunity to cut a few releases. It takes up to 12 hours sometimes. And now, we are hoping that the process only includes release managers, so we don't have to call up the folks at Google and, you know, lengthen that process anymore.

Is 12 hours the expected length for software of this size, or is there work in place to try and bring that down?
There's so much work in place to bring that down. I think 12 hours is on the shorter end of it. Unfortunately, we have had a situation where we have to, you know, switch the release manager because it's just so late at night for them.

They've fallen asleep halfway through?
Exactly, yeah. 6 to 12 hours, I think, is our status quo.

The theme for this release is "Planternetes". That's going to need some explanation, I feel.
Okay. I had full creative control over this. It is summer in the northern hemisphere, and I am a big house plant fanatic. It's always a little sad when I have to move cities for co-op and can't take my plants with me.

Is that a border control thing? They don't let you take them over the border?
It's not even that; they're just so clunky and fragile. It's usually not worth the effort. But I think our community is very much like a garden. We have very critical roles in the ecosystem and we all have to work together.

Will you be posting seeds out to contributors and growing something together all around the world?
That would be so cool if we had merch, like a little card with seeds embedded in it. I don't think we have the budget for that though. [LAUGHS]

You say that. There are people who are inspired in many different areas. I love talking to the release managers and hearing the things that they're interested in. You should think about taking some seeds off one of your plants, and just spreading them around the world. People can take pictures, and tag you in them on Instagram.
That's cool. You know how we have a SIG Beard? We can have a SIG Plant.

You worked for a long time with the release lead for 1.27. Xander Grzywinski. One of the benefits of having done my interview with him in writing and not as a podcast is I didn't have to try and butcher pronouncing his surname. Can you help me out here?
I unfortunately cannot. I don't want to butcher it either!

Anyway, Xander told me that he suspected that in this release you would have to deal with some very last-minute PRs, as is tradition. Was that the case?
I vividly remember the last minute PRs from last release because I was trying to cut the releases, as part of the branch management team. Thankfully, that was not the case this release. We have had other challenges, of course.

Can you tell me some of those challenges?
I think improvement on documentation is always a big part. The KEP process can be very daunting to new contributors. How do you get people to review your KEPs? How do you opt in? All that stuff. We're improving documentations for that.

As someone who has been through a lot of releases, I've been feeling, like you've said, that the last minute nature has slowed down a little. The process is perhaps improving. Do you see that, or do you think there's still a long way to go for the leads to improve it?
I think we've come very far. When I started in 1.22, we were using spreadsheets to track a hundred enhancements. It was a monster; I was terrified to touch it. Now, we're on GitHub boards. As a result of that, we are actually merging the bug triage and CI Signal team in 1.29.

What's the impact of that?
The bug triage team is now using the GitHub board to track issues, which is much more efficient. We are able to merge the two teams together.

I have heard a rumor that GitHub boards are powered by spreadsheets underneath.
Honestly, even if that's true, the fact that it's on the same platform and it has better version control is just magical.

At this time, the next release lead has not yet been announced, but tradition dictates that you write down your feelings, best wishes and instructions to them in an envelope, which you'll leave in their desk drawer. What are you going to put inside that envelope?
Our 1.28 release lead is fantastic and they're so capable of handling the release—

That's you, isn't it?
1.29? [LAUGHS] No, I'm too tired. I need to catch up on my sleep. My advice for them? It's going to be okay. It's all going to be okay. I was going to echo Leo's and Cici's words, to overcommunicate, but I think that has been said enough times already.

You've communicated enough. Stop! No more communication!
Yeah, no more communication. [LAUGHS] It's going to be okay. And honestly, shout out to my emeritus advisor, Leo, for reminding me that. Sometimes there are a lot of fires and it can be overwhelming, but it will be okay.

As we've alluded to a little bit throughout our conversation, there are a lot of people in the Kubernetes community who, for want of a better term, have had "a lot of experience" at running these systems. Then there are, of course, a lot of people who are just at the beginning of their careers; like yourself, at university. How do you see the difference between how those groups interact? Is there one team throughout, or what do you think that each can learn from the other?
I think the diversity of the team is one of its strengths and I really enjoy it. I learn so much from folks who have been doing this for 20 years or folks who are new to the industry like I am.

I know the CNCF goes to a lot of effort to enable new people to take part. Is there anything that you can say about how people might get involved?
Firstly, I think SIG Release has started a wonderful tradition, or system, of helping new folks join the release team as a shadow, and helping them grow into bigger positions, like leads. I think other SIGs are also following that template as well. But a big part of me joining and sticking with the community has been the ability to go to conferences. As I said, my first conference was KubeCon, when I was not involved in the community at all. And so a big shout-out to the CNCF and the companies that sponsor the Dan Kohn and the speaker scholarships. They have been the sole reason that I was able to attend KubeCon, meet people, and feel the power of the community.

Last year's KubeCon in North America was in Detroit?
Detroit, I was there, yeah.

That's quite a long drive?
I was in SF, so I flew over.

You live right next door! If only you'd been in Waterloo.
Yeah, but who knows? Maybe I'll do a road trip from Waterloo to Chicago this year.


Grace Nguyen is a student at the University of Waterloo, and was the release team lead for Kubernetes 1.28. Subscribe to Let's Get To The News, or search for it wherever you get your podcasts.

PersistentVolume Last Phase Transition Time in Kubernetes

In the recent Kubernetes v1.28 release, we (SIG Storage) introduced a new alpha feature that aims to improve PersistentVolume (PV) storage management and help cluster administrators gain better insights into the lifecycle of PVs. With the addition of the lastPhaseTransitionTime field into the status of a PV, cluster administrators are now able to track the last time a PV transitioned to a different phase, allowing for more efficient and informed resource management.

Why do we need new PV field?

PersistentVolumes in Kubernetes play a crucial role in providing storage resources to workloads running in the cluster. However, managing these PVs effectively can be challenging, especially when it comes to determining the last time a PV transitioned between different phases, such as Pending, Bound or Released. Administrators often need to know when a PV was last used or transitioned to certain phases; for instance, to implement retention policies, perform cleanup, or monitor storage health.

In the past, Kubernetes users have faced data loss issues when using the Delete retain policy and had to resort to the safer Retain policy. When we planned the work to introduce the new lastPhaseTransitionTime field, we wanted to provide a more generic solution that can be used for various use cases, including manual cleanup based on the time a volume was last used or producing alerts based on phase transition times.

How lastPhaseTransitionTime helps

Provided you've enabled the feature gate (see How to use it, the new .status.lastPhaseTransitionTime field of a PersistentVolume (PV) is updated every time that PV transitions from one phase to another. Whether it's transitioning from Pending to Bound, Bound to Released, or any other phase transition, the lastPhaseTransitionTime will be recorded. For newly created PVs the phase will be set to Pending and the lastPhaseTransitionTime will be recorded as well.

This feature allows cluster administrators to:

  1. Implement Retention Policies

    With the lastPhaseTransitionTime, administrators can now track when a PV was last used or transitioned to the Released phase. This information can be crucial for implementing retention policies to clean up resources that have been in the Released phase for a specific duration. For example, it is now trivial to write a script or a policy that deletes all PVs that have been in the Released phase for a week.

  2. Monitor Storage Health

    By analyzing the phase transition times of PVs, administrators can monitor storage health more effectively. For example, they can identify PVs that have been in the Pending phase for an unusually long time, which may indicate underlying issues with the storage provisioner.

How to use it

The lastPhaseTransitionTime field is alpha starting from Kubernetes v1.28, so it requires the PersistentVolumeLastPhaseTransitionTime feature gate to be enabled.

If you want to test the feature whilst it's alpha, you need to enable this feature gate on the kube-controller-manager and the kube-apiserver.

Use the --feature-gates command line argument:

--feature-gates="...,PersistentVolumeLastPhaseTransitionTime=true"

Keep in mind that the feature enablement does not have immediate effect; the new field will be populated whenever a PV is updated and transitions between phases. Administrators can then access the new field through the PV status, which can be retrieved using standard Kubernetes API calls or through Kubernetes client libraries.

Here is an example of how to retrieve the lastPhaseTransitionTime for a specific PV using the kubectl command-line tool:

kubectl get pv <pv-name> -o jsonpath='{.status.lastPhaseTransitionTime}'

Going forward

This feature was initially introduced as an alpha feature, behind a feature gate that is disabled by default. During the alpha phase, we (Kubernetes SIG Storage) will collect feedback from the end user community and address any issues or improvements identified.

Once sufficient feedback has been received, or no complaints are received the feature can move to beta. The beta phase will allow us to further validate the implementation and ensure its stability.

At least two Kubernetes releases will happen between the release where this field graduates to beta and the release that graduates the field to general availability (GA). That means that the earliest release where this field could be generally available is Kubernetes 1.32, likely to be scheduled for early 2025.

Getting involved

We always welcome new contributors so if you would like to get involved you can join our Kubernetes Storage Special-Interest-Group (SIG).

If you would like to share feedback, you can do so on our public Slack channel. If you're not already part of that Slack workspace, you can visit https://slack.k8s.io/ for an invitation.

Special thanks to all the contributors that provided great reviews, shared valuable insight and helped implement this feature (alphabetical order):

A Quick Recap of 2023 China Kubernetes Contributor Summit

On September 26, 2023, the first day of KubeCon + CloudNativeCon + Open Source Summit China 2023, nearly 50 contributors gathered in Shanghai for the Kubernetes Contributor Summit.

All participants in the 2023 Kubernetes Contributor Summit

All participants in the 2023 Kubernetes Contributor Summit

This marked the first in-person offline gathering held in China after three years of the pandemic.

A joyful meetup

The event began with welcome speeches from Kevin Wang from Huawei Cloud, one of the co-chairs of KubeCon, and Puja from Giant Swarm.

Following the opening remarks, the contributors introduced themselves briefly. Most attendees were from China, while some contributors had made the journey from Europe and the United States specifically for the conference. Technical experts from companies such as Microsoft, Intel, Huawei, as well as emerging forces like DaoCloud, were present. Laughter and cheerful voices filled the room, regardless of whether English was spoken with European or American accents or if conversations were carried out in authentic Chinese language. This created an atmosphere of comfort, joy, respect, and anticipation. Past contributions brought everyone closer, and mutual recognition and accomplishments made this offline gathering possible.

Face to face meeting in Shanghai

Face to face meeting in Shanghai

The attending contributors were no longer just GitHub IDs; they transformed into vivid faces. From sitting together and capturing group photos to attempting to identify "Who is who," a loosely connected collective emerged. This team structure, although loosely knit and free-spirited, was established to pursue shared dreams.

As the saying goes, "You reap what you sow." Each effort has been diligently documented within the Kubernetes community contributions. Regardless of the passage of time, the community will not erase those shining traces. Brilliance can be found in your PRs, issues, or comments. It can also be seen in the smiling faces captured in meetup photos or heard through stories passed down among contributors.

Technical sharing and discussions

Next, there were three technical sharing sessions:

  • sig-multi-cluster: Hongcai Ren, a maintainer of Karmada, provided an introduction to the responsibilities and roles of this SIG. Their focus is on designing, discussing, implementing, and maintaining APIs, tools, and documentation related to multi-cluster management. Cluster Federation, one of Karmada's core concepts, is also part of their work.

  • helmfile: yxxhero from GitLab presented how to deploy Kubernetes manifests declaratively, customize configurations, and leverage the latest features of Helm, including Helmfile.

  • sig-scheduling: william-wang from Huawei Cloud shared the recent updates and future plans of SIG Scheduling. This SIG is responsible for designing, developing, and testing components related to Pod scheduling.

A technical session about sig-multi-cluster

A technical session about sig-multi-cluster

Following the sessions, a video featuring a call for contributors by Sergey Kanzhelev, the SIG-Node Chair, was played. The purpose was to encourage more contributors to join the Kubernetes community, with a special emphasis on the popular SIG-Node.

Lastly, Kevin hosted an Unconference collective discussion session covering topics such as multi-cluster management, scheduling, elasticity, AI, and more. For detailed minutes of the Unconference meeting, please refer to https://docs.qq.com/doc/DY3pLWklzQkhjWHNT.

China's contributor statistics

The contributor summit took place in Shanghai, with 90% of the attendees being Chinese. Within the Cloud Native Computing Foundation (CNCF) ecosystem, contributions from China have been steadily increasing. Currently:

  • Chinese contributors account for 9% of the total.
  • Contributions from China make up 11.7% of the overall volume.
  • China ranks second globally in terms of contributions.

Note:

The data is from KubeCon keynotes by Chris Aniszczyk, CTO of Cloud Native Computing Foundation, on September 26, 2023. This probably understates Chinese contributions. A lot of Chinese contributors use VPNs and may not show up as being from China in the stats accurately.

The Kubernetes Contributor Summit is an inclusive meetup that welcomes all community contributors, including:

  • New Contributors
  • Current Contributors
    • docs
    • code
    • community management
  • Subproject members
  • Members of Special Interest Group (SIG) / Working Group (WG)
  • Active Contributors
  • Casual Contributors

Acknowledgments

We would like to express our gratitude to the organizers of this event:

  • Kevin Wang, the co-chair of KubeCon and the lead of the kubernetes contributor summit.
  • Paco Xu, who actively coordinated the venue, meals, invited contributors from both China and international sources, and established WeChat groups to collect agenda topics. They also shared details of the event before and after its occurrence through pre and post announcements.
  • Mengjiao Liu, who was responsible for organizing, coordinating, and facilitating various matters related to the summit.

We extend our appreciation to all the contributors who attended the China Kubernetes Contributor Summit in Shanghai. Your dedication and commitment to the Kubernetes community are invaluable. Together, we continue to push the boundaries of cloud native technology and shape the future of this ecosystem.

Bootstrap an Air Gapped Cluster With Kubeadm

Ever wonder how software gets deployed onto a system that is deliberately disconnected from the Internet and other networks? These systems are typically disconnected due to their sensitive nature. Sensitive as in utilities (power/water), banking, healthcare, weapons systems, other government use cases, etc. Sometimes it's technically a water gap, if you're running Kubernetes on an underwater vessel. Still, these environments need software to operate. This concept of deployment in a disconnected state is what it means to deploy to the other side of an air gap.

Again, despite this posture, software still needs to run in these environments. Traditionally, software artifacts are physically carried across the air gap on hard drives, USB sticks, CDs, or floppy disks (for ancient systems, it still happens). Kubernetes lends itself particularly well to running software behind an air gap for several reasons, largely due to its declarative nature.

In this blog article, I will walk through the process of bootstrapping a Kubernetes cluster in an air-gapped lab environment using Fedora Linux and kubeadm.

The Air Gap VM Setup

A real air-gapped network can take some effort to set up, so for this post, I will use an example VM on a laptop and do some network modifications. Below is the topology:

Topology on the host/laptop which shows that connectivity to the internet from the air gap VM is not possible. However, connectivity between the host/laptop and the VM is possible

Local topology

This VM will have its network connectivity disabled but in a way that doesn't shut down the VM's virtual NIC. Instead, its network will be downed by injecting a default route to a dummy interface, making anything internet-hosted unreachable. However, the VM still has a connected route to the bridge interface on the host, which means that network connectivity to the host is still working. This posture means that data can be transferred from the host/laptop to the VM via scp, even with the default route on the VM black-holing all traffic that isn't destined for the local bridge subnet. This type of transfer is analogous to carrying data across the air gap and will be used throughout this post.

Other details about the lab setup:

VM OS: Fedora 37
Kubernetes Version: v1.27.3
CNI Plugins Version: v1.3.0
CNI Provider and Version: Flannel v0.22.0

While this single VM lab is a simplified example, the below diagram more approximately shows what a real air-gapped environment could look like:

Example production topology which shows 3 control plane Kubernetes nodes and 'n' worker nodes along with a Docker registry in an air-gapped environment.  Additionally shows two workstations, one on each side of the air gap and an IT admin which physically carries the artifacts across.

Note, there is still intentional isolation between the environment and the internet. There are also some things that are not shown in order to keep the diagram simple, for example malware scanning on the secure side of the air gap.

Back to the single VM lab environment.

Identifying the required software artifacts

I have gone through the trouble of identifying all of the required software components that need to be carried across the air gap in order for this cluster to be stood up:

  • Docker (to host an internal container image registry)
  • Containerd
  • libcgroup
  • socat
  • conntrack-tools
  • CNI plugins
  • crictl
  • kubeadm
  • kubelet
  • kubectl and k9s (strictly speaking, these aren't required to bootstrap a cluster but they are handy to interact with one)
  • kubelet.service systemd file
  • kubeadm configuration file
  • Docker registry container image
  • Kubernetes component container images
  • CNI network plugin container images (Flannel will be used for this lab)
  • CNI network plugin manifests
  • CNI tooling container images

The way I identified these was by trying to do the installation and working through all of the errors that are thrown around an additional dependency being required. In a real air-gapped scenario, each transport of artifacts across the air gap could represent anywhere from 20 minutes to several weeks of time spent by the installer. That is to say that the target system could be located in a data center on the same floor as your desk, at a satellite downlink facility in the middle of nowhere, or on a submarine that's out to sea. Knowing what is on that system at any given time is important so you know what you have to bring.

Prepare the Node for K8s

Before downloading and moving the artifacts to the VM, let's first prep that VM to run Kubernetes.

VM preparation

Run these steps as a normal user

Make destination directory for software artifacts

mkdir ~/tmp

Run the following steps as the superuser (root)

Write to /etc/sysctl.d/99-k8s-cri.conf:

cat > /etc/sysctl.d/99-k8s-cri.conf << EOF
net.bridge.bridge-nf-call-iptables=1
net.ipv4.ip_forward=1
net.bridge.bridge-nf-call-ip6tables=1
EOF

Write to /etc/modules-load.d/k8s.conf (enable overlay and nbr_netfilter):

echo -e overlay\\nbr_netfilter > /etc/modules-load.d/k8s.conf

Install iptables:

dnf -y install iptables-legacy

Set iptables to use legacy mode (not nft emulating iptables):

update-alternatives --set iptables /usr/sbin/iptables-legacy

Turn off swap:

touch /etc/systemd/zram-generator.conf
systemctl mask systemd-zram-setup@.service
sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab

Disable firewalld (this is OK in a demo context):

systemctl disable --now firewalld

Disable systemd-resolved:

systemctl disable --now systemd-resolved

Configure DNS defaults for NetworkManager:

sed -i '/\[main\]/a dns=default' /etc/NetworkManager/NetworkManager.conf

Blank the system-level DNS resolver configuration:

unlink /etc/resolv.conf || true
touch /etc/resolv.conf

Disable SELinux (just for a demo - check before doing this in production!):

setenforce 0

Make sure all changes survive a reboot

reboot

Download all the artifacts

On the laptop/host machine, download all of the artifacts enumerated in the previous section. Since the air gapped VM is running Fedora 37, all of the dependencies shown in this part are for Fedora 37. Note, this procedure will only work on AArch64 or AMD64 CPU architectures as they are the most popular and widely available.. You can execute this procedure anywhere you have write permissions; your home directory is a perfectly suitable choice.

Note, operating system packages for the Kubernetes artifacts that need to be carried across can now be found at pkgs.k8s.io. This blog post will use a combination of Fedora repositories and GitHub in order to download all of the required artifacts. When you’re doing this on your own cluster, you should decide whether to use the official Kubernetes packages, or the official packages from your operating system distribution - both are valid choices.

# Set architecture variables
UARCH=$(uname -m)

if [["$UARCH" == "arm64" || "$UARCH" == "aarch64"]]; then

    ARCH="aarch64"
    K8s_ARCH="arm64"

else

    ARCH="x86_64"
    K8s_ARCH="amd64"

fi

Set environment variables for software versions to use:

CNI_PLUGINS_VERSION="v1.3.0"
CRICTL_VERSION="v1.27.0"
KUBE_RELEASE="v1.27.3"
RELEASE_VERSION="v0.15.1"
K9S_VERSION="v0.27.4"

Create a download directory, change into it, and download all of the RPMs and configuration files

mkdir download && cd download

curl -O https://download.docker.com/linux/fedora/37/${ARCH}/stable/Packages/docker-ce-cli-23.0.2-1.fc37.${ARCH}.rpm

curl -O https://download.docker.com/linux/fedora/37/${ARCH}/stable/Packages/containerd.io-1.6.19-3.1.fc37.${ARCH}.rpm

curl -O https://download.docker.com/linux/fedora/37/${ARCH}/stable/Packages/docker-compose-plugin-2.17.2-1.fc37.${ARCH}.rpm

curl -O https://download.docker.com/linux/fedora/37/${ARCH}/stable/Packages/docker-ce-rootless-extras-23.0.2-1.fc37.${ARCH}.rpm

curl -O https://download.docker.com/linux/fedora/37/${ARCH}/stable/Packages/docker-ce-23.0.2-1.fc37.${ARCH}.rpm

curl -O https://download-ib01.fedoraproject.org/pub/fedora/linux/releases/37/Everything/${ARCH}/os/Packages/l/libcgroup-3.0-1.fc37.${ARCH}.rpm

echo -e "\nDownload Kubernetes Binaries"

curl -L -O "https://github.com/containernetworking/plugins/releases/download/${CNI_PLUGINS_VERSION}/cni-plugins-linux-${K8s_ARCH}-${CNI_PLUGINS_VERSION}.tgz"

curl -L -O "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-${K8s_ARCH}.tar.gz"

curl -L --remote-name-all https://dl.k8s.io/release/${KUBE_RELEASE}/bin/linux/${K8s_ARCH}/{kubeadm,kubelet}

curl -L -O "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service"

curl -L -O "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf"

curl -L -O "https://dl.k8s.io/release/${KUBE_RELEASE}/bin/linux/${K8s_ARCH}/kubectl"

echo -e "\nDownload dependencies"

curl -O "https://dl.fedoraproject.org/pub/fedora/linux/releases/37/Everything/${ARCH}/os/Packages/s/socat-1.7.4.2-3.fc37.${ARCH}.rpm"

curl -O "https://dl.fedoraproject.org/pub/fedora/linux/releases/37/Everything/${ARCH}/os/Packages/l/libcgroup-3.0-1.fc37.${ARCH}.rpm"

curl -O "https://dl.fedoraproject.org/pub/fedora/linux/releases/37/Everything/${ARCH}/os/Packages/c/conntrack-tools-1.4.6-4.fc37.${ARCH}.rpm"

curl -LO "https://github.com/derailed/k9s/releases/download/${K9S_VERSION}/k9s_Linux_${K8s_ARCH}.tar.gz"

curl -LO "https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml"

Download all of the necessary container images:

images=(
    "registry.k8s.io/kube-apiserver:${KUBE_RELEASE}"
    "registry.k8s.io/kube-controller-manager:${KUBE_RELEASE}"
    "registry.k8s.io/kube-scheduler:${KUBE_RELEASE}"
    "registry.k8s.io/kube-proxy:${KUBE_RELEASE}"
    "registry.k8s.io/pause:3.9"
    "registry.k8s.io/etcd:3.5.7-0"
    "registry.k8s.io/coredns/coredns:v1.10.1"
    "registry:2.8.2"
    "flannel/flannel:v0.22.0"
    "flannel/flannel-cni-plugin:v1.1.2"
)

for image in "${images[@]}"; do
    # Pull the image from the registry
    docker pull "$image"

    # Save the image to a tar file on the local disk
    image_name=$(echo "$image" | sed 's|/|_|g' | sed 's/:/_/g')
    docker save -o "${image_name}.tar" "$image"

done

The above commands will take a look at the CPU architecture for the current host/laptop, create and change into a directory called download, and finally download all of the dependencies. Each of these files must then be transported over the air gap via scp. The exact syntax of the command will vary depending on the user on the VM, if you created an SSH key, and the IP of your air gap VM. The rough syntax is:

scp -i <<SSH_KEY>> <<FILE>> <<AIRGAP_VM_USER>>@<<AIRGAP_VM_IP>>:~/tmp/

Once all of the files have been transported to the air gapped VM, the rest of the blog post will take place from the VM. Open a terminal session to that system.

Put the artifacts in place

Everything that is needed in order to bootstrap a Kubernetes cluster now exists on the air-gapped VM. This section is a lot more complicated since various types of artifacts are now on disk on the air-gapped VM. Get a root shell on the air gap VM as the rest of this section will be executed from there. Let's start by setting the same architecture variables and environmental as were set on the host/laptop and then install all of the RPM packages:

UARCH=$(uname -m)
# Set architecture variables

if [["$UARCH" == "arm64" || "$UARCH" == "aarch64"]]; then

    ARCH="aarch64"
    K8s_ARCH="arm64"

else

    ARCH="x86_64"
    K8s_ARCH="amd64"

fi

# Set environment variables
CNI_PLUGINS_VERSION="v1.3.0"
CRICTL_VERSION="v1.27.0"
KUBE_RELEASE="v1.27.3"
RELEASE_VERSION="v0.15.1"
K9S_VERSION="v0.27.4"

cd ~/tmp/

dnf -y install ./*.rpm

Next, install the CNI plugins and crictl:

mkdir -p /opt/cni/bin
tar -C /opt/cni/bin -xz -f "cni-plugins-linux-${K8s_ARCH}-v1.3.0.tgz"
tar -C /usr/local/bin-xz -f "crictl-v1.27.0-linux-${K8s_ARCH}.tar.gz"

Make kubeadm, kubelet and kubectl executable and move them from the /tmp directory to /usr/local/bin:

chmod +x kubeadm kubelet kubectl
mv kubeadm kubelet kubectl /usr/local/bin

Define an override for the systemd kubelet service file, and move it to the proper location:

mkdir -p /etc/systemd/system/kubelet.service.d

sed "s:/usr/bin:/usr/local/bin:g" 10-kubeadm.conf > /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

The CRI plugin for containerd is disabled by default; enable it:

sed -i 's/^disabled_plugins = \["cri"\]/#&/' /etc/containerd/config.toml

Put a custom /etc/docker/daemon.json file in place:

echo '{

"exec-opts": ["native.cgroupdriver=systemd"],

"insecure-registries" : ["localhost:5000"],

"allow-nondistributable-artifacts": ["localhost:5000"],

"log-driver": "json-file",

"log-opts": {

"max-size": "100m"

},

"group": "rnd",

"storage-driver": "overlay2",

"storage-opts": [

"overlay2.override_kernel_check=true"

]

}' > /etc/docker/daemon.json

Two important items to highlight in the Docker daemon.json configuration file. The insecure-registries line means that the registry in brackets does not support TLS. Even inside an air gapped environment, this isn't a good practice but is fine for the purposes of this lab. The allow-nondistributable-artifacts line tells Docker to permit pushing nondistributable artifacts to this registry. Docker by default does not push these layers to avoid potential issues around licensing or distribution rights. A good example of this is the Windows base container image. This line will allow layers that Docker marks as "foreign" to be pushed to the registry. While not a big deal for this article, that line could be required for some air gapped environments. All layers have to exist locally since nothing inside the air gapped environment can reach out to a public container image registry to get what it needs.

(Re)start Docker and enable it so it starts at system boot:

systemctl restart docker
systemctl enable docker

Start, and enable, containerd and the kubelet:

systemctl enable --now containerd
systemctl enable --now kubelet

The container image registry that runs in Docker is only required for any CNI related containers and subsequent workload containers. This registry is not used to house the Kubernetes component containers. Note, nerdctl would have also worked here as an alternative to Docker and would have allowed for direct interaction with containerd. Docker was chosen for its familiarity.

Start a container image registry inside Docker:

docker load -i registry_2.8.2.tar
docker run -d -p 5000:5000 --restart=always --name registry registry:2.8.2

Load Flannel containers into the Docker registry

Note: Flannel was chosen for this lab due to familiarity. Chose whatever CNI works best in your environment.

docker load -i flannel_flannel_v0.22.0.tar
docker load -i flannel_flannel-cni-plugin_v1.1.2.tar
docker tag flannel/flannel:v0.22.0 localhost:5000/flannel/flannel:v0.22.0
docker tag flannel/flannel-cni-plugin:v1.1.1 localhost:5000/flannel/flannel-cni-plugin:v1.1.1
docker push localhost:5000/flannel/flannel:v0.22.0
docker push localhost:5000/flannel/flannel-cni-plugin:v1.1.1

Load container images for Kubernetes components, via ctr:

images_files=(
    "registry.k8s.io/kube-apiserver:${KUBE_RELEASE}"
    "registry.k8s.io/kube-controller-manager:${KUBE_RELEASE}"
    "registry.k8s.io/kube-scheduler:${KUBE_RELEASE}"
    "registry.k8s.io/kube-proxy:${KUBE_RELEASE}"
    "registry.k8s.io/pause:3.9"
    "registry.k8s.io/etcd:3.5.7-0"
    "registry.k8s.io/coredns/coredns:v1.10.1"
    
)


for index in "${!image_files[@]}"; do

    if [[-f "${image_files[$index]}" ]]; then

        # The below line loads the images where they need to be on the VM
        ctr -n k8s.io images import ${image_files[$index]}

    else

        echo "File ${image_files[$index]} not found!" 1>&2

    fi

done

A totally reasonable question here could be "Why not use the Docker registry that was just stood up to house the K8s component images?" This simply didn't work even with the proper modification to the configuration file that gets passed to kubeadm.

Spin up the Kubernetes cluster

Check if a cluster is already running and tear it down if it is:

if systemctl is-active --quiet kubelet; then

    # Reset the Kubernetes cluster

    echo "A Kubernetes cluster is already running. Resetting the cluster..."

    kubeadm reset -f

fi

Log into the Docker registry from inside the air-gapped VM:

# OK for a demo; use secure credentials in production!

DOCKER_USER=user
DOCKER_PASS=pass
echo ${DOCKER_PASS} | docker login --username=${DOCKER_USER} --password-stdin localhost:5000

Create a cluster configuration file and initialize the cluster:

echo "---

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
clusterName: kubernetes
kubernetesVersion: v1.27.3
networking:
    dnsDomain: cluster.local
    podSubnet: 10.244.0.0/16 # --pod-network-cidr
    serviceSubnet: 10.96.0.0/12
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
    advertiseAddress: 10.10.10.10 # Update to the IP address of the air gap VM
    bindPort: 6443
nodeRegistration:
    criSocket: unix:///run/containerd/containerd.sock # or rely on autodetection
    name: airgap # this must match the hostname of the air gap VM
# Since this is a single node cluster, this taint has to be commented out,
# otherwise the coredns pods will not come up.
# taints:
# - effect: NoSchedule
# key: node-role.kubernetes.io/master" > kubeadm_cluster.yaml

kubeadm init --config kubeadm_config.yaml

Set $KUBECONFIG and use kubectl to wait until the API server is healthy:

export KUBECONFIG=/etc/kubernetes/admin.conf

until kubectl get nodes; do
    echo -e "\nWaiting for API server to respond..." 1>&2
    sleep 5

done

Set up networking

Update Flannel image locations in the Flannel manifest, and apply it:

sed -i 's/image: docker\.io/image: localhost:5000/g' kube-flannel.yaml
kubectl apply -f kube-flannel.yaml

Run kubectl get pods -A --watch until all pods are up and running.

Run an example Pod

With a cluster operational, the next step is a workload. For this simple demonstration, the Podinfo application will be deployed.

Install Helm

This first part of the procedure must be executed from the host/laptop. If not already present, install Helm following Installing Helm.

Next, download the helm binary for Linux:

UARCH=$(uname -m)
# Reset the architecture variables if needed
if [["$UARCH" == "arm64" || "$UARCH" == "aarch64"]]; then

    ARCH="aarch64"
    K8s_ARCH="arm64"

else

    ARCH="x86_64"
    K8s_ARCH="amd64"

fi

curl -LO https://get.helm.sh/helm-v3.12.2-linux-${K8s_ARCH}.tar.gz

Add the Podinfo helm repository, download the Podinfo helm chart, download the Podinfo container image, and then finally save it to the local disk:

helm repo add https://stefanprodan.github.io/podinfo
helm fetch podinfo/podinfo --version 6.4.0
docker pull ghcr.io/stefanprodan/podinfo:6.4.0

Save the podinfo image to a tar file on the local disk

docker save -o podinfo_podinfo-6.4.0.tar ghcr.io/stefanprodan/podinfo

### Transfer the image across the air gap

Reuse the `~/tmp` directory created on the air gapped VM to transport these artifacts across the air gap:

```bash
scp -i <<SSH_KEY>> <<FILE>> <<AIRGAP_VM_USER>>@<<AIRGAP_VM_IP>>:~/tmp/

Continue on the isolated side

Now pivot over to the air gap VM for the rest of the installation procedure.

Switch into ~/tmp:

cd ~/tmp

Extract and move the helm binary:

tar -zxvf helm-v3.0.0-linux-amd64.tar.gz
mv linux-amd64/helm /usr/local/bin/helm

Load the Podinfo container image into the local Docker registry:

docker load -i podinfo_podinfo-6.4.0.tar
docker tag podinfo/podinfo:6.4.0 localhost:5000/podinfo/podinfo:6.4.0
docker push localhost:5000/podinfo/podinfo:6.4.0

Ensure "$KUBECONFIG` is set correctly, then install the Podinfo Helm chart:

# Outside of a demo or lab environment, use lower (or even least) privilege
# credentials to manage your workloads.
export KUBECONFIG=/etc/kubernetes/admin.conf
helm install podinfo ./podinfo-6.4.0.tgz --set image.repository=localhost:5000/podinfo/podinfo

Verify that the Podinfo application comes up:

kubectl get pods -n default

Or run k9s (a terminal user interface for Kubernetes):

k9s

Zarf

Zarf is an open-source tool that takes a declarative approach to software packaging and delivery, including air gap. This same podinfo application will be installed onto the air gap VM using Zarf in this section. The first step is to install Zarf on the host/laptop.

Alternatively, a prebuilt binary can be downloaded onto the host/laptop from GitHub for various OS/CPU architectures.

A binary is also needed across the air gap on the VM:

UARCH=$(uname -m)
# Set the architecture variables if needed
if [["$UARCH" == "arm64" || "$UARCH" == "aarch64"]]; then

    ARCH="aarch64"
    K8s_ARCH="arm64"

else

    ARCH="x86_64"
    K8s_ARCH="amd64"

fi

export ZARF_VERSION=v0.28.3

curl -LO "https://github.com/defenseunicorns/zarf/releases/download/${ZARF_VERSION}/zarf_${ZARF_VERSION}_Linux_${K8s_ARCH}"

Zarf needs to bootstrap itself into a Kubernetes cluster through the use of an init package. That also needs to be transported across the air gap so let's download it onto the host/laptop:

curl -LO "https://github.com/defenseunicorns/zarf/releases/download/${ZARF_VERSION}/zarf-init-${K8s_ARCH}-${ZARF_VERSION}.tar.zst"

The way that Zarf is declarative is through the use of a zarf.yaml file. Here is the zarf.yaml file that will be used for this Podinfo installation. Write it to whatever directory you you have write access to on your host/laptop; your home directory is fine:

echo 'kind: ZarfPackageConfig
metadata:
    name: podinfo
    description: "Deploy helm chart for the podinfo application in K8s via zarf"
components:
    - name: podinfo
        required: true
        charts:
            - name: podinfo
              version: 6.4.0
              namespace: podinfo-helm-namespace
              releaseName: podinfo
              url: https://stefanprodan.github.io/podinfo
        images:
        - ghcr.io/stefanprodan/podinfo:6.4.0' > zarf.yaml

The next step is to build the Podinfo package. This must be done from the same directory location where the zarf.yaml file is located.

zarf package create --confirm

This command will download the defined helm chart and image and put them into a single file written to disk. This single file is all that needs to be carried across the air gap:

ls zarf-package-*

Sample output:

zarf-package-podinfo-arm64.tar.zst

Transport the linux zarf binary, zarf init package and Podinfo package over to the air gapped VM:

scp -i <<SSH_KEY>> <<FILE>> <<AIRGAP_VM_USER>>@<<AIRGAP_VM_IP>>:~/tmp/

From the air gapped VM, switch into the ~/tmp directory where all of the artifacts were placed:

cd ~/tmp

Set $KUBECONFIG to a file with credentials for the local cluster; also set the Zarf version:

export KUBECONFIG=/etc/kubernetes/admin.conf

export ZARF_VERSION=$(zarf version)

Make the zarf binary executable and (as root) move it to /usr/bin:

chmod +x zarf && sudo mv zarf /usr/bin

Likewise, move the Zarf init package to /usr/bin:

mv zarf-init-arm64-${ZARF_VERSION}.tar.zst /usr/bin

Initialize Zarf into the cluster:

zarf init --confirm --components=git-server

When this command is done, a Zarf package is ready to be deployed.

zarf package deploy

This command will search the current directory for a Zarf package. Select the podinfo package (zarf-package-podinfo-${K8s_ARCH}.tar.zst) and continue. Once the package deployment is complete, run zarf tools monitor in order to bring up k9s to view the cluster.

Conclusion

This is one method that can be used to spin up an air-gapped cluster and two methods to deploy a mission application. Your mileage may vary on different operating systems regarding the exact software artifacts that need to be carried across the air gap, but conceptually this procedure is still valid.

This demo also created an artificial air-gapped environment. In the real world, every missed dependency could represent hours, if not days, or weeks of lost time to get running software in the air-gapped environment. This artificial air gap also obscured some common methods or air gap software delivery such as using a data diode. Depending on the environment, the diode can be very expensive to use. Also, none of the artifacts were scanned before being carried across the air gap. The presence of the air gap in general means that the workload running there is more sensitive, and nothing should be carried across unless it's known to be safe.

CRI-O is moving towards pkgs.k8s.io

The Kubernetes community recently announced that their legacy package repositories are frozen, and now they moved to introduced community-owned package repositories powered by the OpenBuildService (OBS). CRI-O has a long history of utilizing OBS for their package builds, but all of the packaging efforts have been done manually so far.

The CRI-O community absolutely loves Kubernetes, which means that they're delighted to announce that:

All future CRI-O packages will be shipped as part of the officially supported Kubernetes infrastructure hosted on pkgs.k8s.io!

There will be a deprecation phase for the existing packages, which is currently being discussed in the CRI-O community. The new infrastructure will only support releases of CRI-O >= v1.28.2 as well as release branches newer than release-1.28.

How to use the new packages

In the same way as the Kubernetes community, CRI-O provides deb and rpm packages as part of a dedicated subproject in OBS, called isv:kubernetes:addons:cri-o. This project acts as an umbrella and provides stable (for CRI-O tags) as well as prerelease (for CRI-O release-1.y and main branches) package builds.

Stable Releases:

Prereleases:

There are no stable releases available in the v1.29 repository yet, because v1.29.0 will be released in December. The CRI-O community will also not support release branches older than release-1.28, because there have been CI requirements merged into main which could be only backported to release-1.28 with appropriate efforts.

For example, If an end-user would like to install the latest available version of the CRI-O main branch, then they can add the repository in the same way as they do for Kubernetes.

rpm Based Distributions

For rpm based distributions, you can run the following commands as a root user to install CRI-O together with Kubernetes:

Add the Kubernetes repo

cat <<EOF | tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.28/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.28/rpm/repodata/repomd.xml.key
EOF

Add the CRI-O repo

cat <<EOF | tee /etc/yum.repos.d/cri-o.repo
[cri-o]
name=CRI-O
baseurl=https://pkgs.k8s.io/addons:/cri-o:/prerelease:/main/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/addons:/cri-o:/prerelease:/main/rpm/repodata/repomd.xml.key
EOF

Install official package dependencies

dnf install -y \
    conntrack \
    container-selinux \
    ebtables \
    ethtool \
    iptables \
    socat

Install the packages from the added repos

dnf install -y --repo cri-o --repo kubernetes \
    cri-o \
    kubeadm \
    kubectl \
    kubelet

deb Based Distributions

For deb based distributions, you can run the following commands as a root user:

Install dependencies for adding the repositories

apt-get update
apt-get install -y software-properties-common curl

Add the Kubernetes repository

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key |
    gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /" |
    tee /etc/apt/sources.list.d/kubernetes.list

Add the CRI-O repository

curl -fsSL https://pkgs.k8s.io/addons:/cri-o:/prerelease:/main/deb/Release.key |
    gpg --dearmor -o /etc/apt/keyrings/cri-o-apt-keyring.gpg
echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/prerelease:/main/deb/ /" |
    tee /etc/apt/sources.list.d/cri-o.list

Install the packages

apt-get update
apt-get install -y cri-o kubelet kubeadm kubectl

Start CRI-O

systemctl start crio.service

The Project's prerelease:/main prefix at the CRI-O's package path, can be replaced with stable:/v1.28, stable:/v1.29, prerelease:/v1.28 or prerelease:/v1.29 if another stream package is used.

Bootstrapping a cluster using kubeadm can be done by running kubeadm init command, which automatically detects that CRI-O is running in the background. There are also Vagrantfile examples available for Fedora 38 as well as Ubuntu 22.04 for testing the packages together with kubeadm.

How it works under the hood

Everything related to these packages lives in the new CRI-O packaging repository. It contains a daily reconciliation GitHub action workflow, for all supported release branches as well as tags of CRI-O. A test pipeline in the OBS workflow ensures that the packages can be correctly installed and used before being published. All of the staging and publishing of the packages is done with the help of the Kubernetes Release Toolbox (krel), which is also used for the official Kubernetes deb and rpm packages.

The package build inputs will undergo daily reconciliation and will be supplied by CRI-O's static binary bundles. These bundles are built and signed for each commit in the CRI-O CI, and contain everything CRI-O requires to run on a certain architecture. The static builds are reproducible, powered by nixpkgs and available only for x86_64, aarch64 and ppc64le architecture.

The CRI-O maintainers will be happy to listen to any feedback or suggestions on the new packaging efforts! Thank you for reading this blog post, feel free to reach out to the maintainers via the Kubernetes Slack channel #crio or create an issue in the packaging repository.

Spotlight on SIG Architecture: Conformance

This is the first interview of a SIG Architecture Spotlight series that will cover the different subprojects. We start with the SIG Architecture: Conformance subproject

In this SIG Architecture spotlight, we talked with Riaan Kleinhans (ii.nz), Lead for the Conformance sub-project.

About SIG Architecture and the Conformance subproject

Frederico (FSM): Hello Riaan, and welcome! For starters, tell us a bit about yourself, your role and how you got involved in Kubernetes.

Riaan Kleinhans (RK): Hi! My name is Riaan Kleinhans and I live in South Africa. I am the Project manager for the ii.nz team in New Zealand. When I joined ii the plan was to move to New Zealand in April 2020 and then Covid happened. Fortunately, being a flexible and dynamic team we were able to make it work remotely and in very different time zones.

The ii team have been tasked with managing the Kubernetes Conformance testing technical debt and writing tests to clear the technical debt. I stepped into the role of project manager to be the link between monitoring, test writing and the community. Through that work I had the privilege of meeting Dan Kohn in those first months, his enthusiasm about the work we were doing was a great inspiration.

FSM: Thank you - so, your involvement in SIG Architecture started because of the conformance work?

RK: SIG Architecture is the home for the Kubernetes Conformance subproject. Initially, most of my interactions were directly with SIG Architecture through the Conformance sub-project. However, as we began organizing the work by SIG, we started engaging directly with each individual SIG. These engagements with the SIGs that own the untested APIs have helped us accelerate our work.

FSM: How would you describe the main goals and areas of intervention of the Conformance sub-project?

RM: The Kubernetes Conformance sub-project focuses on guaranteeing compatibility and adherence to the Kubernetes specification by developing and maintaining a comprehensive conformance test suite. Its main goals include assuring compatibility across different Kubernetes implementations, verifying adherence to the API specification, supporting the ecosystem by encouraging conformance certification, and fostering collaboration within the Kubernetes community. By providing standardised tests and promoting consistent behaviour and functionality, the Conformance subproject ensures a reliable and compatible Kubernetes ecosystem for developers and users alike.

More on the Conformance Test Suite

FSM: A part of providing those standardised tests is, I believe, the Conformance Test Suite. Could you explain what it is and its importance?

RK: The Kubernetes Conformance Test Suite checks if Kubernetes distributions meet the project's specifications, ensuring compatibility across different implementations. It covers various features like APIs, networking, storage, scheduling, and security. Passing the tests confirms proper implementation and promotes a consistent and portable container orchestration platform.

FSM: Right, the tests are important in the way they define the minimum features that any Kubernetes cluster must support. Could you describe the process around determining which features are considered for inclusion? Is there any tension between a more minimal approach, and proposals from the other SIGs?

RK: The requirements for each endpoint that undergoes conformance testing are clearly defined by SIG Architecture. Only API endpoints that are generally available and non-optional features are eligible for conformance. Over the years, there have been several discussions regarding conformance profiles, exploring the possibility of including optional endpoints like RBAC, which are widely used by most end users, in specific profiles. However, this aspect is still a work in progress.

Endpoints that do not meet the conformance criteria are listed in ineligible_endpoints.yaml, which is publicly accessible in the Kubernetes repo. This file can be updated to add or remove endpoints as their status or requirements change. These ineligible endpoints are also visible on APISnoop.

Ensuring transparency and incorporating community input regarding the eligibility or ineligibility of endpoints is of utmost importance to SIG Architecture.

FSM: Writing tests for new features is something generally requires some kind of enforcement. How do you see the evolution of this in Kubernetes? Was there a specific effort to improve the process in a way that required tests would be a first-class citizen, or was that never an issue?

RK: When discussions surrounding the Kubernetes conformance programme began in 2018, only approximately 11% of endpoints were covered by tests. At that time, the CNCF's governing board requested that if funding were to be provided for the work to cover missing conformance tests, the Kubernetes Community should adopt a policy of not allowing new features to be added unless they include conformance tests for their stable APIs.

SIG Architecture is responsible for stewarding this requirement, and APISnoop has proven to be an invaluable tool in this regard. Through automation, APISnoop generates a pull request every weekend to highlight any discrepancies in Conformance coverage. If any endpoints are promoted to General Availability without a conformance test, it will be promptly identified. This approach helps prevent the accumulation of new technical debt.

Additionally, there are plans in the near future to create a release informing job, which will add an additional layer to prevent any new technical debt.

FSM: I see, tooling and automation play an important role there. What are, in your opinion, the areas that, conformance-wise, still require some work to be done? In other words, what are the current priority areas marked for improvement?

RK: We have reached the “100% Conformance Tested” milestone in release 1.27!

At that point, the community took another look at all the endpoints that were listed as ineligible for conformance. The list was populated through community input over several years. Several endpoints that were previously deemed ineligible for conformance have been identified and relocated to a new dedicated list, which is currently receiving focused attention for conformance test development. Again, that list can also be checked on apisnoop.cncf.io.

To ensure the avoidance of new technical debt in the conformance project, there are upcoming plans to establish a release informing job as an additional preventive measure.

While APISnoop is currently hosted on CNCF infrastructure, the project has been generously donated to the Kubernetes community. Consequently, it will be transferred to community-owned infrastructure before the end of 2023.

FSM: That's great news! For anyone wanting to help, what are the venues for collaboration that you would highlight? Do all of them require solid knowledge of Kubernetes as a whole, or are there ways someone newer to the project can contribute?

RK: Contributing to conformance testing is akin to the task of "washing the dishes" – it may not be highly visible, but it remains incredibly important. It necessitates a strong understanding of Kubernetes, particularly in the areas where the endpoints need to be tested. This is why working with each SIG that owns the API endpoint being tested is so important.

As part of our commitment to making test writing accessible to everyone, the ii team is currently engaged in the development of a "click and deploy" solution. This solution aims to enable anyone to swiftly create a working environment on real hardware within minutes. We will share updates regarding this development as soon as we are ready.

FSM: That's very helpful, thank you. Any final comments you would like to share with our readers?

RK: Conformance testing is a collaborative community endeavour that involves extensive cooperation among SIGs. SIG Architecture has spearheaded the initiative and provided guidance. However, the progress of the work relies heavily on the support of all SIGs in reviewing, enhancing, and endorsing the tests.

I would like to extend my sincere appreciation to the ii team for their unwavering commitment to resolving technical debt over the years. In particular, Hippie Hacker's guidance and stewardship of the vision has been invaluable. Additionally, I want to give special recognition to Stephen Heywood for shouldering the majority of the test writing workload in recent releases, as well as to Zach Mandeville for his contributions to APISnoop.

FSM: Many thanks for your availability and insightful comments, I've personally learned quite a bit with it and I'm sure our readers will as well.

Announcing the 2023 Steering Committee Election Results

The 2023 Steering Committee Election is now complete. The Kubernetes Steering Committee consists of 7 seats, 4 of which were up for election in 2023. Incoming committee members serve a term of 2 years, and all members are elected by the Kubernetes Community.

This community body is significant since it oversees the governance of the entire Kubernetes project. With that great power comes great responsibility. You can learn more about the steering committee’s role in their charter.

Thank you to everyone who voted in the election; your participation helps support the community’s continued health and success.

Results

Congratulations to the elected committee members whose two year terms begin immediately (listed in alphabetical order by GitHub handle):

They join continuing members:

Stephen Augustus is a returning Steering Committee Member.

Big Thanks!

Thank you and congratulations on a successful election to this round’s election officers:

Thanks to the Emeritus Steering Committee Members. Your service is appreciated by the community:

And thank you to all the candidates who came forward to run for election.

Get Involved with the Steering Committee

This governing body, like all of Kubernetes, is open to all. You can follow along with Steering Committee backlog items and weigh in by filing an issue or creating a PR against their repo. They have an open meeting on the first Monday at 9:30am PT of every month. They can also be contacted at their public mailing list steering@kubernetes.io.

You can see what the Steering Committee meetings are all about by watching past meetings on the YouTube Playlist.

If you want to meet some of the newly elected Steering Committee members, join us for the Steering AMA at the Kubernetes Contributor Summit in Chicago.


This post was written by the Contributor Comms Subproject. If you want to write stories about the Kubernetes community, learn more about us.

Happy 7th Birthday kubeadm!

What a journey so far!

Starting from the initial blog post “How we made Kubernetes insanely easy to install” in September 2016, followed by an exciting growth that lead to general availability / “Production-Ready Kubernetes Cluster Creation with kubeadm” two years later.

And later on a continuous, steady and reliable flow of small improvements that is still going on as of today.

What is kubeadm? (quick refresher)

kubeadm is focused on bootstrapping Kubernetes clusters on existing infrastructure and performing an essential set of maintenance tasks. The core of the kubeadm interface is quite simple: new control plane nodes are created by running kubeadm init and worker nodes are joined to the control plane by running kubeadm join. Also included are utilities for managing already bootstrapped clusters, such as control plane upgrades and token and certificate renewal.

To keep kubeadm lean, focused, and vendor/infrastructure agnostic, the following tasks are out of its scope:

  • Infrastructure provisioning
  • Third-party networking
  • Non-critical add-ons, e.g. for monitoring, logging, and visualization
  • Specific cloud provider integrations

Infrastructure provisioning, for example, is left to other SIG Cluster Lifecycle projects, such as the Cluster API. Instead, kubeadm covers only the common denominator in every Kubernetes cluster: the control plane. The user may install their preferred networking solution and other add-ons on top of Kubernetes after cluster creation.

Behind the scenes, kubeadm does a lot. The tool makes sure you have all the key components: etcd, the API server, the scheduler, the controller manager. You can join more control plane nodes for improving resiliency or join worker nodes for running your workloads. You get cluster DNS and kube-proxy set up for you. TLS between components is enabled and used for encryption in transit.

Let's celebrate! Past, present and future of kubeadm

In all and for all kubeadm's story is tightly coupled with Kubernetes' story, and with this amazing community.

Therefore celebrating kubeadm is first of all celebrating this community, a set of people, who joined forces in finding a common ground, a minimum viable tool, for bootstrapping Kubernetes clusters.

This tool, was instrumental to the Kubernetes success back in time as well as it is today, and the silver line of kubeadm's value proposition can be summarized in two points

  • An obsession in making things deadly simple for the majority of the users: kubeadm init & kubeadm join, that's all you need!

  • A sharp focus on a well-defined problem scope: bootstrapping Kubernetes clusters on existing infrastructure. As our slogan says: keep it simple, keep it extensible!

This silver line, this clear contract, is the foundation the entire kubeadm user base relies on, and this post is a celebration for kubeadm's users as well.

We are deeply thankful for any feedback from our users, for the enthusiasm that they are continuously showing for this tool via Slack, GitHub, social media, blogs, in person at every KubeCon or at the various meet ups around the world. Keep going!

What continues to amaze me after all those years is the great things people are building on top of kubeadm, and as of today there is a strong and very active list of projects doing so:

  • minikube
  • kind
  • Cluster API
  • Kubespray
  • and many more; if you are using Kubernetes today, there is a good chance that you are using kubeadm even without knowing it 😜

This community, the kubeadm’s users, the projects building on top of kubeadm are the highlights of kubeadm’s 7th birthday celebration and the foundation for what will come next!

Stay tuned, and feel free to reach out to us!

  • Try kubeadm to install Kubernetes today
  • Get involved with the Kubernetes project on GitHub
  • Connect with the community on Slack
  • Follow us on Twitter @Kubernetesio for latest updates

kubeadm: Use etcd Learner to Join a Control Plane Node Safely

The kubeadm tool now supports etcd learner mode, which allows you to enhance the resilience and stability of your Kubernetes clusters by leveraging the learner mode feature introduced in etcd version 3.4. This guide will walk you through using etcd learner mode with kubeadm. By default, kubeadm runs a local etcd instance on each control plane node.

In v1.27, kubeadm introduced a new feature gate EtcdLearnerMode. With this feature gate enabled, when joining a new control plane node, a new etcd member will be created as a learner and promoted to a voting member only after the etcd data are fully aligned.

What are the advantages of using etcd learner mode?

etcd learner mode offers several compelling reasons to consider its adoption in Kubernetes clusters:

  1. Enhanced Resilience: etcd learner nodes are non-voting members that catch up with the leader's logs before becoming fully operational. This prevents new cluster members from disrupting the quorum or causing leader elections, making the cluster more resilient during membership changes.
  2. Reduced Cluster Unavailability: Traditional approaches to adding new members often result in cluster unavailability periods, especially in slow infrastructure or misconfigurations. etcd learner mode minimizes such disruptions.
  3. Simplified Maintenance: Learner nodes provide a safer and reversible way to add or replace cluster members. This reduces the risk of accidental cluster outages due to misconfigurations or missteps during member additions.
  4. Improved Network Tolerance: In scenarios involving network partitions, learner mode allows for more graceful handling. Depending on the partition a new member lands, it can seamlessly integrate with the existing cluster without causing disruptions.

In summary, the etcd learner mode improves the reliability and manageability of Kubernetes clusters during member additions and changes, making it a valuable feature for cluster operators.

How nodes join a cluster that's using the new mode

Create a Kubernetes cluster backed by etcd in learner mode

For a general explanation about creating highly available clusters with kubeadm, you can refer to Creating Highly Available Clusters with kubeadm.

To create a Kubernetes cluster, backed by etcd in learner mode, using kubeadm, follow these steps:

# kubeadm init --feature-gates=EtcdLearnerMode=true ...
kubeadm init --config=kubeadm-config.yaml

The kubeadm configuration file is like below:

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
featureGates:
  EtcdLearnerMode: true

The kubeadm tool deploys a single-node Kubernetes cluster with etcd set to use learner mode.

Join nodes to the Kubernetes cluster

Before joining a control-plane node to the new Kubernetes cluster, ensure that the existing control plane nodes and all etcd members are healthy.

Check the cluster health with etcdctl. If etcdctl isn't available, you can run this tool inside a container image. You would do that directly with your container runtime using a tool such as crictl run and not through Kubernetes

Here is an example on a client command that uses secure communication to check the cluster health of the etcd cluster:

ETCDCTL_API=3 etcdctl --endpoints 127.0.0.1:2379 \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  member list
...
dc543c4d307fadb9, started, node1, https://10.6.177.40:2380, https://10.6.177.40:2379, false

To check if the Kubernetes control plane is healthy, run kubectl get node -l node-role.kubernetes.io/control-plane= and check if the nodes are ready.

Note:

It is recommended to have an odd number of members in an etcd cluster.

Before joining a worker node to the new Kubernetes cluster, ensure that the control plane nodes are healthy.

What's next

Feedback

Was this guide helpful? If you have any feedback or encounter any issues, please let us know. Your feedback is always welcome! Join the bi-weekly SIG Cluster Lifecycle meeting or weekly kubeadm office hours. Or reach us via Slack (channel #kubeadm), or the SIG's mailing list.

User Namespaces: Now Supports Running Stateful Pods in Alpha!

Kubernetes v1.25 introduced support for user namespaces for only stateless pods. Kubernetes 1.28 lifted that restriction, after some design changes were done in 1.27.

The beauty of this feature is that:

  • it is trivial to adopt (you just need to set a bool in the pod spec)
  • doesn't need any changes for most applications
  • improves security by drastically enhancing the isolation of containers and mitigating CVEs rated HIGH and CRITICAL.

This post explains the basics of user namespaces and also shows:

  • the changes that arrived in the recent Kubernetes v1.28 release
  • a demo of a vulnerability rated as HIGH that is not exploitable with user namespaces
  • the runtime requirements to use this feature
  • what you can expect in future releases regarding user namespaces.

What is a user namespace?

A user namespace is a Linux feature that isolates the user and group identifiers (UIDs and GIDs) of the containers from the ones on the host. The indentifiers in the container can be mapped to indentifiers on the host in a way where the host UID/GIDs used for different containers never overlap. Even more, the identifiers can be mapped to unprivileged non-overlapping UIDs and GIDs on the host. This basically means two things:

  • As the UIDs and GIDs for different containers are mapped to different UIDs and GIDs on the host, containers have a harder time to attack each other even if they escape the container boundaries. For example, if container A is running with different UIDs and GIDs on the host than container B, the operations it can do on container B's files and process are limited: only read/write what a file allows to others, as it will never have permission for the owner or group (the UIDs/GIDs on the host are guaranteed to be different for different containers).

  • As the UIDs and GIDs are mapped to unprivileged users on the host, if a container escapes the container boundaries, even if it is running as root inside the container, it has no privileges on the host. This greatly protects what host files it can read/write, which process it can send signals to, etc.

Furthermore, capabilities granted are only valid inside the user namespace and not on the host.

Without using a user namespace a container running as root, in the case of a container breakout, has root privileges on the node. And if some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂).

Changes in 1.28

As already mentioned, starting from 1.28, Kubernetes supports user namespaces with stateful pods. This means that pods with user namespaces can use any type of volume, they are no longer limited to only some volume types as before.

The feature gate to activate this feature was renamed, it is no longer UserNamespacesStatelessPodsSupport but from 1.28 onwards you should use UserNamespacesSupport. There were many changes done and the requirements on the node hosts changed. So with Kubernetes 1.28 the feature flag was renamed to reflect this.

Demo

Rodrigo created a demo which exploits CVE 2022-0492 and shows how the exploit can occur without user namespaces. He also shows how it is not possible to use this exploit from a Pod where the containers are using this feature.

This vulnerability is rated HIGH and allows a container with no special privileges to read/write to any path on the host and launch processes as root on the host too.

Most applications in containers run as root today, or as a semi-predictable non-root user (user ID 65534 is a somewhat popular choice). When you run a Pod with containers using a userns, Kubernetes runs those containers as unprivileged users, with no changes needed in your app.

This means two containers running as user 65534 will effectively be mapped to different users on the host, limiting what they can do to each other in case of an escape, and if they are running as root, the privileges on the host are reduced to the one of an unprivileged user.

Node system requirements

There are requirements on the Linux kernel version as well as the container runtime to use this feature.

On Linux you need Linux 6.3 or greater. This is because the feature relies on a kernel feature named idmap mounts, and support to use idmap mounts with tmpfs was merged in Linux 6.3.

If you are using CRI-O with crun, this is supported in CRI-O 1.28.1 and crun 1.9 or greater. If you are using CRI-O with runc, this is still not supported.

containerd support is currently targeted for containerd 2.0; it is likely that it won't matter if you use it with crun or runc.

Please note that containerd 1.7 added experimental support for user namespaces as implemented in Kubernetes 1.25 and 1.26. The redesign done in 1.27 is not supported by containerd 1.7, therefore it only works, in terms of user namespaces support, with Kubernetes 1.25 and 1.26.

One limitation present in containerd 1.7 is that it needs to change the ownership of every file and directory inside the container image, during Pod startup. This means it has a storage overhead and can significantly impact the container startup latency. Containerd 2.0 will probably include a implementation that will eliminate the startup latency added and the storage overhead. Take this into account if you plan to use containerd 1.7 with user namespaces in production.

None of these containerd limitations apply to CRI-O 1.28.

What’s next?

Looking ahead to Kubernetes 1.29, the plan is to work with SIG Auth to integrate user namespaces to Pod Security Standards (PSS) and the Pod Security Admission. For the time being, the plan is to relax checks in PSS policies when user namespaces are in use. This means that the fields spec[.*].securityContext runAsUser, runAsNonRoot, allowPrivilegeEscalation and capabilities will not trigger a violation if user namespaces are in use. The behavior will probably be controlled by utilizing a API Server feature gate, like UserNamespacesPodSecurityStandards or similar.

How do I get involved?

You can reach SIG Node by several means:

You can also contact us directly:

  • GitHub: @rata @giuseppe @saschagrunert
  • Slack: @rata @giuseppe @sascha

Comparing Local Kubernetes Development Tools: Telepresence, Gefyra, and mirrord

The Kubernetes development cycle is an evolving landscape with a myriad of tools seeking to streamline the process. Each tool has its unique approach, and the choice often comes down to individual project requirements, the team's expertise, and the preferred workflow.

Among the various solutions, a category we dubbed “Local K8S Development tools” has emerged, which seeks to enhance the Kubernetes development experience by connecting locally running components to the Kubernetes cluster. This facilitates rapid testing of new code in cloud conditions, circumventing the traditional cycle of Dockerization, CI, and deployment.

In this post, we compare three solutions in this category: Telepresence, Gefyra, and our own contender, mirrord.

Telepresence

The oldest and most well-established solution in the category, Telepresence uses a VPN (or more specifically, a tun device) to connect the user's machine (or a locally running container) and the cluster's network. It then supports the interception of incoming traffic to a specific service in the cluster, and its redirection to a local port. The traffic being redirected can also be filtered to avoid completely disrupting the remote service. It also offers complementary features to support file access (by locally mounting a volume mounted to a pod) and importing environment variables. Telepresence requires the installation of a local daemon on the user's machine (which requires root privileges) and a Traffic Manager component on the cluster. Additionally, it runs an Agent as a sidecar on the pod to intercept the desired traffic.

Gefyra

Gefyra, similar to Telepresence, employs a VPN to connect to the cluster. However, it only supports connecting locally running Docker containers to the cluster. This approach enhances portability across different OSes and local setups. However, the downside is that it does not support natively run uncontainerized code.

Gefyra primarily focuses on network traffic, leaving file access and environment variables unsupported. Unlike Telepresence, it doesn't alter the workloads in the cluster, ensuring a straightforward clean-up process if things go awry.

mirrord

The newest of the three tools, mirrord adopts a different approach by injecting itself into the local binary (utilizing LD_PRELOAD on Linux or DYLD_INSERT_LIBRARIES on macOS), and overriding libc function calls, which it then proxies a temporary agent it runs in the cluster. For example, when the local process tries to read a file mirrord intercepts that call and sends it to the agent, which then reads the file from the remote pod. This method allows mirrord to cover all inputs and outputs to the process – covering network access, file access, and environment variables uniformly.

By working at the process level, mirrord supports running multiple local processes simultaneously, each in the context of their respective pod in the cluster, without requiring them to be containerized and without needing root permissions on the user’s machine.

Summary

Comparison of Telepresence, Gefyra, and mirrord
TelepresenceGefyramirrord
Cluster connection scopeEntire machine or containerContainerProcess
Developer OS supportLinux, macOS, WindowsLinux, macOS, WindowsLinux, macOS, Windows (WSL)
Incoming traffic featuresInterceptionInterceptionInterception or mirroring
File accessSupportedUnsupportedSupported
Environment variablesSupportedUnsupportedSupported
Requires local rootYesNoNo
How to use
  • CLI
  • Docker Desktop extension
  • CLI
  • Docker Desktop extension
  • CLI
  • Visual Studio Code extension
  • IntelliJ plugin

Conclusion

Telepresence, Gefyra, and mirrord each offer unique approaches to streamline the Kubernetes development cycle, each having its strengths and weaknesses. Telepresence is feature-rich but comes with complexities, mirrord offers a seamless experience and supports various functionalities, while Gefyra aims for simplicity and robustness.

Your choice between them should depend on the specific requirements of your project, your team's familiarity with the tools, and the desired development workflow. Whichever tool you choose, we believe the local Kubernetes development approach can provide an easy, effective, and cheap solution to the bottlenecks of the Kubernetes development cycle, and will become even more prevalent as these tools continue to innovate and evolve.

Kubernetes Legacy Package Repositories Will Be Frozen On September 13, 2023

On August 15, 2023, the Kubernetes project announced the general availability of the community-owned package repositories for Debian and RPM packages available at pkgs.k8s.io. The new package repositories are replacement for the legacy Google-hosted package repositories: apt.kubernetes.io and yum.kubernetes.io. The announcement blog post for pkgs.k8s.io highlighted that we will stop publishing packages to the legacy repositories in the future.

Today, we're formally deprecating the legacy package repositories (apt.kubernetes.io and yum.kubernetes.io), and we're announcing our plans to freeze the contents of the repositories as of September 13, 2023.

Please continue reading in order to learn what does this mean for you as an user or distributor, and what steps you may need to take.

ℹ️ Update (March 26, 2024): the legacy Google-hosted repositories went away on March 4, 2024. It's not possible to install Kubernetes packages from the legacy Google-hosted package repositories any longer.

How does this affect me as a Kubernetes end user?

This change affects users directly installing upstream versions of Kubernetes, either manually by following the official installation and upgrade instructions, or by using a Kubernetes installer that's using packages provided by the Kubernetes project.

This change also affects you if you run Linux on your own PC and have installed kubectl using the legacy package repositories. We'll explain later on how to check if you're affected.

If you use fully managed Kubernetes, for example through a service from a cloud provider, you would only be affected by this change if you also installed kubectl on your Linux PC using packages from the legacy repositories. Cloud providers are generally using their own Kubernetes distributions and therefore they don't use packages provided by the Kubernetes project; more importantly, if someone else is managing Kubernetes for you, then they would usually take responsibility for that check.

If you have a managed control plane but you are responsible for managing the nodes yourself, and any of those nodes run Linux, you should check whether you are affected.

If you're managing your clusters on your own by following the official installation and upgrade instructions, please follow the instructions in this blog post to migrate to the (new) community-owned package repositories.

If you're using a Kubernetes installer that's using packages provided by the Kubernetes project, please check the installer tool's communication channels for information about what steps you need to take, and eventually if needed, follow up with maintainers to let them know about this change.

The following diagram shows who's affected by this change in a visual form (click on diagram for the larger version):

Visual explanation of who's affected by the legacy repositories being deprecated and frozen. Textual explanation is available above this diagram.

How does this affect me as a Kubernetes distributor?

If you're using the legacy repositories as part of your project (e.g. a Kubernetes installer tool), you should migrate to the community-owned repositories as soon as possible and inform your users about this change and what steps they need to take.

Timeline of changes

(updated on March 26, 2024)

  • 15th August 2023:
    Kubernetes announces a new, community-managed source for Linux software packages of Kubernetes components
  • 31st August 2023:
    (this announcement) Kubernetes formally deprecates the legacy package repositories
  • 13th September 2023 (approximately):
    Kubernetes will freeze the legacy package repositories, (apt.kubernetes.io and yum.kubernetes.io). The freeze will happen immediately following the patch releases that are scheduled for September, 2023.
  • 12th January 2024:
    Kubernetes announced intentions to remove the legacy package repositories in January 2024
  • 4th March 2024:
    The legacy package repositories have been removed. It's not possible to install Kubernetes packages from the legacy package repositories any longer

The Kubernetes patch releases scheduled for September 2023 (v1.28.2, v1.27.6, v1.26.9, v1.25.14) will have packages published both to the community-owned and the legacy repositories.

We'll freeze the legacy repositories after cutting the patch releases for September which means that we'll completely stop publishing packages to the legacy repositories at that point.

For the v1.28, v1.27, v1.26, and v1.25 patch releases from October 2023 and onwards, we'll only publish packages to the new package repositories (pkgs.k8s.io).

What about future minor releases?

Kubernetes 1.29 and onwards will have packages published only to the community-owned repositories (pkgs.k8s.io).

What releases are available in the new community-owned package repositories?

Linux packages for releases starting from Kubernetes v1.24.0 are available in the Kubernetes package repositories (pkgs.k8s.io). Kubernetes does not have official Linux packages available for earlier releases of Kubernetes; however, your Linux distribution may provide its own packages.

Can I continue to use the legacy package repositories?

(updated on March 26, 2024)

The legacy Google-hosted repositories went away on March 4, 2024. It's not possible to install Kubernetes packages from the legacy Google-hosted package repositories any longer.

The existing packages in the legacy repositories will be available for the foreseeable future. However, the Kubernetes project can't provide any guarantees on how long is that going to be. The deprecated legacy repositories, and their contents, might be removed at any time in the future and without a further notice period.

The Kubernetes project strongly recommends migrating to the new community-owned repositories as soon as possible. Migrating to the new package repositories is required to consume the official Kubernetes packages.

Given that no new releases will be published to the legacy repositories after the September 13, 2023 cut-off point, you will not be able to upgrade to any patch or minor release made from that date onwards.

Whilst the project makes every effort to release secure software, there may one day be a high-severity vulnerability in Kubernetes, and consequently an important release to upgrade to. The advice we're announcing will help you be as prepared for any future security update, whether trivial or urgent.

How can I check if I'm using the legacy repositories?

The steps to check if you're using the legacy repositories depend on whether you're using Debian-based distributions (Debian, Ubuntu, and more) or RPM-based distributions (CentOS, RHEL, Rocky Linux, and more) in your cluster.

Run these instructions on one of your nodes in the cluster.

Debian-based Linux distributions

The repository definitions (sources) are located in /etc/apt/sources.list and /etc/apt/sources.list.d/ on Debian-based distributions. Inspect these two locations and try to locate a package repository definition that looks like:

deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main

If you find a repository definition that looks like this, you're using the legacy repository and you need to migrate.

If the repository definition uses pkgs.k8s.io, you're already using the community-hosted repositories and you don't need to take any action.

On most systems, this repository definition should be located in /etc/apt/sources.list.d/kubernetes.list (as recommended by the Kubernetes documentation), but on some systems it might be in a different location.

If you can't find a repository definition related to Kubernetes, it's likely that you don't use package managers to install Kubernetes and you don't need to take any action.

RPM-based Linux distributions

The repository definitions are located in /etc/yum.repos.d if you're using the yum package manager, or /etc/dnf/dnf.conf and /etc/dnf/repos.d/ if you're using dnf package manager. Inspect those locations and try to locate a package repository definition that looks like this:

[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-\$basearch
enabled=1
gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
exclude=kubelet kubeadm kubectl

If you find a repository definition that looks like this, you're using the legacy repository and you need to migrate.

If the repository definition uses pkgs.k8s.io, you're already using the community-hosted repositories and you don't need to take any action.

On most systems, that repository definition should be located in /etc/yum.repos.d/kubernetes.repo (as recommended by the Kubernetes documentation), but on some systems it might be in a different location.

If you can't find a repository definition related to Kubernetes, it's likely that you don't use package managers to install Kubernetes and you don't need to take any action.

How can I migrate to the new community-operated repositories?

For more information on how to migrate to the new community managed packages, please refer to the announcement blog post for pkgs.k8s.io.

Why is the Kubernetes project making this change?

Kubernetes has been publishing packages solely to the Google-hosted repository since Kubernetes v1.5, or the past seven years! Following in the footsteps of migrating to our community-managed registry, registry.k8s.io, we are now migrating the Kubernetes package repositories to our own community-managed infrastructure. We’re thankful to Google for their continuous hosting and support all these years, but this transition marks another big milestone for the project’s goal of migrating to complete community-owned infrastructure.

Is there a Kubernetes tool to help me migrate?

We don't have any announcement to make about tooling there. As a Kubernetes user, you have to manually modify your configuration to use the new repositories. Automating the migration from the legacy to the community-owned repositories is technically challenging and we want to avoid any potential risks associated with this.

Acknowledgments

First of all, we want to acknowledge the contributions from Alphabet. Staff at Google have provided their time; Google as a business has provided both the infrastructure to serve packages, and the security context for giving those packages trustworthy digital signatures. These have been important to the adoption and growth of Kubernetes.

Releasing software might not be glamorous but it's important. Many people within the Kubernetes contributor community have contributed to the new way that we, as a project, have for building and publishing packages.

And finally, we want to once again acknowledge the help from SUSE. OpenBuildService, from SUSE, is the technology that the powers the new community-managed package repositories.

Gateway API v0.8.0: Introducing Service Mesh Support

We are thrilled to announce the v0.8.0 release of Gateway API! With this release, Gateway API support for service mesh has reached Experimental status. We look forward to your feedback!

We're especially delighted to announce that Kuma 2.3+, Linkerd 2.14+, and Istio 1.16+ are all fully-conformant implementations of Gateway API service mesh support.

Service mesh support in Gateway API

While the initial focus of Gateway API was always ingress (north-south) traffic, it was clear almost from the beginning that the same basic routing concepts should also be applicable to service mesh (east-west) traffic. In 2022, the Gateway API subproject started the GAMMA initiative, a dedicated vendor-neutral workstream, specifically to examine how best to fit service mesh support into the framework of the Gateway API resources, without requiring users of Gateway API to relearn everything they understand about the API.

Over the last year, GAMMA has dug deeply into the challenges and possible solutions around using Gateway API for service mesh. The end result is a small number of enhancement proposals that subsume many hours of thought and debate, and provide a minimum viable path to allow Gateway API to be used for service mesh.

How will mesh routing work when using Gateway API?

You can find all the details in the Gateway API Mesh routing documentation and GEP-1426, but the short version for Gateway API v0.8.0 is that an HTTPRoute can now have a parentRef that is a Service, rather than just a Gateway. We anticipate future GEPs in this area as we gain more experience with service mesh use cases -- binding to a Service makes it possible to use the Gateway API with a service mesh, but there are several interesting use cases that remain difficult to cover.

As an example, you might use an HTTPRoute to do an A-B test in the mesh as follows:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: bar-route
spec:
  parentRefs:
  - group: ""
    kind: Service
    name: demo-app
    port: 5000
  rules:
  - matches:
    - headers:
      - type: Exact
        name: env
        value: v1
    backendRefs:
    - name: demo-app-v1
      port: 5000
  - backendRefs:
    - name: demo-app-v2
      port: 5000

Any request to port 5000 of the demo-app Service that has the header env: v1 will be routed to demo-app-v1, while any request without that header will be routed to demo-app-v2 -- and since this is being handled by the service mesh, not the ingress controller, the A/B test can happen anywhere in the application's call graph.

How do I know this will be truly portable?

Gateway API has been investing heavily in conformance tests across all features it supports, and mesh is no exception. One of the challenges that the GAMMA initiative ran into is that many of these tests were strongly tied to the idea that a given implementation provides an ingress controller. Many service meshes don't, and requiring a GAMMA-conformant mesh to also implement an ingress controller seemed impractical at best. This resulted in work restarting on Gateway API conformance profiles, as discussed in GEP-1709.

The basic idea of conformance profiles is that we can define subsets of the Gateway API, and allow implementations to choose (and document) which subsets they conform to. GAMMA is adding a new profile, named Mesh and described in GEP-1686, which checks only the mesh functionality as defined by GAMMA. At this point, Kuma 2.3+, Linkerd 2.14+, and Istio 1.16+ are all conformant with the Mesh profile.

What else is in Gateway API v0.8.0?

This release is all about preparing Gateway API for the upcoming v1.0 release where HTTPRoute, Gateway, and GatewayClass will graduate to GA. There are two main changes related to this: CEL validation and API version changes.

CEL Validation

The first major change is that Gateway API v0.8.0 is the start of a transition from webhook validation to CEL validation using information built into the CRDs. That will mean different things depending on the version of Kubernetes you're using:

Kubernetes 1.25+

CEL validation is fully supported, and almost all validation is implemented in CEL. (The sole exception is that header names in header modifier filters can only do case-insensitive validation. There is more information in issue 2277.)

We recommend not using the validating webhook on these Kubernetes versions.

Kubernetes 1.23 and 1.24

CEL validation is not supported, but Gateway API v0.8.0 CRDs can still be installed. When you upgrade to Kubernetes 1.25+, the validation included in these CRDs will automatically take effect.

We recommend continuing to use the validating webhook on these Kubernetes versions.

Kubernetes 1.22 and older

Gateway API only commits to support for 5 most recent versions of Kubernetes. As such, these versions are no longer supported by Gateway API, and unfortunately Gateway API v0.8.0 cannot be installed on them, since CRDs containing CEL validation will be rejected.

API Version Changes

As we prepare for a v1.0 release that will graduate Gateway, GatewayClass, and HTTPRoute to the v1 API Version from v1beta1, we are continuing the process of moving away from v1alpha2 for resources that have graduated to v1beta1. For more information on this change and everything else included in this release, refer to the v0.8.0 release notes.

How can I get started with Gateway API?

Gateway API represents the future of load balancing, routing, and service mesh APIs in Kubernetes. There are already more than 20 implementations available (including both ingress controllers and service meshes) and the list keeps growing.

If you're interested in getting started with Gateway API, take a look at the API concepts documentation and check out some of the Guides to try it out. Because this is a CRD-based API, you can install the latest version on any Kubernetes 1.23+ cluster.

If you're specifically interested in helping to contribute to Gateway API, we would love to have you! Please feel free to open a new issue on the repository, or join in the discussions. Also check out the community page which includes links to the Slack channel and community meetings. We look forward to seeing you!!

Further Reading:

  • GEP-1324 provides an overview of the GAMMA goals and some important definitions. This GEP is well worth a read for its discussion of the problem space.
  • GEP-1426 defines how to use Gateway API route resources, such as HTTPRoute, to manage traffic within a service mesh.
  • GEP-1686 builds on the work of GEP-1709 to define a conformance profile for service meshes to be declared conformant with Gateway API.

Although these are Experimental patterns, note that they are available in the standard release channel, since the GAMMA initiative has not needed to introduce new resources or fields to date.

Kubernetes 1.28: A New (alpha) Mechanism For Safer Cluster Upgrades

This blog describes the mixed version proxy, a new alpha feature in Kubernetes 1.28. The mixed version proxy enables an HTTP request for a resource to be served by the correct API server in cases where there are multiple API servers at varied versions in a cluster. For example, this is useful during a cluster upgrade, or when you're rolling out the runtime configuration of the cluster's control plane.

What problem does this solve?

When a cluster undergoes an upgrade, the kube-apiservers existing at different versions in that scenario can serve different sets (groups, versions, resources) of built-in resources. A resource request made in this scenario may be served by any of the available apiservers, potentially resulting in the request ending up at an apiserver that may not be aware of the requested resource; consequently it being served a 404 not found error which is incorrect. Furthermore, incorrect serving of the 404 errors can lead to serious consequences such as namespace deletion being blocked incorrectly or objects being garbage collected mistakenly.

How do we solve the problem?

The new feature “Mixed Version Proxy” provides the kube-apiserver with the capability to proxy a request to a peer kube-apiserver which is aware of the requested resource and hence can serve the request. To do this, a new filter has been added to the handler chain in the API server's aggregation layer.

  1. The new filter in the handler chain checks if the request is for a group/version/resource that the apiserver doesn't know about (using the existing StorageVersion API). If so, it proxies the request to one of the apiservers that is listed in the ServerStorageVersion object. If the identified peer apiserver fails to respond (due to reasons like network connectivity, race between the request being received and the controller registering the apiserver-resource info in ServerStorageVersion object), then error 503("Service Unavailable") is served.
  2. To prevent indefinite proxying of the request, a (new for v1.28) HTTP header X-Kubernetes-APIServer-Rerouted: true is added to the original request once it is determined that the request cannot be served by the original API server. Setting that to true marks that the original API server couldn't handle the request and it should therefore be proxied. If a destination peer API server sees this header, it never proxies the request further.
  3. To set the network location of a kube-apiserver that peers will use to proxy requests, the value passed in --advertise-address or (when --advertise-address is unspecified) the --bind-address flag is used. For users with network configurations that would not allow communication between peer kube-apiservers using the addresses specified in these flags, there is an option to pass in the correct peer address as --peer-advertise-ip and --peer-advertise-port flags that are introduced in this feature.

How do I enable this feature?

Following are the required steps to enable the feature:

  • Download the latest Kubernetes project (version v1.28.0 or later)
  • Switch on the feature gate with the command line flag --feature-gates=UnknownVersionInteroperabilityProxy=true on the kube-apiservers
  • Pass the CA bundle that will be used by source kube-apiserver to authenticate destination kube-apiserver's serving certs using the flag --peer-ca-file on the kube-apiservers. Note: this is a required flag for this feature to work. There is no default value enabled for this flag.
  • Pass the correct ip and port of the local kube-apiserver that will be used by peers to connect to this kube-apiserver while proxying a request. Use the flags --peer-advertise-ip and peer-advertise-port to the kube-apiservers upon startup. If unset, the value passed to either --advertise-address or --bind-address is used. If those too, are unset, the host's default interface will be used.

What’s missing?

Currently we only proxy resource requests to a peer kube-apiserver when its determined to do so. Next we need to address how to work discovery requests in such scenarios. Right now we are planning to have the following capabilities for beta

  • Merged discovery across all kube-apiservers
  • Use an egress dialer for network connections made to peer kube-apiservers

How can I learn more?

How can I get involved?

Reach us on Slack: #sig-api-machinery, or through the mailing list.

Huge thanks to the contributors that have helped in the design, implementation, and review of this feature: Daniel Smith, Han Kang, Joe Betz, Jordan Liggit, Antonio Ojea, David Eads and Ben Luddy!

Kubernetes v1.28: Introducing native sidecar containers

This post explains how to use the new sidecar feature, which enables restartable init containers and is available in alpha in Kubernetes 1.28. We want your feedback so that we can graduate this feature as soon as possible.

The concept of a “sidecar” has been part of Kubernetes since nearly the very beginning. In 2015, sidecars were described in a blog post about composite containers as additional containers that “extend and enhance the ‘main’ container”. Sidecar containers have become a common Kubernetes deployment pattern and are often used for network proxies or as part of a logging system. Until now, sidecars were a concept that Kubernetes users applied without native support. The lack of native support has caused some usage friction, which this enhancement aims to resolve.

What are sidecar containers in 1.28?

Kubernetes 1.28 adds a new restartPolicy field to init containers that is available when the SidecarContainers feature gate is enabled.

apiVersion: v1
kind: Pod
spec:
  initContainers:
  - name: secret-fetch
    image: secret-fetch:1.0
  - name: network-proxy
    image: network-proxy:1.0
    restartPolicy: Always
  containers:
  ...

The field is optional and, if set, the only valid value is Always. Setting this field changes the behavior of init containers as follows:

  • The container restarts if it exits
  • Any subsequent init container starts immediately after the startupProbe has successfully completed instead of waiting for the restartable init container to exit
  • The resource usage calculation changes for the pod as restartable init container resources are now added to the sum of the resource requests by the main containers

Pod termination continues to only depend on the main containers. An init container with a restartPolicy of Always (named a sidecar) won't prevent the pod from terminating after the main containers exit.

The following properties of restartable init containers make them ideal for the sidecar deployment pattern:

  • Init containers have a well-defined startup order regardless of whether you set a restartPolicy, so you can ensure that your sidecar starts before any container declarations that come after the sidecar declaration in your manifest.
  • Sidecar containers don't extend the lifetime of the Pod, so you can use them in short-lived Pods with no changes to the Pod lifecycle.
  • Sidecar containers are restarted on exit, which improves resilience and lets you use sidecars to provide services that your main containers can more reliably consume.

When to use sidecar containers

You might find built-in sidecar containers useful for workloads such as the following:

  • Batch or AI/ML workloads, or other Pods that run to completion. These workloads will experience the most significant benefits.
  • Network proxies that start up before any other container in the manifest. Every other container that runs can use the proxy container's services. For instructions, see the Kubernetes Native sidecars in Istio blog post.
  • Log collection containers, which can now start before any other container and run until the Pod terminates. This improves the reliability of log collection in your Pods.
  • Jobs, which can use sidecars for any purpose without Job completion being blocked by the running sidecar. No additional configuration is required to ensure this behavior.

How did users get sidecar behavior before 1.28?

Prior to the sidecar feature, the following options were available for implementing sidecar behavior depending on the desired lifetime of the sidecar container:

  • Lifetime of sidecar less than Pod lifetime: Use an init container, which provides well-defined startup order. However, the sidecar has to exit for other init containers and main Pod containers to start.
  • Lifetime of sidecar equal to Pod lifetime: Use a main container that runs alongside your workload containers in the Pod. This method doesn't give you control over startup order, and lets the sidecar container potentially block Pod termination after the workload containers exit.

The built-in sidecar feature solves for the use case of having a lifetime equal to the Pod lifetime and has the following additional benefits:

  • Provides control over startup order
  • Doesn’t block Pod termination

Transitioning existing sidecars to the new model

We recommend only using the sidecars feature gate in short lived testing clusters at the alpha stage. If you have an existing sidecar that is configured as a main container so it can run for the lifetime of the pod, it can be moved to the initContainers section of the pod spec and given a restartPolicy of Always. In many cases, the sidecar should work as before with the added benefit of having a defined startup ordering and not prolonging the pod lifetime.

Known issues

The alpha release of built-in sidecar containers has the following known issues, which we'll resolve before graduating the feature to beta:

  • The CPU, memory, device, and topology manager are unaware of the sidecar container lifetime and additional resource usage, and will operate as if the Pod had lower resource requests than it actually does.
  • The output of kubectl describe node is incorrect when sidecars are in use. The output shows resource usage that's lower than the actual usage because it doesn't use the new resource usage calculation for sidecar containers.

We need your feedback!

In the alpha stage, we want you to try out sidecar containers in your environments and open issues if you encounter bugs or friction points. We're especially interested in feedback about the following:

  • The shutdown sequence, especially with multiple sidecars running
  • The backoff timeout adjustment for crashing sidecars
  • The behavior of Pod readiness and liveness probes when sidecars are running

To open an issue, see the Kubernetes GitHub repository.

What’s next?

In addition to the known issues that will be resolved, we're working on adding termination ordering for sidecar and main containers. This will ensure that sidecar containers only terminate after the Pod's main containers have exited.

We’re excited to see the sidecar feature come to Kubernetes and are interested in feedback.

Acknowledgements

Many years have passed since the original KEP was written, so we apologize if we omit anyone who worked on this feature over the years. This is a best-effort attempt to recognize the people involved in this effort.

More Information

Kubernetes 1.28: Beta support for using swap on Linux

The 1.22 release introduced Alpha support for configuring swap memory usage for Kubernetes workloads running on Linux on a per-node basis. Now, in release 1.28, support for swap on Linux nodes has graduated to Beta, along with many new improvements.

Prior to version 1.22, Kubernetes did not provide support for swap memory on Linux systems. This was due to the inherent difficulty in guaranteeing and accounting for pod memory utilization when swap memory was involved. As a result, swap support was deemed out of scope in the initial design of Kubernetes, and the default behavior of a kubelet was to fail to start if swap memory was detected on a node.

In version 1.22, the swap feature for Linux was initially introduced in its Alpha stage. This represented a significant advancement, providing Linux users with the opportunity to experiment with the swap feature for the first time. However, as an Alpha version, it was not fully developed and had several issues, including inadequate support for cgroup v2, insufficient metrics and summary API statistics, inadequate testing, and more.

Swap in Kubernetes has numerous use cases for a wide range of users. As a result, the node special interest group within the Kubernetes project has invested significant effort into supporting swap on Linux nodes for beta. Compared to the alpha, the kubelet's support for running with swap enabled is more stable and robust, more user-friendly, and addresses many known shortcomings. This graduation to beta represents a crucial step towards achieving the goal of fully supporting swap in Kubernetes.

How do I use it?

The utilization of swap memory on a node where it has already been provisioned can be facilitated by the activation of the NodeSwap feature gate on the kubelet. Additionally, you must disable the failSwapOn configuration setting, or the deprecated --fail-swap-on command line flag must be deactivated.

It is possible to configure the memorySwap.swapBehavior option to define the manner in which a node utilizes swap memory. For instance,

# this fragment goes into the kubelet's configuration file
memorySwap:
  swapBehavior: UnlimitedSwap

The available configuration options for swapBehavior are:

  • UnlimitedSwap (default): Kubernetes workloads can use as much swap memory as they request, up to the system limit.
  • LimitedSwap: The utilization of swap memory by Kubernetes workloads is subject to limitations. Only Pods of Burstable QoS are permitted to employ swap.

If configuration for memorySwap is not specified and the feature gate is enabled, by default the kubelet will apply the same behaviour as the UnlimitedSwap setting.

Note that NodeSwap is supported for cgroup v2 only. For Kubernetes v1.28, using swap along with cgroup v1 is no longer supported.

Install a swap-enabled cluster with kubeadm

Before you begin

It is required for this demo that the kubeadm tool be installed, following the steps outlined in the kubeadm installation guide. If swap is already enabled on the node, cluster creation may proceed. If swap is not enabled, please refer to the provided instructions for enabling swap.

Create a swap file and turn swap on

I'll demonstrate creating 4GiB of unencrypted swap.

dd if=/dev/zero of=/swapfile bs=128M count=32
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
swapon -s # enable the swap file only until this node is rebooted

To start the swap file at boot time, add line like /swapfile swap swap defaults 0 0 to /etc/fstab file.

Set up a Kubernetes cluster that uses swap-enabled nodes

To make things clearer, here is an example kubeadm configuration file kubeadm-config.yaml for the swap enabled cluster.

---
apiVersion: "kubeadm.k8s.io/v1beta3"
kind: InitConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
featureGates:
  NodeSwap: true
memorySwap:
  swapBehavior: LimitedSwap

Then create a single-node cluster using kubeadm init --config kubeadm-config.yaml. During init, there is a warning that swap is enabled on the node and in case the kubelet failSwapOn is set to true. We plan to remove this warning in a future release.

How is the swap limit being determined with LimitedSwap?

The configuration of swap memory, including its limitations, presents a significant challenge. Not only is it prone to misconfiguration, but as a system-level property, any misconfiguration could potentially compromise the entire node rather than just a specific workload. To mitigate this risk and ensure the health of the node, we have implemented Swap in Beta with automatic configuration of limitations.

With LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e. BestEffort/Guaranteed Qos Pods) are prohibited from utilizing swap memory. BestEffort QoS Pods exhibit unpredictable memory consumption patterns and lack information regarding their memory usage, making it difficult to determine a safe allocation of swap memory. Conversely, Guaranteed QoS Pods are typically employed for applications that rely on the precise allocation of resources specified by the workload, with memory being immediately available. To maintain the aforementioned security and node health guarantees, these Pods are not permitted to use swap memory when LimitedSwap is in effect.

Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:

  • nodeTotalMemory: The total amount of physical memory available on the node.
  • totalPodsSwapAvailable: The total amount of swap memory on the node that is available for use by Pods (some swap memory may be reserved for system use).
  • containerMemoryRequest: The container's memory request.

Swap limitation is configured as: (containerMemoryRequest / nodeTotalMemory) × totalPodsSwapAvailable

In other words, the amount of swap that a container is able to use is proportionate to its memory request, the node's total physical memory and the total amount of swap memory on the node that is available for use by Pods.

It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of swap usage by specifying memory requests that are equal to memory limits. Containers configured in this manner will not have access to swap memory.

How does it work?

There are a number of possible ways that one could envision swap use on a node. When swap is already provisioned and available on a node, SIG Node have proposed the kubelet should be able to be configured so that:

  • It can start with swap on.
  • It will direct the Container Runtime Interface to allocate zero swap memory to Kubernetes workloads by default.

Swap configuration on a node is exposed to a cluster admin via the memorySwap in the KubeletConfiguration. As a cluster administrator, you can specify the node's behaviour in the presence of swap memory by setting memorySwap.swapBehavior.

The kubelet employs the CRI (container runtime interface) API to direct the CRI to configure specific cgroup v2 parameters (such as memory.swap.max) in a manner that will enable the desired swap configuration for a container. The CRI is then responsible to write these settings to the container-level cgroup.

How can I monitor swap?

A notable deficiency in the Alpha version was the inability to monitor and introspect swap usage. This issue has been addressed in the Beta version introduced in Kubernetes 1.28, which now provides the capability to monitor swap usage through several different methods.

The beta version of kubelet now collects node-level metric statistics, which can be accessed at the /metrics/resource and /stats/summary kubelet HTTP endpoints. This allows clients who can directly interrogate the kubelet to monitor swap usage and remaining swap memory when using LimitedSwap. Additionally, a machine_swap_bytes metric has been added to cadvisor to show the total physical swap capacity of the machine.

Caveats

Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.

The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe.

As such, we do not advocate the utilization of swap memory for workloads or environments that are subject to performance constraints. Furthermore, it is recommended to employ LimitedSwap, as this significantly mitigates the risks posed to the node.

Cluster administrators and developers should benchmark their nodes and applications before using swap in production scenarios, and we need your help with that!

Security risk

Enabling swap on a system without encryption poses a security risk, as critical information, such as volumes that represent Kubernetes Secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these confidential data. To mitigate this risk, the Kubernetes project strongly recommends that you encrypt your swap space. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. It is the administrator's responsibility to provision encrypted swap to mitigate this risk.

Furthermore, as previously mentioned, with LimitedSwap the user has the option to completely disable swap usage for a container by specifying memory requests that are equal to memory limits. This will prevent the corresponding containers from accessing swap memory.

Looking ahead

The Kubernetes 1.28 release introduced Beta support for swap memory on Linux nodes, and we will continue to work towards general availability for this feature. I hope that this will include:

  • Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
  • Adding support for controlling swap consumption at the Pod level via cgroups.
    • This point is still under discussion.
  • Collecting feedback from test user cases.
    • We will consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.

How can I learn more?

You can review the current documentation for using swap with Kubernetes.

For more information, and to assist with testing and provide feedback, please see KEP-2400 and its design proposal.

How do I get involved?

Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG's mailing list. A Slack channel dedicated to swap is also available at #sig-node-swap.

Feel free to reach out to me, Itamar Holder (@iholder101 on Slack and GitHub) if you'd like to help or ask further questions.

Kubernetes 1.28: Node podresources API Graduates to GA

The podresources API is an API served by the kubelet locally on the node, which exposes the compute resources exclusively allocated to containers. With the release of Kubernetes 1.28, that API is now Generally Available.

What problem does it solve?

The kubelet can allocate exclusive resources to containers, like CPUs, granting exclusive access to full cores or memory, either regions or hugepages. Workloads which require high performance, or low latency (or both) leverage these features. The kubelet also can assign devices to containers. Collectively, these features which enable exclusive assignments are known as "resource managers".

Without an API like podresources, the only possible option to learn about resource assignment was to read the state files the resource managers use. While done out of necessity, the problem with this approach is the path and the format of these file are both internal implementation details. Albeit very stable, the project reserves the right to change them freely. Consuming the content of the state files is thus fragile and unsupported, and projects doing this are recommended to consider moving to podresources API or to other supported APIs.

Overview of the API

The podresources API was initially proposed to enable device monitoring. In order to enable monitoring agents, a key prerequisite is to enable introspection of device assignment, which is performed by the kubelet. Serving this purpose was the initial goal of the API. The first iteration of the API only had a single function implemented, List, to return information about the assignment of devices to containers. The API is used by multus CNI and by GPU monitoring tools.

Since its inception, the podresources API increased its scope to cover other resource managers than device manager. Starting from Kubernetes 1.20, the List API reports also CPU cores and memory regions (including hugepages); the API also reports the NUMA locality of the devices, while the locality of CPUs and memory can be inferred from the system.

In Kubernetes 1.21, the API gained the GetAllocatableResources function. This newer API complements the existing List API and enables monitoring agents to determine the unallocated resources, thus enabling new features built on top of the podresources API like a NUMA-aware scheduler plugin.

Finally, in Kubernetes 1.27, another function, Get was introduced to be more friendly with CNI meta-plugins, to make it simpler to access resources allocated to a specific pod, rather than having to filter through resources for all pods on the node. The Get function is currently alpha level.

Consuming the API

The podresources API is served by the kubelet locally, on the same node on which is running. On unix flavors, the endpoint is served over a unix domain socket; the default path is /var/lib/kubelet/pod-resources/kubelet.sock. On windows, the endpoint is served over a named pipe; the default path is npipe://\\.\pipe\kubelet-pod-resources.

In order for the containerized monitoring application consume the API, the socket should be mounted inside the container. A good practice is to mount the directory on which the podresources socket endpoint sits rather than the socket directly. This will ensure that after a kubelet restart, the containerized monitor application will be able to re-connect to the socket.

An example manifest for a hypothetical monitoring agent consuming the podresources API and deployed as a DaemonSet could look like:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: podresources-monitoring-app
  namespace: monitoring
spec:
  selector:
    matchLabels:
      name: podresources-monitoring
  template:
    metadata:
      labels:
        name: podresources-monitoring
    spec:
      containers:
      - args:
        - --podresources-socket=unix:///host-podresources/kubelet.sock
        command:
        - /bin/podresources-monitor
        image: podresources-monitor:latest  # just for an example
        volumeMounts:
        - mountPath: /host-podresources
          name: host-podresources
      serviceAccountName: podresources-monitor
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pod-resources
          type: Directory
        name: host-podresources

I hope you find it straightforward to consume the podresources API programmatically. The kubelet API package provides the protocol file and the go type definitions; however, a client package is not yet available from the project, and the existing code should not be used directly. The recommended approach is to reimplement the client in your projects, copying and pasting the related functions like for example the multus project is doing.

When operating the containerized monitoring application consuming the podresources API, few points are worth highlighting to prevent "gotcha" moments:

  • Even though the API only exposes data, and doesn't allow by design clients to mutate the kubelet state, the gRPC request/response model requires read-write access to the podresources API socket. In other words, it is not possible to limit the container mount to ReadOnly.
  • Multiple clients are allowed to connect to the podresources socket and consume the API, since it is stateless.
  • The kubelet has built-in rate limits to mitigate local Denial of Service attacks from misbehaving or malicious consumers. The consumers of the API must tolerate rate limit errors returned by the server. The rate limit is currently hardcoded and global, so misbehaving clients can consume all the quota and potentially starve correctly behaving clients.

Future enhancements

For historical reasons, the podresources API has a less precise specification than typical kubernetes APIs (such as the Kubernetes HTTP API, or the container runtime interface). This leads to unspecified behavior in corner cases. An effort is ongoing to rectify this state and to have a more precise specification.

The Dynamic Resource Allocation (DRA) infrastructure is a major overhaul of the resource management. The integration with the podresources API is already ongoing.

An effort is ongoing to recommend or create a reference client package ready to be consumed.

Getting involved

This feature is driven by SIG Node. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Kubernetes 1.28: Improved failure handling for Jobs

This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch users: Pod replacement policy and Backoff limit per index.

These features continue the effort started by the Pod failure policy to improve the handling of Pod failures in a Job.

Pod replacement policy

By default, when a pod enters a terminating state (e.g. due to preemption or eviction), Kubernetes immediately creates a replacement Pod. Therefore, both Pods are running at the same time. In API terms, a pod is considered terminating when it has a deletionTimestamp and it has a phase Pending or Running.

The scenario when two Pods are running at a given time is problematic for some popular machine learning frameworks, such as TensorFlow and JAX, which require at most one Pod running at the same time, for a given index. Tensorflow gives the following error if two pods are running for a given index.

 /job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4

See more details in the (issue).

Creating the replacement Pod before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets, such as:

  • cluster resources can be difficult to obtain for Pods pending to be scheduled, as Kubernetes might take a long time to find available nodes until the existing Pods are fully terminated.
  • if cluster autoscaler is enabled, the replacement Pods might produce undesired scale ups.

How can you use it?

This is an alpha feature, which you can enable by turning on JobPodReplacementPolicy feature gate in your cluster.

Once the feature is enabled in your cluster, you can use it by creating a new Job that specifies a podReplacementPolicy field as shown here:

kind: Job
metadata:
  name: new
  ...
spec:
  podReplacementPolicy: Failed
  ...

In that Job, the Pods would only be replaced once they reached the Failed phase, and not when they are terminating.

Additionally, you can inspect the .status.terminating field of a Job. The value of the field is the number of Pods owned by the Job that are currently terminating.

kubectl get jobs/myjob -o=jsonpath='{.items[*].status.terminating}'
3 # three Pods are terminating and have not yet reached the Failed phase

This can be particularly useful for external queueing controllers, such as Kueue, that tracks quota from running Pods of a Job until the resources are reclaimed from the currently terminating Job.

Note that the podReplacementPolicy: Failed is the default when using a custom Pod failure policy.

Backoff limit per index

By default, Pod failures for Indexed Jobs are counted towards the global limit of retries, represented by .spec.backoffLimit. This means, that if there is a consistently failing index, it is restarted repeatedly until it exhausts the limit. Once the limit is reached the entire Job is marked failed and some indexes may never be even started.

This is problematic for use cases where you want to handle Pod failures for every index independently. For example, if you use Indexed Jobs for running integration tests where each index corresponds to a testing suite. In that case, you may want to account for possible flake tests allowing for 1 or 2 retries per suite. There might be some buggy suites, making the corresponding indexes fail consistently. In that case you may prefer to limit retries for the buggy suites, yet allowing other suites to complete.

The feature allows you to:

  • complete execution of all indexes, despite some indexes failing.
  • better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes.

How can you use it?

This is an alpha feature, which you can enable by turning on the JobBackoffLimitPerIndex feature gate in your cluster.

Once the feature is enabled in your cluster, you can create an Indexed Job with the .spec.backoffLimitPerIndex field specified.

Example

The following example demonstrates how to use this feature to make sure the Job executes all indexes (provided there is no other reason for the early Job termination, such as reaching the activeDeadlineSeconds timeout, or being manually deleted by the user), and the number of failures is controlled per index.

apiVersion: batch/v1
kind: Job
metadata:
  name: job-backoff-limit-per-index-execute-all
spec:
  completions: 8
  parallelism: 2
  completionMode: Indexed
  backoffLimitPerIndex: 1
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: example # this example container returns an error, and fails,
                      # when it is run as the second or third index in any Job
                      # (even after a retry)        
        image: python
        command:
        - python3
        - -c
        - |
          import os, sys, time
          id = int(os.environ.get("JOB_COMPLETION_INDEX"))
          if id == 1 or id == 2:
            sys.exit(1)
          time.sleep(1)

Now, inspect the Pods after the job is finished:

kubectl get pods -l job-name=job-backoff-limit-per-index-execute-all

Returns output similar to this:

NAME                                              READY   STATUS      RESTARTS   AGE
job-backoff-limit-per-index-execute-all-0-b26vc   0/1     Completed   0          49s
job-backoff-limit-per-index-execute-all-1-6j5gd   0/1     Error       0          49s
job-backoff-limit-per-index-execute-all-1-6wd82   0/1     Error       0          37s
job-backoff-limit-per-index-execute-all-2-c66hg   0/1     Error       0          32s
job-backoff-limit-per-index-execute-all-2-nf982   0/1     Error       0          43s
job-backoff-limit-per-index-execute-all-3-cxmhf   0/1     Completed   0          33s
job-backoff-limit-per-index-execute-all-4-9q6kq   0/1     Completed   0          28s
job-backoff-limit-per-index-execute-all-5-z9hqf   0/1     Completed   0          28s
job-backoff-limit-per-index-execute-all-6-tbkr8   0/1     Completed   0          23s
job-backoff-limit-per-index-execute-all-7-hxjsq   0/1     Completed   0          22s

Additionally, you can take a look at the status for that Job:

kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml

The output ends with a status similar to:

  status:
    completedIndexes: 0,3-7
    failedIndexes: 1,2
    succeeded: 6
    failed: 4
    conditions:
    - message: Job has failed indexes
      reason: FailedIndexes
      status: "True"
      type: Failed

Here, indexes 1 and 2 were both retried once. After the second failure, in each of them, the specified .spec.backoffLimitPerIndex was exceeded, so the retries were stopped. For comparison, if the per-index backoff was disabled, then the buggy indexes would retry until the global backoffLimit was exceeded, and then the entire Job would be marked failed, before some of the higher indexes are started.

How can you learn more?

Getting Involved

These features were sponsored by SIG Apps. Batch use cases are actively being improved for Kubernetes users in the batch working group. Working groups are relatively short-lived initiatives focused on specific goals. The goal of the WG Batch is to improve experience for batch workload users, offer support for batch processing use cases, and enhance the Job API for common use cases. If that interests you, please join the working group either by subscriping to our mailing list or on Slack.

Acknowledgments

As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.

We would not have been able to achieve either of these features without Aldo Culquicondor (Google) providing excellent domain knowledge and expertise throughout the Kubernetes ecosystem.

Kubernetes v1.28: Retroactive Default StorageClass move to GA

Announcing graduation to General Availability (GA) - Retroactive Default StorageClass Assignment in Kubernetes v1.28!

Kubernetes SIG Storage team is thrilled to announce that the "Retroactive Default StorageClass Assignment" feature, introduced as an alpha in Kubernetes v1.25, has now graduated to GA and is officially part of the Kubernetes v1.28 release. This enhancement brings a significant improvement to how default StorageClasses are assigned to PersistentVolumeClaims (PVCs).

With this feature enabled, you no longer need to create a default StorageClass first and then a PVC to assign the class. Instead, any PVCs without a StorageClass assigned will now be retroactively updated to include the default StorageClass. This enhancement ensures that PVCs no longer get stuck in an unbound state, and storage provisioning works seamlessly, even when a default StorageClass is not defined at the time of PVC creation.

What changed?

The PersistentVolume (PV) controller has been modified to automatically assign a default StorageClass to any unbound PersistentVolumeClaim with the storageClassName not set. Additionally, the PersistentVolumeClaim admission validation mechanism within the API server has been adjusted to allow changing values from an unset state to an actual StorageClass name.

How to use it?

As this feature has graduated to GA, there's no need to enable a feature gate anymore. Simply make sure you are running Kubernetes v1.28 or later, and the feature will be available for use.

For more details, read about default StorageClass assignment in the Kubernetes documentation. You can also read the previous blog post announcing beta graduation in v1.26.

To provide feedback, join our Kubernetes Storage Special-Interest-Group (SIG) or participate in discussions on our public Slack channel.

Kubernetes 1.28: Non-Graceful Node Shutdown Moves to GA

The Kubernetes Non-Graceful Node Shutdown feature is now GA in Kubernetes v1.28. It was introduced as alpha in Kubernetes v1.24, and promoted to beta in Kubernetes v1.26. This feature allows stateful workloads to restart on a different node if the original node is shutdown unexpectedly or ends up in a non-recoverable state such as the hardware failure or unresponsive OS.

What is a Non-Graceful Node Shutdown

In a Kubernetes cluster, a node can be shutdown in a planned graceful way or unexpectedly because of reasons such as power outage or something else external. A node shutdown could lead to workload failure if the node is not drained before the shutdown. A node shutdown can be either graceful or non-graceful.

The Graceful Node Shutdown feature allows Kubelet to detect a node shutdown event, properly terminate the pods, and release resources, before the actual shutdown.

When a node is shutdown but not detected by Kubelet's Node Shutdown Manager, this becomes a non-graceful node shutdown. Non-graceful node shutdown is usually not a problem for stateless apps, however, it is a problem for stateful apps. The stateful application cannot function properly if the pods are stuck on the shutdown node and are not restarting on a running node.

In the case of a non-graceful node shutdown, you can manually add an out-of-service taint on the Node.

kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute

This taint triggers pods on the node to be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node.

Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power-off state (not in the middle of restarting).

Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered.

What’s new in stable

With the promotion of the Non-Graceful Node Shutdown feature to stable, the feature gate NodeOutOfServiceVolumeDetach is locked to true on kube-controller-manager and cannot be disabled.

Metrics force_delete_pods_total and force_delete_pod_errors_total in the Pod GC Controller are enhanced to account for all forceful pods deletion. A reason is added to the metric to indicate whether the pod is forcefully deleted because it is terminated, orphaned, terminating with the out-of-service taint, or terminating and unscheduled.

A "reason" is also added to the metric attachdetach_controller_forced_detaches in the Attach Detach Controller to indicate whether the force detach is caused by the out-of-service taint or a timeout.

What’s next?

This feature requires a user to manually add a taint to the node to trigger workloads failover and remove the taint after the node is recovered. In the future, we plan to find ways to automatically detect and fence nodes that are shutdown/failed and automatically failover workloads to another node.

How can I learn more?

Check out additional documentation on this feature here.

How to get involved?

We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature and helped move it from alpha, beta, to stable:

This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.

pkgs.k8s.io: Introducing Kubernetes Community-Owned Package Repositories

On behalf of Kubernetes SIG Release, I am very excited to introduce the Kubernetes community-owned software repositories for Debian and RPM packages: pkgs.k8s.io! The new package repositories are replacement for the Google-hosted package repositories (apt.kubernetes.io and yum.kubernetes.io) that we've been using since Kubernetes v1.5.

This blog post contains information about these new package repositories, what does it mean to you as an end user, and how to migrate to the new repositories.

ℹ️ Update (March 26, 2024): the legacy Google-hosted repositories went away on March 4, 2024. It's not possible to install Kubernetes packages from the legacy Google-hosted package repositories any longer. Check out the deprecation announcement for more details about this change.

What you need to know about the new package repositories?

(updated on January 12, 2024 and March 26, 2024)

  • This is an opt-in change; you're required to manually migrate from the Google-hosted repository to the Kubernetes community-owned repositories. See how to migrate later in this announcement for migration information and instructions.
  • The legacy Google-hosted package repositories went away on March 4, 2024. It's not possible to install Kubernetes packages from the legacy Google-hosted package repositories any longer. These repositories have been deprecated as of August 31, 2023, and frozen as of September 13, 2023. Check out the deprecation announcement for more details about this change.
  • The legacy Google-hosted package repositories are going away in January 2024. These repositories have been deprecated as of August 31, 2023, and frozen as of September 13, 2023. Check out the deprecation announcement for more details about this change.
  • The existing packages in the legacy repositories will be available for the foreseeable future. However, the Kubernetes project can't provide any guarantees on how long is that going to be. The deprecated legacy repositories, and their contents, might be removed at any time in the future and without a further notice period. The legacy package repositories are going away in January 2024. The legacy Google-hosted package repositories went away on March 4, 2024.
  • Given that no new releases will be published to the legacy repositories after the September 13, 2023 cut-off point, you will not be able to upgrade to any patch or minor release made from that date onwards if you don't migrate to the new Kubernetes package repositories. That said, we recommend migrating to the new Kubernetes package repositories as soon as possible. Migrating to the new Kubernetes package repositories is required to consume the official Kubernetes packages.
  • The new Kubernetes package repositories contain packages beginning with those Kubernetes versions that were still under support when the community took over the package builds. This means that the new package repositories have Linux packages for all Kubernetes releases starting with v1.24.0.
  • Kubernetes does not have official Linux packages available for earlier releases of Kubernetes; however, your Linux distribution may provide its own packages.
  • There's a dedicated package repository for each Kubernetes minor version. When upgrading to a different minor release, you must bear in mind that the package repository details also change. Check out Changing The Kubernetes Package Repository guide for information about steps that you need to take upon upgrading the Kubernetes minor version.

Why are we introducing new package repositories?

As the Kubernetes project is growing, we want to ensure the best possible experience for the end users. The Google-hosted repository has been serving us well for many years, but we started facing some problems that require significant changes to how we publish packages. Another goal that we have is to use community-owned infrastructure for all critical components and that includes package repositories.

Publishing packages to the Google-hosted repository is a manual process that can be done only by a team of Google employees called Google Build Admins. The Kubernetes Release Managers team is a very diverse team especially in terms of timezones that we work in. Given this constraint, we have to do very careful planning for every release to ensure that we have both Release Manager and Google Build Admin available to carry out the release.

Another problem is that we only have a single package repository. Because of this, we were not able to publish packages for prerelease versions (alpha, beta, and rc). This made testing Kubernetes prereleases harder for anyone who is interested to do so. The feedback that we receive from people testing these releases is critical to ensure the best quality of releases, so we want to make testing these releases as easy as possible. On top of that, having only one repository limited us when it comes to publishing dependencies like cri-tools and kubernetes-cni.

Regardless of all these issues, we're very thankful to Google and Google Build Admins for their involvement, support, and help all these years!

How the new package repositories work?

The new package repositories are hosted at pkgs.k8s.io for both Debian and RPM packages. At this time, this domain points to a CloudFront CDN backed by S3 bucket that contains repositories and packages. However, we plan on onboarding additional mirrors in the future, giving possibility for other companies to help us with serving packages.

Packages are built and published via the OpenBuildService (OBS) platform. After a long period of evaluating different solutions, we made a decision to use OpenBuildService as a platform to manage our repositories and packages. First of all, OpenBuildService is an open source platform used by a large number of open source projects and companies, like openSUSE, VideoLAN, Dell, Intel, and more. OpenBuildService has many features making it very flexible and easy to integrate with our existing release tooling. It also allows us to build packages in a similar way as for the Google-hosted repository making the migration process as seamless as possible.

SUSE sponsors the Kubernetes project with access to their reference OpenBuildService setup (build.opensuse.org) and with technical support to integrate OBS with our release processes.

We use SUSE's OBS instance for building and publishing packages. Upon building a new release, our tooling automatically pushes needed artifacts and package specifications to build.opensuse.org. That will trigger the build process that's going to build packages for all supported architectures (AMD64, ARM64, PPC64LE, S390X). At the end, generated packages will be automatically pushed to our community-owned S3 bucket making them available to all users.

We want to take this opportunity to thank SUSE for allowing us to use build.opensuse.org and their generous support to make this integration possible!

What are significant differences between the Google-hosted and Kubernetes package repositories?

There are three significant differences that you should be aware of:

  • There's a dedicated package repository for each Kubernetes minor release. For example, repository called core:/stable:/v1.28 only hosts packages for stable Kubernetes v1.28 releases. This means you can install v1.28.0 from this repository, but you can't install v1.27.0 or any other minor release other than v1.28. Upon upgrading to another minor version, you have to add a new repository and optionally remove the old one
  • There's a difference in what cri-tools and kubernetes-cni package versions are available in each Kubernetes repository
    • These two packages are dependencies for kubelet and kubeadm
    • Kubernetes repositories for v1.24 to v1.27 have same versions of these packages as the Google-hosted repository
    • Kubernetes repositories for v1.28 and onwards are going to have published only versions that are used by that Kubernetes minor release
      • Speaking of v1.28, only kubernetes-cni 1.2.0 and cri-tools v1.28 are going to be available in the repository for Kubernetes v1.28
      • Similar for v1.29, we only plan on publishing cri-tools v1.29 and whatever kubernetes-cni version is going to be used by Kubernetes v1.29
  • The revision part of the package version (the -00 part in 1.28.0-00) is now autogenerated by the OpenBuildService platform and has a different format. The revision is now in the format of -x.y, e.g. 1.28.0-1.1

Does this in any way affect existing Google-hosted repositories?

(updated on March 26, 2024)

The legacy Google-hosted repositories went away on March 4, 2024. It's not possible to install Kubernetes packages from the legacy Google-hosted package repositories any longer. Check out the deprecation announcement for more details about this change.

The Google-hosted repository and all packages published to it will continue working in the same way as before. There are no changes in how we build and publish packages to the Google-hosted repository, all newly-introduced changes are only affecting packages publish to the community-owned repositories.

However, as mentioned at the beginning of this blog post, we plan to stop publishing packages to the Google-hosted repository in the future.

How to migrate to the Kubernetes community-owned repositories?

Debian, Ubuntu, and operating systems using apt/apt-get

  1. Replace the apt repository definition so that apt points to the new repository instead of the Google-hosted repository. Make sure to replace the Kubernetes minor version in the command below with the minor version that you're currently using:

    echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.28/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
    
  2. Download the public signing key for the Kubernetes package repositories. The same signing key is used for all repositories, so you can disregard the version in the URL:

    curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
    

    Update: In releases older than Debian 12 and Ubuntu 22.04, the folder /etc/apt/keyrings does not exist by default, and it should be created before the curl command.

  3. Update the apt package index:

    sudo apt-get update
    

CentOS, Fedora, RHEL, and operating systems using rpm/dnf

  1. Replace the yum repository definition so that yum points to the new repository instead of the Google-hosted repository. Make sure to replace the Kubernetes minor version in the command below with the minor version that you're currently using:

    cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
    [kubernetes]
    name=Kubernetes
    baseurl=https://pkgs.k8s.io/core:/stable:/v1.28/rpm/
    enabled=1
    gpgcheck=1
    gpgkey=https://pkgs.k8s.io/core:/stable:/v1.28/rpm/repodata/repomd.xml.key
    exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
    EOF
    

Where can I get packages for Kubernetes versions prior to v1.24.0?

(updated on March 26, 2024)

For Kubernetes v1.24 and onwards, Linux packages of Kubernetes components are available for download via the official Kubernetes package repositories. Kubernetes does not publish any software packages for releases of Kubernetes older than v1.24.0; however, your Linux distribution may provide its own packages. Alternatively, you can directly download binaries instead of using packages. As an example, see Without a package manager instructions in "Installing kubeadm" document.

Can I rollback to the Google-hosted repository after migrating to the Kubernetes repositories?

(updated on March 26, 2024)

The legacy Google-hosted repositories went away on March 4, 2024 and therefore it's not possible to rollback to the legacy Google-hosted repositories any longer.

In general, yes. Just do the same steps as when migrating, but use parameters for the Google-hosted repository. You can find those parameters in a document like "Installing kubeadm".

Why isn’t there a stable list of domains/IPs? Why can’t I restrict package downloads?

Our plan for pkgs.k8s.io is to make it work as a redirector to a set of backends (package mirrors) based on user's location. The nature of this change means that a user downloading a package could be redirected to any mirror at any time. Given the architecture and our plans to onboard additional mirrors in the near future, we can't provide a list of IP addresses or domains that you can add to an allow list.

Restrictive control mechanisms like man-in-the-middle proxies or network policies that restrict access to a specific list of IPs/domains will break with this change. For these scenarios, we encourage you to mirror the release packages to a local package repository that you have strict control over.

What should I do if I detect some abnormality with the new repositories?

If you encounter any issue with new Kubernetes package repositories, please file an issue in the kubernetes/release repository.

Kubernetes v1.28: Planternetes

Announcing the release of Kubernetes v1.28 Planternetes, the second release of 2023!

This release consists of 45 enhancements. Of those enhancements, 19 are entering Alpha, 14 have graduated to Beta, and 12 have graduated to Stable.

Kubernetes v1.28: Planternetes

The theme for Kubernetes v1.28 is Planternetes.

Each Kubernetes release is the culmination of the hard work of thousands of individuals from our community. The people behind this release come from a wide range of backgrounds, some of us industry veterans, parents, others students and newcomers to open-source. We combine our unique experience to create a collective artifact with global impact.

Much like a garden, our release has ever-changing growth, challenges and opportunities. This theme celebrates the meticulous care, intention and efforts to get the release to where we are today. Harmoniously together, we grow better.

What's New (Major Themes)

Changes to supported skew between control plane and node versions

Kubernetes v1.28 expands the supported skew between core node and control plane components by one minor version, from n-2 to n-3, so that node components (kubelet and kube-proxy) for the oldest supported minor version work with control plane components (kube-apiserver, kube-scheduler, kube-controller-manager, cloud-controller-manager) for the newest supported minor version.

Some cluster operators avoid node maintenance and especially changes to node behavior, because nodes are where the workloads run. For minor version upgrades to a kubelet, the supported process includes draining that node, and hence disruption to any Pods that had been executing there. For Kubernetes end users with very long running workloads, and where Pods should stay running wherever possible, reducing the time lost to node maintenance is a benefit.

The Kubernetes yearly support period already made annual upgrades possible. Users can upgrade to the latest patch versions to pick up security fixes and do 3 sequential minor version upgrades once a year to "catch up" to the latest supported minor version.

Previously, to stay within the supported skew, a cluster operator planning an annual upgrade would have needed to upgrade their nodes twice (perhaps only hours apart). Now, with Kubernetes v1.28, you have the option of making a minor version upgrade to nodes just once in each calendar year and still staying within upstream support.

If you'd like to stay current and upgrade your clusters more often, that's fine and is still completely supported.

Generally available: recovery from non-graceful node shutdown

If a node shuts down unexpectedly or ends up in a non-recoverable state (perhaps due to hardware failure or unresponsive OS), Kubernetes allows you to clean up afterward and allow stateful workloads to restart on a different node. For Kubernetes v1.28, that's now a stable feature.

This allows stateful workloads to fail over to a different node successfully after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.

Versions of Kubernetes earlier than v1.20 lacked handling for node shutdown on Linux, the kubelet integrates with systemd and implements graceful node shutdown (beta, and enabled by default). However, even an intentional shutdown might not get handled well that could be because:

  • the node runs Windows
  • the node runs Linux, but uses a different init (not systemd)
  • the shutdown does not trigger the system inhibitor locks mechanism
  • because of a node-level configuration error (such as not setting appropriate values for shutdownGracePeriod and shutdownGracePeriodCriticalPods).

When a node shutdowns or fails, and that shutdown was not detected by the kubelet, the pods that are part of a StatefulSet will be stuck in terminating status on the shutdown node. If the stopped node restarts, the kubelet on that node can clean up (DELETE) the Pods that the Kubernetes API still sees as bound to that node. However, if the node stays stopped - or if the kubelet isn't able to start after a reboot - then Kubernetes may not be able to create replacement Pods. When the kubelet on the shut-down node is not available to delete the old pods, an associated StatefulSet cannot create a new pod (which would have the same name).

There's also a problem with storage. If there are volumes used by the pods, existing VolumeAttachments will not be disassociated from the original - and now shut down - node so the PersistentVolumes used by these pods cannot be attached to a different, healthy node. As a result, an application running on an affected StatefulSet may not be able to function properly. If the original, shut down node does come up, then their pods will be deleted by its kubelet and new pods can be created on a different running node. If the original node does not come up (common with an immutable infrastructure design), those pods would be stuck in a Terminating status on the shut-down node forever.

For more information on how to trigger cleanup after a non-graceful node shutdown, read non-graceful node shutdown.

Improvements to CustomResourceDefinition validation rules

The Common Expression Language (CEL) can be used to validate custom resources. The primary goal is to allow the majority of the validation use cases that might once have needed you, as a CustomResourceDefinition (CRD) author, to design and implement a webhook. Instead, and as a beta feature, you can add validation expressions directly into the schema of a CRD.

CRDs need direct support for non-trivial validation. While admission webhooks do support CRDs validation, they significantly complicate the development and operability of CRDs.

In 1.28, two optional fields reason and fieldPath were added to allow user to specify the failure reason and fieldPath when validation failed.

For more information, read validation rules in the CRD documentation.

ValidatingAdmissionPolicies graduate to beta

Common Expression language for admission control is customizable, in-process validation of requests to the Kubernetes API server as an alternative to validating admission webhooks.

This builds on the capabilities of the CRD Validation Rules feature that graduated to beta in 1.25 but with a focus on the policy enforcement capabilities of validating admission control.

This will lower the infrastructure barrier to enforcing customizable policies as well as providing primitives that help the community establish and adhere to the best practices of both K8s and its extensions.

To use ValidatingAdmissionPolicies, you need to enable both the admissionregistration.k8s.io/v1beta1 API group and the ValidatingAdmissionPolicy feature gate in your cluster's control plane.

Match conditions for admission webhooks

Kubernetes v1.27 lets you specify match conditions for admission webhooks, which lets you narrow the scope of when Kubernetes makes a remote HTTP call at admission time. The matchCondition field for ValidatingWebhookConfiguration and MutatingWebhookConfiguration is a CEL expression that must evaluate to true for the admission request to be sent to the webhook.

In Kubernetes v1.28, that field moved to beta, and it's enabled by default.

To learn more, see matchConditions in the Kubernetes documentation.

Beta support for enabling swap space on Linux

This adds swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.

There are two distinct types of users for swap, who may overlap:

  • Node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbor issues.

  • Application developers, who have written applications that would benefit from using swap memory.

Mixed version proxy (alpha)

When a cluster has multiple API servers at mixed versions (such as during an upgrade/downgrade or when runtime-config changes and a rollout happens), not every apiserver can serve every resource at every version.

For Kubernetes v1.28, you can enable the mixed version proxy within the API server's aggregation layer. The mixed version proxy finds requests that the local API server doesn't recognize but another API server inside the control plan is able to support. Having found a suitable peer, the aggregation layer proxies the request to a compatible API server; this is transparent from the client's perspective.

When an upgrade or downgrade is performed on a cluster, for some period of time the API servers within the control plane may be at differing versions; when that happens, different subsets of the API servers are able to serve different sets of built-in resources (different groups, versions, and resources are all possible). This new alpha mechanism lets you hide that skew from clients.

Source code reorganization for control plane components

Kubernetes contributors have begun to reorganize the code for the kube-apiserver to build on a new staging repository that consumes k/apiserver but has a bigger, carefully chosen subset of the functionality of kube-apiserver such that it is reusable.

This is a gradual reorganization; eventually there will be a new git repository with generic functionality abstracted from Kubernetes' API server.

Support for CDI injection into containers (alpha)

CDI provides a standardized way of injecting complex devices into a container (i.e. devices that logically require more than just a single /dev node to be injected for them to work). This new feature enables plugin developers to utilize the CDIDevices field added to the CRI in 1.27 to pass CDI devices directly to CDI enabled runtimes (of which containerd and crio-o are in recent releases).

API awareness of sidecar containers (alpha)

Kubernetes 1.28 introduces an alpha restartPolicy field for init containers, and uses that to indicate when an init container is also a sidecar container. The kubelet will start init containers with restartPolicy: Always in the order they are defined, along with other init containers. Instead of waiting for that sidecar container to complete before starting the main container(s) for the Pod, the kubelet only waits for the sidecar init container to have started.

The kubelet will consider the startup for the sidecar container as being completed if the startup probe succeeds and the postStart handler is completed. This condition is represented with the field Started of ContainerStatus type. If you do not define a startup probe, the kubelet will consider the container startup to be completed immediately after the postStart handler completion.

For init containers, you can either omit the restartPolicy field, or set it to Always. Omitting the field means that you want a true init container that runs to completion before application startup.

Sidecar containers do not block Pod completion: if all regular containers are complete, sidecar containers in that Pod will be terminated.

Once the sidecar container has started (process running, postStart was successful, and any configured startup probe is passing), and then there's a failure, that sidecar container will be restarted even when the Pod's overall restartPolicy is Never or OnFailure. Furthermore, sidecar containers will be restarted (on failure or on normal exit) even during Pod termination.

To learn more, read API for sidecar containers.

Automatic, retroactive assignment of a default StorageClass graduates to stable

Kubernetes automatically sets a storageClassName for a PersistentVolumeClaim (PVC) if you don't provide a value. The control plane also sets a StorageClass for any existing PVC that doesn't have a storageClassName defined. Previous versions of Kubernetes also had this behavior; for Kubernetes v1.28 it is automatic and always active; the feature has graduated to stable (general availability).

To learn more, read about StorageClass in the Kubernetes documentation.

Pod replacement policy for Jobs (alpha)

Kubernetes 1.28 adds a new field for the Job API that allows you to specify if you want the control plane to make new Pods as soon as the previous Pods begin termination (existing behavior), or only once the existing pods are fully terminated (new, optional behavior).

Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per index. With the older behaviour, if a pod that belongs to an Indexed Job enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created but then immediately fails to start due to the clash with the old pod that has not yet shut down.

Having a replacement Pod appear before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, early creation of replacement Pods might produce undesired scale-ups.

To learn more, read Delayed creation of replacement pods in the Job documentation.

Job retry backoff limit, per index (alpha)

This extends the Job API to support indexed jobs where the backoff limit is per index, and the Job can continue execution despite some of its indexes failing.

Currently, the indexes of an indexed job share a single backoff limit. When the job reaches this shared backoff limit, the job controller marks the entire job as failed, and the resources are cleaned up, including indexes that have yet to run to completion.

As a result, the existing implementation did not cover the situation where the workload is truly embarrassingly parallel: each index is fully independent of other indexes.

For instance, if indexed jobs were used as the basis for a suite of long-running integration tests, then each test run would only be able to find a single test failure.

For more information, read Handling Pod and container failures in the Kubernetes documentation.


Correction: the feature CRI container and pod statistics without cAdvisor has been removed as it did not make the release. The original release announcement stated that Kubernetes 1.28 included the new feature.

Feature graduations and deprecations in Kubernetes v1.28

Graduations to stable

This release includes a total of 12 enhancements promoted to Stable:

Deprecations and removals

Removals:

Deprecations:

Release Notes

The complete details of the Kubernetes v1.28 release are available in our release notes.

Availability

Kubernetes v1.28 is available for download on GitHub. To get started with Kubernetes, you can run local Kubernetes clusters using minikube, kind, etc. You can also easily install v1.28 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is comprised of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to ensure we deliver a solid Kubernetes v1.28 release for our community.

Special thanks to our release lead, Grace Nguyen, for guiding us through a smooth and successful release cycle.

Ecosystem Updates

  • KubeCon + CloudNativeCon China 2023 will take place in Shanghai, China, from 26 – 28 September 2023! You can find more information about the conference and registration on the event site.
  • KubeCon + CloudNativeCon North America 2023 will take place in Chicago, Illinois, The United States of America, from 6 – 9 November 2023! You can find more information about the conference and registration on the event site.

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.28 release cycle, which ran for 14 weeks (May 15 to August 15), we saw contributions from 911 companies and 1440 individuals.

Upcoming Release Webinar

Join members of the Kubernetes v1.28 release team on Wednesday, September 6th, 2023, at 9 A.M. PDT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests.

Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:

Spotlight on SIG ContribEx

Welcome to the world of Kubernetes and its vibrant contributor community! In this blog post, we'll be shining a spotlight on the Special Interest Group for Contributor Experience (SIG ContribEx), an essential component of the Kubernetes project.

SIG ContribEx in Kubernetes is responsible for developing and maintaining a healthy and productive community of contributors to the project. This involves identifying and addressing bottlenecks that may hinder the project's growth and feature velocity, such as pull request latency and the number of open pull requests and issues.

SIG ContribEx works to improve the overall contributor experience by creating and maintaining guidelines, tools, and processes that facilitate collaboration and communication among contributors. They also focus on community building and support, including outreach programs and mentorship initiatives to onboard and retain new contributors.

Ultimately, the role of SIG ContribEx is to foster a welcoming and inclusive environment that encourages contribution and supports the long-term sustainability of the Kubernetes project.

In this blog post, Fyka Ansari interviews Kaslin Fields, a DevRel Engineer at Google, who is a chair of SIG ContribEx, and Madhav Jivrajani, a Software Engineer at VMWare who serves as a SIG ContribEx Tech Lead. This interview covers various aspects of SIG ContribEx, including current initiatives, exciting developments, and how interested individuals can get involved and contribute to the group. It provides valuable insights into the workings of SIG ContribEx and highlights the importance of its role in the Kubernetes ecosystem.

Introductions

Fyka: Let's start by diving into your background and how you got involved in the Kubernetes ecosystem. Can you tell us more about that journey?

Kaslin: I first got involved in the Kubernetes ecosystem through my mentor, Jonathan Rippy, who introduced me to containers during my early days in tech. Eventually, I transitioned to a team working with containers, which sparked my interest in Kubernetes when it was announced. While researching Kubernetes in that role, I eagerly sought opportunities to engage with the containers/Kubernetes community. It was not until my subsequent job that I found a suitable role to contribute consistently. I joined SIG ContribEx, specifically in the Contributor Comms subproject, to both deepen my knowledge of Kubernetes and support the community better.

Madhav: My journey with Kubernetes began when I was a student, searching for interesting and exciting projects to work on. With my peers, I discovered open source and attended The New Contributor Workshop organized by the Kubernetes community. The workshop not only provided valuable insights into the community structure but also gave me a sense of warmth and welcome, which motivated me to join and remain involved. I realized that collaboration is at the heart of open-source communities, and to get answers and support, I needed to contribute and do my part. I started working on issues in ContribEx, particularly focusing on GitHub automation, despite not fully understanding the task at first. I continued to contribute for various technical and non-technical aspects of the project, finding it to be one of the most professionally rewarding experiences in my life.

Fyka: That's such an inspiration in itself! I'm sure beginners who are reading this got the ultimate motivation to take their first steps. Embracing the Learning journey, seeking mentorship, and engaging with the Kubernetes community can pave the way for exciting opportunities in the tech industry. Your stories proved the importance of starting small and being proactive, just like Madhav said Don't be afraid to take on tasks, even if you're uncertain at first.

Primary goals and scope

Fyka: Given your experience as a member of SIG ContribEx, could you tell us a bit about the group's primary goals and initiatives? Its current focus areas? What do you see as the scope of SIG ContribEx and the impact it has on the Kubernetes community?

Kaslin: SIG ContribEx's primary goals are to simplify the contributions of Kubernetes contributors and foster a welcoming community. It collaborates with other Kubernetes SIGs, such as planning the Contributor Summit at KubeCon, ensuring it meets the needs of various groups. The group's impact is evident in projects like updating org membership policies and managing critical platforms like Zoom, YouTube, and Slack. Its scope encompasses making the contributor experience smoother and supporting the overall Kubernetes community.

Madhav: The Kubernetes project has vertical SIGs and cross-cutting SIGs, ContribEx is a deeply cross-cutting SIG, impacting virtually every area of the Kubernetes community. Adding to Kaslin, sustainability in the Kubernetes project and community is critical now more than ever, it plays a central role in addressing critical issues, such as maintainer succession, by facilitating cohorts for SIGs to train experienced community members to take on leadership roles. Excellent examples include SIG CLI and SIG Apps, leading to the onboarding of new reviewers. Additionally, SIG ContribEx is essential in managing GitHub automation tools, including bots and commands used by contributors for interacting with Prow and other automation (label syncing, group and GitHub team management, etc).

Beginner's guide!

Fyka: I'll never forget talking to Kaslin when I joined the community and needed help with contributing. Kaslin, your quick and clear answers were a huge help in getting me started. Can you both give some tips for people new to contributing to Kubernetes? What makes SIG ContribEx a great starting point? Why should beginners and current contributors consider it? And what cool opportunities are there for newbies to jump in?

Kaslin: If you want to contribute to Kubernetes for the first time, it can be overwhelming to know where to start. A good option is to join SIG ContribEx as it offers great opportunities to know and serve the community. Within SIG ContribEx, various subprojects allow you to explore different parts of the Kubernetes project while you learn how contributions work. Once you know a bit more, it’s common for you to move to other SIGs within the project, and we think that’s wonderful. While many newcomers look for "good first issues" to start with, these opportunities can be scarce and get claimed quickly. Instead, the real benefit lies in attending meetings and getting to know the community. As you learn more about the project and the people involved, you'll be better equipped to offer your help, and the community will be more inclined to seek your assistance when needed. As a co-lead for the Contributor Comms subproject, I can confidently say that it's an excellent place for beginners to get involved. We have supportive leads and particularly beginner-friendly projects too.

Madhav: To begin, read the SIG README on GitHub, which provides an overview of the projects the SIG manages. While attending meetings is beneficial for all SIGs, it's especially recommended for SIG ContribEx, as each subproject gets dedicated slots for updates and areas that need help. If you can't attend in real-time due to time zone differences, you can catch the meeting recordings or Notes later.

Skills you learn!

Fyka: What skills do you look for when bringing in new contributors to SIG ContribEx, from passion to expertise? Additionally, what skills can contributors expect to develop while working with SIG ContribEx?

Kaslin: Skills folks need to have or will acquire vary depending on what area of ContribEx they work upon. Even within a subproject, a range of skills can be useful and/or developed. For example, the tech lead role involves technical tasks and overseeing automation, while the social media lead role requires excellent communication skills. Working with SIG ContribEx allows contributors to acquire various skills based on their chosen subproject. By participating in meetings, listening, learning, and taking on tasks related to their interests, they can develop and hone these skills. Some subprojects may require more specialized skills, like program management for the mentoring project, but all contributors can benefit from offering their talents to help teach others and contribute to the community.

Sub-projects under SIG ContribEx

Fyka: SIG ContribEx has several smaller projects. Can you tell me about the aims of these projects and how they've impacted the Kubernetes community?

Kaslin: Some SIGs have one or two subprojects and some have none at all, but in SIG ContribEx, we have ELEVEN!

Here’s a list of them and their respective mission statements

  1. Community: Manages the community repository, documentation, and operations.
  2. Community management: Handles communication platforms and policies for the community.
  3. Contributor-comms: Focuses on promoting the success of Kubernetes contributors through marketing.
  4. Contributors-documentation: Writes and maintains documentation for contributing to Kubernetes.
  5. Devstats: Maintains and updates the Kubernetes statistics website.
  6. Elections: Oversees community elections and maintains related documentation and software.
  7. Events: Organizes contributor-focused events like the Contributor Summit.
  8. Github management: Manages permissions, repositories, and groups on GitHub.
  9. Mentoring: Develop programs to help contributors progress in their contributions.
  10. Sigs-GitHub-actions: Repository for GitHub actions related to all SIGs in Kubernetes.
  11. Slack-infra: Creates and maintains tools and automation for Kubernetes Slack.

Madhav: Also, Devstats is critical from a sustainability standpoint!

(If you are willing to learn more and get involved with any of these sub-projects, check out the SIG ContribEx README)._

Accomplishments

Fyka: With that said, any SIG-related accomplishment that you’re proud of?

Kaslin: I'm proud of the accomplishments made by SIG ContribEx and its contributors in supporting the community. Some of the recent achievements include:

  1. Establishment of the elections subproject: Kubernetes is a massive project, and ensuring smooth leadership transitions is crucial. The contributors in this subproject organize fair and consistent elections, which helps keep the project running effectively.
  2. New issue triage proces: With such a large open-source project like Kubernetes, there's always a lot of work to be done. To ensure things progress safely, we implemented new labels and updated functionality for issue triage using our PROW tool. This reduces bottlenecks in the workflow and allows leaders to accomplish more.
  3. New org membership requirements: Becoming an org member in Kubernetes can be overwhelming for newcomers. We view org membership as a significant milestone for contributors aiming to take on leadership roles. We recently updated the rules to automatically remove privileges from inactive members, making sure that the right people have access to the necessary tools and responsibilities.

Overall, these accomplishments have greatly benefited our fellow contributors and strengthened the Kubernetes community.

Upcoming initiatives

Fyka: Could you give us a sneak peek into what's next for the group? We're excited to hear about upcoming projects and initiatives from this dynamic team.

Madhav: We’d love for more groups to sign up for mentoring cohorts! We’re probably going to have to spend some time polishing the process around that.

Final thoughts

Fyka: As we wrap up our conversation, would you like to share some final thoughts for those interested in contributing to SIG ContribEx or getting involved with Kubernetes?

Madhav: Kubernetes is meant to be overwhelming and difficult initially! You’re coming into something that’s taken multiple people, from multiple countries, multiple years to build. Embrace that diversity! Use the high entropy initially to collide around and gain as much knowledge about the project and community as possible before you decide to settle in your niche.

Fyka: Thank You Madhav and Kaslin, it was an absolute pleasure chatting about SIG ContribEx and your experiences as a member. It's clear that the role of SIG ContribEx in Kubernetes is significant and essential, ensuring scalability, growth and productivity, and I hope this interview inspires more people to get involved and contribute to Kubernetes. I wish SIG ContribEx all the best, and can't wait to see what exciting things lie ahead!

What next?

We love meeting new contributors and helping them in investigating different Kubernetes project spaces. If you are interested in getting more involved with SIG ContribEx, here are some resources for you to get started:

Spotlight on SIG CLI

In the world of Kubernetes, managing containerized applications at scale requires powerful and efficient tools. The command-line interface (CLI) is an integral part of any developer or operator’s toolkit, offering a convenient and flexible way to interact with a Kubernetes cluster.

SIG CLI plays a crucial role in improving the Kubernetes CLI experience by focusing on the development and enhancement of kubectl, the primary command-line tool for Kubernetes.

In this SIG CLI Spotlight, Arpit Agrawal, SIG ContribEx-Comms team member, talked with Katrina Verey, Tech Lead & Chair of SIG CLI,and Maciej Szulik, SIG CLI Batch Lead, about SIG CLI, current projects, challenges and how anyone can get involved.

So, whether you are a seasoned Kubernetes enthusiast or just getting started, understanding the significance of SIG CLI will undoubtedly enhance your Kubernetes journey.

Introductions

Arpit: Could you tell us a bit about yourself, your role, and how you got involved in SIG CLI?

Maciej: I’m one of the technical leads for SIG-CLI. I was working on Kubernetes in multiple areas since 2014, and in 2018 I got appointed a lead.

Katrina: I’ve been working with Kubernetes as an end-user since 2016, but it was only in late 2019 that I discovered how well SIG CLI aligned with my experience from internal projects. I started regularly attending meetings and made a few small PRs, and by 2021 I was working more deeply with the Kustomize team specifically. Later that year, I was appointed to my current roles as subproject owner for Kustomize and KRM Functions, and as SIG CLI Tech Lead and Chair.

About SIG CLI

Arpit: Thank you! Could you share with us the purpose and goals of SIG CLI?

Maciej: Our charter has the most detailed description, but in few words, we handle all CLI tooling that helps you manage your Kubernetes manifests and interact with your Kubernetes clusters.

Arpit: I see. And how does SIG CLI work to promote best-practices for CLI development and usage in the cloud native ecosystem?

Maciej: Within kubectl, we have several on-going efforts that try to encourage new contributors to align existing commands to new standards. We publish several libraries which hopefully make it easier to write CLIs that interact with Kubernetes APIs, such as cli-runtime and kyaml.

Katrina: We also maintain some interoperability specifications for CLI tooling, such as the KRM Functions Specification (GA) and the new ApplySet Specification (alpha).

Current projects and challenges

Arpit: Going through the README file, it’s clear SIG CLI has a number of subprojects, could you highlight some important ones?

Maciej: The four most active subprojects that are, in my opinion, worthy of your time investment would be:

  • kubectl: the canonical Kubernetes CLI.
  • Kustomize: a template-free customization tool for Kubernetes yaml manifest files.
  • KUI - a GUI interface to Kubernetes, think kubectl on steroids.
  • krew: a plugin manager for kubectl.

Arpit: Are there any upcoming initiatives or developments that SIG CLI is working on?

Maciej: There are always several initiatives we’re working on at any given point in time. It’s best to join one of our calls to learn about the current ones.

Katrina: For major features, you can check out our open KEPs. For instance, in 1.27 we introduced alphas for a new pruning mode in kubectl apply, and for kubectl create plugins. Exciting ideas that are currently under discussion include an interactive mode for kubectl delete (KEP 3895) and the kuberc user preferences file (KEP 3104).

Arpit: Could you discuss any challenges that SIG CLI faces in its efforts to improve CLIs for cloud-native technologies? What are the future efforts to solve them?

Katrina: The biggest challenge we’re facing with every decision is backwards compatibility and ensuring we don’t break existing users. It frequently happens that fixing what's on the surface may seem straightforward, but even fixing a bug could constitute a breaking change for some users, which means we need to go through an extended deprecation process to change it, or in some cases we can’t change it at all. Another challenge is the need to balance customization with usability in the flag sets we expose on our tools. For example, we get many proposals for new flags that would certainly be useful to some users, but not a large enough subset to justify the increased complexity having them in the tool entails for everyone. The kuberc proposal may help with some of these problems by giving individual users the ability to set or override default values we can’t change, and even create custom subcommands via aliases

Arpit: With every new version release of Kubernetes, maintaining consistency and integrity is surely challenging: how does the SIG CLI team tackle it?

Maciej: This is mostly similar to the topic mentioned in the previous question: every new change, especially to existing commands goes through a lot of scrutiny to ensure we don’t break existing users. At any point in time we have to keep a reasonable balance between features and not breaking users.

Future plans and contribution

Arpit: How do you see the role of CLI tools in the cloud-native ecosystem evolving in the future?

Maciej: I think that CLI tools were and will always be an important piece of the ecosystem. Whether used by administrators on remote machines that don’t have GUI or in every CI/CD pipeline, they are irreplaceable.

Arpit: Kubernetes is a community-driven project. Any recommendation for anyone looking into getting involved in SIG CLI work? Where should they start? Are there any prerequisites?

Maciej: There are no prerequisites other than a little bit of free time on your hands and willingness to learn something new :-)

Katrina: A working knowledge of Go often helps, but we also have areas in need of non-code contributions, such as the Kustomize docs consolidation project.

Confidential Kubernetes: Use Confidential Virtual Machines and Enclaves to improve your cluster security

In this blog post, we will introduce the concept of Confidential Computing (CC) to improve any computing environment's security and privacy properties. Further, we will show how the Cloud-Native ecosystem, particularly Kubernetes, can benefit from the new compute paradigm.

Confidential Computing is a concept that has been introduced previously in the cloud-native world. The Confidential Computing Consortium (CCC) is a project community in the Linux Foundation that already worked on Defining and Enabling Confidential Computing. In the Whitepaper, they provide a great motivation for the use of Confidential Computing:

Data exists in three states: in transit, at rest, and in use. …Protecting sensitive data in all of its states is more critical than ever. Cryptography is now commonly deployed to provide both data confidentiality (stopping unauthorized viewing) and data integrity (preventing or detecting unauthorized changes). While techniques to protect data in transit and at rest are now commonly deployed, the third state - protecting data in use - is the new frontier.

Confidential Computing aims to primarily solve the problem of protecting data in use by introducing a hardware-enforced Trusted Execution Environment (TEE).

Trusted Execution Environments

For more than a decade, Trusted Execution Environments (TEEs) have been available in commercial computing hardware in the form of Hardware Security Modules (HSMs) and Trusted Platform Modules (TPMs). These technologies provide trusted environments for shielded computations. They can store highly sensitive cryptographic keys and carry out critical cryptographic operations such as signing or encrypting data.

TPMs are optimized for low cost, allowing them to be integrated into mainboards and act as a system's physical root of trust. To keep the cost low, TPMs are limited in scope, i.e., they provide storage for only a few keys and are capable of just a small subset of cryptographic operations.

In contrast, HSMs are optimized for high performance, providing secure storage for far more keys and offering advanced physical attack detection mechanisms. Additionally, high-end HSMs can be programmed so that arbitrary code can be compiled and executed. The downside is that they are very costly. A managed CloudHSM from AWS costs around $1.50 / hour or ~$13,500 / year.

In recent years, a new kind of TEE has gained popularity. Technologies like AMD SEV, Intel SGX, and Intel TDX provide TEEs that are closely integrated with userspace. Rather than low-power or high-performance devices that support specific use cases, these TEEs shield normal processes or virtual machines and can do so with relatively low overhead. These technologies each have different design goals, advantages, and limitations, and they are available in different environments, including consumer laptops, servers, and mobile devices.

Additionally, we should mention ARM TrustZone, which is optimized for embedded devices such as smartphones, tablets, and smart TVs, as well as AWS Nitro Enclaves, which are only available on Amazon Web Services and have a different threat model compared to the CPU-based solutions by Intel and AMD.

IBM Secure Execution for Linux lets you run your Kubernetes cluster's nodes as KVM guests within a trusted execution environment on IBM Z series hardware. You can use this hardware-enhanced virtual machine isolation to provide strong isolation between tenants in a cluster, with hardware attestation about the (virtual) node's integrity.

Security properties and feature set

In the following sections, we will review the security properties and additional features these new technologies bring to the table. Only some solutions will provide all properties; we will discuss each technology in further detail in their respective section.

The Confidentiality property ensures that information cannot be viewed while it is in use in the TEE. This provides us with the highly desired feature to secure data in use. Depending on the specific TEE used, both code and data may be protected from outside viewers. The differences in TEE architectures and how their use in a cloud native context are important considerations when designing end-to-end security for sensitive workloads with a minimal Trusted Computing Base (TCB) in mind. CCC has recently worked on a common vocabulary and supporting material that helps to explain where confidentiality boundaries are drawn with the different TEE architectures and how that impacts the TCB size.

Confidentiality is a great feature, but an attacker can still manipulate or inject arbitrary code and data for the TEE to execute and, therefore, easily leak critical information. Integrity guarantees a TEE owner that neither code nor data can be tampered with while running critical computations.

Availability is a basic property often discussed in the context of information security. However, this property is outside the scope of most TEEs. Usually, they can be controlled (shut down, restarted, …) by some higher level abstraction. This could be the CPU itself, the hypervisor, or the kernel. This is to preserve the overall system's availability, not the TEE itself. When running in the cloud, availability is usually guaranteed by the cloud provider in terms of Service Level Agreements (SLAs) and is not cryptographically enforceable.

Confidentiality and Integrity by themselves are only helpful in some cases. For example, consider a TEE running in a remote cloud. How would you know the TEE is genuine and running your intended software? It could be an imposter stealing your data as soon as you send it over. This fundamental problem is addressed by Attestability. Attestation allows us to verify the identity, confidentiality, and integrity of TEEs based on cryptographic certificates issued from the hardware itself. This feature can also be made available to clients outside of the confidential computing hardware in the form of remote attestation.

TEEs can hold and process information that predates or outlives the trusted environment. That could mean across restarts, different versions, or platform migrations. Therefore Recoverability is an important feature. Data and the state of a TEE need to be sealed before they are written to persistent storage to maintain confidentiality and integrity guarantees. The access to such sealed data needs to be well-defined. In most cases, the unsealing is bound to a TEE's identity. Hence, making sure the recovery can only happen in the same confidential context.

This does not have to limit the flexibility of the overall system. AMD SEV-SNP's migration agent (MA) allows users to migrate a confidential virtual machine to a different host system while keeping the security properties of the TEE intact.

Feature comparison

These sections of the article will dive a little bit deeper into the specific implementations, compare supported features and analyze their security properties.

AMD SEV

AMD's Secure Encrypted Virtualization (SEV) technologies are a set of features to enhance the security of virtual machines on AMD's server CPUs. SEV transparently encrypts the memory of each VM with a unique key. SEV can also calculate a signature of the memory contents, which can be sent to the VM's owner as an attestation that the initial guest memory was not manipulated.

The second generation of SEV, known as Encrypted State or SEV-ES, provides additional protection from the hypervisor by encrypting all CPU register contents when a context switch occurs.

The third generation of SEV, Secure Nested Paging or SEV-SNP, is designed to prevent software-based integrity attacks and reduce the risk associated with compromised memory integrity. The basic principle of SEV-SNP integrity is that if a VM can read a private (encrypted) memory page, it must always read the value it last wrote.

Additionally, by allowing the guest to obtain remote attestation statements dynamically, SNP enhances the remote attestation capabilities of SEV.

AMD SEV has been implemented incrementally. New features and improvements have been added with each new CPU generation. The Linux community makes these features available as part of the KVM hypervisor and for host and guest kernels. The first SEV features were discussed and implemented in 2016 - see AMD x86 Memory Encryption Technologies from the 2016 Usenix Security Symposium. The latest big addition was SEV-SNP guest support in Linux 5.19.

Confidential VMs based on AMD SEV-SNP are available in Microsoft Azure since July 2022. Similarly, Google Cloud Platform (GCP) offers confidential VMs based on AMD SEV-ES.

Intel SGX

Intel's Software Guard Extensions has been available since 2015 and were introduced with the Skylake architecture.

SGX is an instruction set that enables users to create a protected and isolated process called an enclave. It provides a reverse sandbox that protects enclaves from the operating system, firmware, and any other privileged execution context.

The enclave memory cannot be read or written from outside the enclave, regardless of the current privilege level and CPU mode. The only way to call an enclave function is through a new instruction that performs several protection checks. Its memory is encrypted. Tapping the memory or connecting the DRAM modules to another system will yield only encrypted data. The memory encryption key randomly changes every power cycle. The key is stored within the CPU and is not accessible.

Since the enclaves are process isolated, the operating system's libraries are not usable as is; therefore, SGX enclave SDKs are required to compile programs for SGX. This also implies applications need to be designed and implemented to consider the trusted/untrusted isolation boundaries. On the other hand, applications get built with very minimal TCB.

An emerging approach to easily transition to process-based confidential computing and avoid the need to build custom applications is to utilize library OSes. These OSes facilitate running native, unmodified Linux applications inside SGX enclaves. A library OS intercepts all application requests to the host OS and processes them securely without the application knowing it's running a TEE.

The 3rd generation Xeon CPUs (aka Ice Lake Server - "ICX") and later generations did switch to using a technology called Total Memory Encryption - Multi-Key (TME-MK) that uses AES-XTS, moving away from the Memory Encryption Engine that the consumer and Xeon E CPUs used. This increased the possible enclave page cache (EPC) size (up to 512GB/CPU) and improved performance. More info about SGX on multi-socket platforms can be found in the Whitepaper.

A list of supported platforms is available from Intel.

SGX is available on Azure, Alibaba Cloud, IBM, and many more.

Intel TDX

Where Intel SGX aims to protect the context of a single process, Intel's Trusted Domain Extensions protect a full virtual machine and are, therefore, most closely comparable to AMD SEV.

As with SEV-SNP, guest support for TDX was merged in Linux Kernel 5.19. However, hardware support will land with Sapphire Rapids during 2023: Alibaba Cloud provides invitational preview instances, and Azure has announced its TDX preview opportunity.

Overhead analysis

The benefits that Confidential Computing technologies provide via strong isolation and enhanced security to customer data and workloads are not for free. Quantifying this impact is challenging and depends on many factors: The TEE technology, the benchmark, the metrics, and the type of workload all have a huge impact on the expected performance overhead.

Intel SGX-based TEEs are hard to benchmark, as shown by different papers. The chosen SDK/library OS, the application itself, as well as the resource requirements (especially large memory requirements) have a huge impact on performance. A single-digit percentage overhead can be expected if an application is well suited to run inside an enclave.

Confidential virtual machines based on AMD SEV-SNP require no changes to the executed program and operating system and are a lot easier to benchmark. A benchmark from Azure and AMD shows that SEV-SNP VM overhead is <10%, sometimes as low as 2%.

Although there is a performance overhead, it should be low enough to enable real-world workloads to run in these protected environments and improve the security and privacy of our data.

Confidential Computing compared to FHE, ZKP, and MPC

Fully Homomorphic Encryption (FHE), Zero Knowledge Proof/Protocol (ZKP), and Multi-Party Computations (MPC) are all a form of encryption or cryptographic protocols that offer similar security guarantees to Confidential Computing but do not require hardware support.

Fully (also partially and somewhat) homomorphic encryption allows one to perform computations, such as addition or multiplication, on encrypted data. This provides the property of encryption in use but does not provide integrity protection or attestation like confidential computing does. Therefore, these two technologies can complement to each other.

Zero Knowledge Proofs or Protocols are a privacy-preserving technique (PPT) that allows one party to prove facts about their data without revealing anything else about the data. ZKP can be used instead of or in addition to Confidential Computing to protect the privacy of the involved parties and their data. Similarly, Multi-Party Computation enables multiple parties to work together on a computation, i.e., each party provides their data to the result without leaking it to any other parties.

Use cases of Confidential Computing

The presented Confidential Computing platforms show that both the isolation of a single container process and, therefore, minimization of the trusted computing base and the isolation of a `` full virtual machine are possible. This has already enabled a lot of interesting and secure projects to emerge:

Confidential Containers

Confidential Containers (CoCo) is a CNCF sandbox project that isolates Kubernetes pods inside of confidential virtual machines.

CoCo can be installed on a Kubernetes cluster with an operator. The operator will create a set of runtime classes that can be used to deploy pods inside an enclave on several different platforms, including AMD SEV, Intel TDX, Secure Execution for IBM Z, and Intel SGX.

CoCo is typically used with signed and/or encrypted container images which are pulled, verified, and decrypted inside the enclave. Secrets, such as image decryption keys, are conditionally provisioned to the enclave by a trusted Key Broker Service that validates the hardware evidence of the TEE prior to releasing any sensitive information.

CoCo has several deployment models. Since the Kubernetes control plane is outside the TCB, CoCo is suitable for managed environments. CoCo can be run in virtual environments that don't support nesting with the help of an API adaptor that starts pod VMs in the cloud. CoCo can also be run on bare metal, providing strong isolation even in multi-tenant environments.

Managed confidential Kubernetes

Azure and GCP both support the use of confidential virtual machines as worker nodes for their managed Kubernetes offerings.

Both services aim for better workload protection and security guarantees by enabling memory encryption for container workloads. However, they don't seek to fully isolate the cluster or workloads against the service provider or infrastructure. Specifically, they don't offer a dedicated confidential control plane or expose attestation capabilities for the confidential cluster/nodes.

Azure also enables Confidential Containers in their managed Kubernetes offering. They support the creation based on Intel SGX enclaves and AMD SEV-based VMs.

Constellation

Constellation is a Kubernetes engine that aims to provide the best possible data security. Constellation wraps your entire Kubernetes cluster into a single confidential context that is shielded from the underlying cloud infrastructure. Everything inside is always encrypted, including at runtime in memory. It shields both the worker and control plane nodes. In addition, it already integrates with popular CNCF software such as Cilium for secure networking and provides extended CSI drivers to write data securely.

Occlum and Gramine

Occlum and Gramine are examples of open source library OS projects that can be used to run unmodified applications in SGX enclaves. They are member projects under the CCC, but similar projects and products maintained by companies also exist. With these libOS projects, existing containerized applications can be easily converted into confidential computing enabled containers. Many curated prebuilt containers are also available.

Where are we today? Vendors, limitations, and FOSS landscape

As we hope you have seen from the previous sections, Confidential Computing is a powerful new concept to improve security, but we are still in the (early) adoption phase. New products are starting to emerge to take advantage of the unique properties.

Google and Microsoft are the first major cloud providers to have confidential offerings that can run unmodified applications inside a protected boundary. Still, these offerings are limited to compute, while end-to-end solutions for confidential databases, cluster networking, and load balancers have to be self-managed.

These technologies provide opportunities to bring even the most sensitive workloads into the cloud and enables them to leverage all the tools in the CNCF landscape.

Call to action

If you are currently working on a high-security product that struggles to run in the public cloud due to legal requirements or are looking to bring the privacy and security of your cloud-native project to the next level: Reach out to all the great projects we have highlighted! Everyone is keen to improve the security of our ecosystem, and you can play a vital role in that journey.

Verifying Container Image Signatures Within CRI Runtimes

The Kubernetes community has been signing their container image-based artifacts since release v1.24. While the graduation of the corresponding enhancement from alpha to beta in v1.26 introduced signatures for the binary artifacts, other projects followed the approach by providing image signatures for their releases, too. This means that they either create the signatures within their own CI/CD pipelines, for example by using GitHub actions, or rely on the Kubernetes image promotion process to automatically sign the images by proposing pull requests to the k/k8s.io repository. A requirement for using this process is that the project is part of the kubernetes or kubernetes-sigs GitHub organization, so that they can utilize the community infrastructure for pushing images into staging buckets.

Assuming that a project now produces signed container image artifacts, how can one actually verify the signatures? It is possible to do it manually like outlined in the official Kubernetes documentation. The problem with this approach is that it involves no automation at all and should be only done for testing purposes. In production environments, tools like the sigstore policy-controller can help with the automation. These tools provide a higher level API by using Custom Resource Definitions (CRD) as well as an integrated admission controller and webhook to verify the signatures.

The general usage flow for an admission controller based verification is:

Create an instance of the policy and annotate the namespace to validate the signatures. Then create the pod. The controller evaluates the policy and if it passes, then it does the image pull if necessary. If the policy evaluation fails, then it will not admit the pod.

A key benefit of this architecture is simplicity: A single instance within the cluster validates the signatures before any image pull can happen in the container runtime on the nodes, which gets initiated by the kubelet. This benefit also brings along the issue of separation: The node which should pull the container image is not necessarily the same node that performs the admission. This means that if the controller is compromised, then a cluster-wide policy enforcement can no longer be possible.

One way to solve this issue is doing the policy evaluation directly within the Container Runtime Interface (CRI) compatible container runtime. The runtime is directly connected to the kubelet on a node and does all the tasks like pulling images. CRI-O is one of those available runtimes and will feature full support for container image signature verification in v1.28.

How does it work? CRI-O reads a file called policy.json, which contains all the rules defined for container images. For example, you can define a policy which only allows signed images quay.io/crio/signed for any tag or digest like this:

{
  "default": [{ "type": "reject" }],
  "transports": {
    "docker": {
      "quay.io/crio/signed": [
        {
          "type": "sigstoreSigned",
          "signedIdentity": { "type": "matchRepository" },
          "fulcio": {
            "oidcIssuer": "https://github.com/login/oauth",
            "subjectEmail": "sgrunert@redhat.com",
            "caData": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUI5ekNDQVh5Z0F3SUJBZ0lVQUxaTkFQRmR4SFB3amVEbG9Ed3lZQ2hBTy80d0NnWUlLb1pJemowRUF3TXcKS2pFVk1CTUdBMVVFQ2hNTWMybG5jM1J2Y21VdVpHVjJNUkV3RHdZRFZRUURFd2h6YVdkemRHOXlaVEFlRncweQpNVEV3TURjeE16VTJOVGxhRncwek1URXdNRFV4TXpVMk5UaGFNQ294RlRBVEJnTlZCQW9UREhOcFozTjBiM0psCkxtUmxkakVSTUE4R0ExVUVBeE1JYzJsbmMzUnZjbVV3ZGpBUUJnY3Foa2pPUFFJQkJnVXJnUVFBSWdOaUFBVDcKWGVGVDRyYjNQUUd3UzRJYWp0TGszL09sbnBnYW5nYUJjbFlwc1lCcjVpKzR5bkIwN2NlYjNMUDBPSU9aZHhleApYNjljNWlWdXlKUlErSHowNXlpK1VGM3VCV0FsSHBpUzVzaDArSDJHSEU3U1hyazFFQzVtMVRyMTlMOWdnOTJqCll6QmhNQTRHQTFVZER3RUIvd1FFQXdJQkJqQVBCZ05WSFJNQkFmOEVCVEFEQVFIL01CMEdBMVVkRGdRV0JCUlkKd0I1ZmtVV2xacWw2ekpDaGt5TFFLc1hGK2pBZkJnTlZIU01FR0RBV2dCUll3QjVma1VXbFpxbDZ6SkNoa3lMUQpLc1hGK2pBS0JnZ3Foa2pPUFFRREF3TnBBREJtQWpFQWoxbkhlWFpwKzEzTldCTmErRURzRFA4RzFXV2cxdENNCldQL1dIUHFwYVZvMGpoc3dlTkZaZ1NzMGVFN3dZSTRxQWpFQTJXQjlvdDk4c0lrb0YzdlpZZGQzL1Z0V0I1YjkKVE5NZWE3SXgvc3RKNVRmY0xMZUFCTEU0Qk5KT3NRNHZuQkhKCi0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0="
          },
          "rekorPublicKeyData": "LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KTUZrd0V3WUhLb1pJemowQ0FRWUlLb1pJemowREFRY0RRZ0FFMkcyWSsydGFiZFRWNUJjR2lCSXgwYTlmQUZ3cgprQmJtTFNHdGtzNEwzcVg2eVlZMHp1ZkJuaEM4VXIvaXk1NUdoV1AvOUEvYlkyTGhDMzBNOStSWXR3PT0KLS0tLS1FTkQgUFVCTElDIEtFWS0tLS0tCg=="
        }
      ]
    }
  }
}

CRI-O has to be started to use that policy as the global source of truth:

> sudo crio --log-level debug --signature-policy ./policy.json

CRI-O is now able to pull the image while verifying its signatures. This can be done by using crictl (cri-tools), for example:

> sudo crictl -D pull quay.io/crio/signed
DEBU[…] get image connection
DEBU[…] PullImageRequest: &PullImageRequest{Image:&ImageSpec{Image:quay.io/crio/signed,Annotations:map[string]string{},},Auth:nil,SandboxConfig:nil,}
DEBU[…] PullImageResponse: &PullImageResponse{ImageRef:quay.io/crio/signed@sha256:18b42e8ea347780f35d979a829affa178593a8e31d90644466396e1187a07f3a,}
Image is up to date for quay.io/crio/signed@sha256:18b42e8ea347780f35d979a829affa178593a8e31d90644466396e1187a07f3a

The CRI-O debug logs will also indicate that the signature got successfully validated:

DEBU[…] IsRunningImageAllowed for image docker:quay.io/crio/signed:latest
DEBU[…]  Using transport "docker" specific policy section quay.io/crio/signed
DEBU[…] Reading /var/lib/containers/sigstore/crio/signed@sha256=18b42e8ea347780f35d979a829affa178593a8e31d90644466396e1187a07f3a/signature-1
DEBU[…] Looking for sigstore attachments in quay.io/crio/signed:sha256-18b42e8ea347780f35d979a829affa178593a8e31d90644466396e1187a07f3a.sig
DEBU[…] GET https://quay.io/v2/crio/signed/manifests/sha256-18b42e8ea347780f35d979a829affa178593a8e31d90644466396e1187a07f3a.sig
DEBU[…] Content-Type from manifest GET is "application/vnd.oci.image.manifest.v1+json"
DEBU[…] Found a sigstore attachment manifest with 1 layers
DEBU[…] Fetching sigstore attachment 1/1: sha256:8276724a208087e73ae5d9d6e8f872f67808c08b0acdfdc73019278807197c45
DEBU[…] Downloading /v2/crio/signed/blobs/sha256:8276724a208087e73ae5d9d6e8f872f67808c08b0acdfdc73019278807197c45
DEBU[…] GET https://quay.io/v2/crio/signed/blobs/sha256:8276724a208087e73ae5d9d6e8f872f67808c08b0acdfdc73019278807197c45
DEBU[…]  Requirement 0: allowed
DEBU[…] Overall: allowed

All of the defined fields like oidcIssuer and subjectEmail in the policy have to match, while fulcio.caData and rekorPublicKeyData are the public keys from the upstream fulcio (OIDC PKI) and rekor (transparency log) instances.

This means that if you now invalidate the subjectEmail of the policy, for example to wrong@mail.com:

> jq '.transports.docker."quay.io/crio/signed"[0].fulcio.subjectEmail = "wrong@mail.com"' policy.json > new-policy.json
> mv new-policy.json policy.json

Then remove the image, since it already exists locally:

> sudo crictl rmi quay.io/crio/signed

Now when you pull the image, CRI-O complains that the required email is wrong:

> sudo crictl pull quay.io/crio/signed
FATA[…] pulling image: rpc error: code = Unknown desc = Source image rejected: Required email wrong@mail.com not found (got []string{"sgrunert@redhat.com"})

It is also possible to test an unsigned image against the policy. For that you have to modify the key quay.io/crio/signed to something like quay.io/crio/unsigned:

> sed -i 's;quay.io/crio/signed;quay.io/crio/unsigned;' policy.json

If you now pull the container image, CRI-O will complain that no signature exists for it:

> sudo crictl pull quay.io/crio/unsigned
FATA[…] pulling image: rpc error: code = Unknown desc = SignatureValidationFailed: Source image rejected: A signature was required, but no signature exists

It is important to mention that CRI-O will match the .critical.identity.docker-reference field within the signature to match with the image repository. For example, if you verify the image registry.k8s.io/kube-apiserver-amd64:v1.28.0-alpha.3, then the corresponding docker-reference should be registry.k8s.io/kube-apiserver-amd64:

> cosign verify registry.k8s.io/kube-apiserver-amd64:v1.28.0-alpha.3 \
    --certificate-identity krel-trust@k8s-releng-prod.iam.gserviceaccount.com \
    --certificate-oidc-issuer https://accounts.google.com \
    | jq -r '.[0].critical.identity."docker-reference"'

registry.k8s.io/kubernetes/kube-apiserver-amd64

The Kubernetes community introduced registry.k8s.io as proxy mirror for various registries. Before the release of kpromo v4.0.2, images had been signed with the actual mirror rather than registry.k8s.io:

> cosign verify registry.k8s.io/kube-apiserver-amd64:v1.28.0-alpha.2 \
    --certificate-identity krel-trust@k8s-releng-prod.iam.gserviceaccount.com \
    --certificate-oidc-issuer https://accounts.google.com \
    | jq -r '.[0].critical.identity."docker-reference"'

asia-northeast2-docker.pkg.dev/k8s-artifacts-prod/images/kubernetes/kube-apiserver-amd64

The change of the docker-reference to registry.k8s.io makes it easier for end users to validate the signatures, because they cannot know anything about the underlying infrastructure being used. The feature to set the identity on image signing has been added to cosign via the flag sign --sign-container-identity as well and will be part of its upcoming release.

The Kubernetes image pull error code SignatureValidationFailed got recently added to Kubernetes and will be available from v1.28. This error code allows end-users to understand image pull failures directly from the kubectl CLI. For example, if you run CRI-O together with Kubernetes using the policy which requires quay.io/crio/unsigned to be signed, then a pod definition like this:

apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  containers:
    - name: container
      image: quay.io/crio/unsigned

Will cause the SignatureValidationFailed error when applying the pod manifest:

> kubectl apply -f pod.yaml
pod/pod created
> kubectl get pods
NAME   READY   STATUS                      RESTARTS   AGE
pod    0/1     SignatureValidationFailed   0          4s
> kubectl describe pod pod | tail -n8
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  58s                default-scheduler  Successfully assigned default/pod to 127.0.0.1
  Normal   BackOff    22s (x2 over 55s)  kubelet            Back-off pulling image "quay.io/crio/unsigned"
  Warning  Failed     22s (x2 over 55s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    9s (x3 over 58s)   kubelet            Pulling image "quay.io/crio/unsigned"
  Warning  Failed     6s (x3 over 55s)   kubelet            Failed to pull image "quay.io/crio/unsigned": SignatureValidationFailed: Source image rejected: A signature was required, but no signature exists
  Warning  Failed     6s (x3 over 55s)   kubelet            Error: SignatureValidationFailed

This overall behavior provides a more Kubernetes native experience and does not rely on third party software to be installed in the cluster.

There are still a few corner cases to consider: For example, what if you want to allow policies per namespace in the same way the policy-controller supports it? Well, there is an upcoming CRI-O feature in v1.28 for that! CRI-O will support the --signature-policy-dir / signature_policy_dir option, which defines the root path for pod namespace-separated signature policies. This means that CRI-O will lookup that path and assemble a policy like <SIGNATURE_POLICY_DIR>/<NAMESPACE>.json, which will be used on image pull if existing. If no pod namespace is provided on image pull (via the sandbox config), or the concatenated path is non-existent, then CRI-O's global policy will be used as fallback.

Another corner case to consider is critical for the correct signature verification within container runtimes: The kubelet only invokes container image pulls if the image does not already exist on disk. This means that an unrestricted policy from Kubernetes namespace A can allow pulling an image, while namespace B is not able to enforce the policy because it already exits on the node. Finally, CRI-O has to verify the policy not only on image pull, but also on container creation. This fact makes things even a bit more complicated, because the CRI does not really pass down the user specified image reference on container creation, but an already resolved image ID, or digest. A small change to the CRI can help with that.

Now that everything happens within the container runtime, someone has to maintain and define the policies to provide a good user experience around that feature. The CRDs of the policy-controller are great, while we could imagine that a daemon within the cluster can write the policies for CRI-O per namespace. This would make any additional hook obsolete and moves the responsibility of verifying the image signature to the actual instance which pulls the image. I evaluated other possible paths toward a better container image signature verification within plain Kubernetes, but I could not find a great fit for a native API. This means that I believe that a CRD is the way to go, but users still need an instance which actually serves it.

Thank you for reading this blog post! If you're interested in more, providing feedback or asking for help, then feel free to get in touch with me directly via Slack (#crio) or the SIG Node mailing list.

dl.k8s.io to adopt a Content Delivery Network

We're happy to announce that dl.k8s.io, home of the official Kubernetes binaries, will soon be powered by Fastly.

Fastly is known for its high-performance content delivery network (CDN) designed to deliver content quickly and reliably around the world. With its powerful network, Fastly will help us deliver official Kubernetes binaries to users faster and more reliably than ever before.

The decision to use Fastly was made after an extensive evaluation process in which we carefully evaluated several potential content delivery network providers. Ultimately, we chose Fastly because of their commitment to the open internet and proven track record of delivering fast and secure digital experiences to some of the most known open source projects (through their Fast Forward program).

What you need to know about this change

  • On Monday, July 24th, the IP addresses and backend storage associated with the dl.k8s.io domain name will change.
  • The change will not impact the vast majority of users since the domain name will remain the same.
  • If you restrict access to specific IP ranges, access to the dl.k8s.io domain could stop working.

If you think you may be impacted or want to know more about this change, please keep reading.

Why are we making this change

The official Kubernetes binaries site, dl.k8s.io, is used by thousands of users all over the world, and currently serves more than 5 petabytes of binaries each month. This change will allow us to improve access to those resources by leveraging a world-wide CDN.

Does this affect dl.k8s.io only, or are other domains also affected?

Only dl.k8s.io will be affected by this change.

My company specifies the domain names that we are allowed to be accessed. Will this change affect the domain name?

No, the domain name (dl.k8s.io) will remain the same: no change will be necessary, and access to the Kubernetes release binaries site should not be affected.

My company uses some form of IP filtering. Will this change affect access to the site?

If IP-based filtering is in place, it’s possible that access to the site will be affected when the new IP addresses become active.

If my company doesn’t use IP addresses to restrict network traffic, do we need to do anything?

No, the switch to the CDN should be transparent.

Will there be a dual running period?

No, it is a cutover. You can, however, test your networks right now to check if they can route to the new public IP addresses from Fastly. You should add the new IPs to your network's allowlist before July 24th. Once the transfer is complete, ensure your networks use the new IP addresses to connect to the dl.k8s.io service.

What are the new IP addresses?

If you need to manage an allow list for downloads, you can get the ranges to match from the Fastly API, in JSON: public IP address ranges. You don't need any credentials to download that list of ranges.

What next steps would you recommend?

If you have IP-based filtering in place, we recommend the following course of action before July, 24th:

  • Add the new IP addresses to your allowlist.
  • Conduct tests with your networks/firewall to ensure your networks can route to the new IP addresses.

After the change is made, we recommend double-checking that HTTP calls are accessing dl.k8s.io with the new IP addresses.

What should I do if I detect some abnormality after the cutover date?

If you encounter any weirdness during binaries download, please open an issue.

Using OCI artifacts to distribute security profiles for seccomp, SELinux and AppArmor

The Security Profiles Operator (SPO) makes managing seccomp, SELinux and AppArmor profiles within Kubernetes easier than ever. It allows cluster administrators to define the profiles in a predefined custom resource YAML, which then gets distributed by the SPO into the whole cluster. Modification and removal of the security profiles are managed by the operator in the same way, but that’s a small subset of its capabilities.

Another core feature of the SPO is being able to stack seccomp profiles. This means that users can define a baseProfileName in the YAML specification, which then gets automatically resolved by the operator and combines the syscall rules. If a base profile has another baseProfileName, then the operator will recursively resolve the profiles up to a certain depth. A common use case is to define base profiles for low level container runtimes (like runc or crun) which then contain syscalls which are required in any case to run the container. Alternatively, application developers can define seccomp base profiles for their standard distribution containers and stack dedicated profiles for the application logic on top. This way developers can focus on maintaining seccomp profiles which are way simpler and scoped to the application logic, without having a need to take the whole infrastructure setup into account.

But how to maintain those base profiles? For example, the amount of required syscalls for a runtime can change over its release cycle in the same way it can change for the main application. Base profiles have to be available in the same cluster, otherwise the main seccomp profile will fail to deploy. This means that they’re tightly coupled to the main application profiles, which acts against the main idea of base profiles. Distributing and managing them as plain files feels like an additional burden to solve.

OCI artifacts to the rescue

The v0.8.0 release of the Security Profiles Operator supports managing base profiles as OCI artifacts! Imagine OCI artifacts as lightweight container images, storing files in layers in the same way images do, but without a process to be executed. Those artifacts can be used to store security profiles like regular container images in compatible registries. This means they can be versioned, namespaced and annotated similar to regular container images.

To see how that works in action, specify a baseProfileName prefixed with oci:// within a seccomp profile CRD, for example:

apiVersion: security-profiles-operator.x-k8s.io/v1beta1
kind: SeccompProfile
metadata:
  name: test
spec:
  defaultAction: SCMP_ACT_ERRNO
  baseProfileName: oci://ghcr.io/security-profiles/runc:v1.1.5
  syscalls:
    - action: SCMP_ACT_ALLOW
      names:
        - uname

The operator will take care of pulling the content by using oras, as well as verifying the sigstore (cosign) signatures of the artifact. If the artifacts are not signed, then the SPO will reject them. The resulting profile test will then contain all base syscalls from the remote runc profile plus the additional allowed uname one. It is also possible to reference the base profile by its digest (SHA256) making the artifact to be pulled more specific, for example by referencing oci://ghcr.io/security-profiles/runc@sha256:380….

The operator internally caches pulled artifacts up to 24 hours for 1000 profiles, meaning that they will be refreshed after that time period, if the cache is full or the operator daemon gets restarted.

Because the overall resulting syscalls are hidden from the user (I only have the baseProfileName listed in the SeccompProfile, and not the syscalls themselves), I'll additionally annotate that SeccompProfile with the final syscalls.

Here's how the SeccompProfile looks after I annotate it:

> kubectl describe seccompprofile test
Name:         test
Namespace:    security-profiles-operator
Labels:       spo.x-k8s.io/profile-id=SeccompProfile-test
Annotations:  syscalls:
                [{"names":["arch_prctl","brk","capget","capset","chdir","clone","close",...
API Version:  security-profiles-operator.x-k8s.io/v1beta1

The SPO maintainers provide all public base profiles as part of the “Security Profiles” GitHub organization.

Managing OCI security profiles

Alright, now the official SPO provides a bunch of base profiles, but how can I define my own? Well, first of all we have to choose a working registry. There are a bunch of registries that already supports OCI artifacts:

The Security Profiles Operator ships a new command line interface called spoc, which is a little helper tool for managing OCI profiles among doing various other things which are out of scope of this blog post. But, the command spoc push can be used to push a security profile to a registry:

> export USERNAME=my-user
> export PASSWORD=my-pass
> spoc push -f ./examples/baseprofile-crun.yaml ghcr.io/security-profiles/crun:v1.8.3
16:35:43.899886 Pushing profile ./examples/baseprofile-crun.yaml to: ghcr.io/security-profiles/crun:v1.8.3
16:35:43.899939 Creating file store in: /tmp/push-3618165827
16:35:43.899947 Adding profile to store: ./examples/baseprofile-crun.yaml
16:35:43.900061 Packing files
16:35:43.900282 Verifying reference: ghcr.io/security-profiles/crun:v1.8.3
16:35:43.900310 Using tag: v1.8.3
16:35:43.900313 Creating repository for ghcr.io/security-profiles/crun
16:35:43.900319 Using username and password
16:35:43.900321 Copying profile to repository
16:35:46.976108 Signing container image
Generating ephemeral keys...
Retrieving signed certificate...

        Note that there may be personally identifiable information associated with this signed artifact.
        This may include the email address associated with the account with which you authenticate.
        This information will be used for signing this artifact and will be stored in public transparency logs and cannot be removed later.

By typing 'y', you attest that you grant (or have permission to grant) and agree to have this information stored permanently in transparency logs.
Your browser will now be opened to:
https://oauth2.sigstore.dev/auth/auth?access_type=…
Successfully verified SCT...
tlog entry created with index: 16520520
Pushing signature to: ghcr.io/security-profiles/crun

You can see that the tool automatically signs the artifact and pushes the ./examples/baseprofile-crun.yaml to the registry, which is then directly ready for usage within the SPO. If username and password authentication is required, either use the --username, -u flag or export the USERNAME environment variable. To set the password, export the PASSWORD environment variable.

It is possible to add custom annotations to the security profile by using the --annotations / -a flag multiple times in KEY:VALUE format. Those have no effect for now, but at some later point additional features of the operator may rely them.

The spoc client is also able to pull security profiles from OCI artifact compatible registries. To do that, just run spoc pull:

> spoc pull ghcr.io/security-profiles/runc:v1.1.5
16:32:29.795597 Pulling profile from: ghcr.io/security-profiles/runc:v1.1.5
16:32:29.795610 Verifying signature

Verification for ghcr.io/security-profiles/runc:v1.1.5 --
The following checks were performed on each of these signatures:
  - Existence of the claims in the transparency log was verified offline
  - The code-signing certificate was verified using trusted certificate authority certificates

[{"critical":{"identity":{"docker-reference":"ghcr.io/security-profiles/runc"},…}}]
16:32:33.208695 Creating file store in: /tmp/pull-3199397214
16:32:33.208713 Verifying reference: ghcr.io/security-profiles/runc:v1.1.5
16:32:33.208718 Creating repository for ghcr.io/security-profiles/runc
16:32:33.208742 Using tag: v1.1.5
16:32:33.208743 Copying profile from repository
16:32:34.119652 Reading profile
16:32:34.119677 Trying to unmarshal seccomp profile
16:32:34.120114 Got SeccompProfile: runc-v1.1.5
16:32:34.120119 Saving profile in: /tmp/profile.yaml

The profile can be now found in /tmp/profile.yaml or the specified output file --output-file / -o. We can specify an username and password in the same way as for spoc push.

spoc makes it easy to manage security profiles as OCI artifacts, which can be then consumed directly by the operator itself.

That was our compact journey through the latest possibilities of the Security Profiles Operator! If you're interested in more, providing feedback or asking for help, then feel free to get in touch with us directly via Slack (#security-profiles-operator) or the mailing list.

Having fun with seccomp profiles on the edge

The Security Profiles Operator (SPO) is a feature-rich operator for Kubernetes to make managing seccomp, SELinux and AppArmor profiles easier than ever. Recording those profiles from scratch is one of the key features of this operator, which usually involves the integration into large CI/CD systems. Being able to test the recording capabilities of the operator in edge cases is one of the recent development efforts of the SPO and makes it excitingly easy to play around with seccomp profiles.

Recording seccomp profiles with spoc record

The v0.8.0 release of the Security Profiles Operator shipped a new command line interface called spoc, a little helper tool for recording and replaying seccomp profiles among various other things that are out of scope of this blog post.

Recording a seccomp profile requires a binary to be executed, which can be a simple golang application which just calls uname(2):

package main

import (
	"syscall"
)

func main() {
	utsname := syscall.Utsname{}
	if err := syscall.Uname(&utsname); err != nil {
		panic(err)
	}
}

Building a binary from that code can be done by:

> go build -o main main.go
> ldd ./main
        not a dynamic executable

Now it's possible to download the latest binary of spoc from GitHub and run the application on Linux with it:

> sudo ./spoc record ./main
10:08:25.591945 Loading bpf module
10:08:25.591958 Using system btf file
libbpf: loading object 'recorder.bpf.o' from buffer
libbpf: prog 'sys_enter': relo #3: patched insn #22 (ALU/ALU64) imm 16 -> 16
10:08:25.610767 Getting bpf program sys_enter
10:08:25.610778 Attaching bpf tracepoint
10:08:25.611574 Getting syscalls map
10:08:25.611582 Getting pid_mntns map
10:08:25.613097 Module successfully loaded
10:08:25.613311 Processing events
10:08:25.613693 Running command with PID: 336007
10:08:25.613835 Received event: pid: 336007, mntns: 4026531841
10:08:25.613951 No container ID found for PID (pid=336007, mntns=4026531841, err=unable to find container ID in cgroup path)
10:08:25.614856 Processing recorded data
10:08:25.614975 Found process mntns 4026531841 in bpf map
10:08:25.615110 Got syscalls: read, close, mmap, rt_sigaction, rt_sigprocmask, madvise, nanosleep, clone, uname, sigaltstack, arch_prctl, gettid, futex, sched_getaffinity, exit_group, openat
10:08:25.615195 Adding base syscalls: access, brk, capget, capset, chdir, chmod, chown, close_range, dup2, dup3, epoll_create1, epoll_ctl, epoll_pwait, execve, faccessat2, fchdir, fchmodat, fchown, fchownat, fcntl, fstat, fstatfs, getdents64, getegid, geteuid, getgid, getpid, getppid, getuid, ioctl, keyctl, lseek, mkdirat, mknodat, mount, mprotect, munmap, newfstatat, openat2, pipe2, pivot_root, prctl, pread64, pselect6, readlink, readlinkat, rt_sigreturn, sched_yield, seccomp, set_robust_list, set_tid_address, setgid, setgroups, sethostname, setns, setresgid, setresuid, setsid, setuid, statfs, statx, symlinkat, tgkill, umask, umount2, unlinkat, unshare, write
10:08:25.616293 Wrote seccomp profile to: /tmp/profile.yaml
10:08:25.616298 Unloading bpf module

I have to execute spoc as root because it will internally run an ebpf program by reusing the same code parts from the Security Profiles Operator itself. I can see that the bpf module got loaded successfully and spoc attached the required tracepoint to it. Then it will track the main application by using its mount namespace and process the recorded syscall data. The nature of ebpf programs is that they see the whole context of the Kernel, which means that spoc tracks all syscalls of the system, but does not interfere with their execution.

The logs indicate that spoc found the syscalls read, close, mmap and so on, including uname. All other syscalls than uname are coming from the golang runtime and its garbage collection, which already adds overhead to a basic application like in our demo. I can also see from the log line Adding base syscalls: … that spoc adds a bunch of base syscalls to the resulting profile. Those are used by the OCI runtime (like runc or crun) in order to be able to run a container. This means that spoc can be used to record seccomp profiles which then can be containerized directly. This behavior can be disabled in spoc by using the --no-base-syscalls/-n or customized via the --base-syscalls/-b command line flags. This can be helpful in cases where different OCI runtimes other than crun and runc are used, or if I just want to record the seccomp profile for the application and stack it with another base profile.

The resulting profile is now available in /tmp/profile.yaml, but the default location can be changed using the --output-file value/-o flag:

> cat /tmp/profile.yaml
apiVersion: security-profiles-operator.x-k8s.io/v1beta1
kind: SeccompProfile
metadata:
  creationTimestamp: null
  name: main
spec:
  architectures:
    - SCMP_ARCH_X86_64
  defaultAction: SCMP_ACT_ERRNO
  syscalls:
    - action: SCMP_ACT_ALLOW
      names:
        - access
        - arch_prctl
        - brk
        - …
        - uname
        - …
status: {}

The seccomp profile Custom Resource Definition (CRD) can be directly used together with the Security Profiles Operator for managing it within Kubernetes. spoc is also capable of producing raw seccomp profiles (as JSON), by using the --type/-t raw-seccomp flag:

> sudo ./spoc record --type raw-seccomp ./main
52.628827 Wrote seccomp profile to: /tmp/profile.json
> jq . /tmp/profile.json
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["access", "…", "write"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

The utility spoc record allows us to record complex seccomp profiles directly from binary invocations in any Linux system which is capable of running the ebpf code within the Kernel. But it can do more: How about modifying the seccomp profile and then testing it by using spoc run.

Running seccomp profiles with spoc run

spoc is also able to run binaries with applied seccomp profiles, making it easy to test any modification to it. To do that, just run:

> sudo ./spoc run ./main
10:29:58.153263 Reading file /tmp/profile.yaml
10:29:58.153311 Assuming YAML profile
10:29:58.154138 Setting up seccomp
10:29:58.154178 Load seccomp profile
10:29:58.154189 Starting audit log enricher
10:29:58.154224 Enricher reading from file /var/log/audit/audit.log
10:29:58.155356 Running command with PID: 437880
>

It looks like that the application exited successfully, which is anticipated because I did not modify the previously recorded profile yet. I can also specify a custom location for the profile by using the --profile/-p flag, but this was not necessary because I did not modify the default output location from the record. spoc will automatically determine if it's a raw (JSON) or CRD (YAML) based seccomp profile and then apply it to the process.

The Security Profiles Operator supports a log enricher feature, which provides additional seccomp related information by parsing the audit logs. spoc run uses the enricher in the same way to provide more data to the end users when it comes to debugging seccomp profiles.

Now I have to modify the profile to see anything valuable in the output. For example, I could remove the allowed uname syscall:

> jq 'del(.syscalls[0].names[] | select(. == "uname"))' /tmp/profile.json > /tmp/no-uname-profile.json

And then try to run it again with the new profile /tmp/no-uname-profile.json:

> sudo ./spoc run -p /tmp/no-uname-profile.json ./main
10:39:12.707798 Reading file /tmp/no-uname-profile.json
10:39:12.707892 Setting up seccomp
10:39:12.707920 Load seccomp profile
10:39:12.707982 Starting audit log enricher
10:39:12.707998 Enricher reading from file /var/log/audit/audit.log
10:39:12.709164 Running command with PID: 480512
panic: operation not permitted

goroutine 1 [running]:
main.main()
        /path/to/main.go:10 +0x85
10:39:12.713035 Unable to run: launch runner: wait for command: exit status 2

Alright, that was expected! The applied seccomp profile blocks the uname syscall, which results in an "operation not permitted" error. This error is pretty generic and does not provide any hint on what got blocked by seccomp. It is generally extremely difficult to predict how applications behave if single syscalls are forbidden by seccomp. It could be possible that the application terminates like in our simple demo, but it could also lead to a strange misbehavior and the application does not stop at all.

If I now change the default seccomp action of the profile from SCMP_ACT_ERRNO to SCMP_ACT_LOG like this:

> jq '.defaultAction = "SCMP_ACT_LOG"' /tmp/no-uname-profile.json > /tmp/no-uname-profile-log.json

Then the log enricher will give us a hint that the uname syscall got blocked when using spoc run:

> sudo ./spoc run -p /tmp/no-uname-profile-log.json ./main
10:48:07.470126 Reading file /tmp/no-uname-profile-log.json
10:48:07.470234 Setting up seccomp
10:48:07.470245 Load seccomp profile
10:48:07.470302 Starting audit log enricher
10:48:07.470339 Enricher reading from file /var/log/audit/audit.log
10:48:07.470889 Running command with PID: 522268
10:48:07.472007 Seccomp: uname (63)

The application will not terminate any more, but seccomp will log the behavior to /var/log/audit/audit.log and spoc will parse the data to correlate it directly to our program. Generating the log messages to the audit subsystem comes with a large performance overhead and should be handled with care in production systems. It also comes with a security risk when running untrusted apps in audit mode in production environments.

This demo should give you an impression how to debug seccomp profile issues with applications, probably by using our shiny new helper tool powered by the features of the Security Profiles Operator. spoc is a flexible and portable binary suitable for edge cases where resources are limited and even Kubernetes itself may not be available with its full capabilities.

Thank you for reading this blog post! If you're interested in more, providing feedback or asking for help, then feel free to get in touch with us directly via Slack (#security-profiles-operator) or the mailing list.

Kubernetes 1.27: KMS V2 Moves to Beta

With Kubernetes 1.27, we (SIG Auth) are moving Key Management Service (KMS) v2 API to beta.

What is KMS?

One of the first things to consider when securing a Kubernetes cluster is encrypting etcd data at rest. KMS provides an interface for a provider to utilize a key stored in an external key service to perform this encryption.

KMS v1 has been a feature of Kubernetes since version 1.10, and is currently in beta as of version v1.12. KMS v2 was introduced as alpha in v1.25.

What’s new in v2beta1?

The KMS encryption provider uses an envelope encryption scheme to encrypt data in etcd. The data is encrypted using a data encryption key (DEK). The DEKs are encrypted with a key encryption key (KEK) that is stored and managed in a remote KMS. With KMS v1, a new DEK is generated for each encryption. With KMS v2, a new DEK is only generated on server startup and when the KMS plugin informs the API server that a KEK rotation has occurred.

Sequence Diagram

Encrypt Request

Sequence diagram for KMSv2 beta Encrypt

Decrypt Request

Sequence diagram for KMSv2 beta Decrypt

Status Request

Sequence diagram for KMSv2 beta Status

Generate Data Encryption Key (DEK)

Sequence diagram for KMSv2 beta Generate DEK

Performance Improvements

With KMS v2, we have made significant improvements to the performance of the KMS encryption provider. In case of KMS v1, a new DEK is generated for every encryption. This means that for every write request, the API server makes a call to the KMS plugin to encrypt the DEK using the remote KEK. The API server also has to cache the DEKs to avoid making a call to the KMS plugin for every read request. When the API server restarts, it has to populate the cache by making a call to the KMS plugin for every DEK in the etcd store based on the cache size. This is a significant overhead for the API server. With KMS v2, the API server generates a DEK at startup and caches it. The API server also makes a call to the KMS plugin to encrypt the DEK using the remote KEK. This is a one-time call at startup and on KEK rotation. The API server then uses the cached DEK to encrypt the resources. This reduces the number of calls to the KMS plugin and improves the overall latency of the API server requests.

We conducted a test that created 12k secrets and measured the time taken for the API server to encrypt the resources. The metric used was apiserver_storage_transformation_duration_seconds. For KMS v1, the test was run on a managed Kubernetes v1.25 cluster with 2 nodes. There was no additional load on the cluster during the test. For KMS v2, the test was run in the Kubernetes CI environment with the following cluster configuration.

KMS ProviderTime taken by 95 percentile
KMS v1160ms
KMS v280μs

The results show that the KMS v2 encryption provider is three orders of magnitude faster than the KMS v1 encryption provider.

What's next?

For Kubernetes v1.28, we expect the feature to stay in beta. In the coming releases we want to investigate:

  • Cryptographic changes to remove the limitation on VM state store.
  • Kubernetes REST API changes to enable a more robust story around key rotation.
  • Handling undecryptable resources. Refer to the KEP for details.

You can learn more about KMS v2 by reading Using a KMS provider for data encryption. You can also follow along on the KEP to track progress across the coming Kubernetes releases.

Call to action

In this blog post, we have covered the improvements made to the KMS encryption provider in Kubernetes v1.27. We have also discussed the new KMS v2 API and how it works. We would love to hear your feedback on this feature. In particular, we would like feedback from Kubernetes KMS plugin implementors as they go through the process of building their integrations with this new API. Please reach out to us on the #sig-auth-kms-dev channel on Kubernetes Slack.

How to get involved

If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every-other Wednesday.

Acknowledgements

This feature has been an effort driven by contributors from several different companies. We would like to extend a huge thank you to everyone that contributed their time and effort to help make this possible.

Kubernetes 1.27: updates on speeding up Pod startup

How can Pod start-up be accelerated on nodes in large clusters? This is a common issue that cluster administrators may face.

This blog post focuses on methods to speed up pod start-up from the kubelet side. It does not involve the creation time of pods by controller-manager through kube-apiserver, nor does it include scheduling time for pods or webhooks executed on it.

We have mentioned some important factors here to consider from the kubelet's perspective, but this is not an exhaustive list. As Kubernetes v1.27 is released, this blog highlights significant changes in v1.27 that aid in speeding up pod start-up.

Parallel container image pulls

Pulling images always takes some time and what's worse is that image pulls are done serially by default. In other words, kubelet will send only one image pull request to the image service at a time. Other image pull requests have to wait until the one being processed is complete.

To enable parallel image pulls, set the serializeImagePulls field to false in the kubelet configuration. When serializeImagePulls is disabled, requests for image pulls are immediately sent to the image service and multiple images can be pulled concurrently.

Maximum parallel image pulls will help secure your node from overloading on image pulling

We introduced a new feature in kubelet that sets a limit on the number of parallel image pulls at the node level. This limit restricts the maximum number of images that can be pulled simultaneously. If there is an image pull request beyond this limit, it will be blocked until one of the ongoing image pulls finishes. Before enabling this feature, please ensure that your container runtime's image service can handle parallel image pulls effectively.

To limit the number of simultaneous image pulls, you can configure the maxParallelImagePulls field in kubelet. By setting maxParallelImagePulls to a value of n, only n images will be pulled concurrently. Any additional image pulls beyond this limit will wait until at least one ongoing pull is complete.

You can find more details in the associated KEP: Kubelet limit of Parallel Image Pulls (KEP-3673).

Raised default API query-per-second limits for kubelet

To improve pod startup in scenarios with multiple pods on a node, particularly sudden scaling situations, it is necessary for Kubelet to synchronize the pod status and prepare configmaps, secrets, or volumes. This requires a large bandwidth to access kube-apiserver.

In versions prior to v1.27, the default kubeAPIQPS was 5 and kubeAPIBurst was 10. However, the kubelet in v1.27 has increased these defaults to 50 and 100 respectively for better performance during pod startup. It's worth noting that this isn't the only reason why we've bumped up the API QPS limits for Kubelet.

  1. It has a potential to be hugely throttled now (default QPS = 5)
  2. In large clusters they can generate significant load anyway as there are a lot of them
  3. They have a dedicated PriorityLevel and FlowSchema that we can easily control

Previously, we often encountered volume mount timeout on kubelet in node with more than 50 pods during pod start up. We suggest that cluster operators bump kubeAPIQPS to 20 and kubeAPIBurst to 40, especially if using bare metal nodes.

More detials can be found in the KEP https://kep.k8s.io/1040 and the pull request #116121.

Event triggered updates to container status

Evented PLEG (PLEG is short for "Pod Lifecycle Event Generator") is set to be in beta for v1.27, Kubernetes offers two ways for the kubelet to detect Pod lifecycle events, such as the last process in a container shutting down. In Kubernetes v1.27, the event based mechanism has graduated to beta but remains disabled by default. If you do explicitly switch to event-based lifecycle change detection, the kubelet is able to start Pods more quickly than with the default approach that relies on polling. The default mechanism, polling for lifecycle changes, adds a noticeable overhead; this affects the kubelet's ability to handle different tasks in parallel, and leads to poor performance and reliability issues. For these reasons, we recommend that you switch your nodes to use event-based pod lifecycle change detection.

Further details can be found in the KEP https://kep.k8s.io/3386 and Switching From Polling to CRI Event-based Updates to Container Status.

Raise your pod resource limit if needed

During start-up, some pods may consume a considerable amount of CPU or memory. If the CPU limit is low, this can significantly slow down the pod start-up process. To improve the memory management, Kubernetes v1.22 introduced a feature gate called MemoryQoS to kubelet. This feature enables kubelet to set memory QoS at container, pod, and QoS levels for better protection and guaranteed quality of memory when running with cgroups v2. Although it has benefits, it is possible that enabling this feature gate may affect the start-up speed of the pod if the pod startup consumes a large amount of memory.

Kubelet configuration now includes memoryThrottlingFactor. This factor is multiplied by the memory limit or node allocatable memory to set the cgroupv2 memory.high value for enforcing MemoryQoS. Decreasing this factor sets a lower high limit for container cgroups, increasing reclaim pressure. Increasing this factor will put less reclaim pressure. The default value is 0.8 initially and will change to 0.9 in Kubernetes v1.27. This parameter adjustment can reduce the potential impact of this feature on pod startup speed.

Further details can be found in the KEP https://kep.k8s.io/2570.

What's more?

In Kubernetes v1.26, a new histogram metric pod_start_sli_duration_seconds was added for Pod startup latency SLI/SLO details. Additionally, the kubelet log will now display more information about pod start-related timestamps, as shown below:

Dec 30 15:33:13.375379 e2e-022435249c-674b9-minion-group-gdj4 kubelet[8362]: I1230 15:33:13.375359 8362 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/konnectivity-agent-gnc9k" podStartSLOduration=-9.223372029479458e+09 pod.CreationTimestamp="2022-12-30 15:33:06 +0000 UTC" firstStartedPulling="2022-12-30 15:33:09.258791695 +0000 UTC m=+13.029631711" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2022-12-30 15:33:13.375009262 +0000 UTC m=+17.145849275" watchObservedRunningTime="2022-12-30 15:33:13.375317944 +0000 UTC m=+17.146157970"

The SELinux Relabeling with Mount Options feature moved to Beta in v1.27. This feature speeds up container startup by mounting volumes with the correct SELinux label instead of changing each file on the volumes recursively. Further details can be found in the KEP https://kep.k8s.io/1710.

To identify the cause of slow pod startup, analyzing metrics and logs can be helpful. Other factors that may impact pod startup include container runtime, disk speed, CPU and memory resources on the node.

SIG Node is responsible for ensuring fast Pod startup times, while addressing issues in large clusters falls under the purview of SIG Scalability as well.

Kubernetes 1.27: In-place Resource Resize for Kubernetes Pods (alpha)

If you have deployed Kubernetes pods with CPU and/or memory resources specified, you may have noticed that changing the resource values involves restarting the pod. This has been a disruptive operation for running workloads... until now.

In Kubernetes v1.27, we have added a new alpha feature that allows users to resize CPU/memory resources allocated to pods without restarting the containers. To facilitate this, the resources field in a pod's containers now allow mutation for cpu and memory resources. They can be changed simply by patching the running pod spec.

This also means that resources field in the pod spec can no longer be relied upon as an indicator of the pod's actual resources. Monitoring tools and other such applications must now look at new fields in the pod's status. Kubernetes queries the actual CPU and memory requests and limits enforced on the running containers via a CRI (Container Runtime Interface) API call to the runtime, such as containerd, which is responsible for running the containers. The response from container runtime is reflected in the pod's status.

In addition, a new restartPolicy for resize has been added. It gives users control over how their containers are handled when resources are resized.

What's new in v1.27?

Besides the addition of resize policy in the pod's spec, a new field named allocatedResources has been added to containerStatuses in the pod's status. This field reflects the node resources allocated to the pod's containers.

In addition, a new field called resources has been added to the container's status. This field reflects the actual resource requests and limits configured on the running containers as reported by the container runtime.

Lastly, a new field named resize has been added to the pod's status to show the status of the last requested resize. A value of Proposed is an acknowledgement of the requested resize and indicates that request was validated and recorded. A value of InProgress indicates that the node has accepted the resize request and is in the process of applying the resize request to the pod's containers. A value of Deferred means that the requested resize cannot be granted at this time, and the node will keep retrying. The resize may be granted when other pods leave and free up node resources. A value of Infeasible is a signal that the node cannot accommodate the requested resize. This can happen if the requested resize exceeds the maximum resources the node can ever allocate for a pod.

When to use this feature

Here are a few examples where this feature may be useful:

  • Pod is running on node but with either too much or too little resources.
  • Pods are not being scheduled due to lack of sufficient CPU or memory in a cluster that is underutilized by running pods that were overprovisioned.
  • Evicting certain stateful pods that need more resources to schedule them on bigger nodes is an expensive or disruptive operation when other lower priority pods in the node can be resized down or moved.

How to use this feature

In order to use this feature in v1.27, the InPlacePodVerticalScaling feature gate must be enabled. A local cluster with this feature enabled can be started as shown below:

root@vbuild:~/go/src/k8s.io/kubernetes# FEATURE_GATES=InPlacePodVerticalScaling=true ./hack/local-up-cluster.sh
go version go1.20.2 linux/arm64
+++ [0320 13:52:02] Building go targets for linux/arm64
    k8s.io/kubernetes/cmd/kubectl (static)
    k8s.io/kubernetes/cmd/kube-apiserver (static)
    k8s.io/kubernetes/cmd/kube-controller-manager (static)
    k8s.io/kubernetes/cmd/cloud-controller-manager (non-static)
    k8s.io/kubernetes/cmd/kubelet (non-static)
...
...
Logs:
  /tmp/etcd.log
  /tmp/kube-apiserver.log
  /tmp/kube-controller-manager.log

  /tmp/kube-proxy.log
  /tmp/kube-scheduler.log
  /tmp/kubelet.log

To start using your cluster, you can open up another terminal/tab and run:

  export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
  cluster/kubectl.sh

Alternatively, you can write to the default kubeconfig:

  export KUBERNETES_PROVIDER=local

  cluster/kubectl.sh config set-cluster local --server=https://localhost:6443 --certificate-authority=/var/run/kubernetes/server-ca.crt
  cluster/kubectl.sh config set-credentials myself --client-key=/var/run/kubernetes/client-admin.key --client-certificate=/var/run/kubernetes/client-admin.crt
  cluster/kubectl.sh config set-context local --cluster=local --user=myself
  cluster/kubectl.sh config use-context local
  cluster/kubectl.sh

Once the local cluster is up and running, Kubernetes users can schedule pods with resources, and resize the pods via kubectl. An example of how to use this feature is illustrated in the following demo video.

Example Use Cases

Cloud-based Development Environment

In this scenario, developers or development teams write their code locally but build and test their code in Kubernetes pods with consistent configs that reflect production use. Such pods need minimal resources when the developers are writing code, but need significantly more CPU and memory when they build their code or run a battery of tests. This use case can leverage in-place pod resize feature (with a little help from eBPF) to quickly resize the pod's resources and avoid kernel OOM (out of memory) killer from terminating their processes.

This KubeCon North America 2022 conference talk illustrates the use case.

Java processes initialization CPU requirements

Some Java applications may need significantly more CPU during initialization than what is needed during normal process operation time. If such applications specify CPU requests and limits suited for normal operation, they may suffer from very long startup times. Such pods can request higher CPU values at the time of pod creation, and can be resized down to normal running needs once the application has finished initializing.

Known Issues

This feature enters v1.27 at alpha stage. Below are a few known issues users may encounter:

  • containerd versions below v1.6.9 do not have the CRI support needed for full end-to-end operation of this feature. Attempts to resize pods will appear to be stuck in the InProgress state, and resources field in the pod's status are never updated even though the new resources may have been enacted on the running containers.
  • Pod resize may encounter a race condition with other pod updates, causing delayed enactment of pod resize.
  • Reflecting the resized container resources in pod's status may take a while.
  • Static CPU management policy is not supported with this feature.

Credits

This feature is a result of the efforts of a very collaborative Kubernetes community. Here's a little shoutout to just a few of the many many people that contributed countless hours of their time and helped make this happen.

  • @thockin for detail-oriented API design and air-tight code reviews.
  • @derekwaynecarr for simplifying the design and thorough API and node reviews.
  • @dchen1107 for bringing vast knowledge from Borg and helping us avoid pitfalls.
  • @ruiwen-zhao for adding containerd support that enabled full E2E implementation.
  • @wangchen615 for implementing comprehensive E2E tests and driving scheduler fixes.
  • @bobbypage for invaluable help getting CI ready and quickly investigating issues, covering for me on my vacation.
  • @Random-Liu for thorough kubelet reviews and identifying problematic race conditions.
  • @Huang-Wei, @ahg-g, @alculquicondor for helping get scheduler changes done.
  • @mikebrow @marosset for reviews on short notice that helped CRI changes make it into v1.25.
  • @endocrimes, @ehashman for helping ensure that the oft-overlooked tests are in good shape.
  • @mrunalp for reviewing cgroupv2 changes and ensuring clean handling of v1 vs v2.
  • @liggitt, @gjkim42 for tracking down, root-causing important missed issues post-merge.
  • @SergeyKanzhelev for supporting and shepherding various issues during the home stretch.
  • @pdgetrf for making the first prototype a reality.
  • @dashpole for bringing me up to speed on 'the Kubernetes way' of doing things.
  • @bsalamat, @kgolab for very thoughtful insights and suggestions in the early stages.
  • @sftim, @tengqm for ensuring docs are easy to follow.
  • @dims for being omnipresent and helping make merges happen at critical hours.
  • Release teams for ensuring that the project stayed healthy.

And a big thanks to my very supportive management Dr. Xiaoning Ding and Dr. Ying Xiong for their patience and encouragement.

References

For app developers

For cluster administrators

Kubernetes 1.27: Avoid Collisions Assigning Ports to NodePort Services

In Kubernetes, a Service can be used to provide a unified traffic endpoint for applications running on a set of Pods. Clients can use the virtual IP address (or VIP) provided by the Service for access, and Kubernetes provides load balancing for traffic accessing different back-end Pods, but a ClusterIP type of Service is limited to providing access to nodes within the cluster, while traffic from outside the cluster cannot be routed. One way to solve this problem is to use a type: NodePort Service, which sets up a mapping to a specific port of all nodes in the cluster, thus redirecting traffic from the outside to the inside of the cluster.

How Kubernetes allocates node ports to Services?

When a type: NodePort Service is created, its corresponding port(s) are allocated in one of two ways:

  • Dynamic : If the Service type is NodePort and you do not set a nodePort value explicitly in the spec for that Service, the Kubernetes control plane will automatically allocate an unused port to it at creation time.

  • Static : In addition to the dynamic auto-assignment described above, you can also explicitly assign a port that is within the nodeport port range configuration.

The value of nodePort that you manually assign must be unique across the whole cluster. Attempting to create a Service of type: NodePort where you explicitly specify a node port that was already allocated results in an error.

Why do you need to reserve ports of NodePort Service?

Sometimes, you may want to have a NodePort Service running on well-known ports so that other components and users inside o r outside the cluster can use them.

In some complex cluster deployments with a mix of Kubernetes nodes and other servers on the same network, it may be necessary to use some pre-defined ports for communication. In particular, some fundamental components cannot rely on the VIPs that back type: LoadBalancer Services because the virtual IP address mapping implementation for that cluster also relies on these foundational components.

Now suppose you need to expose a Minio object storage service on Kubernetes to clients running outside the Kubernetes cluster, and the agreed port is 30009, we need to create a Service as follows:

apiVersion: v1
kind: Service
metadata:
  name: minio
spec:
  ports:
  - name: api
    nodePort: 30009
    port: 9000
    protocol: TCP
    targetPort: 9000
  selector:
    app: minio
  type: NodePort

However, as mentioned before, if the port (30009) required for the minio Service is not reserved, and another type: NodePort (or possibly type: LoadBalancer) Service is created and dynamically allocated before or concurrently with the minio Service, TCP port 30009 might be allocated to that other Service; if so, creation of the minio Service will fail due to a node port collision.

How can you avoid NodePort Service port conflicts?

Kubernetes 1.24 introduced changes for type: ClusterIP Services, dividing the CIDR range for cluster IP addresses into two blocks that use different allocation policies to reduce the risk of conflicts. In Kubernetes 1.27, as an alpha feature, you can adopt a similar policy for type: NodePort Services. You can enable a new feature gate ServiceNodePortStaticSubrange. Turning this on allows you to use a different port allocation strategy for type: NodePort Services, and reduce the risk of collision.

The port range for NodePort will be divided, based on the formula min(max(16, nodeport-size / 32), 128). The outcome of the formula will be a number between 16 and 128, with a step size that increases as the size of the nodeport range increases. The outcome of the formula determine that the size of static port range. When the port range is less than 16, the size of static port range will be set to 0, which means that all ports will be dynamically allocated.

Dynamic port assignment will use the upper band by default, once this has been exhausted it will use the lower range. This will allow users to use static allocations on the lower band with a low risk of collision.

Examples

default range: 30000-32767

Range propertiesValues
service-node-port-range30000-32767
Band Offsetmin(max(16, 2768/32), 128)
= min(max(16, 86), 128)
= min(86, 128)
= 86
Static band start30000
Static band end30085
Dynamic band start30086
Dynamic band end32767
pie showData title 30000-32767 "Static" : 86 "Dynamic" : 2682

very small range: 30000-30015

Range propertiesValues
service-node-port-range30000-30015
Band Offset0
Static band start-
Static band end-
Dynamic band start30000
Dynamic band end30015
pie showData title 30000-30015 "Static" : 0 "Dynamic" : 16

small(lower boundary) range: 30000-30127

Range propertiesValues
service-node-port-range30000-30127
Band Offsetmin(max(16, 128/32), 128)
= min(max(16, 4), 128)
= min(16, 128)
= 16
Static band start30000
Static band end30015
Dynamic band start30016
Dynamic band end30127
pie showData title 30000-30127 "Static" : 16 "Dynamic" : 112

large(upper boundary) range: 30000-34095

Range propertiesValues
service-node-port-range30000-34095
Band Offsetmin(max(16, 4096/32), 128)
= min(max(16, 128), 128)
= min(128, 128)
= 128
Static band start30000
Static band end30127
Dynamic band start30128
Dynamic band end34095
pie showData title 30000-34095 "Static" : 128 "Dynamic" : 3968

very large range: 30000-38191

Range propertiesValues
service-node-port-range30000-38191
Band Offsetmin(max(16, 8192/32), 128)
= min(max(16, 256), 128)
= min(256, 128)
= 128
Static band start30000
Static band end30127
Dynamic band start30128
Dynamic band end38191
pie showData title 30000-38191 "Static" : 128 "Dynamic" : 8064

Kubernetes 1.27: Safer, More Performant Pruning in kubectl apply

Declarative configuration management with the kubectl apply command is the gold standard approach to creating or modifying Kubernetes resources. However, one challenge it presents is the deletion of resources that are no longer needed. In Kubernetes version 1.5, the --prune flag was introduced to address this issue, allowing kubectl apply to automatically clean up previously applied resources removed from the current configuration.

Unfortunately, that existing implementation of --prune has design flaws that diminish its performance and can result in unexpected behaviors. The main issue stems from the lack of explicit encoding of the previously applied set by the preceding apply operation, necessitating error-prone dynamic discovery. Object leakage, inadvertent over-selection of resources, and limited compatibility with custom resources are a few notable drawbacks of this implementation. Moreover, its coupling to client-side apply hinders user upgrades to the superior server-side apply mechanism.

Version 1.27 of kubectl introduces an alpha version of a revamped pruning implementation that addresses these issues. This new implementation, based on a concept called ApplySet, promises better performance and safety.

An ApplySet is a group of resources associated with a parent object on the cluster, as identified and configured through standardized labels and annotations. Additional standardized metadata allows for accurate identification of ApplySet member objects within the cluster, simplifying operations like pruning.

To leverage ApplySet-based pruning, set the KUBECTL_APPLYSET=true environment variable and include the flags --prune and --applyset in your kubectl apply invocation:

KUBECTL_APPLYSET=true kubectl apply -f <directory/> --prune --applyset=<name>

By default, ApplySet uses a Secret as the parent object. However, you can also use a ConfigMap with the format --applyset=configmaps/<name>. If your desired Secret or ConfigMap object does not yet exist, kubectl will create it for you. Furthermore, custom resources can be enabled for use as ApplySet parent objects.

The ApplySet implementation is based on a new low-level specification that can support higher-level ecosystem tools by improving their interoperability. The lightweight nature of this specification enables these tools to continue to use existing object grouping systems while opting in to ApplySet's metadata conventions to prevent inadvertent changes by other tools (such as kubectl).

ApplySet-based pruning offers a promising solution to the shortcomings of the previous --prune implementation in kubectl and can help streamline your Kubernetes resource management. Please give this new feature a try and share your experiences with the community—ApplySet is under active development, and your feedback is invaluable!

Additional resources

How do I get involved?

If you want to get involved in ApplySet development, you can get in touch with the developers at SIG CLI. To provide feedback on the feature, please file a bug or request an enhancement on the kubernetes/kubectl repository.

Kubernetes 1.27: Introducing An API For Volume Group Snapshots

Volume group snapshot is introduced as an Alpha feature in Kubernetes v1.27. This feature introduces a Kubernetes API that allows users to take crash consistent snapshots for multiple volumes together. It uses a label selector to group multiple PersistentVolumeClaims for snapshotting. This new feature is only supported for CSI volume drivers.

An overview of volume group snapshots

Some storage systems provide the ability to create a crash consistent snapshot of multiple volumes. A group snapshot represents “copies” from multiple volumes that are taken at the same point-in-time. A group snapshot can be used either to rehydrate new volumes (pre-populated with the snapshot data) or to restore existing volumes to a previous state (represented by the snapshots).

Why add volume group snapshots to Kubernetes?

The Kubernetes volume plugin system already provides a powerful abstraction that automates the provisioning, attaching, mounting, resizing, and snapshotting of block and file storage.

Underpinning all these features is the Kubernetes goal of workload portability: Kubernetes aims to create an abstraction layer between distributed applications and underlying clusters so that applications can be agnostic to the specifics of the cluster they run on and application deployment requires no cluster specific knowledge.

There is already a VolumeSnapshot API that provides the ability to take a snapshot of a persistent volume to protect against data loss or data corruption. However, there are other snapshotting functionalities not covered by the VolumeSnapshot API.

Some storage systems support consistent group snapshots that allow a snapshot to be taken from multiple volumes at the same point-in-time to achieve write order consistency. This can be useful for applications that contain multiple volumes. For example, an application may have data stored in one volume and logs stored in another volume. If snapshots for the data volume and the logs volume are taken at different times, the application will not be consistent and will not function properly if it is restored from those snapshots when a disaster strikes.

It is true that you can quiesce the application first, take an individual snapshot from each volume that is part of the application one after the other, and then unquiesce the application after all the individual snapshots are taken. This way, you would get application consistent snapshots.

However, sometimes it may not be possible to quiesce an application or the application quiesce can be too expensive so you want to do it less frequently. Taking individual snapshots one after another may also take longer time compared to taking a consistent group snapshot. Some users may not want to do application quiesce very often for these reasons. For example, a user may want to run weekly backups with application quiesce and nightly backups without application quiesce but with consistent group support which provides crash consistency across all volumes in the group.

Kubernetes Volume Group Snapshots API

Kubernetes Volume Group Snapshots introduce three new API objects for managing snapshots:

VolumeGroupSnapshot
Created by a Kubernetes user (or perhaps by your own automation) to request creation of a volume group snapshot for multiple persistent volume claims. It contains information about the volume group snapshot operation such as the timestamp when the volume group snapshot was taken and whether it is ready to use. The creation and deletion of this object represents a desire to create or delete a cluster resource (a group snapshot).
VolumeGroupSnapshotContent
Created by the snapshot controller for a dynamically created VolumeGroupSnapshot. It contains information about the volume group snapshot including the volume group snapshot ID. This object represents a provisioned resource on the cluster (a group snapshot). The VolumeGroupSnapshotContent object binds to the VolumeGroupSnapshot for which it was created with a one-to-one mapping.
VolumeGroupSnapshotClass
Created by cluster administrators to describe how volume group snapshots should be created. including the driver information, the deletion policy, etc.

These three API kinds are defined as CustomResourceDefinitions (CRDs). These CRDs must be installed in a Kubernetes cluster for a CSI Driver to support volume group snapshots.

How do I use Kubernetes Volume Group Snapshots

Volume group snapshots are implemented in the external-snapshotter repository. Implementing volume group snapshots meant adding or changing several components:

  • Added new CustomResourceDefinitions for VolumeGroupSnapshot and two supporting APIs.
  • Volume group snapshot controller logic is added to the common snapshot controller.
  • Volume group snapshot validation webhook logic is added to the common snapshot validation webhook.
  • Adding logic to make CSI calls into the snapshotter sidecar controller.

The volume snapshot controller, CRDs, and validation webhook are deployed once per cluster, while the sidecar is bundled with each CSI driver.

Therefore, it makes sense to deploy the volume snapshot controller, CRDs, and validation webhook as a cluster addon. I strongly recommend that Kubernetes distributors bundle and deploy the volume snapshot controller, CRDs, and validation webhook as part of their Kubernetes cluster management process (independent of any CSI Driver).

Creating a new group snapshot with Kubernetes

Once a VolumeGroupSnapshotClass object is defined and you have volumes you want to snapshot together, you may request a new group snapshot by creating a VolumeGroupSnapshot object.

The source of the group snapshot specifies whether the underlying group snapshot should be dynamically created or if a pre-existing VolumeGroupSnapshotContent should be used.

A pre-existing VolumeGroupSnapshotContent is created by a cluster administrator. It contains the details of the real volume group snapshot on the storage system which is available for use by cluster users.

One of the following members in the source of the group snapshot must be set.

  • selector - a label query over PersistentVolumeClaims that are to be grouped together for snapshotting. This labelSelector will be used to match the label added to a PVC.
  • volumeGroupSnapshotContentName - specifies the name of a pre-existing VolumeGroupSnapshotContent object representing an existing volume group snapshot.

In the following example, there are two PVCs.

NAME        STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
pvc-0       Bound     pvc-a42d7ea2-e3df-11ed-b5ea-0242ac120002   1Gi        RWO           48s
pvc-1       Bound     pvc-a42d81b8-e3df-11ed-b5ea-0242ac120002   1Gi        RWO           48s

Label the PVCs.

% kubectl label pvc pvc-0 group=myGroup
persistentvolumeclaim/pvc-0 labeled

% kubectl label pvc pvc-1 group=myGroup
persistentvolumeclaim/pvc-1 labeled

For dynamic provisioning, a selector must be set so that the snapshot controller can find PVCs with the matching labels to be snapshotted together.

apiVersion: groupsnapshot.storage.k8s.io/v1alpha1
kind: VolumeGroupSnapshot
metadata:
  name: new-group-snapshot-demo
  namespace: demo-namespace
spec:
  volumeGroupSnapshotClassName: csi-groupSnapclass
  source:
    selector:
      matchLabels:
        group: myGroup

In the VolumeGroupSnapshot spec, a user can specify the VolumeGroupSnapshotClass which has the information about which CSI driver should be used for creating the group snapshot.

Two individual volume snapshots will be created as part of the volume group snapshot creation.

snapshot-62abb5db7204ac6e4c1198629fec533f2a5d9d60ea1a25f594de0bf8866c7947-2023-04-26-2.20.4
snapshot-2026811eb9f0787466171fe189c805a22cdb61a326235cd067dc3a1ac0104900-2023-04-26-2.20.4

How to use group snapshot for restore in Kubernetes

At restore time, the user can request a new PersistentVolumeClaim to be created from a VolumeSnapshot object that is part of a VolumeGroupSnapshot. This will trigger provisioning of a new volume that is pre-populated with data from the specified snapshot. The user should repeat this until all volumes are created from all the snapshots that are part of a group snapshot.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc0-restore
  namespace: demo-namespace
spec:
  storageClassName: csi-hostpath-sc
  dataSource:
    name: snapshot-62abb5db7204ac6e4c1198629fec533f2a5d9d60ea1a25f594de0bf8866c7947-2023-04-26-2.20.4
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

As a storage vendor, how do I add support for group snapshots to my CSI driver?

To implement the volume group snapshot feature, a CSI driver must:

  • Implement a new group controller service.
  • Implement group controller RPCs: CreateVolumeGroupSnapshot, DeleteVolumeGroupSnapshot, and GetVolumeGroupSnapshot.
  • Add group controller capability CREATE_DELETE_GET_VOLUME_GROUP_SNAPSHOT.

See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details.

a CSI Volume Driver as possible, it provides a suggested mechanism to deploy a containerized CSI driver to simplify the process.

As part of this recommended deployment process, the Kubernetes team provides a number of sidecar (helper) containers, including the external-snapshotter sidecar container which has been updated to support volume group snapshot.

The external-snapshotter watches the Kubernetes API server for the VolumeGroupSnapshotContent object and triggers CreateVolumeGroupSnapshot and DeleteVolumeGroupSnapshot operations against a CSI endpoint.

What are the limitations?

The alpha implementation of volume group snapshots for Kubernetes has the following limitations:

  • Does not support reverting an existing PVC to an earlier state represented by a snapshot (only supports provisioning a new volume from a snapshot).
  • No application consistency guarantees beyond any guarantees provided by the storage system (e.g. crash consistency). See this doc for more discussions on application consistency.

What’s next?

Depending on feedback and adoption, the Kubernetes team plans to push the CSI Group Snapshot implementation to Beta in either 1.28 or 1.29. Some of the features we are interested in supporting include volume replication, replication group, volume placement, application quiescing, changed block tracking, and more.

How can I learn more?

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. On behalf of SIG Storage, I would like to offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach alpha:

We also want to thank everyone else who has contributed to the project, including others who helped review the KEP and the CSI spec PR.

For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We always welcome new contributors.

We also hold regular Data Protection Working Group meetings. New attendees are welcome to join our discussions.

Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha)

Kubernetes v1.27, released in April 2023, introduced changes to Memory QoS (alpha) to improve memory management capabilites in Linux nodes.

Support for Memory QoS was initially added in Kubernetes v1.22, and later some limitations around the formula for calculating memory.high were identified. These limitations are addressed in Kubernetes v1.27.

Background

Kubernetes allows you to optionally specify how much of each resources a container needs in the Pod specification. The most common resources to specify are CPU and Memory.

For example, a Pod manifest that defines container resource requirements could look like:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "64Mi"
        cpu: "500m"
  • spec.containers[].resources.requests

    When you specify the resource request for containers in a Pod, the Kubernetes scheduler uses this information to decide which node to place the Pod on. The scheduler ensures that for each resource type, the sum of the resource requests of the scheduled containers is less than the total allocatable resources on the node.

  • spec.containers[].resources.limits

    When you specify the resource limit for containers in a Pod, the kubelet enforces those limits so that the running containers are not allowed to use more of those resources than the limits you set.

When the kubelet starts a container as a part of a Pod, kubelet passes the container's requests and limits for CPU and memory to the container runtime. The container runtime assigns both CPU request and CPU limit to a container. Provided the system has free CPU time, the containers are guaranteed to be allocated as much CPU as they request. Containers cannot use more CPU than the configured limit i.e. containers CPU usage will be throttled if they use more CPU than the specified limit within a given time slice.

Prior to Memory QoS feature, the container runtime only used the memory limit and discarded the memory request (requests were, and still are, also used to influence scheduling). If a container uses more memory than the configured limit, the Linux Out Of Memory (OOM) killer will be invoked.

Let's compare how the container runtime on Linux typically configures memory request and limit in cgroups, with and without Memory QoS feature:

  • Memory request

    The memory request is mainly used by kube-scheduler during (Kubernetes) Pod scheduling. In cgroups v1, there are no controls to specify the minimum amount of memory the cgroups must always retain. Hence, the container runtime did not use the value of requested memory set in the Pod spec.

    cgroups v2 introduced a memory.min setting, used to specify the minimum amount of memory that should remain available to the processes within a given cgroup. If the memory usage of a cgroup is within its effective min boundary, the cgroup’s memory won’t be reclaimed under any conditions. If the kernel cannot maintain at least memory.min bytes of memory for the processes within the cgroup, the kernel invokes its OOM killer. In other words, the kernel guarantees at least this much memory is available or terminates processes (which may be outside the cgroup) in order to make memory more available. Memory QoS maps memory.min to spec.containers[].resources.requests.memory to ensure the availability of memory for containers in Kubernetes Pods.

  • Memory limit

    The memory.limit specifies the memory limit, beyond which if the container tries to allocate more memory, Linux kernel will terminate a process with an OOM (Out of Memory) kill. If the terminated process was the main (or only) process inside the container, the container may exit.

    In cgroups v1, memory.limit_in_bytes interface is used to set the memory usage limit. However, unlike CPU, it was not possible to apply memory throttling: as soon as a container crossed the memory limit, it would be OOM killed.

    In cgroups v2, memory.max is analogous to memory.limit_in_bytes in cgroupv1. Memory QoS maps memory.max to spec.containers[].resources.limits.memory to specify the hard limit for memory usage. If the memory consumption goes above this level, the kernel invokes its OOM Killer.

    cgroups v2 also added memory.high configuration. Memory QoS uses memory.high to set memory usage throttle limit. If the memory.high limit is breached, the offending cgroups are throttled, and the kernel tries to reclaim memory which may avoid an OOM kill.

How it works

Cgroups v2 memory controller interfaces & Kubernetes container resources mapping

Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in Kubernetes. cgroupv2 interfaces that this feature uses are:

  • memory.max
  • memory.min
  • memory.high.
Memory QoS Levels

Memory QoS Levels

memory.max is mapped to limits.memory specified in the Pod spec. The kubelet and the container runtime configure the limit in the respective cgroup. The kernel enforces the limit to prevent the container from using more than the configured resource limit. If a process in a container tries to consume more than the specified limit, kernel terminates a process(es) with an Out of Memory (OOM) error.

memory.max maps to limits.memory

memory.max maps to limits.memory

memory.min is mapped to requests.memory, which results in reservation of memory resources that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM killer is invoked to make more memory available.

memory.min maps to requests.memory

memory.min maps to requests.memory

For memory protection, in addition to the original way of limiting memory usage, Memory QoS throttles workload approaching its memory limit, ensuring that the system is not overwhelmed by sporadic increases in memory usage. A new field, memoryThrottlingFactor, is available in the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default. memory.high is mapped to throttling limit calculated by using memoryThrottlingFactor, requests.memory and limits.memory as in the formula below, and rounding down the value to the nearest page size:

memory.high formula

memory.high formula

Note:

If a container has no memory limits specified, limits.memory is substituted for node allocatable memory.

Summary:

FileDescription
memory.maxmemory.max specifies the maximum memory limit, a container is allowed to use. If a process within the container tries to consume more memory than the configured limit, the kernel terminates the process with an Out of Memory (OOM) error.

It is mapped to the container's memory limit specified in Pod manifest.
memory.minmemory.min specifies a minimum amount of memory the cgroups must always retain, i.e., memory that should never be reclaimed by the system. If there's no unprotected reclaimable memory available, OOM kill is invoked.

It is mapped to the container's memory request specified in the Pod manifest.
memory.highmemory.high specifies the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If cgroups memory use goes over the high boundary specified here, the cgroups processes are throttled and put under heavy reclaim pressure.

Kubernetes uses a formula to calculate memory.high, depending on container's memory request, memory limit or node allocatable memory (if container's memory limit is empty) and a throttling factor. Please refer to the KEP for more details on the formula.

Note:

memory.high is set only on container level cgroups while memory.min is set on container, pod, and node level cgroups.

memory.min calculations for cgroups heirarchy

When container memory requests are made, kubelet passes memory.min to the back-end CRI runtime (such as containerd or CRI-O) via the Unified field in CRI during container creation. For every ith container in a pod, the memory.min in container level cgroups will be set to:

memory.min =  pod.spec.containers[i].resources.requests[memory]

Since the memory.min interface requires that the ancestor cgroups directories are all set, the pod and node cgroups directories need to be set correctly.

For every ith container in a pod, memory.min in pod level cgroup:

memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]

For every jth container in every ith pod on a node, memory.min in node level cgroup:

memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]

Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups directly using the libcontainer library (from the runc project), while container cgroups limits are managed by the container runtime.

Support for Pod QoS classes

Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling. Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high as per QOS classes:

  1. Guaranteed pods by their QoS definition require memory requests=memory limits and are not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting memory.high. This ensures that Guaranteed pods can fully use their memory requests up to their set limit, and not hit any throttling.

  2. Burstable pods by their QoS definition require at least one container in the Pod with CPU or memory request or limit set.

    • When requests.memory and limits.memory are set, the formula is used as-is:

      memory.high when requests and limits are set

      memory.high when requests and limits are set

    • When requests.memory is set and limits.memory is not set, limits.memory is substituted for node allocatable memory in the formula:

      memory.high when requests and limits are not set

      memory.high when requests and limits are not set

  3. BestEffort by their QoS definition do not require any memory or CPU limits or requests. For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable memory in the formula:

    memory.high for BestEffort Pod

    memory.high for BestEffort Pod

Summary: Only Pods in Burstable and BestEffort QoS classes will set memory.high. Guaranteed QoS pods do not set memory.high as their memory is guaranteed.

How do I use it?

The prerequisites for enabling Memory QoS feature on your Linux node are:

  1. Verify the requirements related to Kubernetes support for cgroups v2 are met.
  2. Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd and CRI-O provide support compatible with Memory QoS (alpha). This was implemented in the following PRs:

Memory QoS remains an alpha feature for Kubernetes v1.27. You can enable the feature by setting MemoryQoS=true in the kubelet configuration file:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  MemoryQoS: true

How do I get involved?

Huge thank you to all the contributors who helped with the design, implementation, and review of this feature:

For those interested in getting involved in future discussions on Memory QoS feature, you can reach out SIG Node by several means:

Kubernetes 1.27: StatefulSet PVC Auto-Deletion (beta)

Kubernetes v1.27 graduated to beta a new policy mechanism for StatefulSets that controls the lifetime of their PersistentVolumeClaims (PVCs). The new PVC retention policy lets users specify if the PVCs generated from the StatefulSet spec template should be automatically deleted or retrained when the StatefulSet is deleted or replicas in the StatefulSet are scaled down.

What problem does this solve?

A StatefulSet spec can include Pod and PVC templates. When a replica is first created, the Kubernetes control plane creates a PVC for that replica if one does not already exist. The behavior before the PVC retention policy was that the control plane never cleaned up the PVCs created for StatefulSets - this was left up to the cluster administrator, or to some add-on automation that you’d have to find, check suitability, and deploy. The common pattern for managing PVCs, either manually or through tools such as Helm, is that the PVCs are tracked by the tool that manages them, with explicit lifecycle. Workflows that use StatefulSets must determine on their own what PVCs are created by a StatefulSet and what their lifecycle should be.

Before this new feature, when a StatefulSet-managed replica disappears, either because the StatefulSet is reducing its replica count, or because its StatefulSet is deleted, the PVC and its backing volume remains and must be manually deleted. While this behavior is appropriate when the data is critical, in many cases the persistent data in these PVCs is either temporary, or can be reconstructed from another source. In those cases, PVCs and their backing volumes remaining after their StatefulSet or replicas have been deleted are not necessary, incur cost, and require manual cleanup.

The new StatefulSet PVC retention policy

The new StatefulSet PVC retention policy is used to control if and when PVCs created from a StatefulSet’s volumeClaimTemplate are deleted. There are two contexts when this may occur.

The first context is when the StatefulSet resource is deleted (which implies that all replicas are also deleted). This is controlled by the whenDeleted policy. The second context, controlled by whenScaled is when the StatefulSet is scaled down, which removes some but not all of the replicas in a StatefulSet. In both cases the policy can either be Retain, where the corresponding PVCs are not touched, or Delete, which means that PVCs are deleted. The deletion is done with a normal object deletion, so that, for example, all retention policies for the underlying PV are respected.

This policy forms a matrix with four cases. I’ll walk through and give an example for each one.

  • whenDeleted and whenScaled are both Retain.

    This matches the existing behavior for StatefulSets, where no PVCs are deleted. This is also the default retention policy. It’s appropriate to use when data on StatefulSet volumes may be irreplaceable and should only be deleted manually.

  • whenDeleted is Delete and whenScaled is Retain.

    In this case, PVCs are deleted only when the entire StatefulSet is deleted. If the StatefulSet is scaled down, PVCs are not touched, meaning they are available to be reattached if a scale-up occurs with any data from the previous replica. This might be used for a temporary StatefulSet, such as in a CI instance or ETL pipeline, where the data on the StatefulSet is needed only during the lifetime of the StatefulSet lifetime, but while the task is running the data is not easily reconstructible. Any retained state is needed for any replicas that scale down and then up.

  • whenDeleted and whenScaled are both Delete.

    PVCs are deleted immediately when their replica is no longer needed. Note this does not include when a Pod is deleted and a new version rescheduled, for example when a node is drained and Pods need to migrate elsewhere. The PVC is deleted only when the replica is no longer needed as signified by a scale-down or StatefulSet deletion. This use case is for when data does not need to live beyond the life of its replica. Perhaps the data is easily reconstructable and the cost savings of deleting unused PVCs is more important than quick scale-up, or perhaps that when a new replica is created, any data from a previous replica is not usable and must be reconstructed anyway.

  • whenDeleted is Retain and whenScaled is Delete.

    This is similar to the previous case, when there is little benefit to keeping PVCs for fast reuse during scale-up. An example of a situation where you might use this is an Elasticsearch cluster. Typically you would scale that workload up and down to match demand, whilst ensuring a minimum number of replicas (for example: 3). When scaling down, data is migrated away from removed replicas and there is no benefit to retaining those PVCs. However, it can be useful to bring the entire Elasticsearch cluster down temporarily for maintenance. If you need to take the Elasticsearch system offline, you can do this by temporarily deleting the StatefulSet, and then bringing the Elasticsearch cluster back by recreating the StatefulSet. The PVCs holding the Elasticsearch data will still exist and the new replicas will automatically use them.

Visit the documentation to see all the details.

What’s next?

Try it out! The StatefulSetAutoDeletePVC feature gate is beta and enabled by default on cluster running Kubernetes 1.27. Create a StatefulSet using the new policy, test it out and tell us what you think!

I'm very curious to see if this owner reference mechanism works well in practice. For example, I realized there is no mechanism in Kubernetes for knowing who set a reference, so it’s possible that the StatefulSet controller may fight with custom controllers that set their own references. Fortunately, maintaining the existing retention behavior does not involve any new owner references, so default behavior will be compatible.

Please tag any issues you report with the label sig/apps and assign them to Matthew Cary (@mattcary at GitHub).

Enjoy!

Kubernetes 1.27: HorizontalPodAutoscaler ContainerResource type metric moves to beta

Kubernetes 1.20 introduced the ContainerResource type metric in HorizontalPodAutoscaler (HPA).

In Kubernetes 1.27, this feature moves to beta and the corresponding feature gate (HPAContainerMetrics) gets enabled by default.

What is the ContainerResource type metric

The ContainerResource type metric allows us to configure the autoscaling based on resource usage of individual containers.

In the following example, the HPA controller scales the target so that the average utilization of the cpu in the application container of all the pods is around 60%. (See the algorithm details to know how the desired replica number is calculated exactly)

type: ContainerResource
containerResource:
  name: cpu
  container: application
  target:
    type: Utilization
    averageUtilization: 60

The difference from the Resource type metric

HPA already had a Resource type metric.

You can define the target resource utilization like the following, and then HPA will scale up/down the replicas based on the current utilization.

type: Resource
resource:
  name: cpu
  target:
    type: Utilization
    averageUtilization: 60

But, this Resource type metric refers to the average utilization of the Pods.

In case a Pod has multiple containers, the utilization calculation would be:

sum{the resource usage of each container} / sum{the resource request of each container}

The resource utilization of each container may not have a direct correlation or may grow at different rates as the load changes.

For example:

  • A sidecar container is only providing an auxiliary service such as log shipping. If the application does not log very frequently or does not produce logs in its hotpath then the usage of the log shipper will not grow.
  • A sidecar container which provides authentication. Due to heavy caching the usage will only increase slightly when the load on the main container increases. In the current blended usage calculation approach this usually results in the HPA not scaling up the deployment because the blended usage is still low.
  • A sidecar may be injected without resources set which prevents scaling based on utilization. In the current logic the HPA controller can only scale on absolute resource usage of the pod when the resource requests are not set.

And, in such case, if only one container's resource utilization goes high, the Resource type metric may not suggest scaling up.

So, for the accurate autoscaling, you may want to use the ContainerResource type metric for such Pods instead.

What's new for the beta?

For Kubernetes v1.27, the ContainerResource type metric is available by default as described at the beginning of this article. (You can still disable it by the HPAContainerMetrics feature gate.)

Also, we've improved the observability of HPA controller by exposing some metrics from the kube-controller-manager:

  • metric_computation_total: Number of metric computations.
  • metric_computation_duration_seconds: The time that the HPA controller takes to calculate one metric.
  • reconciliations_total: Number of reconciliation of HPA controller.
  • reconciliation_duration_seconds: The time that the HPA controller takes to reconcile a HPA object once.

These metrics have labels action (scale_up, scale_down, none) and error (spec, internal, none). And, in addition to them, the first two metrics have the metric_type label which corresponds to .spec.metrics[*].type for a HorizontalPodAutoscaler.

All metrics are useful for general monitoring of HPA controller, you can get deeper insight into which part has a problem, where it takes time, how much scaling tends to happen at which time on your cluster etc.

Another minor stuff, we've changed the SuccessfulRescale event's messages so that everyone can check whether the events came from the resource metric or the container resource metric (See the related PR).

Getting involved

This feature is managed by SIG Autoscaling. Please join us and share your feedback. We look forward to hearing from you!

How can I learn more?

Kubernetes 1.27: StatefulSet Start Ordinal Simplifies Migration

Kubernetes v1.26 introduced a new, alpha-level feature for StatefulSets that controls the ordinal numbering of Pod replicas. As of Kubernetes v1.27, this feature is now beta. Ordinals can start from arbitrary non-negative numbers. This blog post will discuss how this feature can be used.

Background

StatefulSets ordinals provide sequential identities for pod replicas. When using OrderedReady Pod management Pods are created from ordinal index 0 up to N-1.

With Kubernetes today, orchestrating a StatefulSet migration across clusters is challenging. Backup and restore solutions exist, but these require the application to be scaled down to zero replicas prior to migration. In today's fully connected world, even planned application downtime may not allow you to meet your business goals. You could use Cascading Delete or On Delete to migrate individual pods, however this is error prone and tedious to manage. You lose the self-healing benefit of the StatefulSet controller when your Pods fail or are evicted.

Kubernetes v1.26 enables a StatefulSet to be responsible for a range of ordinals within a range {0..N-1} (the ordinals 0, 1, ... up to N-1). With it, you can scale down a range {0..k-1} in a source cluster, and scale up the complementary range {k..N-1} in a destination cluster, while maintaining application availability. This enables you to retain at most one semantics (meaning there is at most one Pod with a given identity running in a StatefulSet) and Rolling Update behavior when orchestrating a migration across clusters.

Why would I want to use this feature?

Say you're running your StatefulSet in one cluster, and need to migrate it out to a different cluster. There are many reasons why you would need to do this:

  • Scalability: Your StatefulSet has scaled too large for your cluster, and has started to disrupt the quality of service for other workloads in your cluster.
  • Isolation: You're running a StatefulSet in a cluster that is accessed by multiple users, and namespace isolation isn't sufficient.
  • Cluster Configuration: You want to move your StatefulSet to a different cluster to use some environment that is not available on your current cluster.
  • Control Plane Upgrades: You want to move your StatefulSet to a cluster running an upgraded control plane, and can't handle the risk or downtime of in-place control plane upgrades.

How do I use it?

Enable the StatefulSetStartOrdinal feature gate on a cluster, and create a StatefulSet with a customized .spec.ordinals.start.

Try it out

In this demo, I'll use the new mechanism to migrate a StatefulSet from one Kubernetes cluster to another. The redis-cluster Bitnami Helm chart will be used to install Redis.

Tools Required:

Pre-requisites

To do this, I need two Kubernetes clusters that can both access common networking and storage; I've named my clusters source and destination. Specifically, I need:

  • The StatefulSetStartOrdinal feature gate enabled on both clusters.
  • Client configuration for kubectl that lets me access both clusters as an administrator.
  • The same StorageClass installed on both clusters, and set as the default StorageClass for both clusters. This StorageClass should provision underlying storage that is accessible from either or both clusters.
  • A flat network topology that allows for pods to send and receive packets to and from Pods in either clusters. If you are creating clusters on a cloud provider, this configuration may be called private cloud or private network.
  1. Create a demo namespace on both clusters:

    kubectl create ns kep-3335
    
  2. Deploy a Redis cluster with six replicas in the source cluster:

    helm repo add bitnami https://charts.bitnami.com/bitnami
    helm install redis --namespace kep-3335 \
      bitnami/redis-cluster \
      --set persistence.size=1Gi \
      --set cluster.nodes=6
    
  3. Check the replication status in the source cluster:

    kubectl exec -it redis-redis-cluster-0 -- /bin/bash -c \
      "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
    
    2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 myself,master - 0 1669764411000 3 connected 10923-16383                                                                                                                                              
    7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669764410000 3 connected                                                                                             
    961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669764411000 1 connected                                                                                                             
    7136e37d8864db983f334b85d2b094be47c830e5 10.104.0.15:6379@16379 slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669764412595 2 connected                                                                                                                    
    a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669764411592 1 connected 0-5460                                                                                                                                                   
    2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669764410000 2 connected 5461-10922
    
  4. Deploy a Redis cluster with zero replicas in the destination cluster:

    helm install redis --namespace kep-3335 \
      bitnami/redis-cluster \
      --set persistence.size=1Gi \
      --set cluster.nodes=0 \
      --set redis.extraEnvVars\[0\].name=REDIS_NODES,redis.extraEnvVars\[0\].value="redis-redis-cluster-headless.kep-3335.svc.cluster.local" \
      --set existingSecret=redis-redis-cluster
    
  5. Scale down the redis-redis-cluster StatefulSet in the source cluster by 1, to remove the replica redis-redis-cluster-5:

    kubectl patch sts redis-redis-cluster -p '{"spec": {"replicas": 5}}'
    
  6. Migrate dependencies from the source cluster to the destination cluster:

    The following commands copy resources from source to destionation. Details that are not relevant in destination cluster are removed (eg: uid, resourceVersion, status).

    Steps for the source cluster

    Note: If using a StorageClass with reclaimPolicy: Delete configured, you should patch the PVs in source with reclaimPolicy: Retain prior to deletion to retain the underlying storage used in destination. See Change the Reclaim Policy of a PersistentVolume for more details.

    kubectl get pvc redis-data-redis-redis-cluster-5 -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .status)' > /tmp/pvc-redis-data-redis-redis-cluster-5.yaml
    kubectl get pv $(yq '.spec.volumeName' /tmp/pvc-redis-data-redis-redis-cluster-5.yaml) -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .spec.claimRef, .status)' > /tmp/pv-redis-data-redis-redis-cluster-5.yaml
    kubectl get secret redis-redis-cluster -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion)' > /tmp/secret-redis-redis-cluster.yaml
    

    Steps for the destination cluster

    Note: For the PV/PVC, this procedure only works if the underlying storage system that your PVs use can support being copied into destination. Storage that is associated with a specific node or topology may not be supported. Additionally, some storage systems may store addtional metadata about volumes outside of a PV object, and may require a more specialized sequence to import a volume.

    kubectl create -f /tmp/pv-redis-data-redis-redis-cluster-5.yaml
    kubectl create -f /tmp/pvc-redis-data-redis-redis-cluster-5.yaml
    kubectl create -f /tmp/secret-redis-redis-cluster.yaml
    
  7. Scale up the redis-redis-cluster StatefulSet in the destination cluster by 1, with a start ordinal of 5:

    kubectl patch sts redis-redis-cluster -p '{"spec": {"ordinals": {"start": 5}, "replicas": 1}}'
    
  8. Check the replication status in the destination cluster:

    kubectl exec -it redis-redis-cluster-5 -- /bin/bash -c \
      "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
    

    I should see that the new replica (labeled myself) has joined the Redis cluster (the IP address belongs to a different CIDR block than the replicas in the source cluster).

    2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669766684000 2 connected 5461-10922
    7136e37d8864db983f334b85d2b094be47c830e5 10.108.0.22:6379@16379 myself,slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669766685609 2 connected
    2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 master - 0 1669766684000 3 connected 10923-16383
    961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669766683600 1 connected
    a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669766685000 1 connected 0-5460
    7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669766686613 3 connected
    
  9. Repeat steps #5 to #7 for the remainder of the replicas, until the Redis StatefulSet in the source cluster is scaled to 0, and the Redis StatefulSet in the destination cluster is healthy with 6 total replicas.

What's Next?

This feature provides a building block for a StatefulSet to be split up across clusters, but does not prescribe the mechanism as to how the StatefulSet should be migrated. Migration requires coordination of StatefulSet replicas, along with orchestration of the storage and network layer. This is dependent on the storage and connectivity requirements of the application installed by the StatefulSet. Additionally, many StatefulSets are managed by operators, which adds another layer of complexity to migration.

If you're interested in building enhancements to make these processes easier, get involved with SIG Multicluster to contribute!

Updates to the Auto-refreshing Official CVE Feed

Since launching the Auto-refreshing Official CVE feed as an alpha feature in the 1.25 release, we have made significant improvements and updates. We are excited to announce the release of the beta version of the feed. This blog post will outline the feedback received, the changes made, and talk about how you can help as we prepare to make this a stable feature in a future Kubernetes Release.

Feedback from end-users

SIG Security received some feedback from end-users:

Summary of changes

In response, the SIG did a rework of the script generating the JSON feed to comply with the JSON Feed specification from generation and add a last_updated root field to indicate overall freshness. This redesign needed a corresponding fix on the Kubernetes website side for the CVE feed page to continue to work with the new format.

After that, RSS feed support could be added transparently so that end-users can consume the feed in their preferred format.

Overall, the redesign based on the JSON Feed specification, which this time broke backward compatibility, will allow updates in the future to address the rest of the issue while being more transparent and less disruptive to end-users.

Updates

TitleIssueStatus
CVE Feed: JSON feed should pass jsonfeed spec validatorkubernetes/webite#36808closed, addressed by kubernetes/sig-security#76
CVE Feed: Add lastUpdatedAt as a metadata fieldkubernetes/sig-security#72closed, addressed by kubernetes/sig-security#76
Support RSS feeds by generating data in Atom formatkubernetes/sig-security#77closed, addressed by kubernetes/website#39513
CVE Feed: Sort Markdown Table from most recent to least recently announced CVEkubernetes/sig-security#73closed, addressed by kubernetes/sig-security#76
CVE Feed: Include a timestamp field for each CVE indicating when it was last updatedkubernetes/sig-security#63closed, addressed by kubernetes/sig-security#76
CVE Feed: Add Prow job link as a metadata fieldkubernetes/sig-security#71closed, addressed by kubernetes/sig-security#83

What's next?

In preparation to graduate the feed to stable i.e. General Availability stage, SIG Security is still gathering feedback from end users who are using the updated beta feed.

To help us continue to improve the feed in future Kubernetes Releases please share feedback by adding a comment to this tracking issue or let us know on #sig-security-tooling Kubernetes Slack channel, join Kubernetes Slack here.

Kubernetes 1.27: Server Side Field Validation and OpenAPI V3 move to GA

Before Kubernetes v1.8 (!), typos, mis-indentations or minor errors in YAMLs could have catastrophic consequences (e.g. a typo like forgetting the trailing s in replica: 1000 could cause an outage, because the value would be ignored and missing, forcing a reset of replicas back to 1). This was solved back then by fetching the OpenAPI v2 in kubectl and using it to verify that fields were correct and present before applying. Unfortunately, at that time, Custom Resource Definitions didn’t exist, and the code was written under that assumption. When CRDs were later introduced, the lack of flexibility in the validation code forced some hard decisions in the way CRDs exposed their schema, leaving us in a cycle of bad validation causing bad OpenAPI and vice-versa. With the new OpenAPI v3 and Server Field Validation being GA in 1.27, we’ve now solved both of these problems.

Server Side Field Validation offers resource validation on create, update and patch requests to the apiserver and was added to Kubernetes in v1.25, beta in v1.26 and is now GA in v1.27. It provides all the functionality of kubectl validate on the server side.

OpenAPI is a standard, language agnostic interface for discovering the set of operations and types that a Kubernetes cluster supports. OpenAPI V3 is the latest standard of the OpenAPI and is an improvement upon OpenAPI V2 which has been supported since Kubernetes 1.5. OpenAPI V3 support was added in Kubernetes in v1.23, moved to beta in v1.24 and is now GA in v1.27.

OpenAPI V3

What does OpenAPI V3 offer over V2

Built-in types

Kubernetes offers certain annotations on fields that are not representable in OpenAPI V2, or sometimes not represented in the OpenAPI v2 that Kubernetes generate. Most notably, the "default" field is published in OpenAPI V3 while omitted in OpenAPI V2. A single type that can represent multiple types is also expressed correctly in OpenAPI V3 with the oneOf field. This includes proper representations for IntOrString and Quantity.

Custom Resource Definitions

In Kubernetes, Custom Resource Definitions use a structural OpenAPI V3 schema that cannot be represented as OpenAPI V2 without a loss of certain fields. Some of these include nullable, default, anyOf, oneOf, not, etc. OpenAPI V3 is a completely lossless representation of the CustomResourceDefinition structural schema.

How do I use it?

The OpenAPI V3 root discovery can be found at the /openapi/v3 endpoint of a Kubernetes API server. OpenAPI V3 documents are grouped by group-version to reduce the size of the data transported, the separate documents can be accessed at /openapi/v3/apis/<group>/<version> and /openapi/v3/api/v1 representing the legacy group version. Please refer to the Kubernetes API Documentation for more information around this endpoint.

Various consumers of the OpenAPI have already been updated to consume v3, including the entirety of kubectl, and server side apply. An OpenAPI V3 Golang client is available in client-go.

Server Side Field Validation

The query parameter fieldValidation may be used to indicate the level of field validation the server should perform. If the parameter is not passed, server side field validation is in Warn mode by default.

  • Strict: Strict field validation, errors on validation failure
  • Warn: Field validation is performed, but errors are exposed as warnings rather than failing the request
  • Ignore: No server side field validation is performed

kubectl will skip client side validation and will automatically use server side field validation in Strict mode. Controllers by default use server side field validation in Warn mode.

With client side validation, we had to be extra lenient because some fields were missing from OpenAPI V2 and we didn’t want to reject possibly valid objects. This is all fixed in server side validation. Additional documentation may be found here

What's next?

With Server Side Field Validation and OpenAPI V3 released as GA, we introduce more accurate representations of Kubernetes resources. It is recommended to use server side field validation over client side, but with OpenAPI V3, clients are free to implement their own validation if necessary (to “shift things left”) and we guarantee a full lossless schema published by OpenAPI.

Some existing efforts will further improve the information available through OpenAPI including CEL validation and admission, along with OpenAPI annotations on built-in types.

Many other tools can be built for authoring and transforming resources using the type information found in the OpenAPI v3.

How to get involved?

These two features are driven by the SIG API Machinery community, available on the slack channel #sig-api-machinery, through the mailing list and we meet every other Wednesday at 11:00 AM PT on Zoom.

We offer a huge thanks to all the contributors who helped design, implement, and review these two features.

  • Alexander Zielenski
  • Antoine Pelisse
  • Daniel Smith
  • David Eads
  • Jeffrey Ying
  • Jordan Liggitt
  • Kevin Delgado
  • Sean Sullivan

Kubernetes 1.27: Query Node Logs Using The Kubelet API

Kubernetes 1.27 introduced a new feature called Node log query that allows viewing logs of services running on the node.

What problem does it solve?

Cluster administrators face issues when debugging malfunctioning services running on the node. They usually have to SSH or RDP into the node to view the logs of the service to debug the issue. The Node log query feature helps with this scenario by allowing the cluster administrator to view the logs using kubectl. This is especially useful with Windows nodes where you run into the issue of the node going to the ready state but containers not coming up due to CNI misconfigurations and other issues that are not easily identifiable by looking at the Pod status.

How does it work?

The kubelet already has a /var/log/ viewer that is accessible via the node proxy endpoint. The feature supplements this endpoint with a shim that shells out to journalctl, on Linux nodes, and the Get-WinEvent cmdlet on Windows nodes. It then uses the existing filters provided by the commands to allow filtering the logs. The kubelet also uses heuristics to retrieve the logs. If the user is not aware if a given system services logs to a file or to the native system logger, the heuristics first checks the native operating system logger and if that is not available it attempts to retrieve the first logs from /var/log/<servicename> or /var/log/<servicename>.log or /var/log/<servicename>/<servicename>.log.

On Linux we assume that service logs are available via journald, and that journalctl is installed. On Windows we assume that service logs are available in the application log provider. Also note that fetching node logs is only available if you are authorized to do so (in RBAC, that's get and create access to nodes/proxy).

Warning:

Granting permissions to nodes/proxy (even just get permission) also authorizes access to powerful kubelet APIs that can be used to execute commands in any container running on the node, so be careful about how you manage them. See Kubelet authentication/authorization for more information.

(this warning message was added to the article in early 2026)

How do I use it?

To use the feature, ensure that the NodeLogQuery feature gate is enabled for that node, and that the kubelet configuration options enableSystemLogHandler and enableSystemLogQuery are both set to true. You can then query the logs from all your nodes or just a subset. Here is an example to retrieve the kubelet service logs from a node:

# Fetch kubelet logs from a node named node-1.example
kubectl get --raw "/api/v1/nodes/node-1.example/proxy/logs/?query=kubelet"

You can further filter the query to narrow down the results:

# Fetch kubelet logs from a node named node-1.example that have the word "error"
kubectl get --raw "/api/v1/nodes/node-1.example/proxy/logs/?query=kubelet&pattern=error"

You can also fetch files from /var/log/ on a Linux node:

kubectl get --raw "/api/v1/nodes/<insert-node-name-here>/proxy/logs/?query=/<insert-log-file-name-here>"

You can read the documentation for all the available options.

How do I help?

Please use the feature and provide feedback by opening GitHub issues or reaching out to us on the #sig-windows channel on the Kubernetes Slack or the SIG Windows mailing list.

Kubernetes 1.27: Single Pod Access Mode for PersistentVolumes Graduates to Beta

With the release of Kubernetes v1.27 the ReadWriteOncePod feature has graduated to beta. In this blog post, we'll take a closer look at this feature, what it does, and how it has evolved in the beta release.

What is ReadWriteOncePod?

ReadWriteOncePod is a new access mode for PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs) introduced in Kubernetes v1.22. This access mode enables you to restrict volume access to a single pod in the cluster, ensuring that only one pod can write to the volume at a time. This can be particularly useful for stateful workloads that require single-writer access to storage.

For more context on access modes and how ReadWriteOncePod works read What are access modes and why are they important? in the Introducing Single Pod Access Mode for PersistentVolumes article from 2021.

Changes in the ReadWriteOncePod beta

The ReadWriteOncePod beta adds support for scheduler preemption of pods using ReadWriteOncePod PVCs.

Scheduler preemption allows higher-priority pods to preempt lower-priority pods, so that they can start running on the same node. With this release, pods using ReadWriteOncePod PVCs can also be preempted if a higher-priority pod requires the same PVC.

How can I start using ReadWriteOncePod?

With ReadWriteOncePod now in beta, it will be enabled by default in cluster versions v1.27 and beyond.

Note that ReadWriteOncePod is only supported for CSI volumes. Before using this feature you will need to update the following CSI sidecars to these versions or greater:

To start using ReadWriteOncePod, create a PVC with the ReadWriteOncePod access mode:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: single-writer-only
spec:
  accessModes:
  - ReadWriteOncePod # Allow only a single pod to access single-writer-only.
  resources:
    requests:
      storage: 1Gi

If your storage plugin supports dynamic provisioning, new PersistentVolumes will be created with the ReadWriteOncePod access mode applied.

Read Migrating existing PersistentVolumes for details on migrating existing volumes to use ReadWriteOncePod.

How can I learn more?

Please see the alpha blog post and KEP-2485 for more details on the ReadWriteOncePod access mode and motivations for CSI spec changes.

How do I get involved?

The Kubernetes #csi Slack channel and any of the standard SIG Storage communication channels are great mediums to reach out to the SIG Storage and the CSI teams.

Special thanks to the following people whose thoughtful reviews and feedback helped shape this feature:

  • Abdullah Gharaibeh (ahg-g)
  • Aldo Culquicondor (alculquicondor)
  • Antonio Ojea (aojea)
  • David Eads (deads2k)
  • Jan Šafránek (jsafrane)
  • Joe Betz (jpbetz)
  • Kante Yin (kerthcet)
  • Michelle Au (msau42)
  • Tim Bannister (sftim)
  • Xing Yang (xing-yang)

If you’re interested in getting involved with the design and development of CSI or any part of the Kubernetes storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

Kubernetes 1.27: Efficient SELinux volume relabeling (Beta)

The problem

On Linux with Security-Enhanced Linux (SELinux) enabled, it's traditionally the container runtime that applies SELinux labels to a Pod and all its volumes. Kubernetes only passes the SELinux label from a Pod's securityContext fields to the container runtime.

The container runtime then recursively changes SELinux label on all files that are visible to the Pod's containers. This can be time-consuming if there are many files on the volume, especially when the volume is on a remote filesystem.

If a Pod does not have any SELinux label assigned in Kubernetes API, the container runtime assigns a unique random one, so a process that potentially escapes the container boundary cannot access data of any other container on the host. The container runtime still recursively relabels all pod volumes with this random SELinux label.

Improvement using mount options

If a Pod and its volume meet all of the following conditions, Kubernetes will mount the volume directly with the right SELinux label. Such mount will happen in a constant time and the container runtime will not need to recursively relabel any files on it.

  1. The operating system must support SELinux.

    Without SELinux support detected, kubelet and the container runtime do not do anything with regard to SELinux.

  2. The feature gates ReadWriteOncePod and SELinuxMountReadWriteOncePod must be enabled. These feature gates are Beta in Kubernetes 1.27 and Alpha in 1.25.

    With any of these feature gates disabled, SELinux labels will be always applied by the container runtime by a recursive walk through the volume (or its subPaths).

  3. The Pod must have at least seLinuxOptions.level assigned in its Pod Security Context or all Pod containers must have it set in their Security Contexts. Kubernetes will read the default user, role and type from the operating system defaults (typically system_u, system_r and container_t).

    Without Kubernetes knowing at least the SELinux level, the container runtime will assign a random one after the volumes are mounted. The container runtime will still relabel the volumes recursively in that case.

  4. The volume must be a Persistent Volume with Access Mode ReadWriteOncePod.

    This is a limitation of the initial implementation. As described above, two Pods can have a different SELinux label and still use the same volume, as long as they use a different subPath of it. This use case is not possible when the volumes are mounted with the SELinux label, because the whole volume is mounted and most filesystems don't support mounting a single volume multiple times with multiple SELinux labels.

    If running two Pods with two different SELinux contexts and using different subPaths of the same volume is necessary in your deployments, please comment in the KEP issue (or upvote any existing comment - it's best not to duplicate). Such pods may not run when the feature is extended to cover all volume access modes.

  5. The volume plugin or the CSI driver responsible for the volume supports mounting with SELinux mount options.

    These in-tree volume plugins support mounting with SELinux mount options: fc, iscsi, and rbd.

    CSI drivers that support mounting with SELinux mount options must announce that in their CSIDriver instance by setting seLinuxMount field.

    Volumes managed by other volume plugins or CSI drivers that don't set seLinuxMount: true will be recursively relabelled by the container runtime.

Mounting with SELinux context

When all aforementioned conditions are met, kubelet will pass -o context=<SELinux label> mount option to the volume plugin or CSI driver. CSI driver vendors must ensure that this mount option is supported by their CSI driver and, if necessary, the CSI driver appends other mount options that are needed for -o context to work.

For example, NFS may need -o context=<SELinux label>,nosharecache, so each volume mounted from the same NFS server can have a different SELinux label value. Similarly, CIFS may need -o context=<SELinux label>,nosharesock.

It's up to the CSI driver vendor to test their CSI driver in a SELinux enabled environment before setting seLinuxMount: true in the CSIDriver instance.

How can I learn more?

SELinux in containers: see excellent visual SELinux guide by Daniel J Walsh. Note that the guide is older than Kubernetes, it describes Multi-Category Security (MCS) mode using virtual machines as an example, however, a similar concept is used for containers.

See a series of blog posts for details how exactly SELinux is applied to containers by container runtimes:

Read the KEP: Speed up SELinux volume relabeling using mounts

Kubernetes 1.27: More fine-grained pod topology spread policies reached beta

In Kubernetes v1.19, Pod topology spread constraints went to general availability (GA).

As time passed, we - SIG Scheduling - received feedback from users, and, as a result, we're actively working on improving the Topology Spread feature via three KEPs. All of these features have reached beta in Kubernetes v1.27 and are enabled by default.

This blog post introduces each feature and the use case behind each of them.

KEP-3022: min domains in Pod Topology Spread

Pod Topology Spread has the maxSkew parameter to define the degree to which Pods may be unevenly distributed.

But, there wasn't a way to control the number of domains over which we should spread. Some users want to force spreading Pods over a minimum number of domains, and if there aren't enough already present, make the cluster-autoscaler provision them.

Kubernetes v1.24 introduced the minDomains parameter for pod topology spread constraints, as an alpha feature. Via minDomains parameter, you can define the minimum number of domains.

For example, assume there are 3 Nodes with the enough capacity, and a newly created ReplicaSet has the following topologySpreadConstraints in its Pod template.

...
topologySpreadConstraints:
- maxSkew: 1
  minDomains: 5 # requires 5 Nodes at least (because each Node has a unique hostname).
  whenUnsatisfiable: DoNotSchedule # minDomains is valid only when DoNotSchedule is used.
  topologyKey: kubernetes.io/hostname
  labelSelector:
    matchLabels:
        foo: bar

In this case, 3 Pods will be scheduled to those 3 Nodes, but other 2 Pods from this replicaset will be unschedulable until more Nodes join the cluster.

You can imagine that the cluster autoscaler provisions new Nodes based on these unschedulable Pods, and as a result, the replicas are finally spread over 5 Nodes.

KEP-3094: Take taints/tolerations into consideration when calculating podTopologySpread skew

Before this enhancement, when you deploy a pod with podTopologySpread configured, kube-scheduler would take the Nodes that satisfy the Pod's nodeAffinity and nodeSelector into consideration in filtering and scoring, but would not care about whether the node taints are tolerated by the incoming pod or not. This may lead to a node with untolerated taint as the only candidate for spreading, and as a result, the pod will stuck in Pending if it doesn't tolerate the taint.

To allow more fine-gained decisions about which Nodes to account for when calculating spreading skew, Kubernetes 1.25 introduced two new fields within topologySpreadConstraints to define node inclusion policies: nodeAffinityPolicy and nodeTaintPolicy.

A manifest that applies these policies looks like the following:

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  # Configure a topology spread constraint
  topologySpreadConstraints:
    - maxSkew: <integer>
      # ...
      nodeAffinityPolicy: [Honor|Ignore]
      nodeTaintsPolicy: [Honor|Ignore]
  # other Pod fields go here

The nodeAffinityPolicy field indicates how Kubernetes treats a Pod's nodeAffinity or nodeSelector for pod topology spreading. If Honor, kube-scheduler filters out nodes not matching nodeAffinity/nodeSelector in the calculation of spreading skew. If Ignore, all nodes will be included, regardless of whether they match the Pod's nodeAffinity/nodeSelector or not.

For backwards compatibility, nodeAffinityPolicy defaults to Honor.

The nodeTaintsPolicy field defines how Kubernetes considers node taints for pod topology spreading. If Honor, only tainted nodes for which the incoming pod has a toleration, will be included in the calculation of spreading skew. If Ignore, kube-scheduler will not consider the node taints at all in the calculation of spreading skew, so a node with pod untolerated taint will also be included.

For backwards compatibility, nodeTaintsPolicy defaults to Ignore.

The feature was introduced in v1.25 as alpha. By default, it was disabled, so if you want to use this feature in v1.25, you had to explictly enable the feature gate NodeInclusionPolicyInPodTopologySpread. In the following v1.26 release, that associated feature graduated to beta and is enabled by default.

KEP-3243: Respect Pod topology spread after rolling upgrades

Pod Topology Spread uses the field labelSelector to identify the group of pods over which spreading will be calculated. When using topology spreading with Deployments, it is common practice to use the labelSelector of the Deployment as the labelSelector in the topology spread constraints. However, this implies that all pods of a Deployment are part of the spreading calculation, regardless of whether they belong to different revisions. As a result, when a new revision is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause skewed distribution for the remaining pods. To avoid this problem, in the past users needed to add a revision label to Deployment and update it manually at each rolling upgrade (both the label on the pod template and the labelSelector in the topologySpreadConstraints).

To solve this problem with a simpler API, Kubernetes v1.25 introduced a new field named matchLabelKeys to topologySpreadConstraints. matchLabelKeys is a list of pod label keys to select the pods over which spreading will be calculated. The keys are used to lookup values from the labels of the Pod being scheduled, those key-value labels are ANDed with labelSelector to select the group of existing pods over which spreading will be calculated for the incoming pod.

With matchLabelKeys, you don't need to update the pod.spec between different revisions. The controller or operator managing rollouts just needs to set different values to the same label key for different revisions. The scheduler will assume the values automatically based on matchLabelKeys. For example, if you are configuring a Deployment, you can use the label keyed with pod-template-hash, which is added automatically by the Deployment controller, to distinguish between different revisions in a single Deployment.

topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: foo
      matchLabelKeys:
        - pod-template-hash

Getting involved

These features are managed by Kubernetes SIG Scheduling.

Please join us and share your feedback. We look forward to hearing from you!

How can I learn more?

Kubernetes v1.27: Chill Vibes

Announcing the release of Kubernetes v1.27, the first release of 2023!

This release consist of 60 enhancements. 18 of those enhancements are entering Alpha, 29 are graduating to Beta, and 13 are graduating to Stable.

Kubernetes v1.27: Chill Vibes

The theme for Kubernetes v1.27 is Chill Vibes.

It's a little silly, but there were some important shifts in this release that helped inspire the theme. Throughout a typical Kubernetes release cycle, there are several deadlines that features need to meet to remain included. If a feature misses any of these deadlines, there is an exception process they can go through. Handling these exceptions is a very normal part of the release. But v1.27 is the first release that anyone can remember where we didn't receive a single exception request after the enhancements freeze. Even as the release progressed, things remained much calmer than any of us are used to.

There's a specific reason we were able to enjoy a more calm release this time around, and that's all the work that folks put in behind the scenes to improve how we manage the release. That's what this theme celebrates, people putting in the work to make things better for the community.

Special thanks to Britnee Laverack for creating the logo. Britnee also designed the logo for Kubernetes 1.24: Stargazer.

What's New (Major Themes)

Freeze k8s.gcr.io image registry

Replacing the old image registry, k8s.gcr.io with registry.k8s.io which has been generally available for several months. The Kubernetes project created and runs the registry.k8s.io image registry, which is fully controlled by the community. This means that the old registry k8s.gcr.io will be frozen and no further images for Kubernetes and related sub-projects will be published to the old registry.

What does this change mean for contributors?

  • If you are a maintainer of a sub-project, you will need to update your manifests and Helm charts to use the new registry. For more information, checkout this project.

What does this change mean for end users?

  • Kubernetes v1.27 release will not be published to the k8s.gcr.io registry.

  • Patch releases for v1.24, v1.25, and v1.26 will no longer be published to the old registry after April.

  • Starting in v1.25, the default image registry has been set to registry.k8s.io. This value is overridable in kubeadm and kubelet but setting it to k8s.gcr.io will fail for new releases after April as they won’t be present in the old registry.

  • If you want to increase the reliability of your cluster and remove dependency on the community-owned registry or you are running Kubernetes in networks where external traffic is restricted, you should consider hosting local image registry mirrors. Some cloud vendors may offer hosted solutions for this.

SeccompDefault graduates to stable

To use seccomp profile defaulting, you must run the kubelet with the --seccomp-default command line flag enabled for each node where you want to use it. If enabled, the kubelet will use the RuntimeDefault seccomp profile by default, which is defined by the container runtime, instead of using the Unconfined (seccomp disabled) mode. The default profiles aim to provide a strong set of security defaults while preserving the functionality of the workload. It is possible that the default profiles differ between container runtimes and their release versions.

You can find detailed information about a possible upgrade and downgrade strategy in the related Kubernetes Enhancement Proposal (KEP): Enable seccomp by default.

Mutable scheduling directives for Jobs graduates to GA

This was introduced in v1.22 and started as a beta level, now it's stable. In most cases a parallel job will want the pods to run with constraints, like all in the same zone, or all either on GPU model x or y but not a mix of both. The suspend field is the first step towards achieving those semantics. suspend allows a custom queue controller to decide when a job should start. However, once a job is unsuspended, a custom queue controller has no influence on where the pods of a job will actually land.

This feature allows updating a Job's scheduling directives before it starts, which gives custom queue controllers the ability to influence pod placement while at the same time offloading actual pod-to-node assignment to kube-scheduler. This is allowed only for suspended Jobs that have never been unsuspended before. The fields in a Job's pod template that can be updated are node affinity, node selector, tolerations, labels ,annotations, and scheduling gates. Find more details in the KEP: Allow updating scheduling directives of jobs.

DownwardAPIHugePages graduates to stable

In Kubernetes v1.20, support for requests.hugepages-<pagesize> and limits.hugepages-<pagesize> was added to the downward API to be consistent with other resources like cpu, memory, and ephemeral storage. This feature graduates to stable in this release. You can find more details in the KEP: Downward API HugePages.

Pod Scheduling Readiness goes to beta

Upon creation, Pods are ready for scheduling. Kubernetes scheduler does its due diligence to find nodes to place all pending Pods. However, in a real-world case, some Pods may stay in a missing-essential-resources state for a long period. These Pods actually churn the scheduler (and downstream integrators like Cluster Autoscaler) in an unnecessary manner.

By specifying/removing a Pod's .spec.schedulingGates, you can control when a Pod is ready to be considered for scheduling.

The schedulingGates field contains a list of strings, and each string literal is perceived as a criteria that must be satisfied before a Pod is considered schedulable. This field can be initialized only when a Pod is created (either by the client, or mutated during admission). After creation, each schedulingGate can be removed in an arbitrary order, but addition of a new scheduling gate is disallowed.

Node log access via Kubernetes API

This feature helps cluster administrators debug issues with services running on nodes by allowing them to query service logs. To use this feature, ensure that the NodeLogQuery feature gate is enabled on that node, and that the kubelet configuration options enableSystemLogHandler and enableSystemLogQuery are both set to true. On Linux, we assume that service logs are available via journald. On Windows, we assume that service logs are available in the application log provider. You can also fetch logs from the /var/log/ and C:\var\log directories on Linux and Windows, respectively.

A cluster administrator can try out this alpha feature across all nodes of their cluster, or on a subset of them.

ReadWriteOncePod PersistentVolume access mode goes to beta

Kubernetes v1.22 introduced a new access mode ReadWriteOncePod for PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). This access mode enables you to restrict volume access to a single pod in the cluster, ensuring that only one pod can write to the volume at a time. This can be particularly useful for stateful workloads that require single-writer access to storage.

The ReadWriteOncePod beta adds support for scheduler preemption of pods that use ReadWriteOncePod PVCs. Scheduler preemption allows higher-priority pods to preempt lower-priority pods. For example when a pod (A) with a ReadWriteOncePod PVC is scheduled, if another pod (B) is found using the same PVC and pod (A) has higher priority, the scheduler will return an Unschedulable status and attempt to preempt pod (B). For more context, see the KEP: ReadWriteOncePod PersistentVolume AccessMode.

Respect PodTopologySpread after rolling upgrades

matchLabelKeys is a list of pod label keys used to select the pods over which spreading will be calculated. The keys are used to lookup values from the pod labels. Those key-value labels are ANDed with labelSelector to select the group of existing pods over which spreading will be calculated for the incoming pod. Keys that don't exist in the pod labels will be ignored. A null or empty list means only match against the labelSelector.

With matchLabelKeys, users don't need to update the pod.spec between different revisions. The controller/operator just needs to set different values to the same label key for different revisions. The scheduler will assume the values automatically based on matchLabelKeys. For example, if users use Deployment, they can use the label keyed with pod-template-hash, which is added automatically by the Deployment controller, to distinguish between different revisions in a single Deployment.

Faster SELinux volume relabeling using mounts

In this release, how SELinux labels are applied to volumes used by Pods is graduating to beta. This feature speeds up container startup by mounting volumes with the correct SELinux label instead of changing each file on the volumes recursively. Linux kernel with SELinux support allows the first mount of a volume to set SELinux label on the whole volume using -o context= mount option. This way, all files will have assigned the given label in a constant time, without recursively walking through the whole volumes.

The context mount option cannot be applied to bind mounts or re-mounts of already mounted volumes. For CSI storage, a CSI driver does the first mount of a volume, and so it must be the CSI driver that actually applies this mount option. We added a new field SELinuxMount to CSIDriver objects, so that drivers can announce whether they support the -o context mount option.

If Kubernetes knows the SELinux label of a Pod and the CSI driver responsible for a pod's volume announces SELinuxMount: true and the volume has Access Mode ReadWriteOncePod, then it will ask the CSI driver to mount the volume with mount option context= and it will tell the container runtime not to relabel content of the volume (because all files already have the right label). Get more information on this from the KEP: Speed up SELinux volume relabeling using mounts.

Robust VolumeManager reconstruction goes to beta

This is a volume manager refactoring that allows the kubelet to populate additional information about how existing volumes are mounted during the kubelet startup. In general, this makes volume cleanup more robust. If you enable the NewVolumeManagerReconstruction feature gate on a node, you'll get enhanced discovery of mounted volumes during kubelet startup.

Before Kubernetes v1.25, the kubelet used different default behavior for discovering mounted volumes during the kubelet startup. If you disable this feature gate (it's enabled by default), you select the legacy discovery behavior.

In Kubernetes v1.25 and v1.26, this behavior toggle was part of the SELinuxMountReadWriteOncePod feature gate.

Mutable Pod Scheduling Directives goes to beta

This allows mutating a pod that is blocked on a scheduling readiness gate with a more constrained node affinity/selector. It gives the ability to mutate a pods scheduling directives before it is allowed to be scheduled and gives an external resource controller the ability to influence pod placement while at the same time offload actual pod-to-node assignment to kube-scheduler.

This opens the door for a new pattern of adding scheduling features to Kubernetes. Specifically, building lightweight schedulers that implement features not supported by kube-scheduler, while relying on the existing kube-scheduler to support all upstream features and handle the pod-to-node binding. This pattern should be the preferred one if the custom feature doesn't require implementing a schedule plugin, which entails re-building and maintaining a custom kube-scheduler binary.

Feature graduations and deprecations in Kubernetes v1.27

Graduations to stable

This release includes a total of 9 enhancements promoted to Stable:

Deprecations and removals

This release saw several removals:

Release notes

The complete details of the Kubernetes v1.27 release are available in our release notes.

Availability

Kubernetes v1.27 is available for download on GitHub. To get started with Kubernetes, you can run local Kubernetes clusters using minikube, kind, etc. You can also easily install v1.27 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires people with specialised skills from all corners of our community, from the code itself to its documentation and project management.

Special thanks to our Release Lead Xander Grzywinski for guiding us through a smooth and successful release cycle and to all members of the release team for supporting one another and working so hard to produce the v1.27 release for the community.

Ecosystem updates

  • KubeCon + CloudNativeCon Europe 2023 will take place in Amsterdam, The Netherlands, from 17 – 21 April 2023! You can find more information about the conference and registration on the event site.
  • cdCon + GitOpsCon will be held in Vancouver, Canada, on May 8th and 9th, 2023! More information about the conference and registration can be found on the event site.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.27 release cycle, which ran for 14 weeks (January 9 to April 11), we saw contributions from 1020 companies and 1603 individuals.

Upcoming release webinar

Join members of the Kubernetes v1.27 release team on Friday, April 14, 2023, at 10 a.m. PDT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests.

Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:

Keeping Kubernetes Secure with Updated Go Versions

The problem

Since v1.19 (released in 2020), the Kubernetes project provides 12-14 months of patch releases for each minor version. This enables users to qualify and adopt Kubernetes versions in an annual upgrade cycle and receive security fixes for a year.

The Go project releases new minor versions twice a year, and provides security fixes for the last two minor versions, resulting in about a year of support for each Go version. Even though each new Kubernetes minor version is built with a supported Go version when it is first released, that Go version falls out of support before the Kubernetes minor version does, and the lengthened Kubernetes patch support since v1.19 only widened that gap.

At the time this was written, just over half of all Go patch releases (88/171) have contained fixes for issues with possible security implications. Even though many of these issues were not relevant to Kubernetes, some were, so it remained important to use supported Go versions that received those fixes.

An obvious solution would be to simply update Kubernetes release branches to new minor versions of Go. However, Kubernetes avoids destabilizing changes in patch releases, and historically, this prevented updating existing release branches to new minor versions of Go, due to changes that were considered prohibitively complex, risky, or breaking to include in a patch release. Examples include:

  • Go 1.6: enabling http/2 by default
  • Go 1.14: EINTR handling issues
  • Go 1.17: dropping x509 CN support, ParseIP changes
  • Go 1.18: disabling x509 SHA-1 certificate support by default
  • Go 1.19: dropping current-dir LookPath behavior

Some of these changes could be easily mitigated in Kubernetes code, some could only be opted out of via a user-specified GODEBUG envvar, and others required invasive code changes or could not be avoided at all. Because of this inconsistency, Kubernetes release branches have typically remained on a single Go minor version, and risked being unable to pick up relevant Go security fixes for the last several months of each Kubernetes minor version's support lifetime.

When a relevant Go security fix was only available in newer Kubernetes minor versions, users would have to upgrade away from older Kubernetes minor versions before their 12-14 month support period ended, just to pick up those fixes. If a user was not prepared to do that upgrade, it could result in vulnerable Kubernetes clusters. Even if a user could accommodate the unexpected upgrade, the uncertainty made Kubernetes' annual support less reliable for planning.

The solution

We're happy to announce that the gap between supported Kubernetes versions and supported Go versions has been resolved as of January 2023.

We worked closely with the Go team over the past year to address the difficulties adopting new Go versions. This prompted a discussion, proposal, talk at GopherCon, and a design for improving backward compatibility in Go, ensuring new Go versions can maintain compatible runtime behavior with previous Go versions for a minimum of two years (four Go releases). This allows projects like Kubernetes to update release branches to supported Go versions without exposing users to behavior changes.

The proposed improvements are on track to be included in Go 1.21, and the Go team already delivered targeted compatibility improvements in a Go 1.19 patch release in late 2022. Those changes enabled Kubernetes 1.23+ to update to Go 1.19 in January of 2023, while avoiding any user-facing configuration or behavior changes. All supported Kubernetes release branches now use supported Go versions, and can pick up new Go patch releases with available security fixes.

Going forward, Kubernetes maintainers remain committed to making Kubernetes patch releases as safe and non-disruptive as possible, so there are several requirements a new Go minor version must meet before existing Kubernetes release branches will update to use it:

  1. The new Go version must be available for at least 3 months. This gives time for adoption by the Go community, and for reports of issues or regressions.
  2. The new Go version must be used in a new Kubernetes minor release for at least 1 month. This ensures all Kubernetes release-blocking tests pass on the new Go version, and gives time for feedback from the Kubernetes community on release candidates and early adoption of the new minor release.
  3. There must be no regressions from the previous Go version known to impact Kubernetes.
  4. Runtime behavior must be preserved by default, without requiring any action by Kubernetes users / administrators.
  5. Kubernetes libraries like k8s.io/client-go must remain compatible with the original Go version used for each minor release, so consumers won't have to update Go versions to pick up a library patch release (though they are encouraged to build with supported Go versions, which is made even easier with the compatibility improvements planned in Go 1.21).

The goal of all of this work is to unobtrusively make Kubernetes patch releases safer and more secure, and to make Kubernetes minor versions safe to use for the entire duration of their support lifetime.

Many thanks to the Go team, especially Russ Cox, for helping drive these improvements in ways that will benefit all Go users, not just Kubernetes.

Kubernetes Validating Admission Policies: A Practical Example

Admission control is an important part of the Kubernetes control plane, with several internal features depending on the ability to approve or change an API object as it is submitted to the server. It is also useful for an administrator to be able to define business logic, or policies, regarding what objects can be admitted into a cluster. To better support that use case, Kubernetes introduced external admission control in v1.7.

In addition to countless custom, internal implementations, many open source projects and commercial solutions implement admission controllers with user-specified policy, including Kyverno and Open Policy Agent’s Gatekeeper.

While admission controllers for policy have seen adoption, there are blockers for their widespread use. Webhook infrastructure must be maintained as a production service, with all that entails. The failure case of an admission control webhook must either be closed, reducing the availability of the cluster; or open, negating the use of the feature for policy enforcement. The network hop and evaluation time makes admission control a notable component of latency when dealing with, for example, pods being spun up to respond to a network request in a "serverless" environment.

Validating admission policies and the Common Expression Language

Version 1.26 of Kubernetes introduced, in alpha, a compromise solution. Validating admission policies are a declarative, in-process alternative to admission webhooks. They use the Common Expression Language (CEL) to declare validation rules.

CEL was developed by Google for security and policy use cases, based on learnings from the Firebase real-time database. Its design allows it to be safely embedded into applications and executed in microseconds, with limited compute and memory impact. Validation rules for CRDs introduced CEL to the Kubernetes ecosystem in v1.23, and at the time it was noted that the language would suit a more generic implementation of validation by admission control.

Giving CEL a roll - a practical example

Kubescape is a CNCF project which has become one of the most popular ways for users to improve the security posture of a Kubernetes cluster and validate its compliance. Its controls — groups of tests against API objects — are built in Rego, the policy language of Open Policy Agent.

Rego has a reputation for complexity, based largely on the fact that it is a declarative query language (like SQL). It was considered for use in Kubernetes, but it does not offer the same sandbox constraints as CEL.

A common feature request for the project is to be able to implement policies based on Kubescape’s findings and output. For example, after scanning pods for known paths to cloud credential files, users would like the ability to enforce policy that these pods should not be admitted at all. The Kubescape team thought this would be the perfect opportunity to try and port our existing controls to CEL and apply them as admission policies.

Show me the policy

It did not take us long to convert many of our controls and build a library of validating admission policies. Let’s look at one as an example.

Kubescape’s control C-0017 covers the requirement for containers to have an immutable (read-only) root filesystem. This is a best practice according to the NSA Kubernetes hardening guidelines, but is not currently required as a part of any of the pod security standards.

Here's how we implemented it in CEL:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
  name: "kubescape-c-0017-deny-resources-with-mutable-container-filesystem"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   [""]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["pods"]
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments","replicasets","daemonsets","statefulsets"]
    - apiGroups:   ["batch"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["jobs","cronjobs"]
  validations:
    - expression: "object.kind != 'Pod' || object.spec.containers.all(container, has(container.securityContext) && has(container.securityContext.readOnlyRootFilesystem) &&  container.securityContext.readOnlyRootFilesystem == true)"
      message: "Pods having containers with mutable filesystem not allowed! (see more at https://hub.armosec.io/docs/c-0017)"
    - expression: "['Deployment','ReplicaSet','DaemonSet','StatefulSet','Job'].all(kind, object.kind != kind) || object.spec.template.spec.containers.all(container, has(container.securityContext) && has(container.securityContext.readOnlyRootFilesystem) &&  container.securityContext.readOnlyRootFilesystem == true)"
      message: "Workloads having containers with mutable filesystem not allowed! (see more at https://hub.armosec.io/docs/c-0017)"
    - expression: "object.kind != 'CronJob' || object.spec.jobTemplate.spec.template.spec.containers.all(container, has(container.securityContext) && has(container.securityContext.readOnlyRootFilesystem) &&  container.securityContext.readOnlyRootFilesystem == true)"
      message: "CronJob having containers with mutable filesystem not allowed! (see more at https://hub.armosec.io/docs/c-0017)"

Match constraints are provided for three possible API groups: the core/v1 group for Pods, the apps/v1 workload controllers, and the batch/v1 job controllers.

Note:

matchConstraints will convert the API object to the matched version for you. If, for example, an API request was for apps/v1beta1 and you match apps/v1 in matchConstraints, the API request will be converted from apps/v1beta1 to apps/v1 and then validated. This has the useful property of making validation rules secure against the introduction of new versions of APIs, which would otherwise allow API requests to sneak past the validation rule by using the newly introduced version.

The validations include the CEL rules for the objects. There are three different expressions, catering for the fact that a Pod spec can be at the root of the object (a naked pod, under template (a workload controller or a Job), or under jobTemplate (a CronJob).

In the event that any spec does not have readOnlyRootFilesystem set to true, the object will not be admitted.

Note:

In our initial release, we have grouped the three expressions into the same policy object. This means they can be enabled and disabled atomically, and thus there is no chance that a user will accidentally leave a compliance gap by enabling policy for one API group and not the others. Breaking them into separate policies would allow us access to improvements targeted for the 1.27 release, including type checking. We are talking to SIG API Machinery about how to best address this before the APIs reach v1.

Using the CEL library in your cluster

Policies are provided as Kubernetes objects, which are then bound to certain resources by a selector.

Minikube is a quick and easy way to install and configure a Kubernetes cluster for testing. To install Kubernetes v1.26 with the ValidatingAdmissionPolicy feature gate enabled:

minikube start --kubernetes-version=1.26.1 --extra-config=apiserver.runtime-config=admissionregistration.k8s.io/v1alpha1  --feature-gates='ValidatingAdmissionPolicy=true'

To install the policies in your cluster:

# Install configuration CRD
kubectl apply -f https://github.com/kubescape/cel-admission-library/releases/latest/download/policy-configuration-definition.yaml
# Install basic configuration
kubectl apply -f https://github.com/kubescape/cel-admission-library/releases/latest/download/basic-control-configuration.yaml
# Install policies
kubectl apply -f https://github.com/kubescape/cel-admission-library/releases/latest/download/kubescape-validating-admission-policies.yaml

To apply policies to objects, create a ValidatingAdmissionPolicyBinding resource. Let’s apply the above Kubescape C-0017 control to any namespace with the label policy=enforced:

# Create a binding
kubectl apply -f - <<EOT
apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: c0017-binding
spec:
  policyName: kubescape-c-0017-deny-mutable-container-filesystem
  matchResources:
    namespaceSelector:
      matchLabels:
        policy: enforced
EOT

# Create a namespace for running the example
kubectl create namespace policy-example
kubectl label namespace policy-example 'policy=enforced'

Now, if you attempt to create an object without specifying a readOnlyRootFilesystem, it will not be created.

# The next line should fail
kubectl -n policy-example run nginx --image=nginx --restart=Never

The output shows our error:

The pods "nginx" is invalid: : ValidatingAdmissionPolicy 'kubescape-c-0017-deny-mutable-container-filesystem' with binding 'c0017-binding' denied request: Pods having containers with mutable filesystem not allowed! (see more at https://hub.armosec.io/docs/c-0017)

Configuration

Policy objects can include configuration, which is provided in a different object. Many of the Kubescape controls require a configuration: which labels to require, which capabilities to allow or deny, which registries to allow containers to be deployed from, etc. Default values for those controls are defined in the ControlConfiguration object.

To use this configuration object, or your own object in the same format, add a paramRef.name value to your binding object:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: c0001-binding
spec:
  policyName: kubescape-c-0001-deny-forbidden-container-registries
  paramRef:
    name: basic-control-configuration
  matchResources:
    namespaceSelector:
      matchLabels:
        policy: enforced

Summary

Converting our controls to CEL was simple, in most cases. We cannot port the whole Kubescape library, as some controls check for things outside a Kubernetes cluster, and some require data that is not available in the admission request object. Overall, we are happy to contribute this library to the Kubernetes community and will continue to develop it for Kubescape and Kubernetes users alike. We hope it becomes useful, either as something you use yourself, or as examples for you to write your own policies.

As for the validating admission policy feature itself, we are very excited to see this native functionality introduced to Kubernetes. We look forward to watching it move to Beta and then GA, hopefully by the end of the year. It is important to note this feature is currently in Alpha, which means this is the perfect opportunity to play around with it in environments like Minikube and give a test drive. However, it is not yet considered production-ready and stable, and will not be enabled on most managed Kubernetes environments. We will not recommend Kubescape users use these policies in production until the underlying functionality becomes stable. Keep an eye on the KEP, and of course this blog, for an eventual release announcement.

Kubernetes Removals and Major Changes In v1.27

As Kubernetes develops and matures, features may be deprecated, removed, or replaced with better ones for the project's overall health. Based on the information available at this point in the v1.27 release process, which is still ongoing and can introduce additional changes, this article identifies and describes some of the planned changes for the Kubernetes v1.27 release.

A note about the k8s.gcr.io redirect to registry.k8s.io

To host its container images, the Kubernetes project uses a community-owned image registry called registry.k8s.io. On March 20th, all traffic from the out-of-date k8s.gcr.io registry will be redirected to registry.k8s.io. The deprecated k8s.gcr.io registry will eventually be phased out.

What does this change mean?

  • If you are a subproject maintainer, you must update your manifests and Helm charts to use the new registry.

  • The v1.27 Kubernetes release will not be published to the old registry.

  • From April, patch releases for v1.24, v1.25, and v1.26 will no longer be published to the old registry.

We have a blog post with all the information about this change and what to do if it impacts you.

The Kubernetes API Removal and Deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API has been marked for removal in a future Kubernetes release, it will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.

  • Beta or pre-release API versions must be supported for 3 releases after the deprecation.

  • Alpha or experimental API versions may be removed in any release without prior deprecation notice.

Whether an API is removed as a result of a feature graduating from beta to stable or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the documentation.

API removals, and other changes for Kubernetes v1.27

Removal of storage.k8s.io/v1beta1 from CSIStorageCapacity

The CSIStorageCapacity API supports exposing currently available storage capacity via CSIStorageCapacity objects and enhances the scheduling of pods that use CSI volumes with late binding. The storage.k8s.io/v1beta1 API version of CSIStorageCapacity was deprecated in v1.24, and it will no longer be served in v1.27.

Migrate manifests and API clients to use the storage.k8s.io/v1 API version, available since v1.24. All existing persisted objects are accessible via the new API.

Refer to the Storage Capacity Constraints for Pod Scheduling KEP for more information.

Kubernetes v1.27 is not removing any other APIs; however several other aspects are going to be removed. Read on for details.

Support for deprecated seccomp annotations

In Kubernetes v1.19, the seccomp (secure computing mode) support graduated to General Availability (GA). This feature can be used to increase the workload security by restricting the system calls for a Pod (applies to all containers) or single containers.

The support for the alpha seccomp annotations seccomp.security.alpha.kubernetes.io/pod and container.seccomp.security.alpha.kubernetes.io were deprecated since v1.19, now have been completely removed. The seccomp fields are no longer auto-populated when pods with seccomp annotations are created. Pods should use the corresponding pod or container securityContext.seccompProfile field instead.

Removal of several feature gates for volume expansion

The following feature gates for volume expansion GA features will be removed and must no longer be referenced in --feature-gates flags:

ExpandCSIVolumes
Enable expanding of CSI volumes.
ExpandInUsePersistentVolumes
Enable expanding in-use PVCs.
ExpandPersistentVolumes
Enable expanding of persistent volumes.

Removal of --master-service-namespace command line argument

The kube-apiserver accepts a deprecated command line argument, --master-service-namespace, that specified where to create the Service named kubernetes to represent the API server. Kubernetes v1.27 will remove that argument, which has been deprecated since the v1.26 release.

Removal of the ControllerManagerLeaderMigration feature gate

Leader Migration provides a mechanism in which HA clusters can safely migrate "cloud-specific" controllers between the kube-controller-manager and the cloud-controller-manager via a shared resource lock between the two components while upgrading the replicated control plane.

The ControllerManagerLeaderMigration feature, GA since v1.24, is unconditionally enabled and for the v1.27 release the feature gate option will be removed. If you're setting this feature gate explicitly, you'll need to remove that from command line arguments or configuration files.

Removal of --enable-taint-manager command line argument

The kube-controller-manager command line argument --enable-taint-manager is deprecated, and will be removed in Kubernetes v1.27. The feature that it supports, taint based eviction, is already enabled by default and will continue to be implicitly enabled when the flag is removed.

Removal of --pod-eviction-timeout command line argument

The deprecated command line argument --pod-eviction-timeout will be removed from the kube-controller-manager.

Removal of the CSI Migration feature gate

The CSI migration programme allows moving from in-tree volume plugins to out-of-tree CSI drivers. CSI migration is generally available since Kubernetes v1.16, and the associated CSIMigration feature gate will be removed in v1.27.

Removal of CSIInlineVolume feature gate

The CSI Ephemeral Volume feature allows CSI volumes to be specified directly in the pod specification for ephemeral use cases. They can be used to inject arbitrary states, such as configuration, secrets, identity, variables or similar information, directly inside pods using a mounted volume. This feature graduated to GA in v1.25. Hence, the feature gate CSIInlineVolume will be removed in the v1.27 release.

Removal of EphemeralContainers feature gate

Ephemeral containers graduated to GA in v1.25. These are containers with a temporary duration that executes within namespaces of an existing pod. Ephemeral containers are typically initiated by a user in order to observe the state of other pods and containers for troubleshooting and debugging purposes. For Kubernetes v1.27, API support for ephemeral containers is unconditionally enabled; the EphemeralContainers feature gate will be removed.

Removal of LocalStorageCapacityIsolation feature gate

The Local Ephemeral Storage Capacity Isolation feature moved to GA in v1.25. The feature provides support for capacity isolation of local ephemeral storage between pods, such as emptyDir volumes, so that a pod can be hard limited in its consumption of shared resources. The kubelet will evicting Pods if consumption of local ephemeral storage exceeds the configured limit. The feature gate, LocalStorageCapacityIsolation, will be removed in the v1.27 release.

Removal of NetworkPolicyEndPort feature gate

The v1.25 release of Kubernetes promoted endPort in NetworkPolicy to GA. NetworkPolicy providers that support the endPort field that can be used to specify a range of ports to apply a NetworkPolicy. Previously, each NetworkPolicy could only target a single port. So the feature gate NetworkPolicyEndPort will be removed in this release.

Please be aware that endPort field must be supported by the Network Policy provider. If your provider does not support endPort, and this field is specified in a Network Policy, the Network Policy will be created covering only the port field (single port).

Removal of StatefulSetMinReadySeconds feature gate

For a pod that is part of a StatefulSet, Kubernetes can mark the Pod ready only if Pod is available (and passing checks) for at least the period you specify in minReadySeconds. The feature became generally available in Kubernetes v1.25, and the StatefulSetMinReadySeconds feature gate will be locked to true and removed in the v1.27 release.

Removal of IdentifyPodOS feature gate

You can specify the operating system for a Pod, and the feature support for that is stable since the v1.25 release. The IdentifyPodOS feature gate will be removed for Kubernetes v1.27.

Removal of DaemonSetUpdateSurge feature gate

The v1.25 release of Kubernetes also stabilised surge support for DaemonSet pods, implemented in order to minimize DaemonSet downtime during rollouts. The DaemonSetUpdateSurge feature gate will be removed in Kubernetes v1.27.

Removal of --container-runtime command line argument

The kubelet accepts a deprecated command line argument, --container-runtime, and the only valid value would be remote after dockershim code is removed. Kubernetes v1.27 will remove that argument, which has been deprecated since the v1.24 release.

Looking ahead

The official list of API removals planned for Kubernetes v1.29 includes:

  • The flowcontrol.apiserver.k8s.io/v1beta2 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in v1.29.

Want to know more?

Deprecations are announced in the Kubernetes release notes. You can see the announcements of pending deprecations in the release notes for:

We will formally announce the deprecations that come with Kubernetes v1.27 as part of the CHANGELOG for that release.

For information on the process of deprecation and removal, check out the official Kubernetes deprecation policy document.

k8s.gcr.io Redirect to registry.k8s.io - What You Need to Know

On Monday, March 20th, the k8s.gcr.io registry will be redirected to the community owned registry, registry.k8s.io .

TL;DR: What you need to know about this change

  • On Monday, March 20th, traffic from the older k8s.gcr.io registry will be redirected to registry.k8s.io with the eventual goal of sunsetting k8s.gcr.io.
  • If you run in a restricted environment, and apply strict domain name or IP address access policies limited to k8s.gcr.io, the image pulls will not function after k8s.gcr.io starts redirecting to the new registry. 
  • A small subset of non-standard clients do not handle HTTP redirects by image registries, and will need to be pointed directly at registry.k8s.io.
  • The redirect is a stopgap to assist users in making the switch. The deprecated k8s.gcr.io registry will be phased out at some point. Please update your manifests as soon as possible to point to registry.k8s.io.
  • If you host your own image registry, you can copy images you need there as well to reduce traffic to community owned registries.

If you think you may be impacted, or would like to know more about this change, please keep reading.

How can I check if I am impacted?

To test connectivity to registry.k8s.io and being able to pull images from there, here is a sample command that can be executed in the namespace of your choosing:

kubectl run hello-world -ti --rm --image=registry.k8s.io/busybox:latest --restart=Never -- date

When you run the command above, here’s what to expect when things work correctly:

$ kubectl run hello-world -ti --rm --image=registry.k8s.io/busybox:latest --restart=Never -- date
Fri Feb 31 07:07:07 UTC 2023
pod "hello-world" deleted

What kind of errors will I see if I’m impacted?

Errors may depend on what kind of container runtime you are using, and what endpoint you are routed to, but it should present such as ErrImagePull, ImagePullBackOff, or a container failing to be created with the warning FailedCreatePodSandBox.

Below is an example error message showing a proxied deployment failing to pull due to an unknown certificate:

FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: Head “https://us-west1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8”: x509: certificate signed by unknown authority

What images will be impacted?

ALL images on k8s.gcr.io will be impacted by this change. k8s.gcr.io hosts many images beyond Kubernetes releases. A large number of Kubernetes subprojects host their images there as well. Some examples include the dns/k8s-dns-node-cache, ingress-nginx/controller, and node-problem-detector/node-problem-detector images.

I am impacted. What should I do?

For impacted users that run in a restricted environment, the best option is to copy over the required images to a private registry or configure a pull-through cache in their registry.

There are several tools to copy images between registries; crane is one of those tools, and images can be copied to a private registry by using crane copy SRC DST. There are also vendor-specific tools, like e.g. Google’s gcrane, that perform a similar function but are streamlined for their platform.

How can I find which images are using the legacy registry, and fix them?

Option 1: See the one line kubectl command in our earlier blog post:

kubectl get pods --all-namespaces -o jsonpath="{.items[*].spec.containers[*].image}" |\
tr -s '[[:space:]]' '\n' |\
sort |\
uniq -c

Option 2: A kubectl krew plugin has been developed called community-images, that will scan and report any images using the k8s.gcr.io endpoint.

If you have krew installed, you can install it with:

kubectl krew install community-images

and generate a report with:

kubectl community-images

For alternate methods of install and example output, check out the repo: kubernetes-sigs/community-images.

Option 3: If you do not have access to a cluster directly, or manage many clusters - the best way is to run a search over your manifests and charts for "k8s.gcr.io".

Option 4: If you wish to prevent k8s.gcr.io based images from running in your cluster, example policies for Gatekeeper and Kyverno are available in the AWS EKS Best Practices repository that will block them from being pulled. You can use these third-party policies with any Kubernetes cluster.

Option 5: As a LAST possible option, you can use a Mutating Admission Webhook to change the image address dynamically. This should only be considered a stopgap till your manifests have been updated. You can find a (third party) Mutating Webhook and Kyverno policy in k8s-gcr-quickfix.

Why did Kubernetes change to a different image registry?

k8s.gcr.io is hosted on a custom Google Container Registry (GCR) domain that was set up solely for the Kubernetes project. This has worked well since the inception of the project, and we thank Google for providing these resources, but today, there are other cloud providers and vendors that would like to host images to provide a better experience for the people on their platforms. In addition to Google’s renewed commitment to donate $3 million to support the project's infrastructure last year, Amazon Web Services announced a matching donation during their Kubecon NA 2022 keynote in Detroit. This will provide a better experience for users (closer servers = faster downloads) and will reduce the egress bandwidth and costs from GCR at the same time.

For more details on this change, check out registry.k8s.io: faster, cheaper and Generally Available (GA).

Why is a redirect being put in place?

The project switched to registry.k8s.io last year with the 1.25 release; however, most of the image pull traffic is still directed at the old endpoint k8s.gcr.io. This has not been sustainable for us as a project, as it is not utilizing the resources that have been donated to the project from other providers, and we are in the danger of running out of funds due to the cost of serving this traffic.

A redirect will enable the project to take advantage of these new resources, significantly reducing our egress bandwidth costs. We only expect this change to impact a small subset of users running in restricted environments or using very old clients that do not respect redirects properly.

What will happen to k8s.gcr.io?

Separate from the redirect, k8s.gcr.io will be frozen and will not be updated with new images after April 3rd, 2023. k8s.gcr.io will not get any new releases, patches, or security updates. It will continue to remain available to help people migrate, but it WILL be phased out entirely in the future.

I still have questions, where should I go?

For more information on registry.k8s.io and why it was developed, see registry.k8s.io: faster, cheaper and Generally Available.

If you would like to know more about the image freeze and the last images that will be available there, see the blog post: k8s.gcr.io Image Registry Will Be Frozen From the 3rd of April 2023.

Information on the architecture of registry.k8s.io and its request handling decision tree can be found in the kubernetes/registry.k8s.io repo.

If you believe you have encountered a bug with the new registry or the redirect, please open an issue in the kubernetes/registry.k8s.io repo. Please check if there is an issue already open similar to what you are seeing before you create a new issue.

Forensic container analysis

In my previous article, Forensic container checkpointing in Kubernetes, I introduced checkpointing in Kubernetes and how it has to be setup and how it can be used. The name of the feature is Forensic container checkpointing, but I did not go into any details how to do the actual analysis of the checkpoint created by Kubernetes. In this article I want to provide details how the checkpoint can be analyzed.

Checkpointing is still an alpha feature in Kubernetes and this article wants to provide a preview how the feature might work in the future.

Preparation

Details about how to configure Kubernetes and the underlying CRI implementation to enable checkpointing support can be found in my Forensic container checkpointing in Kubernetes article.

As an example I prepared a container image (quay.io/adrianreber/counter:blog) which I want to checkpoint and then analyze in this article. This container allows me to create files in the container and also store information in memory which I later want to find in the checkpoint.

To run that container I need a pod, and for this example I am using the following Pod manifest:

apiVersion: v1
kind: Pod
metadata:
  name: counters
spec:
  containers:
  - name: counter
    image: quay.io/adrianreber/counter:blog

This results in a container called counter running in a pod called counters.

Once the container is running I am performing following actions with that container:

$ kubectl get pod counters --template '{{.status.podIP}}'
10.88.0.25
$ curl 10.88.0.25:8088/create?test-file
$ curl 10.88.0.25:8088/secret?RANDOM_1432_KEY
$ curl 10.88.0.25:8088

The first access creates a file called test-file with the content test-file in the container and the second access stores my secret information (RANDOM_1432_KEY) somewhere in the container's memory. The last access just adds an additional line to the internal log file.

The last step before I can analyze the checkpoint it to tell Kubernetes to create the checkpoint. As described in the previous article this requires access to the kubelet only checkpoint API endpoint.

For a container named counter in a pod named counters in a namespace named default the kubelet API endpoint is reachable at:

# run this on the node where that Pod is executing
curl -X POST "https://localhost:10250/checkpoint/default/counters/counter"

For completeness the following curl command-line options are necessary to have curl accept the kubelet's self signed certificate and authorize the use of the kubelet checkpoint API:

--insecure --cert /var/run/kubernetes/client-admin.crt --key /var/run/kubernetes/client-admin.key

Once the checkpointing has finished the checkpoint should be available at /var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar

In the following steps of this article I will use the name checkpoint.tar when analyzing the checkpoint archive.

Checkpoint archive analysis using checkpointctl

To get some initial information about the checkpointed container I am using the tool checkpointctl like this:

$ checkpointctl show checkpoint.tar --print-stats
+-----------+----------------------------------+--------------+---------+---------------------+--------+------------+------------+-------------------+
| CONTAINER |              IMAGE               |      ID      | RUNTIME |       CREATED       | ENGINE |     IP     | CHKPT SIZE | ROOT FS DIFF SIZE |
+-----------+----------------------------------+--------------+---------+---------------------+--------+------------+------------+-------------------+
| counter   | quay.io/adrianreber/counter:blog | 059a219a22e5 | runc    | 2023-03-02T06:06:49 | CRI-O  | 10.88.0.23 | 8.6 MiB    | 3.0 KiB           |
+-----------+----------------------------------+--------------+---------+---------------------+--------+------------+------------+-------------------+
CRIU dump statistics
+---------------+-------------+--------------+---------------+---------------+---------------+
| FREEZING TIME | FROZEN TIME | MEMDUMP TIME | MEMWRITE TIME | PAGES SCANNED | PAGES WRITTEN |
+---------------+-------------+--------------+---------------+---------------+---------------+
| 100809 us     | 119627 us   | 11602 us     | 7379 us       |          7800 |          2198 |
+---------------+-------------+--------------+---------------+---------------+---------------+

This gives me already some information about the checkpoint in that checkpoint archive. I can see the name of the container, information about the container runtime and container engine. It also lists the size of the checkpoint (CHKPT SIZE). This is mainly the size of the memory pages included in the checkpoint, but there is also information about the size of all changed files in the container (ROOT FS DIFF SIZE).

The additional parameter --print-stats decodes information in the checkpoint archive and displays them in the second table (CRIU dump statistics). This information is collected during checkpoint creation and gives an overview how much time CRIU needed to checkpoint the processes in the container and how many memory pages were analyzed and written during checkpoint creation.

Digging deeper

With the help of checkpointctl I am able to get some high level information about the checkpoint archive. To be able to analyze the checkpoint archive further I have to extract it. The checkpoint archive is a tar archive and can be extracted with the help of tar xf checkpoint.tar.

Extracting the checkpoint archive will result in following files and directories:

  • bind.mounts - this file contains information about bind mounts and is needed during restore to mount all external files and directories at the right location
  • checkpoint/ - this directory contains the actual checkpoint as created by CRIU
  • config.dump and spec.dump - these files contain metadata about the container which is needed during restore
  • dump.log - this file contains the debug output of CRIU created during checkpointing
  • stats-dump - this file contains the data which is used by checkpointctl to display dump statistics (--print-stats)
  • rootfs-diff.tar - this file contains all changed files on the container's file-system

File-system changes - rootfs-diff.tar

The first step to analyze the container's checkpoint further is to look at the files that have changed in my container. This can be done by looking at the file rootfs-diff.tar:

$ tar xvf rootfs-diff.tar
home/counter/logfile
home/counter/test-file

Now the files that changed in the container can be studied:

$ cat home/counter/logfile
10.88.0.1 - - [02/Mar/2023 06:07:29] "GET /create?test-file HTTP/1.1" 200 -
10.88.0.1 - - [02/Mar/2023 06:07:40] "GET /secret?RANDOM_1432_KEY HTTP/1.1" 200 -
10.88.0.1 - - [02/Mar/2023 06:07:43] "GET / HTTP/1.1" 200 -
$ cat home/counter/test-file
test-file 

Compared to the container image (quay.io/adrianreber/counter:blog) this container is based on, I can see that the file logfile contains information about all access to the service the container provides and the file test-file was created just as expected.

With the help of rootfs-diff.tar it is possible to inspect all files that were created or changed compared to the base image of the container.

Analyzing the checkpointed processes - checkpoint/

The directory checkpoint/ contains data created by CRIU while checkpointing the processes in the container. The content in the directory checkpoint/ consists of different image files which can be analyzed with the help of the tool CRIT which is distributed as part of CRIU.

First lets get an overview of the processes inside of the container:

$ crit show checkpoint/pstree.img | jq .entries[].pid
1
7
8

This output means that I have three processes inside of the container's PID namespace with the PIDs: 1, 7, 8

This is only the view from the inside of the container's PID namespace. During restore exactly these PIDs will be recreated. From the outside of the container's PID namespace the PIDs will change after restore.

The next step is to get some additional information about these three processes:

$ crit show checkpoint/core-1.img | jq .entries[0].tc.comm
"bash"
$ crit show checkpoint/core-7.img | jq .entries[0].tc.comm
"counter.py"
$ crit show checkpoint/core-8.img | jq .entries[0].tc.comm
"tee"

This means the three processes in my container are bash, counter.py (a Python interpreter) and tee. For details about the parent child relations of these processes there is more data to be analyzed in checkpoint/pstree.img.

Let's compare the so far collected information to the still running container:

$ crictl inspect --output go-template --template "{{(index .info.pid)}}" 059a219a22e56
722520
$ ps auxf | grep -A 2 722520
fedora    722520  \_ bash -c /home/counter/counter.py 2>&1 | tee /home/counter/logfile
fedora    722541      \_ /usr/bin/python3 /home/counter/counter.py
fedora    722542      \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee /home/counter/logfile
$ cat /proc/722520/comm
bash
$ cat /proc/722541/comm
counter.py
$ cat /proc/722542/comm
tee

In this output I am first retrieving the PID of the first process in the container and then I am looking for that PID and child processes on the system where the container is running. I am seeing three processes and the first one is "bash" which is PID 1 inside of the containers PID namespace. Then I am looking at /proc/<PID>/comm and I can find the exact same value as in the checkpoint image.

Important to remember is that the checkpoint will contain the view from within the container's PID namespace because that information is important to restore the processes.

One last example of what crit can tell us about the container is the information about the UTS namespace:

$ crit show checkpoint/utsns-12.img
{
    "magic": "UTSNS",
    "entries": [
        {
            "nodename": "counters",
            "domainname": "(none)"
        }
    ]
}

This tells me that the hostname inside of the UTS namespace is counters.

For every resource CRIU collected during checkpointing the checkpoint/ directory contains corresponding image files which can be analyzed with the help of crit.

Looking at the memory pages

In addition to the information from CRIU that can be decoded with the help of CRIT, there are also files containing the raw memory pages written by CRIU to disk:

$ ls  checkpoint/pages-*
checkpoint/pages-1.img  checkpoint/pages-2.img  checkpoint/pages-3.img

When I initially used the container I stored a random key (RANDOM_1432_KEY) somewhere in the memory. Let see if I can find it:

$ grep -ao RANDOM_1432_KEY checkpoint/pages-*
checkpoint/pages-2.img:RANDOM_1432_KEY

And indeed, there is my data. This way I can easily look at the content of all memory pages of the processes in the container, but it is also important to remember that anyone that can access the checkpoint archive has access to all information that was stored in the memory of the container's processes.

Using gdb for further analysis

Another possibility to look at the checkpoint images is gdb. The CRIU repository contains the script coredump which can convert a checkpoint into a coredump file:

$ /home/criu/coredump/coredump-python3
$ ls -al core*
core.1  core.7  core.8

Running the coredump-python3 script will convert the checkpoint images into one coredump file for each process in the container. Using gdb I can also look at the details of the processes:

$ echo info registers | gdb --core checkpoint/core.1 -q

[New LWP 1]

Core was generated by `bash -c /home/counter/counter.py 2>&1 | tee /home/counter/logfile'.

#0  0x00007fefba110198 in ?? ()
(gdb)
rax            0x3d                61
rbx            0x8                 8
rcx            0x7fefba11019a      140667595587994
rdx            0x0                 0
rsi            0x7fffed9c1110      140737179816208
rdi            0xffffffff          4294967295
rbp            0x1                 0x1
rsp            0x7fffed9c10e8      0x7fffed9c10e8
r8             0x1                 1
r9             0x0                 0
r10            0x0                 0
r11            0x246               582
r12            0x0                 0
r13            0x7fffed9c1170      140737179816304
r14            0x0                 0
r15            0x0                 0
rip            0x7fefba110198      0x7fefba110198
eflags         0x246               [ PF ZF IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0

In this example I can see the value of all registers as they were during checkpointing and I can also see the complete command-line of my container's PID 1 process: bash -c /home/counter/counter.py 2>&1 | tee /home/counter/logfile

Summary

With the help of container checkpointing, it is possible to create a checkpoint of a running container without stopping the container and without the container knowing that it was checkpointed. The result of checkpointing a container in Kubernetes is a checkpoint archive; using different tools like checkpointctl, tar, crit and gdb the checkpoint can be analyzed. Even with simple tools like grep it is possible to find information in the checkpoint archive.

The different examples I have shown in this article how to analyze a checkpoint are just the starting point. Depending on your requirements it is possible to look at certain things in much more detail, but this article should give you an introduction how to start the analysis of your checkpoint.

How do I get involved?

You can reach SIG Node by several means:

Introducing KWOK: Kubernetes WithOut Kubelet

KWOK logo

Have you ever wondered how to set up a cluster of thousands of nodes just in seconds, how to simulate real nodes with a low resource footprint, and how to test your Kubernetes controller at scale without spending much on infrastructure?

If you answered "yes" to any of these questions, then you might be interested in KWOK, a toolkit that enables you to create a cluster of thousands of nodes in seconds.

What is KWOK?

KWOK stands for Kubernetes WithOut Kubelet. So far, it provides two tools:

kwok
kwok is the cornerstone of this project, responsible for simulating the lifecycle of fake nodes, pods, and other Kubernetes API resources.
kwokctl
kwokctl is a CLI tool designed to streamline the creation and management of clusters, with nodes simulated by kwok.

Why use KWOK?

KWOK has several advantages:

  • Speed: You can create and delete clusters and nodes almost instantly, without waiting for boot or provisioning.
  • Compatibility: KWOK works with any tools or clients that are compliant with Kubernetes APIs, such as kubectl, helm, kui, etc.
  • Portability: KWOK has no specific hardware or software requirements. You can run it using pre-built images, once Docker or Nerdctl is installed. Alternatively, binaries are also available for all platforms and can be easily installed.
  • Flexibility: You can configure different node types, labels, taints, capacities, conditions, etc., and you can configure different pod behaviors, status, etc. to test different scenarios and edge cases.
  • Performance: You can simulate thousands of nodes on your laptop without significant consumption of CPU or memory resources.

What are the use cases?

KWOK can be used for various purposes:

  • Learning: You can use KWOK to learn about Kubernetes concepts and features without worrying about resource waste or other consequences.
  • Development: You can use KWOK to develop new features or tools for Kubernetes without accessing to a real cluster or requiring other components.
  • Testing:
    • You can measure how well your application or controller scales with different numbers of nodes and(or) pods.
    • You can generate high loads on your cluster by creating many pods or services with different resource requests or limits.
    • You can simulate node failures or network partitions by changing node conditions or randomly deleting nodes.
    • You can test how your controller interacts with other components or features of Kubernetes by enabling different feature gates or API versions.

What are the limitations?

KWOK is not intended to replace others completely. It has some limitations that you should be aware of:

  • Functionality: KWOK is not a kubelet and may exhibit different behaviors in areas such as pod lifecycle management, volume mounting, and device plugins. Its primary function is to simulate updates of node and pod status.
  • Accuracy: It's important to note that KWOK doesn't accurately reflect the performance or behavior of real nodes under various workloads or environments. Instead, it approximates some behaviors using simple formulas.
  • Security: KWOK does not enforce any security policies or mechanisms on simulated nodes. It assumes that all requests from the kube-apiserver are authorized and valid.

Getting started

If you are interested in trying out KWOK, please check its documents for more details.

Animation of a terminal showing kwokctl in use

Using kwokctl to manage simulated clusters

Getting Involved

If you're interested in participating in future discussions or development related to KWOK, there are several ways to get involved:

We welcome feedback and contributions from anyone who wants to join us in this exciting project.

Free Katacoda Kubernetes Tutorials Are Shutting Down

Katacoda, the popular learning platform from O’Reilly that has been helping people learn all about Java, Docker, Kubernetes, Python, Go, C++, and more, shut down for public use in June 2022. However, tutorials specifically for Kubernetes, linked from the Kubernetes website for our project’s users and contributors, remained available and active after this change. Unfortunately, this will no longer be the case, and Katacoda tutorials for learning Kubernetes will cease working after March 31st, 2023.

The Kubernetes Project wishes to thank O'Reilly Media for the many years it has supported the community via the Katacoda learning platform. You can read more about the decision to shutter katacoda.com on O'Reilly's own site. With this change, we’ll be focusing on the work needed to remove links to their various tutorials. We have a general issue tracking this topic at #33936 and GitHub discussion. We’re also interested in researching what other learning platforms could be beneficial for the Kubernetes community, replacing Katacoda with a link to a platform or service that has a similar user experience. However, this research will take time, so we’re actively looking for volunteers to help with this work. If a replacement is found, it will need to be supported by Kubernetes leadership, specifically, SIG Contributor Experience, SIG Docs, and the Kubernetes Steering Committee.

The Katacoda shutdown affects 25 tutorial pages, their localizations, as well as the Katacoda Scenario repository: github.com/katacoda-scenarios/kubernetes-bootcamp-scenarios. We recommend that any links, guides, or documentation you have that points to the Katacoda learning platform be updated immediately to reflect this change. While we have yet to find a replacement learning solution, the Kubernetes website contains a lot of helpful documentation to support your continued learning and growth. You can find all of our available documentation tutorials for Kubernetes at https://k8s.io/docs/tutorials/.

If you have any questions regarding the Katacoda shutdown, or subsequent link removal from Kubernetes tutorial pages, please feel free to comment on the general issue tracking the shutdown, or visit the #sig-docs channel on the Kubernetes Slack.

k8s.gcr.io Image Registry Will Be Frozen From the 3rd of April 2023

The Kubernetes project runs a community-owned image registry called registry.k8s.io to host its container images. On the 3rd of April 2023, the old registry k8s.gcr.io will be frozen and no further images for Kubernetes and related subprojects will be pushed to the old registry.

This registry registry.k8s.io replaced the old one and has been generally available for several months. We have published a blog post about its benefits to the community and the Kubernetes project. This post also announced that future versions of Kubernetes will not be available in the old registry. Now that time has come.

What does this change mean for contributors:

  • If you are a maintainer of a subproject, you will need to update your manifests and Helm charts to use the new registry.

What does this change mean for end users:

  • 1.27 Kubernetes release will not be published to the old registry.
  • Patch releases for 1.24, 1.25, and 1.26 will no longer be published to the old registry from April. Please read the timelines below for details of the final patch releases in the old registry.
  • Starting in 1.25, the default image registry has been set to registry.k8s.io. This value is overridable in kubeadm and kubelet but setting it to k8s.gcr.io will fail for new releases after April as they won’t be present in the old registry.
  • If you want to increase the reliability of your cluster and remove dependency on the community-owned registry or you are running Kubernetes in networks where external traffic is restricted, you should consider hosting local image registry mirrors. Some cloud vendors may offer hosted solutions for this.

Timeline of the changes

  • k8s.gcr.io will be frozen on the 3rd of April 2023
  • 1.27 is expected to be released on the 12th of April 2023
  • The last 1.23 release on k8s.gcr.io will be 1.23.18 (1.23 goes end-of-life before the freeze)
  • The last 1.24 release on k8s.gcr.io will be 1.24.12
  • The last 1.25 release on k8s.gcr.io will be 1.25.8
  • The last 1.26 release on k8s.gcr.io will be 1.26.3

What's next

Please make sure your cluster does not have dependencies on old image registry. For example, you can run this command to list the images used by pods:

kubectl get pods --all-namespaces -o jsonpath="{.items[*].spec.containers[*].image}" |\
tr -s '[[:space:]]' '\n' |\
sort |\
uniq -c

There may be other dependencies on the old image registry. Make sure you review any potential dependencies to keep your cluster healthy and up to date.

Acknowledgments

Change is hard, and evolving our image-serving platform is needed to ensure a sustainable future for the project. We strive to make things better for everyone using Kubernetes. Many contributors from all corners of our community have been working long and hard to ensure we are making the best decisions possible, executing plans, and doing our best to communicate those plans.

Thanks to Aaron Crickenberger, Arnaud Meukam, Benjamin Elder, Caleb Woodbine, Davanum Srinivas, Mahamed Ali, and Tim Hockin from SIG K8s Infra, Brian McQueen, and Sergey Kanzhelev from SIG Node, Lubomir Ivanov from SIG Cluster Lifecycle, Adolfo García Veytia, Jeremy Rickard, Sascha Grunert, and Stephen Augustus from SIG Release, Bob Killen and Kaslin Fields from SIG Contribex, Tim Allclair from the Security Response Committee. Also a big thank you to our friends acting as liaisons with our cloud provider partners: Jay Pipes from Amazon and Jon Johnson Jr. from Google.

Spotlight on SIG Instrumentation

Observability requires the right data at the right time for the right consumer (human or piece of software) to make the right decision. In the context of Kubernetes, having best practices for cluster observability across all Kubernetes components is crucial.

SIG Instrumentation helps to address this issue by providing best practices and tools that all other SIGs use to instrument Kubernetes components-like the API server, scheduler, kubelet and kube-controller-manager.

In this SIG Instrumentation spotlight, Imran Noor Mohamed, SIG ContribEx-Comms tech lead talked with Elana Hashman, and Han Kang, chairs of SIG Instrumentation, on how the SIG is organized, what are the current challenges and how anyone can get involved and contribute.

About SIG Instrumentation

Imran (INM): Hello, thank you for the opportunity of learning more about SIG Instrumentation. Could you tell us a bit about yourself, your role, and how you got involved in SIG Instrumentation?

Han (HK): I started in SIG Instrumentation in 2018, and became a chair in 2020. I primarily got involved with SIG instrumentation due to a number of upstream issues with metrics which ended up affecting GKE in bad ways. As a result, we ended up launching an initiative to stabilize our metrics and make metrics a proper API.

Elana (EH): I also joined SIG Instrumentation in 2018 and became a chair at the same time as Han. I was working as a site reliability engineer (SRE) on bare metal Kubernetes clusters and was working to build out our observability stack. I encountered some issues with label joins where Kubernetes metrics didn’t match kube-state-metrics (KSM) and started participating in SIG meetings to improve things. I helped test performance improvements to kube-state-metrics and ultimately coauthored a KEP for overhauling metrics in the 1.14 release to improve usability.

Imran (INM): Interesting! Does that mean SIG Instrumentation involves a lot of plumbing?

Han (HK): I wouldn’t say it involves a ton of plumbing, though it does touch basically every code base. We have our own dedicated directories for our metrics, logs, and tracing frameworks which we tend to work out of primarily. We do have to interact with other SIGs in order to propagate our changes which makes us more of a horizontal SIG.

Imran (INM): Speaking about interaction and coordination with other SIG could you describe how the SIGs is organized?

Elana (EH): In SIG Instrumentation, we have two chairs, Han and myself, as well as two tech leads, David Ashpole and Damien Grisonnet. We all work together as the SIG’s leads in order to run meetings, triage issues and PRs, review and approve KEPs, plan for each release, present at KubeCon and community meetings, and write our annual report. Within the SIG we also have a number of important subprojects, each of which is stewarded by its subproject owners. For example, Marek Siarkowicz is a subproject owner of metrics-server.

Because we’re a horizontal SIG, some of our projects have a wide scope and require coordination from a dedicated group of contributors. For example, in order to guide the Kubernetes migration to structured logging, we chartered the Structured Logging Working Group (WG), organized by Marek and Patrick Ohly. The WG doesn’t own any code, but helps with various components such as the kubelet, scheduler, etc. in migrating their code to use structured logs.

Imran (INM): Walking through the charter alone it’s clear that SIG Instrumentation has a lot of sub-projects. Could you highlight some important ones?

Han (HK): We have many different sub-projects and we are in dire need of people who can come and help shepherd them. Our most important projects in-tree (that is, within the kubernetes/kubernetes repo) are metrics, tracing, and, structured logging. Our most important projects out-of-tree are (a) KSM (kube-state-metrics) and (b) metrics-server.

Elana (EH): Echoing this, we would love to bring on more maintainers for kube-state-metrics and metrics-server. Our friends at WG Structured Logging are also looking for contributors. Other subprojects include klog, prometheus-adapter, and a new subproject that we just launched for collecting high-fidelity, scalable utilization metrics called usage-metrics-collector. All are seeking new contributors!

Current status and ongoing challenges

Imran (INM): For release 1.26 we can see that there are a relevant number of metrics, logs, and tracing KEPs in the pipeline. Would you like to point out important things for last release (maybe alpha & stable milestone candidates?)

Han (HK): We can now generate documentation for every single metric in the main Kubernetes code base! We have a pretty fancy static analysis pipeline that enables this functionality. We’ve also added feature metrics so that you can look at your metrics to determine which features are enabled in your cluster at a given time. Lastly, we added a component-sli endpoint, which should make it easy for people to create availability SLOs for control-plane components.

Elana (EH): We’ve also been working on tracing KEPs for both the API server and kubelet, though neither graduated in 1.26. I’m also really excited about the work Han is doing with WG Reliability to extend and improve our metrics stability framework.

Imran (INM): What do you think are the Kubernetes-specific challenges tackled by the SIG Instrumentation? What are the future efforts to solve them?

Han (HK): SIG instrumentation suffered a bit in the past from being a horizontal SIG. We did not have an obvious location to put our code and did not have a good mechanism to audit metrics that people would randomly add. We’ve fixed this over the years and now we have dedicated spots for our code and a reliable mechanism for auditing new metrics. We also now offer stability guarantees for metrics. We hope to have full-blown tracing up and down the kubernetes stack, and metric support via exemplars.

Elana (EH): I think SIG Instrumentation is a really interesting SIG because it poses different kinds of opportunities to get involved than in other SIGs. You don’t have to be a software developer to contribute to our SIG! All of our components and subprojects are focused on better understanding Kubernetes and its performance in production, which allowed me to get involved as one of the few SIG Chairs working as an SRE at that time. I like that we provide opportunities for newcomers to contribute through using, testing, and providing feedback on our subprojects, which is a lower barrier to entry. Because many of these projects are out-of-tree, I think one of our challenges is to figure out what’s in scope for core Kubernetes SIGs instrumentation subprojects, what’s missing, and then fill in the gaps.

Community and contribution

Imran (INM): Kubernetes values community over products. Any recommendation for anyone looking into getting involved in SIG Instrumentation work? Where should they start (new contributor-friendly areas within SIG?)

Han(HK) and Elana (EH): Come to our bi-weekly triage meetings! They aren’t recorded and are a great place to ask questions and learn about our ongoing work. We strive to be a friendly community and one of the easiest SIGs to get started with. You can check out our latest KubeCon NA 2022 SIG Instrumentation Deep Dive to get more insight into our work. We also invite you to join our Slack channel #sig-instrumentation and feel free to reach out to any of our SIG leads or subproject owners directly.

Thank you so much for your time and insights into the workings of SIG Instrumentation!

Consider All Microservices Vulnerable — And Monitor Their Behavior

This post warns Devops from a false sense of security. Following security best practices when developing and configuring microservices do not result in non-vulnerable microservices. The post shows that although all deployed microservices are vulnerable, there is much that can be done to ensure microservices are not exploited. It explains how analyzing the behavior of clients and services from a security standpoint, named here "Security-Behavior Analytics", can protect the deployed vulnerable microservices. It points to Guard, an open source project offering security-behavior monitoring and control of Kubernetes microservices presumed vulnerable.

As cyber attacks continue to intensify in sophistication, organizations deploying cloud services continue to grow their cyber investments aiming to produce safe and non-vulnerable services. However, the year-by-year growth in cyber investments does not result in a parallel reduction in cyber incidents. Instead, the number of cyber incidents continues to grow annually. Evidently, organizations are doomed to fail in this struggle - no matter how much effort is made to detect and remove cyber weaknesses from deployed services, it seems offenders always have the upper hand.

Considering the current spread of offensive tools, sophistication of offensive players, and ever-growing cyber financial gains to offenders, any cyber strategy that relies on constructing a non-vulnerable, weakness-free service in 2023 is clearly too naïve. It seems the only viable strategy is to:

Admit that your services are vulnerable!

In other words, consciously accept that you will never create completely invulnerable services. If your opponents find even a single weakness as an entry-point, you lose! Admitting that in spite of your best efforts, all your services are still vulnerable is an important first step. Next, this post discusses what you can do about it...

How to protect microservices from being exploited

Being vulnerable does not necessarily mean that your service will be exploited. Though your services are vulnerable in some ways unknown to you, offenders still need to identify these vulnerabilities and then exploit them. If offenders fail to exploit your service vulnerabilities, you win! In other words, having a vulnerability that can’t be exploited, represents a risk that can’t be realized.

Image of an example of offender gaining foothold in a service

Figure 1. An Offender gaining foothold in a vulnerable service

The above diagram shows an example in which the offender does not yet have a foothold in the service; that is, it is assumed that your service does not run code controlled by the offender on day 1. In our example the service has vulnerabilities in the API exposed to clients. To gain an initial foothold the offender uses a malicious client to try and exploit one of the service API vulnerabilities. The malicious client sends an exploit that triggers some unplanned behavior of the service.

More specifically, let’s assume the service is vulnerable to an SQL injection. The developer failed to sanitize the user input properly, thereby allowing clients to send values that would change the intended behavior. In our example, if a client sends a query string with key “username” and value of “tom or 1=1”, the client will receive the data of all users. Exploiting this vulnerability requires the client to send an irregular string as the value. Note that benign users will not be sending a string with spaces or with the equal sign character as a username, instead they will normally send legal usernames which for example may be defined as a short sequence of characters a-z. No legal username can trigger service unplanned behavior.

In this simple example, one can already identify several opportunities to detect and block an attempt to exploit the vulnerability (un)intentionally left behind by the developer, making the vulnerability unexploitable. First, the malicious client behavior differs from the behavior of benign clients, as it sends irregular requests. If such a change in behavior is detected and blocked, the exploit will never reach the service. Second, the service behavior in response to the exploit differs from the service behavior in response to a regular request. Such behavior may include making subsequent irregular calls to other services such as a data store, taking irregular time to respond, and/or responding to the malicious client with an irregular response (for example, containing much more data than normally sent in case of benign clients making regular requests). Service behavioral changes, if detected, will also allow blocking the exploit in different stages of the exploitation attempt.

More generally:

  • Monitoring the behavior of clients can help detect and block exploits against service API vulnerabilities. In fact, deploying efficient client behavior monitoring makes many vulnerabilities unexploitable and others very hard to achieve. To succeed, the offender needs to create an exploit undetectable from regular requests.

  • Monitoring the behavior of services can help detect services as they are being exploited regardless of the attack vector used. Efficient service behavior monitoring limits what an attacker may be able to achieve as the offender needs to ensure the service behavior is undetectable from regular service behavior.

Combining both approaches may add a protection layer to the deployed vulnerable services, drastically decreasing the probability for anyone to successfully exploit any of the deployed vulnerable services. Next, let us identify four use cases where you need to use security-behavior monitoring.

Use cases

One can identify the following four different stages in the life of any service from a security standpoint. In each stage, security-behavior monitoring is required to meet different challenges:

Service StateUse caseWhat do you need in order to cope with this use case?
NormalNo known vulnerabilities: The service owner is normally not aware of any known vulnerabilities in the service image or configuration. Yet, it is reasonable to assume that the service has weaknesses.Provide generic protection against any unknown, zero-day, service vulnerabilities - Detect/block irregular patterns sent as part of incoming client requests that may be used as exploits.
VulnerableAn applicable CVE is published: The service owner is required to release a new non-vulnerable revision of the service. Research shows that in practice this process of removing a known vulnerability may take many weeks to accomplish (2 months on average).Add protection based on the CVE analysis - Detect/block incoming requests that include specific patterns that may be used to exploit the discovered vulnerability. Continue to offer services, although the service has a known vulnerability.
ExploitableA known exploit is published: The service owner needs a way to filter incoming requests that contain the known exploit.Add protection based on a known exploit signature - Detect/block incoming client requests that carry signatures identifying the exploit. Continue to offer services, although the presence of an exploit.
MisusedAn offender misuses pods backing the service: The offender can follow an attack pattern enabling him/her to misuse pods. The service owner needs to restart any compromised pods while using non compromised pods to continue offering the service. Note that once a pod is restarted, the offender needs to repeat the attack pattern before he/she may again misuse it.Identify and restart instances of the component that is being misused - At any given time, some backing pods may be compromised and misused, while others behave as designed. Detect/remove the misused pods while allowing other pods to continue servicing client requests.

Fortunately, microservice architecture is well suited to security-behavior monitoring as discussed next.

Security-Behavior of microservices versus monoliths

Kubernetes is often used to support workloads designed with microservice architecture. By design, microservices aim to follow the UNIX philosophy of "Do One Thing And Do It Well". Each microservice has a bounded context and a clear interface. In other words, you can expect the microservice clients to send relatively regular requests and the microservice to present a relatively regular behavior as a response to these requests. Consequently, a microservice architecture is an excellent candidate for security-behavior monitoring.

Image showing why microservices are well suited for security-behavior monitoring

Figure 2. Microservices are well suited for security-behavior monitoring

The diagram above clarifies how dividing a monolithic service to a set of microservices improves our ability to perform security-behavior monitoring and control. In a monolithic service approach, different client requests are intertwined, resulting in a diminished ability to identify irregular client behaviors. Without prior knowledge, an observer of the intertwined client requests will find it hard to distinguish between types of requests and their related characteristics. Further, internal client requests are not exposed to the observer. Lastly, the aggregated behavior of the monolithic service is a compound of the many different internal behaviors of its components, making it hard to identify irregular service behavior.

In a microservice environment, each microservice is expected by design to offer a more well-defined service and serve better defined type of requests. This makes it easier for an observer to identify irregular client behavior and irregular service behavior. Further, a microservice design exposes the internal requests and internal services which offer more security-behavior data to identify irregularities by an observer. Overall, this makes the microservice design pattern better suited for security-behavior monitoring and control.

Security-Behavior monitoring on Kubernetes

Kubernetes deployments seeking to add Security-Behavior may use Guard, developed under the CNCF project Knative. Guard is integrated into the full Knative automation suite that runs on top of Kubernetes. Alternatively, you can deploy Guard as a standalone tool to protect any HTTP-based workload on Kubernetes.

See:

  • Guard on Github, for using Guard as a standalone tool.
  • The Knative automation suite - Read about Knative, in the blog post Opinionated Kubernetes which describes how Knative simplifies and unifies the way web services are deployed on Kubernetes.
  • You may contact Guard maintainers on the SIG Security Slack channel or on the Knative community security Slack channel. The Knative community channel will move soon to the CNCF Slack under the name #knative-security.

The goal of this post is to invite the Kubernetes community to action and introduce Security-Behavior monitoring and control to help secure Kubernetes based deployments. Hopefully, the community as a follow up will:

  1. Analyze the cyber challenges presented for different Kubernetes use cases
  2. Add appropriate security documentation for users on how to introduce Security-Behavior monitoring and control.
  3. Consider how to integrate with tools that can help users monitor and control their vulnerable services.

Getting involved

You are welcome to get involved and join the effort to develop security behavior monitoring and control for Kubernetes; to share feedback and contribute to code or documentation; and to make or suggest improvements of any kind.

Protect Your Mission-Critical Pods From Eviction With PriorityClass

Pod priority and preemption help to make sure that mission-critical pods are up in the event of a resource crunch by deciding order of scheduling and eviction.

Kubernetes has been widely adopted, and many organizations use it as their de-facto orchestration engine for running workloads that need to be created and deleted frequently.

Therefore, proper scheduling of the pods is key to ensuring that application pods are up and running within the Kubernetes cluster without any issues. This article delves into the use cases around resource management by leveraging the PriorityClass object to protect mission-critical or high-priority pods from getting evicted and making sure that the application pods are up, running, and serving traffic.

Resource management in Kubernetes

The control plane consists of multiple components, out of which the scheduler (usually the built-in kube-scheduler) is one of the components which is responsible for assigning a node to a pod.

Whenever a pod is created, it enters a "pending" state, after which the scheduler determines which node is best suited for the placement of the new pod.

In the background, the scheduler runs as an infinite loop looking for pods without a nodeName set that are ready for scheduling. For each Pod that needs scheduling, the scheduler tries to decide which node should run that Pod.

If the scheduler cannot find any node, the pod remains in the pending state, which is not ideal.

Note:

To name a few, nodeSelector, taints and tolerations, nodeAffinity, the rank of nodes based on available resources (for example, CPU and memory), and several other criteria are used to determine the pod's placement.

The below diagram, from point number 1 through 4, explains the request flow:

A diagram showing the scheduling of three Pods that a client has directly created.

Scheduling in Kubernetes

Typical use cases

Below are some real-life scenarios where control over the scheduling and eviction of pods may be required.

  1. Let's say the pod you plan to deploy is critical, and you have some resource constraints. An example would be the DaemonSet of an infrastructure component like Grafana Loki. The Loki pods must run before other pods can on every node. In such cases, you could ensure resource availability by manually identifying and deleting the pods that are not required or by adding a new node to the cluster. Both these approaches are unsuitable since the former would be tedious to execute, and the latter could involve an expenditure of time and money.

  2. Another use case could be a single cluster that holds the pods for the below environments with associated priorities:

    • Production (prod): top priority
    • Preproduction (preprod): intermediate priority
    • Development (dev): least priority

    In the event of high resource consumption in the cluster, there is competition for CPU and memory resources on the nodes. While cluster-level autoscaling may add more nodes, it takes time. In the interim, if there are no further nodes to scale the cluster, some Pods could remain in a Pending state, or the service could be degraded as they compete for resources. If the kubelet does evict a Pod from the node, that eviction would be random because the kubelet doesn’t have any special information about which Pods to evict and which to keep.

  3. A third example could be a microservice backed by a queuing application or a database running into a resource crunch and the queue or database getting evicted. In such a case, all the other services would be rendered useless until the database can serve traffic again.

There can also be other scenarios where you want to control the order of scheduling or order of eviction of pods.

PriorityClasses in Kubernetes

PriorityClass is a cluster-wide API object in Kubernetes and part of the scheduling.k8s.io/v1 API group. It contains a mapping of the PriorityClass name (defined in .metadata.name) and an integer value (defined in .value). This represents the value that the scheduler uses to determine Pod's relative priority.

Additionally, when you create a cluster using kubeadm or a managed Kubernetes service (for example, Azure Kubernetes Service), Kubernetes uses PriorityClasses to safeguard the pods that are hosted on the control plane nodes. This ensures that critical cluster components such as CoreDNS and kube-proxy can run even if resources are constrained.

This availability of pods is achieved through the use of a special PriorityClass that ensures the pods are up and running and that the overall cluster is not affected.

$ kubectl get priorityclass
NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            82m
system-node-critical      2000001000   false            82m

The diagram below shows exactly how it works with the help of an example, which will be detailed in the upcoming section.

A flow chart that illustrates how the kube-scheduler prioritizes new Pods and potentially preempts existing Pods

Pod scheduling and preemption

Pod priority and preemption

Pod preemption is a Kubernetes feature that allows the cluster to preempt pods (removing an existing Pod in favor of a new Pod) on the basis of priority. Pod priority indicates the importance of a pod relative to other pods while scheduling. If there aren't enough resources to run all the current pods, the scheduler tries to evict lower-priority pods over high-priority ones.

Also, when a healthy cluster experiences a node failure, typically, lower-priority pods get preempted to create room for higher-priority pods on the available node. This happens even if the cluster can bring up a new node automatically since pod creation is usually much faster than bringing up a new node.

PriorityClass requirements

Before you set up PriorityClasses, there are a few things to consider.

  1. Decide which PriorityClasses are needed. For instance, based on environment, type of pods, type of applications, etc.
  2. The default PriorityClass resource for your cluster. The pods without a priorityClassName will be treated as priority 0.
  3. Use a consistent naming convention for all PriorityClasses.
  4. Make sure that the pods for your workloads are running with the right PriorityClass.

PriorityClass hands-on example

Let’s say there are 3 application pods: one for prod, one for preprod, and one for development. Below are three sample YAML manifest files for each of those.

---
# development
apiVersion: v1
kind: Pod
metadata:
  name: dev-nginx
  labels:
    env: dev
spec:
  containers:
  - name: dev-nginx
    image: nginx
    resources:
      requests:
        memory: "256Mi"
        cpu: "0.2"
      limits:
        memory: ".5Gi"
        cpu: "0.5"
---
# preproduction
apiVersion: v1
kind: Pod
metadata:
  name: preprod-nginx
  labels:
    env: preprod
spec:
  containers:
  - name: preprod-nginx
    image: nginx
    resources:
      requests:
        memory: "1.5Gi"
        cpu: "1.5"
      limits:
        memory: "2Gi"
        cpu: "2"
---
# production
apiVersion: v1
kind: Pod
metadata:
  name: prod-nginx
  labels:
    env: prod
spec:
  containers:
  - name: prod-nginx
    image: nginx
    resources:
      requests:
        memory: "2Gi"
        cpu: "2"
      limits:
        memory: "2Gi"
        cpu: "2"

You can create these pods with the kubectl create -f <FILE.yaml> command, and then check their status using the kubectl get pods command. You can see if they are up and look ready to serve traffic:

$ kubectl get pods --show-labels
NAME            READY   STATUS    RESTARTS   AGE   LABELS
dev-nginx       1/1     Running   0          55s   env=dev
preprod-nginx   1/1     Running   0          55s   env=preprod
prod-nginx      0/1     Pending   0          55s   env=prod

Bad news. The pod for the Production environment is still Pending and isn't serving any traffic.

Let's see why this is happening:

$ kubectl get events
...
...
5s          Warning   FailedScheduling   pod/prod-nginx      0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.

In this example, there is only one worker node, and that node has a resource crunch.

Now, let's look at how PriorityClass can help in this situation since prod should be given higher priority than the other environments.

PriorityClass API

Before creating PriorityClasses based on these requirements, let's see what a basic manifest for a PriorityClass looks like and outline some prerequisites:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: PRIORITYCLASS_NAME
value: 0 # any integer value between -1000000000 to 1000000000 
description: >-
  (Optional) description goes here!
globalDefault: false # or true. Only one PriorityClass can be the global default.

Below are some prerequisites for PriorityClasses:

  • The name of a PriorityClass must be a valid DNS subdomain name.
  • When you make your own PriorityClass, the name should not start with system-, as those names are reserved by Kubernetes itself (for example, they are used for two built-in PriorityClasses).
  • Its absolute value should be between -1000000000 to 1000000000 (1 billion).
  • Larger numbers are reserved by PriorityClasses such as system-cluster-critical (this Pod is critically important to the cluster) and system-node-critical (the node critically relies on this Pod). system-node-critical is a higher priority than system-cluster-critical, because a cluster-critical Pod can only work well if the node where it is running has all its node-level critical requirements met.
  • There are two optional fields:
    • globalDefault: When true, this PriorityClass is used for pods where a priorityClassName is not specified. Only one PriorityClass with globalDefault set to true can exist in a cluster.
      If there is no PriorityClass defined with globalDefault set to true, all the pods with no priorityClassName defined will be treated with 0 priority (i.e. the least priority).
    • description: A string with a meaningful value so that people know when to use this PriorityClass.

Note:

Adding a PriorityClass with globalDefault set to true does not mean it will apply the same to the existing pods that are already running. This will be applicable only to the pods that came into existence after the PriorityClass was created.

PriorityClass in action

Here's an example. Next, create some environment-specific PriorityClasses:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: dev-pc
value: 1000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all development pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: preprod-pc
value: 2000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all preprod pods.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: prod-pc
value: 4000000
globalDefault: false
description: >-
  (Optional) This priority class should only be used for all prod pods.

Use kubectl create -f <FILE.YAML> command to create a pc and kubectl get pc to check its status.

$ kubectl get pc
NAME                      VALUE        GLOBAL-DEFAULT   AGE
dev-pc                    1000000      false            3m13s
preprod-pc                2000000      false            2m3s
prod-pc                   4000000      false            7s
system-cluster-critical   2000000000   false            82m
system-node-critical      2000001000   false            82m

The new PriorityClasses are in place now. A small change is needed in the pod manifest or pod template (in a ReplicaSet or Deployment). In other words, you need to specify the priority class name at .spec.priorityClassName (which is a string value).

First update the previous production pod manifest file to have a PriorityClass assigned, then delete the Production pod and recreate it. You can't edit the priority class for a Pod that already exists.

In my cluster, when I tried this, here's what happened. First, that change seems successful; the status of pods has been updated:

$ kubectl get pods --show-labels
NAME            READY   STATUS    	RESTARTS   AGE   LABELS
dev-nginx       1/1     Terminating	0          55s   env=dev
preprod-nginx   1/1     Running   	0          55s   env=preprod
prod-nginx      0/1     Pending   	0          55s   env=prod

The dev-nginx pod is getting terminated. Once that is successfully terminated and there are enough resources for the prod pod, the control plane can schedule the prod pod:

Warning   FailedScheduling   pod/prod-nginx    0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
Normal    Preempted          pod/dev-nginx     by default/prod-nginx on node node01
Normal    Killing            pod/dev-nginx     Stopping container dev-nginx
Normal    Scheduled          pod/prod-nginx    Successfully assigned default/prod-nginx to node01
Normal    Pulling            pod/prod-nginx    Pulling image "nginx"
Normal    Pulled             pod/prod-nginx    Successfully pulled image "nginx"
Normal    Created            pod/prod-nginx    Created container prod-nginx
Normal    Started            pod/prod-nginx    Started container prod-nginx

Enforcement

When you set up PriorityClasses, they exist just how you defined them. However, people (and tools) that make changes to your cluster are free to set any PriorityClass, or to not set any PriorityClass at all. However, you can use other Kubernetes features to make sure that the priorities you wanted are actually applied.

As an alpha feature, you can define a ValidatingAdmissionPolicy and a ValidatingAdmissionPolicyBinding so that, for example, Pods that go into the prod namespace must use the prod-pc PriorityClass. With another ValidatingAdmissionPolicyBinding you ensure that the preprod namespace uses the preprod-pc PriorityClass, and so on. In any cluster, you can enforce similar controls using external projects such as Kyverno or Gatekeeper, through validating admission webhooks.

However you do it, Kubernetes gives you options to make sure that the PriorityClasses are used how you wanted them to be, or perhaps just to warn users when they pick an unsuitable option.

Summary

The above example and its events show you what this feature of Kubernetes brings to the table, along with several scenarios where you can use this feature. To reiterate, this helps ensure that mission-critical pods are up and available to serve the traffic and, in the case of a resource crunch, determines cluster behavior.

It gives you some power to decide the order of scheduling and order of preemption for Pods. Therefore, you need to define the PriorityClasses sensibly. For example, if you have a cluster autoscaler to add nodes on demand, make sure to run it with the system-cluster-critical PriorityClass. You don't want to get in a situation where the autoscaler has been preempted and there are no new nodes coming online.

If you have any queries or feedback, feel free to reach out to me on LinkedIn.

Kubernetes 1.26: Eviction policy for unhealthy pods guarded by PodDisruptionBudgets

Ensuring the disruptions to your applications do not affect its availability isn't a simple task. Last month's release of Kubernetes v1.26 lets you specify an unhealthy pod eviction policy for PodDisruptionBudgets (PDBs) to help you maintain that availability during node management operations. In this article, we will dive deeper into what modifications were introduced for PDBs to give application owners greater flexibility in managing disruptions.

What problems does this solve?

API-initiated eviction of pods respects PodDisruptionBudgets (PDBs). This means that a requested voluntary disruption via an eviction to a Pod, should not disrupt a guarded application and .status.currentHealthy of a PDB should not fall below .status.desiredHealthy. Running pods that are Unhealthy do not count towards the PDB status, but eviction of these is only possible in case the application is not disrupted. This helps disrupted or not yet started application to achieve availability as soon as possible without additional downtime that would be caused by evictions.

Unfortunately, this poses a problem for cluster administrators that would like to drain nodes without any manual interventions. Misbehaving applications with pods in CrashLoopBackOff state (due to a bug or misconfiguration) or pods that are simply failing to become ready make this task much harder. Any eviction request will fail due to violation of a PDB, when all pods of an application are unhealthy. Draining of a node cannot make any progress in that case.

On the other hand there are users that depend on the existing behavior, in order to:

  • prevent data-loss that would be caused by deleting pods that are guarding an underlying resource or storage
  • achieve the best availability possible for their application

Kubernetes 1.26 introduced a new experimental field to the PodDisruptionBudget API: .spec.unhealthyPodEvictionPolicy. When enabled, this field lets you support both of those requirements.

How does it work?

API-initiated eviction is the process that triggers graceful pod termination. The process can be initiated either by calling the API directly, by using a kubectl drain command, or other actors in the cluster. During this process every pod removal is consulted with appropriate PDBs, to ensure that a sufficient number of pods is always running in the cluster.

The following policies allow PDB authors to have a greater control how the process deals with unhealthy pods.

There are two policies IfHealthyBudget and AlwaysAllow to choose from.

The former, IfHealthyBudget, follows the existing behavior to achieve the best availability that you get by default. Unhealthy pods can be disrupted only if their application has a minimum available .status.desiredHealthy number of pods.

By setting the spec.unhealthyPodEvictionPolicy field of your PDB to AlwaysAllow, you are choosing the best effort availability for your application. With this policy it is always possible to evict unhealthy pods. This will make it easier to maintain and upgrade your clusters.

We think that AlwaysAllow will often be a better choice, but for some critical workloads you may still prefer to protect even unhealthy Pods from node drains or other forms of API-initiated eviction.

How do I use it?

This is an alpha feature, which means you have to enable the PDBUnhealthyPodEvictionPolicy feature gate, with the command line argument --feature-gates=PDBUnhealthyPodEvictionPolicy=true to the kube-apiserver.

Here's an example. Assume that you've enabled the feature gate in your cluster, and that you already defined a Deployment that runs a plain webserver. You labelled the Pods for that Deployment with app: nginx. You want to limit avoidable disruption, and you know that best effort availability is sufficient for this app. You decide to allow evictions even if those webserver pods are unhealthy. You create a PDB to guard this application, with the AlwaysAllow policy for evicting unhealthy pods:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  selector:
    matchLabels:
      app: nginx
  maxUnavailable: 1
  unhealthyPodEvictionPolicy: AlwaysAllow

How can I learn more?

How do I get involved?

If you have any feedback, please reach out to us in the #sig-apps channel on Slack (visit https://slack.k8s.io/ for an invitation if you need one), or on the SIG Apps mailing list: kubernetes-sig-apps@googlegroups.com

Kubernetes v1.26: Retroactive Default StorageClass

The v1.25 release of Kubernetes introduced an alpha feature to change how a default StorageClass was assigned to a PersistentVolumeClaim (PVC). With the feature enabled, you no longer need to create a default StorageClass first and PVC second to assign the class. Additionally, any PVCs without a StorageClass assigned can be updated later. This feature was graduated to beta in Kubernetes v1.26.

You can read retroactive default StorageClass assignment in the Kubernetes documentation for more details about how to use that, or you can read on to learn about why the Kubernetes project is making this change.

Why did StorageClass assignment need improvements

Users might already be familiar with a similar feature that assigns default StorageClasses to new PVCs at the time of creation. This is currently handled by the admission controller.

But what if there wasn't a default StorageClass defined at the time of PVC creation? Users would end up with a PVC that would never be assigned a class. As a result, no storage would be provisioned, and the PVC would be somewhat "stuck" at this point. Generally, two main scenarios could result in "stuck" PVCs and cause problems later down the road. Let's take a closer look at each of them.

Changing default StorageClass

With the alpha feature enabled, there were two options admins had when they wanted to change the default StorageClass:

  1. Creating a new StorageClass as default before removing the old one associated with the PVC. This would result in having two defaults for a short period. At this point, if a user were to create a PersistentVolumeClaim with storageClassName set to null (implying default StorageClass), the newest default StorageClass would be chosen and assigned to this PVC.

  2. Removing the old default first and creating a new default StorageClass. This would result in having no default for a short time. Subsequently, if a user were to create a PersistentVolumeClaim with storageClassName set to null (implying default StorageClass), the PVC would be in Pending state forever. The user would have to fix this by deleting the PVC and recreating it once the default StorageClass was available.

Resource ordering during cluster installation

If a cluster installation tool needed to create resources that required storage, for example, an image registry, it was difficult to get the ordering right. This is because any Pods that required storage would rely on the presence of a default StorageClass and would fail to be created if it wasn't defined.

What changed

We've changed the PersistentVolume (PV) controller to assign a default StorageClass to any unbound PersistentVolumeClaim that has the storageClassName set to null. We've also modified the PersistentVolumeClaim admission within the API server to allow the change of values from an unset value to an actual StorageClass name.

Null storageClassName versus storageClassName: "" - does it matter?

Before this feature was introduced, those values were equal in terms of behavior. Any PersistentVolumeClaim with the storageClassName set to null or "" would bind to an existing PersistentVolume resource with storageClassName also set to null or "".

With this new feature enabled we wanted to maintain this behavior but also be able to update the StorageClass name. With these constraints in mind, the feature changes the semantics of null. If a default StorageClass is present, null would translate to "Give me a default" and "" would mean "Give me PersistentVolume that also has "" StorageClass name." In the absence of a StorageClass, the behavior would remain unchanged.

Summarizing the above, we've changed the semantics of null so that its behavior depends on the presence or absence of a definition of default StorageClass.

The tables below show all these cases to better describe when PVC binds and when its StorageClass gets updated.

PVC binding behavior with Retroactive default StorageClass
PVC storageClassName = ""PVC storageClassName = null
Without default classPV storageClassName = ""bindsbinds
PV without storageClassNamebindsbinds
With default classPV storageClassName = ""bindsclass updates
PV without storageClassNamebindsclass updates

How to use it

If you want to test the feature whilst it's alpha, you need to enable the relevant feature gate in the kube-controller-manager and the kube-apiserver. Use the --feature-gates command line argument:

--feature-gates="...,RetroactiveDefaultStorageClass=true"

Test drive

If you would like to see the feature in action and verify it works fine in your cluster here's what you can try:

  1. Define a basic PersistentVolumeClaim:

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: pvc-1
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
    
  2. Create the PersistentVolumeClaim when there is no default StorageClass. The PVC won't provision or bind (unless there is an existing, suitable PV already present) and will remain in Pending state.

    kubectl get pvc
    

    The output is similar to this:

    NAME      STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    pvc-1     Pending   
    
  3. Configure one StorageClass as default.

    kubectl patch sc -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
    

    The output is similar to this:

    storageclass.storage.k8s.io/my-storageclass patched
    
  4. Verify that PersistentVolumeClaims is now provisioned correctly and was updated retroactively with new default StorageClass.

    kubectl get pvc
    

    The output is similar to this:

    NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
    pvc-1     Bound    pvc-06a964ca-f997-4780-8627-b5c3bf5a87d8   1Gi        RWO            my-storageclass   87m
    

New metrics

To help you see that the feature is working as expected we also introduced a new retroactive_storageclass_total metric to show how many times that the PV controller attempted to update PersistentVolumeClaim, and retroactive_storageclass_errors_total to show how many of those attempts failed.

Getting involved

We always welcome new contributors so if you would like to get involved you can join our Kubernetes Storage Special-Interest-Group (SIG).

If you would like to share feedback, you can do so on our public Slack channel.

Special thanks to all the contributors that provided great reviews, shared valuable insight and helped implement this feature (alphabetical order):

Kubernetes v1.26: Alpha support for cross-namespace storage data sources

Kubernetes v1.26, released last month, introduced an alpha feature that lets you specify a data source for a PersistentVolumeClaim, even where the source data belong to a different namespace. With the new feature enabled, you specify a namespace in the dataSourceRef field of a new PersistentVolumeClaim. Once Kubernetes checks that access is OK, the new PersistentVolume can populate its data from the storage source specified in that other namespace. Before Kubernetes v1.26, provided your cluster had the AnyVolumeDataSource feature enabled, you could already provision new volumes from a data source in the same namespace. However, that only worked for the data source in the same namespace, therefore users couldn't provision a PersistentVolume with a claim in one namespace from a data source in other namespace. To solve this problem, Kubernetes v1.26 added a new alpha namespace field to dataSourceRef field in PersistentVolumeClaim the API.

How it works

Once the csi-provisioner finds that a data source is specified with a dataSourceRef that has a non-empty namespace name, it checks all reference grants within the namespace that's specified by the.spec.dataSourceRef.namespace field of the PersistentVolumeClaim, in order to see if access to the data source is allowed. If any ReferenceGrant allows access, the csi-provisioner provisions a volume from the data source.

Trying it out

The following things are required to use cross namespace volume provisioning:

  • Enable the AnyVolumeDataSource and CrossNamespaceVolumeDataSource feature gates for the kube-apiserver and kube-controller-manager
  • Install a CRD for the specific VolumeSnapShot controller
  • Install the CSI Provisioner controller and enable the CrossNamespaceVolumeDataSource feature gate
  • Install the CSI driver
  • Install a CRD for ReferenceGrants

Putting it all together

To see how this works, you can install the sample and try it out. This sample do to create PVC in dev namespace from VolumeSnapshot in prod namespace. That is a simple example. For real world use, you might want to use a more complex approach.

Assumptions for this example

  • Your Kubernetes cluster was deployed with AnyVolumeDataSource and CrossNamespaceVolumeDataSource feature gates enabled
  • There are two namespaces, dev and prod
  • CSI driver is being deployed
  • There is an existing VolumeSnapshot named new-snapshot-demo in the prod namespace
  • The ReferenceGrant CRD (from the Gateway API project) is already deployed

Grant ReferenceGrants read permission to the CSI Provisioner

Access to ReferenceGrants is only needed when the CSI driver has the CrossNamespaceVolumeDataSource controller capability. For this example, the external-provisioner needs get, list, and watch permissions for referencegrants (API group gateway.networking.k8s.io).

  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["referencegrants"]
    verbs: ["get", "list", "watch"]

Enable the CrossNamespaceVolumeDataSource feature gate for the CSI Provisioner

Add --feature-gates=CrossNamespaceVolumeDataSource=true to the csi-provisioner command line. For example, use this manifest snippet to redefine the container:

      - args:
        - -v=5
        - --csi-address=/csi/csi.sock
        - --feature-gates=Topology=true
        - --feature-gates=CrossNamespaceVolumeDataSource=true
        image: csi-provisioner:latest
        imagePullPolicy: IfNotPresent
        name: csi-provisioner

Create a ReferenceGrant

Here's a manifest for an example ReferenceGrant.

apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-prod-pvc
  namespace: prod
spec:
  from:
  - group: ""
    kind: PersistentVolumeClaim
    namespace: dev
  to:
  - group: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: new-snapshot-demo

Create a PersistentVolumeClaim by using cross namespace data source

Kubernetes creates a PersistentVolumeClaim on dev and the CSI driver populates the PersistentVolume used on dev from snapshots on prod.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example-pvc
  namespace: dev
spec:
  storageClassName: example
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  dataSourceRef:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: new-snapshot-demo
    namespace: prod
  volumeMode: Filesystem

How can I learn more?

The enhancement proposal, Provision volumes from cross-namespace snapshots, includes lots of detail about the history and technical implementation of this feature.

Please get involved by joining the Kubernetes Storage Special Interest Group (SIG) to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!

Acknowledgments

It takes a wonderful group to make wonderful software. Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CrossNamespaceVolumeDataSouce feature:

  • Michelle Au (msau42)
  • Xing Yang (xing-yang)
  • Masaki Kimura (mkimuram)
  • Tim Hockin (thockin)
  • Ben Swartzlander (bswartz)
  • Rob Scott (robscott)
  • John Griffith (j-griffith)
  • Michael Henriksen (mhenriks)
  • Mustafa Elbehery (Elbehery)

It’s been a joy to work with y'all on this.

Kubernetes v1.26: Advancements in Kubernetes Traffic Engineering

Kubernetes v1.26 includes significant advancements in network traffic engineering with the graduation of two features (Service internal traffic policy support, and EndpointSlice terminating conditions) to GA, and a third feature (Proxy terminating endpoints) to beta. The combination of these enhancements aims to address short-comings in traffic engineering that people face today, and unlock new capabilities for the future.

Traffic Loss from Load Balancers During Rolling Updates

Prior to Kubernetes v1.26, clusters could experience loss of traffic from Service load balancers during rolling updates when setting the externalTrafficPolicy field to Local. There are a lot of moving parts at play here so a quick overview of how Kubernetes manages load balancers might help!

In Kubernetes, you can create a Service with type: LoadBalancer to expose an application externally with a load balancer. The load balancer implementation varies between clusters and platforms, but the Service provides a generic abstraction representing the load balancer that is consistent across all Kubernetes installations.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app.kubernetes.io/name: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376
  type: LoadBalancer

Under the hood, Kubernetes allocates a NodePort for the Service, which is then used by kube-proxy to provide a network data path from the NodePort to the Pod. A controller will then add all available Nodes in the cluster to the load balancer’s backend pool, using the designated NodePort for the Service as the backend target port.

Figure 1: Overview of Service load balancers

Figure 1: Overview of Service load balancers

Oftentimes it is beneficial to set externalTrafficPolicy: Local for Services, to avoid extra hops between Nodes that are not running healthy Pods backing that Service. When using externalTrafficPolicy: Local, an additional NodePort is allocated for health checking purposes, such that Nodes that do not contain healthy Pods are excluded from the backend pool for a load balancer.

Figure 2: Load balancer traffic to a healthy Node, when externalTrafficPolicy is Local

Figure 2: Load balancer traffic to a healthy Node, when externalTrafficPolicy is Local

One such scenario where traffic can be lost is when a Node loses all Pods for a Service, but the external load balancer has not probed the health check NodePort yet. The likelihood of this situation is largely dependent on the health checking interval configured on the load balancer. The larger the interval, the more likely this will happen, since the load balancer will continue to send traffic to a node even after kube-proxy has removed forwarding rules for that Service. This also occurrs when Pods start terminating during rolling updates. Since Kubernetes does not consider terminating Pods as “Ready”, traffic can be loss when there are only terminating Pods on any given Node during a rolling update.

Figure 3: Load balancer traffic to terminating endpoints, when externalTrafficPolicy is Local

Figure 3: Load balancer traffic to terminating endpoints, when externalTrafficPolicy is Local

Starting in Kubernetes v1.26, kube-proxy enables the ProxyTerminatingEndpoints feature by default, which adds automatic failover and routing to terminating endpoints in scenarios where the traffic would otherwise be dropped. More specifically, when there is a rolling update and a Node only contains terminating Pods, kube-proxy will route traffic to the terminating Pods based on their readiness. In addition, kube-proxy will actively fail the health check NodePort if there are only terminating Pods available. By doing so, kube-proxy alerts the external load balancer that new connections should not be sent to that Node but will gracefully handle requests for existing connections.

Figure 4: Load Balancer traffic to terminating endpoints with ProxyTerminatingEndpoints enabled, when externalTrafficPolicy is Local

Figure 4: Load Balancer traffic to terminating endpoints with ProxyTerminatingEndpoints enabled, when externalTrafficPolicy is Local

EndpointSlice Conditions

In order to support this new capability in kube-proxy, the EndpointSlice API introduced new conditions for endpoints: serving and terminating.

Figure 5: Overview of EndpointSlice conditions

Figure 5: Overview of EndpointSlice conditions

The serving condition is semantically identical to ready, except that it can be true or false while a Pod is terminating, unlike ready which will always be false for terminating Pods for compatibility reasons. The terminating condition is true for Pods undergoing termination (non-empty deletionTimestamp), false otherwise.

The addition of these two conditions enables consumers of this API to understand Pod states that were previously not possible. For example, we can now track "ready" and "not ready" Pods that are also terminating.

Figure 6: EndpointSlice conditions with a terminating Pod

Figure 6: EndpointSlice conditions with a terminating Pod

Consumers of the EndpointSlice API, such as Kube-proxy and Ingress Controllers, can now use these conditions to coordinate connection draining events, by continuing to forward traffic for existing connections but rerouting new connections to other non-terminating endpoints.

Optimizing Internal Node-Local Traffic

Similar to how Services can set externalTrafficPolicy: Local to avoid extra hops for externally sourced traffic, Kubernetes now supports internalTrafficPolicy: Local, to enable the same optimization for traffic originating within the cluster, specifically for traffic using the Service Cluster IP as the destination address. This feature graduated to Beta in Kubernetes v1.24 and is graduating to GA in v1.26.

Services default the internalTrafficPolicy field to Cluster, where traffic is randomly distributed to all endpoints.

Figure 7: Service routing when internalTrafficPolicy is Cluster

Figure 7: Service routing when internalTrafficPolicy is Cluster

When internalTrafficPolicy is set to Local, kube-proxy will forward internal traffic for a Service only if there is an available endpoint that is local to the same Node.

Figure 8: Service routing when internalTrafficPolicy is Local

Figure 8: Service routing when internalTrafficPolicy is Local

Caution:

When using internalTrafficPoliy: Local, traffic will be dropped by kube-proxy when no local endpoints are available.

Getting Involved

If you're interested in future discussions on Kubernetes traffic engineering, you can get involved in SIG Network through the following ways:

Kubernetes 1.26: Job Tracking, to Support Massively Parallel Batch Workloads, Is Generally Available

The Kubernetes 1.26 release includes a stable implementation of the Job controller that can reliably track a large amount of Jobs with high levels of parallelism. SIG Apps and WG Batch have worked on this foundational improvement since Kubernetes 1.22. After multiple iterations and scale verifications, this is now the default implementation of the Job controller.

Paired with the Indexed completion mode, the Job controller can handle massively parallel batch Jobs, supporting up to 100k concurrent Pods.

The new implementation also made possible the development of Pod failure policy, which is in beta in the 1.26 release.

How do I use this feature?

To use Job tracking with finalizers, upgrade to Kubernetes 1.25 or newer and create new Jobs. You can also use this feature in v1.23 and v1.24, if you have the ability to enable the JobTrackingWithFinalizers feature gate.

If your cluster runs Kubernetes 1.26, Job tracking with finalizers is a stable feature. For v1.25, it's behind that feature gate, and your cluster administrators may have explicitly disabled it - for example, if you have a policy of not using beta features.

Jobs created before the upgrade will still be tracked using the legacy behavior. This is to avoid retroactively adding finalizers to running Pods, which might introduce race conditions.

For maximum performance on large Jobs, the Kubernetes project recommends using the Indexed completion mode. In this mode, the control plane is able to track Job progress with less API calls.

If you are a developer of operator(s) for batch, HPC, AI, ML or related workloads, we encourage you to use the Job API to delegate accurate progress tracking to Kubernetes. If there is something missing in the Job API that forces you to manage plain Pods, the Working Group Batch welcomes your feedback and contributions.

Deprecation notices

During the development of the feature, the control plane added the annotation batch.kubernetes.io/job-tracking to the Jobs that were created when the feature was enabled. This allowed a safe transition for older Jobs, but it was never meant to stay.

In the 1.26 release, we deprecated the annotation batch.kubernetes.io/job-tracking and the control plane will stop adding it in Kubernetes 1.27. Along with that change, we will remove the legacy Job tracking implementation. As a result, the Job controller will track all Jobs using finalizers and it will ignore Pods that don't have the aforementioned finalizer.

Before you upgrade your cluster to 1.27, we recommend that you verify that there are no running Jobs that don't have the annotation, or you wait for those jobs to complete. Otherwise, you might observe the control plane recreating some Pods. We expect that this shouldn't affect any users, as the feature is enabled by default since Kubernetes 1.25, giving enough buffer for old jobs to complete.

What problem does the new implementation solve?

Generally, Kubernetes workload controllers, such as ReplicaSet or StatefulSet, rely on the existence of Pods or other objects in the API to determine the status of the workload and whether replacements are needed. For example, if a Pod that belonged to a ReplicaSet terminates or ceases to exist, the ReplicaSet controller needs to create a replacement Pod to satisfy the desired number of replicas (.spec.replicas).

Since its inception, the Job controller also relied on the existence of Pods in the API to track Job status. A Job has completion and failure handling policies, requiring the end state of a finished Pod to determine whether to create a replacement Pod or mark the Job as completed or failed. As a result, the Job controller depended on Pods, even terminated ones, to remain in the API in order to keep track of the status.

This dependency made the tracking of Job status unreliable, because Pods can be deleted from the API for a number of reasons, including:

  • The garbage collector removing orphan Pods when a Node goes down.
  • The garbage collector removing terminated Pods when they reach a threshold.
  • The Kubernetes scheduler preempting a Pod to accommodate higher priority Pods.
  • The taint manager evicting a Pod that doesn't tolerate a NoExecute taint.
  • External controllers, not included as part of Kubernetes, or humans deleting Pods.

The new implementation

When a controller needs to take an action on objects before they are removed, it should add a finalizer to the objects that it manages. A finalizer prevents the objects from being deleted from the API until the finalizers are removed. Once the controller is done with the cleanup and accounting for the deleted object, it can remove the finalizer from the object and the control plane removes the object from the API.

This is what the new Job controller is doing: adding a finalizer during Pod creation, and removing the finalizer after the Pod has terminated and has been accounted for in the Job status. However, it wasn't that simple.

The main challenge is that there are at least two objects involved: the Pod and the Job. While the finalizer lives in the Pod object, the accounting lives in the Job object. There is no mechanism to atomically remove the finalizer in the Pod and update the counters in the Job status. Additionally, there could be more than one terminated Pod at a given time.

To solve this problem, we implemented a three staged approach, each translating to an API call.

  1. For each terminated Pod, add the unique ID (UID) of the Pod into short-lived lists stored in the .status of the owning Job (.status.uncountedTerminatedPods).
  2. Remove the finalizer from the Pods(s).
  3. Atomically do the following operations:
    • remove UIDs from the short-lived lists
    • increment the overall succeeded and failed counters in the status of the Job.

Additional complications come from the fact that the Job controller might receive the results of the API changes in steps 1 and 2 out of order. We solved this by adding an in-memory cache for removed finalizers.

Still, we faced some issues during the beta stage, leaving some pods stuck with finalizers in some conditions (#108645, #109485, and #111646). As a result, we decided to switch that feature gate to be disabled by default for the 1.23 and 1.24 releases.

Once resolved, we re-enabled the feature for the 1.25 release. Since then, we have received reports from our customers running tens of thousands of Pods at a time in their clusters through the Job API. Seeing this success, we decided to graduate the feature to stable in 1.26, as part of our long term commitment to make the Job API the best way to run large batch Jobs in a Kubernetes cluster.

To learn more about the feature, you can read the KEP.

Acknowledgments

As with any Kubernetes feature, multiple people contributed to getting this done, from testing and filing bugs to reviewing code.

On behalf of SIG Apps, I would like to especially thank Jordan Liggitt (Google) for helping me debug and brainstorm solutions for more than one race condition and Maciej Szulik (Red Hat) for his thorough reviews.

Kubernetes v1.26: CPUManager goes GA

The CPU Manager is a part of the kubelet, the Kubernetes node agent, which enables the user to allocate exclusive CPUs to containers. Since Kubernetes v1.10, where it graduated to Beta, the CPU Manager proved itself reliable and fulfilled its role of allocating exclusive CPUs to containers, so adoption has steadily grown making it a staple component of performance-critical and low-latency setups. Over time, most changes were about bugfixes or internal refactoring, with the following noteworthy user-visible changes:

The CPU Manager reached the point on which it "just works", so in Kubernetes v1.26 it has graduated to generally available (GA).

Customization options for CPU Manager

The CPU Manager supports two operation modes, configured using its policies. With the none policy, the CPU Manager allocates CPUs to containers without any specific constraint except the (optional) quota set in the Pod spec. With the static policy, then provided that the pod is in the Guaranteed QoS class and every container in that Pod requests an integer amount of vCPU cores, then the CPU Manager allocates CPUs exclusively. Exclusive assignment means that other containers (whether from the same Pod, or from a different Pod) do not get scheduled onto that CPU.

This simple operational model served the user base pretty well, but as the CPU Manager matured more and more, users started to look at more elaborate use cases and how to better support them.

Rather than add more policies, the community realized that pretty much all the novel use cases are some variation of the behavior enabled by the static CPU Manager policy. Hence, it was decided to add options to tune the behavior of the static policy. The options have a varying degree of maturity, like any other Kubernetes feature, and in order to be accepted, each new option provides a backward compatible behavior when disabled, and to document how to interact with each other, should they interact at all.

This enabled the Kubernetes project to graduate to GA the CPU Manager core component and core CPU allocation algorithms to GA, while also enabling a new age of experimentation in this area. In Kubernetes v1.26, the CPU Manager supports three different policy options:

full-pcpus-only
restrict the CPU Manager core allocation algorithm to full physical cores only, reducing noisy neighbor issues from hardware technologies that allow sharing cores.
distribute-cpus-across-numa
drive the CPU Manager to evenly distribute CPUs across NUMA nodes, for cases where more than one NUMA node is required to satisfy the allocation.
align-by-socket
change how the CPU Manager allocates CPUs to a container: consider CPUs to be aligned at the socket boundary, instead of NUMA node boundary.

Further development

After graduating the main CPU Manager feature, each existing policy option will follow their graduation process, independent from CPU Manager and from each other option. There is room for new options to be added, but there's also a growing demand for even more flexibility than what the CPU Manager, and its policy options, currently grant.

Conversations are in progress in the community about splitting the CPU Manager and the other resource managers currently part of the kubelet executable into pluggable, independent kubelet plugins. If you are interested in this effort, please join the conversation on SIG Node communication channels (Slack, mailing list, weekly meeting).

Further reading

Please check out the Control CPU Management Policies on the Node task page to learn more about the CPU Manager, and how it fits in relation to the other node-level resource managers.

Getting involved

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Kubernetes 1.26: Pod Scheduling Readiness

Kubernetes 1.26 introduced a new Pod feature: scheduling gates. In Kubernetes, scheduling gates are keys that tell the scheduler when a Pod is ready to be considered for scheduling.

What problem does it solve?

When a Pod is created, the scheduler will continuously attempt to find a node that fits it. This infinite loop continues until the scheduler either finds a node for the Pod, or the Pod gets deleted.

Pods that remain unschedulable for long periods of time (e.g., ones that are blocked on some external event) waste scheduling cycles. A scheduling cycle may take ≅20ms or more depending on the complexity of the Pod's scheduling constraints. Therefore, at scale, those wasted cycles significantly impact the scheduler's performance. See the arrows in the "scheduler" box below.

graph LR; pod((New Pod))-->queue subgraph Scheduler queue(scheduler queue) sched_cycle[/scheduling cycle/] schedulable{schedulable?} queue==>|Pop out|sched_cycle sched_cycle==>schedulable schedulable==>|No|queue subgraph note [Cycles wasted on keep rescheduling 'unready' Pods] end end classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff; classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px; class queue,sched_cycle,schedulable k8s; class pod plain; class note note; class Scheduler Scheduler;

Scheduling gates helps address this problem. It allows declaring that newly created Pods are not ready for scheduling. When scheduling gates are present on a Pod, the scheduler ignores the Pod and therefore saves unnecessary scheduling attempts. Those Pods will also be ignored by Cluster Autoscaler if you have it installed in the cluster.

Clearing the gates is the responsibility of external controllers with knowledge of when the Pod should be considered for scheduling (e.g., a quota manager).

graph LR; pod((New Pod))-->queue subgraph Scheduler queue(scheduler queue) sched_cycle[/scheduling cycle/] schedulable{schedulable?} popout{Pop out?} queue==>|PreEnqueue check|popout popout-->|Yes|sched_cycle popout==>|No|queue sched_cycle-->schedulable schedulable-->|No|queue subgraph note [A knob to gate Pod's scheduling] end end classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff; classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px; classDef popout fill:#f96,stroke:#fff,stroke-width:1px; class queue,sched_cycle,schedulable k8s; class pod plain; class note note; class popout popout; class Scheduler Scheduler;

How does it work?

Scheduling gates in general works very similar to Finalizers. Pods with a non-empty spec.schedulingGates field will show as status SchedulingGated and be blocked from scheduling. Note that more than one gate can be added, but they all should be added upon Pod creation (e.g., you can add them as part of the spec or via a mutating webhook).

NAME       READY   STATUS            RESTARTS   AGE
test-pod   0/1     SchedulingGated   0          10s

To clear the gates, you update the Pod by removing all of the items from the Pod's schedulingGates field. The gates do not need to be removed all at once, but only when all the gates are removed the scheduler will start to consider the Pod for scheduling.

Under the hood, scheduling gates are implemented as a PreEnqueue scheduler plugin, a new scheduler framework extension point that is invoked at the beginning of each scheduling cycle.

Use Cases

An important use case this feature enables is dynamic quota management. Kubernetes supports ResourceQuota, however the API Server enforces quota at the time you attempt Pod creation. For example, if a new Pod exceeds the CPU quota, it gets rejected. The API Server doesn't queue the Pod; therefore, whoever created the Pod needs to continuously attempt to recreate it again. This either means a delay between resources becoming available and the Pod actually running, or it means load on the API server and Scheduler due to constant attempts.

Scheduling gates allows an external quota manager to address the above limitation of ResourceQuota. Specifically, the manager could add a example.com/quota-check scheduling gate to all Pods created in the cluster (using a mutating webhook). The manager would then remove the gate when there is quota to start the Pod.

Whats next?

To use this feature, the PodSchedulingReadiness feature gate must be enabled in the API Server and scheduler. You're more than welcome to test it out and tell us (SIG Scheduling) what you think!

Additional resources

Kubernetes 1.26: Support for Passing Pod fsGroup to CSI Drivers At Mount Time

Delegation of fsGroup to CSI drivers was first introduced as alpha in Kubernetes 1.22, and graduated to beta in Kubernetes 1.25. For Kubernetes 1.26, we are happy to announce that this feature has graduated to General Availability (GA).

In this release, if you specify a fsGroup in the security context, for a (Linux) Pod, all processes in the pod's containers are part of the additional group that you specified.

In previous Kubernetes releases, the kubelet would always apply the fsGroup ownership and permission changes to files in the volume according to the policy you specified in the Pod's .spec.securityContext.fsGroupChangePolicy field.

Starting with Kubernetes 1.26, CSI drivers have the option to apply the fsGroup settings during volume mount time, which frees the kubelet from changing the permissions of files and directories in those volumes.

How does it work?

CSI drivers that support this feature should advertise the VOLUME_MOUNT_GROUP node capability.

After recognizing this information, the kubelet passes the fsGroup information to the CSI driver during pod startup. This is done through the NodeStageVolumeRequest and NodePublishVolumeRequest CSI calls.

Consequently, the CSI driver is expected to apply the fsGroup to the files in the volume using a mount option. As an example, Azure File CSIDriver utilizes the gid mount option to map the fsGroup information to all the files in the volume.

It should be noted that in the example above the kubelet refrains from directly applying the permission changes into the files and directories in that volume files. Additionally, two policy definitions no longer have an effect: neither .spec.fsGroupPolicy for the CSIDriver object, nor .spec.securityContext.fsGroupChangePolicy for the Pod.

For more details about the inner workings of this feature, check out the enhancement proposal and the CSI Driver fsGroup Support in the CSI developer documentation.

Why is it important?

Without this feature, applying the fsGroup information to files is not possible in certain storage environments.

For instance, Azure File does not support a concept of POSIX-style ownership and permissions of files. The CSI driver is only able to set the file permissions at the volume level.

How do I use it?

This feature should be mostly transparent to users. If you maintain a CSI driver that should support this feature, read CSI Driver fsGroup Support for more information on how to support this feature in your CSI driver.

Existing CSI drivers that do not support this feature will continue to work as usual: they will not receive any fsGroup information from the kubelet. In addition to that, the kubelet will continue to perform the ownership and permissions changes to files for those volumes, according to the policies specified in .spec.fsGroupPolicy for the CSIDriver and .spec.securityContext.fsGroupChangePolicy for the relevant Pod.

Kubernetes v1.26: GA Support for Kubelet Credential Providers

Kubernetes v1.26 introduced generally available (GA) support for kubelet credential provider plugins, offering an extensible plugin framework to dynamically fetch credentials for any container image registry.

Background

Kubernetes supports the ability to dynamically fetch credentials for a container registry service. Prior to Kubernetes v1.20, this capability was compiled into the kubelet and only available for Amazon Elastic Container Registry, Azure Container Registry, and Google Cloud Container Registry.

Figure 1: Kubelet built-in credential provider support for Amazon Elastic Container Registry, Azure Container Registry, and Google Cloud Container Registry.

Figure 1: Kubelet built-in credential provider support for Amazon Elastic Container Registry, Azure Container Registry, and Google Cloud Container Registry.

Kubernetes v1.20 introduced alpha support for kubelet credential providers plugins, which provides a mechanism for the kubelet to dynamically authenticate and pull images for arbitrary container registries - whether these are public registries, managed services, or even a self-hosted registry. In Kubernetes v1.26, this feature is now GA

Figure 2: Kubelet credential provider overview

Figure 2: Kubelet credential provider overview

Why is it important?

Prior to Kubernetes v1.20, if you wanted to dynamically fetch credentials for image registries other than ACR (Azure Container Registry), ECR (Elastic Container Registry), or GCR (Google Container Registry), you needed to modify the kubelet code. The new plugin mechanism can be used in any cluster, and lets you authenticate to new registries without any changes to Kubernetes itself. Any cloud provider or vendor can publish a plugin that lets you authenticate with their image registry.

How it works

The kubelet and the exec plugin binary communicate through stdio (stdin, stdout, and stderr) by sending and receiving json-serialized api-versioned types. If the exec plugin is enabled and the kubelet requires authentication information for an image that matches against a plugin, the kubelet will execute the plugin binary, passing the CredentialProviderRequest API via stdin. Then the exec plugin communicates with the container registry to dynamically fetch the credentials and returns the credentials in an encoded response of the CredentialProviderResponse API to the kubelet via stdout.

Figure 3: Kubelet credential provider plugin flow

Figure 3: Kubelet credential provider plugin flow

On receiving credentials from the kubelet, the plugin can also indicate how long credentials can be cached for, to prevent unnecessary execution of the plugin by the kubelet for subsequent image pull requests to the same registry. In cases where the cache duration is not specified by the plugin, a default cache duration can be specified by the kubelet (more details below).

{
  "apiVersion": "kubelet.k8s.io/v1",
  "kind": "CredentialProviderResponse",
  "auth": {
    "cacheDuration": "6h",
    "private-registry.io/my-app": {
      "username": "exampleuser",
      "password": "token12345"
    }
  }
}

In addition, the plugin can specify the scope in which cached credentials are valid for. This is specified through the cacheKeyType field in CredentialProviderResponse. When the value is Image, the kubelet will only use cached credentials for future image pulls that exactly match the image of the first request. When the value is Registry, the kubelet will use cached credentials for any subsequent image pulls destined for the same registry host but using different paths (for example, gcr.io/foo/bar and gcr.io/bar/foo refer to different images from the same registry). Lastly, when the value is Global, the kubelet will use returned credentials for all images that match against the plugin, including images that can map to different registry hosts (for example, gcr.io vs registry.k8s.io (previously k8s.gcr.io)). The cacheKeyType field is required by plugin implementations.

{
  "apiVersion": "kubelet.k8s.io/v1",
  "kind": "CredentialProviderResponse",
  "auth": {
    "cacheKeyType": "Registry",
    "private-registry.io/my-app": {
      "username": "exampleuser",
      "password": "token12345"
    }
  }
}

Using kubelet credential providers

You can configure credential providers by installing the exec plugin(s) into a local directory accessible by the kubelet on every node. Then you set two command line arguments for the kubelet:

  • --image-credential-provider-config: the path to the credential provider plugin config file.
  • --image-credential-provider-bin-dir: the path to the directory where credential provider plugin binaries are located.

The configuration file passed into --image-credential-provider-config is read by the kubelet to determine which exec plugins should be invoked for a container image used by a Pod. Note that the name of each provider must match the name of the binary located in the local directory specified in --image-credential-provider-bin-dir, otherwise the kubelet cannot locate the path of the plugin to invoke.

kind: CredentialProviderConfig
apiVersion: kubelet.config.k8s.io/v1
providers:
- name: auth-provider-gcp
  apiVersion: credentialprovider.kubelet.k8s.io/v1
  matchImages:
  - "container.cloud.google.com"
  - "gcr.io"
  - "*.gcr.io"
  - "*.pkg.dev"
  args:
  - get-credentials
  - --v=3
  defaultCacheDuration: 1m

Below is an overview of how the Kubernetes project is using kubelet credential providers for end-to-end testing.

Figure 4: Kubelet credential provider configuration used for Kubernetes e2e testing

Figure 4: Kubelet credential provider configuration used for Kubernetes e2e testing

For more configuration details, see Kubelet Credential Providers.

Getting Involved

Come join SIG Node if you want to report bugs or have feature requests for the Kubelet Credential Provider. You can reach us through the following ways:

Kubernetes 1.26: Introducing Validating Admission Policies

In Kubernetes 1.26, the 1st alpha release of validating admission policies is available!

Validating admission policies use the Common Expression Language (CEL) to offer a declarative, in-process alternative to validating admission webhooks.

CEL was first introduced to Kubernetes for the Validation rules for CustomResourceDefinitions. This enhancement expands the use of CEL in Kubernetes to support a far wider range of admission use cases.

Admission webhooks can be burdensome to develop and operate. Webhook developers must implement and maintain a webhook binary to handle admission requests. Also, admission webhooks are complex to operate. Each webhook must be deployed, monitored and have a well defined upgrade and rollback plan. To make matters worse, if a webhook times out or becomes unavailable, the Kubernetes control plane can become unavailable. This enhancement avoids much of this complexity of admission webhooks by embedding CEL expressions into Kubernetes resources instead of calling out to a remote webhook binary.

For example, to set a limit on how many replicas a Deployment can have. Start by defining a validation policy:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
  name: "demo-policy.example.com"
spec:
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  validations:
    - expression: "object.spec.replicas <= 5"

The expression field contains the CEL expression that is used to validate admission requests. matchConstraints declares what types of requests this ValidatingAdmissionPolicy is may validate.

Next bind the policy to the appropriate resources:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "demo-binding-test.example.com"
spec:
  policyName: "demo-policy.example.com"
  matchResources:
    namespaceSelector:
      matchExpressions:
      - key: environment
        operator: In
        values:
        - test

This ValidatingAdmissionPolicyBinding resource binds the above policy only to namespaces where the environment label is set to test. Once this binding is created, the kube-apiserver will begin enforcing this admission policy.

To emphasize how much simpler this approach is than admission webhooks, if this example were instead implemented with a webhook, an entire binary would need to be developed and maintained just to perform a <= check. In our review of a wide range of admission webhooks used in production, the vast majority performed relatively simple checks, all of which can easily be expressed using CEL.

Validation admission policies are highly configurable, enabling policy authors to define policies that can be parameterized and scoped to resources as needed by cluster administrators.

For example, the above admission policy can be modified to make it configurable:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicy
metadata:
  name: "demo-policy.example.com"
spec:
  paramKind:
    apiVersion: rules.example.com/v1 # You also need a CustomResourceDefinition for this API
    kind: ReplicaLimit
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  validations:
    - expression: "object.spec.replicas <= params.maxReplicas"

Here, paramKind defines the resources used to configure the policy and the expression uses the params variable to access the parameter resource.

This allows multiple bindings to be defined, each configured differently. For example:

apiVersion: admissionregistration.k8s.io/v1alpha1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "demo-binding-production.example.com"
spec:
  policyName: "demo-policy.example.com"
  paramRef:
    name: "demo-params-production.example.com"
  matchResources:
    namespaceSelector:
      matchExpressions:
      - key: environment
        operator: In
        values:
        - production
apiVersion: rules.example.com/v1 # defined via a CustomResourceDefinition
kind: ReplicaLimit
metadata:
  name: "demo-params-production.example.com"
maxReplicas: 1000

This binding and parameter resource pair limit deployments in namespaces with the environment label set to production to a max of 1000 replicas.

You can then use a separate binding and parameter pair to set a different limit for namespaces in the test environment.

I hope this has given you a glimpse of what is possible with validating admission policies! There are many features that we have not yet touched on.

To learn more, read Validating Admission Policy.

We are working hard to add more features to admission policies and make the enhancement easier to use. Try it out, send us your feedback and help us build a simpler alternative to admission webhooks!

How do I get involved?

If you want to get involved in development of admission policies, discuss enhancement roadmaps, or report a bug, you can get in touch with developers at SIG API Machinery.

Kubernetes 1.26: Device Manager graduates to GA

The Device Plugin framework was introduced in the Kubernetes v1.8 release as a vendor independent framework to enable discovery, advertisement and allocation of external devices without modifying core Kubernetes. The feature graduated to Beta in v1.10. With the recent release of Kubernetes v1.26, Device Manager is now generally available (GA).

Within the kubelet, the Device Manager facilitates communication with device plugins using gRPC through Unix sockets. Device Manager and Device plugins both act as gRPC servers and clients by serving and connecting to the exposed gRPC services respectively. Device plugins serve a gRPC service that kubelet connects to for device discovery, advertisement (as extended resources) and allocation. Device Manager connects to the Registration gRPC service served by kubelet to register itself with kubelet.

Please refer to the documentation for an example on how a pod can request a device exposed to the cluster by a device plugin.

Here are some example implementations of device plugins:

Noteworthy developments since Device Plugin framework introduction

Kubelet APIs moved to kubelet staging repo

External facing deviceplugin API packages moved from k8s.io/kubernetes/pkg/kubelet/apis/ to k8s.io/kubelet/pkg/apis/ in v1.17. Refer to Move external facing kubelet apis to staging for more details on the rationale behind this change.

Device Plugin API updates

Additional gRPC endpoints introduced:

  1. GetDevicePluginOptions is used by device plugins to communicate options to the DeviceManager in order to indicate if PreStartContainer, GetPreferredAllocation or other future optional calls are supported and can be called before making devices available to the container.
  2. GetPreferredAllocation allows a device plugin to forward allocation preferrence to the DeviceManager so it can incorporate this information into its allocation decisions. The DeviceManager will call out to a plugin at pod admission time asking for a preferred device allocation of a given size from a list of available devices to make a more informed decision. E.g. Specifying inter-device constraints to indicate preferrence on best-connected set of devices when allocating devices to a container.
  3. PreStartContainer is called before each container start if indicated by device plugins during registration phase. It allows Device Plugins to run device specific operations on the Devices requested. E.g. reconfiguring or reprogramming FPGAs before the container starts running.

Pull Requests that introduced these changes are here:

  1. Invoke preStart RPC call before container start, if desired by plugin
  2. Add GetPreferredAllocation() call to the v1beta1 device plugin API

With introduction of the above endpoints the interaction between Device Manager in kubelet and Device Manager can be shown as below:

Representation of the Device Plugin framework showing the relationship between the kubelet and a device plugin

Device Plugin framework Overview

Change in semantics of device plugin registration process

Device plugin code was refactored to separate 'plugin' package under the devicemanager package to lay the groundwork for introducing a v1beta2 device plugin API. This would allow adding support in devicemanager to service multiple device plugin APIs at the same time.

With this refactoring work, it is now mandatory for a device plugin to start serving its gRPC service before registering itself with kubelet. Previously, these two operations were asynchronous and device plugin could register itself before starting its gRPC server which is no longer the case. For more details, refer to PR #109016 and Issue #112395.

Dynamic resource allocation

In Kubernetes 1.26, inspired by how Persistent Volumes are handled in Kubernetes, Dynamic Resource Allocation has been introduced to cater to devices that have more sophisticated resource requirements like:

  1. Decouple device initialization and allocation from the pod lifecycle.
  2. Facilitate dynamic sharing of devices between containers and pods.
  3. Support custom resource-specific parameters
  4. Enable resource-specific setup and cleanup actions
  5. Enable support for Network-attached resources, not just node-local resources

Is the Device Plugin API stable now?

No, the Device Plugin API is still not stable; the latest Device Plugin API version available is v1beta1. There are plans in the community to introduce v1beta2 API to service multiple plugin APIs at once. A per-API call with request/response types would allow adding support for newer API versions without explicitly bumping the API.

In addition to that, there are existing proposals in the community to introduce additional endpoints KEP-3162: Add Deallocate and PostStopContainer to Device Manager API.

Kubernetes 1.26: Non-Graceful Node Shutdown Moves to Beta

Kubernetes v1.24 introduced an alpha quality implementation of improvements for handling a non-graceful node shutdown. In Kubernetes v1.26, this feature moves to beta. This feature allows stateful workloads to failover to a different node after the original node is shut down or in a non-recoverable state, such as the hardware failure or broken OS.

What is a node shutdown in Kubernetes?

In a Kubernetes cluster, it is possible for a node to shut down. This could happen either in a planned way or it could happen unexpectedly. You may plan for a security patch, or a kernel upgrade and need to reboot the node, or it may shut down due to preemption of VM instances. A node may also shut down due to a hardware failure or a software problem.

To trigger a node shutdown, you could run a shutdown or poweroff command in a shell, or physically press a button to power off a machine.

A node shutdown could lead to workload failure if the node is not drained before the shutdown.

In the following, we will describe what is a graceful node shutdown and what is a non-graceful node shutdown.

What is a graceful node shutdown?

The kubelet's handling for a graceful node shutdown allows the kubelet to detect a node shutdown event, properly terminate the pods on that node, and release resources before the actual shutdown. Critical pods are terminated after all the regular pods are terminated, to ensure that the essential functions of an application can continue to work as long as possible.

What is a non-graceful node shutdown?

A Node shutdown can be graceful only if the kubelet's node shutdown manager can detect the upcoming node shutdown action. However, there are cases where a kubelet does not detect a node shutdown action. This could happen because the shutdown command does not trigger the Inhibitor Locks mechanism used by the kubelet on Linux, or because of a user error. For example, if the shutdownGracePeriod and shutdownGracePeriodCriticalPods details are not configured correctly for that node.

When a node is shut down (or crashes), and that shutdown was not detected by the kubelet node shutdown manager, it becomes a non-graceful node shutdown. Non-graceful node shutdown is a problem for stateful apps. If a node containing a pod that is part of a StatefulSet is shut down in a non-graceful way, the Pod will be stuck in Terminating status indefinitely, and the control plane cannot create a replacement Pod for that StatefulSet on a healthy node. You can delete the failed Pods manually, but this is not ideal for a self-healing cluster. Similarly, pods that ReplicaSets created as part of a Deployment will be stuck in Terminating status, and that were bound to the now-shutdown node, stay as Terminating indefinitely. If you have set a horizontal scaling limit, even those terminating Pods count against the limit, so your workload may struggle to self-heal if it was already at maximum scale. (By the way: if the node that had done a non-graceful shutdown comes back up, the kubelet does delete the old Pod, and the control plane can make a replacement.)

What's new for the beta?

For Kubernetes v1.26, the non-graceful node shutdown feature is beta and enabled by default. The NodeOutOfServiceVolumeDetach feature gate is enabled by default on kube-controller-manager instead of being opt-in; you can still disable it if needed (please also file an issue to explain the problem).

On the instrumentation side, the kube-controller-manager reports two new metrics.

force_delete_pods_total
number of pods that are being forcibly deleted (resets on Pod garbage collection controller restart)
force_delete_pod_errors_total
number of errors encountered when attempting forcible Pod deletion (also resets on Pod garbage collection controller restart)

How does it work?

In the case of a node shutdown, if a graceful shutdown is not working or the node is in a non-recoverable state due to hardware failure or broken OS, you can manually add an out-of-service taint on the Node. For example, this can be node.kubernetes.io/out-of-service=nodeshutdown:NoExecute or node.kubernetes.io/out-of-service=nodeshutdown:NoSchedule. This taint trigger pods on the node to be forcefully deleted if there are no matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and new pods will be created successfully on a different running node.

kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute

Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power-off state (not in the middle of restarting), either because the user intentionally shut it down or the node is down due to hardware failures, OS issues, etc.

Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered.

What’s next?

Depending on feedback and adoption, the Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to GA in either 1.27 or 1.28.

This feature requires a user to manually add a taint to the node to trigger the failover of workloads and remove the taint after the node is recovered.

The cluster operator can automate this process by automatically applying the out-of-service taint if there is a programmatic way to determine that the node is really shut down and there isn’t IO between the node and storage. The cluster operator can then automatically remove the taint after the workload fails over successfully to another running node and that the shutdown node has been recovered.

In the future, we plan to find ways to automatically detect and fence nodes that are shut down or in a non-recoverable state and fail their workloads over to another node.

How can I learn more?

To learn more, read Non Graceful node shutdown in the Kubernetes documentation.

How to get involved?

We offer a huge thank you to all the contributors who helped with design, implementation, and review of this feature:

There are many people who have helped review the design and implementation along the way. We want to thank everyone who has contributed to this effort including the about 30 people who have reviewed the KEP and implementation over the last couple of years.

This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.

Kubernetes 1.26: Alpha API For Dynamic Resource Allocation

Dynamic resource allocation is a new API for requesting resources. It is a generalization of the persistent volumes API for generic resources, making it possible to:

  • access the same resource instance in different pods and containers,
  • attach arbitrary constraints to a resource request to get the exact resource you are looking for,
  • initialize a resource according to parameters provided by the user.

Third-party resource drivers are responsible for interpreting these parameters as well as tracking and allocating resources as requests come in.

Dynamic resource allocation is an alpha feature and only enabled when the DynamicResourceAllocation feature gate and the resource.k8s.io/v1alpha1 API group are enabled. For details, see the --feature-gates and --runtime-config kube-apiserver parameters. The kube-scheduler, kube-controller-manager and kubelet components all need the feature gate enabled as well.

The default configuration of kube-scheduler enables the DynamicResources plugin if and only if the feature gate is enabled. Custom configurations may have to be modified to include it.

Once dynamic resource allocation is enabled, resource drivers can be installed to manage certain kinds of hardware. Kubernetes has a test driver that is used for end-to-end testing, but also can be run manually. See below for step-by-step instructions.

API

The new resource.k8s.io/v1alpha1 API group provides four new types:

ResourceClass
Defines which resource driver handles a certain kind of resource and provides common parameters for it. ResourceClasses are created by a cluster administrator when installing a resource driver.
ResourceClaim
Defines a particular resource instances that is required by a workload. Created by a user (lifecycle managed manually, can be shared between different Pods) or for individual Pods by the control plane based on a ResourceClaimTemplate (automatic lifecycle, typically used by just one Pod).
ResourceClaimTemplate
Defines the spec and some meta data for creating ResourceClaims. Created by a user when deploying a workload.
PodScheduling
Used internally by the control plane and resource drivers to coordinate pod scheduling when ResourceClaims need to be allocated for a Pod.

Parameters for ResourceClass and ResourceClaim are stored in separate objects, typically using the type defined by a CRD that was created when installing a resource driver.

With this alpha feature enabled, the spec of Pod defines ResourceClaims that are needed for a Pod to run: this information goes into a new resourceClaims field. Entries in that list reference either a ResourceClaim or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using this .spec (for example, inside a Deployment or StatefulSet) share the same ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets its own ResourceClaim instance.

For a container defined within a Pod, the resources.claims list defines whether that container gets access to these resource instances, which makes it possible to share resources between one or more containers inside the same Pod. For example, an init container could set up the resource before the application uses it.

Here is an example of a fictional resource driver. Two ResourceClaim objects will get created for this Pod and each container gets access to one of them.

Assuming a resource driver called resource-driver.example.com was installed together with the following resource class:

apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClass
name: resource.example.com
driverName: resource-driver.example.com

An end-user could then allocate two specific resources of type resource.example.com as follows:

---
apiVersion: cats.resource.example.com/v1
kind: ClaimParameters
name: large-black-cats
spec:
  color: black
  size: large
---
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
  name: large-black-cats
spec:
  spec:
    resourceClassName: resource.example.com
    parametersRef:
      apiGroup: cats.resource.example.com
      kind: ClaimParameters
      name: large-black-cats
–--
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-cats
spec:
  containers: # two example containers; each container claims one cat resource
  - name: first-example
    image: ubuntu:22.04
    command: ["sleep", "9999"]
    resources:
      claims:
      - name: cat-0
  - name: second-example
    image: ubuntu:22.04
    command: ["sleep", "9999"]
    resources:
      claims:
      - name: cat-1
  resourceClaims:
  - name: cat-0
    source:
      resourceClaimTemplateName: large-black-cats
  - name: cat-1
    source:
      resourceClaimTemplateName: large-black-cats

Scheduling

In contrast to native resources (such as CPU or RAM) and extended resources (managed by a device plugin, advertised by kubelet), the scheduler has no knowledge of what dynamic resources are available in a cluster or how they could be split up to satisfy the requirements of a specific ResourceClaim. Resource drivers are responsible for that. Drivers mark ResourceClaims as allocated once resources for it are reserved. This also then tells the scheduler where in the cluster a claimed resource is actually available.

ResourceClaims can get resources allocated as soon as the ResourceClaim is created (immediate allocation), without considering which Pods will use the resource. The default (wait for first consumer) is to delay allocation until a Pod that relies on the ResourceClaim becomes eligible for scheduling. This design with two allocation options is similar to how Kubernetes handles storage provisioning with PersistentVolumes and PersistentVolumeClaims.

In the wait for first consumer mode, the scheduler checks all ResourceClaims needed by a Pod. If the Pods has any ResourceClaims, the scheduler creates a PodScheduling (a special object that requests scheduling details on behalf of the Pod). The PodScheduling has the same name and namespace as the Pod and the Pod as its as owner. Using its PodScheduling, the scheduler informs the resource drivers responsible for those ResourceClaims about nodes that the scheduler considers suitable for the Pod. The resource drivers respond by excluding nodes that don't have enough of the driver's resources left.

Once the scheduler has that resource information, it selects one node and stores that choice in the PodScheduling object. The resource drivers then allocate resources based on the relevant ResourceClaims so that the resources will be available on that selected node. Once that resource allocation is complete, the scheduler attempts to schedule the Pod to a suitable node. Scheduling can still fail at this point; for example, a different Pod could be scheduled to the same node in the meantime. If this happens, already allocated ResourceClaims may get deallocated to enable scheduling onto a different node.

As part of this process, ResourceClaims also get reserved for the Pod. Currently ResourceClaims can either be used exclusively by a single Pod or an unlimited number of Pods.

One key feature is that Pods do not get scheduled to a node unless all of their resources are allocated and reserved. This avoids the scenario where a Pod gets scheduled onto one node and then cannot run there, which is bad because such a pending Pod also blocks all other resources like RAM or CPU that were set aside for it.

Limitations

The scheduler plugin must be involved in scheduling Pods which use ResourceClaims. Bypassing the scheduler by setting the nodeName field leads to Pods that the kubelet refuses to start because the ResourceClaims are not reserved or not even allocated. It may be possible to remove this limitation in the future.

Writing a resource driver

A dynamic resource allocation driver typically consists of two separate-but-coordinating components: a centralized controller, and a DaemonSet of node-local kubelet plugins. Most of the work required by the centralized controller to coordinate with the scheduler can be handled by boilerplate code. Only the business logic required to actually allocate ResourceClaims against the ResourceClasses owned by the plugin needs to be customized. As such, Kubernetes provides the following package, including APIs for invoking this boilerplate code as well as a Driver interface that you can implement to provide their custom business logic:

Likewise, boilerplate code can be used to register the node-local plugin with the kubelet, as well as start a gRPC server to implement the kubelet plugin API. For drivers written in Go, the following package is recommended:

It is up to the driver developer to decide how these two components communicate. The KEP outlines an approach using CRDs.

Within SIG Node, we also plan to provide a complete example driver that can serve as a template for other drivers.

Running the test driver

The following steps bring up a local, one-node cluster directly from the Kubernetes source code. As a prerequisite, your cluster must have nodes with a container runtime that supports the Container Device Interface (CDI). For example, you can run CRI-O v1.23.2 or later. Once containerd v1.7.0 is released, we expect that you can run that or any later version. In the example below, we use CRI-O.

First, clone the Kubernetes source code. Inside that directory, run:

$ hack/install-etcd.sh
...

$ RUNTIME_CONFIG=resource.k8s.io/v1alpha1 \
  FEATURE_GATES=DynamicResourceAllocation=true \
  DNS_ADDON="coredns" \
  CGROUP_DRIVER=systemd \
  CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock \
  LOG_LEVEL=6 \
  ENABLE_CSI_SNAPSHOTTER=false \
  API_SECURE_PORT=6444 \
  ALLOW_PRIVILEGED=1 \
  PATH=$(pwd)/third_party/etcd:$PATH \
  ./hack/local-up-cluster.sh -O
...

To start using your cluster, you can open up another terminal/tab and run:

$ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig

Once the cluster is up, in another terminal run the test driver controller. KUBECONFIG must be set for all of the following commands.

$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=5 controller

In another terminal, run the kubelet plugin:

$ sudo mkdir -p /var/run/cdi && \
  sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/
$ go run ./test/e2e/dra/test-driver --feature-gates ContextualLogging=true -v=6 kubelet-plugin

Changing the permissions of the directories makes it possible to run and (when using delve) debug the kubelet plugin as a normal user, which is convenient because it uses the already populated Go cache. Remember to restore permissions with sudo chmod go-w when done. Alternatively, you can also build the binary and run that as root.

Now the cluster is ready to create objects:

$ kubectl create -f test/e2e/dra/test-driver/deploy/example/resourceclass.yaml
resourceclass.resource.k8s.io/example created

$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml
configmap/test-inline-claim-parameters created
resourceclaimtemplate.resource.k8s.io/test-inline-claim-template created
pod/test-inline-claim created

$ kubectl get resourceclaims
NAME                         RESOURCECLASSNAME   ALLOCATIONMODE         STATE                AGE
test-inline-claim-resource   example             WaitForFirstConsumer   allocated,reserved   8s

$ kubectl get pods
NAME                READY   STATUS      RESTARTS   AGE
test-inline-claim   0/2     Completed   0          21s

The test driver doesn't do much, it only sets environment variables as defined in the ConfigMap. The test pod dumps the environment, so the log can be checked to verify that everything worked:

$ kubectl logs test-inline-claim with-resource | grep user_a
user_a='b'

Next steps

Kubernetes 1.26: Windows HostProcess Containers Are Generally Available

The long-awaited day has arrived: HostProcess containers, the Windows equivalent to Linux privileged containers, has finally made it to GA in Kubernetes 1.26!

What are HostProcess containers and why are they useful?

Cluster operators are often faced with the need to configure their nodes upon provisioning such as installing Windows services, configuring registry keys, managing TLS certificates, making network configuration changes, or even deploying monitoring tools such as a Prometheus's node-exporter. Previously, performing these actions on Windows nodes was usually done by running PowerShell scripts over SSH or WinRM sessions and/or working with your cloud provider's virtual machine management tooling. HostProcess containers now enable you to do all of this and more with minimal effort using Kubernetes native APIs.

With HostProcess containers you can now package any payload into the container image, map volumes into containers at runtime, and manage them like any other Kubernetes workload. You get all the benefits of containerized packaging and deployment methods combined with a reduction in both administrative and development cost. Gone are the days where cluster operators would need to manually log onto Windows nodes to perform administrative duties.

HostProcess containers differ quite significantly from regular Windows Server containers. They are run directly as processes on the host with the access policies of a user you specify. HostProcess containers run as either the built-in Windows system accounts or ephemeral users within a user group defined by you. HostProcess containers also share the host's network namespace and access/configure storage mounts visible to the host. On the other hand, Windows Server containers are highly isolated and exist in a separate execution namespace. Direct access to the host from a Windows Server container is explicitly disallowed by default.

How does it work?

Windows HostProcess containers are implemented with Windows Job Objects, a break from the previous container model which use server silos. Job Objects are components of the Windows OS which offer the ability to manage a group of processes as a group (also known as a job) and assign resource constraints to the group as a whole. Job objects are specific to the Windows OS and are not associated with the Kubernetes Job API. They have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the desired permissions, among other host resources. The init process, and any processes it launches (including processes explicitly launched by the user) are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.

HostProcess and Linux privileged containers enable similar scenarios but differ greatly in their implementation (hence the naming difference). HostProcess containers have their own PodSecurityContext fields. Those used to configure Linux privileged containers do not apply. Enabling privileged access to a Windows host is a fundamentally different process than with Linux so the configuration and capabilities of each differ significantly. Below is a diagram detailing the overall architecture of Windows HostProcess containers:

HostProcess Architecture

Two major features were added prior to moving to stable: the ability to run as local user accounts, and a simplified method of accessing volume mounts. To learn more, read Create a Windows HostProcess Pod.

HostProcess containers in action

Kubernetes SIG Windows has been busy putting HostProcess containers to use - even before GA! They've been very excited to use HostProcess containers for a number of important activities that were a pain to perform in the past.

Here are just a few of the many use use cases with example deployments:

How do I use it?

A HostProcess container can be built using any base image of your choosing, however, for convenience we have created a HostProcess container base image. This image is only a few KB in size and does not inherit any of the same compatibility requirements as regular Windows server containers which allows it to run on any Windows server version.

To use that Microsoft image, put this in your Dockerfile:

FROM mcr.microsoft.com/oss/kubernetes/windows-host-process-containers-base-image:v1.0.0

You can run HostProcess containers from within a HostProcess Pod.

To get started with running Windows containers, see the general guidance for deploying Windows nodes. If you have a compatible node (for example: Windows as the operating system with containerd v1.7 or later as the container runtime), you can deploy a Pod with one or more HostProcess containers. See the Create a Windows HostProcess Pod - Prerequisites for more information.

Please note that within a Pod, you can't mix HostProcess containers with normal Windows containers.

How can I learn more?

How do I get involved?

Get involved with SIG Windows to contribute!

Kubernetes 1.26: We're now signing our binary release artifacts!

The Kubernetes Special Interest Group (SIG) Release is proud to announce that we are digitally signing all release artifacts, and that this aspect of Kubernetes has now reached beta.

Signing artifacts provides end users a chance to verify the integrity of the downloaded resource. It allows to mitigate man-in-the-middle attacks directly on the client side and therefore ensures the trustfulness of the remote serving the artifacts. The overall goal of out past work was to define the used tooling for signing all Kubernetes related artifacts as well as providing a standard signing process for related projects (for example for those in kubernetes-sigs).

We already signed all officially released container images (from Kubernetes v1.24 onwards). Image signing was alpha for v1.24 and v1.25. For v1.26, we've added all binary artifacts to the signing process as well! This means that now all client, server and source tarballs, binary artifacts, Software Bills of Material (SBOMs) as well as the build provenance will be signed using cosign. Technically speaking, we now ship additional *.sig (signature) and *.cert (certificate) files side by side to the artifacts for verifying their integrity.

To verify an artifact, for example kubectl, you can download the signature and certificate alongside with the binary. I use the release candidate rc.1 of v1.26 for demonstration purposes because the final has not been released yet:

curl -sSfL https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl -o kubectl
curl -sSfL https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl.sig -o kubectl.sig
curl -sSfL https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl.cert -o kubectl.cert

Then you can verify kubectl using cosign:

COSIGN_EXPERIMENTAL=1 cosign verify-blob kubectl --signature kubectl.sig --certificate kubectl.cert
tlog entry verified with uuid: 5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657 index: 8173886
Verified OK

The UUID can be used to query the rekor transparency log:

rekor-cli get --uuid 5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657
LogID: c0d23d6ad406973f9559f3ba2d1ca01f84147d8ffc5b8445c224f98b9591801d
Index: 8173886
IntegratedTime: 2022-11-30T18:59:07Z
UUID: 24296fb24b8ad77a5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657
Body: {
  "HashedRekordObj": {
    "data": {
      "hash": {
        "algorithm": "sha256",
        "value": "982dfe7eb5c27120de6262d30fa3e8029bc1da9e632ce70570e9c921d2851fc2"
      }
    },
    "signature": {
      "content": "MEQCIH0e1/0svxMoLzjeyhAaLFSHy5ZaYy0/2iQl2t3E0Pj4AiBsWmwjfLzrVyp9/v1sy70Q+FHE8miauOOVkAW2lTYVug==",
      "publicKey": {
        "content": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN2akNDQWthZ0F3SUJBZ0lVRldab0pLSUlFWkp3LzdsRkFrSVE2SHBQdi93d0NnWUlLb1pJemowRUF3TXcKTnpFVk1CTUdBMVVFQ2hNTWMybG5jM1J2Y21VdVpHVjJNUjR3SEFZRFZRUURFeFZ6YVdkemRHOXlaUzFwYm5SbApjbTFsWkdsaGRHVXdIaGNOTWpJeE1UTXdNVGcxT1RBMldoY05Nakl4TVRNd01Ua3dPVEEyV2pBQU1Ga3dFd1lICktvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUVDT3h5OXBwTFZzcVFPdHJ6RFgveTRtTHZSeU1scW9sTzBrS0EKTlJDM3U3bjMreHorYkhvWVkvMUNNRHpJQjBhRTA3NkR4ZWVaSkhVaWFjUXU4a0dDNktPQ0FXVXdnZ0ZoTUE0RwpBMVVkRHdFQi93UUVBd0lIZ0RBVEJnTlZIU1VFRERBS0JnZ3JCZ0VGQlFjREF6QWRCZ05WSFE0RUZnUVV5SmwxCkNlLzIzNGJmREJZQ2NzbXkreG5qdnpjd0h3WURWUjBqQkJnd0ZvQVUzOVBwejFZa0VaYjVxTmpwS0ZXaXhpNFkKWkQ4d1FnWURWUjBSQVFIL0JEZ3dOb0UwYTNKbGJDMXpkR0ZuYVc1blFHczRjeTF5Wld4bGJtY3RjSEp2WkM1cApZVzB1WjNObGNuWnBZMlZoWTJOdmRXNTBMbU52YlRBcEJnb3JCZ0VFQVlPL01BRUJCQnRvZEhSd2N6b3ZMMkZqClkyOTFiblJ6TG1kdmIyZHNaUzVqYjIwd2dZb0dDaXNHQVFRQjFua0NCQUlFZkFSNkFIZ0FkZ0RkUFRCcXhzY1IKTW1NWkhoeVpaemNDb2twZXVONDhyZitIaW5LQUx5bnVqZ0FBQVlUSjZDdlJBQUFFQXdCSE1FVUNJRXI4T1NIUQp5a25jRFZpNEJySklXMFJHS0pqNkQyTXFGdkFMb0I5SmNycXlBaUVBNW4xZ283cmQ2U3ZVeXNxeldhMUdudGZKCllTQnVTZHF1akVySFlMQTUrZTR3Q2dZSUtvWkl6ajBFQXdNRFpnQXdZd0l2Tlhub3pyS0pWVWFESTFiNUlqa1oKUWJJbDhvcmlMQ1M4MFJhcUlBSlJhRHNCNTFUeU9iYTdWcGVYdThuTHNjVUNNREU4ZmpPZzBBc3ZzSXp2azNRUQo0c3RCTkQrdTRVV1UrcjhYY0VxS0YwNGJjTFQwWEcyOHZGQjRCT2x6R204K093PT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo="
      }
    }
  }
}

The HashedRekordObj.signature.content should match the content of the file kubectl.sig and HashedRekordObj.signature.publicKey.content should be identical with the contents of kubectl.cert. It is also possible to specify the remote certificate and signature locations without downloading them:

COSIGN_EXPERIMENTAL=1 cosign verify-blob kubectl \
    --signature https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl.sig \
    --certificate https://dl.k8s.io/release/v1.26.0-rc.1/bin/linux/amd64/kubectl.cert
tlog entry verified with uuid: 5d54b39222e3fa9a21bcb0badd8aac939b4b0d1d9085b37f1f10b18a8cd24657 index: 8173886
Verified OK

All of the mentioned steps as well as how to verify container images are outlined in the official documentation about how to Verify Signed Kubernetes Artifacts. In one of the next upcoming Kubernetes releases we will working making the global story more mature by ensuring that truly all Kubernetes artifacts are signed. Beside that, we are considering using Kubernetes owned infrastructure for the signing (root trust) and verification (transparency log) process.

Getting involved

If you're interested in contributing to SIG Release, then consider applying for the upcoming v1.27 shadowing program (watch for the announcement on k-dev) or join our weekly meeting to say hi.

We're looking forward to making even more of those awesome changes for future Kubernetes releases. For example, we're working on the SLSA Level 3 Compliance in the Kubernetes Release Process or the Renaming of the kubernetes/kubernetes default branch name to main.

Thank you for reading this blog post! I'd like to use this opportunity to give all involved SIG Release folks a special shout-out for shipping this feature in time!

Feel free to reach out to us by using the SIG Release mailing list or the #sig-release Slack channel.

Additional resources

Kubernetes v1.26: Electrifying

It's with immense joy that we announce the release of Kubernetes v1.26!

This release includes a total of 37 enhancements: eleven of them are graduating to Stable, ten are graduating to Beta, and sixteen of them are entering Alpha. We also have twelve features being deprecated or removed, three of which we better detail in this announcement.

Kubernetes 1.26: Electrifying

The theme for Kubernetes v1.26 is Electrifying.

Each Kubernetes release is the result of the coordinated effort of dedicated volunteers, and only made possible due to the use of a diverse and complex set of computing resources, spread out through multiple datacenters and regions worldwide. The end result of a release - the binaries, the image containers, the documentation - are then deployed on a growing number of personal, on-premises, and cloud computing resources.

In this release we want to recognise the importance of all these building blocks on which Kubernetes is developed and used, while at the same time raising awareness on the importance of taking the energy consumption footprint into account: environmental sustainability is an inescapable concern of creators and users of any software solution, and the environmental footprint of software, like Kubernetes, an area which we believe will play a significant role in future releases.

As a community, we always work to make each new release process better than before (in this release, we have started to use Projects for tracking enhancements, for example). If v1.24 "Stargazer" had us looking upwards, to what is possible when our community comes together, and v1.25 "Combiner" what the combined efforts of our community are capable of, this v1.26 "Electrifying" is also dedicated to all of those whose individual motion, integrated into the release flow, made all of this possible.

Major themes

Kubernetes v1.26 is composed of many changes, brought to you by a worldwide team of volunteers. For this release, we have identified several major themes.

Change in container image registry

In the previous release, Kubernetes changed the container registry, allowing the spread of the load across multiple Cloud Providers and Regions, a change that reduced the reliance on a single entity and provided a faster download experience for a large number of users.

This release of Kubernetes is the first that is exclusively published in the new registry.k8s.io container image registry. In the (now legacy) k8s.gcr.io image registry, no container images tags for v1.26 will be published, and only tags from releases before v1.26 will continue to be updated. Refer to registry.k8s.io: faster, cheaper and Generally Available for more information on the motivation, advantages, and implications of this significant change.

CRI v1alpha2 removed

With the adoption of the Container Runtime Interface (CRI) and the removal of dockershim in v1.24, the CRI is the only supported and documented way through which Kubernetes interacts with different container runtimes. Each kubelet negotiates which version of CRI to use with the container runtime on that node.

In the previous release, the Kubernetes project recommended using CRI version v1, but kubelet could still negotiate the use of CRI v1alpha2, which was deprecated.

Kubernetes v1.26 drops support for CRI v1alpha2. That removal will result in the kubelet not registering the node if the container runtime doesn't support CRI v1. This means that containerd minor version 1.5 and older are not supported in Kubernetes 1.26; if you use containerd, you will need to upgrade to containerd version 1.6.0 or later before you upgrade that node to Kubernetes v1.26. This applies equally to any other container runtimes that only support the v1alpha2: if that affects you, you should contact the container runtime vendor for advice or check their website for additional instructions in how to move forward.

Storage improvements

Following the GA of the core Container Storage Interface (CSI) Migration feature in the previous release, CSI migration is an on-going effort that we've been working on for a few releases now, and this release continues to add (and remove) features aligned with the migration's goals, as well as other improvements to Kubernetes storage.

CSI migration for Azure File and vSphere graduated to stable

Both the vSphere and Azure in-tree driver migration to CSI have graduated to Stable. You can find more information about them in the vSphere CSI driver and Azure File CSI driver repositories.

Delegate FSGroup to CSI Driver graduated to stable

This feature allows Kubernetes to supply the pod's fsGroup to the CSI driver when a volume is mounted so that the driver can utilize mount options to control volume permissions. Previously, the kubelet would always apply the fsGroupownership and permission change to files in the volume according to the policy specified in the Pod's .spec.securityContext.fsGroupChangePolicy field. Starting with this release, CSI drivers have the option to apply the fsGroup settings during attach or mount time of the volumes.

In-tree GlusterFS driver removal

Already deprecated in the v1.25 release, the in-tree GlusterFS driver was removed in this release.

In-tree OpenStack Cinder driver removal

This release removed the deprecated in-tree storage integration for OpenStack (the cinder volume type). You should migrate to external cloud provider and CSI driver from https://github.com/kubernetes/cloud-provider-openstack instead. For more information, visit Cinder in-tree to CSI driver migration.

Signing Kubernetes release artifacts graduates to beta

Introduced in Kubernetes v1.24, this feature constitutes a significant milestone in improving the security of the Kubernetes release process. All release artifacts are signed keyless using cosign, and both binary artifacts and images can be verified.

Support for Windows privileged containers graduates to stable

Privileged container support allows containers to run with similar access to the host as processes that run on the host directly. Support for this feature in Windows nodes, called HostProcess containers, will now graduate to Stable, enabling access to host resources (including network resources) from privileged containers.

Improvements to Kubernetes metrics

This release has several noteworthy improvements on metrics.

Metrics framework extension graduates to alpha

The metrics framework extension graduates to Alpha, and documentation is now published for every metric in the Kubernetes codebase.This enhancement adds two additional metadata fields to Kubernetes metrics: Internal and Beta, representing different stages of metric maturity.

Component Health Service Level Indicators graduates to alpha

Also improving on the ability to consume Kubernetes metrics, component health Service Level Indicators (SLIs) have graduated to Alpha: by enabling the ComponentSLIs feature flag there will be an additional metrics endpoint which allows the calculation of Service Level Objectives (SLOs) from raw healthcheck data converted into metric format.

Feature metrics are now available

Feature metrics are now available for each Kubernetes component, making it possible to track whether each active feature gate is enabled by checking the component's metric endpoint for kubernetes_feature_enabled.

Dynamic Resource Allocation graduates to alpha

Dynamic Resource Allocation is a new feature that puts resource scheduling in the hands of third-party developers: it offers an alternative to the limited "countable" interface for requesting access to resources (e.g. nvidia.com/gpu: 2), providing an API more akin to that of persistent volumes. Under the hood, it uses the Container Device Interface (CDI) to do its device injection. This feature is blocked by the DynamicResourceAllocation feature gate.

CEL in Admission Control graduates to alpha

This feature introduces a v1alpha1 API for validating admission policies, enabling extensible admission control via Common Expression Language expressions. Currently, custom policies are enforced via admission webhooks, which, while flexible, have a few drawbacks when compared to in-process policy enforcement. To use, enable the ValidatingAdmissionPolicy feature gate and the admissionregistration.k8s.io/v1alpha1 API via --runtime-config.

Pod scheduling improvements

Kubernetes v1.26 introduces some relevant enhancements to the ability to better control scheduling behavior.

PodSchedulingReadiness graduates to alpha

This feature introduces a .spec.schedulingGates field to Pod's API, to indicate whether the Pod is allowed to be scheduled or not. External users/controllers can use this field to hold a Pod from scheduling based on their policies and needs.

NodeInclusionPolicyInPodTopologySpread graduates to beta

By specifying a nodeInclusionPolicy in topologySpreadConstraints, you can control whether to take taints/tolerations into consideration when calculating Pod Topology Spread skew.

Other Updates

Graduations to stable

This release includes a total of eleven enhancements promoted to Stable:

Deprecations and removals

12 features were deprecated or removed from Kubernetes with this release.

Release notes

The complete details of the Kubernetes v1.26 release are available in our release notes.

Availability

Kubernetes v1.26 is available for download on the Kubernetes site. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using containers as "nodes", with kind. You can also easily install v1.26 using kubeadm.

Release team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to ensure we deliver a solid Kubernetes v1.26 release for our community.

A very special thanks is in order for our Release Lead, Leonard Pahlke, for successfully steering the entire release team throughout the entire release cycle, by making sure that we could all contribute in the best way possible to this release through his constant support and attention to the many and diverse details that make up the path to a successful release.

User highlights

Ecosystem updates

  • KubeCon + CloudNativeCon Europe 2023 will take place in Amsterdam, The Netherlands, from 17 – 21 April 2023! You can find more information about the conference and registration on the event site.
  • CloudNativeSecurityCon North America, a two-day event designed to foster collaboration, discussion and knowledge sharing of cloud native security projects and how to best use these to address security challenges and opportunities, will take place in Seattle, Washington (USA), from 1-2 February 2023. See the event page for more information.
  • The CNCF announced the 2022 Community Awards Winners: the Community Awards recognize CNCF community members that are going above and beyond to advance cloud native technology.

Project velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.26 release cycle, which ran for 14 weeks (September 5 to December 9), we saw contributions from 976 companies and 6877 individuals.

Upcoming Release Webinar

Join members of the Kubernetes v1.26 release team on Tuesday January 17, 2023 10am - 11am EST (3pm - 4pm UTC) to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests.

Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:

Forensic container checkpointing in Kubernetes

Forensic container checkpointing is based on Checkpoint/Restore In Userspace (CRIU) and allows the creation of stateful copies of a running container without the container knowing that it is being checkpointed. The copy of the container can be analyzed and restored in a sandbox environment multiple times without the original container being aware of it. Forensic container checkpointing was introduced as an alpha feature in Kubernetes v1.25.

How does it work?

With the help of CRIU it is possible to checkpoint and restore containers. CRIU is integrated in runc, crun, CRI-O and containerd and forensic container checkpointing as implemented in Kubernetes uses these existing CRIU integrations.

Why is it important?

With the help of CRIU and the corresponding integrations it is possible to get all information and state about a running container on disk for later forensic analysis. Forensic analysis might be important to inspect a suspicious container without stopping or influencing it. If the container is really under attack, the attacker might detect attempts to inspect the container. Taking a checkpoint and analysing the container in a sandboxed environment offers the possibility to inspect the container without the original container and maybe attacker being aware of the inspection.

In addition to the forensic container checkpointing use case, it is also possible to migrate a container from one node to another node without loosing the internal state. Especially for stateful containers with long initialization times restoring from a checkpoint might save time after a reboot or enable much faster startup times.

How do I use container checkpointing?

The feature is behind a feature gate, so make sure to enable the ContainerCheckpoint gate before you can use the new feature.

The runtime must also support container checkpointing:

  • containerd: support is currently under discussion. See containerd pull request #6965 for more details.

  • CRI-O: v1.25 has support for forensic container checkpointing.

Usage example with CRI-O

To use forensic container checkpointing in combination with CRI-O, the runtime needs to be started with the command-line option --enable-criu-support=true. For Kubernetes, you need to run your cluster with the ContainerCheckpoint feature gate enabled. As the checkpointing functionality is provided by CRIU it is also necessary to install CRIU. Usually runc or crun depend on CRIU and therefore it is installed automatically.

It is also important to mention that at the time of writing the checkpointing functionality is to be considered as an alpha level feature in CRI-O and Kubernetes and the security implications are still under consideration.

Once containers and pods are running it is possible to create a checkpoint. Checkpointing is currently only exposed on the kubelet level. To checkpoint a container, you can run curl on the node where that container is running, and trigger a checkpoint:

curl -X POST "https://localhost:10250/checkpoint/namespace/podId/container"

For a container named counter in a pod named counters in a namespace named default the kubelet API endpoint is reachable at:

curl -X POST "https://localhost:10250/checkpoint/default/counters/counter"

For completeness the following curl command-line options are necessary to have curl accept the kubelet's self signed certificate and authorize the use of the kubelet checkpoint API:

--insecure --cert /var/run/kubernetes/client-admin.crt --key /var/run/kubernetes/client-admin.key

Triggering this kubelet API will request the creation of a checkpoint from CRI-O. CRI-O requests a checkpoint from your low-level runtime (for example, runc). Seeing that request, runc invokes the criu tool to do the actual checkpointing.

Once the checkpointing has finished the checkpoint should be available at /var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar

You could then use that tar archive to restore the container somewhere else.

Restore a checkpointed container outside of Kubernetes (with CRI-O)

With the checkpoint tar archive it is possible to restore the container outside of Kubernetes in a sandboxed instance of CRI-O. For better user experience during restore, I recommend that you use the latest version of CRI-O from the main CRI-O GitHub branch. If you're using CRI-O v1.25, you'll need to manually create certain directories Kubernetes would create before starting the container.

The first step to restore a container outside of Kubernetes is to create a pod sandbox using crictl:

crictl runp pod-config.json

Then you can restore the previously checkpointed container into the newly created pod sandbox:

crictl create <POD_ID> container-config.json pod-config.json

Instead of specifying a container image in a registry in container-config.json you need to specify the path to the checkpoint archive that you created earlier:

{
  "metadata": {
      "name": "counter"
  },
  "image":{
      "image": "/var/lib/kubelet/checkpoints/<checkpoint-archive>.tar"
  }
}

Next, run crictl start <CONTAINER_ID> to start that container, and then a copy of the previously checkpointed container should be running.

Restore a checkpointed container within of Kubernetes

To restore the previously checkpointed container directly in Kubernetes it is necessary to convert the checkpoint archive into an image that can be pushed to a registry.

One possible way to convert the local checkpoint archive consists of the following steps with the help of buildah:

newcontainer=$(buildah from scratch)
buildah add $newcontainer /var/lib/kubelet/checkpoints/checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar /
buildah config --annotation=io.kubernetes.cri-o.annotations.checkpoint.name=<container-name> $newcontainer
buildah commit $newcontainer checkpoint-image:latest
buildah rm $newcontainer

The resulting image is not standardized and only works in combination with CRI-O. Please consider this image format as pre-alpha. There are ongoing discussions to standardize the format of checkpoint images like this. Important to remember is that this not yet standardized image format only works if CRI-O has been started with --enable-criu-support=true. The security implications of starting CRI-O with CRIU support are not yet clear and therefore the functionality as well as the image format should be used with care.

Now, you'll need to push that image to a container image registry. For example:

buildah push localhost/checkpoint-image:latest container-image-registry.example/user/checkpoint-image:latest

To restore this checkpoint image (container-image-registry.example/user/checkpoint-image:latest), the image needs to be listed in the specification for a Pod. Here's an example manifest:

apiVersion: v1
kind: Pod
metadata:
  namePrefix: example-
spec:
  containers:
  - name: <container-name>
    image: container-image-registry.example/user/checkpoint-image:latest
  nodeName: <destination-node>

Kubernetes schedules the new Pod onto a node. The kubelet on that node instructs the container runtime (CRI-O in this example) to create and start a container based on an image specified as registry/user/checkpoint-image:latest. CRI-O detects that registry/user/checkpoint-image:latest is a reference to checkpoint data rather than a container image. Then, instead of the usual steps to create and start a container, CRI-O fetches the checkpoint data and restores the container from that specified checkpoint.

The application in that Pod would continue running as if the checkpoint had not been taken; within the container, the application looks and behaves like any other container that had been started normally and not restored from a checkpoint.

With these steps, it is possible to replace a Pod running on one node with a new equivalent Pod that is running on a different node, and without losing the state of the containers in that Pod.

How do I get involved?

You can reach SIG Node by several means:

Further reading

Please see the follow-up article Forensic container analysis for details on how a container checkpoint can be analyzed.

Finding suspicious syscalls with the seccomp notifier

Debugging software in production is one of the biggest challenges we have to face in our containerized environments. Being able to understand the impact of the available security options, especially when it comes to configuring our deployments, is one of the key aspects to make the default security in Kubernetes stronger. We have all those logging, tracing and metrics data already at hand, but how do we assemble the information they provide into something human readable and actionable?

Seccomp is one of the standard mechanisms to protect a Linux based Kubernetes application from malicious actions by interfering with its system calls. This allows us to restrict the application to a defined set of actionable items, like modifying files or responding to HTTP requests. Linking the knowledge of which set of syscalls is required to, for example, modify a local file, to the actual source code is in the same way non-trivial. Seccomp profiles for Kubernetes have to be written in JSON and can be understood as an architecture specific allow-list with superpowers, for example:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 38,
  "defaultErrno": "ENOSYS",
  "syscalls": [
    {
      "names": ["chmod", "chown", "open", "write"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

The above profile errors by default specifying the defaultAction of SCMP_ACT_ERRNO. This means we have to allow a set of syscalls via SCMP_ACT_ALLOW, otherwise the application would not be able to do anything at all. Okay cool, for being able to allow file operations, all we have to do is adding a bunch of file specific syscalls like open or write, and probably also being able to change the permissions via chmod and chown, right? Basically yes, but there are issues with the simplicity of that approach:

Seccomp profiles need to include the minimum set of syscalls required to start the application. This also includes some syscalls from the lower level Open Container Initiative (OCI) container runtime, for example runc or crun. Beside that, we can only guarantee the required syscalls for a very specific version of the runtimes and our application, because the code parts can change between releases. The same applies to the termination of the application as well as the target architecture we're deploying on. Features like executing commands within containers also require another subset of syscalls. Not to mention that there are multiple versions for syscalls doing slightly different things and the seccomp profiles are able to modify their arguments. It's also not always clearly visible to the developers which syscalls are used by their own written code parts, because they rely on programming language abstractions or frameworks.

How can we know which syscalls are even required then? Who should create and maintain those profiles during its development life-cycle?

Well, recording and distributing seccomp profiles is one of the problem domains of the Security Profiles Operator, which is already solving that. The operator is able to record seccomp, SELinux and even AppArmor profiles into a Custom Resource Definition (CRD), reconciles them to each node and makes them available for usage.

The biggest challenge about creating security profiles is to catch all code paths which execute syscalls. We could achieve that by having 100% logical coverage of the application when running an end-to-end test suite. You get the problem with the previous statement: It's too idealistic to be ever fulfilled, even without taking all the moving parts during application development and deployment into account.

Missing a syscall in the seccomp profiles' allow list can have tremendously negative impact on the application. It's not only that we can encounter crashes, which are trivially detectable. It can also happen that they slightly change logical paths, change the business logic, make parts of the application unusable, slow down performance or even expose security vulnerabilities. We're simply not able to see the whole impact of that, especially because blocked syscalls via SCMP_ACT_ERRNO do not provide any additional audit logging on the system.

Does that mean we're lost? Is it just not realistic to dream about a Kubernetes where everyone uses the default seccomp profile? Should we stop striving towards maximum security in Kubernetes and accept that it's not meant to be secure by default?

Definitely not. Technology evolves over time and there are many folks working behind the scenes of Kubernetes to indirectly deliver features to address such problems. One of the mentioned features is the seccomp notifier, which can be used to find suspicious syscalls in Kubernetes.

The seccomp notify feature consists of a set of changes introduced in Linux 5.9. It makes the kernel capable of communicating seccomp related events to the user space. That allows applications to act based on the syscalls and opens for a wide range of possible use cases. We not only need the right kernel version, but also at least runc v1.1.0 (or crun v0.19) to be able to make the notifier work at all. The Kubernetes container runtime CRI-O gets support for the seccomp notifier in v1.26.0. The new feature allows us to identify possibly malicious syscalls in our application, and therefore makes it possible to verify profiles for consistency and completeness. Let's give that a try.

First of all we need to run the latest main version of CRI-O, because v1.26.0 has not been released yet at time of writing. You can do that by either compiling it from the source code or by using the pre-built binary bundle via the get-script. The seccomp notifier feature of CRI-O is guarded by an annotation, which has to be explicitly allowed, for example by using a configuration drop-in like this:

> cat /etc/crio/crio.conf.d/02-runtimes.conf
[crio.runtime]
default_runtime = "runc"

[crio.runtime.runtimes.runc]
allowed_annotations = [ "io.kubernetes.cri-o.seccompNotifierAction" ]

If CRI-O is up and running, then it should indicate that the seccomp notifier is available as well:

> sudo ./bin/crio --enable-metrics
INFO[…] Starting seccomp notifier watcher
INFO[…] Serving metrics on :9090 via HTTP

We also enable the metrics, because they provide additional telemetry data about the notifier. Now we need a running Kubernetes cluster for demonstration purposes. For this demo, we mainly stick to the hack/local-up-cluster.sh approach to locally spawn a single node Kubernetes cluster.

If everything is up and running, then we would have to define a seccomp profile for testing purposes. But we do not have to create our own, we can just use the RuntimeDefault profile which gets shipped with each container runtime. For example the RuntimeDefault profile for CRI-O can be found in the containers/common library.

Now we need a test container, which can be a simple nginx pod like this:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  annotations:
    io.kubernetes.cri-o.seccompNotifierAction: "stop"
spec:
  restartPolicy: Never
  containers:
    - name: nginx
      image: nginx:1.23.2
      securityContext:
        seccompProfile:
          type: RuntimeDefault

Please note the annotation io.kubernetes.cri-o.seccompNotifierAction, which enables the seccomp notifier for this workload. The value of the annotation can be either stop for stopping the workload or anything else for doing nothing else than logging and throwing metrics. Because of the termination we also use the restartPolicy: Never to not automatically recreate the container on failure.

Let's run the pod and check if it works:

> kubectl apply -f nginx.yaml
> kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE     IP          NODE        NOMINATED NODE   READINESS GATES
nginx   1/1     Running   0          3m39s   10.85.0.3   127.0.0.1   <none>           <none>

We can also test if the web server itself works as intended:

> curl 10.85.0.3
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>

While everything is now up and running, CRI-O also indicates that it has started the seccomp notifier:

…
INFO[…] Injecting seccomp notifier into seccomp profile of container 662a3bb0fdc7dd1bf5a88a8aa8ef9eba6296b593146d988b4a9b85822422febb
…

If we would now run a forbidden syscall inside of the container, then we can expect that the workload gets terminated. Let's give that a try by running chroot in the containers namespaces:

> kubectl exec -it nginx -- bash
root@nginx:/# chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
root@nginx:/# command terminated with exit code 137

The exec session got terminated, so it looks like the container is not running any more:

> kubectl get pods
NAME    READY   STATUS           RESTARTS   AGE
nginx   0/1     seccomp killed   0          96s

Alright, the container got killed by seccomp, do we get any more information about what was going on?

> kubectl describe pod nginx
Name:             nginx
Containers:
  nginx:
    State:          Terminated
      Reason:       seccomp killed
      Message:      Used forbidden syscalls: chroot (1x)
      Exit Code:    137
      Started:      Mon, 14 Nov 2022 12:19:46 +0100
      Finished:     Mon, 14 Nov 2022 12:20:26 +0100

The seccomp notifier feature of CRI-O correctly set the termination reason and message, including which forbidden syscall has been used how often (1x). How often? Yes, the notifier gives the application up to 5 seconds after the last seen syscall until it starts the termination. This means that it's possible to catch multiple forbidden syscalls within one test by avoiding time-consuming trial and errors.

> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- chroot /tmp
chroot: cannot change root directory to '/tmp': Function not implemented
command terminated with exit code 125
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl exec -it nginx -- swapoff -a
command terminated with exit code 32
> kubectl describe pod nginx | grep Message
      Message:      Used forbidden syscalls: chroot (2x), swapoff (2x)

The CRI-O metrics will also reflect that:

> curl -sf localhost:9090/metrics | grep seccomp_notifier
# HELP container_runtime_crio_containers_seccomp_notifier_count_total Amount of containers stopped because they used a forbidden syscalls by their name
# TYPE container_runtime_crio_containers_seccomp_notifier_count_total counter
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (1x)"} 1
container_runtime_crio_containers_seccomp_notifier_count_total{name="…",syscalls="chroot (2x), swapoff (2x)"} 1

How does it work in detail? CRI-O uses the chosen seccomp profile and injects the action SCMP_ACT_NOTIFY instead of SCMP_ACT_ERRNO, SCMP_ACT_KILL, SCMP_ACT_KILL_PROCESS or SCMP_ACT_KILL_THREAD. It also sets a local listener path which will be used by the lower level OCI runtime (runc or crun) to create the seccomp notifier socket. If the connection between the socket and CRI-O has been established, then CRI-O will receive notifications for each syscall being interfered by seccomp. CRI-O stores the syscalls, allows a bit of timeout for them to arrive and then terminates the container if the chosen seccompNotifierAction=stop. Unfortunately, the seccomp notifier is not able to notify on the defaultAction, which means that it's required to have a list of syscalls to test for custom profiles. CRI-O does also state that limitation in the logs:

INFO[…] The seccomp profile default action SCMP_ACT_ERRNO cannot be overridden to SCMP_ACT_NOTIFY,
        which means that syscalls using that default action can't be traced by the notifier

As a conclusion, the seccomp notifier implementation in CRI-O can be used to verify if your applications behave correctly when using RuntimeDefault or any other custom profile. Alerts can be created based on the metrics to create long running test scenarios around that feature. Making seccomp understandable and easier to use will increase adoption as well as help us to move towards a more secure Kubernetes by default!

Thank you for reading this blog post. If you'd like to read more about the seccomp notifier, checkout the following resources:

Boosting Kubernetes container runtime observability with OpenTelemetry

When speaking about observability in the cloud native space, then probably everyone will mention OpenTelemetry (OTEL) at some point in the conversation. That's great, because the community needs standards to rely on for developing all cluster components into the same direction. OpenTelemetry enables us to combine logs, metrics, traces and other contextual information (called baggage) into a single resource. Cluster administrators or software engineers can use this resource to get a viewport about what is going on in the cluster over a defined period of time. But how can Kubernetes itself make use of this technology stack?

Kubernetes consists of multiple components where some are independent and others are stacked together. Looking at the architecture from a container runtime perspective, then there are from the top to the bottom:

  • kube-apiserver: Validates and configures data for the API objects
  • kubelet: Agent running on each node
  • CRI runtime: Container Runtime Interface (CRI) compatible container runtime like CRI-O or containerd
  • OCI runtime: Lower level Open Container Initiative (OCI) runtime like runc or crun
  • Linux kernel or Microsoft Windows: Underlying operating system

That means if we encounter a problem with running containers in Kubernetes, then we start looking at one of those components. Finding the root cause for problems is one of the most time consuming actions we face with the increased architectural complexity from today's cluster setups. Even if we know the component which seems to cause the issue, we still have to take the others into account to maintain a mental timeline of events which are going on. How do we achieve that? Well, most folks will probably stick to scraping logs, filtering them and assembling them together over the components borders. We also have metrics, right? Correct, but bringing metrics values in correlation with plain logs makes it even harder to track what is going on. Some metrics are also not made for debugging purposes. They have been defined based on the end user perspective of the cluster for linking usable alerts and not for developers debugging a cluster setup.

OpenTelemetry to the rescue: the project aims to combine signals such as traces, metrics and logs together to maintain the right viewport on the cluster state.

What is the current state of OpenTelemetry tracing in Kubernetes? From an API server perspective, we have alpha support for tracing since Kubernetes v1.22, which will graduate to beta in one of the upcoming releases. Unfortunately the beta graduation has missed the v1.26 Kubernetes release. The design proposal can be found in the API Server Tracing Kubernetes Enhancement Proposal (KEP) which provides more information about it.

The kubelet tracing part is tracked in another KEP, which was implemented in an alpha state in Kubernetes v1.25. A beta graduation is not planned as time of writing, but more may come in the v1.27 release cycle. There are other side-efforts going on beside both KEPs, for example klog is considering OTEL support, which would boost the observability by linking log messages to existing traces. Within SIG Instrumentation and SIG Node, we're also discussing how to link the kubelet traces together, because right now they're focused on the gRPC calls between the kubelet and the CRI container runtime.

CRI-O features OpenTelemetry tracing support since v1.23.0 and is working on continuously improving them, for example by attaching the logs to the traces or extending the spans to logical parts of the application. This helps users of the traces to gain the same information like parsing the logs, but with enhanced capabilities of scoping and filtering to other OTEL signals. The CRI-O maintainers are also working on a container monitoring replacement for conmon, which is called conmon-rs and is purely written in Rust. One benefit of having a Rust implementation is to be able to add features like OpenTelemetry support, because the crates (libraries) for those already exist. This allows a tight integration with CRI-O and lets consumers see the most low level tracing data from their containers.

The containerd folks added tracing support since v1.6.0, which is available by using a plugin. Lower level OCI runtimes like runc or crun feature no support for OTEL at all and it does not seem to exist a plan for that. We always have to consider that there is a performance overhead when collecting the traces as well as exporting them to a data sink. I still think it would be worth an evaluation on how extended telemetry collection could look like in OCI runtimes. Let's see if the Rust OCI runtime youki is considering something like that in the future.

I'll show you how to give it a try. For my demo I'll stick to a stack with a single local node that has runc, conmon-rs, CRI-O, and a kubelet. To enable tracing in the kubelet, I need to apply the following KubeletConfiguration:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
  KubeletTracing: true
tracing:
  samplingRatePerMillion: 1000000

A samplingRatePerMillion equally to one million will internally translate to sampling everything. A similar configuration has to be applied to CRI-O; I can either start the crio binary with --enable-tracing and --tracing-sampling-rate-per-million 1000000 or we use a drop-in configuration like this:

cat /etc/crio/crio.conf.d/99-tracing.conf
[crio.tracing]
enable_tracing = true
tracing_sampling_rate_per_million = 1000000

To configure CRI-O to use conmon-rs, you require at least the latest CRI-O v1.25.x and conmon-rs v0.4.0. Then a configuration drop-in like this can be used to make CRI-O use conmon-rs:

cat /etc/crio/crio.conf.d/99-runtimes.conf
[crio.runtime]
default_runtime = "runc"

[crio.runtime.runtimes.runc]
runtime_type = "pod"
monitor_path = "/path/to/conmonrs" # or will be looked up in $PATH

That's it, the default configuration will point to an OpenTelemetry collector gRPC endpoint of localhost:4317, which has to be up and running as well. There are multiple ways to run OTLP as described in the docs, but it's also possible to kubectl proxy into an existing instance running within Kubernetes.

If everything is set up, then the collector should log that there are incoming traces:

ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope go.opentelemetry.io/otel/sdk/tracer
Span #0
    Trace ID       : 71896e69f7d337730dfedb6356e74f01
    Parent ID      : a2a7714534c017e6
    ID             : 1d27dbaf38b9da8b
    Name           : github.com/cri-o/cri-o/server.(*Server).filterSandboxList
    Kind           : SPAN_KIND_INTERNAL
    Start time     : 2022-11-15 09:50:20.060325562 +0000 UTC
    End time       : 2022-11-15 09:50:20.060326291 +0000 UTC
    Status code    : STATUS_CODE_UNSET
    Status message :
Span #1
    Trace ID       : 71896e69f7d337730dfedb6356e74f01
    Parent ID      : a837a005d4389579
    ID             : a2a7714534c017e6
    Name           : github.com/cri-o/cri-o/server.(*Server).ListPodSandbox
    Kind           : SPAN_KIND_INTERNAL
    Start time     : 2022-11-15 09:50:20.060321973 +0000 UTC
    End time       : 2022-11-15 09:50:20.060330602 +0000 UTC
    Status code    : STATUS_CODE_UNSET
    Status message :
Span #2
    Trace ID       : fae6742709d51a9b6606b6cb9f381b96
    Parent ID      : 3755d12b32610516
    ID             : 0492afd26519b4b0
    Name           : github.com/cri-o/cri-o/server.(*Server).filterContainerList
    Kind           : SPAN_KIND_INTERNAL
    Start time     : 2022-11-15 09:50:20.0607746 +0000 UTC
    End time       : 2022-11-15 09:50:20.060795505 +0000 UTC
    Status code    : STATUS_CODE_UNSET
    Status message :
Events:
SpanEvent #0
     -> Name: log
     -> Timestamp: 2022-11-15 09:50:20.060778668 +0000 UTC
     -> DroppedAttributesCount: 0
     -> Attributes::
          -> id: Str(adf791e5-2eb8-4425-b092-f217923fef93)
          -> log.message: Str(No filters were applied, returning full container list)
          -> log.severity: Str(DEBUG)
          -> name: Str(/runtime.v1.RuntimeService/ListContainers)

I can see that the spans have a trace ID and typically have a parent attached. Events such as logs are part of the output as well. In the above case, the kubelet is periodically triggering a ListPodSandbox RPC to CRI-O caused by the Pod Lifecycle Event Generator (PLEG). Displaying those traces can be done via, for example, Jaeger. When running the tracing stack locally, then a Jaeger instance should be exposed on http://localhost:16686 per default.

The ListPodSandbox requests are directly visible within the Jaeger UI:

ListPodSandbox RPC in the Jaeger UI

That's not too exciting, so I'll run a workload directly via kubectl:

kubectl run -it --rm --restart=Never --image=alpine alpine -- echo hi
hi
pod "alpine" deleted

Looking now at Jaeger, we can see that we have traces for conmonrs, crio as well as the kubelet for the RunPodSandbox and CreateContainer CRI RPCs:

Container creation in the Jaeger UI

The kubelet and CRI-O spans are connected to each other to make investigation easier. If we now take a closer look at the spans, then we can see that CRI-O's logs are correctly accosted with the corresponding functionality. For example we can extract the container user from the traces like this:

CRI-O in the Jaeger UI

The lower level spans of conmon-rs are also part of this trace. For example conmon-rs maintains an internal read_loop for handling IO between the container and the end user. The logs for reading and writing bytes are part of the span. The same applies to the wait_for_exit_code span, which tells us that the container exited successfully with code 0:

conmon-rs in the Jaeger UI

Having all that information at hand side by side to the filtering capabilities of Jaeger makes the whole stack a great solution for debugging container issues! Mentioning the "whole stack" also shows the biggest downside of the overall approach: Compared to parsing logs it adds a noticeable overhead on top of the cluster setup. Users have to maintain a sink like Elasticsearch to persist the data, expose the Jaeger UI and possibly take the performance drawback into account. Anyways, it's still one of the best ways to increase the observability aspect of Kubernetes.

Thank you for reading this blog post, I'm pretty sure we're looking into a bright future for OpenTelemetry support in Kubernetes to make troubleshooting simpler.

registry.k8s.io: faster, cheaper and Generally Available (GA)

Starting with Kubernetes 1.25, our container image registry has changed from k8s.gcr.io to registry.k8s.io. This new registry spreads the load across multiple Cloud Providers & Regions, functioning as a sort of content delivery network (CDN) for Kubernetes container images. This change reduces the project’s reliance on a single entity and provides a faster download experience for a large number of users.

TL;DR: What you need to know about this change

  • Container images for Kubernetes releases from 1.25 1.27 onward are not published to k8s.gcr.io, only to registry.k8s.io.
  • In the upcoming December patch releases, the new registry domain default will be backported to all branches still in support (1.22, 1.23, 1.24).
  • If you run in a restricted environment and apply strict domain/IP address access policies limited to k8s.gcr.io, the image pulls will not function after the migration to this new registry. For these users, the recommended method is to mirror the release images to a private registry.

If you’d like to know more about why we made this change, or some potential issues you might run into, keep reading.

Why has Kubernetes changed to a different image registry?

k8s.gcr.io is hosted on a custom Google Container Registry (GCR) domain that was setup solely for the Kubernetes project. This has worked well since the inception of the project, and we thank Google for providing these resources, but today there are other cloud providers and vendors that would like to host images to provide a better experience for the people on their platforms. In addition to Google’s renewed commitment to donate $3 million to support the project's infrastructure, Amazon announced a matching donation during their Kubecon NA 2022 keynote in Detroit. This will provide a better experience for users (closer servers = faster downloads) and will reduce the egress bandwidth and costs from GCR at the same time. registry.k8s.io will spread the load between Google and Amazon, with other providers to follow in the future.

Why isn’t there a stable list of domains/IPs? Why can’t I restrict image pulls?

registry.k8s.io is a secure blob redirector that connects clients to the closest cloud provider. The nature of this change means that a client pulling an image could be redirected to any one of a large number of backends. We expect the set of backends to keep changing and will only increase as more and more cloud providers and vendors come on board to help mirror the release images.

Restrictive control mechanisms like man-in-the-middle proxies or network policies that restrict access to a specific list of IPs/domains will break with this change. For these scenarios, we encourage you to mirror the release images to a local registry that you have strict control over.

For more information on this policy, please see the stability section of the registry.k8s.io documentation.

What kind of errors will I see? How will I know if I’m still using the old address?

Errors may depend on what kind of container runtime you are using, and what endpoint you are routed to, but it should present as a container failing to be created with the warning FailedCreatePodSandBox.

Below is an example error message showing a proxied deployment failing to pull due to an unknown certificate:

FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = Error response from daemon: Head “https://us-west1-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.8”: x509: certificate signed by unknown authority

I’m impacted by this change, how do I revert to the old registry address?

If using the new registry domain name is not an option, you can revert to the old domain name for cluster versions less than 1.25. Keep in mind that, eventually, you will have to switch to the new registry, as new image tags will no longer be pushed to GCR.

Reverting the registry name in kubeadm

The registry used by kubeadm to pull its images can be controlled by two methods:

Setting the --image-repository flag.

kubeadm init --image-repository=k8s.gcr.io

Or in kubeadm config ClusterConfiguration:

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
imageRepository: "k8s.gcr.io"

Reverting the Registry Name in kubelet

The image used by kubelet for the pod sandbox (pause) can be overridden by configuring your container runtime or by setting the --pod-infra-container-image flag depending on the version of Kubernetes you are using.

Other runtimes: containerd, CRI-O, cri-dockerd.

When using dockershim before v1.23:

kubelet --pod-infra-container-image=k8s.gcr.io/pause:3.5

Legacy container registry freeze

k8s.gcr.io Image Registry Will Be Frozen From the 3rd of April 2023 announces the freeze of the legacy k8s.gcr.io image registry. Read that article for more details.

Acknowledgments

Change is hard, and evolving our image-serving platform is needed to ensure a sustainable future for the project. We strive to make things better for everyone using Kubernetes. Many contributors from all corners of our community have been working long and hard to ensure we are making the best decisions possible, executing plans, and doing our best to communicate those plans.

Thanks to Aaron Crickenberger, Arnaud Meukam, Benjamin Elder, Caleb Woodbine, Davanum Srinivas, Mahamed Ali, and Tim Hockin from SIG K8s Infra, Brian McQueen, and Sergey Kanzhelev from SIG Node, Lubomir Ivanov from SIG Cluster Lifecycle, Adolfo García Veytia, Jeremy Rickard, Sascha Grunert, and Stephen Augustus from SIG Release, Bob Killen and Kaslin Fields from SIG Contribex, Tim Allclair from the Security Response Committee. Also a big thank you to our friends acting as liaisons with our cloud provider partners: Jay Pipes from Amazon and Jon Johnson Jr. from Google.

This article was updated on the 28th of February 2023.

Kubernetes Removals, Deprecations, and Major Changes in 1.26

Change is an integral part of the Kubernetes life-cycle: as Kubernetes grows and matures, features may be deprecated, removed, or replaced with improvements for the health of the project. For Kubernetes v1.26 there are several planned: this article identifies and describes some of them, based on the information available at this mid-cycle point in the v1.26 release process, which is still ongoing and can introduce additional changes.

The Kubernetes API Removal and Deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API is one that has been marked for removal in a future Kubernetes release; it will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
  • Beta or pre-release API versions must be supported for 3 releases after deprecation.
  • Alpha or experimental API versions may be removed in any release without prior deprecation notice.

Whether an API is removed as a result of a feature graduating from beta to stable or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the documentation.

A note about the removal of the CRI v1alpha2 API and containerd 1.5 support

Following the adoption of the Container Runtime Interface (CRI) and the [removal of dockershim] in v1.24 , the CRI is the supported and documented way through which Kubernetes interacts with different container runtimes. Each kubelet negotiates which version of CRI to use with the container runtime on that node.

The Kubernetes project recommends using CRI version v1; in Kubernetes v1.25 the kubelet can also negotiate the use of CRI v1alpha2 (which was deprecated along at the same time as adding support for the stable v1 interface).

Kubernetes v1.26 will not support CRI v1alpha2. That removal will result in the kubelet not registering the node if the container runtime doesn't support CRI v1. This means that containerd minor version 1.5 and older will not be supported in Kubernetes 1.26; if you use containerd, you will need to upgrade to containerd version 1.6.0 or later before you upgrade that node to Kubernetes v1.26. Other container runtimes that only support the v1alpha2 are equally affected: if that affects you, you should contact the container runtime vendor for advice or check their website for additional instructions in how to move forward.

If you want to benefit from v1.26 features and still use an older container runtime, you can run an older kubelet. The supported skew for the kubelet allows you to run a v1.25 kubelet, which still is still compatible with v1alpha2 CRI support, even if you upgrade the control plane to the 1.26 minor release of Kubernetes.

As well as container runtimes themselves, that there are tools like stargz-snapshotter that act as a proxy between kubelet and container runtime and those also might be affected.

Deprecations and removals in Kubernetes v1.26

In addition to the above, Kubernetes v1.26 is targeted to include several additional removals and deprecations.

Removal of the v1beta1 flow control API group

The flowcontrol.apiserver.k8s.io/v1beta1 API version of FlowSchema and PriorityLevelConfiguration will no longer be served in v1.26. Users should migrate manifests and API clients to use the flowcontrol.apiserver.k8s.io/v1beta2 API version, available since v1.23.

Removal of the v2beta2 HorizontalPodAutoscaler API

The autoscaling/v2beta2 API version of HorizontalPodAutoscaler will no longer be served in v1.26. Users should migrate manifests and API clients to use the autoscaling/v2 API version, available since v1.23.

Removal of in-tree credential management code

In this upcoming release, legacy vendor-specific authentication code that is part of Kubernetes will be removed from both client-go and kubectl. The existing mechanism supports authentication for two specific cloud providers: Azure and Google Cloud. In its place, Kubernetes already offers a vendor-neutral authentication plugin mechanism - you can switch over right now, before the v1.26 release happens. If you're affected, you can find additional guidance on how to proceed for Azure and for Google Cloud.

Removal of kube-proxy userspace modes

The userspace proxy mode, deprecated for over a year, is no longer supported on either Linux or Windows and will be removed in this release. Users should use iptables or ipvs on Linux, or kernelspace on Windows: using --mode userspace will now fail.

Removal of in-tree OpenStack cloud provider

Kubernetes is switching from in-tree code for storage integrations, in favor of the Container Storage Interface (CSI). As part of this, Kubernetes v1.26 will remove the deprecated in-tree storage integration for OpenStack (the cinder volume type). You should migrate to external cloud provider and CSI driver from https://github.com/kubernetes/cloud-provider-openstack instead. For more information, visit Cinder in-tree to CSI driver migration.

Removal of the GlusterFS in-tree driver

The in-tree GlusterFS driver was deprecated in v1.25, and will be removed from Kubernetes v1.26.

Deprecation of non-inclusive kubectl flag

As part of the implementation effort of the Inclusive Naming Initiative, the --prune-whitelist flag will be deprecated, and replaced with --prune-allowlist. Users that use this flag are strongly advised to make the necessary changes prior to the final removal of the flag, in a future release.

Removal of dynamic kubelet configuration

Dynamic kubelet configuration allowed new kubelet configurations to be rolled out via the Kubernetes API, even in a live cluster. A cluster operator could reconfigure the kubelet on a Node by specifying a ConfigMap that contained the configuration data that the kubelet should use. Dynamic kubelet configuration was removed from the kubelet in v1.24, and will be removed from the API server in the v1.26 release.

Deprecations for kube-apiserver command line arguments

The --master-service-namespace command line argument to the kube-apiserver doesn't have any effect, and was already informally deprecated. That command line argument will be formally marked as deprecated in v1.26, preparing for its removal in a future release. The Kubernetes project does not expect any impact from this deprecation and removal.

Deprecations for kubectl run command line arguments

Several unused option arguments for the kubectl run subcommand will be marked as deprecated, including:

  • --cascade
  • --filename
  • --force
  • --grace-period
  • --kustomize
  • --recursive
  • --timeout
  • --wait

These arguments are already ignored so no impact is expected: the explicit deprecation sets a warning message and prepares the removal of the arguments in a future release.

Removal of legacy command line arguments relating to logging

Kubernetes v1.26 will remove some command line arguments relating to logging. These command line arguments were already deprecated. For more information, see Deprecate klog specific flags in Kubernetes Components.

Looking ahead

The official list of API removals planned for Kubernetes 1.27 includes:

  • All beta versions of the CSIStorageCapacity API; specifically: storage.k8s.io/v1beta1

Want to know more?

Deprecations are announced in the Kubernetes release notes. You can see the announcements of pending deprecations in the release notes for:

We will formally announce the deprecations that come with Kubernetes 1.26 as part of the CHANGELOG for that release.

Live and let live with Kluctl and Server Side Apply

This blog post was inspired by a previous Kubernetes blog post about Advanced Server Side Apply. The author of said blog post listed multiple benefits for applications and controllers when switching to server-side apply (from now on abbreviated with SSA). Especially the chapter about CI/CD systems motivated me to respond and write down my thoughts and experiences.

These thoughts and experiences are the results of me working on Kluctl for the past 2 years. I describe Kluctl as "The missing glue to put together large Kubernetes deployments, composed of multiple smaller parts (Helm/Kustomize/...) in a manageable and unified way."

To get a basic understanding of Kluctl, I suggest to visit the kluctl.io website and read through the documentation and tutorials, for example the microservices demo tutorial. As an alternative, you can watch Hands-on Introduction to kluctl from the Rawkode Academy YouTube channel which shows a hands-on demo session.

There is also a Kluctl delivery scenario for the podtato-head demo project.

Live and let live

One of the main philosophies that Kluctl follows is "live and let live", meaning that it will try its best to work in conjunction with any other tool or controller running outside or inside your clusters. Kluctl will not overwrite any fields that it lost ownership of, unless you explicitly tell it to do so.

Achieving this would not have been possible (or at least several magnitudes harder) without the use of SSA. Server-side apply allows Kluctl to detect when ownership for a field got lost, for example when another controller or operator updates that field to another value. Kluctl can then decide on a field-by-field basis if force-applying is required before retrying based on these decisions.

The days before SSA

The first versions of Kluctl were based on shelling out to kubectl and thus implicitly relied on client-side apply. At that time, SSA was still alpha and quite buggy. And to be honest, I didn't even know it was a thing at that time.

The way client-side apply worked had some serious drawbacks. The most obvious one (it was guaranteed that you'd stumble on this by yourself if enough time passed) is that it relied on an annotation (kubectl.kubernetes.io/last-applied-configuration) being added to the object, bringing in all the limitations and issues with huge annotation values. A good example of such issues are CRDs being so large, that they don't fit into the annotation's value anymore.

Another drawback can be seen just by looking at the name (client-side apply). Being client side means that each client has to provide the apply-logic on its own, which at that time was only properly implemented inside kubectl, making it hard to be replicated inside controllers.

This added kubectl as a dependency (either as an executable or in the form of Go packages) to all controllers that wanted to leverage the apply-logic.

However, even if one managed to get client-side apply running from inside a controller, you ended up with a solution that gave no control over how it worked internally. As an example, there was no way to individually decide which fields to overwrite in case of external changes and which ones to let go.

Discovering SSA apply

I was never happy with the solution described above and then somehow stumbled across server-side apply, which was still in beta at that time. Experimenting with it via kubectl apply --server-side revealed immediately that the true power of SSA can not be easily leveraged by shelling out to kubectl.

The way SSA is implemented in kubectl does not allow enough control over conflict resolution as it can only switch between "not force-applying anything and erroring out" and "force-applying everything without showing any mercy!".

The API documentation however made it clear that SSA is able to control conflict resolution on field level, simply by choosing which fields to include and which fields to omit from the supplied object.

Moving away from kubectl

This meant that Kluctl had to move away from shelling out to kubectl first. Only after that was done, I would have been able to properly implement SSA with its powerful conflict resolution.

To achieve this, I first implemented access to the target clusters via a Kubernetes client library. This had the nice side effect of dramatically speeding up Kluctl as well. It also improved the security and usability of Kluctl by ensuring that a running Kluctl command could not be messed around with by externally modifying the kubeconfig while it was running.

Implementing SSA

After switching to a Kubernetes client library, leveraging SSA felt easy. Kluctl now has to send each manifest to the API server as part of a PATCH request, which signals that Kluctl wants to perform a SSA operation. The API server then responds with an OK response (HTTP status code 200), or with a Conflict response (HTTP status 409).

In case of a Conflict response, the body of that response includes machine-readable details about the conflicts. Kluctl can then use these details to figure out which fields are in conflict and which actors (field managers) have taken ownership of the conflicted fields.

Then, for each field, Kluctl will decide if the conflict should be ignored or if it should be force-applied. If any field needs to be force-applied, Kluctl will retry the apply operation with the ignored fields omitted and the force flag being set on the API call.

In case a conflict is ignored, Kluctl will issue a warning to the user so that the user can react properly (or ignore it forever...).

That's basically it. That is all that is required to leverage SSA. Big thanks and thumbs-up to the Kubernetes developers who made this possible!

Conflict Resolution

Kluctl has a few simple rules to figure out if a conflict should be ignored or force-applied.

It first checks the field's actor (the field manager) against a list of known field manager strings from tools that are frequently used to perform manual modifications. These are for example kubectl and k9s. Any modifications performed with these tools are considered "temporary" and will be overwritten by Kluctl.

If you're using Kluctl along with kubectl where you don't want the changes from kubectl to be overwritten (for example, using in a script) then you can specify --field-manager=<manager-name> on the command line to kubectl, and Kluctl doesn't apply its special heuristic.

If the field manager is not known by Kluctl, it will check if force-applying is requested for that field. Force-applying can be requested in different ways:

  1. By passing --force-apply to Kluctl. This will cause ALL fields to be force-applied on conflicts.
  2. By adding the kluctl.io/force-apply=true annotation to the object in question. This will cause all fields of that object to be force-applied on conflicts.
  3. By adding the kluctl.io/force-apply-field=my.json.path annotation to the object in question. This causes only fields matching the JSON path to be force-applied on conflicts.

Marking a field to be force-applied is required whenever some other actor is known to erroneously claim fields (the ECK operator does this to the nodeSets field for example), you can ensure that Kluctl always overwrites these fields to the original or a new value.

In the future, Kluctl will allow even more control about conflict resolution. For example, the CLI will allow to control force-applying on field level.

DevOps vs Controllers

So how does SSA in Kluctl lead to "live and let live"?

It allows the co-existence of classical pipelines (e.g. Github Actions or Gitlab CI), controllers (e.g. the HPA controller or GitOps style controllers) and even admins running deployments from their local machines.

Wherever you are on your infrastructure automation journey, Kluctl has a place for you. From running deployments using a script on your PC, all the way to fully automated CI/CD with the pipelines themselves defined in code, Kluctl aims to complement the workflow that's right for you.

And even after fully automating everything, you can intervene with your admin permissions if required and run a kubectl command that will modify a field and prevent Kluctl from overwriting it. You'd just have to switch to a field-manager (e.g. "admin-override") that is not overwritten by Kluctl.

A few takeaways

Server-side apply is a great feature and essential for the future of controllers and tools in Kubernetes. The amount of controllers involved will only get more and proper modes of working together are a must.

I believe that CI/CD-related controllers and tools should leverage SSA to perform proper conflict resolution. I also believe that other controllers (e.g. Flux and ArgoCD) would benefit from the same kind of conflict resolution control on field-level.

It might even be a good idea to come together and work on a standardized set of annotations to control conflict resolution for CI/CD-related tooling.

On the other side, non CI/CD-related controllers should ensure that they don't cause unnecessary conflicts when modifying objects. As of the server-side apply documentation, it is strongly recommended for controllers to always perform force-applying. When following this recommendation, controllers should really make sure that only fields related to the controller are included in the applied object. Otherwise, unnecessary conflicts are guaranteed.

In many cases, controllers are meant to only modify the status subresource of the objects they manage. In this case, controllers should only patch the status subresource and not touch the actual object. If this is followed, conflicts become impossible to occur.

If you are a developer of such a controller and unsure about your controller adhering to the above, simply try to retrieve an object managed by your controller and look at the managedFields (you'll need to pass --show-managed-fields -oyaml to kubectl get) to see if some field got claimed unexpectedly.

Server Side Apply Is Great And You Should Be Using It

Server-side apply (SSA) has now been GA for a few releases, and I have found myself in a number of conversations, recommending that people / teams in various situations use it. So I’d like to write down some of those reasons.

Obvious (and not-so-obvious) benefits of SSA

A list of improvements / niceties you get from switching from various things to Server-side apply!

  • Versus client-side-apply (that is, plain kubectl apply):
    • The system gives you conflicts when you accidentally fight with another actor over the value of a field!
    • When combined with --dry-run, there’s no chance of accidentally running a client-side dry run instead of a server side dry run.
  • Versus hand-rolling patches:
    • The SSA patch format is extremely natural to write, with no weird syntax. It’s just a regular object, but you can (and should) omit any field you don’t care about.
    • The old patch format (“strategic merge patch”) was ad-hoc and still has some bugs; JSON-patch and JSON merge-patch fail to handle some cases that are common in the Kubernetes API, namely lists with items that should be recursively merged based on a “name” or other identifying field.
    • There’s also now great go-language library support for building apply calls programmatically!
    • You can use SSA to explicitly delete fields you don’t “own” by setting them to null, which makes it a feature-complete replacement for all of the old patch formats.
  • Versus shelling out to kubectl:
  • Versus GET-modify-PUT:
    • (This one is more complicated and you can skip it if you've never written a controller!)
    • To use GET-modify-PUT correctly, you have to handle and retry a write failure in the case that someone else has modified the object in any way between your GET and PUT. This is an “optimistic concurrency failure” when it happens.
    • SSA offloads this task to the server– you only have to retry if there’s a conflict, and the conflicts you can get are all meaningful, like when you’re actually trying to take a field away from another actor in the system.
    • To put it another way, if 10 actors do a GET-modify-PUT cycle at the same time, 9 will get an optimistic concurrency failure and have to retry, then 8, etc, for up to 50 total GET-PUT attempts in the worst case (that’s .5N^2 GET and PUT calls for N actors making simultaneous changes). If the actors are using SSA instead, and the changes don’t actually conflict over specific fields, then all the changes can go in in any order. Additionally, SSA changes can often be done without a GET call at all. That’s only N apply requests for N actors, which is a drastic improvement!

How can I use SSA?

Users

Use kubectl apply --server-side! Soon we (SIG API Machinery) hope to make this the default and remove the “client side” apply completely!

Controller authors

There’s two main categories here, but for both of them, you should probably force conflicts when using SSA. This is because your controller probably doesn’t know what to do when some other entity in the system has a different desire than your controller about a particular field. (See the CI/CD section, though!)

Controllers that use either a GET-modify-PUT sequence or a PATCH

This kind of controller GETs an object (possibly from a watch), modifies it, and then PUTs it back to write its changes. Sometimes it constructs a custom PATCH, but the semantics are the same. Most existing controllers (especially those in-tree) work like this.

If your controller is perfect, great! You don’t need to change it. But if you do want to change it, you can take advantage of the new client library’s extract workflow– that is, get the existing object, extract your existing desires, make modifications, and re-apply. For many controllers that were computing the smallest API changes possible, this will be a minor update to the existing implementation.

This workflow avoids the failure mode of accidentally trying to own every field in the object, which is what happens if you just GET the object, make changes, and then apply. (Note that the server will notice you did this and reject your change!)

Reconstructive controllers

This kind of controller wasn't really possible prior to SSA. The idea here is to (whenever something changes etc) reconstruct from scratch the fields of the object as the controller wishes them to be, and then apply the change to the server, letting it figure out the result. I now recommend that new controllers start out this way–it's less fiddly to say what you want an object to look like than it is to say how you want it to change.

The client library supports this method of operation by default.

The only downside is that you may end up sending unneeded apply requests to the API server, even if actually the object already matches your controller’s desires. This doesn't matter if it happens once in a while, but for extremely high-throughput controllers, it might cause a performance problem for the cluster–specifically, the API server. No-op writes are not written to storage (etcd) or broadcast to any watchers, so it’s not really that big of a deal. If you’re worried about this anyway, today you could use the method explained in the previous section, or you could still do it this way for now, and wait for an additional client-side mechanism to suppress zero-change applies.

To get around this downside, why not GET the object and only send your apply if the object needs it? Surprisingly, it doesn't help much – a no-op apply is not very much more work for the API server than an extra GET; and an apply that changes things is cheaper than that same apply with a preceding GET. Worse, since it is a distributed system, something could change between your GET and apply, invalidating your computation. Instead, you can use this optimization on an object retrieved from a cache–then it legitimately will reduce load on the system (at the cost of a delay when a change is needed and the cache is a bit behind).

CI/CD systems

Continuous integration (CI) and/or continuous deployment (CD) systems are a special kind of controller which is doing something like reading manifests from source control (such as a Git repo) and automatically pushing them into the cluster. Perhaps the CI / CD process first generates manifests from a template, then runs some tests, and then deploys a change. Typically, users are the entities pushing changes into source control, although that’s not necessarily always the case.

Some systems like this continuously reconcile with the cluster, others may only operate when a change is pushed to the source control system. The following considerations are important for both, but more so for the continuously reconciling kind.

CI/CD systems are literally controllers, but for the purpose of apply, they are more like users, and unlike other controllers, they need to pay attention to conflicts. Reasoning:

  • Abstractly, CI/CD systems can change anything, which means they could conflict with any controller out there. The recommendation that controllers force conflicts is assuming that controllers change a limited number of things and you can be reasonably sure that they won’t fight with other controllers about those things; that’s clearly not the case for CI/CD controllers.
  • Concrete example: imagine the CI/CD system wants .spec.replicas for some Deployment to be 3, because that is the value that is checked into source code; however there is also a HorizontalPodAutoscaler (HPA) that targets the same deployment. The HPA computes a target scale and decides that there should be 10 replicas. Which should win? I just said that most controllers–including the HPA–should ignore conflicts. The HPA has no idea if it has been enabled incorrectly, and the HPA has no convenient way of informing users of errors.
  • The other common cause of a CI/CD system getting a conflict is probably when it is trying to overwrite a hot-fix (hand-rolled patch) placed there by a system admin / SRE / dev-on-call. You almost certainly don’t want to override that automatically.
  • Of course, sometimes SRE makes an accidental change, or a dev makes an unauthorized change – those you do want to notice and overwrite; however, the CI/CD system can’t tell the difference between these last two cases.

Hopefully this convinces you that CI/CD systems need error paths–a way to back-propagate these conflict errors to humans; in fact, they should have this already, certainly continuous integration systems need some way to report that tests are failing. But maybe I can also say something about how humans can deal with errors:

  • Reject the hotfix: the (human) administrator of the CI/CD system observes the error, and manually force-applies the manifest in question. Then the CI/CD system will be able to apply the manifest successfully and become a co-owner.

    Optional: then the administrator applies a blank manifest (just the object type / namespace / name) to relinquish any fields they became a manager for. if this step is omitted, there's some chance the administrator will end up owning fields and causing an unwanted future conflict.

    Note: why an administrator? I'm assuming that developers which ordinarily push to the CI/CD system and / or its source control system may not have permissions to push directly to the cluster.

  • Accept the hotfix: the author of the change in question sees the conflict, and edits their change to accept the value running in production.

  • Accept then reject: as in the accept option, but after that manifest is applied, and the CI/CD queue owns everything again (so no conflicts), re-apply the original manifest.

  • I can also imagine the CI/CD system permitting you to mark a manifest as “force conflicts” somehow– if there’s demand for this we could consider making a more standardized way to do this. A rigorous version of this which lets you declare exactly which conflicts you intend to force would require support from the API server; in lieu of that, you can make a second manifest with only that subset of fields.

  • Future work: we could imagine an especially advanced CI/CD system that could parse metadata.managedFields data to see who or what they are conflicting with, over what fields, and decide whether or not to ignore the conflict. In fact, this information is also presented in any conflict errors, though perhaps not in an easily machine-parseable format. We (SIG API Machinery) mostly didn't expect that people would want to take this approach — so we would love to know if in fact people want/need the features implied by this approach, such as the ability, when applying to request to override certain conflicts but not others.

    If this sounds like an approach you'd want to take for your own controller, come talk to SIG API Machinery!

Happy applying!

Current State: 2019 Third Party Security Audit of Kubernetes

We expect the brand new Third Party Security Audit of Kubernetes will be published later this month (Oct 2022).

In preparation for that, let's look at the state of findings that were made public as part of the last third party security audit of 2019 that was based on Kubernetes v1.13.4.

Motivation

Craig Ingram has graciously attempted over the years to keep track of the status of the findings reported in the last audit in this issue: kubernetes/kubernetes#81146. This blog post will attempt to dive deeper into this, address any gaps in tracking and become a point in time summary of the state of the findings reported from 2019.

This article should also help readers gain confidence through transparent communication, of work done by the community to address these findings and bubble up any findings that need help from community contributors.

Current State

The status of each issue / finding here is represented in a best effort manner. Authors do not claim to be 100% accurate on the status and welcome any corrections or feedback if the current state is not reflected accurately by commenting directly on the relevant issue.

#TitleIssueStatus
1hostPath PersistentVolumes enable PodSecurityPolicy bypass#81110closed, addressed by kubernetes/website#15756 and kubernetes/kubernetes#109798
2Kubernetes does not facilitate certificate revocation#81111duplicate of #18982 and needs a KEP
3HTTPS connections are not authenticated#81112Largely left as an end user exercise in setting up the right configuration
4TOCTOU when moving PID to manager's cgroup via kubelet#81113Requires Node access for successful exploitation. Fix needed
5Improperly patched directory traversal in kubectl cp#76788closed, assigned CVE-2019-11249, fixed in #80436
6Bearer tokens are revealed in logs#81114closed, assigned CVE-2019-11250, fixed in #81330
7Seccomp is disabled by default#81115closed, addressed by #101943
8Pervasive world-accessible file permissions#81116#112384 ( in progress)
9Environment variables expose sensitive data#81117closed, addressed by #84992 and #84677
10Use of InsecureIgnoreHostKey in SSH connections#81118This feature was removed in v1.22: #102297
11Use of InsecureSkipVerify and other TLS weaknesses#81119Needs a KEP
12kubeadm performs potentially-dangerous reset operations#81120closed, fixed by #81495, #81494, and kubernetes/website#15881
13Overflows when using strconv.Atoi and downcasting the result#81121closed, fixed by #89120
14kubelet can cause an Out of Memory error with a malicious manifest#81122closed, fixed by #76518
15kubectl can cause an Out Of Memory error with a malicious Pod specification#81123Fix needed
16Improper fetching of PIDs allows incorrect cgroup movement#81124Fix needed
17Directory traversal of host logs running kube-apiserver and kubelet#81125closed, fixed by #87273
18Non-constant time password comparison#81126closed, fixed by #81152
19Encryption recommendations not in accordance with best practices#81127Work in Progress
20Adding credentials to containers by default is unsafe#81128Closed, fixed by #89193
21kubelet liveness probes can be used to enumerate host network#81129Needs a KEP
22iSCSI volume storage cleartext secrets in logs#81130closed, fixed by #81215
23Hard coded credential paths#81131closed, awaiting more evidence
24Log rotation is not atomic#81132Fix needed
25Arbitrary file paths without bounding#81133Fix needed.
26Unsafe JSON construction#81134Partially fixed
27kubelet crash due to improperly handled errors#81135Closed. Fixed by #81135
28Legacy tokens do not expire#81136closed, fixed as part of #70679
29CoreDNS leaks internal cluster information across namespaces#81137Closed, resolved with CoreDNS v1.6.2. #81137 (comment)
30Services use questionable default functions#81138Fix needed
31Incorrect docker daemon process name in container manager#81139closed, fixed by #81083
32Use standard formats everywhere#81140Needs a KEP
33Superficial health check provides false sense of safety#81141closed, fixed by #81319
34Hardcoded use of insecure gRPC transport#81142Needs a KEP
35Incorrect handling of Retry-After#81143closed, fixed by #91048
36Incorrect isKernelPid check#81144closed, fixed by #81086
37Kubelet supports insecure TLS ciphersuites#81145closed but fix needed for #91444 (see this comment)

Inspired outcomes

Apart from fixes to the specific issues, the 2019 third party security audit also motivated security focussed enhancements in the next few releases of Kubernetes. One such example is Kubernetes Enhancement Proposal (KEP) 1933 Defend Against Logging Secrets via Static Analysis to prevent exposing secrets to logs with Patrick Rhomberg driving the implementation. As a result of this KEP, go-flow-levee, a taint propagation analysis tool configured to detect logging of secrets, is executed in a script as a Prow presubmit job. This KEP was introduced in v1.20.0 as an alpha feature, then graduated to beta in v1.21.0, and graduated to stable in v1.23.0. As stable, the analysis runs as a blocking presubmit test. This KEP also helped resolve the following issues from the 2019 third party security audit:

Remaining Work

Many of the 37 findings identified were fixed by work from our community members over the last 3 years. However, we still have some work left to do. Here's a breakdown of remaining work with rough estimates on time commitment, complexity and benefits to the ecosystem on fixing these pending issues.

Note:

Anything requiring a KEP (Kubernetes Enhancement Proposal) is considered high time commitment and high complexity. Benefits to Ecosystem are roughly equivalent to risk of keeping the finding unfixed which is determined by Severity Level + Likelihood of a successful vulnerability exploit. These estimates and values in the table below are the authors' personal opinion. An individual or end users' threat model may rate the benefits to fix a particular issue higher or lower.
TitleIssueTime CommitmentComplexityBenefit to Ecosystem
Kubernetes does not facilitate certificate revocation#81111HighHighMedium
Use of InsecureSkipVerify and other TLS weaknesses#81119HighHighMedium
kubectl can cause a local Out Of Memory error with a malicious Pod specification#81123MediumMediumMedium
Improper fetching of PIDs allows incorrect cgroup movement#81124MediumMediumMedium
kubelet liveness probes can be used to enumerate host network#81129HighHighMedium
API Server supports insecure TLS ciphersuites#81145MediumMediumLow
TOCTOU when moving PID to manager's cgroup via kubelet#81113MediumMediumLow
Log rotation is not atomic#81132MediumMediumLow
Arbitrary file paths without bounding#81133MediumMediumLow
Services use questionable default functions#81138MediumMediumLow
Use standard formats everywhere#81140HighHighVery Low
Hardcoded use of insecure gRPC transport#81142HighHighVery Low

To get started on fixing any of these findings that need help, please consider getting involved in Kubernetes SIG Security by joining our bi-weekly meetings or hanging out with us on our Slack Channel.

Introducing Kueue

Whether on-premises or in the cloud, clusters face real constraints for resource usage, quota, and cost management reasons. Regardless of the autoscalling capabilities, clusters have finite capacity. As a result, users want an easy way to fairly and efficiently share resources.

In this article, we introduce Kueue, an open source job queueing controller designed to manage batch jobs as a single unit. Kueue leaves pod-level orchestration to existing stable components of Kubernetes. Kueue natively supports the Kubernetes Job API and offers hooks for integrating other custom-built APIs for batch jobs.

Why Kueue?

Job queueing is a key feature to run batch workloads at scale in both on-premises and cloud environments. The main goal of job queueing is to manage access to a limited pool of resources shared by multiple tenants. Job queueing decides which jobs should wait, which can start immediately, and what resources they can use.

Some of the most desired job queueing requirements include:

  • Quota and budgeting to control who can use what and up to what limit. This is not only needed in clusters with static resources like on-premises, but it is also needed in cloud environments to control spend or usage of scarce resources.
  • Fair sharing of resources between tenants. To maximize the usage of available resources, any unused quota assigned to inactive tenants should be allowed to be shared fairly between active tenants.
  • Flexible placement of jobs across different resource types based on availability. This is important in cloud environments which have heterogeneous resources such as different architectures (GPU or CPU models) and different provisioning modes (spot vs on-demand).
  • Support for autoscaled environments where resources can be provisioned on demand.

Plain Kubernetes doesn't address the above requirements. In normal circumstances, once a Job is created, the job-controller instantly creates the pods and kube-scheduler continuously attempts to assign the pods to nodes. At scale, this situation can work the control plane to death. There is also currently no good way to control at the job level which jobs should get which resources first, and no way to express order or fair sharing. The current ResourceQuota model is not a good fit for these needs because quotas are enforced on resource creation, and there is no queueing of requests. The intent of ResourceQuotas is to provide a builtin reliability mechanism with policies needed by admins to protect clusters from failing over.

In the Kubernetes ecosystem, there are several solutions for job scheduling. However, we found that these alternatives have one or more of the following problems:

  • They replace existing stable components of Kubernetes, like kube-scheduler or the job-controller. This is problematic not only from an operational point of view, but also the duplication in the job APIs causes fragmentation of the ecosystem and reduces portability.
  • They don't integrate with autoscaling, or
  • They lack support for resource flexibility.

How Kueue works

With Kueue we decided to take a different approach to job queueing on Kubernetes that is anchored around the following aspects:

  • Not duplicating existing functionalities already offered by established Kubernetes components for pod scheduling, autoscaling and job lifecycle management.
  • Adding key features that are missing to existing components. For example, we invested in the Job API to cover more use cases like IndexedJob and fixed long standing issues related to pod tracking. While this path takes longer to land features, we believe it is the more sustainable long term solution.
  • Ensuring compatibility with cloud environments where compute resources are elastic and heterogeneous.

For this approach to be feasible, Kueue needs knobs to influence the behavior of those established components so it can effectively manage when and where to start a job. We added those knobs to the Job API in the form of two features:

  • Suspend field, which allows Kueue to signal to the job-controller when to start or stop a Job.
  • Mutable scheduling directives, which allows Kueue to update a Job's .spec.template.spec.nodeSelector before starting the Job. This way, Kueue can control Pod placement while still delegating to kube-scheduler the actual pod-to-node scheduling.

Note that any custom job API can be managed by Kueue if that API offers the above two capabilities.

Resource model

Kueue defines new APIs to address the requirements mentioned at the beginning of this post. The three main APIs are:

  • ResourceFlavor: a cluster-scoped API to define resource flavor available for consumption, like a GPU model. At its core, a ResourceFlavor is a set of labels that mirrors the labels on the nodes that offer those resources.
  • ClusterQueue: a cluster-scoped API to define resource pools by setting quotas for one or more ResourceFlavor.
  • LocalQueue: a namespaced API for grouping and managing single tenant jobs. In its simplest form, a LocalQueue is a pointer to the ClusterQueue that the tenant (modeled as a namespace) can use to start their jobs.

For more details, take a look at the API concepts documentation. While the three APIs may look overwhelming, most of Kueue’s operations are centered around ClusterQueue; the ResourceFlavor and LocalQueue APIs are mainly organizational wrappers.

Example use case

Imagine the following setup for running batch workloads on a Kubernetes cluster on the cloud:

  • You have cluster-autoscaler installed in the cluster to automatically adjust the size of your cluster.
  • There are two types of autoscaled node groups that differ on their provisioning policies: spot and on-demand. The nodes of each group are differentiated by the label instance-type=spot or instance-type=ondemand. Moreover, since not all Jobs can tolerate running on spot nodes, the nodes are tainted with spot=true:NoSchedule.
  • To strike a balance between cost and resource availability, imagine you want Jobs to use up to 1000 cores of on-demand nodes, then use up to 2000 cores of spot nodes.

As an admin for the batch system, you define two ResourceFlavors that represent the two types of nodes:

---
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ResourceFlavor
metadata:
  name: ondemand
  labels:
    instance-type: ondemand 
---
apiVersion: kueue.x-k8s.io/v1alpha2
kind: ResourceFlavor
metadata:
  name: spot
  labels:
    instance-type: spot
taints:
- effect: NoSchedule
  key: spot
  value: "true"

Then you define the quotas by creating a ClusterQueue as follows:

apiVersion: kueue.x-k8s.io/v1alpha2
kind: ClusterQueue
metadata:
  name: research-pool
spec:
  namespaceSelector: {}
  resources:
  - name: "cpu"
    flavors:
    - name: ondemand
      quota:
        min: 1000
    - name: spot
      quota:
        min: 2000

Note that the order of flavors in the ClusterQueue resources matters: Kueue will attempt to fit jobs in the available quotas according to the order unless the job has an explicit affinity to specific flavors.

For each namespace, you define a LocalQueue that points to the ClusterQueue above:

apiVersion: kueue.x-k8s.io/v1alpha2
kind: LocalQueue
metadata:
  name: training
  namespace: team-ml
spec:
  clusterQueue: research-pool

Admins create the above setup once. Batch users are able to find the queues they are allowed to submit to by listing the LocalQueues in their namespace(s). The command is similar to the following: kubectl get -n my-namespace localqueues

To submit work, create a Job and set the kueue.x-k8s.io/queue-name annotation as follows:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: sample-job-
  annotations:
    kueue.x-k8s.io/queue-name: training
spec:
  parallelism: 3
  completions: 3
  template:
    spec:
      tolerations:
      - key: spot
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: example-batch-workload
        image: registry.example/batch/calculate-pi:3.14
        args: ["30s"]
        resources:
          requests:
            cpu: 1
      restartPolicy: Never

Kueue intervenes to suspend the Job as soon as it is created. Once the Job is at the head of the ClusterQueue, Kueue evaluates if it can start by checking if the resources requested by the job fit the available quota.

In the above example, the Job tolerates spot resources. If there are previously admitted Jobs consuming all existing on-demand quota but not all of spot’s, Kueue admits the Job using the spot quota. Kueue does this by issuing a single update to the Job object that:

  • Changes the .spec.suspend flag to false
  • Adds the term instance-type: spot to the job's .spec.template.spec.nodeSelector so that when the pods are created by the job controller, those pods can only schedule onto spot nodes.

Finally, if there are available empty nodes with matching node selector terms, then kube-scheduler will directly schedule the pods. If not, then kube-scheduler will initially mark the pods as unschedulable, which will trigger the cluster-autoscaler to provision new nodes.

Future work and getting involved

The example above offers a glimpse of some of Kueue's features including support for quota, resource flexibility, and integration with cluster autoscaler. Kueue also supports fair-sharing, job priorities, and different queueing strategies. Take a look at the Kueue documentation to learn more about those features and how to use Kueue.

We have a number of features that we plan to add to Kueue, such as hierarchical quota, budgets, and support for dynamically sized jobs. In the more immediate future, we are focused on adding support for job preemption.

The latest Kueue release is available on Github; try it out if you run batch workloads on Kubernetes (requires v1.22 or newer). We are in the early stages of this project and we are seeking feedback of all levels, major or minor, so please don’t hesitate to reach out. We’re also open to additional contributors, whether it is to fix or report bugs, or help add new features or write documentation. You can get in touch with us via our repo, mailing list or on Slack.

Last but not least, thanks to all our contributors who made this project possible!

Kubernetes 1.25: alpha support for running Pods with user namespaces

Kubernetes v1.25 introduces the support for user namespaces.

This is a major improvement for running secure workloads in Kubernetes. Each pod will have access only to a limited subset of the available UIDs and GIDs on the system, thus adding a new security layer to protect from other pods running on the same system.

How does it work?

A process running on Linux can use up to 4294967296 different UIDs and GIDs.

User namespaces is a Linux feature that allows mapping a set of users in the container to different users in the host, thus restricting what IDs a process can effectively use. Furthermore, the capabilities granted in a new user namespace do not apply in the host initial namespaces.

Why is it important?

There are mainly two reasons why user namespaces are important:

  • improve security since they restrict the IDs a pod can use, so each pod can run in its own separate environment with unique IDs.

  • enable running workloads as root in a safer manner.

In a user namespace we can map the root user inside the pod to a non-zero ID outside the container, containers believe in running as root while they are a regular unprivileged ID from the host point of view.

The process can keep capabilities that are usually restricted to privileged pods and do it in a safe way since the capabilities granted in a new user namespace do not apply in the host initial namespaces.

How do I enable user namespaces?

At the moment, user namespaces support is opt-in, so you must enable it for a pod setting hostUsers to false under the pod spec stanza:

apiVersion: v1
kind: Pod
spec:
  hostUsers: false
  containers:
  - name: nginx
    image: docker.io/nginx

The feature is behind a feature gate, so make sure to enable the UserNamespacesStatelessPodsSupport gate before you can use the new feature.

The runtime must also support user namespaces:

  • containerd: support is planned for the 1.7 release. See containerd issue #7063 for more details.

  • CRI-O: v1.25 has support for user namespaces.

Support for this in cri-dockerd is not planned yet.

How do I get involved?

You can reach SIG Node by several means:

You can also contact us directly:

  • GitHub / Slack: @rata @giuseppe

Enforce CRD Immutability with CEL Transition Rules

Immutable fields can be found in a few places in the built-in Kubernetes types. For example, you can't change the .metadata.name of an object. Specific objects have fields where changes to existing objects are constrained; for example, the .spec.selector of a Deployment.

Aside from simple immutability, there are other common design patterns such as lists which are append-only, or a map with mutable values and immutable keys.

Until recently the best way to restrict field mutability for CustomResourceDefinitions has been to create a validating admission webhook: this means a lot of complexity for the common case of making a field immutable.

Beta since Kubernetes 1.25, CEL Validation Rules allow CRD authors to express validation constraints on their fields using a rich expression language, CEL. This article explores how you can use validation rules to implement a few common immutability patterns directly in the manifest for a CRD.

Basics of validation rules

The new support for CEL validation rules in Kubernetes allows CRD authors to add complicated admission logic for their resources without writing any code!

For example, A CEL rule to constrain a field maximumSize to be greater than a minimumSize for a CRD might look like the following:

rule: |
    self.maximumSize > self.minimumSize
message: 'Maximum size must be greater than minimum size.'

The rule field contains an expression written in CEL. self is a special keyword in CEL which refers to the object whose type contains the rule.

The message field is an error message which will be sent to Kubernetes clients whenever this particular rule is not satisfied.

For more details about the capabilities and limitations of Validation Rules using CEL, please refer to validation rules. The CEL specification is also a good reference for information specifically related to the language.

Immutability patterns with CEL validation rules

This section implements several common use cases for immutability in Kubernetes CustomResourceDefinitions, using validation rules expressed as kubebuilder marker comments. Resultant OpenAPI generated by the kubebuilder marker comments will also be included so that if you are writing your CRD manifests by hand you can still follow along.

Project setup

To use CEL rules with kubebuilder comments, you first need to set up a Golang project structure with the CRD defined in Go.

You may skip this step if you are not using kubebuilder or are only interested in the resultant OpenAPI extensions.

Begin with a folder structure of a Go module set up like the following. If you have your own project already set up feel free to adapt this tutorial to your liking:

graph LR . --> generate.go . --> pkg --> apis --> stable.example.com --> v1 v1 --> doc.go v1 --> types.go . --> tools.go

This is the typical folder structure used by Kubernetes projects for defining new API resources.

doc.go contains package-level metadata such as the group and the version:

// +groupName=stable.example.com
// +versionName=v1
package v1

types.go contains all type definitions in stable.example.com/v1

package v1

import (
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// An empty CRD as an example of defining a type using controller tools
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
type TestCRD struct {
   metav1.TypeMeta   `json:",inline"`
   metav1.ObjectMeta `json:"metadata,omitempty"`

   Spec   TestCRDSpec   `json:"spec,omitempty"`
   Status TestCRDStatus `json:"status,omitempty"`
}

type TestCRDStatus struct {}
type TestCRDSpec struct {
   // You will fill this in as you go along
}

tools.go contains a dependency on controller-gen which will be used to generate the CRD definition:

//go:build tools

package celimmutabilitytutorial

// Force direct dependency on code-generator so that it may be executed with go run
import (
   _ "sigs.k8s.io/controller-tools/cmd/controller-gen"
)

Finally, generate.gocontains a go:generate directive to make use of controller-gen. controller-gen parses our types.go and creates generates CRD yaml files into a crd folder:

package celimmutabilitytutorial

//go:generate go run sigs.k8s.io/controller-tools/cmd/controller-gen crd paths=./pkg/apis/... output:dir=./crds

You may now want to add dependencies for our definitions and test the code generation:

cd cel-immutability-tutorial
go mod init <your-org>/<your-module-name>
go mod tidy
go generate ./...

After running these commands you now have completed the basic project structure. Your folder tree should look like the following:

graph LR . --> crds --> stable.example.com_testcrds.yaml . --> generate.go . --> go.mod . --> go.sum . --> pkg --> apis --> stable.example.com --> v1 v1 --> doc.go v1 --> types.go . --> tools.go

The manifest for the example CRD is now available in crds/stable.example.com_testcrds.yaml.

Immutablility after first modification

A common immutability design pattern is to make the field immutable once it has been first set. This example will throw a validation error if the field after changes after being first initialized.

// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type ImmutableSinceFirstWrite struct {
   metav1.TypeMeta   `json:",inline"`
   metav1.ObjectMeta `json:"metadata,omitempty"`

   // +kubebuilder:validation:Optional
   // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
   // +kubebuilder:validation:MaxLength=512
   Value string `json:"value"`
}

The +kubebuilder directives in the comments inform controller-gen how to annotate the generated OpenAPI. The XValidation rule causes the rule to appear among the x-kubernetes-validations OpenAPI extension. Kubernetes then respects the OpenAPI spec to enforce our constraints.

To enforce a field's immutability after its first write, you need to apply the following constraints:

  1. Field must be allowed to be initially unset +kubebuilder:validation:Optional
  2. Once set, field must not be allowed to be removed: !has(oldSelf.value) | has(self.value) (type-scoped rule)
  3. Once set, field must not be allowed to change value self == oldSelf (field-scoped rule)

Also note the additional directive +kubebuilder:validation:MaxLength. CEL requires that all strings have attached max length so that it may estimate the computation cost of the rule. Rules that are too expensive will be rejected. For more information on CEL cost budgeting, check out the other tutorial.

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincefirstwrites.yaml
customresourcedefinition.apiextensions.k8s.io/immutablesincefirstwrites.stable.example.com created

Creating initial empty object with no value is permitted since value is optional:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
  name: test1
EOF
immutablesincefirstwrite.stable.example.com/test1 created

The initial modification of value succeeds:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
  name: test1
value: Hello, world!
EOF
immutablesincefirstwrite.stable.example.com/test1 configured

An attempt to change value is blocked by the field-level validation rule. Note the error message shown to the user comes from the validation rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
  name: test1
value: Hello, new world!
EOF
The ImmutableSinceFirstWrite "test1" is invalid: value: Invalid value: "string": Value is immutable

An attempt to remove the value field altogether is blocked by the other validation rule on the type. The error message also comes from the rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: ImmutableSinceFirstWrite
metadata:
  name: test1
EOF
The ImmutableSinceFirstWrite "test1" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

Note that in the generated schema there are two separate rule locations. One is directly attached to the property immutable_since_first_write. The other rule is associated with the crd type itself.

openAPIV3Schema:
  properties:
    value:
      maxLength: 512
      type: string
      x-kubernetes-validations:
      - message: Value is immutable
        rule: self == oldSelf
  type: object
  x-kubernetes-validations:
  - message: Value is required once set
    rule: '!has(oldSelf.value) || has(self.value)'

Immutability upon object creation

A field which is immutable upon creation time is implemented similarly to the earlier example. The difference is that that field is marked required, and the type-scoped rule is no longer necessary.

type ImmutableSinceCreation struct {
   metav1.TypeMeta   `json:",inline"`
   metav1.ObjectMeta `json:"metadata,omitempty"`

   // +kubebuilder:validation:Required
   // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
   // +kubebuilder:validation:MaxLength=512
   Value string `json:"value"`
}

This field will be required when the object is created, and after that point will not be allowed to be modified. Our CEL Validation Rule self == oldSelf

Usage example

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_immutablesincecreations.yaml
customresourcedefinition.apiextensions.k8s.io/immutablesincecreations.stable.example.com created

Applying an object without the required field should fail:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
  name: test1
EOF
The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation

Now that the field has been added, the operation is permitted:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
  name: test1
value: Hello, world!
EOF
immutablesincecreation.stable.example.com/test1 created

If you attempt to change the value, the operation is blocked due to the validation rules in the CRD. Note that the error message is as it was defined in the validation rule.

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
  name: test1
value: Hello, new world!
EOF
The ImmutableSinceCreation "test1" is invalid: value: Invalid value: "string": Value is immutable

Also if you attempted to remove value altogether after adding it, you will see an error as expected:

kubectl apply -f - <<EOF
apiVersion: stable.example.com/v1
kind: ImmutableSinceCreation
metadata:
  name: test1
EOF
The ImmutableSinceCreation "test1" is invalid:
* value: Required value
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation

Generated schema

openAPIV3Schema:
  properties:
    value:
      maxLength: 512
      type: string
      x-kubernetes-validations:
      - message: Value is immutable
        rule: self == oldSelf
  required:
  - value
  type: object

Append-only list of containers

In the case of ephemeral containers on Pods, Kubernetes enforces that the elements in the list are immutable, and can’t be removed. The following example shows how you could use CEL to achieve the same behavior.

// +kubebuilder:validation:XValidation:rule="!has(oldSelf.value) || has(self.value)", message="Value is required once set"
type AppendOnlyList struct {
   metav1.TypeMeta   `json:",inline"`
   metav1.ObjectMeta `json:"metadata,omitempty"`

   // +kubebuilder:validation:Optional
   // +kubebuilder:validation:MaxItems=100
   // +kubebuilder:validation:XValidation:rule="oldSelf.all(x, x in self)",message="Values may only be added"
   Values []v1.EphemeralContainer `json:"value"`
}
  1. Once set, field must not be deleted: !has(oldSelf.value) || has(self.value) (type-scoped)
  2. Once a value is added it is not removed: oldSelf.all(x, x in self) (field-scoped)
  3. Value may be initially unset: +kubebuilder:validation:Optional

Note that for cost-budgeting purposes, MaxItems is also required to be specified.

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_appendonlylists.yaml
customresourcedefinition.apiextensions.k8s.io/appendonlylists.stable.example.com created

Creating an initial list with one element inside should succeed without problem:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
  name: testlist
value:
  - name: container1
    image: nginx/nginx
EOF
appendonlylist.stable.example.com/testlist created

Adding an element to the list should also proceed without issue as expected:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
  name: testlist
value:
  - name: container1
    image: nginx/nginx
  - name: container2
    image: mongodb/mongodb
EOF
appendonlylist.stable.example.com/testlist configured

But if you now attempt to remove an element, the error from the validation rule is triggered:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
  name: testlist
value:
  - name: container1
    image: nginx/nginx
EOF
The AppendOnlyList "testlist" is invalid: value: Invalid value: "array": Values may only be added

Additionally, to attempt to remove the field once it has been set is also disallowed by the type-scoped validation rule.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: AppendOnlyList
metadata:
  name: testlist
EOF
The AppendOnlyList "testlist" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

openAPIV3Schema:
  properties:
    value:
      items: ...
      maxItems: 100
      type: array
      x-kubernetes-validations:
      - message: Values may only be added
        rule: oldSelf.all(x, x in self)
  type: object
  x-kubernetes-validations:
  - message: Value is required once set
    rule: '!has(oldSelf.value) || has(self.value)'

Map with append-only keys, immutable values

// A map which does not allow keys to be removed or their values changed once set. New keys may be added, however.
// +kubebuilder:validation:XValidation:rule="!has(oldSelf.values) || has(self.values)", message="Value is required once set"
type MapAppendOnlyKeys struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	// +kubebuilder:validation:Optional
	// +kubebuilder:validation:MaxProperties=10
	// +kubebuilder:validation:XValidation:rule="oldSelf.all(key, key in self && self[key] == oldSelf[key])",message="Keys may not be removed and their values must stay the same"
	Values map[string]string `json:"values,omitempty"`
}
  1. Once set, field must not be deleted: !has(oldSelf.values) || has(self.values) (type-scoped)
  2. Once a key is added it is not removed nor is its value modified: oldSelf.all(key, key in self && self[key] == oldSelf[key]) (field-scoped)
  3. Value may be initially unset: +kubebuilder:validation:Optional

Example usage

Generating and installing the CRD should succeed:

# Ensure the CRD yaml is generated by controller-gen
go generate ./...
kubectl apply -f crds/stable.example.com_mapappendonlykeys.yaml
customresourcedefinition.apiextensions.k8s.io/mapappendonlykeys.stable.example.com created

Creating an initial object with one key within values should be permitted:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
  name: testmap
values:
    key1: value1
EOF
mapappendonlykeys.stable.example.com/testmap created

Adding new keys to the map should also be permitted:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
  name: testmap
values:
    key1: value1
    key2: value2
EOF
mapappendonlykeys.stable.example.com/testmap configured

But if a key is removed, the error messagr from the validation rule should be returned:

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
  name: testmap
values:
    key1: value1
EOF
The MapAppendOnlyKeys "testmap" is invalid: values: Invalid value: "object": Keys may not be removed and their values must stay the same

If the entire field is removed, the other validation rule is triggered and the operation is prevented. Note that the error message for the validation rule is shown to the user.

kubectl apply -f - <<EOF
---
apiVersion: stable.example.com/v1
kind: MapAppendOnlyKeys
metadata:
  name: testmap
EOF
The MapAppendOnlyKeys "testmap" is invalid: <nil>: Invalid value: "object": Value is required once set

Generated schema

openAPIV3Schema:
  description: A map which does not allow keys to be removed or their values
    changed once set. New keys may be added, however.
  properties:
    values:
      additionalProperties:
        type: string
      maxProperties: 10
      type: object
      x-kubernetes-validations:
      - message: Keys may not be removed and their values must stay the same
        rule: oldSelf.all(key, key in self && self[key] == oldSelf[key])
  type: object
  x-kubernetes-validations:
  - message: Value is required once set
    rule: '!has(oldSelf.values) || has(self.values)'

Going further

The above examples showed how CEL rules can be added to kubebuilder types. The same rules can be added directly to OpenAPI if writing a manifest for a CRD by hand.

For native types, the same behavior can be achieved using kube-openapi’s marker +validations.

Usage of CEL within Kubernetes Validation Rules is so much more powerful than what has been shown in this article. For more information please check out validation rules in the Kubernetes documentation and CRD Validation Rules Beta blog post.

Kubernetes 1.25: Kubernetes In-Tree to CSI Volume Migration Status Update

The Kubernetes in-tree storage plugin to Container Storage Interface (CSI) migration infrastructure has already been beta since v1.17. CSI migration was introduced as alpha in Kubernetes v1.14. Since then, SIG Storage and other Kubernetes special interest groups are working to ensure feature stability and compatibility in preparation for CSI Migration feature to go GA.

SIG Storage is excited to announce that the core CSI Migration feature is generally available in Kubernetes v1.25 release!

SIG Storage wrote a blog post in v1.23 for CSI Migration status update which discussed the CSI migration status for each storage driver. It has been a while and this article is intended to give a latest status update on each storage driver for their CSI Migration status in Kubernetes v1.25.

Quick recap: What is CSI Migration, and why migrate?

The Container Storage Interface (CSI) was designed to help Kubernetes replace its existing, in-tree storage driver mechanisms - especially vendor specific plugins. Kubernetes support for the Container Storage Interface has been generally available since Kubernetes v1.13. Support for using CSI drivers was introduced to make it easier to add and maintain new integrations between Kubernetes and storage backend technologies. Using CSI drivers allows for better maintainability (driver authors can define their own release cycle and support lifecycle) and reduce the opportunity for vulnerabilities (with less in-tree code, the risks of a mistake are reduced, and cluster operators can select only the storage drivers that their cluster requires).

As more CSI Drivers were created and became production ready, SIG Storage wanted all Kubernetes users to benefit from the CSI model. However, we could not break API compatibility with the existing storage API types due to k8s architecture conventions. The solution we came up with was CSI migration: a feature that translates in-tree APIs to equivalent CSI APIs and delegates operations to a replacement CSI driver.

The CSI migration effort enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver from the storage backend. If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. Existing StorageClass, PersistentVolume and PersistentVolumeClaim objects should continue to work. When a Kubernetes cluster administrator updates a cluster to enable CSI migration, existing workloads that utilize PVCs which are backed by in-tree storage plugins will continue to function as they always have. However, behind the scenes, Kubernetes hands control of all storage management operations (previously targeting in-tree drivers) to CSI drivers.

For example, suppose you are a kubernetes.io/gce-pd user; after CSI migration, you can still use kubernetes.io/gce-pd to provision new volumes, mount existing GCE-PD volumes or delete existing volumes. All existing APIs and Interface will still function correctly. However, the underlying function calls are all going through the GCE PD CSI driver instead of the in-tree Kubernetes function.

This enables a smooth transition for end users. Additionally as storage plugin developers, we can reduce the burden of maintaining the in-tree storage plugins and eventually remove them from the core Kubernetes binary.

What is the timeline / status?

The current and targeted releases for each individual driver is shown in the table below:

DriverAlphaBeta (in-tree deprecated)Beta (on-by-default)GATarget "in-tree plugin" removal
AWS EBS1.141.171.231.251.27 (Target)
Azure Disk1.151.191.231.241.26 (Target)
Azure File1.151.211.241.26 (Target)1.28 (Target)
Ceph FS1.26 (Target)
Ceph RBD1.231.26 (Target)1.27 (Target)1.28 (Target)1.30 (Target)
GCE PD1.141.171.231.251.27 (Target)
OpenStack Cinder1.141.181.211.241.26 (Target)
Portworx1.231.251.26 (Target)1.27 (Target)1.29 (Target)
vSphere1.181.191.251.26 (Target)1.28 (Target)

The following storage drivers will not have CSI migration support. The scaleio, flocker, quobyte and storageos drivers were removed; the others are deprecated and will be removed from core Kubernetes in the coming releases.

DriverDeprecatedCode Removal
Flocker1.221.25
GlusterFS1.251.26 (Target)
Quobyte1.221.25
ScaleIO1.161.22
StorageOS1.221.25

What does it mean for the core CSI Migration feature to go GA?

Core CSI Migration goes to GA means that the general framework, core library and API for CSI migration is stable for Kubernetes v1.25 and will be part of future Kubernetes releases as well.

  • If you are a Kubernetes distribution maintainer, this means if you disabled CSIMigration feature gate previously, you are no longer allowed to do so because the feature gate has been locked.
  • If you are a Kubernetes storage driver developer, this means you can expect no backwards incompatibility changes in the CSI migration library.
  • If you are a Kubernetes maintainer, expect nothing changes from your day to day development flows.
  • If you are a Kubernetes user, expect nothing to change from your day-to-day usage flows. If you encounter any storage related issues, contact the people who operate your cluster (if that's you, contact the provider of your Kubernetes distribution, or get help from the community).

What does it mean for the storage driver CSI migration to go GA?

Storage Driver CSI Migration goes to GA means that the specific storage driver supports CSI Migration. Expect feature parity between the in-tree plugin with the CSI driver.

  • If you are a Kubernetes distribution maintainer, make sure you install the corresponding CSI driver on the distribution. And make sure you are not disabling the specific CSIMigration{provider} flag, as they are locked.
  • If you are a Kubernetes storage driver maintainer, make sure the CSI driver can ensure feature parity if it supports CSI migration.
  • If you are a Kubernetes maintainer/developer, expect nothing to change from your day-to-day development flows.
  • If you are a Kubernetes user, the CSI Migration feature should be completely transparent to you, the only requirement is to install the corresponding CSI driver.

What's next?

We are expecting cloud provider in-tree storage plugins code removal to start to happen as part of the v1.26 and v1.27 releases of Kubernetes. More and more drivers that support CSI migration will go GA in the upcoming releases.

How do I get involved?

The Kubernetes Slack channel #csi-migration along with any of the standard SIG Storage communication channels are great ways to reach out to the SIG Storage and migration working group teams.

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. We offer a huge thank you to the contributors who stepped up these last quarters to help move the project forward:

  • Xing Yang (xing-yang)
  • Hemant Kumar (gnufied)

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CSI migration feature:

  • Andy Zhang (andyzhangz)
  • Divyen Patel (divyenpatel)
  • Deep Debroy (ddebroy)
  • Humble Devassy Chirammal (humblec)
  • Ismail Alidzhikov (ialidzhikov)
  • Jordan Liggitt (liggitt)
  • Matthew Cary (mattcary)
  • Matthew Wong (wongma7)
  • Neha Arora (nearora-msft)
  • Oksana Naumov (trierra)
  • Saad Ali (saad-ali)
  • Michelle Au (msau42)

Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

Kubernetes 1.25: CustomResourceDefinition Validation Rules Graduate to Beta

In Kubernetes 1.25, Validation rules for CustomResourceDefinitions (CRDs) have graduated to Beta!

Validation rules make it possible to declare how custom resources are validated using the Common Expression Language (CEL). For example:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
    ...
    openAPIV3Schema:
      type: object
      properties:
        spec:
          type: object
          x-kubernetes-validations:
            - rule: "self.minReplicas <= self.replicas && self.replicas <= self.maxReplicas"
              message: "replicas should be in the range minReplicas..maxReplicas."
          properties:
            replicas:
              type: integer
            ...

Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:

Validation RulePurpose
self.minReplicas <= self.replicasValidate an integer field is less than or equal to another integer field
'Available' in self.stateCountsValidate an entry with the 'Available' key exists in a map
self.set1.all(e, !(e in self.set2))Validate that the elements of two sets are disjoint
self == oldSelfValidate that a required field is immutable once it is set
self.created + self.ttl < self.expiredValidate that 'expired' date is after a 'create' date plus a 'ttl' duration

Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.

Why CEL?

CEL was chosen as the language for validation rules for a couple reasons:

  • CEL expressions can easily be inlined into CRD schemas. They are sufficiently expressive to replace the vast majority of CRD validation checks currently implemented in admission webhooks. This results in CRDs that are self-contained and are easier to understand.
  • CEL expressions are compiled and type checked against a CRD's schema "ahead-of-time" (when CRDs are created and updated) allowing them to be evaluated efficiently and safely "runtime" (when custom resources are validated). Even regex string literals in CEL are validated and pre-compiled when CRDs are created or updated.

Why not use validation webhooks?

Benefits of using validation rules when compared with validation webhooks:

  • CRD authors benefit from a simpler workflow since validation rules eliminate the need to develop and maintain a webhook.
  • Cluster administrators benefit by no longer having to install, upgrade and operate webhooks for the purposes of CRD validation.
  • Cluster operability improves because CRD validation no longer requires a remote call to a webhook endpoint, eliminating a potential point of failure in the request-serving-path of the Kubernetes API server. This allows clusters to retain high availability while scaling to larger amounts of installed CRD extensions, since expected control plane availability would otherwise decrease with each additional webhook installed.

Getting started with validation rules

Writing validation rules in OpenAPIv3 schemas

You can define validation rules for any level of a CRD's OpenAPIv3 schema. Validation rules are automatically scoped to their location in the schema where they are declared.

Good practices for CRD validation rules:

  • Scope validation rules as close as possible to the fields(s) they validate.
  • Use multiple rules when validating independent constraints.
  • Do not use validation rules for validations already
  • Use OpenAPIv3 value validations (maxLength, maxItems, maxProperties, required, enum, minimum, maximum, ..) and string formats where available.
  • Use x-kubernetes-int-or-string, x-kubernetes-embedded-type and x-kubernetes-list-type=(set|map) were appropriate.

Examples of good practice:

ValidationBest PracticeExample(s)
Validate an integer is between 0 and 100.Use OpenAPIv3 value validations.
type: integer
minimum: 0
maximum: 100
Constraint the max size limits on maps (objects with additionalProperties), arrays and string.Use OpenAPIv3 value validations. Recommended for all maps, arrays and strings. This best practice is essential for rule cost estimation (explained below).
type:
maxItems: 100
Require a date-time be more recent than a particular timestamp.Use OpenAPIv3 string formats to declare that the field is a date-time. Use validation rules to compare it to a particular timestamp.
type: string
format: date-time
x-kubernetes-validations:
- rule: "self >= timestamp('2000-01-01T00:00:00.000Z')"
Require two sets to be disjoint.Use x-kubernetes-list-type to validate that the arrays are sets.
Use validation rules to validate the sets are disjoint.
type: object
properties:
set1:
type: array
x-kubernetes-list-type: set
set2: ...
x-kubernetes-validations:
- rule: "!self.set1.all(e, !(e in self.set2))"

CRD transition rules

Transition Rules make it possible to compare the new state against the old state of a resource in validation rules. You use transition rules to make sure that the cluster's API server does not accept invalid state transitions. A transition rule is a validation rule that references 'oldSelf'. The API server only evaluates transition rules when both an old value and new value exist.

Transition rule examples:

Transition RulePurpose
self == oldSelfFor a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.
(on parent of field) has(self.field) == has(oldSelf.field)
on field: self == oldSelf
Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).
self.all(x, x in oldSelf)Only allow adding items to a field that represents a set (prevent removals).
self >= oldSelfValidate that a number is monotonically increasing.

Using the Functions Libraries

Validation rules have access to a couple different function libraries:

Examples of function libraries in use:

Validation RulePurpose
!(self.getDayOfWeek() in [0, 6])Validate that a date is not a Sunday or Saturday.
isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']Validate that a URL has an allowed hostname.
self.map(x, x.weight).sum() == 1Validate that the weights of a list of objects sum to 1.
int(self.find('^[0-9]*')) < 100Validate that a string starts with a number less than 100.
self.isSorted()Validates that a list is sorted.

Resource use and limits

To prevent CEL evaluation from consuming excessive compute resources, validation rules impose some limits. These limits are based on CEL cost units, a platform and machine independent measure of execution cost. As a result, the limits are the same regardless of where they are enforced.

Estimated cost limit

CEL is, by design, non-Turing-complete in such a way that the halting problem isn’t a concern. CEL takes advantage of this design choice to include an "estimated cost" subsystem that can statically compute the worst case run time cost of any CEL expression. Validation rules are integrated with the estimated cost system and disallow CEL expressions from being included in CRDs if they have a sufficiently poor (high) estimated cost. The estimated cost limit is set quite high and typically requires an O(n^2) or worse operation, across something of unbounded size, to be exceeded. Fortunately the fix is usually quite simple: because the cost system is aware of size limits declared in the CRD's schema, CRD authors can add size limits to the CRD's schema (maxItems for arrays, maxProperties for maps, maxLength for strings) to reduce the estimated cost.

Good practice:

Set maxItems, maxProperties and maxLength on all array, map (object with additionalProperties) and string types in CRD schemas! This results in lower and more accurate estimated costs and generally makes a CRD safer to use.

Runtime cost limits for CRD validation rules

In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.

With the estimated cost limit already in place, the runtime cost limit is rarely encountered. But it is possible. For example, it might be encountered for a large resource composed entirely of a single large list and a validation rule that is either evaluated on each element in the list, or traverses the entire list.

CRD authors can ensure the runtime cost limit will not be exceeded in much the same way the estimated cost limit is avoided: by setting maxItems, maxProperties and maxLength on array, map and string types.

Future work

We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in an upcoming Kubernetes release!

There is a growing community of Kubernetes contributors thinking about how to make it possible to write extensible admission controllers using CEL as a substitute for admission webhooks for policy enforcement use cases. Anyone interested should reach out to us on the usual SIG API Machinery channels or via slack at #sig-api-machinery-cel-dev.

Acknowledgements

Special thanks to Cici Huang, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to Validation Rules!

Kubernetes 1.25: Use Secrets for Node-Driven Expansion of CSI Volumes

Kubernetes v1.25, released earlier this month, introduced a new feature that lets your cluster expand storage volumes, even when access to those volumes requires a secret (for example: a credential for accessing a SAN fabric) to perform node expand operation. This new behavior is in alpha and you must enable a feature gate (CSINodeExpandSecret) to make use of it. You must also be using CSI storage; this change isn't relevant to storage drivers that are built in to Kubernetes.

To turn on this new, alpha feature, you enable the CSINodeExpandSecret feature gate for the kube-apiserver and kubelet, which turns on a mechanism to send secretRef configuration as part of NodeExpansion by the CSI drivers thus make use of the same to perform node side expansion operation with the underlying storage system.

What is this all about?

Before Kubernetes v1.24, you were able to define a cluster-level StorageClass that made use of StorageClass Secrets, but you didn't have any mechanism to specify the credentials that would be used for operations that take place when the storage was mounted onto a node and when the volume has to be expanded at node side.

The Kubernetes CSI already implemented a similar mechanism specific kinds of volume resizes; namely, resizes of PersistentVolumes where the resizes take place independently from any node referred as Controller Expansion. In that case, you associate a PersistentVolume with a Secret that contains credentials for volume resize actions, so that controller expansion can take place. CSI also supports a nodeExpandVolume operation which CSI drivers can make use independent of Controller Expansion or along with Controller Expansion on which, where the resize is driven from a node in your cluster where the volume is attached. Please read Kubernetes 1.24: Volume Expansion Now A Stable Feature

  • At times, the CSI driver needs to check the actual size of the backend block storage (or image) before proceeding with a node-level filesystem expand operation. This avoids false positive returns from the backend storage cluster during filesystem expands.

  • When a PersistentVolume represents encrypted block storage (for example using LUKS) you need to provide a passphrase in order to expand the device, and also to make it possible to grow the filesystem on that device.

  • For various validations at time of node expansion, the CSI driver has to be connected to the backend storage cluster. If the nodeExpandVolume request includes a secretRef then the CSI driver can make use of the same and connect to the storage cluster to perform the cluster operations.

How does it work?

To enable this functionality from this version of Kubernetes, SIG Storage have introduced a new feature gate called CSINodeExpandSecret. Once the feature gate is enabled in the cluster, NodeExpandVolume requests can include a secretRef field. The NodeExpandVolume request is part of CSI; for example, in a request which has been sent from the Kubernetes control plane to the CSI driver.

As a cluster operator, you admin can specify these secrets as an opaque parameter in a StorageClass, the same way that you can already specify other CSI secret data. The StorageClass needs to have some CSI-specific parameters set. Here's an example of those parameters:

csi.storage.k8s.io/node-expand-secret-name: test-secret
csi.storage.k8s.io/node-expand-secret-namespace: default

If feature gates are enabled and storage class carries the above secret configuration, the CSI provisioner receives the credentials from the Secret as part of the NodeExpansion request.

CSI volumes that require secrets for online expansion will have NodeExpandSecretRef field set. If not set, the NodeExpandVolume CSI RPC call will be made without a secret.

Trying it out

  1. Enable the CSINodeExpandSecret feature gate (please refer to Feature Gates).

  2. Create a Secret, and then a StorageClass that uses that Secret.

Here's an example manifest for a Secret that holds credentials:

apiVersion: v1
kind: Secret
metadata:
  name: test-secret
  namespace: default
data:
stringData:
  username: admin
  password: t0p-Secret

Here's an example manifest for a StorageClass that refers to those credentials:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-blockstorage-sc
parameters:
  csi.storage.k8s.io/node-expand-secret-name: test-secret   # the name of the Secret
  csi.storage.k8s.io/node-expand-secret-namespace: default  # the namespace that the Secret is in
provisioner: blockstorage.cloudprovider.example
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true

Example output

If the PersistentVolumeClaim (PVC) was created successfully, you can see that configuration within the spec.csi field of the PersistentVolume (look for spec.csi.nodeExpandSecretRef). Check that it worked by running kubectl get persistentvolume <pv_name> -o yaml. You should see something like.

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: blockstorage.cloudprovider.example
  creationTimestamp: "2022-08-26T15:14:07Z"
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0
  resourceVersion: "420263"
  uid: 6fa824d7-8a06-4e0c-b722-d3f897dcbd65
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 6Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: csi-pvc
    namespace: default
    resourceVersion: "419862"
    uid: 95eb531a-d675-49f6-940b-9bc3fde83eb0
  csi:
    driver: blockstorage.cloudprovider.example
    nodeExpandSecretRef:
      name: test-secret
      namespace: default
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1648042783218-8081-blockstorage.cloudprovider.example
    volumeHandle: e21c7809-aabb-11ec-917a-2e2e254eb4cf
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.hostpath.csi/node
          operator: In
          values:
          - racknode01
  persistentVolumeReclaimPolicy: Delete
  storageClassName: csi-blockstorage-sc
  volumeMode: Filesystem
status:
  phase: Bound

If you then trigger online storage expansion, the kubelet passes the appropriate credentials to the CSI driver, by loading that Secret and passing the data to the storage driver.

Here's an example debug log:

I0330 03:29:51.966241       1 server.go:101] GRPC call: /csi.v1.Node/NodeExpandVolume
I0330 03:29:51.966261       1 server.go:105] GRPC request: {"capacity_range":{"required_bytes":7516192768},"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/blockstorage.cloudprovider.example/f7c62e6e08ce21e9b2a95c841df315ed4c25a15e91d8fcaf20e1c2305e5300ab/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_id":"e21c7809-aabb-11ec-917a-2e2e254eb4cf","volume_path":"/var/lib/kubelet/pods/bcb1b2c4-5793-425c-acf1-47163a81b4d7/volumes/kubernetes.io~csi/pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0/mount"}
I0330 03:29:51.966360       1 nodeserver.go:459] req:volume_id:"e21c7809-aabb-11ec-917a-2e2e254eb4cf" volume_path:"/var/lib/kubelet/pods/bcb1b2c4-5793-425c-acf1-47163a81b4d7/volumes/kubernetes.io~csi/pvc-95eb531a-d675-49f6-940b-9bc3fde83eb0/mount" capacity_range:<required_bytes:7516192768 > staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/blockstorage.cloudprovider.example/f7c62e6e08ce21e9b2a95c841df315ed4c25a15e91d8fcaf20e1c2305e5300ab/globalmount" volume_capability:<mount:<> access_mode:<mode:SINGLE_NODE_MULTI_WRITER > > secrets:<key:"XXXXXX" value:"XXXXX" > secrets:<key:"XXXXX" value:"XXXXXX" >

The future

As this feature is still in alpha, Kubernetes Storage SIG expect to update or get feedback from CSI driver authors with more tests and implementation. The community plans to eventually promote the feature to Beta in upcoming releases.

Get involved or learn more?

The enhancement proposal includes lots of detail about the history and technical implementation of this feature.

To learn more about StorageClass based dynamic provisioning in Kubernetes, please refer to Storage Classes and Persistent Volumes.

Please get involved by joining the Kubernetes Storage SIG (Special Interest Group) to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!

Kubernetes 1.25: Local Storage Capacity Isolation Reaches GA

Local ephemeral storage capacity isolation was introduced as a alpha feature in Kubernetes 1.7 and it went beta in 1.9. With Kubernetes 1.25 we are excited to announce general availability(GA) of this feature.

Pods use ephemeral local storage for scratch space, caching, and logs. The lifetime of local ephemeral storage does not extend beyond the life of the individual pod. It is exposed to pods using the container’s writable layer, logs directory, and EmptyDir volumes. Before this feature was introduced, there were issues related to the lack of local storage accounting and isolation, such as Pods not knowing how much local storage is available and being unable to request guaranteed local storage. Local storage is a best-effort resource and pods can be evicted due to other pods filling the local storage.

The local storage capacity isolation feature allows users to manage local ephemeral storage in the same way as managing CPU and memory. It provides support for capacity isolation of shared storage between pods, such that a pod can be hard limited in its consumption of shared resources by evicting Pods if its consumption of shared storage exceeds that limit. It also allows setting ephemeral storage requests for resource reservation. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption.

How to use local storage capacity isolation

A typical configuration for local ephemeral storage is to place all different kinds of ephemeral local data (emptyDir volumes, writeable layers, container images, logs) into one filesystem. Typically, both /var/lib/kubelet and /var/log are on the system's root filesystem. If users configure the local storage in different ways, kubelet might not be able to correctly measure disk usage and use this feature.

Setting requests and limits for local ephemeral storage

You can specify ephemeral-storage for managing local ephemeral storage. Each container of a Pod can specify either or both of the following:

  • spec.containers[].resources.limits.ephemeral-storage
  • spec.containers[].resources.requests.ephemeral-storage

In the following example, the Pod has two containers. The first container has a request of 8GiB of local ephemeral storage and a limit of 12GiB. The second container requests 2GiB of local storage, but no limit setting. Therefore, the Pod requests a total of 10GiB (8GiB+2GiB) of local ephemeral storage and enforces a limit of 12GiB of local ephemeral storage. It also sets emptyDir sizeLimit to 5GiB. With this setting in pod spec, it will affect how the scheduler makes a decision on scheduling pods and also how kubelet evict pods.

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      requests:
        ephemeral-storage: "8Gi"
      limits:
        ephemeral-storage: "12Gi"
    volumeMounts:
    - name: ephemeral
      mountPath: "/tmp"
  - name: log-aggregator
    image: images.my-company.example/log-aggregator:v6
    resources:
      requests:
        ephemeral-storage: "2Gi"
    volumeMounts:
    - name: ephemeral
      mountPath: "/tmp"
  volumes:
    - name: ephemeral
      emptyDir: {}
        sizeLimit: 5Gi

First of all, the scheduler ensures that the sum of the resource requests of the scheduled containers is less than the capacity of the node. In this case, the pod can be assigned to a node only if its available ephemeral storage (allocatable resource) has more than 10GiB.

Secondly, at container level, since one of the container sets resource limit, kubelet eviction manager will measure the disk usage of this container and evict the pod if the storage usage of the first container exceeds its limit (12GiB). At pod level, kubelet works out an overall Pod storage limit by adding up the limits of all the containers in that Pod. In this case, the total storage usage at pod level is the sum of the disk usage from all containers plus the Pod's emptyDir volumes. If this total usage exceeds the overall Pod storage limit (12GiB), then the kubelet also marks the Pod for eviction.

Last, in this example, emptyDir volume sets its sizeLimit to 5Gi. It means that if this pod's emptyDir used up more local storage than 5GiB, the pod will be evicted from the node.

Setting resource quota and limitRange for local ephemeral storage

This feature adds two more resource quotas for storage. The request and limit set constraints on the total requests/limits of all containers’ in a namespace.

  • requests.ephemeral-storage
  • limits.ephemeral-storage
apiVersion: v1
kind: ResourceQuota
metadata:
  name: storage-resources
spec:
  hard:
    requests.ephemeral-storage: "10Gi"
    limits.ephemeral-storage: "20Gi"

Similar to CPU and memory, admin could use LimitRange to set default container’s local storage request/limit, and/or minimum/maximum resource constraints for a namespace.

apiVersion: v1
kind: LimitRange
metadata:
  name: storage-limit-range
spec:
  limits:
  - default:
      ephemeral-storage: 10Gi
    defaultRequest:
      ephemeral-storage: 5Gi
    type: Container

Also, ephemeral-storage may be specified to reserve for kubelet or system. example, --system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=10Gi][,][pid=1000] --kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=5Gi][,][pid=1000]. If your cluster node root disk capacity is 100Gi, after setting system-reserved and kube-reserved value, the available allocatable ephemeral storage would become 85Gi. The schedule will use this information to assign pods based on request and allocatable resources from each node. The eviction manager will also use allocatable resource to determine pod eviction. See more details from Reserve Compute Resources for System Daemons.

How do I get involved?

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together.

We offer a huge thank you to all the contributors in Kubernetes Storage SIG and CSI community who helped review the design and implementation of the project, including but not limited to the following:

Kubernetes 1.25: Two Features for Apps Rollouts Graduate to Stable

This blog describes the two features namely minReadySeconds for StatefulSets and maxSurge for DaemonSets that SIG Apps is happy to graduate to stable in Kubernetes 1.25.

Specifying minReadySeconds slows down a rollout of a StatefulSet, when using a RollingUpdate value in .spec.updateStrategy field, by waiting for each pod for a desired time. This time can be used for initializing the pod (e.g. warming up the cache) or as a delay before acknowledging the pod.

maxSurge allows a DaemonSet workload to run multiple instances of the same pod on a node during a rollout when using a RollingUpdate value in .spec.updateStrategy field. This helps to minimize the downtime of the DaemonSet for consumers.

These features were already available in a Deployment and other workloads. This graduation helps to align this functionality across the workloads.

What problems do these features solve?

minReadySeconds for StatefulSets

minReadySeconds ensures that the StatefulSet workload is Ready for the given number of seconds before reporting the pod as Available. The notion of being Ready and Available is quite important for workloads. For example, some workloads, like Prometheus with multiple instances of Alertmanager, should be considered Available only when the Alertmanager's state transfer is complete. minReadySeconds also helps when using loadbalancers with cloud providers. Since the pod should be Ready for the given number of seconds, it provides buffer time to prevent killing pods in rotation before new pods show up.

maxSurge for DaemonSets

Kubernetes system-level components like CNI, CSI are typically run as DaemonSets. These components can have impact on the availability of the workloads if those DaemonSets go down momentarily during the upgrades. The feature allows DaemonSet pods to temporarily increase their number, thereby ensuring zero-downtime for the DaemonSets.

Please note that the usage of hostPort in conjunction with maxSurge in DaemonSets is not allowed as DaemonSet pods are tied to a single node and two active pods cannot share the same port on the same node.

How does it work?

minReadySeconds for StatefulSets

The StatefulSet controller watches for the StatefulSet pods and counts how long a particular pod has been in the Running state, if this value is greater than or equal to the time specified in .spec.minReadySeconds field of the StatefulSet, the StatefulSet controller updates the AvailableReplicas field in the StatefulSet's status.

maxSurge for DaemonSets

The DaemonSet controller creates the additional pods (above the desired number resulting from DaemonSet spec) based on the value given in .spec.strategy.rollingUpdate.maxSurge. The additional pods would run on the same node where the old DaemonSet pod is running till the old pod gets killed.

  • The default value is 0.
  • The value cannot be 0 when MaxUnavailable is 0.
  • The value can be specified either as an absolute number of pods, or a percentage (rounded up) of desired pods.

How do I use it?

minReadySeconds for StatefulSets

Specify a value for minReadySeconds for any StatefulSet and check if pods are available or not by inspecting AvailableReplicas field using:

kubectl get statefulset/<name_of_the_statefulset> -o yaml

Please note that the default value of minReadySeconds is 0.

maxSurge for DaemonSets

Specify a value for .spec.updateStrategy.rollingUpdate.maxSurge and set .spec.updateStrategy.rollingUpdate.maxUnavailable to 0.

Then observe a faster rollout and higher number of pods running at the same time in the next rollout.

kubectl rollout restart daemonset <name_of_the_daemonset>
kubectl get pods -w

How can I learn more?

minReadySeconds for StatefulSets

maxSurge for DaemonSets

How do I get involved?

Please reach out to us on #sig-apps channel on Slack, or through the SIG Apps mailing list kubernetes-sig-apps@googlegroups.com.

Kubernetes 1.25: PodHasNetwork Condition for Pods

Kubernetes 1.25 introduces Alpha support for a new kubelet-managed pod condition in the status field of a pod: PodHasNetwork. The kubelet, for a worker node, will use the PodHasNetwork condition to accurately surface the initialization state of a pod from the perspective of pod sandbox creation and network configuration by a container runtime (typically in coordination with CNI plugins). The kubelet starts to pull container images and start individual containers (including init containers) after the status of the PodHasNetwork condition is set to "True". Metrics collection services that report latency of pod initialization from a cluster infrastructural perspective (i.e. agnostic of per container characteristics like image size or payload) can utilize the PodHasNetwork condition to accurately generate Service Level Indicators (SLIs). Certain operators or controllers that manage underlying pods may utilize the PodHasNetwork condition to optimize the set of actions performed when pods repeatedly fail to come up.

Updates for Kubernetes 1.28

The PodHasNetwork condition has been renamed to PodReadyToStartContainers. Alongside that change, the feature gate PodHasNetworkCondition has been replaced by PodReadyToStartContainersCondition. You need to set PodReadyToStartContainersCondition to true in order to use the new feature in v1.28.0 and later.

How is this different from the existing Initialized condition reported for pods?

The kubelet sets the status of the existing Initialized condition reported in the status field of a pod depending on the presence of init containers in a pod.

If a pod specifies init containers, the status of the Initialized condition in the pod status will not be set to "True" until all init containers for the pod have succeeded. However, init containers, configured by users, may have errors (payload crashing, invalid image, etc) and the number of init containers configured in a pod may vary across different workloads. Therefore, cluster-wide, infrastructural SLIs around pod initialization cannot depend on the Initialized condition of pods.

If a pod does not specify init containers, the status of the Initialized condition in the pod status is set to "True" very early in the lifecycle of the pod. This occurs before the kubelet initiates any pod runtime sandbox creation and network configuration steps. As a result, a pod without init containers will report the status of the Initialized condition as "True" even if the container runtime is not able to successfully initialize the pod sandbox environment.

Relative to either situation above, the PodHasNetwork condition surfaces more accurate data around when the pod runtime sandbox was initialized with networking configured so that the kubelet can proceed to launch user-configured containers (including init containers) in the pod.

Special Cases

If a pod specifies hostNetwork as "True", the PodHasNetwork condition is set to "True" based on successful creation of the pod sandbox while the network configuration state of the pod sandbox is ignored. This is because the CRI implementation typically skips any pod sandbox network configuration when hostNetwork is set to "True" for a pod.

A node agent may dynamically re-configure network interface(s) for a pod by watching changes in pod annotations that specify additional networking configuration (e.g. k8s.v1.cni.cncf.io/networks). Dynamic updates of pod networking configuration after the pod sandbox is initialized by Kubelet (in coordination with a container runtime) are not reflected by the PodHasNetwork condition.

Try out the PodHasNetwork condition for pods

In order to have the kubelet report the PodHasNetwork condition in the status field of a pod, please enable the PodHasNetworkCondition feature gate on the kubelet.

For a pod whose runtime sandbox has been successfully created and has networking configured, the kubelet will report the PodHasNetwork condition with status set to "True":

$ kubectl describe pod nginx1
Name:             nginx1
Namespace:        default
...
Conditions:
  Type              Status
  PodHasNetwork     True
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True

For a pod whose runtime sandbox has not been created yet (and networking not configured either), the kubelet will report the PodHasNetwork condition with status set to "False":

$ kubectl describe pod nginx2
Name:             nginx2
Namespace:        default
...
Conditions:
  Type              Status
  PodHasNetwork     False
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True

What’s next?

Depending on feedback and adoption, the Kubernetes team plans to push the reporting of the PodHasNetwork condition to Beta in 1.26 or 1.27.

How can I learn more?

Please check out the documentation for the PodHasNetwork condition to learn more about it and how it fits in relation to other pod conditions.

How to get involved?

This feature is driven by the SIG Node community. Please join us to connect with the community and share your ideas and feedback around the above feature and beyond. We look forward to hearing from you!

Acknowledgements

We want to thank the following people for their insightful and helpful reviews of the KEP and PRs around this feature: Derek Carr (@derekwaynecarr), Mrunal Patel (@mrunalp), Dawn Chen (@dchen1107), Qiutong Song (@qiutongs), Ruiwen Zhao (@ruiwen-zhao), Tim Bannister (@sftim), Danielle Lancashire (@endocrimes) and Agam Dua (@agamdua).

Announcing the Auto-refreshing Official Kubernetes CVE Feed

A long-standing request from the Kubernetes community has been to have a programmatic way for end users to keep track of Kubernetes security issues (also called "CVEs", after the database that tracks public security issues across different products and vendors). Accompanying the release of Kubernetes v1.25, we are excited to announce availability of such a feed as an alpha feature. This blog will cover the background and scope of this new service.

Motivation

With the growing number of eyes on Kubernetes, the number of CVEs related to Kubernetes have increased. Although most CVEs that directly, indirectly, or transitively impact Kubernetes are regularly fixed, there is no single place for the end users of Kubernetes to programmatically subscribe or pull the data of fixed CVEs. Current options are either broken or incomplete.

Scope

What This Does

Create a periodically auto-refreshing, human and machine-readable list of official Kubernetes CVEs

What This Doesn't Do

  • Triage and vulnerability disclosure will continue to be done by SRC (Security Response Committee).
  • Listing CVEs that are identified in build time dependencies and container images are out of scope.
  • Only official CVEs announced by the Kubernetes SRC will be published in the feed.

Who It's For

  • End Users: Persons or teams who use Kubernetes to deploy applications they own
  • Platform Providers: Persons or teams who manage Kubernetes clusters
  • Maintainers: Persons or teams who create and support Kubernetes releases through their work in Kubernetes Community - via various Special Interest Groups and Committees.

Implementation Details

A supporting contributor blog was published that describes in depth on how this CVE feed was implemented to ensure the feed was reasonably protected against tampering and was automatically updated after a new CVE was announced.

What's Next?

In order to graduate this feature, SIG Security is gathering feedback from end users who are using this alpha feed.

So in order to improve the feed in future Kubernetes Releases, if you have any feedback, please let us know by adding a comment to this tracking issue or let us know on #sig-security-tooling Kubernetes Slack channel. (Join Kubernetes Slack here)

A special shout out and massive thanks to Neha Lohia (@nehalohia27) and Tim Bannister (@sftim) for their stellar collaboration for many months from "ideation to implementation" of this feature.

Kubernetes 1.25: KMS V2 Improvements

With Kubernetes v1.25, SIG Auth is introducing a new v2alpha1 version of the Key Management Service (KMS) API. There are a lot of improvements in the works, and we're excited to be able to start down the path of a new and improved KMS!

What is KMS?

One of the first things to consider when securing a Kubernetes cluster is encrypting persisted API data at rest. KMS provides an interface for a provider to utilize a key stored in an external key service to perform this encryption.

Encryption at rest using KMS v1 has been a feature of Kubernetes since version v1.10, and is currently in beta as of version v1.12.

What’s new in v2alpha1?

While the original v1 implementation has been successful in helping Kubernetes users encrypt etcd data, it did fall short in a few key ways:

  1. Performance: When starting a cluster, all resources are serially fetched and decrypted to fill the kube-apiserver cache. When using a KMS plugin, this can cause slow startup times due to the large number of requests made to the remote vault. In addition, there is the potential to hit API rate limits on external key services depending on how many encrypted resources exist in the cluster.
  2. Key Rotation: With KMS v1, rotation of a key-encrypting key is a manual and error-prone process. It can be difficult to determine what encryption keys are in-use on a cluster.
  3. Health Check & Status: Before the KMS v2 API, the kube-apiserver was forced to make encrypt and decrypt calls as a proxy to determine if the KMS plugin is healthy. With cloud services these operations usually cost actual money with cloud service. Whatever the cost, those operations on their own do not provide a holistic view of the service's health.
  4. Observability: Without some kind of trace ID, it's has been difficult to correlate events found in the various logs across kube-apiserver, KMS, and KMS plugins.

The KMS v2 enhancement attempts to address all of these shortcomings, though not all planned features are implemented in the initial alpha release. Here are the improvements that arrived in Kubernetes v1.25:

  1. Support for KMS plugins that use a key hierarchy to reduce network requests made to the remote vault. To learn more, check out the design details for how a KMS plugin can leverage key hierarchy.
  2. Extra metadata is now tracked to allow a KMS plugin to communicate what key it is currently using with the kube-apiserver, allowing for rotation without API server restart. Data stored in etcd follows a more standard proto format to allow external tools to observe its state. To learn more, check out the details for metadata.
  3. A dedicated status API is used to communicate the health of the KMS plugin with the API server. To learn more, check out the details for status API.
  4. To improve observability, a new UID field is included in EncryptRequest and DecryptRequest of the v2 API. The UID is generated for each envelope operation. To learn more, check out the details for observability.

Sequence Diagram

Encrypt Request

Sequence diagram for KMSv2 Encrypt

Decrypt Request

Sequence diagram for KMSv2 Decrypt

What’s next?

For Kubernetes v1.26, we expect to ship another alpha version. As of right now, the alpha API will be ready to be used by KMS plugin authors. We hope to include a reference plugin implementation with the next release, and you'll be able to try out the feature at that time.

You can learn more about KMS v2 by reading Using a KMS provider for data encryption. You can also follow along on the KEP to track progress across the coming Kubernetes releases.

How to get involved

If you are interested in getting involved in the development of this feature or would like to share feedback, please reach out on the #sig-auth-kms-dev channel on Kubernetes Slack.

You are also welcome to join the bi-weekly SIG Auth meetings, held every-other Wednesday.

Acknowledgements

This feature has been an effort driven by contributors from several different companies. We would like to extend a huge thank you to everyone that contributed their time and effort to help make this possible.

Kubernetes’s IPTables Chains Are Not API

Some Kubernetes components (such as kubelet and kube-proxy) create iptables chains and rules as part of their operation. These chains were never intended to be part of any Kubernetes API/ABI guarantees, but some external components nonetheless make use of some of them (in particular, using KUBE-MARK-MASQ to mark packets as needing to be masqueraded).

As a part of the v1.25 release, SIG Network made this declaration explicit: that (with one exception), the iptables chains that Kubernetes creates are intended only for Kubernetes’s own internal use, and third-party components should not assume that Kubernetes will create any specific iptables chains, or that those chains will contain any specific rules if they do exist.

Then, in future releases, as part of KEP-3178, we will begin phasing out certain chains that Kubernetes itself no longer needs. Components outside of Kubernetes itself that make use of KUBE-MARK-MASQ, KUBE-MARK-DROP, or other Kubernetes-generated iptables chains should start migrating away from them now.

Background

In addition to various service-specific iptables chains, kube-proxy creates certain general-purpose iptables chains that it uses as part of service proxying. In the past, kubelet also used iptables for a few features (such as setting up hostPort mapping for pods) and so it also redundantly created some of the same chains.

However, with the removal of dockershim in Kubernetes in 1.24, kubelet now no longer ever uses any iptables rules for its own purposes; the things that it used to use iptables for are now always the responsibility of the container runtime or the network plugin, and there is no reason for kubelet to be creating any iptables rules.

Meanwhile, although iptables is still the default kube-proxy backend on Linux, it is unlikely to remain the default forever, since the associated command-line tools and kernel APIs are essentially deprecated, and no longer receiving improvements. (RHEL 9 logs a warning if you use the iptables API, even via iptables-nft.)

Although as of Kubernetes 1.25 iptables kube-proxy remains popular, and kubelet continues to create the iptables rules that it historically created (despite no longer using them), third party software cannot assume that core Kubernetes components will keep creating these rules in the future.

Upcoming changes

Starting a few releases from now, kubelet will no longer create the following iptables chains in the nat table:

  • KUBE-MARK-DROP
  • KUBE-MARK-MASQ
  • KUBE-POSTROUTING

Additionally, the KUBE-FIREWALL chain in the filter table will no longer have the functionality currently associated with KUBE-MARK-DROP (and it may eventually go away entirely).

This change will be phased in via the IPTablesOwnershipCleanup feature gate. That feature gate is available and can be manually enabled for testing in Kubernetes 1.25. The current plan is that it will become enabled-by-default in Kubernetes 1.27, though this may be delayed to a later release. (It will not happen sooner than Kubernetes 1.27.)

What to do if you use Kubernetes’s iptables chains

(Although the discussion below focuses on short-term fixes that are still based on iptables, you should probably also start thinking about eventually migrating to nftables or another API).

If you use KUBE-MARK-MASQ...

If you are making use of the KUBE-MARK-MASQ chain to cause packets to be masqueraded, you have two options: (1) rewrite your rules to use -j MASQUERADE directly, (2) create your own alternative “mark for masquerade” chain.

The reason kube-proxy uses KUBE-MARK-MASQ is because there are lots of cases where it needs to call both -j DNAT and -j MASQUERADE on a packet, but it’s not possible to do both of those at the same time in iptables; DNAT must be called from the PREROUTING (or OUTPUT) chain (because it potentially changes where the packet will be routed to) while MASQUERADE must be called from POSTROUTING (because the masqueraded source IP that it picks depends on what the final routing decision was).

In theory, kube-proxy could have one set of rules to match packets in PREROUTING/OUTPUT and call -j DNAT, and then have a second set of rules to match the same packets in POSTROUTING and call -j MASQUERADE. But instead, for efficiency, it only matches them once, during PREROUTING/OUTPUT, at which point it calls -j DNAT and then calls -j KUBE-MARK-MASQ to set a bit on the kernel packet mark as a reminder to itself. Then later, during POSTROUTING, it has a single rule that matches all previously-marked packets, and calls -j MASQUERADE on them.

If you have a lot of rules where you need to apply both DNAT and masquerading to the same packets like kube-proxy does, then you may want a similar arrangement. But in many cases, components that use KUBE-MARK-MASQ are only doing it because they copied kube-proxy’s behavior without understanding why kube-proxy was doing it that way. Many of these components could easily be rewritten to just use separate DNAT and masquerade rules. (In cases where no DNAT is occurring then there is even less point to using KUBE-MARK-MASQ; just move your rules from PREROUTING to POSTROUTING and call -j MASQUERADE directly.)

If you use KUBE-MARK-DROP...

The rationale for KUBE-MARK-DROP is similar to the rationale for KUBE-MARK-MASQ: kube-proxy wanted to make packet-dropping decisions alongside other decisions in the nat KUBE-SERVICES chain, but you can only call -j DROP from the filter table. So instead, it uses KUBE-MARK-DROP to mark packets to be dropped later on.

In general, the approach for removing a dependency on KUBE-MARK-DROP is the same as for removing a dependency on KUBE-MARK-MASQ. In kube-proxy’s case, it is actually quite easy to replace the usage of KUBE-MARK-DROP in the nat table with direct calls to DROP in the filter table, because there are no complicated interactions between DNAT rules and drop rules, and so the drop rules can simply be moved from nat to filter.

In more complicated cases, it might be necessary to “re-match” the same packets in both nat and filter.

If you use Kubelet’s iptables rules to figure out iptables-legacy vs iptables-nft...

Components that manipulate host-network-namespace iptables rules from inside a container need some way to figure out whether the host is using the old iptables-legacy binaries or the newer iptables-nft binaries (which talk to a different kernel API underneath).

The iptables-wrappers module provides a way for such components to autodetect the system iptables mode, but in the past it did this by assuming that Kubelet will have created “a bunch” of iptables rules before any containers start, and so it can guess which mode the iptables binaries in the host filesystem are using by seeing which mode has more rules defined.

In future releases, Kubelet will no longer create many iptables rules, so heuristics based on counting the number of rules present may fail.

However, as of 1.24, Kubelet always creates a chain named KUBE-IPTABLES-HINT in the mangle table of whichever iptables subsystem it is using. Components can now look for this specific chain to know which iptables subsystem Kubelet (and thus, presumably, the rest of the system) is using.

(Additionally, since Kubernetes 1.17, kubelet has created a chain called KUBE-KUBELET-CANARY in the mangle table. While this chain may go away in the future, it will of course still be there in older releases, so in any recent version of Kubernetes, at least one of KUBE-IPTABLES-HINT or KUBE-KUBELET-CANARY will be present.)

The iptables-wrappers package has already been updated with this new heuristic, so if you were previously using that, you can rebuild your container images with an updated version of that.

Further reading

The project to clean up iptables chain ownership and deprecate the old chains is tracked by KEP-3178.

Introducing COSI: Object Storage Management using Kubernetes APIs

This article introduces the Container Object Storage Interface (COSI), a standard for provisioning and consuming object storage in Kubernetes. It is an alpha feature in Kubernetes v1.25.

File and block storage are treated as first class citizens in the Kubernetes ecosystem via Container Storage Interface (CSI). Workloads using CSI volumes enjoy the benefits of portability across vendors and across Kubernetes clusters without the need to change application manifests. An equivalent standard does not exist for Object storage.

Object storage has been rising in popularity in recent years as an alternative form of storage to filesystems and block devices. Object storage paradigm promotes disaggregation of compute and storage. This is done by making data available over the network, rather than locally. Disaggregated architectures allow compute workloads to be stateless, which consequently makes them easier to manage, scale and automate.

COSI

COSI aims to standardize consumption of object storage to provide the following benefits:

  • Kubernetes Native - Use the Kubernetes API to provision, configure and manage buckets
  • Self Service - A clear delineation between administration and operations (DevOps) to enable self-service capability for DevOps personnel
  • Portability - Vendor neutrality enabled through portability across Kubernetes Clusters and across Object Storage vendors

Portability across vendors is only possible when both vendors support a common datapath-API. Eg. it is possible to port from AWS S3 to Ceph, or AWS S3 to MinIO and back as they all use S3 API. In contrast, it is not possible to port from AWS S3 and Google Cloud’s GCS or vice versa.

Architecture

COSI is made up of three components:

  • COSI Controller Manager
  • COSI Sidecar
  • COSI Driver

The COSI Controller Manager acts as the main controller that processes changes to COSI API objects. It is responsible for fielding requests for bucket creation, updates, deletion and access management. One instance of the controller manager is required per kubernetes cluster. Only one is needed even if multiple object storage providers are used in the cluster.

The COSI Sidecar acts as a translator between COSI API requests and vendor-specific COSI Drivers. This component uses a standardized gRPC protocol that vendor drivers are expected to satisfy.

The COSI Driver is the vendor specific component that receives requests from the sidecar and calls the appropriate vendor APIs to create buckets, manage their lifecycle and manage access to them.

API

The COSI API is centered around buckets, since bucket is the unit abstraction for object storage. COSI defines three Kubernetes APIs aimed at managing them

  • Bucket
  • BucketClass
  • BucketClaim

In addition, two more APIs for managing access to buckets are also defined:

  • BucketAccess
  • BucketAccessClass

In a nutshell, Bucket and BucketClaim can be considered to be similar to PersistentVolume and PersistentVolumeClaim respectively. The BucketClass’ counterpart in the file/block device world is StorageClass.

Since Object Storage is always authenticated, and over the network, access credentials are required to access buckets. The two APIs, namely, BucketAccess and BucketAccessClass are used to denote access credentials and policies for authentication. More info about these APIs can be found in the official COSI proposal - https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1979-object-storage-support

Self-Service

Other than providing kubernetes-API driven bucket management, COSI also aims to empower DevOps personnel to provision and manage buckets on their own, without admin intervention. This, further enabling dev teams to realize faster turn-around times and faster time-to-market.

COSI achieves this by dividing bucket provisioning steps among two different stakeholders, namely the administrator (admin), and the cluster operator. The administrator will be responsible for setting broad policies and limits on how buckets are provisioned, and how access is obtained for them. The cluster operator will be free to create and utilize buckets within the limits set by the admin.

For example, a cluster operator could use an admin policy could be used to restrict maximum provisioned capacity to 100GB, and developers would be allowed to create buckets and store data upto that limit. Similarly for access credentials, admins would be able to restrict who can access which buckets, and developers would be able to access all the buckets available to them.

Portability

The third goal of COSI is to achieve vendor neutrality for bucket management. COSI enables two kinds of portability:

  • Cross Cluster
  • Cross Provider

Cross Cluster portability is allowing buckets provisioned in one cluster to be available in another cluster. This is only valid when the object storage backend itself is accessible from both clusters.

Cross-provider portability is about allowing organizations or teams to move from one object storage provider to another seamlessly, and without requiring changes to application definitions (PodTemplates, StatefulSets, Deployment and so on). This is only possible if the source and destination providers use the same data.

COSI does not handle data migration as it is outside of its scope. In case porting between providers requires data to be migrated as well, then other measures need to be taken to ensure data availability.

What’s next

The amazing sig-storage-cosi community has worked hard to bring the COSI standard to alpha status. We are looking forward to onboarding a lot of vendors to write COSI drivers and become COSI compatible!

We want to add more authentication mechanisms for COSI buckets, we are designing advanced bucket sharing primitives, multi-cluster bucket management and much more. Lots of great ideas and opportunities ahead!

Stay tuned for what comes next, and if you have any questions, comments or suggestions

Kubernetes 1.25: cgroup v2 graduates to GA

Kubernetes 1.25 brings cgroup v2 to GA (general availability), letting the kubelet use the latest container resource management capabilities.

What are cgroups?

Effective resource management is a critical aspect of Kubernetes. This involves managing the finite resources in your nodes, such as CPU, memory, and storage.

cgroups are a Linux kernel capability that establish resource management functionality like limiting CPU usage or setting memory limits for running processes.

When you use the resource management capabilities in Kubernetes, such as configuring requests and limits for Pods and containers, Kubernetes uses cgroups to enforce your resource requests and limits.

The Linux kernel offers two versions of cgroups: cgroup v1 and cgroup v2.

What is cgroup v2?

cgroup v2 is the latest version of the Linux cgroup API. cgroup v2 provides a unified control system with enhanced resource management capabilities.

cgroup v2 has been in development in the Linux Kernel since 2016 and in recent years has matured across the container ecosystem. With Kubernetes 1.25, cgroup v2 support has graduated to general availability.

Many recent releases of Linux distributions have switched over to cgroup v2 by default so it's important that Kubernetes continues to work well on these new updated distros.

cgroup v2 offers several improvements over cgroup v1, such as the following:

  • Single unified hierarchy design in API
  • Safer sub-tree delegation to containers
  • Newer features like Pressure Stall Information
  • Enhanced resource allocation management and isolation across multiple resources
    • Unified accounting for different types of memory allocations (network and kernel memory, etc)
    • Accounting for non-immediate resource changes such as page cache write backs

Some Kubernetes features exclusively use cgroup v2 for enhanced resource management and isolation. For example, the MemoryQoS feature improves memory utilization and relies on cgroup v2 functionality to enable it. New resource management features in the kubelet will also take advantage of the new cgroup v2 features moving forward.

How do you use cgroup v2?

Many Linux distributions are switching to cgroup v2 by default; you might start using it the next time you update the Linux version of your control plane and nodes!

Using a Linux distribution that uses cgroup v2 by default is the recommended method. Some of the popular Linux distributions that use cgroup v2 include the following:

  • Container Optimized OS (since M97)
  • Ubuntu (since 21.10)
  • Debian GNU/Linux (since Debian 11 Bullseye)
  • Fedora (since 31)
  • Arch Linux (since April 2021)
  • RHEL and RHEL-like distributions (since 9)

To check if your distribution uses cgroup v2 by default, refer to Check your cgroup version or consult your distribution's documentation.

If you're using a managed Kubernetes offering, consult your provider to determine how they're adopting cgroup v2, and whether you need to take action.

To use cgroup v2 with Kubernetes, you must meet the following requirements:

  • Your Linux distribution enables cgroup v2 on kernel version 5.8 or later
  • Your container runtime supports cgroup v2. For example:
  • The kubelet and the container runtime are configured to use the systemd cgroup driver

The kubelet and container runtime use a cgroup driver to set cgroup parameters. When using cgroup v2, it's strongly recommended that both the kubelet and your container runtime use the systemd cgroup driver, so that there's a single cgroup manager on the system. To configure the kubelet and the container runtime to use the driver, refer to the systemd cgroup driver documentation.

Migrate to cgroup v2

When you run Kubernetes with a Linux distribution that enables cgroup v2, the kubelet should automatically adapt without any additional configuration required, as long as you meet the requirements.

In most cases, you won't see a difference in the user experience when you switch to using cgroup v2 unless your users access the cgroup file system directly.

If you have applications that access the cgroup file system directly, either on the node or from inside a container, you must update the applications to use the cgroup v2 API instead of the cgroup v1 API.

Scenarios in which you might need to update to cgroup v2 include the following:

  • If you run third-party monitoring and security agents that depend on the cgroup file system, update the agents to versions that support cgroup v2.
  • If you run cAdvisor as a stand-alone DaemonSet for monitoring pods and containers, update it to v0.43.0 or later.
  • If you deploy Java applications, prefer to use versions which fully support cgroup v2:

Learn more

Get involved

Your feedback is always welcome! SIG Node meets regularly and are available in the #sig-node channel in the Kubernetes Slack, or using the SIG mailing list.

cgroup v2 has had a long journey and is a great example of open source community collaboration across the industry because it required work across the stack, from the Linux Kernel to systemd to various container runtimes, and (of course) Kubernetes.

Acknowledgments

We would like to thank Giuseppe Scrivano who initiated cgroup v2 support in Kubernetes, and reviews and leadership from the SIG Node community including chairs Dawn Chen and Derek Carr.

We'd also like to thank the maintainers of container runtimes like Docker, containerd and CRI-O, and the maintainers of components like cAdvisor and runc, libcontainer, which underpin many container runtimes. Finally, this wouldn't have been possible without support from systemd and upstream Linux Kernel maintainers.

It's a team effort!

Kubernetes 1.25: CSI Inline Volumes have graduated to GA

CSI Inline Volumes were introduced as an alpha feature in Kubernetes 1.15 and have been beta since 1.16. We are happy to announce that this feature has graduated to General Availability (GA) status in Kubernetes 1.25.

CSI Inline Volumes are similar to other ephemeral volume types, such as configMap, downwardAPI and secret. The important difference is that the storage is provided by a CSI driver, which allows the use of ephemeral storage provided by third-party vendors. The volume is defined as part of the pod spec and follows the lifecycle of the pod, meaning the volume is created once the pod is scheduled and destroyed when the pod is destroyed.

What's new in 1.25?

There are a couple of new bug fixes related to this feature in 1.25, and the CSIInlineVolume feature gate has been locked to True with the graduation to GA. There are no new API changes, so users of this feature during beta should not notice any significant changes aside from these bug fixes.

When to use this feature

CSI inline volumes are meant for simple local volumes that should follow the lifecycle of the pod. They may be useful for providing secrets, configuration data, or other special-purpose storage to the pod from a CSI driver.

A CSI driver is not suitable for inline use when:

  • The volume needs to persist longer than the lifecycle of a pod
  • Volume snapshots, cloning, or volume expansion are required
  • The CSI driver requires volumeAttributes that should be restricted to an administrator

How to use this feature

In order to use this feature, the CSIDriver spec must explicitly list Ephemeral as one of the supported volumeLifecycleModes. Here is a simple example from the Secrets Store CSI Driver.

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  name: secrets-store.csi.k8s.io
spec:
  podInfoOnMount: true
  attachRequired: false
  volumeLifecycleModes:
  - Ephemeral

Any pod spec may then reference that CSI driver to create an inline volume, as in this example.

kind: Pod
apiVersion: v1
metadata:
  name: my-csi-app-inline
spec:
  containers:
    - name: my-frontend
      image: busybox
      volumeMounts:
      - name: secrets-store-inline
        mountPath: "/mnt/secrets-store"
        readOnly: true
      command: [ "sleep", "1000000" ]
  volumes:
    - name: secrets-store-inline
      csi:
        driver: secrets-store.csi.k8s.io
        readOnly: true
        volumeAttributes:
          secretProviderClass: "my-provider"

If the driver supports any volume attributes, you can provide these as part of the spec for the Pod as well:

      csi:
        driver: block.csi.vendor.example
        volumeAttributes:
          foo: bar

Example Use Cases

Two existing CSI drivers that support the Ephemeral volume lifecycle mode are the Secrets Store CSI Driver and the Cert-Manager CSI Driver.

The Secrets Store CSI Driver allows users to mount secrets from external secret stores into a pod as an inline volume. This can be useful when the secrets are stored in an external managed service or Vault instance.

The Cert-Manager CSI Driver works along with cert-manager to seamlessly request and mount certificate key pairs into a pod. This allows the certificates to be renewed and updated in the application pod automatically.

Security Considerations

Special consideration should be given to which CSI drivers may be used as inline volumes. volumeAttributes are typically controlled through the StorageClass, and may contain attributes that should remain restricted to the cluster administrator. Allowing a CSI driver to be used for inline ephmeral volumes means that any user with permission to create pods may also provide volumeAttributes to the driver through a pod spec.

Cluster administrators may choose to omit (or remove) Ephemeral from volumeLifecycleModes in the CSIDriver spec to prevent the driver from being used as an inline ephemeral volume, or use an admission webhook to restrict how the driver is used.

References

For more information on this feature, see:

Kubernetes v1.25: Pod Security Admission Controller in Stable

The release of Kubernetes v1.25 marks a major milestone for Kubernetes out-of-the-box pod security controls: Pod Security admission (PSA) graduated to stable, and Pod Security Policy (PSP) has been removed. PSP was deprecated in Kubernetes v1.21, and no longer functions in Kubernetes v1.25 and later.

The Pod Security admission controller replaces PodSecurityPolicy, making it easier to enforce predefined Pod Security Standards by simply adding a label to a namespace. The Pod Security Standards are maintained by the K8s community, which means you automatically get updated security policies whenever new security-impacting Kubernetes features are introduced.

What’s new since Beta?

Pod Security Admission hasn’t changed much since the Beta in Kubernetes v1.23. The focus has been on improving the user experience, while continuing to maintain a high quality bar.

Improved violation messages

We improved violation messages so that you get fewer duplicate messages. For example, instead of the following message when the Baseline and Restricted policies check the same capability:

pods "admin-pod" is forbidden: violates PodSecurity "restricted:latest": non-default capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add), unrestricted capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add)

You get this message:

pods "admin-pod" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "admin" must not include "SYS_ADMIN" in securityContext.capabilities.add)

Improved namespace warnings

When you modify the enforce Pod Security labels on a namespace, the Pod Security admission controller checks all existing pods for violations and surfaces a warning if any are out of compliance. These warnings are now aggregated for pods with identical violations, making large namespaces with many replicas much more manageable. For example:

Warning: frontend-h23gf2: allowPrivilegeEscalation != false
Warning: myjob-g342hj (and 6 other pods): host namespaces, allowPrivilegeEscalation != false Warning: backend-j23h42 (and 1 other pod): non-default capabilities, unrestricted capabilities

Additionally, when you apply a non-privileged label to a namespace that has been configured to be exempt, you will now get a warning alerting you to this fact:

Warning: namespace 'kube-system' is exempt from Pod Security, and the policy (enforce=baseline:latest) will be ignored

Changes to the Pod Security Standards

The Pod Security Standards, which Pod Security admission enforces, have been updated with support for the new Pod OS field. In v1.25 and later, if you use the Restricted policy, the following Linux-specific restrictions will no longer be required if you explicitly set the pod's .spec.os.name field to windows:

  • Seccomp - The seccompProfile.type field for Pod and container security contexts
  • Privilege escalation - The allowPrivilegeEscalation field on container security contexts
  • Capabilities - The requirement to drop ALL capabilities in the capabilities field on containers

In Kubernetes v1.23 and earlier, the kubelet didn't enforce the Pod OS field. If your cluster includes nodes running a v1.23 or older kubelet, you should explicitly pin Restricted policies to a version prior to v1.25.

Migrating from PodSecurityPolicy to the Pod Security admission controller

For instructions to migrate from PodSecurityPolicy to the Pod Security admission controller, and for help choosing a migration strategy, refer to the migration guide. We're also developing a tool called pspmigrator to automate parts of the migration process.

We'll be talking about PSP migration in more detail at our upcoming KubeCon 2022 NA talk, Migrating from Pod Security Policy. Use the KubeCon NA schedule to learn more.

PodSecurityPolicy: The Historical Context

The PodSecurityPolicy (PSP) admission controller has been removed, as of Kubernetes v1.25. Its deprecation was announced and detailed in the blog post PodSecurityPolicy Deprecation: Past, Present, and Future, published for the Kubernetes v1.21 release.

This article aims to provide historical context on the birth and evolution of PSP, explain why the feature never made it to stable, and show why it was removed and replaced by Pod Security admission control.

PodSecurityPolicy, like other specialized admission control plugins, provided fine-grained permissions on specific fields concerning the pod security settings as a built-in policy API. It acknowledged that cluster administrators and cluster users are usually not the same people, and that creating workloads in the form of a Pod or any resource that will create a Pod should not equal being "root on the cluster". It could also encourage best practices by configuring more secure defaults through mutation and decoupling low-level Linux security decisions from the deployment process.

The birth of PodSecurityPolicy

PodSecurityPolicy originated from OpenShift's SecurityContextConstraints (SCC) that were in the very first release of the Red Hat OpenShift Container Platform, even before Kubernetes 1.0. PSP was a stripped-down version of the SCC.

The origin of the creation of PodSecurityPolicy is difficult to track, notably because it was mainly added before Kubernetes Enhancements Proposal (KEP) process, when design proposals were still a thing. Indeed, the archive of the final design proposal is still available. Nevertheless, a KEP issue number five was created after the first pull requests were merged.

Before adding the first piece of code that created PSP, two main pull requests were merged into Kubernetes, a SecurityContext subresource that defined new fields on pods' containers, and the first iteration of the ServiceAccount API.

Kubernetes 1.0 was released on 10 July 2015 without any mechanism to restrict the security context and sensitive options of workloads, other than an alpha-quality SecurityContextDeny admission plugin (then known as scdeny). The SecurityContextDeny plugin is still in Kubernetes today (as an alpha feature) and creates an admission controller that prevents the usage of some fields in the security context.

The roots of the PodSecurityPolicy were added with the very first pull request on security policy, which added the design proposal with the new PSP object, based on the SCC (Security Context Constraints). It was a long discussion of nine months, with back and forth from OpenShift's SCC, many rebases, and the rename to PodSecurityPolicy that finally made it to upstream Kubernetes in February 2016. Now that the PSP object had been created, the next step was to add an admission controller that could enforce these policies. The first step was to add the admission without taking into account the users or groups. A specific issue to bring PodSecurityPolicy to a usable state was added to keep track of the progress and a first version of the admission controller was merged in pull request named PSP admission in May 2016. Then around two months later, Kubernetes 1.3 was released.

Here is a timeline that recaps the main pull requests of the birth of the PodSecurityPolicy and its admission controller with 1.0 and 1.3 releases as reference points.

Timeline of the PodSecurityPolicy creation pull requests

After that, the PSP admission controller was enhanced by adding what was initially left aside. The authorization mechanism, merged in early November 2016 allowed administrators to use multiple policies in a cluster to grant different levels of access for different types of users. Later, a pull request merged in October 2017 fixed a design issue on ordering PodSecurityPolicies between mutating and alphabetical order, and continued to build the PSP admission as we know it. After that, many improvements and fixes followed to build the PodSecurityPolicy feature of recent Kubernetes releases.

The rise of Pod Security Admission

Despite the crucial issue it was trying to solve, PodSecurityPolicy presented some major flaws:

  • Flawed authorization model - users can create a pod if they have the use verb on the PSP that allows that pod or the pod's service account has the use permission on the allowing PSP.
  • Difficult to roll out - PSP fail-closed. That is, in the absence of a policy, all pods are denied. It mostly means that it cannot be enabled by default and that users have to add PSPs for all workloads before enabling the feature, thus providing no audit mode to discover which pods would not be allowed by the new policy. The opt-in model also leads to insufficient test coverage and frequent breakage due to cross-feature incompatibility. And unlike RBAC, there was no strong culture of shipping PSP manifests with projects.
  • Inconsistent unbounded API - the API has grown with lots of inconsistencies notably because of many requests for niche use cases: e.g. labels, scheduling, fine-grained volume controls, etc. It has poor composability with a weak prioritization model, leading to unexpected mutation priority. It made it really difficult to combine PSP with other third-party admission controllers.
  • Require security knowledge - effective usage still requires an understanding of Linux security primitives. e.g. MustRunAsNonRoot + AllowPrivilegeEscalation.

The experience with PodSecurityPolicy concluded that most users care for two or three policies, which led to the creation of the Pod Security Standards, that define three policies:

  • Privileged - unrestricted policy.
  • Baseline - minimally restrictive policy, allowing the default pod configuration.
  • Restricted - security best practice policy.

The replacement for PSP, the new Pod Security Admission is an in-tree, stable for Kubernetes v1.25, admission plugin to enforce these standards at the namespace level. It makes it easier to enforce basic pod security without deep security knowledge. For more sophisticated use cases, you might need a third-party solution that can be easily combined with Pod Security Admission.

What's next

For further details on the SIG Auth processes, covering PodSecurityPolicy removal and creation of Pod Security admission, the SIG auth update at KubeCon NA 2019 and the PodSecurityPolicy Replacement: Past, Present, and Future presentation at KubeCon NA 2021 records are available.

Particularly on the PSP removal, the PodSecurityPolicy Deprecation: Past, Present, and Future blog post is still accurate.

And for the new Pod Security admission, documentation is available. In addition, the blog post Kubernetes 1.23: Pod Security Graduates to Beta along with the KubeCon EU 2022 presentation The Hitchhiker's Guide to Pod Security give great hands-on tutorials to learn.

Kubernetes v1.25: Combiner

Announcing the release of Kubernetes v1.25!

This release includes a total of 40 enhancements. Fifteen of those enhancements are entering Alpha, ten are graduating to Beta, and thirteen are graduating to Stable. We also have two features being deprecated or removed.

Kubernetes 1.25: Combiner

The theme for Kubernetes v1.25 is Combiner.

The Kubernetes project itself is made up of many, many individual components that, when combined, take the form of the project you see today. It is also built and maintained by many individuals, all of them with different skills, experiences, histories, and interests, who join forces not just as the release team but as the many SIGs that support the project and the community year-round.

With this release, we wish to honor the collaborative, open spirit that takes us from isolated developers, writers, and users spread around the globe to a combined force capable of changing the world. Kubernetes v1.25 includes a staggering 40 enhancements, none of which would exist without the incredible power we have when we work together.

Inspired by our release lead's son, Albert Song, Kubernetes v1.25 is named for each and every one of you, no matter how you choose to contribute your unique power to the combined force that becomes Kubernetes.

What's New (Major Themes)

PodSecurityPolicy is removed; Pod Security Admission graduates to Stable

PodSecurityPolicy was initially deprecated in v1.21, and with the release of v1.25, it has been removed. The updates required to improve its usability would have introduced breaking changes, so it became necessary to remove it in favor of a more friendly replacement. That replacement is Pod Security Admission, which graduates to Stable with this release. If you are currently relying on PodSecurityPolicy, please follow the instructions for migration to Pod Security Admission.

Ephemeral Containers Graduate to Stable

Ephemeral Containers are containers that exist for only a limited time within an existing pod. This is particularly useful for troubleshooting when you need to examine another container but cannot use kubectl exec because that container has crashed or its image lacks debugging utilities. Ephemeral containers graduated to Beta in Kubernetes v1.23, and with this release, the feature graduates to Stable.

Support for cgroups v2 Graduates to Stable

It has been more than two years since the Linux kernel cgroups v2 API was declared stable. With some distributions now defaulting to this API, Kubernetes must support it to continue operating on those distributions. cgroups v2 offers several improvements over cgroups v1, for more information see the cgroups v2 documentation. While cgroups v1 will continue to be supported, this enhancement puts us in a position to be ready for its eventual deprecation and replacement.

Improved Windows support

Moved container registry service from k8s.gcr.io to registry.k8s.io

Moving container registry from k8s.gcr.io to registry.k8s.io got merged. For more details, see the wiki page, announcement was sent to the kubernetes development mailing list.

SeccompDefault promoted to beta, see the tutorial Restrict a Container's Syscalls with seccomp for more details.

Promoted endPort in Network Policy to GA. Network Policy providers that support endPort field now can use it to specify a range of ports to apply a Network Policy. Previously, each Network Policy could only target a single port.

Please be aware that endPort field must be supported by the Network Policy provider. If your provider does not support endPort, and this field is specified in a Network Policy, the Network Policy will be created covering only the port field (single port).

The Local Ephemeral Storage Capacity Isolation feature moved to GA. This was introduced as alpha in 1.8, moved to beta in 1.10, and it is now a stable feature. It provides support for capacity isolation of local ephemeral storage between pods, such as EmptyDir, so that a pod can be hard limited in its consumption of shared resources by evicting Pods if its consumption of local ephemeral storage exceeds that limit.

CSI Migration is an ongoing effort that SIG Storage has been working on for a few releases. The goal is to move in-tree volume plugins to out-of-tree CSI drivers and eventually remove the in-tree volume plugins. The core CSI Migration feature moved to GA. CSI Migration for GCE PD and AWS EBS also moved to GA. CSI Migration for vSphere remains in beta (but is on by default). CSI Migration for Portworx moved to Beta (but is off-by-default).

The CSI Ephemeral Volume feature allows CSI volumes to be specified directly in the pod specification for ephemeral use cases. They can be used to inject arbitrary states, such as configuration, secrets, identity, variables or similar information, directly inside pods using a mounted volume. This was initially introduced in 1.15 as an alpha feature, and it moved to GA. This feature is used by some CSI drivers such as the secret-store CSI driver.

CRD Validation Expression Language is promoted to beta, which makes it possible to declare how custom resources are validated using the Common Expression Language (CEL). Please see the validation rules guide.

Promoted the ServerSideFieldValidation feature gate to beta (on by default). This allows optionally triggering schema validation on the API server that errors when unknown fields are detected. This allows the removal of client-side validation from kubectl while maintaining the same core functionality of erroring out on requests that contain unknown or invalid fields.

Introduced KMS v2 API

Introduce KMS v2alpha1 API to add performance, rotation, and observability improvements. Encrypt data at rest (ie Kubernetes Secrets) with DEK using AES-GCM instead of AES-CBC for kms data encryption. No user action is required. Reads with AES-GCM and AES-CBC will continue to be allowed. See the guide Using a KMS provider for data encryption for more information.

Kube-proxy images are now based on distroless images

In previous releases, kube-proxy container images were built using Debian as the base image. Starting with this release, the images are now built using distroless. This change reduced image size by almost 50% and decreased the number of installed packages and files to only those strictly required for kube-proxy to do its job.

Other Updates

Graduations to Stable

This release includes a total of thirteen enhancements promoted to stable:

Deprecations and Removals

Two features were deprecated or removed from Kubernetes with this release.

Release Notes

The complete details of the Kubernetes v1.25 release are available in our release notes.

Availability

Kubernetes v1.25 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using containers as “nodes”, with kind. You can also easily install 1.25 using kubeadm.

Release Team

Kubernetes is only possible with the support, commitment, and hard work of its community. Each release team is made up of dedicated community volunteers who work together to build the many pieces that, when combined, make up the Kubernetes releases you rely on. This requires the specialized skills of people from all corners of our community, from the code itself to its documentation and project management.

We would like to thank the entire release team for the hours spent hard at work to ensure we deliver a solid Kubernetes v1.25 release for our community. Every one of you had a part to play in building this, and you all executed beautifully. We would like to extend special thanks to our fearless release lead, Cici Huang, for all she did to guarantee we had what we needed to succeed.

User Highlights

Ecosystem Updates

  • KubeCon + CloudNativeCon North America 2022 will take place in Detroit, Michigan from 24 – 28 October 2022! You can find more information about the conference and registration on the event site.
  • KubeDay event series kicks off with KubeDay Japan on December 7! Register or submit a proposal on the event site
  • In the 2021 Cloud Native Survey, the CNCF saw record Kubernetes and container adoption. Take a look at the results of the survey.

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.25 release cycle, which ran for 14 weeks (May 23 to August 23), we saw contributions from 1065 companies and 1620 individuals.

Upcoming Release Webinar

Join members of the Kubernetes v1.25 release team on Thursday September 22, 2022 10am – 11am PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:

Spotlight on SIG Storage

Since the very beginning of Kubernetes, the topic of persistent data and how to address the requirement of stateful applications has been an important topic. Support for stateless deployments was natural, present from the start, and garnered attention, becoming very well-known. Work on better support for stateful applications was also present from early on, with each release increasing the scope of what could be run on Kubernetes.

Message queues, databases, clustered filesystems: these are some examples of the solutions that have different storage requirements and that are, today, increasingly deployed in Kubernetes. Dealing with ephemeral and persistent storage, local or remote, file or block, from many different vendors, while considering how to provide the needed resiliency and data consistency that users expect, all of this is under SIG Storage's umbrella.

In this SIG Storage spotlight, Frederico Muñoz (Cloud & Architecture Lead at SAS) talked with Xing Yang, Tech Lead at VMware and co-chair of SIG Storage, on how the SIG is organized, what are the current challenges and how anyone can get involved and contribute.

About SIG Storage

Frederico (FSM): Hello, thank you for the opportunity of learning more about SIG Storage. Could you tell us a bit about yourself, your role, and how you got involved in SIG Storage.

Xing Yang (XY): I am a Tech Lead at VMware, working on Cloud Native Storage. I am also a Co-Chair of SIG Storage. I started to get involved in K8s SIG Storage at the end of 2017, starting with contributing to the VolumeSnapshot project. At that time, the VolumeSnapshot project was still in an experimental, pre-alpha stage. It needed contributors. So I volunteered to help. Then I worked with other community members to bring VolumeSnapshot to Alpha in K8s 1.12 release in 2018, Beta in K8s 1.17 in 2019, and eventually GA in 1.20 in 2020.

FSM: Reading the SIG Storage charter alone it’s clear that SIG Storage covers a lot of ground, could you describe how the SIG is organised?

XY: In SIG Storage, there are two Co-Chairs and two Tech Leads. Saad Ali from Google and myself are Co-Chairs. Michelle Au from Google and Jan Šafránek from Red Hat are Tech Leads.

We have bi-weekly meetings where we go through features we are working on for each particular release, getting the statuses, making sure each feature has dev owners and reviewers working on it, and reminding people about the release deadlines, etc. More information on the SIG is on the community page. People can also add PRs that need attention, design proposals that need discussion, and other topics to the meeting agenda doc. We will go over them after project tracking is done.

We also have other regular meetings, i.e., CSI Implementation meeting, Object Bucket API design meeting, and one-off meetings for specific topics if needed. There is also a K8s Data Protection Workgroup that is sponsored by SIG Storage and SIG Apps. SIG Storage owns or co-owns features that are being discussed at the Data Protection WG.

Storage and Kubernetes

FSM: Storage is such a foundational component in so many things, not least in Kubernetes: what do you think are the Kubernetes-specific challenges in terms of storage management?

XY: In Kubernetes, there are multiple components involved for a volume operation. For example, creating a Pod to use a PVC has multiple components involved. There are the Attach Detach Controller and the external-attacher working on attaching the PVC to the pod. There’s the Kubelet that works on mounting the PVC to the pod. Of course the CSI driver is involved as well. There could be race conditions sometimes when coordinating between multiple components.

Another challenge is regarding core vs Custom Resource Definitions (CRD), not really storage specific. CRD is a great way to extend Kubernetes capabilities while not adding too much code to the Kubernetes core itself. However, this also means there are many external components that are needed when running a Kubernetes cluster.

From the SIG Storage side, one most notable example is Volume Snapshot. Volume Snapshot APIs are defined as CRDs. API definitions and controllers are out-of-tree. There is a common snapshot controller and a snapshot validation webhook that should be deployed on the control plane, similar to how kube-controller-manager is deployed. Although Volume Snapshot is a CRD, it is a core feature of SIG Storage. It is recommended for the K8s cluster distros to deploy Volume Snapshot CRDs, the snapshot controller, and the snapshot validation webhook, however, most of the time we don’t see distros deploy them. So this becomes a problem for the storage vendors: now it becomes their responsibility to deploy these non-driver specific common components. This could cause conflicts if a customer wants to use more than one storage system and deploy more than one CSI driver.

FSM: Not only the complexity of a single storage system, you have to consider how they will be used together in Kubernetes?

XY: Yes, there are many different storage systems that can provide storage to containers in Kubernetes. They don’t work the same way. It is challenging to find a solution that works for everyone.

FSM: Storage in Kubernetes also involves interacting with external solutions, perhaps more so than other parts of Kubernetes. Is this interaction with vendors and external providers challenging? Has it evolved with time in any way?

XY: Yes, it is definitely challenging. Initially Kubernetes storage had in-tree volume plugin interfaces. Multiple storage vendors implemented in-tree interfaces and have volume plugins in the Kubernetes core code base. This caused lots of problems. If there is a bug in a volume plugin, it affects the entire Kubernetes code base. All volume plugins must be released together with Kubernetes. There was no flexibility if storage vendors need to fix a bug in their plugin or want to align with their own product release.

FSM: That’s where CSI enters the game?

XY: Exactly, then there comes Container Storage Interface (CSI). This is an industry standard trying to design common storage interfaces so that a storage vendor can write one plugin and have it work across a range of container orchestration systems (CO). Now Kubernetes is the main CO, but back when CSI just started, there were Docker, Mesos, Cloud Foundry, in addition to Kubernetes. CSI drivers are out-of-tree so bug fixes and releases can happen at their own pace.

CSI is definitely a big improvement compared to in-tree volume plugins. Kubernetes implementation of CSI has been GA since the 1.13 release. It has come a long way. SIG Storage has been working on moving in-tree volume plugins to out-of-tree CSI drivers for several releases now.

FSM: Moving drivers away from the Kubernetes main tree and into CSI was an important improvement.

XY: CSI interface is an improvement over the in-tree volume plugin interface, however, there are still challenges. There are lots of storage systems. Currently there are more than 100 CSI drivers listed in CSI driver docs. These storage systems are also very diverse. So it is difficult to design a common API that works for all. We introduced capabilities at CSI driver level, but we also have challenges when volumes provisioned by the same driver have different behaviors. The other day we just had a meeting discussing Per Volume CSI Driver Capabilities. We have a problem differentiating some CSI driver capabilities when the same driver supports both block and file volumes. We are going to have follow up meetings to discuss this problem.

Ongoing challenges

FSM: Specifically for the 1.25 release we can see that there are a relevant number of storage-related KEPs in the pipeline, would you say that this release is particularly important for the SIG?

XY: I wouldn’t say one release is more important than other releases. In any given release, we are working on a few very important things.

FSM: Indeed, but are there any 1.25 specific specificities and highlights you would like to point out though?

XY: Yes. For the 1.25 release, I want to highlight the following:

  • CSI Migration is an on-going effort that SIG Storage has been working on for a few releases now. The goal is to move in-tree volume plugins to out-of-tree CSI drivers and eventually remove the in-tree volume plugins. There are 7 KEPs that we are targeting in 1.25 are related to CSI migration. There is one core KEP for the general CSI Migration feature. That is targeting GA in 1.25. CSI Migration for GCE PD and AWS EBS are targeting GA. CSI Migration for vSphere is targeting to have the feature gate on by default while staying in 1.25 that are in Beta. Ceph RBD and PortWorx are targeting Beta, with feature gate off by default. Ceph FS is targeting Alpha.
  • The second one I want to highlight is COSI, the Container Object Storage Interface. This is a sub-project under SIG Storage. COSI proposes object storage Kubernetes APIs to support orchestration of object store operations for Kubernetes workloads. It also introduces gRPC interfaces for object storage providers to write drivers to provision buckets. The COSI team has been working on this project for more than two years now. The COSI feature is targeting Alpha in 1.25. The KEP just got merged. The COSI team is working on updating the implementation based on the updated KEP.
  • Another feature I want to mention is CSI Ephemeral Volume support. This feature allows CSI volumes to be specified directly in the pod specification for ephemeral use cases. They can be used to inject arbitrary states, such as configuration, secrets, identity, variables or similar information, directly inside pods using a mounted volume. This was initially introduced in 1.15 as an alpha feature, and it is now targeting GA in 1.25.

FSM: If you had to single something out, what would be the most pressing areas the SIG is working on?

XY: CSI migration is definitely one area that the SIG has put in lots of effort and it has been on-going for multiple releases now. It involves work from multiple cloud providers and storage vendors as well.

Community involvement

FSM: Kubernetes is a community-driven project. Any recommendation for anyone looking into getting involved in SIG Storage work? Where should they start?

XY: Take a look at the SIG Storage community page, it has lots of information on how to get started. There are SIG annual reports that tell you what we did each year. Take a look at the Contributing guide. It has links to presentations that can help you get familiar with Kubernetes storage concepts.

Join our bi-weekly meetings on Thursdays. Learn how the SIG operates and what we are working on for each release. Find a project that you are interested in and help out. As I mentioned earlier, I got started in SIG Storage by contributing to the Volume Snapshot project.

FSM: Any closing thoughts you would like to add?

XY: SIG Storage always welcomes new contributors. We need contributors to help with building new features, fixing bugs, doing code reviews, writing tests, monitoring test grid health, and improving documentation, etc.

FSM: Thank you so much for your time and insights into the workings of SIG Storage!

Meet Our Contributors - APAC (China region)

Authors & Interviewers: Avinesh Tripathi, Debabrata Panigrahi, Jayesh Srivastava, Priyanka Saggu, Purneswar Prasad, Vedant Kakde


Hello, everyone 👋

Welcome back to the third edition of the "Meet Our Contributors" blog post series for APAC.

This post features four outstanding contributors from China, who have played diverse leadership and community roles in the upstream Kubernetes project.

So, without further ado, let's get straight to the article.

Andy Zhang

Andy Zhang currently works for Microsoft China at the Shanghai site. His main focus is on Kubernetes storage drivers. Andy started contributing to Kubernetes about 5 years ago.

He states that as he is working in Azure Kubernetes Service team and spends most of his time contributing to the Kubernetes community project. Now he is the main contributor of quite a lot Kubernetes subprojects such as Kubernetes cloud provider code.

His open source contributions are mainly self-motivated. In the last two years he has mentored a few students contributing to Kubernetes through the LFX Mentorship program, some of whom got jobs due to their expertise and contributions on Kubernetes projects.

Andy is an active member of the China Kubernetes community. He adds that the Kubernetes community has a good guide about how to become members, code reviewers, approvers and finally when he found out that some open source projects are in the very early stage, he actively contributed to those projects and became the project maintainer.

Shiming Zhang

Shiming Zhang is a Software Engineer working on Kubernetes for DaoCloud in Shanghai, China.

He has mostly been involved with SIG Node as a reviewer. His major contributions have mainly been bug fixes and feature improvements in an ongoing KEP, all revolving around SIG Node.

Some of his major PRs are fixing watchForLockfileContention memory leak, fixing startupProbe behaviour, adding Field status.hostIPs for Pod.

Paco Xu

Paco Xu works at DaoCloud, a Shanghai-based cloud-native firm. He works with the infra and the open source team, focusing on enterprise cloud native platforms based on Kubernetes.

He started with Kubernetes in early 2017 and his first contribution was in March 2018. He started with a bug that he found, but his solution was not that graceful, hence wasn't accepted. He then started with some good first issues, which helped him to a great extent. In addition to this, from 2016 to 2017, he made some minor contributions to Docker.

Currently, Paco is a reviewer for kubeadm (a SIG Cluster Lifecycle product), and for SIG Node.

Paco says that you should contribute to open source projects you use. For him, an open source project is like a book to learn, getting inspired through discussions with the project maintainers.

In my opinion, the best way for me is learning how owners work on the project.

Jintao Zhang

Jintao Zhang is presently employed at API7, where he focuses on ingress and service mesh.

In 2017, he encountered an issue which led to a community discussion and his contributions to Kubernetes started. Before contributing to Kubernetes, Jintao was a long-time contributor to Docker-related open source projects.

Currently Jintao is a maintainer for the ingress-nginx project.

He suggests keeping track of job opportunities at open source companies so that you can find one that allows you to contribute full time. For new contributors Jintao says that if anyone wants to make a significant contribution to an open source project, then they should choose the project based on their interests and should generously invest time.


If you have any recommendations/suggestions for who we should interview next, please let us know in the #sig-contribex channel channel on the Kubernetes Slack. Your suggestions would be much appreciated. We're thrilled to have additional folks assisting us in reaching out to even more wonderful individuals of the community.

We'll see you all in the next one. Everyone, till then, have a happy contributing! 👋

Enhancing Kubernetes one KEP at a Time

Did you know that Kubernetes v1.24 has 46 enhancements? That's a lot of new functionality packed into a 4-month release cycle. The Kubernetes release team coordinates the logistics of the release, from remediating test flakes to publishing updated docs. It's a ton of work, but they always deliver.

The release team comprises around 30 people across six subteams - Bug Triage, CI Signal, Enhancements, Release Notes, Communications, and Docs.  Each of these subteams manages a component of the release. This post will focus on the role of the enhancements subteam and how you can get involved.

What's the enhancements subteam?

Great question. We'll get to that in a second but first, let's talk about how features are managed in Kubernetes.

Each new feature requires a Kubernetes Enhancement Proposal - KEP for short. KEPs are small structured design documents that provide a way to propose and coordinate new features. The KEP author describes the motivation, design (and alternatives), risks, and tests - then community members provide feedback to build consensus.

KEPs are submitted and updated through a pull request (PR) workflow on the k/enhancements repo. Features start in alpha and move through a graduation process to beta and stable as they mature. For example, here's a cool KEP about privileged container support on Windows Server.  It was introduced as alpha in Kubernetes v1.22 and graduated to beta in v1.23.

Now getting back to the question - the enhancements subteam coordinates the lifecycle tracking of the KEPs for each release. Each KEP is required to meet a set of requirements to be cleared for inclusion in a release. The enhancements subteam verifies each requirement for each KEP and tracks the status.

At the start of a release, Kubernetes Special Interest Groups (SIGs) submit their enhancements to opt into a release. A typical release might have from 60 to 90 enhancements at the beginning.  During the release, many enhancements will drop out. Some do not quite meet the KEP requirements, and others do not complete their implementation in code. About 60%-70% of the opted-in KEPs will make it into the final release.

What does the enhancements subteam do?

Another great question, keep them coming! The enhancements team is involved in two crucial milestones during each release: enhancements freeze and code freeze.

Enhancements Freeze

Enhancements freeze is the deadline for a KEP to be complete in order for the enhancement to be included in a release. It's a quality gate to enforce alignment around maintaining and updating KEPs. The most notable requirements are a (1) production readiness review (PRR) and a (2) KEP file with a complete test plan and graduation criteria.

The enhancements subteam communicates to each KEP author through comments on the KEP issue on Github. As a first step, they'll verify the status and check if it meets the requirements.  The KEP gets marked as tracked after satisfying the requirements; otherwise, it's considered at risk. If a KEP is still at risk when enhancement freeze is in effect, the KEP is removed from the release.

This part of the cycle is typically the busiest for the enhancements subteam because of the large number of KEPs to groom, and each KEP might need to be visited multiple times to verify whether it meets requirements.

Code Freeze

Code freeze is the implementation deadline for all enhancements. The code must be implemented, reviewed, and merged by this point if a code change or update is needed for the enhancement. The latter third of the release is focused on stabilizing the codebase - fixing flaky tests, resolving various regressions, and preparing docs - and all the code needs to be in place before those steps can happen.

The enhancements subteam verifies that all PRs for an enhancement are merged into the Kubernetes codebase (k/k). During this period, the subteam reaches out to KEP authors to understand what PRs are part of the KEP, verifies that those PRs get merged, and then updates the status of the KEP. The enhancement is removed from the release if the code isn't all merged before the code freeze deadline.

How can I get involved with the release team?

I'm glad you asked. The most direct way is to apply to be a release team shadow. The shadow role is a hands-on apprenticeship intended to prepare individuals for leadership positions on the release team. Many shadow roles are non-technical and do not require prior contributions to the Kubernetes codebase.

With 3 Kubernetes releases every year and roughly 25 shadows per release, the release team is always in need of individuals wanting to contribute. Before each release cycle, the release team opens the application for the shadow program. When the application goes live, it's posted in the Kubernetes Dev Mailing List.  You can subscribe to notifications from that list (or check it regularly!) to watch when the application opens. The announcement will typically go out in mid-April, mid-July, and mid-December - or roughly a month before the start of each release.

How can I find out more?

Check out the role handbooks if you're curious about the specifics of all the Kubernetes release subteams. The handbooks capture the logistics of each subteam, including a week-by-week breakdown of the subteam activities.  It's an excellent reference for getting to know each team better.

You can also check out the release-related Kubernetes slack channels - particularly #release, #sig-release, and #sig-arch. These channels have discussions and updates surrounding many aspects of the release.

Kubernetes Removals and Major Changes In 1.25

As Kubernetes grows and matures, features may be deprecated, removed, or replaced with improvements for the health of the project. Kubernetes v1.25 includes several major changes and one major removal.

The Kubernetes API Removal and Deprecation process

The Kubernetes project has a well-documented deprecation policy for features. This policy states that stable APIs may only be deprecated when a newer, stable version of that same API is available and that APIs have a minimum lifetime for each stability level. A deprecated API is one that has been marked for removal in a future Kubernetes release; it will continue to function until removal (at least one year from the deprecation), but usage will result in a warning being displayed. Removed APIs are no longer available in the current version, at which point you must migrate to using the replacement.

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
  • Beta or pre-release API versions must be supported for 3 releases after deprecation.
  • Alpha or experimental API versions may be removed in any release without prior deprecation notice.

Whether an API is removed as a result of a feature graduating from beta to stable or because that API simply did not succeed, all removals comply with this deprecation policy. Whenever an API is removed, migration options are communicated in the documentation.

A note about PodSecurityPolicy

In Kubernetes v1.25, we will be removing PodSecurityPolicy after its deprecation in v1.21. PodSecurityPolicy has served us honorably, but its complex and often confusing usage necessitated changes, which unfortunately would have been breaking changes. To address this, it is being removed in favor of a replacement, Pod Security Admission, which is graduating to stable in this release as well. If you are currently relying on PodSecurityPolicy, follow the instructions for migration to Pod Security Admission.

Major Changes for Kubernetes v1.25

Kubernetes v1.25 will include several major changes, in addition to the removal of PodSecurityPolicy.

CSI Migration

The effort to move the in-tree volume plugins to out-of-tree CSI drivers continues, with the core CSI Migration feature going GA in v1.25. This is an important step towards removing the in-tree volume plugins entirely.

Deprecations and removals for storage drivers

Several volume plugins are being deprecated or removed.

GlusterFS will be deprecated in v1.25. While a CSI driver was built for it, it has not been maintained. The possibility of migration to a compatible CSI driver was discussed, but a decision was ultimately made to begin the deprecation of the GlusterFS plugin from in-tree drivers. The Portworx in-tree volume plugin is also being deprecated with this release. The Flocker, Quobyte, and StorageOS in-tree volume plugins are being removed.

Flocker, Quobyte, and StorageOS in-tree volume plugins will be removed in v1.25. Users of these plugins need to switch to an equivalent CSI driver or an alternate storage provider.

Change to vSphere version support

From Kubernetes v1.25, the in-tree vSphere volume driver will not support any vSphere release before 7.0u2. Once Kubernetes v1.25 is released, check the v1.25 detailed release notes for more advice on how to handle this.

Cleaning up IPTables Chain Ownership

On Linux, Kubernetes (usually) creates iptables chains to ensure that network packets reach Although these chains and their names have been an internal implementation detail, some tooling has relied upon that behavior. will only support for internal Kubernetes use cases. Starting with v1.25, the Kubelet will gradually move towards not creating the following iptables chains in the nat table:

  • KUBE-MARK-DROP
  • KUBE-MARK-MASQ
  • KUBE-POSTROUTING

This change will be phased in via the IPTablesCleanup feature gate. Although this is not formally a deprecation, some end users have come to rely on specific internal behavior of kube-proxy. The Kubernetes project overall wants to make it clear that depending on these internal details is not supported, and that future implementations will change their behavior here.

Looking ahead

The official list of API removals planned for Kubernetes 1.26 is:

  • The beta FlowSchema and PriorityLevelConfiguration APIs (flowcontrol.apiserver.k8s.io/v1beta1)
  • The beta HorizontalPodAutoscaler API (autoscaling/v2beta2)

Want to know more?

Deprecations are announced in the Kubernetes release notes. You can see the announcements of pending deprecations in the release notes for:

For information on the process of deprecation and removal, check out the official Kubernetes deprecation policy document.

Spotlight on SIG Docs

Introduction

The official documentation is the go-to source for any open source project. For Kubernetes, it's an ever-evolving Special Interest Group (SIG) with people constantly putting in their efforts to make details about the project easier to consume for new contributors and users. SIG Docs publishes the official documentation on kubernetes.io which includes, but is not limited to, documentation of the core APIs, core architectural details, and CLI tools shipped with the Kubernetes release.

To learn more about the work of SIG Docs and its future ahead in shaping the community, I have summarised my conversation with the co-chairs, Divya Mohan (DM), Rey Lejano (RL) and Natali Vlatko (NV), who ran through the SIG's goals and how fellow contributors can help.

A summary of the conversation

Could you tell us a little bit about what SIG Docs does?

SIG Docs is the special interest group for documentation for the Kubernetes project on kubernetes.io, generating reference guides for the Kubernetes API, kubeadm and kubectl as well as maintaining the official website’s infrastructure and analytics. The remit of their work also extends to docs releases, translation of docs, improvement and adding new features to existing documentation, pushing and reviewing content for the official Kubernetes blog and engaging with the Release Team for each cycle to get docs and blogs reviewed.

There are 2 subprojects under Docs: blogs and localization. How has the community benefited from it and are there some interesting contributions by those teams you want to highlight?

Blogs: This subproject highlights new or graduated Kubernetes enhancements, community reports, SIG updates or any relevant news to the Kubernetes community such as thought leadership, tutorials and project updates, such as the Dockershim removal and removal of PodSecurityPolicy, which is upcoming in the 1.25 release. Tim Bannister, one of the SIG Docs tech leads, does awesome work and is a major force when pushing contributions through to the docs and blogs.

Localization: With this subproject, the Kubernetes community has been able to achieve greater inclusivity and diversity among both users and contributors. This has also helped the project gain more contributors, especially students, since a couple of years ago. One of the major highlights and up-and-coming localizations are Hindi and Bengali. The efforts for Hindi localization are currently being spearheaded by students in India.

In addition to that, there are two other subprojects: reference-docs and the website, which is built with Hugo and is an important ownership area.

Recently there has been a lot of buzz around the Kubernetes ecosystem as well as the industry regarding the removal of dockershim in the latest 1.24 release. How has SIG Docs helped the project to ensure a smooth change among the end-users?

Documenting the removal of Dockershim was a mammoth task, requiring the revamping of existing documentation and communicating to the various stakeholders regarding the deprecation efforts. It needed a community effort, so ahead of the 1.24 release, SIG Docs partnered with Docs and Comms verticals, the Release Lead from the Release Team, and also the CNCF to help put the word out. Weekly meetings and a GitHub project board were set up to track progress, review issues and approve PRs and keep the Kubernetes website updated. This has also helped new contributors know about the depreciation, so that if any good-first-issue pops up, they could chip in. A dedicated Slack channel was used to communicate meeting updates, invite feedback or to solicit help on outstanding issues and PRs. The weekly meeting also continued for a month after the 1.24 release to review related issues and fix them. A huge shoutout to Celeste Horgan, who kept the ball rolling on this conversation throughout the deprecation process.

Why should new and existing contributors consider joining this SIG?

Kubernetes is a vast project and can be intimidating at first for a lot of folks to find a place to start. Any open source project is defined by its quality of documentation and SIG Docs aims to be a welcoming, helpful place for new contributors to get onboard. One gets the perks of working with the project docs as well as learning by reading it. They can also bring their own, new perspective to create and improve the documentation. In the long run if they stick to SIG Docs, they can rise up the ladder to be maintainers. This will help make a big project like Kubernetes easier to parse and navigate.

How do you help new contributors get started? Are there any prerequisites to join?

There are no such prerequisites to get started with contributing to Docs. But there is certainly a fantastic Contribution to Docs guide which is always kept as updated and relevant as possible and new contributors are urged to read it and keep it handy. Also, there are a lot of useful pins and bookmarks in the community Slack channel #sig-docs. GitHub issues with the good-first-issue labels in the kubernetes/website repo is a great place to create your first PR. Now, SIG Docs has a monthly New Contributor Meet and Greet on the first Tuesday of the month with the first occupant of the New Contributor Ambassador role, Arsh Sharma. This has helped in making a more accessible point of contact within the SIG for new contributors.

DM & RL : The formalization of the localization subproject in the last few months has been a big win for SIG Docs, given all the great work put in by contributors from different countries. Earlier the localization efforts didn’t have any streamlined process and focus was given to provide a structure by drafting a KEP over the past couple of months for localization to be formalized as a subproject, which is planned to be pushed through by the end of third quarter.

DM : Another area where there has been a lot of success is the New Contributor Ambassador role, which has helped in making a more accessible point of contact for the onboarding of new contributors into the project.

NV : For each release cycle, SIG Docs have to review release docs and feature blogs highlighting release updates within a short window. This is always a big effort for the docs and blogs reviewers.

Is there something exciting coming up for the future of SIG Docs that you want the community to know?

SIG Docs is now looking forward to establishing a roadmap, having a steady pipeline of folks being able to push improvements to the documentation and streamlining community involvement in triaging issues and reviewing PRs being filed. To build one such contributor and reviewership base, a mentorship program is being set up to help current contributors become reviewers. This definitely is a space to watch out for more!

Wrap Up

SIG Docs hosted a deep dive talk during on KubeCon + CloudNativeCon North America 2021, covering their awesome SIG. They are very welcoming and have been the starting ground into Kubernetes for a lot of new folks who want to contribute to the project. Join the SIG's meetings to find out about the most recent research results, their plans for the forthcoming year, and how to get involved in the upstream Docs team as a contributor!

Kubernetes Gateway API Graduates to Beta

We are excited to announce the v0.5.0 release of Gateway API. For the first time, several of our most important Gateway API resources are graduating to beta. Additionally, we are starting a new initiative to explore how Gateway API can be used for mesh and introducing new experimental concepts such as URL rewrites. We'll cover all of this and more below.

What is Gateway API?

Gateway API is a collection of resources centered around Gateway resources (which represent the underlying network gateways / proxy servers) to enable robust Kubernetes service networking through expressive, extensible and role-oriented interfaces that are implemented by many vendors and have broad industry support.

Originally conceived as a successor to the well known Ingress API, the benefits of Gateway API include (but are not limited to) explicit support for many commonly used networking protocols (e.g. HTTP, TLS, TCP, UDP) as well as tightly integrated support for Transport Layer Security (TLS). The Gateway resource in particular enables implementations to manage the lifecycle of network gateways as a Kubernetes API.

If you're an end-user interested in some of the benefits of Gateway API we invite you to jump in and find an implementation that suits you. At the time of this release there are over a dozen implementations for popular API gateways and service meshes and guides are available to start exploring quickly.

Getting started

Gateway API is an official Kubernetes API like Ingress. Gateway API represents a superset of Ingress functionality, enabling more advanced concepts. Similar to Ingress, there is no default implementation of Gateway API built into Kubernetes. Instead, there are many different implementations available, providing significant choice in terms of underlying technologies while providing a consistent and portable experience.

Take a look at the API concepts documentation and check out some of the Guides to start familiarizing yourself with the APIs and how they work. When you're ready for a practical application open the implementations page and select an implementation that belongs to an existing technology you may already be familiar with or the one your cluster provider uses as a default (if applicable). Gateway API is a Custom Resource Definition (CRD) based API so you'll need to install the CRDs onto a cluster to use the API.

If you're specifically interested in helping to contribute to Gateway API, we would love to have you! Please feel free to open a new issue on the repository, or join in the discussions. Also check out the community page which includes links to the Slack channel and community meetings.

Release highlights

Graduation to beta

The v0.5.0 release is particularly historic because it marks the growth in maturity to a beta API version (v1beta1) release for some of the key APIs:

This achievement was marked by the completion of several graduation criteria:

  • API has been widely implemented.
  • Conformance tests provide basic coverage for all resources and have multiple implementations passing tests.
  • Most of the API surface is actively being used.
  • Kubernetes SIG Network API reviewers have approved graduation to beta.

For more information on Gateway API versioning, refer to the official documentation. To see what's in store for future releases check out the next steps section.

Release channels

This release introduces the experimental and standard release channels which enable a better balance of maintaining stability while still enabling experimentation and iterative development.

The standard release channel includes:

  • resources that have graduated to beta
  • fields that have graduated to standard (no longer considered experimental)

The experimental release channel includes everything in the standard release channel, plus:

  • alpha API resources
  • fields that are considered experimental and have not graduated to standard channel

Release channels are used internally to enable iterative development with quick turnaround, and externally to indicate feature stability to implementors and end-users.

For this release we've added the following experimental features:

Other improvements

For an exhaustive list of changes included in the v0.5.0 release, please see the v0.5.0 release notes.

Gateway API for service mesh: the GAMMA Initiative

Some service mesh projects have already implemented support for the Gateway API. Significant overlap between the Service Mesh Interface (SMI) APIs and the Gateway API has inspired discussion in the SMI community about possible integration.

We are pleased to announce that the service mesh community, including representatives from Cilium Service Mesh, Consul, Istio, Kuma, Linkerd, NGINX Service Mesh and Open Service Mesh, is coming together to form the GAMMA Initiative, a dedicated workstream within the Gateway API subproject focused on Gateway API for Mesh Management and Administration.

This group will deliver enhancement proposals consisting of resources, additions, and modifications to the Gateway API specification for mesh and mesh-adjacent use-cases.

This work has begun with an exploration of using Gateway API for service-to-service traffic and will continue with enhancement in areas such as authentication and authorization policy.

Next steps

As we continue to mature the API for production use cases, here are some of the highlights of what we'll be working on for the next Gateway API releases:

If there's something on this list you want to get involved in, or there's something not on this list that you want to advocate for to get on the roadmap please join us in the #sig-network-gateway-api channel on Kubernetes Slack or our weekly community calls.

Annual Report Summary 2021

Last year, we published our first Annual Report Summary for 2020 and it's already time for our second edition!

2021 Annual Report Summary

This summary reflects the work that has been done in 2021 and the initiatives on deck for the rest of 2022. Please forward to organizations and indidviduals participating in upstream activities, planning cloud native strategies, and/or those looking to help out. To find a specific community group's complete report, go to the kubernetes/community repo under the groups folder. Example: sig-api-machinery/annual-report-2021.md

You’ll see that this report summary is a growth area in itself. It takes us roughly 6 months to prepare and execute, which isn’t helpful or valuable to anyone as a fast moving project with short and long term needs. How can we make this better? Provide your feedback here: https://github.com/kubernetes/steering/issues/242

Reference: Annual Report Documentation

Kubernetes 1.24: Maximum Unavailable Replicas for StatefulSet

Kubernetes StatefulSets, since their introduction in 1.5 and becoming stable in 1.9, have been widely used to run stateful applications. They provide stable pod identity, persistent per pod storage and ordered graceful deployment, scaling and rolling updates. You can think of StatefulSet as the atomic building block for running complex stateful applications. As the use of Kubernetes has grown, so has the number of scenarios requiring StatefulSets. Many of these scenarios, require faster rolling updates than the currently supported one-pod-at-a-time updates, in the case where you're using the OrderedReady Pod management policy for a StatefulSet.

Here are some examples:

  • I am using a StatefulSet to orchestrate a multi-instance, cache based application where the size of the cache is large. The cache starts cold and requires some significant amount of time before the container can start. There could be more initial startup tasks that are required. A RollingUpdate on this StatefulSet would take a lot of time before the application is fully updated. If the StatefulSet supported updating more than one pod at a time, it would result in a much faster update.

  • My stateful application is composed of leaders and followers or one writer and multiple readers. I have multiple readers or followers and my application can tolerate multiple pods going down at the same time. I want to update this application more than one pod at a time so that i get the new updates rolled out quickly, especially if the number of instances of my application are large. Note that my application still requires unique identity per pod.

In order to support such scenarios, Kubernetes 1.24 includes a new alpha feature to help. Before you can use the new feature you must enable the MaxUnavailableStatefulSet feature flag. Once you enable that, you can specify a new field called maxUnavailable, part of the spec for a StatefulSet. For example:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  namespace: default
spec:
  podManagementPolicy: OrderedReady  # you must set OrderedReady
  replicas: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      # image changed since publication (previously used registry "k8s.gcr.io")
      - image: registry.k8s.io/nginx-slim:0.8
        imagePullPolicy: IfNotPresent
        name: nginx
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 2 # this is the new alpha field, whose default value is 1
      partition: 0
    type: RollingUpdate

If you enable the new feature and you don't specify a value for maxUnavailable in a StatefulSet, Kubernetes applies a default maxUnavailable: 1. This matches the behavior you would see if you don't enable the new feature.

I'll run through a scenario based on that example manifest to demonstrate how this feature works. I will deploy a StatefulSet that has 5 replicas, with maxUnavailable set to 2 and partition set to 0.

I can trigger a rolling update by changing the image to registry.k8s.io/nginx-slim:0.9. Once I initiate the rolling update, I can watch the pods update 2 at a time as the current value of maxUnavailable is 2. The below output shows a span of time and is not complete. The maxUnavailable can be an absolute number (for example, 2) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding up to the nearest integer.

kubectl get pods --watch 
NAME    READY   STATUS    RESTARTS   AGE
web-0   1/1     Running   0          85s
web-1   1/1     Running   0          2m6s
web-2   1/1     Running   0          106s
web-3   1/1     Running   0          2m47s
web-4   1/1     Running   0          2m27s
web-4   1/1     Terminating   0          5m43s ----> start terminating 4
web-3   1/1     Terminating   0          6m3s  ----> start terminating 3
web-3   0/1     Terminating   0          6m7s
web-3   0/1     Pending       0          0s
web-3   0/1     Pending       0          0s
web-4   0/1     Terminating   0          5m48s
web-4   0/1     Terminating   0          5m48s
web-3   0/1     ContainerCreating   0          2s
web-3   1/1     Running             0          2s
web-4   0/1     Pending             0          0s
web-4   0/1     Pending             0          0s
web-4   0/1     ContainerCreating   0          0s
web-4   1/1     Running             0          1s
web-2   1/1     Terminating         0          5m46s ----> start terminating 2 (only after both 4 and 3 are running)
web-1   1/1     Terminating         0          6m6s  ----> start terminating 1
web-2   0/1     Terminating         0          5m47s
web-1   0/1     Terminating         0          6m7s
web-1   0/1     Pending             0          0s
web-1   0/1     Pending             0          0s
web-1   0/1     ContainerCreating   0          1s
web-1   1/1     Running             0          2s
web-2   0/1     Pending             0          0s
web-2   0/1     Pending             0          0s
web-2   0/1     ContainerCreating   0          0s
web-2   1/1     Running             0          1s
web-0   1/1     Terminating         0          6m6s ----> start terminating 0 (only after 2 and 1 are running)
web-0   0/1     Terminating         0          6m7s
web-0   0/1     Pending             0          0s
web-0   0/1     Pending             0          0s
web-0   0/1     ContainerCreating   0          0s
web-0   1/1     Running             0          1s

Note that as soon as the rolling update starts, both 4 and 3 (the two highest ordinal pods) start terminating at the same time. Pods with ordinal 4 and 3 may become ready at their own pace. As soon as both pods 4 and 3 are ready, pods 2 and 1 start terminating at the same time. When pods 2 and 1 are both running and ready, pod 0 starts terminating.

In Kubernetes, updates to StatefulSets follow a strict ordering when updating Pods. In this example, the update starts at replica 4, then replica 3, then replica 2, and so on, one pod at a time. When going one pod at a time, its not possible for 3 to be running and ready before 4. When maxUnavailable is more than 1 (in the example scenario I set maxUnavailable to 2), it is possible that replica 3 becomes ready and running before replica 4 is ready—and that is ok. If you're a developer and you set maxUnavailable to more than 1, you should know that this outcome is possible and you must ensure that your application is able to handle such ordering issues that occur if any. When you set maxUnavailable greater than 1, the ordering is guaranteed in between each batch of pods being updated. That guarantee means that pods in update batch 2 (replicas 2 and 1) cannot start updating until the pods from batch 0 (replicas 4 and 3) are ready.

Although Kubernetes refers to these as replicas, your stateful application may have a different view and each pod of the StatefulSet may be holding completely different data than other pods. The important thing here is that updates to StatefulSets happen in batches, and you can now have a batch size larger than 1 (as an alpha feature).

Also note, that the above behavior is with podManagementPolicy: OrderedReady. If you defined a StatefulSet as podManagementPolicy: Parallel, not only maxUnavailable number of replicas are terminated at the same time; maxUnavailable number of replicas start in ContainerCreating phase at the same time as well. This is called bursting.

So, now you may have a lot of questions about:-

  • What is the behavior when you set podManagementPolicy: Parallel?
  • What is the behavior when partition to a value other than 0?

It might be better to try and see it for yourself. This is an alpha feature, and the Kubernetes contributors are looking for feedback on this feature. Did this help you achieve your stateful scenarios Did you find a bug or do you think the behavior as implemented is not intuitive or can break applications or catch them by surprise? Please open an issue to let us know.

Further reading and next steps

Contextual Logging in Kubernetes 1.24

The Structured Logging Working Group has added new capabilities to the logging infrastructure in Kubernetes 1.24. This blog post explains how developers can take advantage of those to make log output more useful and how they can get involved with improving Kubernetes.

Structured logging

The goal of structured logging is to replace C-style formatting and the resulting opaque log strings with log entries that have a well-defined syntax for storing message and parameters separately, for example as a JSON struct.

When using the traditional klog text output format for structured log calls, strings were originally printed with \n escape sequences, except when embedded inside a struct. For structs, log entries could still span multiple lines, with no clean way to split the log stream into individual entries:

I1112 14:06:35.783529  328441 structured_logging.go:51] "using InfoS" longData={Name:long Data:Multiple
lines
with quite a bit
of text. internal:0}
I1112 14:06:35.783549  328441 structured_logging.go:52] "using InfoS with\nthe message across multiple lines" int=1 stringData="long: Multiple\nlines\nwith quite a bit\nof text." str="another value"

Now, the < and > markers along with indentation are used to ensure that splitting at a klog header at the start of a line is reliable and the resulting output is human-readable:

I1126 10:31:50.378204  121736 structured_logging.go:59] "using InfoS" longData=<
	{Name:long Data:Multiple
	lines
	with quite a bit
	of text. internal:0}
 >
I1126 10:31:50.378228  121736 structured_logging.go:60] "using InfoS with\nthe message across multiple lines" int=1 stringData=<
	long: Multiple
	lines
	with quite a bit
	of text.
 > str="another value"

Note that the log message itself is printed with quoting. It is meant to be a fixed string that identifies a log entry, so newlines should be avoided there.

Before Kubernetes 1.24, some log calls in kube-scheduler still used klog.Info for multi-line strings to avoid the unreadable output. Now all log calls have been updated to support structured logging.

Contextual logging

Contextual logging is based on the go-logr API. The key idea is that libraries are passed a logger instance by their caller and use that for logging instead of accessing a global logger. The binary decides about the logging implementation, not the libraries. The go-logr API is designed around structured logging and supports attaching additional information to a logger.

This enables additional use cases:

  • The caller can attach additional information to a logger:

    When passing this extended logger into a function and a function uses it instead of the global logger, the additional information is then included in all log entries, without having to modify the code that generates the log entries. This is useful in highly parallel applications where it can become hard to identify all log entries for a certain operation because the output from different operations gets interleaved.

  • When running unit tests, log output can be associated with the current test. Then when a test fails, only the log output of the failed test gets shown by go test. That output can also be more verbose by default because it will not get shown for successful tests. Tests can be run in parallel without interleaving their output.

One of the design decisions for contextual logging was to allow attaching a logger as value to a context.Context. Since the logger encapsulates all aspects of the intended logging for the call, it is part of the context and not just using it. A practical advantage is that many APIs already have a ctx parameter or adding one has additional advantages, like being able to get rid of context.TODO() calls inside the functions.

Another decision was to not break compatibility with klog v2:

  • Libraries that use the traditional klog logging calls in a binary that has set up contextual logging will work and log through the logging backend chosen by the binary. However, such log output will not include the additional information and will not work well in unit tests, so libraries should be modified to support contextual logging. The migration guide for structured logging has been extended to also cover contextual logging.

  • When a library supports contextual logging and retrieves a logger from its context, it will still work in a binary that does not initialize contextual logging because it will get a logger that logs through klog.

In Kubernetes 1.24, contextual logging is a new alpha feature with ContextualLogging as feature gate. When disabled (the default), the new klog API calls for contextual logging (see below) become no-ops to avoid performance or functional regressions.

No Kubernetes component has been converted yet. An example program in the Kubernetes repository demonstrates how to enable contextual logging in a binary and how the output depends on the binary's parameters:

$ cd $GOPATH/src/k8s.io/kubernetes/staging/src/k8s.io/component-base/logs/example/cmd/
$ go run . --help
...
      --feature-gates mapStringBool  A set of key=value pairs that describe feature gates for alpha/experimental features. Options are:
                                     AllAlpha=true|false (ALPHA - default=false)
                                     AllBeta=true|false (BETA - default=false)
                                     ContextualLogging=true|false (ALPHA - default=false)
$ go run . --feature-gates ContextualLogging=true
...
I0404 18:00:02.916429  451895 logger.go:94] "example/myname: runtime" foo="bar" duration="1m0s"
I0404 18:00:02.916447  451895 logger.go:95] "example: another runtime" foo="bar" duration="1m0s"

The example prefix and foo="bar" were added by the caller of the function which logs the runtime message and duration="1m0s" value.

The sample code for klog includes an example for a unit test with per-test output.

klog enhancements

Contextual logging API

The following calls manage the lookup of a logger:

FromContext
from a context parameter, with fallback to the global logger
Background
the global fallback, with no intention to support contextual logging
TODO
the global fallback, but only as a temporary solution until the function gets extended to accept a logger through its parameters
SetLoggerWithOptions
changes the fallback logger; when called with ContextualLogger(true), the logger is ready to be called directly, in which case logging will be done without going through klog

To support the feature gate mechanism in Kubernetes, klog has wrapper calls for the corresponding go-logr calls and a global boolean controlling their behavior:

Usage of those functions in Kubernetes code is enforced with a linter check. The klog default for contextual logging is to enable the functionality because it is considered stable in klog. It is only in Kubernetes binaries where that default gets overridden and (in some binaries) controlled via the --feature-gate parameter.

ktesting logger

The new ktesting package implements logging through testing.T using klog's text output format. It has a single API call for instrumenting a test case and support for command line flags.

klogr

klog/klogr continues to be supported and it's default behavior is unchanged: it formats structured log entries using its own, custom format and prints the result via klog.

However, this usage is discouraged because that format is neither machine-readable (in contrast to real JSON output as produced by zapr, the go-logr implementation used by Kubernetes) nor human-friendly (in contrast to the klog text format).

Instead, a klogr instance should be created with WithFormat(FormatKlog) which chooses the klog text format. A simpler construction method with the same result is the new klog.NewKlogr. That is the logger that klog returns as fallback when nothing else is configured.

Reusable output test

A lot of go-logr implementations have very similar unit tests where they check the result of certain log calls. If a developer didn't know about certain caveats like for example a String function that panics when called, then it is likely that both the handling of such caveats and the unit test are missing.

klog.test is a reusable set of test cases that can be applied to a go-logr implementation.

Output flushing

klog used to start a goroutine unconditionally during init which flushed buffered data at a hard-coded interval. Now that goroutine is only started on demand (i.e. when writing to files with buffering) and can be controlled with StopFlushDaemon and StartFlushDaemon.

When a go-logr implementation buffers data, flushing that data can be integrated into klog.Flush by registering the logger with the FlushLogger option.

Various other changes

For a description of all other enhancements see in the release notes.

logcheck

Originally designed as a linter for structured log calls, the logcheck tool has been enhanced to support also contextual logging and traditional klog log calls. These enhanced checks already found bugs in Kubernetes, like calling klog.Info instead of klog.Infof with a format string and parameters.

It can be included as a plugin in a golangci-lint invocation, which is how Kubernetes uses it now, or get invoked stand-alone.

We are in the process of moving the tool into a new repository because it isn't really related to klog and its releases should be tracked and tagged properly.

Next steps

The Structured Logging WG is always looking for new contributors. The migration away from C-style logging is now going to target structured, contextual logging in one step to reduce the overall code churn and number of PRs. Changing log calls is good first contribution to Kubernetes and an opportunity to get to know code in various different areas.

Kubernetes 1.24: Avoid Collisions Assigning IP Addresses to Services

In Kubernetes, Services are an abstract way to expose an application running on a set of Pods. Services can have a cluster-scoped virtual IP address (using a Service of type: ClusterIP). Clients can connect using that virtual IP address, and Kubernetes then load-balances traffic to that Service across the different backing Pods.

How Service ClusterIPs are allocated?

A Service ClusterIP can be assigned:

dynamically
the cluster's control plane automatically picks a free IP address from within the configured IP range for type: ClusterIP Services.
statically
you specify an IP address of your choice, from within the configured IP range for Services.

Across your whole cluster, every Service ClusterIP must be unique. Trying to create a Service with a specific ClusterIP that has already been allocated will return an error.

Why do you need to reserve Service Cluster IPs?

Sometimes you may want to have Services running in well-known IP addresses, so other components and users in the cluster can use them.

The best example is the DNS Service for the cluster. Some Kubernetes installers assign the 10th address from the Service IP range to the DNS service. Assuming you configured your cluster with Service IP range 10.96.0.0/16 and you want your DNS Service IP to be 10.96.0.10, you'd have to create a Service like this:

apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    kubernetes.io/name: CoreDNS
  name: kube-dns
  namespace: kube-system
spec:
  clusterIP: 10.96.0.10
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: 53
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  selector:
    k8s-app: kube-dns
  type: ClusterIP

but as I explained before, the IP address 10.96.0.10 has not been reserved; if other Services are created before or in parallel with dynamic allocation, there is a chance they can allocate this IP, hence, you will not be able to create the DNS Service because it will fail with a conflict error.

How can you avoid Service ClusterIP conflicts?

In Kubernetes 1.24, you can enable a new feature gate ServiceIPStaticSubrange. Turning this on allows you to use a different IP allocation strategy for Services, reducing the risk of collision.

The ClusterIP range will be divided, based on the formula min(max(16, cidrSize / 16), 256), described as never less than 16 or more than 256 with a graduated step between them.

Dynamic IP assignment will use the upper band by default, once this has been exhausted it will use the lower range. This will allow users to use static allocations on the lower band with a low risk of collision.

Examples:

Service IP CIDR block: 10.96.0.0/24

Range Size: 28 - 2 = 254
Band Offset: min(max(16, 256/16), 256) = min(16, 256) = 16
Static band start: 10.96.0.1
Static band end: 10.96.0.16
Range end: 10.96.0.254

pie showData title 10.96.0.0/24 "Static" : 16 "Dynamic" : 238

Service IP CIDR block: 10.96.0.0/20

Range Size: 212 - 2 = 4094
Band Offset: min(max(16, 4096/16), 256) = min(256, 256) = 256
Static band start: 10.96.0.1
Static band end: 10.96.1.0
Range end: 10.96.15.254

pie showData title 10.96.0.0/20 "Static" : 256 "Dynamic" : 3838

Service IP CIDR block: 10.96.0.0/16

Range Size: 216 - 2 = 65534
Band Offset: min(max(16, 65536/16), 256) = min(4096, 256) = 256
Static band start: 10.96.0.1
Static band ends: 10.96.1.0
Range end: 10.96.255.254

pie showData title 10.96.0.0/16 "Static" : 256 "Dynamic" : 65278

Get involved with SIG Network

The current SIG-Network KEPs and issues on GitHub illustrate the SIG’s areas of emphasis.

SIG Network meetings are a friendly, welcoming venue for you to connect with the community and share your ideas. Looking forward to hearing from you!

Kubernetes 1.24: Introducing Non-Graceful Node Shutdown Alpha

Kubernetes v1.24 introduces alpha support for Non-Graceful Node Shutdown. This feature allows stateful workloads to failover to a different node after the original node is shutdown or in a non-recoverable state such as hardware failure or broken OS.

How is this different from Graceful Node Shutdown

You might have heard about the Graceful Node Shutdown capability of Kubernetes, and are wondering how the Non-Graceful Node Shutdown feature is different from that. Graceful Node Shutdown allows Kubernetes to detect when a node is shutting down cleanly, and handles that situation appropriately. A Node Shutdown can be "graceful" only if the node shutdown action can be detected by the kubelet ahead of the actual shutdown. However, there are cases where a node shutdown action may not be detected by the kubelet. This could happen either because the shutdown command does not trigger the systemd inhibitor locks mechanism that kubelet relies upon, or because of a configuration error (the ShutdownGracePeriod and ShutdownGracePeriodCriticalPods are not configured properly).

Graceful node shutdown relies on Linux-specific support. The kubelet does not watch for upcoming shutdowns on Windows nodes (this may change in a future Kubernetes release).

When a node is shutdown but without the kubelet detecting it, pods on that node also shut down ungracefully. For stateless apps, that's often not a problem (a ReplicaSet adds a new pod once the cluster detects that the affected node or pod has failed). For stateful apps, the story is more complicated. If you use a StatefulSet and have a pod from that StatefulSet on a node that fails uncleanly, that affected pod will be marked as terminating; the StatefulSet cannot create a replacement pod because the pod still exists in the cluster. As a result, the application running on the StatefulSet may be degraded or even offline. If the original, shut down node comes up again, the kubelet on that original node reports in, deletes the existing pods, and the control plane makes a replacement pod for that StatefulSet on a different running node. If the original node has failed and does not come up, those stateful pods would be stuck in a terminating status on that failed node indefinitely.

$ kubectl get pod -o wide
NAME    READY   STATUS        RESTARTS   AGE   IP           NODE                      NOMINATED NODE   READINESS GATES
web-0   1/1     Running       0          100m   10.244.2.4   k8s-node-876-1639279816   <none>           <none>
web-1   1/1     Terminating   0          100m   10.244.1.3   k8s-node-433-1639279804   <none>           <none>

Try out the new non-graceful shutdown handling

To use the non-graceful node shutdown handling, you must enable the NodeOutOfServiceVolumeDetach feature gate for the kube-controller-manager component.

In the case of a node shutdown, you can manually taint that node as out of service. You should make certain that the node is truly shutdown (not in the middle of restarting) before you add that taint. You could add that taint following a shutdown that the kubelet did not detect and handle in advance; another case where you can use that taint is when the node is in a non-recoverable state due to a hardware failure or a broken OS. The values you set for that taint can be node.kubernetes.io/out-of-service=nodeshutdown: "NoExecute" or node.kubernetes.io/out-of-service=nodeshutdown:" NoSchedule". Provided you have enabled the feature gate mentioned earlier, setting the out-of-service taint on a Node means that pods on the node will be deleted unless if there are matching tolerations on the pods. Persistent volumes attached to the shutdown node will be detached, and for StatefulSets, replacement pods will be created successfully on a different running node.

$ kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute

$ kubectl get pod -o wide
NAME    READY   STATUS    RESTARTS   AGE    IP           NODE                      NOMINATED NODE   READINESS GATES
web-0   1/1     Running   0          150m   10.244.2.4   k8s-node-876-1639279816   <none>           <none>
web-1   1/1     Running   0          10m    10.244.1.7   k8s-node-433-1639279804   <none>           <none>

Note: Before applying the out-of-service taint, you must verify that a node is already in shutdown or power off state (not in the middle of restarting), either because the user intentionally shut it down or the node is down due to hardware failures, OS issues, etc.

Once all the workload pods that are linked to the out-of-service node are moved to a new running node, and the shutdown node has been recovered, you should remove that taint on the affected node after the node is recovered. If you know that the node will not return to service, you could instead delete the node from the cluster.

What’s next?

Depending on feedback and adoption, the Kubernetes team plans to push the Non-Graceful Node Shutdown implementation to Beta in either 1.25 or 1.26.

This feature requires a user to manually add a taint to the node to trigger workloads failover and remove the taint after the node is recovered. In the future, we plan to find ways to automatically detect and fence nodes that are shutdown/failed and automatically failover workloads to another node.

How can I learn more?

Check out the documentation for non-graceful node shutdown.

How to get involved?

This feature has a long story. Yassine Tijani (yastij) started the KEP more than two years ago. Xing Yang (xing-yang) continued to drive the effort. There were many discussions among SIG Storage, SIG Node, and API reviewers to nail down the design details. Ashutosh Kumar (sonasingh46) did most of the implementation and brought it to Alpha in Kubernetes 1.24.

We want to thank the following people for their insightful reviews: Tim Hockin (thockin) for his guidance on the design, Jing Xu (jingxu97), Hemant Kumar (gnufied), and Michelle Au (msau42) for reviews from SIG Storage side, and Mrunal Patel (mrunalp), David Porter (bobbypage), Derek Carr (derekwaynecarr), and Danielle Endocrimes (endocrimes) for reviews from SIG Node side.

There are many people who have helped review the design and implementation along the way. We want to thank everyone who has contributed to this effort including the about 30 people who have reviewed the KEP and implementation over the last couple of years.

This feature is a collaboration between SIG Storage and SIG Node. For those interested in getting involved with the design and development of any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). For those interested in getting involved with the design and development of the components that support the controlled interactions between pods and host resources, join the Kubernetes Node SIG.

Kubernetes 1.24: Prevent unauthorised volume mode conversion

Kubernetes v1.24 introduces a new alpha-level feature that prevents unauthorised users from modifying the volume mode of a PersistentVolumeClaim created from an existing VolumeSnapshot in the Kubernetes cluster.

The problem

The Volume Mode determines whether a volume is formatted into a filesystem or presented as a raw block device.

Users can leverage the VolumeSnapshot feature, which has been stable since Kubernetes v1.20, to create a PersistentVolumeClaim (shortened as PVC) from an existing VolumeSnapshot in the Kubernetes cluster. The PVC spec includes a dataSource field, which can point to an existing VolumeSnapshot instance. Visit Create a PersistentVolumeClaim from a Volume Snapshot for more details.

When leveraging the above capability, there is no logic that validates whether the mode of the original volume, whose snapshot was taken, matches the mode of the newly created volume.

This presents a security gap that allows malicious users to potentially exploit an as-yet-unknown vulnerability in the host operating system.

Many popular storage backup vendors convert the volume mode during the course of a backup operation, for efficiency purposes, which prevents Kubernetes from blocking the operation completely and presents a challenge in distinguishing trusted users from malicious ones.

Preventing unauthorised users from converting the volume mode

In this context, an authorised user is one who has access rights to perform Update or Patch operations on VolumeSnapshotContents, which is a cluster-level resource.
It is upto the cluster administrator to provide these rights only to trusted users or applications, like backup vendors.

If the alpha feature is enabled in snapshot-controller, snapshot-validation-webhook and external-provisioner, then unauthorised users will not be allowed to modify the volume mode of a PVC when it is being created from a VolumeSnapshot.

To convert the volume mode, an authorised user must do the following:

  1. Identify the VolumeSnapshot that is to be used as the data source for a newly created PVC in the given namespace.
  2. Identify the VolumeSnapshotContent bound to the above VolumeSnapshot.
kubectl get volumesnapshot -n <namespace>
  1. Add the annotation snapshot.storage.kubernetes.io/allowVolumeModeChange to the VolumeSnapshotContent.

  2. This annotation can be added either via software or manually by the authorised user. The VolumeSnapshotContent annotation must look like following manifest fragment:

kind: VolumeSnapshotContent
metadata:
  annotations:
    - snapshot.storage.kubernetes.io/allowVolumeModeChange: "true"
...

Note: For pre-provisioned VolumeSnapshotContents, you must take an extra step of setting spec.sourceVolumeMode field to either Filesystem or Block, depending on the mode of the volume from which this snapshot was taken.

An example is shown below:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
  annotations:
  - snapshot.storage.kubernetes.io/allowVolumeModeChange: "true"
  name: new-snapshot-content-test
spec:
  deletionPolicy: Delete
  driver: hostpath.csi.k8s.io
  source:
    snapshotHandle: 7bdd0de3-aaeb-11e8-9aae-0242ac110002
  sourceVolumeMode: Filesystem
  volumeSnapshotRef:
    name: new-snapshot-test
    namespace: default

Repeat steps 1 to 3 for all VolumeSnapshotContents whose volume mode needs to be converted during a backup or restore operation.

If the annotation shown in step 4 above is present on a VolumeSnapshotContent object, Kubernetes will not prevent the volume mode from being converted. Users should keep this in mind before they attempt to add the annotation to any VolumeSnapshotContent.

What's next

Enable this feature and let us know what you think!

We hope this feature causes no disruption to existing workflows while preventing malicious users from exploiting security vulnerabilities in their clusters.

For any queries or issues, join Kubernetes on Slack and create a thread in the #sig-storage channel. Alternately, create an issue in the CSI external-snapshotter repository.

Kubernetes 1.24: Volume Populators Graduate to Beta

The volume populators feature is now two releases old and entering beta! The AnyVolumeDataSource feature gate defaults to enabled in Kubernetes v1.24, which means that users can specify any custom resource as the data source of a PVC.

An earlier blog article detailed how the volume populators feature works. In short, a cluster administrator can install a CRD and associated populator controller in the cluster, and any user who can create instances of the CR can create pre-populated volumes by taking advantage of the populator.

Multiple populators can be installed side by side for different purposes. The SIG storage community is already seeing some implementations in public, and more prototypes should appear soon.

Cluster administrations are strongly encouraged to install the volume-data-source-validator controller and associated VolumePopulator CRD before installing any populators so that users can get feedback about invalid PVC data sources.

New Features

The lib-volume-populator library on which populators are built now includes metrics to help operators monitor and detect problems. This library is now beta and latest release is v1.0.1.

The volume data source validator controller also has metrics support added, and is in beta. The VolumePopulator CRD is beta and the latest release is v1.0.1.

Trying it out

To see how this works, you can install the sample "hello" populator and try it out.

First install the volume-data-source-validator controller.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/v1.0.1/client/config/crd/populator.storage.k8s.io_volumepopulators.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/v1.0.1/deploy/kubernetes/rbac-data-source-validator.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/v1.0.1/deploy/kubernetes/setup-data-source-validator.yaml

Next install the example populator.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/lib-volume-populator/v1.0.1/example/hello-populator/crd.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/lib-volume-populator/87a47467b86052819e9ad13d15036d65b9a32fbb/example/hello-populator/deploy.yaml

Your cluster now has a new CustomResourceDefinition that provides a test API named Hello. Create an instance of the Hello custom resource, with some text:

apiVersion: hello.example.com/v1alpha1
kind: Hello
metadata:
  name: example-hello
spec:
  fileName: example.txt
  fileContents: Hello, world!

Create a PVC that refers to that CR as its data source.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi
  dataSourceRef:
    apiGroup: hello.example.com
    kind: Hello
    name: example-hello
  volumeMode: Filesystem

Next, run a Job that reads the file in the PVC.

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  template:
    spec:
      containers:
        - name: example-container
          image: busybox:latest
          command:
            - cat
            - /mnt/example.txt
          volumeMounts:
            - name: vol
              mountPath: /mnt
      restartPolicy: Never
      volumes:
        - name: vol
          persistentVolumeClaim:
            claimName: example-pvc

Wait for the job to complete (including all of its dependencies).

kubectl wait --for=condition=Complete job/example-job

And last examine the log from the job.

kubectl logs job/example-job

The output should be:

Hello, world!

Note that the volume already contained a text file with the string contents from the CR. This is only the simplest example. Actual populators can set up the volume to contain arbitrary contents.

How to write your own volume populator

Developers interested in writing new poplators are encouraged to use the lib-volume-populator library and to only supply a small controller wrapper around the library, and a pod image capable of attaching to volumes and writing the appropriate data to the volume.

Individual populators can be extremely generic such that they work with every type of PVC, or they can do vendor specific things to rapidly fill a volume with data if the volume was provisioned by a specific CSI driver from the same vendor, for example, by communicating directly with the storage for that volume.

How can I learn more?

The enhancement proposal, Volume Populators, includes lots of detail about the history and technical implementation of this feature.

Volume populators and data sources, within the documentation topic about persistent volumes, explains how to use this feature in your cluster.

Please get involved by joining the Kubernetes storage SIG to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!

Kubernetes 1.24: gRPC container probes in beta

_Update: Since this article was posted, the feature was graduated to GA in v1.27 and doesn't require any feature gates to be enabled.

With Kubernetes 1.24 the gRPC probes functionality entered beta and is available by default. Now you can configure startup, liveness, and readiness probes for your gRPC app without exposing any HTTP endpoint, nor do you need an executable. Kubernetes can natively connect to your workload via gRPC and query its status.

Some history

It's useful to let the system managing your workload check that the app is healthy, has started OK, and whether the app considers itself good to accept traffic. Before the gRPC support was added, Kubernetes already allowed you to check for health based on running an executable from inside the container image, by making an HTTP request, or by checking whether a TCP connection succeeded.

For most apps, those checks are enough. If your app provides a gRPC endpoint for a health (or readiness) check, it is easy to repurpose the exec probe to use it for gRPC health checking. In the blog article Health checking gRPC servers on Kubernetes, Ahmet Alp Balkan described how you can do that — a mechanism that still works today.

There is a commonly used tool to enable this that was created on August 21, 2018, and with the first release at Sep 19, 2018.

This approach for gRPC apps health checking is very popular. There are 3,626 Dockerfiles with the grpc_health_probe and 6,621 yaml files that are discovered with the basic search on GitHub (at the moment of writing). This is a good indication of the tool popularity and the need to support this natively.

Kubernetes v1.23 introduced an alpha-quality implementation of native support for querying a workload status using gRPC. Because it was an alpha feature, this was disabled by default for the v1.23 release.

Using the feature

We built gRPC health checking in similar way with other probes and believe it will be easy to use if you are familiar with other probe types in Kubernetes. The natively supported health probe has many benefits over the workaround involving grpc_health_probe executable.

With the native gRPC support you don't need to download and carry 10MB of an additional executable with your image. Exec probes are generally slower than a gRPC call as they require instantiating a new process to run an executable. It also makes the checks less sensible for edge cases when the pod is running at maximum resources and has troubles instantiating new processes.

There are a few limitations though. Since configuring a client certificate for probes is hard, services that require client authentication are not supported. The built-in probes are also not checking the server certificates and ignore related problems.

Built-in checks also cannot be configured to ignore certain types of errors (grpc_health_probe returns different exit codes for different errors), and cannot be "chained" to run the health check on multiple services in a single probe.

But all these limitations are quite standard for gRPC and there are easy workarounds for those.

Try it for yourself

Cluster-level setup

You can try this feature today. To try native gRPC probes, you can spin up a Kubernetes cluster yourself with the GRPCContainerProbe feature gate enabled, there are many tools available.

Since the feature gate GRPCContainerProbe is enabled by default in 1.24, many vendors will have this functionality working out of the box. So you may just create an 1.24 cluster on platform of your choice. Some vendors allow to enable alpha features on 1.23 clusters.

For example, at the moment of writing, you can spin up the test cluster on GKE for a quick test. Other vendors may also have similar capabilities, especially if you are reading this blog post long after the Kubernetes 1.24 release.

On GKE use the following command (note, version is 1.23 and enable-kubernetes-alpha are specified).

gcloud container clusters create test-grpc \
    --enable-kubernetes-alpha \
    --no-enable-autorepair \
    --no-enable-autoupgrade \
    --release-channel=rapid \
    --cluster-version=1.23

You will also need to configure kubectl to access the cluster:

gcloud container clusters get-credentials test-grpc

Trying the feature out

Let's create the pod to test how gRPC probes work. For this test we will use the agnhost image. This is a k8s maintained image with that can be used for all sorts of workload testing. For example, it has a useful grpc-health-checking module that exposes two ports - one is serving health checking service, another - http port to react on commands make-serving and make-not-serving.

Here is an example pod definition. It starts the grpc-health-checking module, exposes ports 5000 and 8080, and configures gRPC readiness probe:

---
apiVersion: v1
kind: Pod
metadata:
  name: test-grpc
spec:
  containers:
  - name: agnhost
    # image changed since publication (previously used registry "k8s.gcr.io")
    image: registry.k8s.io/e2e-test-images/agnhost:2.35
    command: ["/agnhost", "grpc-health-checking"]
    ports:
    - containerPort: 5000
    - containerPort: 8080
    readinessProbe:
      grpc:
        port: 5000

In the manifest file called test.yaml, you can create the pod and check its status. The pod will be in ready state as indicated by the snippet of the output.

kubectl apply -f test.yaml
kubectl describe test-grpc

The output will contain something like this:

Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True

Now let's change the health checking endpoint status to NOT_SERVING. In order to call the http port of the Pod, let's create a port forward:

kubectl port-forward test-grpc 8080:8080

You can curl to call the command...

curl http://localhost:8080/make-not-serving

... and in a few seconds the port status will switch to not ready.

kubectl describe pod test-grpc

The output now will have:

Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True

...

  Warning  Unhealthy  2s (x6 over 42s)  kubelet            Readiness probe failed: service unhealthy (responded with "NOT_SERVING")

Once it is switched back, in about one second the Pod will get back to ready status:

curl http://localhost:8080/make-serving
kubectl describe test-grpc

The output indicates that the Pod went back to being Ready:

Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True

This new built-in gRPC health probing on Kubernetes makes implementing a health-check via gRPC much easier than the older approach that relied on using a separate exec probe. Read through the official documentation to learn more and provide feedback before the feature will be promoted to GA.

Summary

Kubernetes is a popular workload orchestration platform and we add features based on feedback and demand. Features like gRPC probes support is a minor improvement that will make life of many app developers easier and apps more resilient. Try it today and give feedback, before the feature went into GA.

Kubernetes 1.24: Storage Capacity Tracking Now Generally Available

The v1.24 release of Kubernetes brings storage capacity tracking as a generally available feature.

Problems we have solved

As explained in more detail in the previous blog post about this feature, storage capacity tracking allows a CSI driver to publish information about remaining capacity. The kube-scheduler then uses that information to pick suitable nodes for a Pod when that Pod has volumes that still need to be provisioned.

Without this information, a Pod may get stuck without ever being scheduled onto a suitable node because kube-scheduler has to choose blindly and always ends up picking a node for which the volume cannot be provisioned because the underlying storage system managed by the CSI driver does not have sufficient capacity left.

Because CSI drivers publish storage capacity information that gets used at a later time when it might not be up-to-date anymore, it can still happen that a node is picked that doesn't work out after all. Volume provisioning recovers from that by informing the scheduler that it needs to try again with a different node.

Load tests that were done again for promotion to GA confirmed that all storage in a cluster can be consumed by Pods with storage capacity tracking whereas Pods got stuck without it.

Problems we have not solved

Recovery from a failed volume provisioning attempt has one known limitation: if a Pod uses two volumes and only one of them could be provisioned, then all future scheduling decisions are limited by the already provisioned volume. If that volume is local to a node and the other volume cannot be provisioned there, the Pod is stuck. This problem pre-dates storage capacity tracking and while the additional information makes it less likely to occur, it cannot be avoided in all cases, except of course by only using one volume per Pod.

An idea for solving this was proposed in a KEP draft: volumes that were provisioned and haven't been used yet cannot have any valuable data and therefore could be freed and provisioned again elsewhere. SIG Storage is looking for interested developers who want to continue working on this.

Also not solved is support in Cluster Autoscaler for Pods with volumes. For CSI drivers with storage capacity tracking, a prototype was developed and discussed in a PR. It was meant to work with arbitrary CSI drivers, but that flexibility made it hard to configure and slowed down scale up operations: because autoscaler was unable to simulate volume provisioning, it only scaled the cluster by one node at a time, which was seen as insufficient.

Therefore that PR was not merged and a different approach with tighter coupling between autoscaler and CSI driver will be needed. For this a better understanding is needed about which local storage CSI drivers are used in combination with cluster autoscaling. Should this lead to a new KEP, then users will have to try out an implementation in practice before it can move to beta or GA. So please reach out to SIG Storage if you have an interest in this topic.

Acknowledgements

Thanks a lot to the members of the community who have contributed to this feature or given feedback including members of SIG Scheduling, SIG Autoscaling, and of course SIG Storage!

Kubernetes 1.24: Volume Expansion Now A Stable Feature

Volume expansion was introduced as a alpha feature in Kubernetes 1.8 and it went beta in 1.11 and with Kubernetes 1.24 we are excited to announce general availability(GA) of volume expansion.

This feature allows Kubernetes users to simply edit their PersistentVolumeClaim objects and specify new size in PVC Spec and Kubernetes will automatically expand the volume using storage backend and also expand the underlying file system in-use by the Pod without requiring any downtime at all if possible.

How to use volume expansion

You can trigger expansion for a PersistentVolume by editing the spec field of a PVC, specifying a different (and larger) storage request. For example, given following PVC:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: myclaim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi # specify new size here

You can request expansion of the underlying PersistentVolume by specifying a new value instead of old 1Gi size. Once you've changed the requested size, watch the status.conditions field of the PVC to see if the resize has completed.

When Kubernetes starts expanding the volume - it will add Resizing condition to the PVC, which will be removed once expansion completes. More information about progress of expansion operation can also be obtained by monitoring events associated with the PVC:

kubectl describe pvc <pvc>

Storage driver support

Not every volume type however is expandable by default. Some volume types such as - intree hostpath volumes are not expandable at all. For CSI volumes - the CSI driver must have capability EXPAND_VOLUME in controller or node service (or both if appropriate). Please refer to documentation of your CSI driver, to find out if it supports volume expansion.

Please refer to volume expansion documentation for intree volume types which support volume expansion - Expanding Persistent Volumes.

In general to provide some degree of control over volumes that can be expanded, only dynamically provisioned PVCs whose storage class has allowVolumeExpansion parameter set to true are expandable.

A Kubernetes cluster administrator must edit the appropriate StorageClass object and set the allowVolumeExpansion field to true. For example:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp2-default
provisioner: kubernetes.io/aws-ebs
parameters:
  secretNamespace: ""
  secretName: ""
allowVolumeExpansion: true

Online expansion compared to offline expansion

By default, Kubernetes attempts to expand volumes immediately after user requests a resize. If one or more Pods are using the volume, Kubernetes tries to expands the volume using an online resize; as a result volume expansion usually requires no application downtime. Filesystem expansion on the node is also performed online and hence does not require shutting down any Pod that was using the PVC.

If you expand a PersistentVolume that is not in use, Kubernetes does an offline resize (and, because the volume isn't in use, there is again no workload disruption).

In some cases though - if underlying Storage Driver can only support offline expansion, users of the PVC must take down their Pod before expansion can succeed. Please refer to documentation of your storage provider to find out - what mode of volume expansion it supports.

When volume expansion was introduced as an alpha feature, Kubernetes only supported offline filesystem expansion on the node and hence required users to restart their pods for file system resizing to finish. His behaviour has been changed and Kubernetes tries its best to fulfil any resize request regardless of whether the underlying PersistentVolume volume is online or offline. If your storage provider supports online expansion then no Pod restart should be necessary for volume expansion to finish.

Next steps

Although volume expansion is now stable as part of the recent v1.24 release, SIG Storage are working to make it even simpler for users of Kubernetes to expand their persistent storage. Kubernetes 1.23 introduced features for triggering recovery from failed volume expansion, allowing users to attempt self-service healing after a failed resize. See Recovering from volume expansion failure for more details.

The Kubernetes contributor community is also discussing the potential for StatefulSet-driven storage expansion. This proposed feature would let you trigger expansion for all underlying PVs that are providing storage to a StatefulSet, by directly editing the StatefulSet object. See the Support Volume Expansion Through StatefulSets enhancement proposal for more details.

Dockershim: The Historical Context

Dockershim has been removed as of Kubernetes v1.24, and this is a positive move for the project. However, context is important for fully understanding something, be it socially or in software development, and this deserves a more in-depth review. Alongside the dockershim removal in Kubernetes v1.24, we’ve seen some confusion (sometimes at a panic level) and dissatisfaction with this decision in the community, largely due to a lack of context around this removal. The decision to deprecate and eventually remove dockershim from Kubernetes was not made quickly or lightly. Still, it’s been in the works for so long that many of today’s users are newer than that decision, and certainly newer than the choices that led to the dockershim being necessary in the first place.

So what is the dockershim, and why is it going away?

In the early days of Kubernetes, we only supported one container runtime. That runtime was Docker Engine. Back then, there weren’t really a lot of other options out there and Docker was the dominant tool for working with containers, so this was not a controversial choice. Eventually, we started adding more container runtimes, like rkt and hypernetes, and it became clear that Kubernetes users want a choice of runtimes working best for them. So Kubernetes needed a way to allow cluster operators the flexibility to use whatever runtime they choose.

The Container Runtime Interface (CRI) was released to allow that flexibility. The introduction of CRI was great for the project and users alike, but it did introduce a problem: Docker Engine’s use as a container runtime predates CRI, and Docker Engine is not CRI-compatible. To solve this issue, a small software shim (dockershim) was introduced as part of the kubelet component specifically to fill in the gaps between Docker Engine and CRI, allowing cluster operators to continue using Docker Engine as their container runtime largely uninterrupted.

However, this little software shim was never intended to be a permanent solution. Over the course of years, its existence has introduced a lot of unnecessary complexity to the kubelet itself. Some integrations are inconsistently implemented for Docker because of this shim, resulting in an increased burden on maintainers, and maintaining vendor-specific code is not in line with our open source philosophy. To reduce this maintenance burden and move towards a more collaborative community in support of open standards, KEP-2221 was introduced, proposing the removal of the dockershim. With the release of Kubernetes v1.20, the deprecation was official.

We didn’t do a great job communicating this, and unfortunately, the deprecation announcement led to some panic within the community. Confusion around what this meant for Docker as a company, if container images built by Docker would still run, and what Docker Engine actually is led to a conflagration on social media. This was our fault; we should have more clearly communicated what was happening and why at the time. To combat this, we released a blog and accompanying FAQ to allay the community’s fears and correct some misconceptions about what Docker is and how containers work within Kubernetes. As a result of the community’s concerns, Docker and Mirantis jointly agreed to continue supporting the dockershim code in the form of cri-dockerd, allowing you to continue using Docker Engine as your container runtime if need be. For the interest of users who want to try other runtimes, like containerd or cri-o, migration documentation was written.

We later surveyed the community and discovered that there are still many users with questions and concerns. In response, Kubernetes maintainers and the CNCF committed to addressing these concerns by extending documentation and other programs. In fact, this blog post is a part of this program. With so many end users successfully migrated to other runtimes, and improved documentation, we believe that everyone has a paved way to migration now.

Docker is not going away, either as a tool or as a company. It’s an important part of the cloud native community and the history of the Kubernetes project. We wouldn’t be where we are without them. That said, removing dockershim from kubelet is ultimately good for the community, the ecosystem, the project, and open source at large. This is an opportunity for all of us to come together to support open standards, and we’re glad to be doing so with the help of Docker and the community.

Kubernetes 1.24: Stargazer

We are excited to announce the release of Kubernetes 1.24, the first release of 2022!

This release consists of 46 enhancements: fourteen enhancements have graduated to stable, fifteen enhancements are moving to beta, and thirteen enhancements are entering alpha. Also, two features have been deprecated, and two features have been removed.

Major Themes

Dockershim Removed from kubelet

After its deprecation in v1.20, the dockershim component has been removed from the kubelet in Kubernetes v1.24. From v1.24 onwards, you will need to either use one of the other supported runtimes (such as containerd or CRI-O) or use cri-dockerd if you are relying on Docker Engine as your container runtime. For more information about ensuring your cluster is ready for this removal, please see this guide.

Beta APIs Off by Default

New beta APIs will not be enabled in clusters by default. Existing beta APIs and new versions of existing beta APIs will continue to be enabled by default.

Signing Release Artifacts

Release artifacts are signed using cosign signatures, and there is experimental support for verifying image signatures. Signing and verification of release artifacts is part of increasing software supply chain security for the Kubernetes release process.

OpenAPI v3

Kubernetes 1.24 offers beta support for publishing its APIs in the OpenAPI v3 format.

Storage Capacity and Volume Expansion Are Generally Available

Storage capacity tracking supports exposing currently available storage capacity via CSIStorageCapacity objects and enhances scheduling of pods that use CSI volumes with late binding.

Volume expansion adds support for resizing existing persistent volumes.

NonPreemptingPriority to Stable

This feature adds a new option to PriorityClasses, which can enable or disable pod preemption.

Storage Plugin Migration

Work is underway to migrate the internals of in-tree storage plugins to call out to CSI Plugins while maintaining the original API. The Azure Disk and OpenStack Cinder plugins have both been migrated.

gRPC Probes Graduate to Beta

With Kubernetes 1.24, the gRPC probes functionality has entered beta and is available by default. You can now configure startup, liveness, and readiness probes for your gRPC app natively within Kubernetes without exposing an HTTP endpoint or using an extra executable.

Kubelet Credential Provider Graduates to Beta

Originally released as Alpha in Kubernetes 1.20, the kubelet's support for image credential providers has now graduated to Beta. This allows the kubelet to dynamically retrieve credentials for a container image registry using exec plugins rather than storing credentials on the node's filesystem.

Contextual Logging in Alpha

Kubernetes 1.24 has introduced contextual logging that enables the caller of a function to control all aspects of logging (output formatting, verbosity, additional values, and names).

Avoiding Collisions in IP allocation to Services

Kubernetes 1.24 introduces a new opt-in feature that allows you to soft-reserve a range for static IP address assignments to Services. With the manual enablement of this feature, the cluster will prefer automatic assignment from the pool of Service IP addresses, thereby reducing the risk of collision.

A Service ClusterIP can be assigned:

  • dynamically, which means the cluster will automatically pick a free IP within the configured Service IP range.
  • statically, which means the user will set one IP within the configured Service IP range.

Service ClusterIP are unique; hence, trying to create a Service with a ClusterIP that has already been allocated will return an error.

Dynamic Kubelet Configuration is Removed from the Kubelet

After being deprecated in Kubernetes 1.22, Dynamic Kubelet Configuration has been removed from the kubelet. The feature will be removed from the API server in Kubernetes 1.26.

Before you upgrade to Kubernetes 1.24, please verify that you are using/upgrading to a container runtime that has been tested to work correctly with this release.

For example, the following container runtimes are being prepared, or have already been prepared, for Kubernetes:

  • containerd v1.6.4 and later, v1.5.11 and later
  • CRI-O 1.24 and later

Service issues exist for pod CNI network setup and tear down in containerd v1.6.0–v1.6.3 when the CNI plugins have not been upgraded and/or the CNI config version is not declared in the CNI config files. The containerd team reports, "these issues are resolved in containerd v1.6.4."

With containerd v1.6.0–v1.6.3, if you do not upgrade the CNI plugins and/or declare the CNI config version, you might encounter the following "Incompatible CNI versions" or "Failed to destroy network for sandbox" error conditions.

CSI Snapshot

This information was added after initial publication.

VolumeSnapshot v1beta1 CRD has been removed. Volume snapshot and restore functionality for Kubernetes and the Container Storage Interface (CSI), which provides standardized APIs design (CRDs) and adds PV snapshot/restore support for CSI volume drivers, moved to GA in v1.20. VolumeSnapshot v1beta1 was deprecated in v1.20 and is now unsupported. Refer to KEP-177: CSI Snapshot and Volume Snapshot GA blog for more information.

Other Updates

Graduations to Stable

This release saw fourteen enhancements promoted to stable:

Major Changes

This release saw two major changes:

Release Notes

Check out the full details of the Kubernetes 1.24 release in our release notes.

Availability

Kubernetes 1.24 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using containers as “nodes”, with kind. You can also easily install 1.24 using kubeadm.

Release Team

This release would not have been possible without the combined efforts of committed individuals comprising the Kubernetes 1.24 release team. This team came together to deliver all of the components that go into each Kubernetes release, including code, documentation, release notes, and more.

Special thanks to James Laverack, our release lead, for guiding us through a successful release cycle, and to all of the release team members for the time and effort they put in to deliver the v1.24 release for the Kubernetes community.

Kubernetes 1.24: Stargazer

The theme for Kubernetes 1.24 is Stargazer.

Generations of people have looked to the stars in awe and wonder, from ancient astronomers to the scientists who built the James Webb Space Telescope. The stars have inspired us, set our imagination alight, and guided us through long nights on difficult seas.

With this release we gaze upwards, to what is possible when our community comes together. Kubernetes is the work of hundreds of contributors across the globe and thousands of end-users supporting applications that serve millions. Every one is a star in our sky, helping us chart our course.

The release logo is made by Britnee Laverack, and depicts a telescope set upon starry skies and the Pleiades, often known in mythology as the “Seven Sisters”. The number seven is especially auspicious for the Kubernetes project, and is a reference back to our original “Project Seven” name.

This release of Kubernetes is named for those that would look towards the night sky and wonder — for all the stargazers out there. ✨

User Highlights

Ecosystem Updates

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.24 release cycle, which ran for 17 weeks (January 10 to May 3), we saw contributions from 1029 companies and 1179 individuals.

Upcoming Release Webinar

Join members of the Kubernetes 1.24 release team on Tue May 24, 2022 9:45am – 11am PT to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:

Increasing the security bar in Ingress-NGINX v1.2.0

The Ingress may be one of the most targeted components of Kubernetes. An Ingress typically defines an HTTP reverse proxy, exposed to the Internet, containing multiple websites, and with some privileged access to Kubernetes API (such as to read Secrets relating to TLS certificates and their private keys).

While it is a risky component in your architecture, it is still the most popular way to properly expose your services.

Ingress-NGINX has been part of security assessments that figured out we have a big problem: we don't do all proper sanitization before turning the configuration into an nginx.conf file, which may lead to information disclosure risks.

While we understand this risk and the real need to fix this, it's not an easy process to do, so we took another approach to reduce (but not remove!) this risk in the current (v1.2.0) release.

Meet Ingress NGINX v1.2.0 and the chrooted NGINX process

One of the main challenges is that Ingress-NGINX runs the web proxy server (NGINX) alongside the Ingress controller (the component that has access to Kubernetes API that and that creates the nginx.conf file).

So, NGINX does have the same access to the filesystem of the controller (and Kubernetes service account token, and other configurations from the container). While splitting those components is our end goal, the project needed a fast response; that lead us to the idea of using chroot().

Let's take a look into what an Ingress-NGINX container looked like before this change:

Ingress NGINX pre chroot

As we can see, the same container (not the Pod, the container!) that provides HTTP Proxy is the one that watches Ingress objects and writes the Container Volume

Now, meet the new architecture:

Ingress NGINX post chroot

What does all of this mean? A basic summary is: that we are isolating the NGINX service as a container inside the controller container.

While this is not strictly true, to understand what was done here, it's good to understand how Linux containers (and underlying mechanisms such as kernel namespaces) work. You can read about cgroups in the Kubernetes glossary: cgroup and learn more about cgroups interact with namespaces in the NGINX project article What Are Namespaces and cgroups, and How Do They Work?. (As you read that, bear in mind that Linux kernel namespaces are a different thing from Kubernetes namespaces).

Skip the talk, what do I need to use this new approach?

While this increases the security, we made this feature an opt-in in this release so you can have time to make the right adjustments in your environment(s). This new feature is only available from release v1.2.0 of the Ingress-NGINX controller.

There are two required changes in your deployments to use this feature:

  • Append the suffix "-chroot" to the container image name. For example: gcr.io/k8s-staging-ingress-nginx/controller-chroot:v1.2.0
  • In your Pod template for the Ingress controller, find where you add the capability NET_BIND_SERVICE and add the capability SYS_CHROOT. After you edit the manifest, you'll see a snippet like:
capabilities:
  drop:
  - ALL
  add:
  - NET_BIND_SERVICE
  - SYS_CHROOT

If you deploy the controller using the official Helm chart then change the following setting in values.yaml:

controller:
  image:
    chroot: true

Ingress controllers are normally set up cluster-wide (the IngressClass API is cluster scoped). If you manage the Ingress-NGINX controller but you're not the overall cluster operator, then check with your cluster admin about whether you can use the SYS_CHROOT capability, before you enable it in your deployment.

OK, but how does this increase the security of my Ingress controller?

Take the following configuration snippet and imagine, for some reason it was added to your nginx.conf:

location /randomthing/ {
      alias /;
      autoindex on;
}

If you deploy this configuration, someone can call http://website.example/randomthing and get some listing (and access) to the whole filesystem of the Ingress controller.

Now, can you spot the difference between chrooted and non chrooted Nginx on the listings below?

Without extra chroot()With extra chroot()
binbin
devdev
etcetc
home
liblib
media
mnt
optopt
procproc
root
runrun
sbin
srv
sys
tmptmp
usrusr
varvar
dbg
nginx-ingress-controller
wait-shutdown

The one in left side is not chrooted. So NGINX has full access to the filesystem. The one in right side is chrooted, so a new filesystem with only the required files to make NGINX work is created.

What about other security improvements in this release?

We know that the new chroot() mechanism helps address some portion of the risk, but still, someone can try to inject commands to read, for example, the nginx.conf file and extract sensitive information.

So, another change in this release (this is opt-out!) is the deep inspector. We know that some directives or regular expressions may be dangerous to NGINX, so the deep inspector checks all fields from an Ingress object (during its reconciliation, and also with a validating admission webhook) to verify if any fields contains these dangerous directives.

The ingress controller already does this for annotations, and our goal is to move this existing validation to happen inside deep inspection as part of a future release.

You can take a look into the existing rules in https://github.com/kubernetes/ingress-nginx/blob/main/internal/ingress/inspector/rules.go.

Due to the nature of inspecting and matching all strings within relevant Ingress objects, this new feature may consume a bit more CPU. You can disable it by running the ingress controller with the command line argument --deep-inspect=false.

What's next?

This is not our final goal. Our final goal is to split the control plane and the data plane processes. In fact, doing so will help us also achieve a Gateway API implementation, as we may have a different controller as soon as it "knows" what to provide to the data plane (we need some help here!!)

Some other projects in Kubernetes already take this approach (like KPNG, the proposed replacement for kube-proxy), and we plan to align with them and get the same experience for Ingress-NGINX.

Further reading

If you want to take a look into how chrooting was done in Ingress NGINX, take a look into https://github.com/kubernetes/ingress-nginx/pull/8337 The release v1.2.0 containing all the changes can be found at https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.2.0

Kubernetes Removals and Deprecations In 1.24

As Kubernetes evolves, features and APIs are regularly revisited and removed. New features may offer an alternative or improved approach to solving existing problems, motivating the team to remove the old approach.

We want to make sure you are aware of the changes coming in the Kubernetes 1.24 release. The release will deprecate several (beta) APIs in favor of stable versions of the same APIs. The major change coming in the Kubernetes 1.24 release is the removal of Dockershim. This is discussed below and will be explored in more depth at release time. For an early look at the changes coming in Kubernetes 1.24, take a look at the in-progress CHANGELOG.

A note about Dockershim

It's safe to say that the removal receiving the most attention with the release of Kubernetes 1.24 is Dockershim. Dockershim was deprecated in v1.20. As noted in the Kubernetes 1.20 changelog: "Docker support in the kubelet is now deprecated and will be removed in a future release. The kubelet uses a module called "dockershim" which implements CRI support for Docker and it has seen maintenance issues in the Kubernetes community." With the upcoming release of Kubernetes 1.24, the Dockershim will finally be removed.

In the article Don't Panic: Kubernetes and Docker, the authors succinctly captured the change's impact and encouraged users to remain calm:

Docker as an underlying runtime is being deprecated in favor of runtimes that use the Container Runtime Interface (CRI) created for Kubernetes. Docker-produced images will continue to work in your cluster with all runtimes, as they always have.

Several guides have been created with helpful information about migrating from dockershim to container runtimes that are directly compatible with Kubernetes. You can find them on the Migrating from dockershim page in the Kubernetes documentation.

For more information about why Kubernetes is moving away from dockershim, check out the aptly named: Kubernetes is Moving on From Dockershim and the updated dockershim removal FAQ.

Take a look at the Is Your Cluster Ready for v1.24? post to learn about how to ensure your cluster continues to work after upgrading from v1.23 to v1.24.

The Kubernetes API removal and deprecation process

Kubernetes contains a large number of components that evolve over time. In some cases, this evolution results in APIs, flags, or entire features, being removed. To prevent users from facing breaking changes, Kubernetes contributors adopted a feature deprecation policy. This policy ensures that stable APIs may only be deprecated when a newer stable version of that same API is available and that APIs have a minimum lifetime as indicated by the following stability levels:

  • Generally available (GA) or stable API versions may be marked as deprecated but must not be removed within a major version of Kubernetes.
  • Beta or pre-release API versions must be supported for 3 releases after deprecation.
  • Alpha or experimental API versions may be removed in any release without prior deprecation notice.

Removals follow the same deprecation policy regardless of whether an API is removed due to a beta feature graduating to stable or because that API was not proven to be successful. Kubernetes will continue to make sure migration options are documented whenever APIs are removed.

Deprecated APIs are those that have been marked for removal in a future Kubernetes release. Removed APIs are those that are no longer available for use in current, supported Kubernetes versions after having been deprecated. These removals have been superseded by newer, stable/generally available (GA) APIs.

API removals, deprecations, and other changes for Kubernetes 1.24

What to do

Dockershim removal

As stated earlier, there are several guides about Migrating from dockershim. You can start with Finding what container runtime are on your nodes. If your nodes are using dockershim, there are other possible Docker Engine dependencies such as Pods or third-party tools executing Docker commands or private registries in the Docker configuration file. You can follow the Check whether Dockershim removal affects you guide to review possible Docker Engine dependencies. Before upgrading to v1.24, you decide to either remain using Docker Engine and Migrate Docker Engine nodes from dockershim to cri-dockerd or migrate to a CRI-compatible runtime. Here's a guide to change the container runtime on a node from Docker Engine to containerd.

kubectl convert

The kubectl convert plugin for kubectl can be helpful to address migrating off deprecated APIs. The plugin facilitates the conversion of manifests between different API versions, for example, from a deprecated to a non-deprecated API version. More general information about the API migration process can be found in the Deprecated API Migration Guide. Follow the install kubectl convert plugin documentation to download and install the kubectl-convert binary.

Looking ahead

The Kubernetes 1.25 and 1.26 releases planned for later this year will stop serving beta versions of several currently stable Kubernetes APIs. The v1.25 release will also remove PodSecurityPolicy, which was deprecated with Kubernetes 1.21 and will not graduate to stable. See PodSecurityPolicy Deprecation: Past, Present, and Future for more information.

The official list of API removals planned for Kubernetes 1.25 is:

  • The beta CronJob API (batch/v1beta1)
  • The beta EndpointSlice API (discovery.k8s.io/v1beta1)
  • The beta Event API (events.k8s.io/v1beta1)
  • The beta HorizontalPodAutoscaler API (autoscaling/v2beta1)
  • The beta PodDisruptionBudget API (policy/v1beta1)
  • The beta PodSecurityPolicy API (policy/v1beta1)
  • The beta RuntimeClass API (node.k8s.io/v1beta1)

The official list of API removals planned for Kubernetes 1.26 is:

  • The beta FlowSchema and PriorityLevelConfiguration APIs (flowcontrol.apiserver.k8s.io/v1beta1)
  • The beta HorizontalPodAutoscaler API (autoscaling/v2beta2)

Want to know more?

Deprecations are announced in the Kubernetes release notes. You can see the announcements of pending deprecations in the release notes for:

For information on the process of deprecation and removal, check out the official Kubernetes deprecation policy document.

Is Your Cluster Ready for v1.24?

Way back in December of 2020, Kubernetes announced the deprecation of Dockershim. In Kubernetes, dockershim is a software shim that allows you to use the entire Docker engine as your container runtime within Kubernetes. In the upcoming v1.24 release, we are removing Dockershim - the delay between deprecation and removal in line with the project’s policy of supporting features for at least one year after deprecation. If you are a cluster operator, this guide includes the practical realities of what you need to know going into this release. Also, what do you need to do to ensure your cluster doesn’t fall over!

First, does this even affect you?

If you are rolling your own cluster or are otherwise unsure whether or not this removal affects you, stay on the safe side and check to see if you have any dependencies on Docker Engine. Please note that using Docker Desktop to build your application containers is not a Docker dependency for your cluster. Container images created by Docker are compliant with the Open Container Initiative (OCI), a Linux Foundation governance structure that defines industry standards around container formats and runtimes. They will work just fine on any container runtime supported by Kubernetes.

If you are using a managed Kubernetes service from a cloud provider, and you haven’t explicitly changed the container runtime, there may be nothing else for you to do. Amazon EKS, Azure AKS, and Google GKE all default to containerd now, though you should make sure they do not need updating if you have any node customizations. To check the runtime of your nodes, follow Find Out What Container Runtime is Used on a Node.

Regardless of whether you are rolling your own cluster or using a managed Kubernetes service from a cloud provider, you may need to migrate telemetry or security agents that rely on Docker Engine.

I have a Docker dependency. What now?

If your Kubernetes cluster depends on Docker Engine and you intend to upgrade to Kubernetes v1.24 (which you should eventually do for security and similar reasons), you will need to change your container runtime from Docker Engine to something else or use cri-dockerd. Since containerd is a graduated CNCF project and the runtime within Docker itself, it’s a safe bet as an alternative container runtime. Fortunately, the Kubernetes project has already documented the process of changing a node’s container runtime, using containerd as an example. Instructions are similar for switching to one of the other supported runtimes.

I want to upgrade Kubernetes, and I need to maintain compatibility with Docker as a runtime. What are my options?

Fear not, you aren’t being left out in the cold and you don’t have to take the security risk of staying on an old version of Kubernetes. Mirantis and Docker have jointly released, and are maintaining, a replacement for dockershim. That replacement is called cri-dockerd. If you do need to maintain compatibility with Docker as a runtime, install cri-dockerd following the instructions in the project’s documentation.

Is that it?

Yes. As long as you go into this release aware of the changes being made and the details of your own clusters, and you make sure to communicate clearly with your development teams, it will be minimally dramatic. You may have some changes to make to your cluster, application code, or scripts, but all of these requirements are documented. Switching from using Docker Engine as your runtime to using one of the other supported container runtimes effectively means removing the middleman, since the purpose of dockershim is to access the container runtime used by Docker itself. From a practical perspective, this removal is better both for you and for Kubernetes maintainers in the long-run.

If you still have questions, please first check the Dockershim Removal FAQ.

Meet Our Contributors - APAC (Aus-NZ region)

Authors & Interviewers: Anubhav Vardhan, Atharva Shinde, Avinesh Tripathi, Brad McCoy, Debabrata Panigrahi, Jayesh Srivastava, Kunal Verma, Pranshu Srivastava, Priyanka Saggu, Purneswar Prasad, Vedant Kakde


Good day, everyone 👋

Welcome back to the second episode of the "Meet Our Contributors" blog post series for APAC.

This post will feature four outstanding contributors from the Australia and New Zealand regions, who have played diverse leadership and community roles in the Upstream Kubernetes project.

So, without further ado, let's get straight to the blog.

Caleb Woodbine

Caleb Woodbine is currently a member of the ii.nz organisation.

He began contributing to the Kubernetes project in 2018 as a member of the Kubernetes Conformance working group. His experience was positive, and he benefited from early guidance from Hippie Hacker, a fellow contributor from New Zealand.

He has made major contributions to Kubernetes project since then through SIG k8s-infra and k8s-conformance working group.

Caleb is also a co-organizer of the CloudNative NZ community events, which aim to expand the reach of Kubernetes project throughout New Zealand in order to encourage technical education and improved employment opportunities.

There need to be more outreach in APAC and the educators and universities must pick up Kubernetes, as they are very slow and about 8+ years out of date. NZ tends to rather pay overseas than educate locals on the latest cloud tech Locally.

Dylan Graham

Dylan Graham is a cloud engineer from Adelaide, Australia. He has been contributing to the upstream Kubernetes project since 2018.

He stated that being a part of such a large-scale project was initially overwhelming, but that the community's friendliness and openness assisted him in getting through it.

He began by contributing to the project documentation and is now mostly focused on the community support for the APAC region.

He believes that consistent attendance at community/project meetings, taking on project tasks, and seeking community guidance as needed can help new aspiring developers become effective contributors.

The feeling of being a part of a large community is really special. I've met some amazing people, even some before the pandemic in real life :)

Hippie Hacker

Hippie has worked for the CNCF.io as a Strategic Initiatives contractor from New Zealand for almost 5+ years. He is an active contributor to k8s-infra, API conformance testing, Cloud provider conformance submissions, and apisnoop.cncf.io domains of the upstream Kubernetes & CNCF projects.

He recounts their early involvement with the Kubernetes project, which began roughly 5 years ago when their firm, ii.nz, demonstrated network booting from a Raspberry Pi using PXE and running Gitlab in-cluster to install Kubernetes on servers.

He describes their own contributing experience as someone who, at first, tried to do all of the hard lifting on their own, but eventually saw the benefit of group contributions which reduced burnout and task division which allowed folks to keep moving forward on their own momentum.

He recommends that new contributors use pair programming.

The cross pollination of approaches and two pairs of eyes on the same work can often yield a much more amplified effect than a PR comment / approval alone can afford.

Nick Young

Nick Young works at VMware as a technical lead for Contour, a CNCF ingress controller. He was active with the upstream Kubernetes project from the beginning, and eventually became the chair of the LTS working group, where he advocated user concerns. He is currently the SIG Network Gateway API subproject's maintainer.

His contribution path was notable in that he began working on major areas of the Kubernetes project early on, skewing his trajectory.

He asserts the best thing a new contributor can do is to "start contributing". Naturally, if it is relevant to their employment, that is excellent; however, investing non-work time in contributing can pay off in the long run in terms of work. He believes that new contributors, particularly those who are currently Kubernetes users, should be encouraged to participate in higher-level project discussions.

Just being active and contributing will get you a long way. Once you've been active for a while, you'll find that you're able to answer questions, which will mean you're asked questions, and before you know it you are an expert.


If you have any recommendations/suggestions for who we should interview next, please let us know in #sig-contribex. Your suggestions would be much appreciated. We're thrilled to have additional folks assisting us in reaching out to even more wonderful individuals of the community.

We'll see you all in the next one. Everyone, till then, have a happy contributing! 👋

Updated: Dockershim Removal FAQ

This supersedes the original Dockershim Deprecation FAQ article, published in late 2020. The article includes updates from the v1.24 release of Kubernetes.


This document goes over some frequently asked questions regarding the removal of dockershim from Kubernetes. The removal was originally announced as a part of the Kubernetes v1.20 release. The Kubernetes v1.24 release actually removed the dockershim from Kubernetes.

For more on what that means, check out the blog post Don't Panic: Kubernetes and Docker.

To determine the impact that the removal of dockershim would have for you or your organization, you can read Check whether dockershim removal affects you.

In the months and days leading up to the Kubernetes 1.24 release, Kubernetes contributors worked hard to try to make this a smooth transition.

Why was the dockershim removed from Kubernetes?

Early versions of Kubernetes only worked with a specific container runtime: Docker Engine. Later, Kubernetes added support for working with other container runtimes. The CRI standard was created to enable interoperability between orchestrators (like Kubernetes) and many different container runtimes. Docker Engine doesn't implement that interface (CRI), so the Kubernetes project created special code to help with the transition, and made that dockershim code part of Kubernetes itself.

The dockershim code was always intended to be a temporary solution (hence the name: shim). You can read more about the community discussion and planning in the Dockershim Removal Kubernetes Enhancement Proposal. In fact, maintaining dockershim had become a heavy burden on the Kubernetes maintainers.

Additionally, features that were largely incompatible with the dockershim, such as cgroups v2 and user namespaces are being implemented in these newer CRI runtimes. Removing the dockershim from Kubernetes allows further development in those areas.

Are Docker and containers the same thing?

Docker popularized the Linux containers pattern and has been instrumental in developing the underlying technology, however containers in Linux have existed for a long time. The container ecosystem has grown to be much broader than just Docker. Standards like OCI and CRI have helped many tools grow and thrive in our ecosystem, some replacing aspects of Docker while others enhance existing functionality.

Will my existing container images still work?

Yes, the images produced from docker build will work with all CRI implementations. All your existing images will still work exactly the same.

What about private images?

Yes. All CRI runtimes support the same pull secrets configuration used in Kubernetes, either via the PodSpec or ServiceAccount.

Can I still use Docker Engine in Kubernetes 1.23?

Yes, the only thing changed in 1.20 is a single warning log printed at kubelet startup if using Docker Engine as the runtime. You'll see this warning in all versions up to 1.23. The dockershim removal occurred in Kubernetes 1.24.

If you're running Kubernetes v1.24 or later, see Can I still use Docker Engine as my container runtime?. (Remember, you can switch away from the dockershim if you're using any supported Kubernetes release; from release v1.24, you must switch as Kubernetes no longer includes the dockershim).

Which CRI implementation should I use?

That’s a complex question and it depends on a lot of factors. If Docker Engine is working for you, moving to containerd should be a relatively easy swap and will have strictly better performance and less overhead. However, we encourage you to explore all the options from the CNCF landscape in case another would be an even better fit for your environment.

Can I still use Docker Engine as my container runtime?

First off, if you use Docker on your own PC to develop or test containers: nothing changes. You can still use Docker locally no matter what container runtime(s) you use for your Kubernetes clusters. Containers make this kind of interoperability possible.

Mirantis and Docker have committed to maintaining a replacement adapter for Docker Engine, and to maintain that adapter even after the in-tree dockershim is removed from Kubernetes. The replacement adapter is named cri-dockerd.

You can install cri-dockerd and use it to connect the kubelet to Docker Engine. Read Migrate Docker Engine nodes from dockershim to cri-dockerd to learn more.

Are there examples of folks using other runtimes in production today?

All Kubernetes project produced artifacts (Kubernetes binaries) are validated with each release.

Additionally, the kind project has been using containerd for some time and has seen an improvement in stability for its use case. Kind and containerd are leveraged multiple times every day to validate any changes to the Kubernetes codebase. Other related projects follow a similar pattern as well, demonstrating the stability and usability of other container runtimes. As an example, OpenShift 4.x has been using the CRI-O runtime in production since June 2019.

For other examples and references you can look at the adopters of containerd and CRI-O, two container runtimes under the Cloud Native Computing Foundation (CNCF).

People keep referencing OCI, what is that?

OCI stands for the Open Container Initiative, which standardized many of the interfaces between container tools and technologies. They maintain a standard specification for packaging container images (OCI image-spec) and running containers (OCI runtime-spec). They also maintain an actual implementation of the runtime-spec in the form of runc, which is the underlying default runtime for both containerd and CRI-O. The CRI builds on these low-level specifications to provide an end-to-end standard for managing containers.

What should I look out for when changing CRI implementations?

While the underlying containerization code is the same between Docker and most CRIs (including containerd), there are a few differences around the edges. Some common things to consider when migrating are:

  • Logging configuration
  • Runtime resource limitations
  • Node provisioning scripts that call docker or use Docker Engine via its control socket
  • Plugins for kubectl that require the docker CLI or the Docker Engine control socket
  • Tools from the Kubernetes project that require direct access to Docker Engine (for example: the deprecated kube-imagepuller tool)
  • Configuration of functionality like registry-mirrors and insecure registries
  • Other support scripts or daemons that expect Docker Engine to be available and are run outside of Kubernetes (for example, monitoring or security agents)
  • GPUs or special hardware and how they integrate with your runtime and Kubernetes

If you use Kubernetes resource requests/limits or file-based log collection DaemonSets then they will continue to work the same, but if you've customized your dockerd configuration, you’ll need to adapt that for your new container runtime where possible.

Another thing to look out for is anything expecting to run for system maintenance or nested inside a container when building images will no longer work. For the former, you can use the crictl tool as a drop-in replacement (see mapping from docker cli to crictl) and for the latter you can use newer container build options like img, buildah, kaniko, or buildkit-cli-for-kubectl that don’t require Docker.

For containerd, you can start with their documentation to see what configuration options are available as you migrate things over.

For instructions on how to use containerd and CRI-O with Kubernetes, see the Kubernetes documentation on Container Runtimes.

What if I have more questions?

If you use a vendor-supported Kubernetes distribution, you can ask them about upgrade plans for their products. For end-user questions, please post them to our end user community forum: https://discuss.kubernetes.io/.

You can discuss the decision to remove dockershim via a dedicated GitHub issue.

You can also check out the excellent blog post Wait, Docker is deprecated in Kubernetes now? a more in-depth technical discussion of the changes.

Is there any tooling that can help me find dockershim in use?

Yes! The Detector for Docker Socket (DDS) is a kubectl plugin that you can install and then use to check your cluster. DDS can detect if active Kubernetes workloads are mounting the Docker Engine socket (docker.sock) as a volume. Find more details and usage patterns in the DDS project's README.

Can I have a hug?

Yes, we're still giving hugs as requested. 🤗🤗🤗

SIG Node CI Subproject Celebrates Two Years of Test Improvements

Ensuring the reliability of SIG Node upstream code is a continuous effort that takes a lot of behind-the-scenes effort from many contributors. There are frequent releases of Kubernetes, base operating systems, container runtimes, and test infrastructure that result in a complex matrix that requires attention and steady investment to "keep the lights on." In May 2020, the Kubernetes node special interest group ("SIG Node") organized a new subproject for continuous integration (CI) for node-related code and tests. Since its inauguration, the SIG Node CI subproject has run a weekly meeting, and even the full hour is often not enough to complete triage of all bugs, test-related PRs and issues, and discuss all related ongoing work within the subgroup.

Over the past two years, we've fixed merge-blocking and release-blocking tests, reducing time to merge Kubernetes contributors' pull requests thanks to reduced test flakes. When we started, Node test jobs only passed 42% of the time, and through our efforts, we now ensure a consistent >90% job pass rate. We've closed 144 test failure issues and merged 176 pull requests just in kubernetes/kubernetes. And we've helped subproject participants ascend the Kubernetes contributor ladder, with 3 new org members, 6 new reviewers, and 2 new approvers.

The Node CI subproject is an approachable first stop to help new contributors get started with SIG Node. There is a low barrier to entry for new contributors to address high-impact bugs and test fixes, although there is a long road before contributors can climb the entire contributor ladder: it took over a year to establish two new approvers for the group. The complexity of all the different components that power Kubernetes nodes and its test infrastructure requires a sustained investment over a long period for developers to deeply understand the entire system, both at high and low levels of detail.

We have several regular contributors at our meetings, however; our reviewers and approvers pool is still small. It is our goal to continue to grow contributors to ensure a sustainable distribution of work that does not just fall to a few key approvers.

It's not always obvious how subprojects within SIGs are formed, operate, and work. Each is unique to its sponsoring SIG and tailored to the projects that the group is intended to support. As a group that has welcomed many first-time SIG Node contributors, we'd like to share some of the details and accomplishments over the past two years, helping to demystify our inner workings and celebrate the hard work of all our dedicated contributors!

Timeline

May 2020. SIG Node CI group was formed on May 11, 2020, with more than 30 volunteers signed up, to improve SIG Node CI signal and overall observability. Victor Pickard focused on getting testgrid jobs passing when Ning Liao suggested forming a group around this effort and came up with the original group charter document. The SIG Node chairs sponsored group creation with Victor as a subproject lead. Sergey Kanzhelev joined Victor shortly after as a co-lead.

At the kick-off meeting, we discussed which tests to concentrate on fixing first and discussed merge-blocking and release-blocking tests, many of which were failing due to infrastructure issues or buggy test code.

The subproject launched weekly hour-long meetings to discuss ongoing work discussion and triage.

June 2020. Morgan Bauer, Karan Goel, and Jorge Alarcon Ochoa were recognized as reviewers for the SIG Node CI group for their contributions, helping significantly with the early stages of the subproject. David Porter and Roy Yang also joined the SIG test failures GitHub team.

August 2020. All merge-blocking and release-blocking tests were passing, with some flakes. However, only 42% of all SIG Node test jobs were green, as there were many flakes and failing tests.

October 2020. Amim Knabben becomes a Kubernetes org member for his contributions to the subproject.

January 2021. With healthy presubmit and critical periodic jobs passing, the subproject discussed its goal for cleaning up the rest of periodic tests and ensuring they passed without flakes.

Elana Hashman joined the subproject, stepping up to help lead it after Victor's departure.

February 2021. Artyom Lukianov becomes a Kubernetes org member for his contributions to the subproject.

August 2021. After SIG Node successfully ran a bug scrub to clean up its bug backlog, the scope of the meeting was extended to include bug triage to increase overall reliability, anticipating issues before they affect the CI signal.

Subproject leads Elana Hashman and Sergey Kanzhelev are both recognized as approvers on all node test code, supported by SIG Node and SIG Testing.

September 2021. After significant deflaking progress with serial tests in the 1.22 release spearheaded by Francesco Romani, the subproject set a goal for getting the serial job fully passing by the 1.23 release date.

Mike Miranda becomes a Kubernetes org member for his contributions to the subproject.

November 2021. Throughout 2021, SIG Node had no merge or release-blocking test failures. Many flaky tests from past releases are removed from release-blocking dashboards as they had been fully cleaned up.

Danielle Lancashire was recognized as a reviewer for SIG Node's subgroup, test code.

The final node serial tests were completely fixed. The serial tests consist of many disruptive and slow tests which tend to be flakey and are hard to troubleshoot. By the 1.23 release freeze, the last serial tests were fixed and the job was passing without flakes.

Slack announcement that Serial tests are green

The 1.23 release got a special shout out for the tests quality and CI signal. The SIG Node CI subproject was proud to have helped contribute to such a high-quality release, in part due to our efforts in identifying and fixing flakes in Node and beyond.

Slack shoutout that release was mostly green

December 2021. An estimated 90% of test jobs were passing at the time of the 1.23 release (up from 42% in August 2020).

Dockershim code was removed from Kubernetes. This affected nearly half of SIG Node's test jobs and the SIG Node CI subproject reacted quickly and retargeted all the tests. SIG Node was the first SIG to complete test migrations off dockershim, providing examples for other affected SIGs. The vast majority of new jobs passed at the time of introduction without further fixes required. The effort of removing dockershim) from Kubernetes is ongoing. There are still some wrinkles from the dockershim removal as we uncover more dependencies on dockershim, but we plan to stabilize all test jobs by the 1.24 release.

Statistics

Our regular meeting attendees and subproject participants for the past few months:

  • Aditi Sharma
  • Artyom Lukianov
  • Arnaud Meukam
  • Danielle Lancashire
  • David Porter
  • Davanum Srinivas
  • Elana Hashman
  • Francesco Romani
  • Matthias Bertschy
  • Mike Miranda
  • Paco Xu
  • Peter Hunt
  • Ruiwen Zhao
  • Ryan Phillips
  • Sergey Kanzhelev
  • Skyler Clark
  • Swati Sehgal
  • Wenjun Wu

The kubernetes/test-infra source code repository contains test definitions. The number of Node PRs just in that repository:

  • 2020 PRs (since May): 183
  • 2021 PRs: 264

Triaged issues and PRs on CI board (including triaging away from the subgroup scope):

  • 2020 (since May): 132
  • 2021: 532

Future

Just "keeping the lights on" is a bold task and we are committed to improving this experience. We are working to simplify the triage and review processes for SIG Node.

Specifically, we are working on better test organization, naming, and tracking:

We are also constantly making progress on improved tests debuggability and de-flaking.

If any of this interests you, we'd love for you to join us! There's plenty to learn in debugging test failures, and it will help you gain familiarity with the code that SIG Node maintains.

You can always find information about the group on the SIG Node page. We give group updates at our maintainer track sessions, such as KubeCon + CloudNativeCon Europe 2021 and KubeCon + CloudNative North America 2021. Join us in our mission to keep the kubelet and other SIG Node components reliable and ensure smooth and uneventful releases!

Spotlight on SIG Multicluster

Introduction

SIG Multicluster is the SIG focused on how Kubernetes concepts are expanded and used beyond the cluster boundary. Historically, Kubernetes resources only interacted within that boundary - KRU or Kubernetes Resource Universe (not an actual Kubernetes concept). Kubernetes clusters, even now, don't really know anything about themselves or, about other clusters. Absence of cluster identifiers is a case in point. With the growing adoption of multicloud and multicluster deployments, the work SIG Multicluster doing is gaining a lot of attention. In this blog, Jeremy Olmsted-Thompson, Google and Chris Short, AWS discuss the interesting problems SIG Multicluster is solving and how you can get involved. Their initials JOT and CS will be used for brevity.

A summary of their conversation

CS: How long has the SIG Multicluster existed and how was the SIG in its infancy? How long have you been with this SIG?

JOT: I've been around for almost two years in the SIG Multicluster. All I know about the infancy years is from the lore but even in the early days, it was always about solving this same problem. Early efforts have been things like KubeFed. I think there are still folks using KubeFed but it's a smaller slice. Back then, I think people out there deploying large numbers of Kubernetes clusters were really not at a point where we had a ton of real concrete use cases. Projects like KubeFed and Cluster Registry were developed around that time and the need back then can be associated to these projects. The motivation for these projects were how do we solve the problems that we think people are going to have, when they start expanding to multiple clusters. Honestly, in some ways, it was trying to do too much at that time.

CS: How does KubeFed differ from the current state of SIG Multicluster? How does the lore differ from the now?

JOT: Yeah, it was like trying to get ahead of potential problems instead of addressing specific problems. I think towards the end of 2019, there was a slow down in SIG multicluster work and we kind of picked it back up with one of the most active recent projects that is the SIG Multicluster services (MCS).

Now this is the shift to solving real specific problems. For example,

I've got workloads that are spread across multiple clusters and I need them to talk to each other.

Okay, that's very straightforward and we know that we need to solve that. To get started, let's make sure that these projects can work together on a common API so you get the same kind of portability that you get with Kubernetes.

There's a few implementations of the MCS API out there and more are being developed. But, we didn't build an implementation because depending on how you're deploying things there could be hundreds of implementations. As long as you only need the basic Multicluster service functionality, it'll just work on whatever background you want, whether it's Submariner, GKE, or a service mesh.

My favorite example of "then vs. now" is cluster ID. A few years ago, there was an effort to define a cluster ID. A lot of really good thought went into this concept, for example, how do we make a cluster ID is unique across multiple clusters. How do we make this ID globally unique so it'll work in every contact? Let's say, there's an acquisition or merger of teams - does the cluster IDs still remain unique for those teams?

With Multicluster services, we found the need for an actual cluster ID, and it has a very specific need. To address this specific need, we're no longer considering every single Kubernetes cluster out there rather the ClusterSets - a grouping of clusters that work together in some kind of bounds. That's a much narrower scope than considering clusters everywhere in time and space. It also leaves flexibility for an implementer to define the boundary (a ClusterSet) beyond which this cluster ID will no longer be unique.

CS: How do you feel about the current state of SIG Multicluster versus where you're hoping to be in future?

JOT: There's a few projects that are kind of getting started, for example, Work API. In the future, I think that some common practices around how do we deploy things across clusters are going to develop.

If I have clusters deployed in a bunch of different regions; what's the best way to actually do that?

The answer is, almost always, "it depends". Why are you doing this? Is it because there's some kind of compliance that makes you care about locality? Is it performance? Is it availability?

I think revisiting registry patterns will probably be a natural step after we have cluster IDs, that is, how do you actually associate these clusters together? Maybe you've got a distributed deployment that you run in your own data centers all over the world. I imagine that expanding the API in that space is going to be important as more multi cluster features develop. It really depends on what the community starts doing with these tools.

CS: In the early days of Kubernetes, we used to have a few large Kubernetes clusters and now we're dealing with many small Kubernetes clusters - even multiple clusters for our own dev environments. How has this shift from a few large clusters to many small clusters affected the SIG? Has it accelerated the work or make it challenging in any way?

JOT: I think that it has created a lot of ambiguity that needs solving. Originally, you'd have a dev cluster, a staging cluster, and a prod cluster. When the multi region thing came in, we started needing dev/staging/prod clusters, per region. And then, sometimes clusters really need more isolation due to compliance or some regulations issues. Thus, we're ending up with a lot of clusters. I think figuring out the right balance on how many clusters should you actually have is important. The power of Kubernetes is being able to deploy a lot of things managed by a single control plane. So, it's not like every single workload that gets deployed should be in its own cluster. But I think it's pretty clear that we can't put every single workload in a single cluster.

CS: What are some of your favorite things about this SIG?

JOT: The complexity of the problems, the people and the newness of the space. We don't have right answers and we have to figure this out. At the beginning, we couldn't even think about multi clusters because there was no way to connect services across clusters. Now there is and we're starting to go tackle those problems, I think that this is a really fun place to be in because I expect that the SIG is going to get a lot busier the next couple of years. It's a very collaborative group and we definitely would like more people to come join us, get involved, raise their problems and bring their ideas.

CS: What do you think keeps people in this group? How has the pandemic affected you?

JOT: I think it definitely got a little bit quieter during the pandemic. But for the most part; it's a very distributed group so whether you're calling in to our weekly meetings from a conference room or from your home, it doesn't make that huge of a difference. During the pandemic, a lot of people had time to focus on what's next for their scale and growth. I think that's what keeps people in the group - we have real problems that need to be solved which are very new in this space. And it's fun :)

Wrap up

CS: That's all we have for today. Thanks Jeremy for your time.

JOT: Thanks Chris. Everybody is welcome at our bi-weekly meetings. We love as many people to come as possible and welcome all questions and all ideas. It's a new space and it'd be great to grow the community.

Securing Admission Controllers

Admission control is a key part of Kubernetes security, alongside authentication and authorization. Webhook admission controllers are extensively used to help improve the security of Kubernetes clusters in a variety of ways including restricting the privileges of workloads and ensuring that images deployed to the cluster meet organization’s security requirements.

However, as with any additional component added to a cluster, security risks can present themselves. A security risk example is if the deployment and management of the admission controller are not handled correctly. To help admission controller users and designers manage these risks appropriately, the security documentation subgroup of SIG Security has spent some time developing a threat model for admission controllers. This threat model looks at likely risks which may arise from the incorrect use of admission controllers, which could allow security policies to be bypassed, or even allow an attacker to get unauthorised access to the cluster.

From the threat model, we developed a set of security best practices that should be adopted to ensure that cluster operators can get the security benefits of admission controllers whilst avoiding any risks from using them.

Admission controllers and good practices for security

From the threat model, a couple of themes emerged around how to ensure the security of admission controllers.

Secure webhook configuration

It’s important to ensure that any security component in a cluster is well configured and admission controllers are no different here. There are a couple of security best practices to consider when using admission controllers

  • Correctly configured TLS for all webhook traffic. Communications between the API server and the admission controller webhook should be authenticated and encrypted to ensure that attackers who may be in a network position to view or modify this traffic cannot do so. To achieve this access the API server and webhook must be using certificates from a trusted certificate authority so that they can validate their mutual identities
  • Only authenticated access allowed. If an attacker can send an admission controller large numbers of requests, they may be able to overwhelm the service causing it to fail. Ensuring all access requires strong authentication should mitigate that risk.
  • Admission controller fails closed. This is a security practice that has a tradeoff, so whether a cluster operator wants to configure it will depend on the cluster’s threat model. If an admission controller fails closed, when the API server can’t get a response from it, all deployments will fail. This stops attackers bypassing the admission controller by disabling it, but, can disrupt the cluster’s operation. As clusters can have multiple webhooks, one approach to hit a middle ground might be to have critical controls on a fail closed setups and less critical controls allowed to fail open.
  • Regular reviews of webhook configuration. Configuration mistakes can lead to security issues, so it’s important that the admission controller webhook configuration is checked to make sure the settings are correct. This kind of review could be done automatically by an Infrastructure As Code scanner or manually by an administrator.

Secure cluster configuration for admission control

In most cases, the admission controller webhook used by a cluster will be installed as a workload in the cluster. As a result, it’s important to ensure that Kubernetes' security features that could impact its operation are well configured.

  • Restrict RBAC rights. Any user who has rights which would allow them to modify the configuration of the webhook objects or the workload that the admission controller uses could disrupt its operation. So it’s important to make sure that only cluster administrators have those rights.
  • Prevent privileged workloads. One of the realities of container systems is that if a workload is given certain privileges, it will be possible to break out to the underlying cluster node and impact other containers on that node. Where admission controller services run in the cluster they’re protecting, it’s important to ensure that any requirement for privileged workloads is carefully reviewed and restricted as much as possible.
  • Strictly control external system access. As a security service in a cluster admission controller systems will have access to sensitive information like credentials. To reduce the risk of this information being sent outside the cluster, network policies should be used to restrict the admission controller services access to external networks.
  • Each cluster has a dedicated webhook. Whilst it may be possible to have admission controller webhooks that serve multiple clusters, there is a risk when using that model that an attack on the webhook service would have a larger impact where it’s shared. Also where multiple clusters use an admission controller there will be increased complexity and access requirements, making it harder to secure.

Admission controller rules

A key element of any admission controller used for Kubernetes security is the rulebase it uses. The rules need to be able to accurately meet their goals avoiding false positive and false negative results.

  • Regularly test and review rules. Admission controller rules need to be tested to ensure their accuracy. They also need to be regularly reviewed as the Kubernetes API will change with each new version, and rules need to be assessed with each Kubernetes release to understand any changes that may be required to keep them up to date.

Meet Our Contributors - APAC (India region)

Authors & Interviewers: Anubhav Vardhan, Atharva Shinde, Avinesh Tripathi, Debabrata Panigrahi, Kunal Verma, Pranshu Srivastava, Pritish Samal, Purneswar Prasad, Vedant Kakde

Editor: Priyanka Saggu


Good day, everyone 👋

Welcome to the first episode of the APAC edition of the "Meet Our Contributors" blog post series.

In this post, we'll introduce you to five amazing folks from the India region who have been actively contributing to the upstream Kubernetes projects in a variety of ways, as well as being the leaders or maintainers of numerous community initiatives.

💫 Let's get started, so without further ado…

Arsh Sharma

Arsh is currently employed with Okteto as a Developer Experience engineer. As a new contributor, he realised that 1:1 mentorship opportunities were quite beneficial in getting him started with the upstream project.

He is presently a CI Signal shadow on the Kubernetes 1.23 release team. He is also contributing to the SIG Testing and SIG Docs projects, as well as to the cert-manager tools development work that is being done under the aegis of SIG Architecture.

To the newcomers, Arsh helps plan their early contributions sustainably.

I would encourage folks to contribute in a way that's sustainable. What I mean by that is that it's easy to be very enthusiastic early on and take up more stuff than one can actually handle. This can often lead to burnout in later stages. It's much more sustainable to work on things iteratively.

Kunal Kushwaha

Kunal Kushwaha is a core member of the Kubernetes marketing council. He is also a CNCF ambassador and one of the founders of the CNCF Students Program.. He also served as a Communications role shadow during the 1.22 release cycle.

At the end of his first year, Kunal began contributing to the fabric8io kubernetes-client project. He was then selected to work on the same project as part of Google Summer of Code. Kunal mentored people on the same project, first through Google Summer of Code then through Google Code-in.

As an open-source enthusiast, he believes that diverse participation in the community is beneficial since it introduces new perspectives and opinions and respect for one's peers. He has worked on various open-source projects, and his participation in communities has considerably assisted his development as a developer.

I believe if you find yourself in a place where you do not know much about the project, that's a good thing because now you can learn while contributing and the community is there to help you. It has helped me a lot in gaining skills, meeting people from around the world and also helping them. You can learn on the go, you don't have to be an expert. Make sure to also check out no code contributions because being a beginner is a skill and you can bring new perspectives to the organisation.

Madhav Jivarajani

Madhav Jivarajani works on the VMware Upstream Kubernetes stability team. He began contributing to the Kubernetes project in January 2021 and has since made significant contributions to several areas of work under SIG Architecture, SIG API Machinery, and SIG ContribEx (contributor experience).

Among several significant contributions are his recent efforts toward the Archival of design proposals, refactoring the "groups" codebase under k8s-infra repository to make it mockable and testable, and improving the functionality of the GitHub k8s bot.

In addition to his technical efforts, Madhav oversees many projects aimed at assisting new contributors. He organises bi-weekly "KEP reading club" sessions to help newcomers understand the process of adding new features, deprecating old ones, and making other key changes to the upstream project. He has also worked on developing Katacoda scenarios to assist new contributors to become acquainted with the process of contributing to k/k. In addition to his current efforts to meet with community members every week, he has organised several new contributors workshops (NCW).

I initially did not know much about Kubernetes. I joined because the community was super friendly. But what made me stay was not just the people, but the project itself. My solution to not feeling overwhelmed in the community was to gain as much context and knowledge into the topics that I was interested in and were being discussed. And as a result I continued to dig deeper into Kubernetes and the design of it. I am a systems nut & thus Kubernetes was an absolute goldmine for me.

Rajas Kakodkar

Rajas Kakodkar currently works at VMware as a Member of Technical Staff. He has been engaged in many aspects of the upstream Kubernetes project since 2019.

He is now a key contributor to the Testing special interest group. He is also active in the SIG Network community. Lately, Rajas has contributed significantly to the NetworkPolicy++ and kpng sub-projects.

One of the first challenges he ran across was that he was in a different time zone than the upstream project's regular meeting hours. However, async interactions on community forums progressively corrected that problem.

I enjoy contributing to Kubernetes not just because I get to work on cutting edge tech but more importantly because I get to work with awesome people and help in solving real world problems.

Rajula Vineet Reddy

Rajula Vineet Reddy, a Junior Engineer at CERN, is a member of the Marketing Council team under SIG ContribEx . He also served as a release shadow for SIG Release during the 1.22 and 1.23 Kubernetes release cycles.

He started looking at the Kubernetes project as part of a university project with the help of one of his professors. Over time, he spent a significant amount of time reading the project's documentation, Slack discussions, GitHub issues, and blogs, which helped him better grasp the Kubernetes project and piqued his interest in contributing upstream. One of his key contributions was his assistance with automation in the SIG ContribEx Upstream Marketing subproject.

According to Rajula, attending project meetings and shadowing various project roles are vital for learning about the community.

I find the community very helpful and it's always “you get back as much as you contribute”. The more involved you are, the more you will understand, get to learn and contribute new things.

The first step to “come forward and start” is hard. But it's all gonna be smooth after that. Just take that jump.


If you have any recommendations/suggestions for who we should interview next, please let us know in #sig-contribex. We're thrilled to have other folks assisting us in reaching out to even more wonderful individuals of the community. Your suggestions would be much appreciated.

We'll see you all in the next one. Everyone, till then, have a happy contributing! 👋

Kubernetes is Moving on From Dockershim: Commitments and Next Steps

Kubernetes is removing dockershim in the upcoming v1.24 release. We're excited to reaffirm our community values by supporting open source container runtimes, enabling a smaller kubelet, and increasing engineering velocity for teams using Kubernetes. If you use Docker Engine as a container runtime for your Kubernetes cluster, get ready to migrate in 1.24! To check if you're affected, refer to Check whether dockershim removal affects you.

Why we’re moving away from dockershim

Docker was the first container runtime used by Kubernetes. This is one of the reasons why Docker is so familiar to many Kubernetes users and enthusiasts. Docker support was hardcoded into Kubernetes – a component the project refers to as dockershim. As containerization became an industry standard, the Kubernetes project added support for additional runtimes. This culminated in the implementation of the container runtime interface (CRI), letting system components (like the kubelet) talk to container runtimes in a standardized way. As a result, dockershim became an anomaly in the Kubernetes project. Dependencies on Docker and dockershim have crept into various tools and projects in the CNCF ecosystem ecosystem, resulting in fragile code.

By removing the dockershim CRI, we're embracing the first value of CNCF: "Fast is better than slow". Stay tuned for future communications on the topic!

Deprecation timeline

We formally announced the dockershim deprecation in December 2020. Full removal is targeted in Kubernetes 1.24, in April 2022. This timeline aligns with our deprecation policy, which states that deprecated behaviors must function for at least 1 year after their announced deprecation.

We'll support Kubernetes version 1.23, which includes dockershim, for another year in the Kubernetes project. For managed Kubernetes providers, vendor support is likely to last even longer, but this is dependent on the companies themselves. Regardless, we're confident all cluster operations will have time to migrate. If you have more questions about the dockershim removal, refer to the Dockershim Deprecation FAQ.

We asked you whether you feel prepared for the migration from dockershim in this survey: Are you ready for Dockershim removal. We had over 600 responses. To everybody who took time filling out the survey, thank you.

The results show that we still have a lot of ground to cover to help you to migrate smoothly. Other container runtimes exist, and have been promoted extensively. However, many users told us they still rely on dockershim, and sometimes have dependencies that need to be re-worked. Some of these dependencies are outside of your control. Based on your feedback, here are some of the steps we are taking to help.

Our next steps

Based on the feedback you provided:

  • CNCF and the 1.24 release team are committed to delivering documentation in time for the 1.24 release. This includes more informative blog posts like this one, updating existing code samples, tutorials, and tasks, and producing a migration guide for cluster operators.
  • We are reaching out to the rest of the CNCF community to help prepare them for this change.

If you're part of a project with dependencies on dockershim, or if you're interested in helping with the migration effort, please join us! There's always room for more contributors, whether to our transition tools or to our documentation. To get started, say hello in the #sig-node channel on Kubernetes Slack!

Final thoughts

As a project, we've already seen cluster operators increasingly adopt other container runtimes through 2021. We believe there are no major blockers to migration. The steps we're taking to improve the migration experience will light the path more clearly for you.

We understand that migration from dockershim is yet another action you may need to do to keep your Kubernetes infrastructure up to date. For most of you, this step will be straightforward and transparent. In some cases, you will encounter hiccups or issues. The community has discussed at length whether postponing the dockershim removal would be helpful. For example, we recently talked about it in the SIG Node discussion on November 11th and in the Kubernetes Steering committee meeting held on December 6th. We already postponed it once in 2021 because the adoption rate of other runtimes was lower than we wanted, which also gave us more time to identify potential blocking issues.

At this point, we believe that the value that you (and Kubernetes) gain from dockershim removal makes up for the migration effort you'll have. Start planning now to avoid surprises. We'll have more updates and guides before Kubernetes 1.24 is released.

Kubernetes-in-Kubernetes and the WEDOS PXE bootable server farm

When you own two data centers, thousands of physical servers, virtual machines and hosting for hundreds of thousands sites, Kubernetes can actually simplify the management of all these things. As practice has shown, by using Kubernetes, you can declaratively describe and manage not only applications, but also the infrastructure itself. I work for the largest Czech hosting provider WEDOS Internet a.s and today I'll show you two of my projects — Kubernetes-in-Kubernetes and Kubefarm.

With their help you can deploy a fully working Kubernetes cluster inside another Kubernetes using Helm in just a couple of commands. How and why?

Let me introduce you to how our infrastructure works. All our physical servers can be divided into two groups: control-plane and compute nodes. Control plane nodes are usually set up manually, have a stable OS installed, and designed to run all cluster services including Kubernetes control-plane. The main task of these nodes is to ensure the smooth operation of the cluster itself. Compute nodes do not have any operating system installed by default, instead they are booting the OS image over the network directly from the control plane nodes. Their work is to carry out the workload.

Kubernetes cluster layout

Once nodes have downloaded their image, they can continue to work without keeping connection to the PXE server. That is, a PXE server is just keeping rootfs image and does not hold any other complex logic. After our nodes have booted, we can safely restart the PXE server, nothing critical will happen to them.

Kubernetes cluster after bootstrapping

After booting, the first thing our nodes do is join to the existing Kubernetes cluster, namely, execute the kubeadm join command so that kube-scheduler could schedule some pods on them and launch various workloads afterwards. From the beginning we used the scheme when nodes were joined into the same cluster used for the control-plane nodes.

Kubernetes scheduling containers to the compute nodes

This scheme worked stably for over two years. However later we decided to add containerized Kubernetes to it. And now we can spawn new Kubernetes-clusters very easily right on our control-plane nodes which are now member special admin-clusters. Now, compute nodes can be joined directly to their own clusters - depending on the configuration.

Multiple clusters are running in single Kubernetes, compute nodes joined to them

Kubefarm

This project came with the goal of enabling anyone to deploy such an infrastructure in just a couple of commands using Helm and get about the same in the end.

At this time, we moved away from the idea of a monocluster. Because it turned out to be not very convenient for managing work of several development teams in the same cluster. The fact is that Kubernetes was never designed as a multi-tenant solution and at the moment it does not provide sufficient means of isolation between projects. Therefore, running separate clusters for each team turned out to be a good idea. However, there should not be too many clusters, to let them be convenient to manage. Nor is it too small to have sufficient independence between development teams.

The scalability of our clusters became noticeably better after that change. The more clusters you have per number of nodes, the smaller the failure domain and the more stable they work. And as a bonus, we got a fully declaratively described infrastructure. Thus, now you can deploy a new Kubernetes cluster in the same way as deploying any other application in Kubernetes.

It uses Kubernetes-in-Kubernetes as a basis, LTSP as PXE-server from which the nodes are booted, and automates the DHCP server configuration using dnsmasq-controller:

Kubefarm

How it works

Now let's see how it works. In general, if you look at Kubernetes as from an application perspective, you can note that it follows all the principles of The Twelve-Factor App, and is actually written very well. Thus, it means running Kubernetes as an app in a different Kubernetes shouldn't be a big deal.

Running Kubernetes in Kubernetes

Now let's take a look at the Kubernetes-in-Kubernetes project, which provides a ready-made Helm chart for running Kubernetes in Kubernetes.

Here is the parameters that you can pass to Helm in the values file:

Kubernetes is just five binaries

Beside persistence (storage parameters for the cluster), the Kubernetes control-plane components are described here: namely: etcd cluster, apiserver, controller-manager and scheduler. These are pretty much standard Kubernetes components. There is a light-hearted saying that “Kubernetes is just five binaries”. So here is where the configuration for these binaries is located.

If you ever tried to bootstrap a cluster using kubeadm, then this config will remind you it's configuration. But in addition to Kubernetes entities, you also have an admin container. In fact, it is a container which holds two binaries inside: kubectl and kubeadm. They are used to generate kubeconfig for the above components and to perform the initial configuration for the cluster. Also, in an emergency, you can always exec into it to check and manage your cluster.

After the release has been deployed, you can see a list of pods: admin-container, apiserver in two replicas, controller-manager, etcd-cluster, scheduller and the initial job that initializes the cluster. In the end you have a command, which allows you to get shell into the admin container, you can use it to see what is happening inside:

Also, let's take look at the certificates. If you've ever installed Kubernetes, then you know that it has a scary directory /etc/kubernetes/pki with a bunch of some certificates. In case of Kubernetes-in-Kubernetes, you have fully automated management of them with cert-manager. Thus, it is enough to pass all certificates parameters to Helm during installation, and all the certificates will automatically be generated for your cluster.

Looking at one of the certificates, eg. apiserver, you can see that it has a list of DNS names and IP addresses. If you want to make this cluster accessible outside, then just describe the additional DNS names in the values file and update the release. This will update the certificate resource, and cert-manager will regenerate the certificate. You'll no longer need to think about this. If kubeadm certificates need to be renewed at least once a year, here the cert-manager will take care and automatically renew them.

Now let's log into the admin container and look at the cluster and nodes. Of course, there are no nodes, yet, because at the moment you have deployed just the blank control-plane for Kubernetes. But in kube-system namespace you can see some coredns pods waiting for scheduling and configmaps already appeared. That is, you can conclude that the cluster is working:

Here is the diagram of the deployed cluster. You can see services for all Kubernetes components: apiserver, controller-manager, etcd-cluster and scheduler. And the pods on right side to which they forward traffic.

By the way, the diagram, is drawn in ArgoCD — the GitOps tool we use to manage our clusters, and cool diagrams are one of its features.

Orchestrating physical servers

OK, now you can see the way how is our Kubernetes control-plane deployed, but what about worker nodes, how are we adding them? As I already said, all our servers are bare metal. We do not use virtualization to run Kubernetes, but we orchestrate all physical servers by ourselves.

Also, we do use Linux network boot feature very actively. Moreover, this is exactly the booting, not some kind of automation of the installation. When the nodes are booting, they just run a ready-made system image for them. That is, to update any node, we just need to reboot it - and it will download a new image. It is very easy, simple and convenient.

For this, the Kubefarm project was created, which allows you to automate this. The most commonly used examples can be found in the examples directory. The most standard of them named generic. Let's take a look at values.yaml:

Here you can specify the parameters which are passed into the upstream Kubernetes-in-Kubernetes chart. In order for you control-plane to be accessible from the outside, it is enough to specify the IP address here, but if you wish, you can specify some DNS name here.

In the PXE server configuration you can specify a timezone. You can also add an SSH key for logging in without a password (but you can also specify a password), as well as kernel modules and parameters that should be applied during booting the system.

Next comes the nodePools configuration, i.e. the nodes themselves. If you've ever used a terraform module for gke, then this logic will remind you of it. Here you statically describe all nodes with a set of parameters:

  • Name (hostname);

  • MAC-addresses — we have nodes with two network cards, and each one can boot from any of the MAC addresses specified here.

  • IP-address, which the DHCP server should issue to this node.

In this example, you have two pools: the first has five nodes, the second has only one, the second pool has also two tags assigned. Tags are the way to describe configuration for specific nodes. For example, you can add specific DHCP options for some pools, options for the PXE server for booting (e.g. here is debug option enabled) and set of kubernetesLabels and kubernetesTaints options. What does that mean?

For example, in this configuration you have a second nodePool with one node. The pool has debug and foo tags assigned. Now see the options for foo tag in kubernetesLabels. This means that the m1c43 node will boot with these two labels and taint assigned. Everything seems to be simple. Now let's try this in practice.

Demo

Go to examples and update previously deployed chart to Kubefarm. Just use the generic parameters and look at the pods. You can see that a PXE server and one more job were added. This job essentially goes to the deployed Kubernetes cluster and creates a new token. Now it will run repeatedly every 12 hours to generate a new token, so that the nodes can connect to your cluster.

In a graphical representation, it looks about the same, but now apiserver started to be exposed outside.

In the diagram, the IP is highlighted in green, the PXE server can be reached through it. At the moment, Kubernetes does not allow creating a single LoadBalancer service for TCP and UDP protocols by default, so you have to create two different services with the same IP address. One is for TFTP, and the second for HTTP, through which the system image is downloaded.

But this simple example is not always enough, sometimes you might need to modify the logic at boot. For example, here is a directory advanced_network, inside which there is a values file with a simple shell script. Let's call it network.sh:

All this script does is take environment variables at boot time, and generates a network configuration based on them. It creates a directory and puts the netplan config inside. For example, a bonding interface is created here. Basically, this script can contain everything you need. It can hold the network configuration or generate the system services, add some hooks or describe any other logic. Anything that can be described in bash or shell languages will work here, and it will be executed at boot time.

Let's see how it can be deployed. Let's pass the generic values file as the first parameter, and an additional values file as the second parameter. This is a standard Helm feature. This way you can also pass the secrets, but in this case, the configuration is just expanded by the second file:

Let's look at the configmap foo-kubernetes-ltsp for the netboot server and make sure that network.sh script is really there. These commands used to configure the network at boot time:

Here you can see how it works in principle. The chassis interface (we use HPE Moonshots 1500) have the nodes, you can enter show node list command to get a list of all the nodes. Now you can see the booting process.

You can also get their MAC addresses by show node macaddr all command. We have a clever operator that collects MAC-addresses from chassis automatically and passes them to the DHCP server. Actually, it's just creating custom configuration resources for dnsmasq-controller which is running in same admin Kubernetes cluster. Also, trough this interface you can control the nodes themselves, e.g. turn them on and off.

If you have no such opportunity to enter the chassis through iLO and collect a list of MAC addresses for your nodes, you can consider using catchall cluster pattern. Purely speaking, it is just a cluster with a dynamic DHCP pool. Thus, all nodes that are not described in the configuration to other clusters will automatically join to this cluster.

For example, you can see a special cluster with some nodes. They are joined to the cluster with an auto-generated name based on their MAC address. Starting from this point you can connect to them and see what happens there. Here you can somehow prepare them, for example, set up the file system and then rejoin them to another cluster.

Now let's try connecting to the node terminal and see how it is booting. After the BIOS, the network card is configured, here it sends a request to the DHCP server from a specific MAC address, which redirects it to a specific PXE server. Later the kernel and initrd image are downloaded from the server using the standard HTTP protocol:

After loading the kernel, the node downloads the rootfs image and transfers control to systemd. Then the booting proceeds as usual, and after that the node joins Kubernetes:

If you take a look at fstab, you can see only two entries there: /var/lib/docker and /var/lib/kubelet, they are mounted as tmpfs (in fact, from RAM). At the same time, the root partition is mounted as overlayfs, so all changes that you make here on the system will be lost on the next reboot.

Looking into the block devices on the node, you can see some nvme disk, but it has not yet been mounted anywhere. There is also a loop device - this is the exact rootfs image downloaded from the server. At the moment it is located in RAM, occupies 653 MB and mounted with the loop option.

If you look in /etc/ltsp, you find the network.sh file that was executed at boot. From containers, you can see running kube-proxy and pause container for it.

Details

Network Boot Image

But where does the main image come from? There is a little trick here. The image for the nodes is built through the Dockerfile along with the server. The Docker multi-stage build feature allows you to easily add any packages and kernel modules exactly at the stage of the image build. It looks like this:

What's going on here? First, we take a regular Ubuntu 20.04 and install all the packages we need. First of all we install the kernel, lvm, systemd, ssh. In general, everything that you want to see on the final node should be described here. Here we also install docker with kubelet and kubeadm, which are used to join the node to the cluster.

And then we perform an additional configuration. In the last stage, we simply install tftp and nginx (which serves our image to clients), grub (bootloader). Then root of the previous stages copied into the final image and generate squashed image from it. That is, in fact, we get a docker image, which has both the server and the boot image for our nodes. At the same time, it can be easily updated by changing the Dockerfile.

Webhooks and API aggregation layer

I want to pay special attention to the problem of webhooks and aggregation layer. In general, webhooks is a Kubernetes feature that allows you to respond to the creation or modification of any resources. Thus, you can add a handler so that when resources are applied, Kubernetes must send request to some pod and check if configuration of this resource is correct, or make additional changes to it.

But the point is, in order for the webhooks to work, the apiserver must have direct access to the cluster for which it is running. And if it is started in a separate cluster, like our case, or even separately from any cluster, then Konnectivity service can help us here. Konnectivity is one of the optional but officially supported Kubernetes components.

Let's take cluster of four nodes for example, each of them is running a kubelet and we have other Kubernetes components running outside: kube-apiserver, kube-scheduler and kube-controller-manager. By default, all these components interact with the apiserver directly - this is the most known part of the Kubernetes logic. But in fact, there is also a reverse connection. For example, when you want to view the logs or run a kubectl exec command, the API server establishes a connection to the specific kubelet independently:

Kubernetes apiserver reaching kubelet

But the problem is that if we have a webhook, then it usually runs as a standard pod with a service in our cluster. And when apiserver tries to reach it, it will fail because it will try to access an in-cluster service named webhook.namespace.svc being outside of the cluster where it is actually running:

Kubernetes apiserver can't reach webhook

And here Konnectivity comes to our rescue. Konnectivity is a tricky proxy server developed especially for Kubernetes. It can be deployed as a server next to the apiserver. And Konnectivity-agent is deployed in several replicas directly in the cluster you want to access. The agent establishes a connection to the server and sets up a stable channel to make apiserver able to access all webhooks and all kubelets in the cluster. Thus, now all communication with the cluster will take place through the Konnectivity-server:

Kubernetes apiserver reaching webhook via konnectivity

Our plans

Of course, we are not going to stop at this stage. People interested in the project often write to me. And if there will be a sufficient number of interested people, I hope to move Kubernetes-in-Kubernetes project under Kubernetes SIGs, by representing it in form of the official Kubernetes Helm chart. Perhaps, by making this project independent we'll gather an even larger community.

I am also thinking of integrating it with the Machine Controller Manager, which would allow creating worker nodes, not only of physical servers, but also, for example, for creating virtual machines using kubevirt and running them in the same Kubernetes cluster. By the way, it also allows to spawn virtual machines in the clouds, and have a control-plane deployed locally.

I am also considering the option of integrating with the Cluster-API so that you can create physical Kubefarm clusters directly through the Kubernetes environment. But at the moment I'm not completely sure about this idea. If you have any thoughts on this matter, I'll be happy to listen to them.

Using Admission Controllers to Detect Container Drift at Runtime

Introductory illustration

Illustration by Munire Aireti

At Box, we use Kubernetes (K8s) to manage hundreds of micro-services that enable Box to stream data at a petabyte scale. When it comes to the deployment process, we run kube-applier as part of the GitOps workflows with declarative configuration and automated deployment. Developers declare their K8s apps manifest into a Git repository that requires code reviews and automatic checks to pass, before any changes can get merged and applied inside our K8s clusters. With kubectl exec and other similar commands, however, developers are able to directly interact with running containers and alter them from their deployed state. This interaction could then subvert the change control and code review processes that are enforced in our CI/CD pipelines. Further, it allows such impacted containers to continue receiving traffic long-term in production.

To solve this problem, we developed our own K8s component called kube-exec-controller along with its corresponding kubectl plugin. They function together in detecting and terminating potentially mutated containers (caused by interactive kubectl commands), as well as revealing the interaction events directly to the target Pods for better visibility.

Admission control for interactive kubectl commands

Once a request is sent to K8s, it needs to be authenticated and authorized by the API server to proceed. Additionally, K8s has a separate layer of protection called admission controllers, which can intercept the request before an object is persisted in etcd. There are various predefined admission controls compiled into the API server binary (e.g. ResourceQuota to enforce hard resource usage limits per namespace). Besides, there are two dynamic admission controls named MutatingAdmissionWebhook and ValidatingAdmissionWebhook, used for mutating or validating K8s requests respectively. The latter is what we adopted to detect container drift at runtime caused by interactive kubectl commands. This whole process can be divided into three steps as explained in detail below.

1. Admit interactive kubectl command requests

First of all, we needed to enable a validating webhook that sends qualified requests to kube-exec-controller. To add the new validation mechanism applying to interactive kubectl commands specifically, we configured the webhook’s rules with resources as [pods/exec, pods/attach], and operations as CONNECT. These rules tell the cluster's API server that all exec and attach requests should be subject to our admission control webhook. In the ValidatingAdmissionWebhook that we configured, we specified a service reference (could also be replaced with url that gives the location of the webhook) and caBundle to allow validating its X.509 certificate, both under the clientConfig stanza.

Here is a short example of what our ValidatingWebhookConfiguration object looks like:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: example-validating-webhook-config
webhooks:
  - name: validate-pod-interaction.example.com
    sideEffects: None
    rules:
      - apiGroups: ["*"]
        apiVersions: ["*"]
        operations: ["CONNECT"]
        resources: ["pods/exec", "pods/attach"]
    failurePolicy: Fail
    clientConfig:
      service:
        # reference to kube-exec-controller service deployed inside the K8s cluster
        name: example-service
        namespace: kube-exec-controller
        path: "/admit-pod-interaction"
      caBundle: "{{VALUE}}" # PEM encoded CA bundle to validate kube-exec-controller's certificate
    admissionReviewVersions: ["v1", "v1beta1"]

2. Label the target Pod with potentially mutated containers

Once a request of kubectl exec comes in, kube-exec-controller makes an internal note to label the associated Pod. The added labels mean that we can not only query all the affected Pods, but also enable the security mechanism to retrieve previously identified Pods, in case the controller service itself gets restarted.

The admission control process cannot directly modify the targeted in its admission response. This is because the pods/exec request is against a subresource of the Pod API, and the API kind for that subresource is PodExecOptions. As a result, there is a separate process in kube-exec-controller that patches the labels asynchronously. The admission control always permits the exec request, then acts as a client of the K8s API to label the target Pod and to log related events. Developers can check whether their Pods are affected or not using kubectl or similar tools. For example:

$ kubectl get pod --show-labels
NAME      READY  STATUS   RESTARTS  AGE  LABELS
test-pod  1/1    Running  0         2s   box.com/podInitialInteractionTimestamp=1632524400,box.com/podInteractorUsername=username-1,box.com/podTTLDuration=1h0m0s

$ kubectl describe pod test-pod
...
Events:
Type       Reason            Age     From                            Message
----       ------            ----    ----                            -------
Warning    PodInteraction    5s      admission-controller-service    Pod was interacted with 'kubectl exec' command by user 'username-1' initially at time 2021-09-24 16:00:00 -0800 PST
Warning    PodInteraction    5s      admission-controller-service    Pod will be evicted at time 2021-09-24 17:00:00 -0800 PST (in about 1h0m0s).

3. Evict the target Pod after a predefined period

As you can see in the above event messages, the affected Pod is not evicted immediately. At times, developers might have to get into their running containers necessarily for debugging some live issues. Therefore, we define a time to live (TTL) of affected Pods based on the environment of clusters they are running. In particular, we allow a longer time in our dev clusters as it is more common to run kubectl exec or other interactive commands for active development.

For our production clusters, we specify a lower time limit so as to avoid the impacted Pods serving traffic abidingly. The kube-exec-controller internally sets and tracks a timer for each Pod that matches the associated TTL. Once the timer is up, the controller evicts that Pod using K8s API. The eviction (rather than deletion) is to ensure service availability, since the cluster respects any configured PodDisruptionBudget (PDB). Let's say if a user has defined x number of Pods as critical in their PDB, the eviction (as requested by kube-exec-controller) does not continue when the target workload has fewer than x Pods running.

Here comes a sequence diagram of the entire workflow mentioned above:

Sequence Diagram

A new kubectl plugin for better user experience

Our admission controller component works great for solving the container drift issue we had on the platform. It is also able to submit all related Events to the target Pod that has been affected. However, K8s clusters don't retain Events very long (the default retention period is one hour). We need to provide other ways for developers to get their Pod interaction activity. A kubectl plugin is a perfect choice for us to expose this information. We named our plugin kubectl pi (short for pod-interaction) and provide two subcommands: get and extend.

When the get subcommand is called, the plugin checks the metadata attached by our admission controller and transfers it to human-readable information. Here is an example output from running kubectl pi get:

$ kubectl pi get test-pod
POD-NAME  INTERACTOR  POD-TTL  EXTENSION  EXTENSION-REQUESTER  EVICTION-TIME
test-pod  username-1  1h0m0s   /          /                    2021-09-24 17:00:00 -0800 PST

The plugin can also be used to extend the TTL for a Pod that is marked for future eviction. This is useful in case developers need extra time to debug ongoing issues. To achieve this, a developer uses the kubectl pi extend subcommand, where the plugin patches the relevant annotations for the given Pod. These annotations include the duration and username who made the extension request for transparency (displayed in the table returned from the kubectl pi get command).

Correspondingly, there is another webhook defined in kube-exec-controller which admits valid annotation updates. Once admitted, those updates reset the eviction timer of the target Pod as requested. An example of requesting the extension from the developer side would be:

$ kubectl pi extend test-pod --duration=30m
Successfully extended the termination time of pod/test-pod with a duration=30m
 
$ kubectl pi get test-pod
POD-NAME  INTERACTOR  POD-TTL  EXTENSION  EXTENSION-REQUESTER  EVICTION-TIME
test-pod  username-1  1h0m0s   30m        username-2           2021-09-24 17:30:00 -0800 PST

Future improvement

Although our admission controller service works great in handling interactive requests to a Pod, it could as well evict the Pod while the actual commands are no-op in these requests. For instance, developers sometimes run kubectl exec merely to check their service logs stored on hosts. Nevertheless, the target Pods would still get bounced despite the state of their containers not changing at all. One of the improvements here could be adding the ability to distinguish the commands that are passed to the interactive requests, so that no-op commands should not always force a Pod eviction. However, this becomes challenging when developers get a shell to a running container and execute commands inside the shell, since they will no longer be visible to our admission controller service.

Another item worth pointing out here is the choice of using K8s labels and annotations. In our design, we decided to have all immutable metadata attached as labels for better enforcing the immutability in our admission control. Yet some of these metadata could fit better as annotations. For instance, we had a label with the key box.com/podInitialInteractionTimestamp used to list all affected Pods in kube-exec-controller code, although its value would be unlikely to query for. As a more ideal design in the K8s world, a single label could be preferable in our case for identification with other metadata applied as annotations instead.

Summary

With the power of admission controllers, we are able to secure our K8s clusters by detecting potentially mutated containers at runtime, and evicting their Pods without affecting service availability. We also utilize kubectl plugins to provide flexibility of the eviction time and hence, bringing a better and more self-independent experience to service owners. We are proud to announce that we have open-sourced the whole project for the community to leverage in their own K8s clusters. Any contribution is more than welcomed and appreciated. You can find this project hosted on GitHub at https://github.com/box/kube-exec-controller

Special thanks to Ayush Sobti and Ethan Goldblum for their technical guidance on this project.

What's new in Security Profiles Operator v0.4.0

The Security Profiles Operator (SPO) is an out-of-tree Kubernetes enhancement to make the management of seccomp, SELinux and AppArmor profiles easier and more convenient. We're happy to announce that we recently released v0.4.0 of the operator, which contains a ton of new features, fixes and usability improvements.

What's new

It has been a while since the last v0.3.0 release of the operator. We added new features, fine-tuned existing ones and reworked our documentation in 290 commits over the past half year.

One of the highlights is that we're now able to record seccomp and SELinux profiles using the operators log enricher. This allows us to reduce the dependencies required for profile recording to have auditd or syslog (as fallback) running on the nodes. All profile recordings in the operator work in the same way by using the ProfileRecording CRD as well as their corresponding label selectors. The log enricher itself can be also used to gather meaningful insights about seccomp and SELinux messages of a node. Checkout the official documentation to learn more about it.

Beside the log enricher based recording we now offer an alternative to record seccomp profiles by utilizing ebpf. This optional feature can be enabled by setting enableBpfRecorder to true. This results in running a dedicated container, which ships a custom bpf module on every node to collect the syscalls for containers. It even supports older Kernel versions which do not expose the BPF Type Format (BTF) per default as well as the amd64 and arm64 architectures. Checkout our documentation to see it in action. By the way, we now add the seccomp profile architecture of the recorder host to the recorded profile as well.

We also graduated the seccomp profile API from v1alpha1 to v1beta1. This aligns with our overall goal to stabilize the CRD APIs over time. The only thing which has changed is that the seccomp profile type Architectures now points to []Arch instead of []*Arch.

SELinux enhancements

Managing SELinux policies (an equivalent to using semodule that you would normally call on a single server) is not done by SPO itself, but by another container called selinuxd to provide better isolation. This release switched to using selinuxd containers from a personal repository to images located under our team's quay.io repository. The selinuxd repository has moved as well to the containers GitHub organization.

Please note that selinuxd links dynamically to libsemanage and mounts the SELinux directories from the nodes, which means that the selinuxd container must be running the same distribution as the cluster nodes. SPO defaults to using CentOS-8 based containers, but we also build Fedora based ones. If you are using another distribution and would like us to add support for it, please file an issue against selinuxd.

Profile Recording

This release adds support for recording of SELinux profiles. The recording itself is managed via an instance of a ProfileRecording Custom Resource as seen in an example in our repository. From the user's point of view it works pretty much the same as recording of seccomp profiles.

Under the hood, to know what the workload is doing SPO installs a special permissive policy called selinuxrecording on startup which allows everything and logs all AVCs to audit.log. These AVC messages are scraped by the log enricher component and when the recorded workload exits, the policy is created.

SELinuxProfile CRD graduation

An v1alpha2 version of the SelinuxProfile object has been introduced. This removes the raw Common Intermediate Language (CIL) from the object itself and instead adds a simple policy language to ease the writing and parsing experience.

Alongside, a RawSelinuxProfile object was also introduced. This contains a wrapped and raw representation of the policy. This was intended for folks to be able to take their existing policies into use as soon as possible. However, on validations are done here.

AppArmor support

This version introduces the initial support for AppArmor, allowing users to load and unload AppArmor profiles into cluster nodes by using the new AppArmorProfile CRD.

To enable AppArmor support use the enableAppArmor feature gate switch of your SPO configuration. Then use our apparmor example to deploy your first profile across your cluster.

Metrics

The operator now exposes metrics, which are described in detail in our new metrics documentation. We decided to secure the metrics retrieval process by using kube-rbac-proxy, while we ship an additional spo-metrics-client cluster role (and binding) to retrieve the metrics from within the cluster. If you're using OpenShift, then we provide an out of the box working ServiceMonitor to access the metrics.

Debuggability and robustness

Beside all those new features, we decided to restructure parts of the Security Profiles Operator internally to make it better to debug and more robust. For example, we now maintain an internal gRPC API to communicate within the operator across different features. We also improved the performance of the log enricher, which now caches results for faster retrieval of the log data. The operator can be put into a more verbose log mode by setting verbosity from 0 to 1.

We also print the used libseccomp and libbpf versions on startup, as well as expose CPU and memory profiling endpoints for each container via the enableProfiling option. Dedicated liveness and startup probes inside of the operator daemon will now additionally improve the life cycle of the operator.

Conclusion

Thank you for reading this update. We're looking forward to future enhancements of the operator and would love to get your feedback about the latest release. Feel free to reach out to us via the Kubernetes slack #security-profiles-operator for any feedback or question.

Kubernetes 1.23: StatefulSet PVC Auto-Deletion (alpha)

Kubernetes v1.23 introduced a new, alpha-level policy for StatefulSets that controls the lifetime of PersistentVolumeClaims (PVCs) generated from the StatefulSet spec template for cases when they should be deleted automatically when the StatefulSet is deleted or pods in the StatefulSet are scaled down.

What problem does this solve?

A StatefulSet spec can include Pod and PVC templates. When a replica is first created, the Kubernetes control plane creates a PVC for that replica if one does not already exist. The behavior before Kubernetes v1.23 was that the control plane never cleaned up the PVCs created for StatefulSets - this was left up to the cluster administrator, or to some add-on automation that you’d have to find, check suitability, and deploy. The common pattern for managing PVCs, either manually or through tools such as Helm, is that the PVCs are tracked by the tool that manages them, with explicit lifecycle. Workflows that use StatefulSets must determine on their own what PVCs are created by a StatefulSet and what their lifecycle should be.

Before this new feature, when a StatefulSet-managed replica disappears, either because the StatefulSet is reducing its replica count, or because its StatefulSet is deleted, the PVC and its backing volume remains and must be manually deleted. While this behavior is appropriate when the data is critical, in many cases the persistent data in these PVCs is either temporary, or can be reconstructed from another source. In those cases, PVCs and their backing volumes remaining after their StatefulSet or replicas have been deleted are not necessary, incur cost, and require manual cleanup.

The new StatefulSet PVC retention policy

If you enable the alpha feature, a StatefulSet spec includes a PersistentVolumeClaim retention policy. This is used to control if and when PVCs created from a StatefulSet’s volumeClaimTemplate are deleted. This first iteration of the retention policy contains two situations where PVCs may be deleted.

The first situation is when the StatefulSet resource is deleted (which implies that all replicas are also deleted). This is controlled by the whenDeleted policy. The second situation, controlled by whenScaled is when the StatefulSet is scaled down, which removes some but not all of the replicas in a StatefulSet. In both cases the policy can either be Retain, where the corresponding PVCs are not touched, or Delete, which means that PVCs are deleted. The deletion is done with a normal object deletion, so that, for example, all retention policies for the underlying PV are respected.

This policy forms a matrix with four cases. I’ll walk through and give an example for each one.

  • whenDeleted and whenScaled are both Retain. This matches the existing behavior for StatefulSets, where no PVCs are deleted. This is also the default retention policy. It’s appropriate to use when data on StatefulSet volumes may be irreplaceable and should only be deleted manually.

  • whenDeleted is Delete and whenScaled is Retain. In this case, PVCs are deleted only when the entire StatefulSet is deleted. If the StatefulSet is scaled down, PVCs are not touched, meaning they are available to be reattached if a scale-up occurs with any data from the previous replica. This might be used for a temporary StatefulSet, such as in a CI instance or ETL pipeline, where the data on the StatefulSet is needed only during the lifetime of the StatefulSet lifetime, but while the task is running the data is not easily reconstructible. Any retained state is needed for any replicas that scale down and then up.

  • whenDeleted and whenScaled are both Delete. PVCs are deleted immediately when their replica is no longer needed. Note this does not include when a Pod is deleted and a new version rescheduled, for example when a node is drained and Pods need to migrate elsewhere. The PVC is deleted only when the replica is no longer needed as signified by a scale-down or StatefulSet deletion. This use case is for when data does not need to live beyond the life of its replica. Perhaps the data is easily reconstructable and the cost savings of deleting unused PVCs is more important than quick scale-up, or perhaps that when a new replica is created, any data from a previous replica is not usable and must be reconstructed anyway.

  • whenDeleted is Retain and whenScaled is Delete. This is similar to the previous case, when there is little benefit to keeping PVCs for fast reuse during scale-up. An example of a situation where you might use this is an Elasticsearch cluster. Typically you would scale that workload up and down to match demand, whilst ensuring a minimum number of replicas (for example: 3). When scaling down, data is migrated away from removed replicas and there is no benefit to retaining those PVCs. However, it can be useful to bring the entire Elasticsearch cluster down temporarily for maintenance. If you need to take the Elasticsearch system offline, you can do this by temporarily deleting the StatefulSet, and then bringing the Elasticsearch cluster back by recreating the StatefulSet. The PVCs holding the Elasticsearch data will still exist and the new replicas will automatically use them.

Visit the documentation to see all the details.

What’s next?

Enable the feature and try it out! Enable the StatefulSetAutoDeletePVC feature gate on a cluster, then create a StatefulSet using the new policy. Test it out and tell us what you think!

I'm very curious to see if this owner reference mechanism works well in practice. For example, we realized there is no mechanism in Kubernetes for knowing who set a reference, so it’s possible that the StatefulSet controller may fight with custom controllers that set their own references. Fortunately, maintaining the existing retention behavior does not involve any new owner references, so default behavior will be compatible.

Please tag any issues you report with the label sig/apps and assign them to Matthew Cary (@mattcary at GitHub).

Enjoy!

Kubernetes 1.23: Prevent PersistentVolume leaks when deleting out of order

PersistentVolume (or PVs for short) are associated with Reclaim Policy. The Reclaim Policy is used to determine the actions that need to be taken by the storage backend on deletion of the PV. Where the reclaim policy is Delete, the expectation is that the storage backend releases the storage resource that was allocated for the PV. In essence, the reclaim policy needs to honored on PV deletion.

With the recent Kubernetes v1.23 release, an alpha feature lets you configure your cluster to behave that way and honor the configured reclaim policy.

How did reclaim work in previous Kubernetes releases?

PersistentVolumeClaim (or PVC for short) is a request for storage by a user. A PV and PVC are considered Bound if there is a newly created PV or a matching PV is found. The PVs themselves are backed by a volume allocated by the storage backend.

Normally, if the volume is to be deleted, then the expectation is to delete the PVC for a bound PV-PVC pair. However, there are no restrictions to delete a PV prior to deleting a PVC.

First, I'll demonstrate the behavior for clusters that are running an older version of Kubernetes.

Retrieve an PVC that is bound to a PV

Retrieve an existing PVC example-vanilla-block-pvc

kubectl get pvc example-vanilla-block-pvc

The following output shows the PVC and it's Bound PV, the PV is shown under the VOLUME column:

NAME                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
example-vanilla-block-pvc   Bound    pvc-6791fdd4-5fad-438e-a7fb-16410363e3da   5Gi        RWO            example-vanilla-block-sc   19s

Delete PV

When I try to delete a bound PV, the cluster blocks and the kubectl tool does not return back control to the shell; for example:

kubectl delete pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
persistentvolume "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" deleted
^C

Retrieving the PV:

kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da

It can be observed that the PV is in Terminating state

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM                               STORAGECLASS               REASON   AGE
pvc-6791fdd4-5fad-438e-a7fb-16410363e3da   5Gi        RWO            Delete           Terminating   default/example-vanilla-block-pvc   example-vanilla-block-sc            2m23s

Delete PVC

kubectl delete pvc example-vanilla-block-pvc

The following output is seen if the PVC gets successfully deleted:

persistentvolumeclaim "example-vanilla-block-pvc" deleted

The PV object from the cluster also gets deleted. When attempting to retrieve the PV it will be observed that the PV is no longer found:

kubectl get pv pvc-6791fdd4-5fad-438e-a7fb-16410363e3da
Error from server (NotFound): persistentvolumes "pvc-6791fdd4-5fad-438e-a7fb-16410363e3da" not found

Although the PV is deleted the underlying storage resource is not deleted, and needs to be removed manually.

To sum it up, the reclaim policy associated with the Persistent Volume is currently ignored under certain circumstance. For a Bound PV-PVC pair the ordering of PV-PVC deletion determines whether the PV reclaim policy is honored. The reclaim policy is honored if the PVC is deleted first, however, if the PV is deleted prior to deleting the PVC then the reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the external infrastructure is not removed.

PV reclaim policy with Kubernetes v1.23

The new behavior ensures that the underlying storage object is deleted from the backend when users attempt to delete a PV manually.

How to enable new behavior?

To make use of the new behavior, you must have upgraded your cluster to the v1.23 release of Kubernetes. You need to make sure that you are running the CSI external-provisioner version 4.0.0, or later. You must also enable the HonorPVReclaimPolicy feature gate for the external-provisioner and for the kube-controller-manager.

If you're not using a CSI driver to integrate with your storage backend, the fix isn't available. The Kubernetes project doesn't have a current plan to fix the bug for in-tree storage drivers: the future of those in-tree drivers is deprecation and migration to CSI.

How does it work?

The new behavior is achieved by adding a finalizer external-provisioner.volume.kubernetes.io/finalizer on new and existing PVs, the finalizer is only removed after the storage from backend is deleted.

An example of a PV with the finalizer, notice the new finalizer in the finalizers list

kubectl get pv pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: csi.vsphere.vmware.com
  creationTimestamp: "2021-11-17T19:28:56Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-provisioner.volume.kubernetes.io/finalizer
  name: pvc-a7b7e3ba-f837-45ba-b243-dec7d8aaed53
  resourceVersion: "194711"
  uid: 087f14f2-4157-4e95-8a70-8294b039d30e
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: example-vanilla-block-pvc
    namespace: default
    resourceVersion: "194677"
    uid: a7b7e3ba-f837-45ba-b243-dec7d8aaed53
  csi:
    driver: csi.vsphere.vmware.com
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1637110610497-8081-csi.vsphere.vmware.com
      type: vSphere CNS Block Volume
    volumeHandle: 2dacf297-803f-4ccc-afc7-3d3c3f02051e
  persistentVolumeReclaimPolicy: Delete
  storageClassName: example-vanilla-block-sc
  volumeMode: Filesystem
status:
  phase: Bound

The presence of the finalizer prevents the PV object from being removed from the cluster. As stated previously, the finalizer is only removed from the PV object after it is successfully deleted from the storage backend. To learn more about finalizers, please refer to Using Finalizers to Control Deletion.

What about CSI migrated volumes?

The fix is applicable to CSI migrated volumes as well. However, when the feature HonorPVReclaimPolicy is enabled on 1.23, and CSI Migration is disabled, the finalizer is removed from the PV object if it exists.

Some caveats

  1. The fix is applicable only to CSI volumes and migrated volumes. In-tree volumes will exhibit older behavior.
  2. The fix is introduced as an alpha feature in the external-provisioner under the feature gate HonorPVReclaimPolicy. The feature is disabled by default, and needs to be enabled explicitly.

References

How do I get involved?

The Kubernetes Slack channel SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution:

  • Jan Šafránek (jsafrane)
  • Xing Yang (xing-yang)
  • Matthew Wong (wongma7)

Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

Kubernetes 1.23: Kubernetes In-Tree to CSI Volume Migration Status Update

The Kubernetes in-tree storage plugin to Container Storage Interface (CSI) migration infrastructure has already been beta since v1.17. CSI migration was introduced as alpha in Kubernetes v1.14.

Since then, SIG Storage and other Kubernetes special interest groups are working to ensure feature stability and compatibility in preparation for GA. This article is intended to give a status update to the feature as well as changes between Kubernetes 1.17 and 1.23. In addition, I will also cover the future roadmap for the CSI migration feature GA for each storage plugin.

Quick recap: What is CSI Migration, and why migrate?

The Container Storage Interface (CSI) was designed to help Kubernetes replace its existing, in-tree storage driver mechanisms - especially vendor specific plugins. Kubernetes support for the Container Storage Interface has been generally available since Kubernetes v1.13. Support for using CSI drivers was introduced to make it easier to add and maintain new integrations between Kubernetes and storage backend technologies. Using CSI drivers allows for for better maintainability (driver authors can define their own release cycle and support lifecycle) and reduce the opportunity for vulnerabilities (with less in-tree code, the risks of a mistake are reduced, and cluster operators can select only the storage drivers that their cluster requires).

As more CSI Drivers were created and became production ready, SIG Storage group wanted all Kubernetes users to benefit from the CSI model. However, we cannot break API compatibility with the existing storage API types. The solution we came up with was CSI migration: a feature that translates in-tree APIs to equivalent CSI APIs and delegates operations to a replacement CSI driver.

The CSI migration effort enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver from the storage backend. If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. Existing StorageClass, PersistentVolume and PersistentVolumeClaim objects should continue to work. When a Kubernetes cluster administrator updates a cluster to enable CSI migration, existing workloads that utilize PVCs which are backed by in-tree storage plugins will continue to function as they always have. However, behind the scenes, Kubernetes hands control of all storage management operations (previously targeting in-tree drivers) to CSI drivers.

For example, suppose you are a kubernetes.io/gce-pd user, after CSI migration, you can still use kubernetes.io/gce-pd to provision new volumes, mount existing GCE-PD volumes or delete existing volumes. All existing API/Interface will still function correctly. However, the underlying function calls are all going through the GCE PD CSI driver instead of the in-tree Kubernetes function.

This enables a smooth transition for end users. Additionally as storage plugin developers, we can reduce the burden of maintaining the in-tree storage plugins and eventually remove them from the core Kubernetes binary.

What has been changed, and what's new?

Building on the work done in Kubernetes v1.17 and earlier, the releases since then have made a series of changes:

New feature gates

The Kubernetes v1.21 release deprecated the CSIMigration{provider}Complete feature flags, and stopped honoring them. In their place came new feature flags named InTreePlugin{vendor}Unregister, that replace the old feature flag and retain all the functionality that CSIMigration{provider}Complete provided.

CSIMigration{provider}Complete was introduced before as a supplementary feature gate once CSI migration is enabled on all of the nodes. This flag unregisters the in-tree storage plugin you specify with the {provider} part of the flag name.

When you enable that feature gate, then instead of using the in-tree driver code, your cluster directly selects and uses the relevant CSI driver. This happens without any check for whether CSI migration is enabled on the node, or whether you have in fact deployed that CSI driver.

While this feature gate is a great helper, SIG Storage (and, I'm sure, lots of cluster operators) also wanted a feature gate that lets you disable an in-tree storage plugin, even without also enabling CSI migration. For example, you might want to disable the EBS storage plugin on a GCE cluster, because EBS volumes are specific to a different vendor's cloud (AWS).

To make this possible, Kubernetes v1.21 introduced a new feature flag set: InTreePlugin{vendor}Unregister.

InTreePlugin{vendor}Unregister is a standalone feature gate that can be enabled and disabled independently from CSI Migration. When enabled, the component will not register the specific in-tree storage plugin to the supported list. If the cluster operator only enables this flag, end users will get an error from PVC saying it cannot find the plugin when the plugin is used. The cluster operator may want to enable this regardless of CSI Migration if they do not want to support the legacy in-tree APIs and only support CSI moving forward.

Observability

Kubernetes v1.21 introduced metrics for tracking CSI migration. You can use these metrics to observe how your cluster is using storage services and whether access to that storage is using the legacy in-tree driver or its CSI-based replacement.

ComponentsMetricsNotes
Kube-Controller-Managerstorage_operation_duration_secondsA new label migrated is added to the metric to indicate whether this storage operation is a CSI migration operation(string value true for enabled and false for not enabled).
Kubeletcsi_operations_secondsThe new metric exposes labels including driver_name, method_name, grpc_status_code and migrated. The meaning of these labels is identical to csi_sidecar_operations_seconds.
CSI Sidecars(provisioner, attacher, resizer)csi_sidecar_operations_secondsA new label migrated is added to the metric to indicate whether this storage operation is a CSI migration operation(string value true for enabled and false for not enabled).

Bug fixes and feature improvement

We have fixed numerous bugs like dangling attachment, garbage collection, incorrect topology label through the help of our beta testers.

Cloud Provider && Cluster Lifecycle Collaboration

SIG Storage has been working closely with SIG Cloud Provider and SIG Cluster Lifecycle on the rollout of CSI migration.

If you are a user of a managed Kubernetes service, check with your provider if anything needs to be done. In many cases, the provider will manage the migration and no additional work is required.

If you use a distribution of Kubernetes, check its official documentation for information about support for this feature. For the CSI Migration feature graduation to GA, SIG Storage and SIG Cluster Lifecycle are collaborating towards making the migration mechanisms available in tooling (such as kubeadm) as soon as they're available in Kubernetes itself.

What is the timeline / status?

The current and targeted releases for each individual driver is shown in the table below:

DriverAlphaBeta (in-tree deprecated)Beta (on-by-default)GATarget "in-tree plugin" removal
AWS EBS1.141.171.231.24 (Target)1.26 (Target)
GCE PD1.141.171.231.24 (Target)1.26 (Target)
OpenStack Cinder1.141.181.211.24 (Target)1.26 (Target)
Azure Disk1.151.191.231.24 (Target)1.26 (Target)
Azure File1.151.211.24 (Target)1.25 (Target)1.27 (Target)
vSphere1.181.191.24 (Target)1.25 (Target)1.27 (Target)
Ceph RBD1.23
Portworx1.23

The following storage drivers will not have CSI migration support. The ScaleIO driver was already removed; the others are deprecated and will be removed from core Kubernetes.

DriverDeprecatedCode Removal
ScaleIO1.161.22
Flocker1.221.25 (Target)
Quobyte1.221.25 (Target)
StorageOS1.221.25 (Target)

What's next?

With more CSI drivers graduating to GA, we hope to soon mark the overall CSI Migration feature as GA. We are expecting cloud provider in-tree storage plugins code removal to happen by Kubernetes v1.26 and v1.27.

What should I do as a user?

Note that all new features for the Kubernetes storage system (such as volume snapshotting) will only be added to the CSI interface. Therefore, if you are starting up a new cluster, creating stateful applications for the first time, or require these new features we recommend using CSI drivers natively (instead of the in-tree volume plugin API). Follow the updated user guides for CSI drivers and use the new CSI APIs.

However, if you choose to roll a cluster forward or continue using specifications with the legacy volume APIs, CSI Migration will ensure we continue to support those deployments with the new CSI drivers. However, if you want to leverage new features like snapshot, it will require a manual migration to re-import an existing intree PV as a CSI PV.

How do I get involved?

The Kubernetes Slack channel #csi-migration along with any of the standard SIG Storage communication channels are great mediums to reach out to the SIG Storage and migration working group teams.

This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together. We offer a huge thank you to the contributors who stepped up these last quarters to help move the project forward:

  • Michelle Au (msau42)
  • Jan Šafránek (jsafrane)
  • Hemant Kumar (gnufied)

Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CSI migration feature:

  • Andy Zhang (andyzhangz)
  • Divyen Patel (divyenpatel)
  • Deep Debroy (ddebroy)
  • Humble Devassy Chirammal (humblec)
  • Jing Xu (jingxu97)
  • Jordan Liggitt (liggitt)
  • Matthew Cary (mattcary)
  • Matthew Wong (wongma7)
  • Neha Arora (nearora-msft)
  • Oksana Naumov (trierra)
  • Saad Ali (saad-ali)
  • Tim Bannister (sftim)
  • Xing Yang (xing-yang)

Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

Kubernetes 1.23: Pod Security Graduates to Beta

With the release of Kubernetes v1.23, Pod Security admission has now entered beta. Pod Security is a built-in admission controller that evaluates pod specifications against a predefined set of Pod Security Standards and determines whether to admit or deny the pod from running.

Pod Security is the successor to PodSecurityPolicy which was deprecated in the v1.21 release, and will be removed in Kubernetes v1.25. In this article, we cover the key concepts of Pod Security along with how to use it. We hope that cluster administrators and developers alike will use this new mechanism to enforce secure defaults for their workloads.

Why Pod Security

The overall aim of Pod Security is to let you isolate workloads. You can run a cluster that runs different workloads and, without adding extra third-party tooling, implement controls that require Pods for a workload to restrict their own privileges to a defined bounding set.

Pod Security overcomes key shortcomings of Kubernetes' existing, but deprecated, PodSecurityPolicy (PSP) mechanism:

  • Policy authorization model — challenging to deploy with controllers.
  • Risks around switching — a lack of dry-run/audit capabilities made it hard to enable PodSecurityPolicy.
  • Inconsistent and Unbounded API — the large configuration surface and evolving constraints led to a complex and confusing API.

The shortcomings of PSP made it very difficult to use which led the community to reevaluate whether or not a better implementation could achieve the same goals. One of those goals was to provide an out-of-the-box solution to apply security best practices. Pod Security ships with predefined Pod Security levels that a cluster administrator can configure to meet the desired security posture.

It's important to note that Pod Security doesn't have complete feature parity with the deprecated PodSecurityPolicy. Specifically, it doesn't have the ability to mutate or change Kubernetes resources to auto-remediate a policy violation on behalf of the user. Additionally, it doesn't provide fine-grained control over each allowed field and value within a pod specification or any other Kubernetes resource that you may wish to evaluate. If you need more fine-grained policy control then take a look at these other projects which support such use cases.

Pod Security also adheres to Kubernetes best practices of declarative object management by denying resources that violate the policy. This requires resources to be updated in source repositories, and tooling to be updated prior to being deployed to Kubernetes.

How Does Pod Security Work?

Pod Security is a built-in admission controller starting with Kubernetes v1.22, but can also be run as a standalone webhook. Admission controllers function by intercepting requests in the Kubernetes API server prior to persistence to storage. They can either admit or deny a request. In the case of Pod Security, pod specifications will be evaluated against a configured policy in the form of a Pod Security Standard. This means that security sensitive fields in a pod specification will only be allowed to have specific values.

Configuring Pod Security

Pod Security Standards

In order to use Pod Security we first need to understand Pod Security Standards. These standards define three different policy levels that range from permissive to restrictive. These levels are as follows:

  • privileged — open and unrestricted
  • baseline — Covers known privilege escalations while minimizing restrictions
  • restricted — Highly restricted, hardening against known and unknown privilege escalations. May cause compatibility issues

Each of these policy levels define which fields are restricted within a pod specification and the allowed values. Some of the fields restricted by these policies include:

  • spec.securityContext.sysctls
  • spec.hostNetwork
  • spec.volumes[*].hostPath
  • spec.containers[*].securityContext.privileged

Policy levels are applied via labels on Namespace resources, which allows for granular per-namespace policy selection. The AdmissionConfiguration in the API server can also be configured to set cluster-wide default levels and exemptions.

Policy modes

Policies are applied in a specific mode. Multiple modes (with different policy levels) can be set on the same namespace. Here is a list of modes:

  • enforce — Any Pods that violate the policy will be rejected
  • audit — Violations will be recorded as an annotation in the audit logs, but don't affect whether the pod is allowed.
  • warn — Violations will send a warning message back to the user, but don't affect whether the pod is allowed.

In addition to modes you can also pin the policy to a specific version (for example v1.22). Pinning to a specific version allows the behavior to remain consistent if the policy definition changes in future Kubernetes releases.

Hands on demo

Prerequisites

Deploy a kind cluster

kind create cluster --image kindest/node:v1.23.0

It might take a while to start and once it's started it might take a minute or so before the node becomes ready.

kubectl cluster-info --context kind-kind

Wait for the node STATUS to become ready.

kubectl get nodes

The output is similar to this:

NAME                 STATUS   ROLES                  AGE   VERSION
kind-control-plane   Ready    control-plane,master   54m   v1.23.0

Confirm Pod Security is enabled

The best way to confirm the API's default enabled plugins is to check the Kubernetes API container's help arguments.

kubectl -n kube-system exec kube-apiserver-kind-control-plane -it -- kube-apiserver -h | grep "default enabled ones"

The output is similar to this:

...
      --enable-admission-plugins strings
admission plugins that should be enabled in addition
to default enabled ones (NamespaceLifecycle, LimitRanger,
ServiceAccount, TaintNodesByCondition, PodSecurity, Priority,
DefaultTolerationSeconds, DefaultStorageClass,
StorageObjectInUseProtection, PersistentVolumeClaimResize,
RuntimeClass, CertificateApproval, CertificateSigning,
CertificateSubjectRestriction, DefaultIngressClass,
MutatingAdmissionWebhook, ValidatingAdmissionWebhook,
ResourceQuota).
...

PodSecurity is listed in the group of default enabled admission plugins.

If using a cloud provider, or if you don't have access to the API server, the best way to check would be to run a quick end-to-end test:

kubectl create namespace verify-pod-security
kubectl label namespace verify-pod-security pod-security.kubernetes.io/enforce=restricted
# The following command does NOT create a workload (--dry-run=server)
kubectl -n verify-pod-security run test --dry-run=server --image=busybox --privileged
kubectl delete namespace verify-pod-security

The output is similar to this:

Error from server (Forbidden): pods "test" is forbidden: violates PodSecurity "restricted:latest": privileged (container "test" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "test" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "test" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "test" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "test" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Configure Pod Security

Policies are applied to a namespace via labels. These labels are as follows:

  • pod-security.kubernetes.io/<MODE>: <LEVEL> (required to enable pod security)
  • pod-security.kubernetes.io/<MODE>-version: <VERSION> (optional, defaults to latest)

A specific version can be supplied for each enforcement mode. The version pins the policy to the version that was shipped as part of the Kubernetes release. Pinning to a specific Kubernetes version allows for deterministic policy behavior while allowing flexibility for future updates to Pod Security Standards. The possible <MODE(S)> are enforce, audit and warn.

When to use warn?

The typical uses for warn are to get ready for a future change where you want to enforce a different policy. The most two common cases would be:

  • warn at the same level but a different version (e.g. pin enforce to restricted+v1.23 and warn at restricted+latest)
  • warn at a stricter level (e.g. enforce baseline, warn restricted)

It's not recommended to use warn for the exact same level+version of the policy as enforce. In the admission sequence, if enforce fails, the entire sequence fails before evaluating the warn.

First, create a namespace called verify-pod-security if not created earlier. For the demo, --overwrite is used when labeling to allow repurposing a single namespace for multiple examples.

kubectl create namespace verify-pod-security

Deploy demo workloads

Each workload represents a higher level of security that would not pass the profile that comes after it.

For the following examples, use the busybox container runs a sleep command for 1 million seconds (≅11 days) or until deleted. Pod Security is not interested in which container image you chose, but rather the Pod level settings and their implications for security.

Privileged level and workload

For the privileged pod, use the privileged policy. This allows the process inside a container to gain new processes (also known as "privilege escalation") and can be dangerous if untrusted.

First, let's apply a restricted Pod Security level for a test.

# enforces a "restricted" security policy and audits on restricted
kubectl label --overwrite ns verify-pod-security \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/audit=restricted

Next, try to deploy a privileged workload in the namespace.

cat <<EOF | kubectl -n verify-pod-security apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-privileged
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    securityContext:
      allowPrivilegeEscalation: true
EOF

The output is similar to this:

Error from server (Forbidden): error when creating "STDIN": pods "busybox-privileged" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "busybox" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Now let's apply the privileged Pod Security level and try again.

# enforces a "privileged" security policy and warns / audits on baseline
kubectl label --overwrite ns verify-pod-security \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/warn=baseline \
  pod-security.kubernetes.io/audit=baseline
cat <<EOF | kubectl -n verify-pod-security apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-privileged
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    securityContext:
      allowPrivilegeEscalation: true
EOF

The output is similar to this:

pod/busybox-privileged created

We can run kubectl -n verify-pod-security get pods to verify it is running. Clean up with:

kubectl -n verify-pod-security delete pod busybox-privileged

Baseline level and workload

The baseline policy demonstrates sensible defaults while preventing common container exploits.

Let's revert back to a restricted Pod Security level for a quick test.

# enforces a "restricted" security policy and audits on restricted
kubectl label --overwrite ns verify-pod-security \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/audit=restricted

Apply the workload.

cat <<EOF | kubectl -n verify-pod-security apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-baseline
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
          - NET_BIND_SERVICE
          - CHOWN
EOF

The output is similar to this:

Error from server (Forbidden): error when creating "STDIN": pods "busybox-baseline" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]; container "busybox" must not include "CHOWN" in securityContext.capabilities.add), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

Let's apply the baseline Pod Security level and try again.

# enforces a "baseline" security policy and warns / audits on restricted
kubectl label --overwrite ns verify-pod-security \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted
cat <<EOF | kubectl -n verify-pod-security apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-baseline
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
          - NET_BIND_SERVICE
          - CHOWN
EOF

The output is similar to the following. Note that the warnings match the error message from the test above, but the pod is still successfully created.

Warning: would violate PodSecurity "restricted:latest": unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]; container "busybox" must not include "CHOWN" in securityContext.capabilities.add), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
pod/busybox-baseline created

Remember, we set the verify-pod-security namespace to warn based on the restricted profile. We can run kubectl -n verify-pod-security get pods to verify it is running. Clean up with:

kubectl -n verify-pod-security delete pod busybox-baseline

Restricted level and workload

The restricted policy requires rejection of all privileged parameters. It is the most secure with a trade-off for complexity. The restricted policy allows containers to add the NET_BIND_SERVICE capability only.

While we've already tested restricted as a blocking function, let's try to get something running that meets all the criteria.

First we need to reapply the restricted profile, for the last time.

# enforces a "restricted" security policy and audits on restricted
kubectl label --overwrite ns verify-pod-security \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/audit=restricted
cat <<EOF | kubectl -n verify-pod-security apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-restricted
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
          - NET_BIND_SERVICE
EOF

The output is similar to this:

Error from server (Forbidden): error when creating "STDIN": pods "busybox-restricted" is forbidden: violates PodSecurity "restricted:latest": unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")

This is because the restricted profile explicitly requires that certain values are set to the most secure parameters.

By requiring explicit values, manifests become more declarative and your entire security model can shift left. With the restricted level of enforcement, a company could audit their cluster's compliance based on permitted manifests.

Let's fix each warning resulting in the following file:

cat <<EOF | kubectl -n verify-pod-security apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-restricted
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    securityContext:
      seccompProfile:
        type: RuntimeDefault
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE
EOF

The output is similar to this:

pod/busybox-restricted created

Run kubectl -n verify-pod-security get pods to verify it is running. The output is similar to this:

NAME               READY   STATUS                       RESTARTS   AGE
busybox-restricted   0/1     CreateContainerConfigError   0          2m26s

Let's figure out why the container is not starting with kubectl -n verify-pod-security describe pod busybox-restricted. The output is similar to this:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Warning  Failed     2m29s (x8 over 3m55s)  kubelet            Error: container has runAsNonRoot and image will run as root (pod: "busybox-restricted_verify-pod-security(a4c6a62d-2166-41a9-b288-20df17cf5c90)", container: busybox)

To solve this, set the effective UID (runAsUser) to a non-zero (root) value or use the nobody UID (65534).

# delete the original pod
kubectl -n verify-pod-security delete pod busybox-restricted

# create the pod again with new runAsUser
cat <<EOF | kubectl -n verify-pod-security apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-restricted
spec:
  securityContext:
    runAsUser: 65534
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    securityContext:
      seccompProfile:
        type: RuntimeDefault
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE
EOF

Run kubectl -n verify-pod-security get pods to verify it is running. The output is similar to this:

NAME                 READY   STATUS    RESTARTS   AGE
busybox-restricted   1/1     Running   0          25s

Clean up the demo (restricted pod and namespace) with:

kubectl delete namespace verify-pod-security

At this point, if you wanted to dive deeper into linux permissions or what is permitted for a certain container, exec into the control plane and play around with containerd and crictl inspect.

# if using docker, shell into the control plane
docker exec -it kind-control-plane bash

# list running containers
crictl ps

# inspect each one by container ID
crictl inspect <CONTAINER ID>

Applying a cluster-wide policy

In addition to applying labels to namespaces to configure policy you can also configure cluster-wide policies and exemptions using the AdmissionConfiguration resource.

Using this resource, policy definitions are applied cluster-wide by default and any policy that is applied via namespace labels will take precedence.

There is no runtime configurable API for the AdmissionConfiguration configuration file so a cluster administrator would need to specify a path to the file below via the --admission-control-config-file flag on the API server.

In the following resource we are enforcing the baseline policy and warning and auditing the baseline policy. We are also making the kube-system namespace exempt from this policy.

It's not recommended to alter control plane / clusters after install, so let's build a new cluster with a default policy on all namespaces.

First, delete the current cluster.

kind delete cluster

Create a Pod Security configuration that enforce and audit baseline policies while using a restricted profile to warn the end user.

cat <<EOF > pod-security.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1beta1
    kind: PodSecurityConfiguration
    defaults:
      enforce: "baseline"
      enforce-version: "latest"
      audit: "baseline"
      audit-version: "latest"
      warn: "restricted"
      warn-version: "latest"
    exemptions:
      # Array of authenticated usernames to exempt.
      usernames: []
      # Array of runtime class names to exempt.
      runtimeClasses: []
      # Array of namespaces to exempt.
      namespaces: [kube-system]
EOF

For additional options, check out the official standards admission controller docs.

We now have a default baseline policy. Next pass it to the kind configuration to enable the --admission-control-config-file API server argument and pass the policy file. To pass a file to a kind cluster, use a configuration file to pass additional setup instructions. Kind uses kubeadm to provision the cluster and the configuration file has the ability to pass kubeadmConfigPatches for further customization. In our case, the local file is mounted into the control plane node as /etc/kubernetes/policies/pod-security.yaml which is then mounted into the apiServer container. We also pass the --admission-control-config-file argument pointing to the policy's location.

cat <<EOF > kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    apiServer:
        # enable admission-control-config flag on the API server
        extraArgs:
          admission-control-config-file: /etc/kubernetes/policies/pod-security.yaml
        # mount new file / directories on the control plane
        extraVolumes:
          - name: policies
            hostPath: /etc/kubernetes/policies
            mountPath: /etc/kubernetes/policies
            readOnly: true
            pathType: "DirectoryOrCreate"
  # mount the local file on the control plane
  extraMounts:
  - hostPath: ./pod-security.yaml
    containerPath: /etc/kubernetes/policies/pod-security.yaml
    readOnly: true
EOF

Create a new cluster using the kind configuration file defined above.

kind create cluster --image kindest/node:v1.23.0 --config kind-config.yaml

Let's look at the default namespace.

kubectl describe namespace default

The output is similar to this:

Name:         default
Labels:       kubernetes.io/metadata.name=default
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Let's create a new namespace and see if the labels apply there.

kubectl create namespace test-defaults
kubectl describe namespace test-defaults

Same.

Name:         test-defaults
Labels:       kubernetes.io/metadata.name=test-defaults
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Can a privileged workload be deployed?

cat <<EOF | kubectl -n test-defaults apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-privileged
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
    securityContext:
      allowPrivilegeEscalation: true
EOF

Hmm... yep. The default warn level is working at least.

Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "busybox" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "busybox" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "busybox" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "busybox" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
pod/busybox-privileged created

Let's delete the pod with kubectl -n test-defaults delete pod/busybox-privileged.

Is my config even working?

# if using docker, shell into the control plane
docker exec -it kind-control-plane bash

# cat out the file we mounted
cat /etc/kubernetes/policies/pod-security.yaml

# check the api server logs
cat /var/log/containers/kube-apiserver*.log 

# check the api server config
cat /etc/kubernetes/manifests/kube-apiserver.yaml 

UPDATE: The baseline policy permits allowPrivilegeEscalation. While I cannot see the Pod Security default levels of enforcement, they are there. Let's try to provide a manifest that violates the baseline by requesting hostNetwork access.

# delete the original pod
kubectl -n test-defaults delete pod busybox-privileged

cat <<EOF | kubectl -n test-defaults apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: busybox-privileged
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"
  hostNetwork: true
EOF

The output is similar to this:

Error from server (Forbidden): error when creating "STDIN": pods "busybox-privileged" is forbidden: violates PodSecurity "baseline:latest": host namespaces (hostNetwork=true)

Yes!!! It worked! 🎉🎉🎉

I later found out, another way to check if things are operating as intended is to check the raw API server metrics endpoint.

Run the following command:

kubectl get --raw /metrics | grep pod_security_evaluations_total

The output is similar to this:

# HELP pod_security_evaluations_total [ALPHA] Number of policy evaluations that occurred, not counting ignored or exempt requests.
# TYPE pod_security_evaluations_total counter
pod_security_evaluations_total{decision="allow",mode="enforce",policy_level="baseline",policy_version="latest",request_operation="create",resource="pod",subresource=""} 2
pod_security_evaluations_total{decision="allow",mode="enforce",policy_level="privileged",policy_version="latest",request_operation="create",resource="pod",subresource=""} 0
pod_security_evaluations_total{decision="allow",mode="enforce",policy_level="privileged",policy_version="latest",request_operation="update",resource="pod",subresource=""} 0
pod_security_evaluations_total{decision="deny",mode="audit",policy_level="baseline",policy_version="latest",request_operation="create",resource="pod",subresource=""} 1
pod_security_evaluations_total{decision="deny",mode="enforce",policy_level="baseline",policy_version="latest",request_operation="create",resource="pod",subresource=""} 1
pod_security_evaluations_total{decision="deny",mode="warn",policy_level="restricted",policy_version="latest",request_operation="create",resource="controller",subresource=""} 2
pod_security_evaluations_total{decision="deny",mode="warn",policy_level="restricted",policy_version="latest",request_operation="create",resource="pod",subresource=""} 2

A monitoring tool could ingest these metrics too for reporting, assessments, or measuring trends.

Clean up

When finished, delete the kind cluster.

kind delete cluster

Auditing

Auditing is another way to track what policies are being enforced in your cluster. To set up auditing with kind, review the official docs for enabling auditing. As of version 1.11, Kubernetes audit logs include two annotations that indicate whether or not a request was authorized (authorization.k8s.io/decision) and the reason for the decision (authorization.k8s.io/reason). Audit events can be streamed to a webhook for monitoring, tracking, or alerting.

The audit events look similar to the following:

{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"","pod-security.kubernetes.io/audit":"allowPrivilegeEscalation != false (container \"busybox\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"busybox\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"busybox\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"busybox\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")"}}

Auditing is also a good first step in evaluating your cluster's current compliance with Pod Security. The Kubernetes Enhancement Proposal (KEP) hints at a future where baseline could be the default for unlabeled namespaces.

Example audit-policy.yaml configuration tuned for Pod Security events:

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
  resources:
    - group: "" # core API group
      resources: ["pods", "pods/ephemeralcontainers", "podtemplates", "replicationcontrollers"]
    - group: "apps"
      resources: ["daemonsets", "deployments", "replicasets", "statefulsets"]
    - group: "batch"
      resources: ["cronjobs", "jobs"]
  verbs: ["create", "update"]
  omitStages:
    - "RequestReceived"
    - "ResponseStarted"
    - "Panic"

Once auditing is enabled, look at the configured local file if using --audit-log-path or the destination of a webhook if using --audit-webhook-config-file.

If using a file (--audit-log-path), run cat /PATH/TO/API/AUDIT.log | grep "is forbidden:" to see all rejected workloads audited.

PSP migrations

If you're already using PSP, SIG Auth has created a guide and published the steps to migrate off of PSP.

To summarize the process:

  • Update all existing PSPs to be non-mutating
  • Apply Pod Security policies in warn or audit mode
  • Upgrade Pod Security policies to enforce mode
  • Remove PodSecurityPolicy from --enable-admission-plugins

Listed as "optional future extensions" and currently out of scope, SIG Auth has kicked around the idea of providing a tool to assist with migrations. More details in the KEP.

Wrap up

Pod Security is a promising new feature that provides an out-of-the-box way to allow users to improve the security posture of their workloads. Like any new enhancement that has matured to beta, we ask that you try it out, provide feedback, or share your experience via either raising a Github issue or joining SIG Auth community meetings. It's our hope that Pod Security will be deployed on every cluster in our ongoing pursuit as a community to make Kubernetes security a priority.

For a step by step guide on how to enable "baseline" Pod Security Standards with Pod Security Admission feature please refer to these dedicated tutorials that cover the configuration needed at cluster level and namespace level.

Additional resources

Kubernetes 1.23: Dual-stack IPv4/IPv6 Networking Reaches GA

"When will Kubernetes have IPv6?" This question has been asked with increasing frequency ever since alpha support for IPv6 was first added in k8s v1.9. While Kubernetes has supported IPv6-only clusters since v1.18, migration from IPv4 to IPv6 was not yet possible at that point. At long last, dual-stack IPv4/IPv6 networking has reached general availability (GA) in Kubernetes v1.23.

What does dual-stack networking mean for you? Let’s take a look…

Service API updates

Services were single-stack before 1.20, so using both IP families meant creating one Service per IP family. The user experience was simplified in 1.20, when Services were re-implemented to allow both IP families, meaning a single Service can handle both IPv4 and IPv6 workloads. Dual-stack load balancing is possible between services running any combination of IPv4 and IPv6.

The Service API now has new fields to support dual-stack, replacing the single ipFamily field.

  • You can select your choice of IP family by setting ipFamilyPolicy to one of three options: SingleStack, PreferDualStack, or RequireDualStack. A service can be changed between single-stack and dual-stack (within some limits).
  • Setting ipFamilies to a list of families assigned allows you to set the order of families used.
  • clusterIPs is inclusive of the previous clusterIP but allows for multiple entries, so it’s no longer necessary to run duplicate services, one in each of the two IP families. Instead, you can assign cluster IP addresses in both IP families.

Note that Pods are also dual-stack. For a given pod, there is no possibility of setting multiple IP addresses in the same family.

Default behavior remains single-stack

Starting in 1.20 with the re-implementation of dual-stack services as alpha, the underlying networking for Kubernetes has included dual-stack whether or not a cluster was configured with the feature flag to enable dual-stack.

Kubernetes 1.23 removed that feature flag as part of graduating the feature to stable. Dual-stack networking is always available if you want to configure it. You can set your cluster network to operate as single-stack IPv4, as single-stack IPv6, or as dual-stack IPv4/IPv6.

While Services are set according to what you configure, Pods default to whatever the CNI plugin sets. If your CNI plugin assigns single-stack IPs, you will have single-stack unless ipFamilyPolicy specifies PreferDualStack or RequireDualStack. If your CNI plugin assigns dual-stack IPs, pod.status.PodIPs defaults to dual-stack.

Even though dual-stack is possible, it is not mandatory to use it. Examples in the documentation show the variety possible in dual-stack service configurations.

Try dual-stack right now

While upstream Kubernetes now supports dual-stack networking as a GA or stable feature, each provider’s support of dual-stack Kubernetes may vary. Nodes need to be provisioned with routable IPv4/IPv6 network interfaces. Pods need to be dual-stack. The network plugin is what assigns the IP addresses to the Pods, so it's the network plugin being used for the cluster that needs to support dual-stack. Some Container Network Interface (CNI) plugins support dual-stack, as does kubenet.

Ecosystem support of dual-stack is increasing; you can create dual-stack clusters with kubeadm, try a dual-stack cluster locally with KIND, and deploy dual-stack clusters in cloud providers (after checking docs for CNI or kubenet availability).

Get involved with SIG Network

SIG-Network wants to learn from community experiences with dual-stack networking to find out more about evolving needs and your use cases. The SIG-network update video from KubeCon NA 2021 summarizes the SIG’s recent updates, including dual-stack going to stable in 1.23.

The current SIG-Network KEPs and issues on GitHub illustrate the SIG’s areas of emphasis. The dual-stack API server is one place to consider contributing.

SIG-Network meetings are a friendly, welcoming venue for you to connect with the community and share your ideas. Looking forward to hearing from you!

Acknowledgments

The dual-stack networking feature represents the work of many Kubernetes contributors. Thanks to all who contributed code, experience reports, documentation, code reviews, and everything in between. Bridget Kromhout details this community effort in Dual-Stack Networking in Kubernetes. KubeCon keynotes by Tim Hockin & Khaled (Kal) Henidak in 2019 (The Long Road to IPv4/IPv6 Dual-stack Kubernetes) and by Lachlan Evenson in 2021 (And Here We Go: Dual-stack Networking in Kubernetes) talk about the dual-stack journey, spanning five years and a great many lines of code.

Kubernetes 1.23: The Next Frontier

We’re pleased to announce the release of Kubernetes 1.23, the last release of 2021!

This release consists of 47 enhancements: 11 enhancements have graduated to stable, 17 enhancements are moving to beta, and 19 enhancements are entering alpha. Also, 1 feature has been deprecated.

Major Themes

Deprecation of FlexVolume

FlexVolume is deprecated. The out-of-tree CSI driver is the recommended way to write volume drivers in Kubernetes. See this doc for more information. Maintainers of FlexVolume drivers should implement a CSI driver and move users of FlexVolume to CSI. Users of FlexVolume should move their workloads to the CSI driver.

Deprecation of klog specific flags

To simplify the code base, several logging flags were marked as deprecated in Kubernetes 1.23. The code which implements them will be removed in a future release, so users of those need to start replacing the deprecated flags with some alternative solutions.

Software Supply Chain SLSA Level 1 Compliance in the Kubernetes Release Process

Kubernetes releases now generate provenance attestation files describing the staging and release phases of the release process. Artifacts are now verified as they are handed over from one phase to the next. This final piece completes the work needed to comply with Level 1 of the SLSA security framework (Supply-chain Levels for Software Artifacts).

IPv4/IPv6 Dual-stack Networking graduates to GA

IPv4/IPv6 dual-stack networking graduates to GA. Since 1.21, Kubernetes clusters have been enabled to support dual-stack networking by default. In 1.23, the IPv6DualStack feature gate is removed. The use of dual-stack networking is not mandatory. Although clusters are enabled to support dual-stack networking, Pods and Services continue to default to single-stack. To use dual-stack networking Kubernetes nodes must have routable IPv4/IPv6 network interfaces, a dual-stack capable CNI network plugin must be used, Pods must be configured to be dual-stack and Services must have their .spec.ipFamilyPolicy field set to either PreferDualStack or RequireDualStack.

HorizontalPodAutoscaler v2 graduates to GA

The HorizontalPodAutoscaler autoscaling/v2 stable API moved to GA in 1.23. The HorizontalPodAutoscaler autoscaling/v2beta2 API has been deprecated.

Generic Ephemeral Volume feature graduates to GA

The generic ephemeral volume feature moved to GA in 1.23. This feature allows any existing storage driver that supports dynamic provisioning to be used as an ephemeral volume with the volume’s lifecycle bound to the Pod. All StorageClass parameters for volume provisioning and all features supported with PersistentVolumeClaims are supported.

Skip Volume Ownership change graduates to GA

The feature to configure volume permission and ownership change policy for Pods moved to GA in 1.23. This allows users to skip recursive permission changes on mount and speeds up the pod start up time.

Allow CSI drivers to opt-in to volume ownership and permission change graduates to GA

The feature to allow CSI Drivers to declare support for fsGroup based permissions graduates to GA in 1.23.

PodSecurity graduates to Beta

PodSecurity moves to Beta. PodSecurity replaces the deprecated PodSecurityPolicy admission controller. PodSecurity is an admission controller that enforces Pod Security Standards on Pods in a Namespace based on specific namespace labels that set the enforcement level. In 1.23, the PodSecurity feature gate is enabled by default.

Container Runtime Interface (CRI) v1 is default

The Kubelet now supports the CRI v1 API, which is now the project-wide default. If a container runtime does not support the v1 API, Kubernetes will fall back to the v1alpha2 implementation. There is no intermediate action required by end-users, because v1 and v1alpha2 do not differ in their implementation. It is likely that v1alpha2 will be removed in one of the future Kubernetes releases to be able to develop v1.

Structured logging graduate to Beta

Structured logging reached its Beta milestone. Most log messages from kubelet and kube-scheduler have been converted. Users are encouraged to try out JSON output or parsing of the structured text format and provide feedback on possible solutions for the open issues, such as handling of multi-line strings in log values.

Simplified Multi-point plugin configuration for scheduler

The kube-scheduler is adding a new, simplified config field for Plugins to allow multiple extension points to be enabled in one spot. The new multiPoint plugin field is intended to simplify most scheduler setups for administrators. Plugins that are enabled via multiPoint will automatically be registered for each individual extension point that they implement. For example, a plugin that implements Score and Filter extensions can be simultaneously enabled for both. This means entire plugins can be enabled and disabled without having to manually edit individual extension point settings. These extension points can now be abstracted away due to their irrelevance for most users.

CSI Migration updates

CSI Migration enables the replacement of existing in-tree storage plugins such as kubernetes.io/gce-pd or kubernetes.io/aws-ebs with a corresponding CSI driver. If CSI Migration is working properly, Kubernetes end users shouldn’t notice a difference. After migration, Kubernetes users may continue to rely on all the functionality of in-tree storage plugins using the existing interface.

  • CSI Migration feature is turned on by default but stays in Beta for GCE PD, AWS EBS, and Azure Disk in 1.23.
  • CSI Migration is introduced as an Alpha feature for Ceph RBD and Portworx in 1.23.

Expression language validation for CRD is alpha

Expression language validation for CRD is in alpha starting in 1.23. If the CustomResourceValidationExpressions feature gate is enabled, custom resources will be validated by validation rules using the Common Expression Language (CEL).

Server Side Field Validation is Alpha

If the ServerSideFieldValidation feature gate is enabled starting 1.23, users will receive warnings from the server when they send Kubernetes objects in the request that contain unknown or duplicate fields. Previously unknown fields and all but the last duplicate fields would be dropped by the server.

With the feature gate enabled, we also introduce the fieldValidation query parameter so that users can specify the desired behavior of the server on a per request basis. Valid values for the fieldValidation query parameter are:

  • Ignore (default when feature gate is disabled, same as pre-1.23 behavior of dropping/ignoring unknown fields)
  • Warn (default when feature gate is enabled).
  • Strict (this will fail the request with an Invalid Request error)

OpenAPI v3 is Alpha

If the OpenAPIV3 feature gate is enabled starting 1.23, users will be able to request the OpenAPI v3.0 spec for all Kubernetes types. OpenAPI v3 aims to be fully transparent and includes support for a set of fields that are dropped when publishing OpenAPI v2: default, nullable, oneOf, anyOf. A separate spec is published per Kubernetes group version (at the $cluster/openapi/v3/apis/<group>/<version> endpoint) for improved performance and discovery, for all group versions can be found at the $cluster/openapi/v3 path.

Other Updates

Graduated to Stable

Major Changes

Release Notes

Check out the full details of the Kubernetes 1.23 release in our release notes.

Availability

Kubernetes 1.23 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials or run local Kubernetes clusters using Docker container “nodes” with kind. You can also easily install 1.23 using kubeadm.

Release Team

This release was made possible by a very dedicated group of individuals, who came together as a team to deliver technical content, documentation, code, and a host of other components that go into every Kubernetes release.

A huge thank you to the release lead Rey Lejano for leading us through a successful release cycle, and to everyone else on the release team for supporting each other, and working so hard to deliver the 1.23 release for the community.

Kubernetes 1.23: The Next Frontier

"The Next Frontier" theme represents the new and graduated enhancements in 1.23, Kubernetes' history of Star Trek references, and the growth of community members in the release team.

Kubernetes has a history of Star Trek references. The original codename for Kubernetes within Google is Project 7, a reference to Seven of Nine from Star Trek Voyager. And of course Borg was the name for the predecessor to Kubernetes. "The Next Frontier" theme continues the Star Trek references. "The Next Frontier" is a fusion of two Star Trek titles, Star Trek V: The Final Frontier and Star Trek the Next Generation.

"The Next Frontier" represents a line in the SIG Release charter, "Ensure there is a consistent group of community members in place to support the release process across time." With each release team, we grow the community with new release team members and for many it's their first contribution in their open source frontier.

Reference: https://kubernetes.io/blog/2015/04/borg-predecessor-to-kubernetes/ Reference: https://github.com/kubernetes/community/blob/master/sig-release/charter.md

The Kubernetes 1.23 release logo continues with the theme's Star Trek reference. Every star is a helm from the Kubernetes logo. The ship represents the collective teamwork of the release team.

Rey Lejano designed the logo.

User Highlights

Ecosystem Updates

Project Velocity

The CNCF K8s DevStats project aggregates a number of interesting data points related to the velocity of Kubernetes and various sub-projects. This includes everything from individual contributions to the number of companies that are contributing, and is an illustration of the depth and breadth of effort that goes into evolving this ecosystem.

In the v1.23 release cycle, which ran for 16 weeks (August 23 to December 7), we saw contributions from 1032 companies and 1084 individuals.

Event Update

  • KubeCon + CloudNativeCon China 2021 is happening this month from December 9 - 11. After taking a break last year, the event will be virtual this year and includes 105 sessions. Check out the event schedule here.
  • KubeCon + CloudNativeCon Europe 2022 will take place in Valencia, Spain, May 4 – 7, 2022! You can find more information about the conference and registration on the event site.
  • Kubernetes Community Days has upcoming events scheduled in Pakistan, Brazil, Chengdu, and in Australia.

Upcoming Release Webinar

Join members of the Kubernetes 1.23 release team on January 4, 2022 to learn about the major features of this release, as well as deprecations and removals to help plan for upgrades. For more information and registration, visit the event page on the CNCF Online Programs site.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below:

Contribution, containers and cricket: the Kubernetes 1.22 release interview

The Kubernetes release train rolls on, and we look ahead to the release of 1.23 next week. As is our tradition, I'm pleased to bring you a look back at the process that brought us the previous version.

The release team for 1.22 was led by Savitha Raghunathan, who was, at the time, a Senior Platform Engineer at MathWorks. I spoke to Savitha on the Kubernetes Podcast from Google, the weekly* show covering the Kubernetes and Cloud Native ecosystem.

Our release conversations shine a light on the team that puts together each Kubernetes release. Make sure you subscribe, wherever you get your podcasts so you catch the story of 1.23.

And in case you're interested in why the show has been on a hiatus the last few weeks, all will be revealed in the next episode!

This transcript has been lightly edited and condensed for clarity.


CRAIG BOX: Welcome to the show, Savitha.

SAVITHA RAGHUNATHAN: Hey, Craig. Thanks for having me on the show. How are you today?

CRAIG BOX: I'm very well, thank you. I've interviewed a lot of people on the show, and you're actually the first person who's asked that of me.

SAVITHA RAGHUNATHAN: I'm glad. It's something that I always do. I just want to make sure the other person is good and happy.

CRAIG BOX: That's very kind of you. Thank you for kicking off on a wonderful foot there. I want to ask first of all — you grew up in Chennai. My association with Chennai is the Super Kings cricket team. Was cricket part of your upbringing?

SAVITHA RAGHUNATHAN: Yeah. Actually, a lot. My mom loves watching cricket. I have a younger brother, and when we were growing up, we used to play cricket on the terrace. Everyone surrounding me, my best friends — and even now, my partner — loves watching cricket, too. Cricket is a part of my life.

I stopped watching it a while ago, but I still enjoy a good game.

CRAIG BOX: It's probably a bit harder in the US. Everything's in a different time zone. I find, with my cricket team being on the other side of the world, that it's a lot easier when they're playing near me, as opposed to trying to keep up with what they're doing when they're playing at 3:00 in the morning.

SAVITHA RAGHUNATHAN: That is actually one of the things that made me lose touch with cricket. I'm going to give you a piece of interesting information. I never supported Chennai Super Kings. I always supported Royal Challengers of Bangalore.

I once went to the stadium, and it was a match between the Chennai Super Kings and the RCB. I was the only one who was cheering whenever the RCB hit a 6, or when they were scoring. I got the stares of thousands of people looking at me. I'm like, "what are you doing?" My friends are like, "you're going to get us killed! Just stop screaming!"

CRAIG BOX: I hear you. As a New Zealander in the UK, there are a lot of international cricket matches I've been to where I am one of the few people dressed in the full beige kit. But I have to ask, why an affiliation with a different team?

SAVITHA RAGHUNATHAN: I'm not sure. When the IPL came out, I really liked Virat Kohli. He was playing for RCB at that time, and I think pretty much that's it.

CRAIG BOX: Well, what I know about the Chennai Super Kings is that their coach is New Zealand's finest batsmen and air conditioning salesman, Stephen Fleming.

SAVITHA RAGHUNATHAN: Oh, really?

CRAIG BOX: Yeah, he's a dead ringer for the guy who played the yellow Wiggle back in the day.

SAVITHA RAGHUNATHAN: Oh, interesting. I remember the name, but I cannot put the picture and the name together. I stopped watching cricket once I moved to the States. Then, all my focus was on studies and extracurriculars. I have always been an introvert. The campus — it was a new thing for me — they had international festivals.

And every week, they'd have some kind of new thing going on, so I'd go check them out. I wouldn't participate, but I did go out and check them out. That was a big feat for me around that time because a lot of people — and still, even now, a lot of people — they kind of scare me. I don't know how to make a conversation with everyone.

I'll just go and say, "hi, how are you? OK, I'm good. I'm just going to move on". And I'll just go to the next person. And after two hours, I'm out of that place.

CRAIG BOX: Perhaps a pleasant side effect of the last 12 months — a lot fewer gatherings of people.

SAVITHA RAGHUNATHAN: Could be that, but I'm so excited about KubeCon. But when I think about it, I'm like "oh my God. There's going to be a lot of people. What am I going to do? I'm going to meet all my friends over there".

Sometimes I have social anxiety like, what's going to happen?

CRAIG BOX: What's going to happen is you're going to ask them how they are at the beginning, and they're immediately going to be set at ease.

SAVITHA RAGHUNATHAN: laughs I hope so.

CRAIG BOX: Let's talk a little bit, then, about your transition from India to the US. You did your undergraduate degree in computer science at the SSN College of Engineering. How did you end up at Arizona State?

SAVITHA RAGHUNATHAN: I always wanted to pursue higher studies when I was in India, and I didn't have the opportunity immediately. Once I graduated from my school there, I went and I worked for a couple of years. My aim was always to get out of there and come here, do my graduate studies.

Eventually, I want to do a PhD. I have an idea of what I want to do. I always wanted to keep studying. If there's an option that I could just keep studying and not do work or anything of that sort, I'd just pick that other one — I'll just keep studying.

But unfortunately, you need money and other things to live and sustain in this world. So I'm like, OK, I'll take a break from studies, and I will work for a while.

CRAIG BOX: The road to success is littered with dreams of PhDs. I have a lot of friends who thought that that was the path they were going to take, and they've had a beautiful career and probably aren't going to go back to study. Did you use the Matlab software at all while you were going through your schooling?

SAVITHA RAGHUNATHAN: No, unfortunately. That is a question that everyone asks. I have not used Matlab. I haven't used it even now. I don't use it for work. I didn't have any necessity for my school work. I didn't have anything to do with Matlab. I never analysed, or did data processing, or anything, with Matlab. So unfortunately, no.

Everyone asks me like, you're working at MathWorks. Have you used Matlab? I'm like, no.

CRAIG BOX: Fair enough. Nor have I. But it's been around since the late 1970s, so I imagine there are a lot of people who will have come across it at some point. Do you work with a lot of people who have been working on it that whole time?

SAVITHA RAGHUNATHAN: Kind of. Not all the time, but I get to meet some folks who work on the product itself. Most of my interactions are with the infrastructure team and platform engineering teams at MathWorks. One other interesting fact is that when I joined the company — MathWorks has an extensive internal curriculum for training and learning, which I really love. They have an "Intro to Matlab" course, and that's on my bucket of things to do.

It was like 500 years ago. I added it, and I never got to it. I'm like, OK, maybe this year at least I want to get to it and I want to learn something new. My partner used Matlab extensively. He misses it right now at his current employer. And he's like, "you have the entire licence! You have access to the entire suite and you haven't used it?" I'm like, "no!"

CRAIG BOX: Well, I have bad news for the idea of you doing a PhD, I'm sorry.

SAVITHA RAGHUNATHAN: Another thing is that none of my family knew about the company MathWorks and Matlab. The only person who knew was my younger brother. He was so proud. He was like, "oh my God".

When he was 12 years old, he started getting involved in robotics and all that stuff. That's how he got introduced to Matlab. He goes absolutely bananas for the swag. So all the t-shirts, all the hoodies — any swag that I get from MathWorks goes to him, without saying.

Over the five, six years, the things that I've got — there was only one sweatshirt that I kept for myself. Everything else I've just given to him. And he cherishes it. He's the only one in my family who knew about Matlab and MathWorks.

Now, everyone knows, because I'm working there. They were initially like, I don't even know that company name. Is it like Amazon? I'm like, no, we make software that can send people to the moon. And we also make software that can do amazing robotic surgeries and even make a car drive on its own. That's something that I take immense pride in.

I know I don't directly work on the product, but I'm enabling the people who are creating the product. I'm really, really proud of that.

CRAIG BOX: I think Jeff Bezos is working on at least two out of three of those disciplines that you mentioned before, so it's maybe a little bit like Amazon. One thing I've always thought about Matlab is that, because it's called Matlab, it solves that whole problem where Americans call it math, and the rest of the world call it maths. Why do Americans think there's only one math?

SAVITHA RAGHUNATHAN: Definitely. I had trouble — growing up in India, it's always British English. And I had so much trouble when I moved here. So many things changed.

One of the things is maths. I always got used to writing maths, physics, and everything.

CRAIG BOX: They don't call it "physic" in the US, do they?

SAVITHA RAGHUNATHAN: No, no, they don't. Luckily, they don't. That still stays "physics". But math — I had trouble. It's maths. Even when you do the full abbreviations like mathematics and you are still calling it math, I'm like, mm.

CRAIG BOX: They can do the computer science abbreviation thing and call it math-7-S or whatever the number of letters is.

SAVITHA RAGHUNATHAN: Just like Kubernetes. K-8-s.

CRAIG BOX: Your path to Kubernetes is through MathWorks. They started out as a company making software which was distributed in a physical sense — boxed copies, if you will. I understand now there is a cloud version. Can I assume that that is where the two worlds intersect?

SAVITHA RAGHUNATHAN: Kind of. I have interaction with the team that supports Matlab on the cloud, but I don't get to work with them on a day-to-day basis. They use Docker containers, and they are building the platform using Kubernetes. So yeah, a little bit of that.

CRAIG BOX: So what exactly is the platform that you are engineering day to day?

SAVITHA RAGHUNATHAN: Providing Kubernetes as a platform, obviously — that goes without saying — to some of the internal development teams. In the future we might expand it to more teams within the company. That is a focus area right now, so that's what we are doing. In the process, we might even get to work with the people who are deploying Matlab on the cloud, which is exciting.

CRAIG BOX: Now, your path to contribution to Kubernetes, you've said before, was through fixing a 404 error on the Kubernetes.io website. Do you remember what the page was?

SAVITHA RAGHUNATHAN: I do. I was going to something for work, and I came across this changelog. In Kubernetes there's a nice page — once you got to the release page, there would be a long list of changelogs.

One of the things that I fixed was, the person who worked on the feature had changed their GitHub handle, and that wasn't reflected on this page. So that was my first. I got curious and clicked on the links. One of the links was the handle, and that went to a 404. And I was like "Yeah, I'll just fix that. They have done all the hard work. They can get the credit that's due".

It was easy. It wasn't overwhelming for me to pick it up as my first issue. Before that I logged on around Kubernetes for about six to eight months without doing anything because it was just a lot.

CRAIG BOX: One of the other things that you said about your initial contribution is that you had to learn how to use Git. As a very powerful tool, I find Git is a high barrier to entry for even contributing code to a project. When you want to contribute a blog post or documentation or a fix like you did before, I find it almost impossible to think how a new user would come along and do that. What was your process? Do you think that there's anything we can do to make that barrier lower for new contributors?

SAVITHA RAGHUNATHAN: Of course. There are more and more tutorials available these days. There is a new contributor workshop. They actually have a GitHub workflow section, how to do a pull request and stuff like that. I know a couple of folks from SIG Docs that are working on which Git commands that you need, or how to get to writing something small and getting it committed. But more tutorials or more links to intro to Git would definitely help.

The thing is also, someone like a documentation writer — they don't actually want to know the entirety of Git. Honestly, it's an ocean. I don't know how to do it. Most of the time, I still ask for help even though I work with Git on a day to day basis. There are several articles and a lot of help is available already within the community. Maybe we could just add a couple more to kubernetes.dev. That is an amazing site for all the new contributors and existing contributors who want to build code, who want to write documentation.

We could just add a tutorial there like, "hey, don't know Git, you are new to Git? You just need to know these main things".

CRAIG BOX: I find it a shame, to be honest, that people need to use Git for that, by comparison to Wikipedia where you can come along, and even though it might be written in Markdown or something like it, it seems like the barrier is a lot lower. Similar to you, I always have to look up anything more complicated than the five or six Git commands that I use on a day to day basis. Even to do simple things, I basically just go and follow a recipe which I find on the internet.

SAVITHA RAGHUNATHAN: This is how I got introduced to one of the amazing mentors in Kubernetes. Everyone knows him by his handle, Dims. It was my second PR to the Kubernetes website, and I made a mistake. I destroyed the Git history. I could not push my reviews and comments — I addressed them. I couldn't push them back.

My immediate thought was to delete it and recreate, do another pull request. But then I was like, "what happens to others who have already put effort into reviewing them?" I asked for help, and Dims was there.

I would say I just got lucky he was there. And he was like, "OK, let me walk you through". We did troubleshooting through Slack messages. I copied and pasted all the errors. Every single command that he said, I copied and pasted. And then he was like, "OK, run this one. Try this one. And do this one".

Finally, I got it fixed. So you know what I did? I went and I stored the command history somewhere local for the next time when I run into this problem. Luckily, I haven't. But I find the contributors so helpful. They are busy. They have a lot of things to do, but they take moments to stop and help someone who's new.

That is also another part of the reason why I stay — I want to contribute more. It's mainly the community. It's the Kubernetes community. I know you asked me about Git, and I just took the conversation to the Kubernetes community. That's how my brain works.

CRAIG BOX: A lot of people in the community do that and think that's fantastic, obviously, people like Dims who are just floating around on Slack and seem to have endless time. I don't know how they do it.

SAVITHA RAGHUNATHAN: I really want to know the secret for endless time. If I only had 48 hours in a day. I would sleep for 16 hours, and I would use the rest of the time for doing the things that I want.

CRAIG BOX: If I had a chance to sleep up to 48 hours a day, I think it'd be a lot more than 16.

Now, one of the areas that you've been contributing to Kubernetes is in the release team. In 1.18, you were a shadow for the docs role. You led that role in 1.19. And you were a release lead shadow for versions 1,20 and 1.21 before finally leading this release, 1.22, which we will talk about soon.

How did you get involved? And how did you decide which roles to take as you went through that process?

SAVITHA RAGHUNATHAN: That is a topic I love to talk about. This was fresh when I started learning about Kubernetes and using Kubernetes at work. And I got so much help from the community, I got interested in contributing back.

At the first KubeCon that I attended in 2018, in Seattle, they had a speed mentoring session. Now they call it "pod mentoring". I went to the session, and said, "hey, I want to contribute. I don't know where to start". And I got a lot of information on how to get started.

One of the places was SIG Release and the release team. I came back and diligently attended all the SIG Release meetings for four to six months. And in between, I applied to the Kubernetes release team — 1.14 and 1.15. I didn't get through. So I took a little bit of a break, and I focused on doing some documentation work. Then I applied for 1.18.

Since I was already working on some kinds of — not like full fledged "documentation" documentation, I still don't write. I eventually want to write something really nice and full fledged documentation like other awesome folks.

CRAIG BOX: You'll need a lot more than 48 hours in your day to do that.

SAVITHA RAGHUNATHAN: laughing That's how I applied for the docs role, because I know a little bit about the website. I've done a few pull requests and commits. That's how I got started. I applied for that one role, and I got selected for the 1.18 team. That's how my journey just took off.

And the next release, I was leading the documentation team. And as everyone knows, the pandemic hit. It was one of the longest releases. I could lean back on the community. I would just wait for the release team meetings.

It was my way of coping with the pandemic. It took my mind off. It was actually more than a release team, they were people. They were all people first, and we took care of each other. So it felt good.

And then, I became a release lead shadow for 1.20 and 1.21 because I wanted to know more. I wanted to learn more. I wasn't ready. I still don't feel ready, but I have led 1.22. So if I could do it, anyone could do it.

CRAIG BOX: How much of this work is day job?

SAVITHA RAGHUNATHAN: I am lucky to be blessed with an awesome team. I do most of my work after work, but there have been times where I have to take meetings and attend to immediate urgent stuff. During the time of exception requests and stuff like that, I take a little bit of time from my work.

My team has been wonderful: they support me in all possible ways, and the management as well. Other than the meetings, I don't do much of the work during the day job. It just takes my focus and attention away too much, and I end up having to spend a lot of time sitting in front of the computer, which I don't like.

Before the pandemic I had a good work life balance. I'd just go to work at 7:00, 7:30, and I'd be back by 4 o'clock. I never touched my laptop ever again. I left all work behind when I came home. So right now, I'm still learning how to get through.

I try to limit the amount of open source work that I do during work time. The release lead shadow and the release lead job — they require a lot of time, effort. So on average, I'd be spending two to three hours post work time on the release activities.

CRAIG BOX: Before the pandemic, everyone was worried that if we let people work from home, they wouldn't work enough. I think the opposite has actually happened, is that now we're worried that if we let people work from home, they will just get on the computer in the morning and you'll have to pry it out of their hands at midnight.

SAVITHA RAGHUNATHAN: Yeah, I think the productivity has increased at least twofold, I would say, for everyone, once they started working from home.

CRAIG BOX: But at the expense of work-life balance, though, because as you say, when you're sitting in the same chair in front of, perhaps, the same computer doing your MathWorks work and then your open source work, they kind of can blur into one perhaps?

SAVITHA RAGHUNATHAN: That is a challenge. I face it every day. But so many others are also facing it. I implemented a few little tricks to help me. When I used to come back home from work, the first thing I would do is remove my watch. That was an indication that OK, I'm done.

That's the thing that I still do. I just remove my watch, and I just keep it right where my workstation is. And I just close the door so that I never look back. Even going past the room, I don't get a glimpse of my work office. I start implementing tiny little things like that to avoid burnout.

I think I'm still facing a little bit of burnout. I don't know if I have fully recovered from it. I constantly feel like I need a vacation. And I could just take a vacation for like a month or two. If it's possible, I will just do it.

CRAIG BOX: I do hope that travel opens up for everyone as an opportunity because I know that, for a lot of people, it's not so much they've been working from home but they've been living at work. The idea of taking vacation effectively means, well, I've been stuck in the same place, if I've been under a lockdown. It's hard to justify that. It will be good as things improve worldwide for us to be able to start focusing more on mental health and perhaps getting away from the "everything room," as I sometimes call it.

SAVITHA RAGHUNATHAN: I'm totally looking forward to it. I hope that travel opens up and I could go home and I could meet my siblings and my aunt and my parents.

CRAIG BOX: Catch a cricket match?

SAVITHA RAGHUNATHAN: Yeah. Probably yes, if I have company and if there is anything interesting happening around the time. I don't mind going back to the Chepauk Stadium and catching a match or two.

CRAIG BOX: Let's turn now to the recently released Kubernetes 1.22. Congratulations on the launch.

SAVITHA RAGHUNATHAN: Thank you.

CRAIG BOX: Each launch comes with a theme and a mascot or a logo. What is the theme for this release?

SAVITHA RAGHUNATHAN: The theme for the release is reaching new peaks. I am fascinated with a lot of space travel and chasing stars, the Milky Way. The best place to do that is over the top of a mountain. So that is the release logo, basically. It's a mountain — Mount Rainier. On top of that, there is a Kubernetes flag, and it's overlooking the Milky Way.

It's also symbolic that with every release, that we are achieving something new, bigger, and better, and we are making the release awesome. So I just wanted to incorporate that into the team as to say, we are achieving new things with every release. That's the "reaching new peaks" theme.

CRAIG BOX: The last couple of releases have both been incrementally larger — as a result, perhaps, of the fact there are now only three releases per year rather than four. There were also changes to the process, where the work has been driven a lot more by the SIGs than by the release team having to go and ask the SIGs what was going on. What can you say about the size and scope of the 1.22 release?

SAVITHA RAGHUNATHAN: The 1.22 release is the largest release to date. We have 56 enhancements if I'm not wrong, and we have a good amount of features that's graduated as stable. You can now say that Kubernetes as a project has become more mature because you see new features coming in. At the same time, you see the features that weren't used getting deprecated — we have like three deprecations in this release.

Aside from that fact, we also have a big team that's supporting one of the longest releases. This is the first official release cycle after the cadence KEP got approved. Officially, we are at four months, even though 1.19 was six months, and 1.21 was like 3 and 1/2 months, I think, this is the first one after the official KEP approval.

CRAIG BOX: What changes did you make to the process knowing that you had that extra month?

SAVITHA RAGHUNATHAN: One of the things the community had asked for is more time for development. We tried to incorporate that in the release schedule. We had about six weeks between the enhancements freeze and the code freeze. That's one.

It might not be visible to everyone, but one of the things that I wanted to make sure of was the health of the team — since it was a long, long release, we had time to plan out, and not have everyone work during the weekends or during their evenings or time off. That actually helped everyone keep their sanity, and also in making good progress and delivering good results at the end of the release. That's one of the process improvements that I'd call out.

We got better by making a post during the exception request process. Everyone works around the world. People from the UK start a little earlier than the people in the US East Coast. The West Coast starts three hours later than the East Coast. We used to make a post every Friday evening saying "hey, we actually received this many requests. We have addressed a number of them. We are waiting on a couple, or whatever. All the release team members are done for the day. We will see you around on Monday. Have a good weekend." Something like that.

We set the expectations from the community as well. We understand things are really important and urgent, but we are done. This gave everyone their time back. They don't have to worry over the weekend thinking like, hey, what's happening? What's happening in the release? They could spend time with their family, or they could do whatever they want to do, like go on a hike, or just sit and watch TV.

There have been weekends that I just did that. I just binge-watched a series. That's what I did.

CRAIG BOX: Any recommendations?

SAVITHA RAGHUNATHAN: I'm a big fan of Marvel, so I have watched the new Loki, which I really love. Loki is one of my favourite characters in Marvel. And I also liked WandaVision. That was good, too.

CRAIG BOX: I've not seen Loki yet, but I've heard it described as the best series of Doctor Who in the last few years.

SAVITHA RAGHUNATHAN: Really?

CRAIG BOX: There must be an element of time-travelling in there if that's how people are describing it.

SAVITHA RAGHUNATHAN: You should really go and watch it whenever you have time. It's really amazing. I might go back and watch it again because I might have missed bits and pieces. That always happens in Marvel movies and the episodes; you need to watch them a couple of times to catch, "oh, this is how they relate".

CRAIG BOX: Yes, the mark of good media that you want to immediately go back and watch it again once you've seen it.

Let's look now at some of the new features in Kubernetes 1.22. A couple of things that have graduated to general availability — server-side apply, external credential providers, a couple of new security features — the replacement for pod security policy has been announced, and seccomp is now available by default.

Do you have any favourite features in 1.22 that you'd like to discuss?

SAVITHA RAGHUNATHAN: I have a lot of them. All my favourite features are related to security. OK, one of them is not security, but a major theme of my favourite KEPs is security. I'll start with the default seccomp. I think it will help make clusters secure by default, and may assist in preventing more vulnerabilities, which means less headaches for the cluster administrators.

This is close to my heart because the base of the MathWorks platform is provisioning Kubernetes clusters. Knowing that they are secure by default will definitely provide me with some good sleep. And also, I'm paranoid about security most of the time. I'm super interested in making everything secure. It might get in the way of making the users of the platform angry because it's not usable in any way.

My next one is rootless Kubelet. That feature's going to enable the cluster admin, the platform developers to deploy Kubernetes components to run in a user namespace. And I think that is also a great addition.

Like you mention, the most awaited drop in for the PSP replacement is here. It's pod admission control. It lets cluster admins apply the pod security standards. And I think it's just not related to the cluster admins. I might have to go back and check on that. Anyone can probably use it — the developers and the admins alike.

It also supports various modes, which is most welcome. There are times where you don't want to just cut the users off because they are trying to do something which is not securely correct. You just want to warn them, hey, this is what you are doing. This might just cause a security issue later, so you might want to correct it. But you just don't want to cut them off from using the platform, or them trying to attempt to do something — deploy their workload and get their day-to-day job done. That is something that I really like, that it also supports a warning mechanism.

Another one which is not security is node swap support. Kubernetes didn't have support for swap before, but it is taken into consideration now. This is an alpha feature. With this, you can take advantage of the swap, which is provisioned on the Linux VMs.

Some of the workloads — when they are deployed, they might need a lot of swap for the start-up — example, like Node and Java applications, which I just took out of their KEP user stories. So if anyone's interested, they can go and look in the KEP. That's useful. And it also increases the node stability and whatnot. So I think it's going to be beneficial for a lot of folks.

We know how Java and containers work. I think it has gotten better, but five years ago, it was so hard to get a Java application to fit in a small container. It always needed a lot of memory, swap, and everything to start up and run. I think this will help the users and help the admins and keep the cost low, and it will tie into so many other things as well. I'm excited about that feature.

Another feature that I want to just call out — I don't use Windows that much, but I just want to give a shout out to the folks who are doing an amazing job bringing all the Kubernetes features to Windows as well, to give a seamless experience.

One of the things is Windows privileged containers. I think it went alpha this release. And that is a wonderful addition, if you ask me. It can take advantage of whatever that's happening on the Linux side. And they can also port it over and see, OK, I can now run Windows containers in a privileged mode.

So whatever they are trying to achieve, they can do it. So that's a noteworthy mention. I need to give a shout out for the folks who work and make things happen in the Windows ecosystem as well.

CRAIG BOX: One of the things that's great about the release process is the continuity between groups and teams. There's always an emeritus advisor who was a lead from a previous release. One thing that I always ask when I do these interviews is, what is the advice that you give to the next person? When we talked to Nabarun for the 1.21 interview, he said that his advice to you would be "do, delegate, and defer". Figure out what you can do, figure out what you can ask other people to do, and figure out what doesn't need to be done. Were you able to take that advice on board?

SAVITHA RAGHUNATHAN: Yeah, you won't believe it. I have it right here stuck to my monitor.

CRAIG BOX: Next to your Git cheat sheet?

SAVITHA RAGHUNATHAN: laughs Absolutely. I just have it stuck there. I just took a look at it.

CRAIG BOX: Someone that you will have been able to delegate and defer to is Rey Lejano from Rancher Labs and SUSE, who is the release lead to be for 1.23.

SAVITHA RAGHUNATHAN: I want to tell Rey to beware of the team's mental health. Schedule in such a way that it avoids burnout. Check in, and make sure that everyone is doing good. If they need some kind of help, create a safe space where they can actually ask for help, if they want to step back, if they need someone to cover.

I think that is most important. The releases are successful based on the thousands and thousands of contributors. But when it comes to a release team, you need to have a healthy team where people feel they are in a good place and they just want to make good contributions, which means they want to be heard. That's one thing that I want to tell Rey.

Also collaborate and learn from each other. I constantly learn. I think the team was 39 folks, including me. Every day I learned something or the other, even starting from how to interact.

Sometimes I have learned more leadership skills from my release lead shadows. They are awesome, and they are mature. I constantly learn from them, and I admire them a lot.

It also helps to have good, strong individuals in the team who can step up and help when needed. For example, unfortunately, we lost one of our teammates after the start of the release cycle. That was tragic. His name was Peeyush Gupta. He was an awesome and wonderful human — very warm.

I didn't get more of a chance to interact with him. I had exchanged a few Slack messages, but I got his warm personality. I just want to take a couple of seconds to remember him. He was awesome.

After we lost him, we had this strong person from the team step up and lead the communications, who had never been a part of the release team before at all. He was a shadow for the first time. His name is Jesse Butler. So he stepped up, and he just took it away. He ran the comms show for 1.22.

That's what the community is about. You take care of team members, and the team will take care of you. So that's one other thing that I want to let Rey know, and maybe whoever — I think it's applicable overall.

CRAIG BOX: There's a link to a family education fund for Peeyush Gupta, which you can find in the show notes.

Five releases in a row now you've been a member of the release team. Will you be putting your feet up now for 1.23?

SAVITHA RAGHUNATHAN: I am going to take a break for a while. In the future, I want to be contributing, if not the release team, the SIG Release and the release management effort. But right now, I have been there for five releases. And I feel like, OK, I just need a little bit of fresh air.

And also the pandemic and the burnout has caught up, so I'm going to take a break from certain contributions. You will see me in the future. I will be around, but I might not be actively participating in the release team activities. I will be around the community. Anyone can reach out to me. They all know my Slack, so they can just reach out to me via Slack or Twitter.

CRAIG BOX: Yes, your Twitter handle is CoffeeArtGirl. Does that mean that you'll be spending some time working on your lattes?

SAVITHA RAGHUNATHAN: I am very bad at making lattes. The coffee art means that I used to make art with coffee. You get instant coffee powder and just mix it with water. You get the colours, very beautiful brown colours. I used to make art using that.

And I love coffee. So I just combined all the words together. And I had to come up with it in a span of one hour or so because I was joining this 'meet our contributors' panel. And Paris asked me, "do you have a Twitter handle?" I was planning to create one, but I didn't have the time.

I'm like, well, let me just think what I could just come up with real quick. So I just came up with that. So that's the story behind my Twitter handle. Everyone's interested in it. You are not the first person you have asked me or mentioned about it. So many others are like, why coffee art?

CRAIG BOX: And you are also interested in art with perhaps other materials?

SAVITHA RAGHUNATHAN: Yes. My interests keep changing. I used to do pebble art. It's just collecting pebbles from wherever I go, and I used to paint on them. I used to use watercolour, but I want to come back to watercolour sometime.

My recent interests are coloured pencils, which came back. When I was very young, I used to do a lot of coloured pencils. And then I switched to watercolours and oil painting. So I just go around in circles.

One of the hobbies that I picked up during a pandemic is crochet. I made a scarf for Mother's Day. My mum and my dad were here last year. They got stuck because of the pandemic, and they couldn't go back home. So they stayed with me for 10 months. That is the jackpot that I had, that I got to spend so much time with my parents after I moved to the US.

CRAIG BOX: And they got rewarded with a scarf.

SAVITHA RAGHUNATHAN: Yeah.

CRAIG BOX: One to share between them.

SAVITHA RAGHUNATHAN: I started making a blanket for my dad. And it became so heavy, I might have to just pick up some lighter yarn. I still don't know the differences between different kinds of yarns, but I'm getting better.

I started out because I wanted to make these little toys. They call them amigurumi in the crochet world. I wanted to make them. That's why I started out. I'm trying. I made a little cat which doesn't look like a cat, but it is a cat. I have to tell everyone that it's a cat so that they don't mock me later, but.

CRAIG BOX: It's an artistic interpretation of a cat.

SAVITHA RAGHUNATHAN: It definitely is!


Savitha Raghunathan, now a Senior Software Engineer at Red Hat, served as the Kubernetes 1.22 release team lead.

You can find the Kubernetes Podcast from Google at @KubernetesPod on Twitter, and you can subscribe so you never miss an episode.

Quality-of-Service for Memory Resources

Kubernetes v1.22, released in August 2021, introduced a new alpha feature that improves how Linux nodes implement memory resource requests and limits.

In prior releases, Kubernetes did not support memory quality guarantees. For example, if you set container resources as follows:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  containers:
  - name: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "64Mi"
        cpu: "500m"

spec.containers[].resources.requests(e.g. cpu, memory) is designed for scheduling. When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.

spec.containers[].resources.limits is passed to the container runtime when the kubelet starts a container. CPU is considered a "compressible" resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container, giving your app potentially worse performance. However, it won’t be terminated. That is what "compressible" means.

In cgroup v1, and prior to this feature, the container runtime never took into account and effectively ignored spec.containers[].resources.requests["memory"]. This is unlike CPU, in which the container runtime consider both requests and limits. Furthermore, memory actually can't be compressed in cgroup v1. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated by the kernel with an OOM (Out of Memory) kill.

Fortunately, cgroup v2 brings a new design and implementation to achieve full protection on memory. The new feature relies on cgroups v2 which most current operating system releases for Linux already provide. With this experimental feature, quality-of-service for pods and containers extends to cover not just CPU time but memory as well.

How does it work?

Memory QoS uses the memory controller of cgroup v2 to guarantee memory resources in Kubernetes. Memory requests and limits of containers in pod are used to set specific interfaces memory.min and memory.high provided by the memory controller. When memory.min is set to memory requests, memory resources are reserved and never reclaimed by the kernel; this is how Memory QoS ensures the availability of memory for Kubernetes pods. And if memory limits are set in the container, this means that the system needs to limit container memory usage, Memory QoS uses memory.high to throttle workload approaching it's memory limit, ensuring that the system is not overwhelmed by instantaneous memory allocation.

The following table details the specific functions of these two parameters and how they correspond to Kubernetes container resources.

FileDescription
memory.minmemory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup's memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked.

We map it to the container's memory request
memory.highmemory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit.

We use a formula to calculate memory.high, depending on container's memory limit or node allocatable memory (if container's memory limit is empty) and a throttling factor. Please refer to the KEP for more details on the formula.

When container memory requests are made, kubelet passes memory.min to the back-end CRI runtime (possibly containerd, cri-o) via the Unified field in CRI during container creation. The memory.min in container level cgroup will be set to:


i: the ith container in one pod

Since the memory.min interface requires that the ancestor cgroup directories are all set, the pod and node cgroup directories need to be set correctly.

memory.min in pod level cgroup:

i: the ith container in one pod

memory.min in node level cgroup:

i: the ith pod in one node, j: the jth container in one pod

Kubelet will manage the cgroup hierarchy of the pod level and node level cgroups directly using runc libcontainer library, while container cgroup limits are managed by the container runtime.

For memory limits, in addition to the original way of limiting memory usage, Memory QoS adds an additional feature of throttling memory allocation. A throttling factor is introduced as a multiplier (default is 0.8). If the result of multiplying memory limits by the factor is greater than memory requests, kubelet will set memory.high to the value and use Unified via CRI. And if the container does not specify memory limits, kubelet will use node allocatable memory instead. The memory.high in container level cgroup is set to:


i: the ith container in one pod

This can can help improve stability when pod memory usage increases, ensuring that memory is throttled as it approaches the memory limit.

How do I use it?

Here are the prerequisites for enabling Memory QoS on your Linux node, some of these are related to Kubernetes support for cgroup v2.

  1. Kubernetes since v1.22
  2. runc since v1.0.0-rc93; containerd since 1.4; cri-o since 1.20
  3. Linux kernel minimum version: 4.15, recommended version: 5.2+
  4. Linux image with cgroupv2 enabled or enabling cgroupv2 unified_cgroup_hierarchy manually

OCI runtimes such as runc and crun already support cgroups v2 Unified, and Kubernetes CRI has also made the desired changes to support passing Unified. However, CRI Runtime support is required as well. Memory QoS in Alpha phase is designed to support containerd and cri-o. Related PR Feature: containerd-cri support LinuxContainerResources.Unified #5627 has been merged and will be released in containerd 1.6. CRI-O implement kube alpha features for 1.22 #5207 is still in WIP.

With those prerequisites met, you can enable the memory QoS feature gate (see Set kubelet parameters via a config file).

How can I learn more?

You can find more details as follows:

How do I get involved?

You can reach SIG Node by several means:

You can also contact me directly:

Dockershim removal is coming. Are you ready?

Reviewers: Davanum Srinivas, Elana Hashman, Noah Kantrowitz, Rey Lejano.

Last year we announced that Kubernetes' dockershim component (which provides a built-in integration for Docker Engine) is deprecated.

Update: There's a Dockershim Deprecation FAQ with more information, and you can also discuss the deprecation via a dedicated GitHub issue.

Our current plan is to remove dockershim from the Kubernetes codebase soon. We are looking for feedback from you whether you are ready for dockershim removal and to ensure that you are ready when the time comes.

Please fill out this survey: https://forms.gle/svCJmhvTv78jGdSx8

The dockershim component that enables Docker as a Kubernetes container runtime is being deprecated in favor of runtimes that directly use the Container Runtime Interface created for Kubernetes. Many Kubernetes users have migrated to other container runtimes without problems. However we see that dockershim is still very popular. You may see some public numbers in recent Container Report from DataDog. Some Kubernetes hosting vendors just recently enabled other runtimes support (especially for Windows nodes). And we know that many third party tools vendors are still not ready: migrating telemetry and security agents.

At this point, we believe that there is feature parity between Docker and the other runtimes. Many end-users have used our migration guide and are running production workload using these different runtimes. The plan of record today is that dockershim will be removed in version 1.24, slated for release around April of next year. For those developing or running alpha and beta versions, dockershim will be removed in December at the beginning of the 1.24 release development cycle.

There is only one month left to give us feedback. We want you to tell us how ready you are.

We are collecting opinions through this survey: https://forms.gle/svCJmhvTv78jGdSx8 To better understand preparedness for the dockershim removal, our survey is asking the version of Kubernetes you are currently using, and an estimate of when you think you will adopt Kubernetes 1.24. All the aggregated information on dockershim removal readiness will be published. Free form comments will be reviewed by SIG Node leadership. If you want to discuss any details of migrating from dockershim, report bugs or adoption blockers, you can use one of the SIG Node contact options any time: https://github.com/kubernetes/community/tree/master/sig-node#contact

Kubernetes is a mature project. This deprecation is another step in the effort to get away from permanent beta features and providing more stability and compatibility guarantees. With the migration from dockershim you will get more flexibility and choice of container runtime features as well as less dependencies of your apps on specific underlying technology. Please take time to review the dockershim migration documentation and consult your Kubernetes hosting vendor (if you have one) what container runtime options are available for you. Read up container runtime documentation with instructions on how to use containerd and CRI-O to help prepare you when you're ready to upgrade to 1.24. CRI-O, containerd, and Docker with Mirantis cri-dockerd are not the only container runtime options, we encourage you to explore the CNCF landscape on container runtimes in case another suits you better.

Thank you!

Non-root Containers And Devices

The user/group ID related security settings in Pod's securityContext trigger a problem when users want to deploy containers that use accelerator devices (via Kubernetes Device Plugins) on Linux. In this blog post I talk about the problem and describe the work done so far to address it. It's not meant to be a long story about getting the k/k issue fixed.

Instead, this post aims to raise awareness of the issue and to highlight important device use-cases too. This is needed as Kubernetes works on new related features such as support for user namespaces.

Why non-root containers can't use devices and why it matters

One of the key security principles for running containers in Kubernetes is the principle of least privilege. The Pod/container securityContext specifies the config options to set, e.g., Linux capabilities, MAC policies, and user/group ID values to achieve this.

Furthermore, the cluster admins are supported with tools like PodSecurityPolicy (deprecated) or Pod Security Admission (alpha) to enforce the desired security settings for pods that are being deployed in the cluster. These settings could, for instance, require that containers must be runAsNonRoot or that they are forbidden from running with root's group ID in runAsGroup or supplementalGroups.

In Kubernetes, the kubelet builds the list of Device resources to be made available to a container (based on inputs from the Device Plugins) and the list is included in the CreateContainer CRI message sent to the CRI container runtime. Each Device contains little information: host/container device paths and the desired devices cgroups permissions.

The OCI Runtime Spec for Linux Container Configuration expects that in addition to the devices cgroup fields, more detailed information about the devices must be provided:

{
        "type": "<string>",
        "path": "<string>",
        "major": <int64>,
        "minor": <int64>,
        "fileMode": <uint32>,
        "uid": <uint32>,
        "gid": <uint32>
},

The CRI container runtimes (containerd, CRI-O) are responsible for obtaining this information from the host for each Device. By default, the runtimes copy the host device's user and group IDs:

  • uid (uint32, OPTIONAL) - id of device owner in the container namespace.
  • gid (uint32, OPTIONAL) - id of device group in the container namespace.

Similarly, the runtimes prepare other mandatory config.json sections based on the CRI fields, including the ones defined in securityContext: runAsUser/runAsGroup, which become part of the POSIX platforms user structure via:

  • uid (int, REQUIRED) specifies the user ID in the container namespace.
  • gid (int, REQUIRED) specifies the group ID in the container namespace.
  • additionalGids (array of ints, OPTIONAL) specifies additional group IDs in the container namespace to be added to the process.

However, the resulting config.json triggers a problem when trying to run containers with both devices added and with non-root uid/gid set via runAsUser/runAsGroup: the container user process has no permission to use the device even when its group id (gid, copied from host) was permissive to non-root groups. This is because the container user does not belong to that host group (e.g., via additionalGids).

Being able to run applications that use devices as non-root user is normal and expected to work so that the security principles can be met. Therefore, several alternatives were considered to get the gap filled with what the PodSec/CRI/OCI supports today.

What was done to solve the issue?

You might have noticed from the problem definition that it would at least be possible to workaround the problem by manually adding the device gid(s) to supplementalGroups, or in the case of just one device, set runAsGroup to the device's group id. However, this is problematic because the device gid(s) may have different values depending on the nodes' distro/version in the cluster. For example, with GPUs the following commands for different distros and versions return different gids:

Fedora 33:

$ ls -l /dev/dri/
total 0
drwxr-xr-x. 2 root root         80 19.10. 10:21 by-path
crw-rw----+ 1 root video  226,   0 19.10. 10:42 card0
crw-rw-rw-. 1 root render 226, 128 19.10. 10:21 renderD128
$ grep -e video -e render /etc/group
video:x:39:
render:x:997:

Ubuntu 20.04:

$ ls -l /dev/dri/
total 0
drwxr-xr-x 2 root root         80 19.10. 17:36 by-path
crw-rw---- 1 root video  226,   0 19.10. 17:36 card0
crw-rw---- 1 root render 226, 128 19.10. 17:36 renderD128
$ grep -e video -e render /etc/group
video:x:44:
render:x:133:

Which number to choose in your securityContext? Also, what if the runAsGroup/runAsUser values cannot be hard-coded because they are automatically assigned during pod admission time via external security policies?

Unlike volumes with fsGroup, the devices have no official notion of deviceGroup/deviceUser that the CRI runtimes (or kubelet) would be able to use. We considered using container annotations set by the device plugins (e.g., io.kubernetes.cri.hostDeviceSupplementalGroup/) to get custom OCI config.json uid/gid values. This would have required changes to all existing device plugins which was not ideal.

Instead, a solution that is seamless to end-users without getting the device plugin vendors involved was preferred. The selected approach was to re-use runAsUser and runAsGroup values in config.json for devices:

{
        "type": "c",
        "path": "/dev/foo",
        "major": 123,
        "minor": 4,
        "fileMode": 438,
        "uid": <runAsUser>,
        "gid": <runAsGroup>
},

With runc OCI runtime (in non-rootless mode), the device is created (mknod(2)) in the container namespace and the ownership is changed to runAsUser/runAsGroup using chmod(2).

Note:

Rootless mode and devices is not supported.
Having the ownership updated in the container namespace is justified as the user process is the only one accessing the device. Only runAsUser/runAsGroup are taken into account, and, e.g., the USER setting in the container is currently ignored.

While it is likely that the "faulty" deployments (i.e., non-root securityContext + devices) do not exist, to be absolutely sure no deployments break, an opt-in config entry in both containerd and CRI-O to enable the new behavior was added. The following:

device_ownership_from_security_context (bool)

defaults to false and must be enabled to use the feature.

See non-root containers using devices after the fix

To demonstrate the new behavior, let's use a Data Plane Development Kit (DPDK) application using hardware accelerators, Kubernetes CPU manager, and HugePages as an example. The cluster runs containerd with:

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    device_ownership_from_security_context = true

or CRI-O with:

[crio.runtime]
device_ownership_from_security_context = true

and the Guaranteed QoS Class Pod that runs DPDK's crypto-perf test utility with this YAML:

...
metadata:
  name: qat-dpdk
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 2000
    fsGroup: 3000
  containers:
  - name: crypto-perf
    image: intel/crypto-perf:devel
    ...
    resources:
      requests:
        cpu: "3"
        memory: "128Mi"
        qat.intel.com/generic: '4'
        hugepages-2Mi: "128Mi"
      limits:
        cpu: "3"
        memory: "128Mi"
        qat.intel.com/generic: '4'
        hugepages-2Mi: "128Mi"
  ...

To verify the results, check the user and group ID that the container runs as:

$ kubectl exec -it qat-dpdk -c crypto-perf -- id

They are set to non-zero values as expected:

uid=1000 gid=2000 groups=2000,3000

Next, check the device node permissions (qat.intel.com/generic exposes /dev/vfio/ devices) are accessible to runAsUser/runAsGroup:

$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/vfio
total 0
drwxr-xr-x 2 root root      140 Sep  7 10:55 .
drwxr-xr-x 7 root root      380 Sep  7 10:55 ..
crw------- 1 1000 2000 241,   0 Sep  7 10:55 58
crw------- 1 1000 2000 241,   2 Sep  7 10:55 60
crw------- 1 1000 2000 241,  10 Sep  7 10:55 68
crw------- 1 1000 2000 241,  11 Sep  7 10:55 69
crw-rw-rw- 1 1000 2000  10, 196 Sep  7 10:55 vfio

Finally, check the non-root container is also allowed to create HugePages:

$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/hugepages/

fsGroup gives a runAsUser writable HugePages emptyDir mountpoint:

total 0
drwxrwsr-x 2 root 3000   0 Sep  7 10:55 .
drwxr-xr-x 7 root root 380 Sep  7 10:55 ..

Help us test it and provide feedback!

The functionality described here is expected to help with cluster security and the configurability of device permissions. To allow non-root containers to use devices requires cluster admins to opt-in to the functionality by setting device_ownership_from_security_context = true. To make it a default setting, please test it and provide your feedback (via SIG-Node meetings or issues)! The flag is available in CRI-O v1.22 release and queued for containerd v1.6.

More work is needed to get it properly supported. It is known to work with runc but it also needs to be made to function with other OCI runtimes too, where applicable. For instance, Kata Containers supports device passthrough and allows it to make devices available to containers in VM sandboxes too.

Moreover, the additional challenge comes with support of user names and devices. This problem is still open and requires more brainstorming.

Finally, it needs to be understood whether runAsUser/runAsGroup are enough or if device specific settings similar to fsGroups are needed in PodSpec/CRI v2.

Thanks

My thanks goes to Mike Brown (IBM, containerd), Peter Hunt (Redhat, CRI-O), and Alexander Kanevskiy (Intel) for providing all the feedback and good conversations.

Announcing the 2021 Steering Committee Election Results

The 2021 Steering Committee Election is now complete. The Kubernetes Steering Committee consists of 7 seats, 4 of which were up for election in 2021. Incoming committee members serve a term of 2 years, and all members are elected by the Kubernetes Community.

This community body is significant since it oversees the governance of the entire Kubernetes project. With that great power comes great responsibility. You can learn more about the steering committee’s role in their charter.

Results

Congratulations to the elected committee members whose two year terms begin immediately (listed in alphabetical order by GitHub handle):

They join continuing members:

Paris Pittman and Christoph Blecker are returning Steering Committee Members.

Big Thanks

Thank you and congratulations on a successful election to this round’s election officers:

Special thanks to Arnaud Meukam (@ameukam), k8s-infra liaison, who enabled our voting software on community-owned infrastructure.

Thanks to the Emeritus Steering Committee Members. Your prior service is appreciated by the community:

And thank you to all the candidates who came forward to run for election.

Get Involved with the Steering Committee

This governing body, like all of Kubernetes, is open to all. You can follow along with Steering Committee backlog items and weigh in by filing an issue or creating a PR against their repo. They have an open meeting on the first Monday at 9:30am PT of every month and regularly attend Meet Our Contributors. They can also be contacted at their public mailing list steering@kubernetes.io.

You can see what the Steering Committee meetings are all about by watching past meetings on the YouTube Playlist.


This post was written by the Upstream Marketing Working Group. If you want to write stories about the Kubernetes community, learn more about us.

Use KPNG to Write Specialized kube-proxiers

The post will show you how to create a specialized service kube-proxy style network proxier using Kubernetes Proxy NG kpng without interfering with the existing kube-proxy. The kpng project aims at renewing the the default Kubernetes Service implementation, the "kube-proxy". An important feature of kpng is that it can be used as a library to create proxiers outside K8s. While this is useful for CNI-plugins that replaces the kube-proxy it also opens the possibility for anyone to create a proxier for a special purpose.

Define a service that uses a specialized proxier

apiVersion: v1
kind: Service
metadata:
  name: kpng-example
  labels:
    service.kubernetes.io/service-proxy-name: kpng-example
spec:
  clusterIP: None
  ipFamilyPolicy: RequireDualStack
  externalIPs:
  - 10.0.0.55
  - 1000::55
  selector:
    app: kpng-alpine
  ports:
  - port: 6000

If the service.kubernetes.io/service-proxy-name label is defined the kube-proxy will ignore the service. A custom controller can watch services with the label set to its own name, "kpng-example" in this example, and setup specialized load-balancing.

The service.kubernetes.io/service-proxy-name label is not new, but so far is has been quite hard to write a specialized proxier.

The common use for a specialized proxier is assumed to be handling external traffic for some use-case not supported by K8s. In that case ClusterIP is not needed, so we use a "headless" service in this example.

Specialized proxier using kpng

A kpng based proxier consists of the kpng controller handling all the K8s api related functions, and a "backend" implementing the load-balancing. The backend can be linked with the kpng controller binary or be a separate program communicating with the controller using gRPC.

kpng kube --service-proxy-name=kpng-example to-api

This starts the kpng controller and tell it to watch only services with the "kpng-example" service proxy name. The "to-api" parameter will open a gRPC server for backends.

You can test this yourself outside your cluster. Please see the example below.

Now we start a backend that simply prints the updates from the controller.

$ kubectl apply -f kpng-example.yaml
$ kpng-json | jq     # (this is the backend)
{
  "Service": {
    "Namespace": "default",
    "Name": "kpng-example",
    "Type": "ClusterIP",
    "IPs": {
      "ClusterIPs": {},
      "ExternalIPs": {
        "V4": [
          "10.0.0.55"
        ],
        "V6": [
          "1000::55"
        ]
      },
      "Headless": true
    },
    "Ports": [
      {
        "Protocol": 1,
        "Port": 6000,
        "TargetPort": 6000
      }
    ]
  },
  "Endpoints": [
    {
      "IPs": {
        "V6": [
          "1100::202"
        ]
      },
      "Local": true
    },
    {
      "IPs": {
        "V4": [
          "11.0.2.2"
        ]
      },
      "Local": true
    },
    {
      "IPs": {
        "V4": [
          "11.0.1.2"
        ]
      }
    },
    {
      "IPs": {
        "V6": [
          "1100::102"
        ]
      }
    }
  ]
}

A real backend would use some mechanism to load-balance traffic from the external IPs to the endpoints.

Writing a backend

The kpng-json backend looks like this:

package main
import (
        "os"
        "encoding/json"
        "sigs.k8s.io/kpng/client"
)
func main() {
        client.Run(jsonPrint)
}
func jsonPrint(items []*client.ServiceEndpoints) {
        enc := json.NewEncoder(os.Stdout)
        for _, item := range items {
                _ = enc.Encode(item)
        }
}

(yes, that is the entire program)

A real backend would of course be much more complex, but this illustrates how kpng let you focus on load-balancing.

You can have several backends connected to a kpng controller, so during development or debug it can be useful to let something like the kpng-json backend run in parallel with your real backend.

Example

The complete example can be found here.

As an example we implement an "all-ip" backend. It direct all traffic for the externalIPs to a local endpoint, regardless of ports and upper layer protocols. There is a KEP for this function and this example is a much simplified version.

To direct all traffic from an external address to a local POD only one iptables rule is needed, for instance;

ip6tables -t nat -A PREROUTING -d 1000::55/128 -j DNAT --to-destination 1100::202

As you can see the addresses are in the call to the backend and all it have to do is:

  • Extract the addresses with Local: true
  • Setup iptables rules for the ExternalIPs

A script doing that may look like:

xip=$(cat /tmp/out | jq -r .Service.IPs.ExternalIPs.V6[0])
podip=$(cat /tmp/out | jq -r '.Endpoints[]|select(.Local == true)|select(.IPs.V6 != null)|.IPs.V6[0]')
ip6tables -t nat -A PREROUTING -d $xip/128 -j DNAT --to-destination $podip

Assuming the JSON output above is stored in /tmp/out (jq is an awesome program!).

As this is an example we make it really simple for ourselves by using a minor variation of the kpng-json backend above. Instead of just printing, a program is called and the JSON output is passed as stdin to that program. The backend can be tested stand-alone:

CALLOUT=jq kpng-callout

Where jq can be replaced with your own program or script. A script may look like the example above. For more info and the complete example please see https://github.com/kubernetes-sigs/kpng/tree/master/examples/pipe-exec.

Summary

While kpng is in early stage of development this post wants to show how you may build your own specialized K8s proxiers in the future. The only thing your applications need to do is to add the service.kubernetes.io/service-proxy-name label in the Service manifest.

It is a tedious process to get new features into the kube-proxy and it is not unlikely that they will be rejected, so to write a specialized proxier may be the only option.

Introducing ClusterClass and Managed Topologies in Cluster API

The Cluster API community is happy to announce the implementation of ClusterClass and Managed Topologies, a new feature that will greatly simplify how you can provision, upgrade, and operate multiple Kubernetes clusters in a declarative way.

A little bit of context…

Before getting into the details, let's take a step back and look at the history of Cluster API.

The Cluster API project started three years ago, and the first releases focused on extensibility and implementing a declarative API that allows a seamless experience across infrastructure providers. This was a success with many cloud providers: AWS, Azure, Digital Ocean, GCP, Metal3, vSphere and still counting.

With extensibility addressed, the focus shifted to features, like automatic control plane and etcd management, health-based machine remediation, machine rollout strategies and more.

Fast forwarding to 2021, with lots of companies using Cluster API to manage fleets of Kubernetes clusters running workloads in production, the community focused its effort on stabilization of both code, APIs, documentation, and on extensive test signals which inform Kubernetes releases.

With solid foundations in place, and a vibrant and welcoming community that still continues to grow, it was time to plan another iteration on our UX for both new and advanced users.

Enter ClusterClass and Managed Topologies, tada!

ClusterClass

As the name suggests, ClusterClass and managed topologies are built in two parts.

The idea behind ClusterClass is simple: define the shape of your cluster once, and reuse it many times, abstracting the complexities and the internals of a Kubernetes cluster away.

Defining a ClusterClass

ClusterClass, at its heart, is a collection of Cluster and Machine templates. You can use it as a “stamp” that can be leveraged to create many clusters of a similar shape.

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: my-amazing-cluster-class
spec:
  controlPlane:
    ref:
      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
      kind: KubeadmControlPlaneTemplate
      name: high-availability-control-plane
    machineInfrastructure:
      ref:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: DockerMachineTemplate
        name: control-plane-machine
  workers:
    machineDeployments:
      - class: type1-workers
        template:
          bootstrap:
            ref:
              apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
              kind: KubeadmConfigTemplate
              name: type1-bootstrap
          infrastructure:
            ref:
              apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
              kind: DockerMachineTemplate
              name: type1-machine
      - class: type2-workers
        template:
          bootstrap:
            ref:
              apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
              kind: KubeadmConfigTemplate
              name: type2-bootstrap
          infrastructure:
            ref:
              kind: DockerMachineTemplate
              apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
              name: type2-machine
  infrastructure:
    ref:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: DockerClusterTemplate
      name: cluster-infrastructure

The possibilities are endless; you can get a default ClusterClass from the community, “off-the-shelf” classes from your vendor of choice, “certified” classes from the platform admin in your company, or even create custom ones for advanced scenarios.

Managed Topologies

Managed Topologies let you put the power of ClusterClass into action.

Given a ClusterClass, you can create many Clusters of a similar shape by providing a single resource, the Cluster.

Create a Cluster with ClusterClass

Here is an example:

---
apiVersion: cluster.x-k8s.io/v1beta1
 kind: Cluster
 metadata:
   name: my-amazing-cluster
   namespace: bar
 spec:
   topology: # define a managed topology
     class: my-amazing-cluster-class # use the ClusterClass mentioned earlier
     version: v1.21.2
     controlPlane:
       replicas: 3
     workers:
       machineDeployments:
       - class: type1-workers
         name: big-pool-of-machines
         replicas: 5
       - class: type2-workers
         name: small-pool-of-machines
         replicas: 1

But there is more than simplified cluster creation. Now the Cluster acts as a single control point for your entire topology.

All the power of Cluster API, extensibility, lifecycle automation, stability, all the features required for managing an enterprise grade Kubernetes cluster on the infrastructure provider of your choice are now at your fingertips: you can create your Cluster, add new machines, upgrade to the next Kubernetes version, and all from a single place.

It is just as simple as it looks!

What’s next

While the amazing Cluster API community is working hard to deliver the first version of ClusterClass and managed topologies later this year, we are already looking forward to what comes next for the project and its ecosystem.

There are a lot of great ideas and opportunities ahead!

We want to make managed topologies even more powerful and flexible, allowing users to dynamically change bits of a ClusterClass according to the specific needs of a Cluster; this will ensure the same simple and intuitive UX for solving complex problems like e.g. selecting machine image for a specific Kubernetes version and for a specific region of your infrastructure provider, or injecting proxy configurations in the entire Cluster, and so on.

Stay tuned for what comes next, and if you have any questions, comments or suggestions:

A Closer Look at NSA/CISA Kubernetes Hardening Guidance

Background

USA's National Security Agency (NSA) and the Cybersecurity and Infrastructure Security Agency (CISA) released Kubernetes Hardening Guidance on August 3rd, 2021. The guidance details threats to Kubernetes environments and provides secure configuration guidance to minimize risk.

The following sections of this blog correlate to the sections in the NSA/CISA guidance. Any missing sections are skipped because of limited opportunities to add anything new to the existing content.

Note: This blog post is not a substitute for reading the guide. Reading the published guidance is recommended before proceeding as the following content is complementary.

Update, November 2023:

The National Security Agency (NSA) and the Cybersecurity and Infrastructure Security Agency (CISA) released the 1.0 version of the Kubernetes hardening guide in August 2021 and updated it based on industry feedback in March 2022 (version 1.1).

The most recent version of the Kubernetes hardening guidance was released in August 2022 with corrections and clarifications. Version 1.2 outlines a number of recommendations for hardening Kubernetes clusters.

Introduction and Threat Model

Note that the threats identified as important by the NSA/CISA, or the intended audience of this guidance, may be different from the threats that other enterprise users of Kubernetes consider important. This section is still useful for organizations that care about data, resource theft and service unavailability.

The guidance highlights the following three sources of compromises:

  • Supply chain risks
  • Malicious threat actors
  • Insider threats (administrators, users, or cloud service providers)

The threat model tries to take a step back and review threats that not only exist within the boundary of a Kubernetes cluster but also include the underlying infrastructure and surrounding workloads that Kubernetes does not manage.

For example, when a workload outside the cluster shares the same physical network, it has access to the kubelet and to control plane components: etcd, controller manager, scheduler and API server. Therefore, the guidance recommends having network level isolation separating Kubernetes clusters from other workloads that do not need connectivity to Kubernetes control plane nodes. Specifically, scheduler, controller-manager, etcd only need to be accessible to the API server. Any interactions with Kubernetes from outside the cluster can happen by providing access to API server port.

List of ports and protocols for each of these components are defined in Ports and Protocols within the Kubernetes documentation.

Special note: kube-scheduler and kube-controller-manager uses different ports than the ones mentioned in the guidance

The Threat modelling section from the CNCF Cloud Native Security Whitepaper + Map provides another perspective on approaching threat modelling Kubernetes, from a cloud native lens.

Kubernetes Pod security

Kubernetes by default does not guarantee strict workload isolation between pods running in the same node in a cluster. However, the guidance provides several techniques to enhance existing isolation and reduce the attack surface in case of a compromise.

"Non-root" containers and "rootless" container engines

Several best practices related to basic security principle of least privilege i.e. provide only the permissions are needed; no more, no less, are worth a second look.

The guide recommends setting non-root user at build time instead of relying on setting runAsUser at runtime in your Pod spec. This is a good practice and provides some level of defense in depth. For example, if the container image is built with user 10001 and the Pod spec misses adding the runAsuser field in its Deployment object. In this case there are certain edge cases that are worth exploring for awareness:

  1. Pods can fail to start, if the user defined at build time is different from the one defined in pod spec and some files are as a result inaccessible.
  2. Pods can end up sharing User IDs unintentionally. This can be problematic even if the User IDs are non-zero in a situation where a container escape to host file system is possible. Once the attacker has access to the host file system, they get access to all the file resources that are owned by other unrelated pods that share the same UID.
  3. Pods can end up sharing User IDs, with other node level processes not managed by Kubernetes e.g. node level daemons for auditing, vulnerability scanning, telemetry. The threat is similar to the one above where host file system access can give attacker full access to these node level daemons without needing to be root on the node.

However, none of these cases will have as severe an impact as a container running as root being able to escape as a root user on the host, which can provide an attacker with complete control of the worker node, further allowing lateral movement to other worker or control plane nodes.

Kubernetes 1.22 introduced an alpha feature that specifically reduces the impact of such a control plane component running as root user to a non-root user through user namespaces.

That (alpha stage) support for user namespaces / rootless mode is available with the following container runtimes:

Some distributions support running in rootless mode, like the following:

Immutable container filesystems

The NSA/CISA Kubernetes Hardening Guidance highlights an often overlooked feature readOnlyRootFileSystem, with a working example in Appendix B. This example limits execution and tampering of containers at runtime. Any read/write activity can then be limited to few directories by using tmpfs volume mounts.

However, some applications that modify the container filesystem at runtime, like exploding a WAR or JAR file at container startup, could face issues when enabling this feature. To avoid this issue, consider making minimal changes to the filesystem at runtime when possible.

Building secure container images

Kubernetes Hardening Guidance also recommends running a scanner at deploy time as an admission controller, to prevent vulnerable or misconfigured pods from running in the cluster. Theoretically, this sounds like a good approach but there are several caveats to consider before this can be implemented in practice:

  • Depending on network bandwidth, available resources and scanner of choice, scanning for vulnerabilities for an image can take an indeterminate amount of time. This could lead to slower or unpredictable pod start up times, which could result in spikes of unavailability when apps are serving peak load.
  • If the policy that allows or denies pod startup is made using incorrect or incomplete data it could result in several false positive or false negative outcomes like the following:
    • inside a container image, the openssl package is detected as vulnerable. However, the application is written in Golang and uses the Go crypto package for TLS. Therefore, this vulnerability is not in the code execution path and as such has minimal impact if it remains unfixed.
    • A vulnerability is detected in the openssl package for a Debian base image. However, the upstream Debian community considers this as a Minor impact vulnerability and as a result does not release a patch fix for this vulnerability. The owner of this image is now stuck with a vulnerability that cannot be fixed and a cluster that does not allow the image to run because of predefined policy that does not take into account whether the fix for a vulnerability is available or not
    • A Golang app is built on top of a distroless image, but it is compiled with a Golang version that uses a vulnerable standard library. The scanner has no visibility into golang version but only on OS level packages. So it allows the pod to run in the cluster in spite of the image containing an app binary built on vulnerable golang.

To be clear, relying on vulnerability scanners is absolutely a good idea but policy definitions should be flexible enough to allow:

  • Creation of exception lists for images or vulnerabilities through labelling
  • Overriding the severity with a risk score based on impact of a vulnerability
  • Applying the same policies at build time to catch vulnerable images with fixable vulnerabilities before they can be deployed into Kubernetes clusters

Special considerations like offline vulnerability database fetch, may also be needed, if the clusters run in an air-gapped environment and the scanners require internet access to update the vulnerability database.

Pod Security Policies

Since Kubernetes v1.21, the PodSecurityPolicy API and related features are deprecated, but some of the guidance in this section will still apply for the next few years, until cluster operators upgrade their clusters to newer Kubernetes versions.

The Kubernetes project is working on a replacement for PodSecurityPolicy. Kubernetes v1.22 includes an alpha feature called Pod Security Admission that is intended to allow enforcing a minimum level of isolation between pods.

The built-in isolation levels for Pod Security Admission are derived from Pod Security Standards, which is a superset of all the components mentioned in Table I page 10 of the guidance.

Information about migrating from PodSecurityPolicy to the Pod Security Admission feature is available in Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller.

One important behavior mentioned in the guidance that remains the same between Pod Security Policy and its replacement is that enforcing either of them does not affect pods that are already running. With both PodSecurityPolicy and Pod Security Admission, the enforcement happens during the pod creation stage.

Hardening container engines

Some container workloads are less trusted than others but may need to run in the same cluster. In those cases, running them on dedicated nodes that include hardened container runtimes that provide stricter pod isolation boundaries can act as a useful security control.

Kubernetes supports an API called RuntimeClass that is stable / GA (and, therefore, enabled by default) stage as of Kubernetes v1.20. RuntimeClass allows you to ensure that Pods requiring strong isolation are scheduled onto nodes that can offer it.

Some third-party projects that you can use in conjunction with RuntimeClass are:

As discussed here and in the guidance, many features and tooling exist in and around Kubernetes that can enhance the isolation boundaries between pods. Based on relevant threats and risk posture, you should pick and choose between them, instead of trying to apply all the recommendations. Having said that, cluster level isolation i.e. running workloads in dedicated clusters, remains the strictest workload isolation mechanism, in spite of improvements mentioned earlier here and in the guide.

Network Separation and Hardening

Kubernetes Networking can be tricky and this section focuses on how to secure and harden the relevant configurations. The guide identifies the following as key takeaways:

  • Using NetworkPolicies to create isolation between resources,
  • Securing the control plane
  • Encrypting traffic and sensitive data

Network Policies

Network policies can be created with the help of network plugins. In order to make the creation and visualization easier for users, Cilium supports a web GUI tool. That web GUI lets you create Kubernetes NetworkPolicies (a generic API that nevertheless requires a compatible CNI plugin), and / or Cilium network policies (CiliumClusterwideNetworkPolicy and CiliumNetworkPolicy, which only work in clusters that use the Cilium CNI plugin). You can use these APIs to restrict network traffic between pods, and therefore minimize the attack vector.

Another scenario that is worth exploring is the usage of external IPs. Some services, when misconfigured, can create random external IPs. An attacker can take advantage of this misconfiguration and easily intercept traffic. This vulnerability has been reported in CVE-2020-8554. Using externalip-webhook can mitigate this vulnerability by preventing the services from using random external IPs. externalip-webhook only allows creation of services that don't require external IPs or whose external IPs are within the range specified by the administrator.

CVE-2020-8554 - Kubernetes API server in all versions allow an attacker who is able to create a ClusterIP service and set the spec.externalIPs field, to intercept traffic to that IP address. Additionally, an attacker who is able to patch the status (which is considered a privileged operation and should not typically be granted to users) of a LoadBalancer service can set the status.loadBalancer.ingress.ip to similar effect.

Resource Policies

In addition to configuring ResourceQuotas and limits, consider restricting how many process IDs (PIDs) a given Pod can use, and also to reserve some PIDs for node-level use to avoid resource exhaustion. More details to apply these limits can be found in Process ID Limits And Reservations.

Control Plane Hardening

In the next section, the guide covers control plane hardening. It is worth noting that from Kubernetes 1.20, insecure port from API server, has been removed.

Etcd

As a general rule, the etcd server should be configured to only trust certificates assigned to the API server. It limits the attack surface and prevents a malicious attacker from gaining access to the cluster. It might be beneficial to use a separate CA for etcd, as it by default trusts all the certificates issued by the root CA.

Kubeconfig Files

In addition to specifying the token and certificates directly, .kubeconfig supports dynamic retrieval of temporary tokens using auth provider plugins. Beware of the possibility of malicious shell code execution in a kubeconfig file. Once attackers gain access to the cluster, they can steal ssh keys/secrets or more.

Secrets

Kubernetes Secrets is the native way of managing secrets as a Kubernetes API object. However, in some scenarios such as a desire to have a single source of truth for all app secrets, irrespective of whether they run on Kubernetes or not, secrets can be managed loosely coupled with Kubernetes and consumed by pods through side-cars or init-containers with minimal usage of Kubernetes Secrets API.

External secrets providers and csi-secrets-store are some of these alternatives to Kubernetes Secrets

Log Auditing

The NSA/CISA guidance stresses monitoring and alerting based on logs. The key points include logging at the host level, application level, and on the cloud. When running Kubernetes in production, it's important to understand who's responsible, and who's accountable, for each layer of logging.

Kubernetes API auditing

One area that deserves more focus is what exactly should alert or be logged. The document outlines a sample policy in Appendix L: Audit Policy that logs all RequestResponse's including metadata and request / response bodies. While helpful for a demo, it may not be practical for production.

Each organization needs to evaluate their own threat model and build an audit policy that complements or helps troubleshooting incident response. Think about how someone would attack your organization and what audit trail could identify it. Review more advanced options for tuning audit logs in the official audit logging documentation. It's crucial to tune your audit logs to only include events that meet your threat model. A minimal audit policy that logs everything at metadata level can also be a good starting point.

Audit logging configurations can also be tested with kind following these instructions.

Streaming logs and auditing

Logging is important for threat and anomaly detection. As the document outlines, it's a best practice to scan and alert on logs as close to real time as possible and to protect logs from tampering if a compromise occurs. It's important to reflect on the various levels of logging and identify the critical areas such as API endpoints.

Kubernetes API audit logging can stream to a webhook and there's an example in Appendix N: Webhook configuration. Using a webhook could be a method that stores logs off cluster and/or centralizes all audit logs. Once logs are centrally managed, look to enable alerting based on critical events. Also ensure you understand what the baseline is for normal activities.

Alert identification

While the guide stressed the importance of notifications, there is not a blanket event list to alert from. The alerting requirements vary based on your own requirements and threat model. Examples include the following events:

  • Changes to the securityContext of a Pod
  • Updates to admission controller configs
  • Accessing certain files / URLs

Additional logging resources

Upgrading and Application Security practices

Kubernetes releases three times per year, so upgrade-related toil is a common problem for people running production clusters. In addition to this, operators must regularly upgrade the underlying node's operating system and running applications. This is a best practice to ensure continued support and to reduce the likelihood of bugs or vulnerabilities.

Kubernetes supports the three most recent stable releases. While each Kubernetes release goes through a large number of tests before being published, some teams aren't comfortable running the latest stable release until some time has passed. No matter what version you're running, ensure that patch upgrades happen frequently or automatically. More information can be found in the version skew policy pages.

When thinking about how you'll manage node OS upgrades, consider ephemeral nodes. Having the ability to destroy and add nodes allows your team to respond quicker to node issues. In addition, having deployments that tolerate node instability (and a culture that encourages frequent deployments) allows for easier cluster upgrades.

Additionally, it's worth reiterating from the guidance that periodic vulnerability scans and penetration tests can be performed on the various system components to proactively look for insecure configurations and vulnerabilities.

Finding release & security information

To find the most recent Kubernetes supported versions, refer to https://k8s.io/releases, which includes minor versions. It's good to stay up to date with your minor version patches.

If you're running a managed Kubernetes offering, look for their release documentation and find their various security channels.

Subscribe to the Kubernetes Announce mailing list. The Kubernetes Announce mailing list is searchable for terms such as "Security Advisories". You can set up alerts and email notifications as long as you know what key words to alert on.

Conclusion

In summary, it is fantastic to see security practitioners sharing this level of detailed guidance in public. This guidance further highlights Kubernetes going mainstream and how securing Kubernetes clusters and the application containers running on Kubernetes continues to need attention and focus of practitioners. Only a few weeks after the guidance was published, an open source tool kubescape to validate cluster against this guidance became available.

This tool can be a great starting point to check the current state of your clusters, after which you can use the information in this blog post and in the guidance to assess where improvements can be made.

Finally, it is worth reiterating that not all controls in this guidance will make sense for all practitioners. The best way to know which controls matter is to rely on the threat model of your own Kubernetes environment.

A special shout out and thanks to Rory McCune (@raesene) for his inputs to this blog post

How to Handle Data Duplication in Data-Heavy Kubernetes Environments

Why Duplicate Data?

It’s convenient to create a copy of your application with a copy of its state for each team. For example, you might want a separate database copy to test some significant schema changes or develop other disruptive operations like bulk insert/delete/update...

Duplicating data takes a lot of time. That’s because you need first to download all the data from a source block storage provider to compute and then send it back to a storage provider again. There’s a lot of network traffic and CPU/RAM used in this process. Hardware acceleration by offloading certain expensive operations to dedicated hardware is always a huge performance boost. It reduces the time required to complete an operation by orders of magnitude.

Volume Snapshots to the rescue

Kubernetes introduced VolumeSnapshots as alpha in 1.12, beta in 1.17, and the Generally Available version in 1.20. VolumeSnapshots use specialized APIs from storage providers to duplicate volume of data.

Since data is already in the same storage device (array of devices), duplicating data is usually a metadata operation for storage providers with local snapshots (majority of on-premise storage providers). All you need to do is point a new disk to an immutable snapshot and only save deltas (or let it do a full-disk copy). As an operation that is inside the storage back-end, it’s much quicker and usually doesn’t involve sending traffic over the network. Public Clouds storage providers under the hood work a bit differently. They save snapshots to Object Storage and then copy back from Object storage to Block storage when "duplicating" disk. Technically there is a lot of Compute and network resources spent on Cloud providers side, but from Kubernetes user perspective VolumeSnapshots work the same way whether is it local or remote snapshot storage provider and no Compute and Network resources are involved in this operation.

Sounds like we have our solution, right?

Actually, VolumeSnapshots are namespaced, and Kubernetes protects namespaced data from being shared between tenants (Namespaces). This Kubernetes limitation is a conscious design decision so that a Pod running in a different namespace can’t mount another application’s PersistentVolumeClaim (PVC).

One way around it would be to create multiple volumes with duplicate data in one namespace. However, you could easily reference the wrong copy.

So the idea is to separate teams/initiatives by namespaces to avoid that and generally limit access to the production namespace.

Solution? Creating a Golden Snapshot externally

Another way around this design limitation is to create Snapshot externally (not through Kubernetes). This is also called pre-provisioning a snapshot manually. Next, I will import it as a multi-tenant golden snapshot that can be used for many namespaces. Below illustration will be for AWS EBS (Elastic Block Storage) and GCE PD (Persistent Disk) services.

High-level plan for preparing the Golden Snapshot

  1. Identify Disk (EBS/Persistent Disk) that you want to clone with data in the cloud provider
  2. Make a Disk Snapshot (in cloud provider console)
  3. Get Disk Snapshot ID

High-level plan for cloning data for each team

  1. Create Namespace “sandbox01”
  2. Import Disk Snapshot (ID) as VolumeSnapshotContent to Kubernetes
  3. Create VolumeSnapshot in the Namespace "sandbox01" mapped to VolumeSnapshotContent
  4. Create the PersistentVolumeClaim from VolumeSnapshot
  5. Install Deployment or StatefulSet with PVC

Step 1: Identify Disk

First, you need to identify your golden source. In my case, it’s a PostgreSQL database on PersistentVolumeClaim “postgres-pv-claim” in the “production” namespace.

kubectl -n <namespace> get pvc <pvc-name> -o jsonpath='{.spec.volumeName}'

The output will look similar to:

pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d90

Step 2: Prepare your golden source

You need to do this once or every time you want to refresh your golden data.

Make a Disk Snapshot

Go to AWS EC2 or GCP Compute Engine console and search for an EBS volume (on AWS) or Persistent Disk (on GCP), that has a label matching the last output. In this case I saw: pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d9.

Click on Create snapshot and give it a name. You can do it in Console manually, in AWS CloudShell / Google Cloud Shell, or in the terminal. To create a snapshot in the terminal you must have the AWS CLI tool (aws) or Google's CLI (gcloud) installed and configured.

Here’s the command to create snapshot on GCP:

gcloud compute disks snapshot <cloud-disk-id> --project=<gcp-project-id> --snapshot-names=<set-new-snapshot-name> --zone=<availability-zone> --storage-location=<region>
Screenshot of a terminal showing volume snapshot creation on GCP

GCP snapshot creation

GCP identifies the disk by its PVC name, so it’s direct mapping. In AWS, you need to find volume by the CSIVolumeName AWS tag with PVC name value first that will be used for snapshot creation.

Screenshot of AWS web console, showing EBS volume identification

Identify disk ID on AWS

Mark done Volume (volume-id) vol-00c7ecd873c6fb3ec and ether create EBS snapshot in AWS Console, or use aws cli.

aws ec2 create-snapshot --volume-id '<volume-id>' --description '<set-new-snapshot-name>' --tag-specifications 'ResourceType=snapshot'

Step 3: Get your Disk Snapshot ID

In AWS, the command above will output something similar to:

"SnapshotId": "snap-09ed24a70bc19bbe4"

If you’re using the GCP cloud, you can get the snapshot ID from the gcloud command by querying for the snapshot’s given name:

gcloud compute snapshots --project=<gcp-project-id> describe <new-snapshot-name> | grep id:

You should get similar output to:

id: 6645363163809389170

Step 4: Create a development environment for each team

Now I have my Golden Snapshot, which is immutable data. Each team will get a copy of this data, and team members can modify it as they see fit, given that a new EBS/persistent disk will be created for each team.

Below I will define a manifest for each namespace. To save time, you can replace the namespace name (such as changing “sandbox01” → “sandbox42”) using tools such as sed or yq, with Kubernetes-aware templating tools like Kustomize, or using variable substitution in a CI/CD pipeline.

Here's an example manifest:

---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotContent
metadata:
 name: postgresql-orders-db-sandbox01
 namespace: sandbox01
spec:
 deletionPolicy: Retain
 driver: pd.csi.storage.gke.io
 source:
   snapshotHandle: 'gcp/projects/staging-eu-castai-vt5hy2/global/snapshots/6645363163809389170'
 volumeSnapshotRef:
   kind: VolumeSnapshot
   name: postgresql-orders-db-snap
   namespace: sandbox01
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
 name: postgresql-orders-db-snap
 namespace: sandbox01
spec:
 source:
   volumeSnapshotContentName: postgresql-orders-db-sandbox01

In Kubernetes, VolumeSnapshotContent (VSC) objects are not namespaced. However, I need a separate VSC for each different namespace to use, so the metadata.name of each VSC must also be different. To make that straightfoward, I used the target namespace as part of the name.

Now it’s time to replace the driver field with the CSI (Container Storage Interface) driver installed in your K8s cluster. Major cloud providers have CSI driver for block storage that support VolumeSnapshots but quite often CSI drivers are not installed by default, consult with your Kubernetes provider.

That manifest above defines a VSC that works on GCP. On AWS, driver and SnashotHandle values might look like:

  driver: ebs.csi.aws.com
  source:
    snapshotHandle: "snap-07ff83d328c981c98"

At this point, I need to use the Retain policy, so that the CSI driver doesn’t try to delete my manually created EBS disk snapshot.

For GCP, you will have to build this string by hand - add a full project ID and snapshot ID. For AWS, it’s just a plain snapshot ID.

VSC also requires specifying which VolumeSnapshot (VS) will use it, so VSC and VS are referencing each other.

Now I can create PersistentVolumeClaim from VS above. It’s important to set this first:

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: postgres-pv-claim
 namespace: sandbox01
spec:
 dataSource:
   kind: VolumeSnapshot
   name: postgresql-orders-db-snap
   apiGroup: snapshot.storage.k8s.io
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 21Gi

If default StorageClass has WaitForFirstConsumer policy, then the actual Cloud Disk will be created from the Golden Snapshot only when some Pod bounds that PVC.

Now I assign that PVC to my Pod (in my case, it’s Postgresql) as I would with any other PVC.

kubectl -n <namespace> get volumesnapshotContent,volumesnapshot,pvc,pod

Both VS and VSC should be READYTOUSE true, PVC bound, and the Pod (from Deployment or StatefulSet) running.

To keep on using data from my Golden Snapshot, I just need to repeat this for the next namespace and voilà! No need to waste time and compute resources on the duplication process.

Spotlight on SIG Node

Introduction

In Kubernetes, a Node is a representation of a single machine in your cluster. SIG Node owns that very important Node component and supports various subprojects such as Kubelet, Container Runtime Interface (CRI) and more to support how the pods and host resources interact. In this blog, we have summarized our conversation with Elana Hashman (EH) & Sergey Kanzhelev (SK), who walk us through the various aspects of being a part of the SIG and share some insights about how others can get involved.

A summary of our conversation

Could you tell us a little about what SIG Node does?

SK: SIG Node is a vertical SIG responsible for the components that support the controlled interactions between the pods and host resources. We manage the lifecycle of pods that are scheduled to a node. This SIG's focus is to enable a broad set of workload types, including workloads with hardware specific or performance sensitive requirements. All while maintaining isolation boundaries between pods on a node, as well as the pod and the host. This SIG maintains quite a few components and has many external dependencies (like container runtimes or operating system features), which makes the complexity we deal with huge. We tame the complexity and aim to continuously improve node reliability.

"SIG Node is a vertical SIG" could you explain a bit more?

EH: There are two kinds of SIGs: horizontal and vertical. Horizontal SIGs are concerned with a particular function of every component in Kubernetes: for example, SIG Security considers security aspects of every component in Kubernetes, or SIG Instrumentation looks at the logs, metrics, traces and events of every component in Kubernetes. Such SIGs don't tend to own a lot of code.

Vertical SIGs, on the other hand, own a single component, and are responsible for approving and merging patches to that code base. SIG Node owns the "Node" vertical, pertaining to the kubelet and its lifecycle. This includes the code for the kubelet itself, as well as the node controller, the container runtime interface, and related subprojects like the node problem detector.

How did the CI subproject start? Is this specific to SIG Node and how does it help the SIG?

SK: The subproject started as a follow up after one of the releases was blocked by numerous test failures of critical tests. These tests haven’t started falling all at once, rather continuous lack of attention led to slow degradation of tests quality. SIG Node was always prioritizing quality and reliability, and forming of the subproject was a way to highlight this priority.

As the 3rd largest SIG in terms of number of issues and PRs, how does your SIG juggle so much work?

EH: It helps to be organized. When I increased my contributions to the SIG in January of 2021, I found myself overwhelmed by the volume of pull requests and issues and wasn't sure where to start. We were already tracking test-related issues and pull requests on the CI subproject board, but that was missing a lot of our bugfixes and feature work. So I began putting together a triage board for the rest of our pull requests, which allowed me to sort each one by status and what actions to take, and documented its use for other contributors. We closed or merged over 500 issues and pull requests tracked by our two boards in each of the past two releases. The Kubernetes devstats showed that we have significantly increased our velocity as a result.

In June, we ran our first bug scrub event to work through the backlog of issues filed against SIG Node, ensuring they were properly categorized. We closed over 130 issues over the course of this 48 hour global event, but as of writing we still have 333 open issues.

Why should new and existing contributors consider joining SIG Node?

SK: Being a SIG Node contributor gives you skills and recognition that are rewarding and useful. Understanding under the hood of a kubelet helps architecting better apps, tune and optimize those apps, and gives leg up in issues troubleshooting. If you are a new contributor, SIG Node gives you the foundational knowledge that is key to understanding why other Kubernetes components are designed the way they are. Existing contributors may benefit as many features will require SIG Node changes one way or another. So being a SIG Node contributor helps building features in other SIGs faster.

SIG Node maintains numerous components, many of which have dependency on external projects or OS features. This makes the onboarding process quite lengthy and demanding. But if you are up for a challenge, there is always a place for you, and a group of people to support.

What do you do to help new contributors get started?

EH: Getting started in SIG Node can be intimidating, since there is so much work to be done, our SIG meetings are very large, and it can be hard to find a place to start.

I always encourage new contributors to work on things that they have some investment in already. In SIG Node, that might mean volunteering to help fix a bug that you have personally been affected by, or helping to triage bugs you care about by priority.

To come up to speed on any open source code base, there are two strategies you can take: start by exploring a particular issue deeply, and follow that to expand the edges of your knowledge as needed, or briefly review as many issues and change requests as you possibly can to get a higher level picture of how the component works. Ultimately, you will need to do both if you want to become a Node reviewer or approver.

Davanum Srinivas and I each ran a cohort of group mentoring to help teach new contributors the skills to become Node reviewers, and if there's interest we can work to find a mentor to run another session. I also encourage new contributors to attend our Node CI Subproject meeting: it's a smaller audience and we don't record the triage sessions, so it can be a less intimidating way to get started with the SIG.

Are there any particular skills you’d like to recruit for? What skills are contributors to SIG Usability likely to learn?

SK: SIG Node works on many workstreams in very different areas. All of these areas are on system level. For the typical code contributions you need to have a passion for building and utilizing low level APIs and writing performant and reliable components. Being a contributor you will learn how to debug and troubleshoot, profile, and monitor these components, as well as user workload that is run by these components. Often, with the limited to no access to Nodes, as they are running production workloads.

The other way of contribution is to help document SIG node features. This type of contribution requires a deep understanding of features, and ability to explain them in simple terms.

Finally, we are always looking for feedback on how best to run your workload. Come and explain specifics of it, and what features in SIG Node components may help to run it better.

What are you getting positive feedback on, and what’s coming up next for SIG Node?

EH: Over the past year SIG Node has adopted some new processes to help manage our feature development and Kubernetes enhancement proposals, and other SIGs have looked to us for inspiration in managing large workloads. I hope that this is an area we can continue to provide leadership in and further iterate on.

We have a great balance of new features and deprecations in flight right now. Deprecations of unused or difficult to maintain features help us keep technical debt and maintenance load under control, and examples include the dockershim and DynamicKubeletConfiguration deprecations. New features will unlock additional functionality in end users' clusters, and include exciting features like support for cgroups v2, swap memory, graceful node shutdowns, and device management policies.

Any closing thoughts/resources you’d like to share?

SK/EH: It takes time and effort to get to any open source community. SIG Node may overwhelm you at first with the number of participants, volume of work, and project scope. But it is totally worth it. Join our welcoming community! SIG Node GitHub Repo contains many useful resources including Slack, mailing list and other contact info.

Wrap Up

SIG Node hosted a KubeCon + CloudNativeCon Europe 2021 talk with an intro and deep dive to their awesome SIG. Join the SIG's meetings to find out about the most recent research results, what the plans are for the forthcoming year, and how to get involved in the upstream Node team as a contributor!

Introducing Single Pod Access Mode for PersistentVolumes

Last month's release of Kubernetes v1.22 introduced a new ReadWriteOncePod access mode for PersistentVolumes and PersistentVolumeClaims. With this alpha feature, Kubernetes allows you to restrict volume access to a single pod in the cluster.

What are access modes and why are they important?

When using storage, there are different ways to model how that storage is consumed.

For example, a storage system like a network file share can have many users all reading and writing data simultaneously. In other cases maybe everyone is allowed to read data but not write it. For highly sensitive data, maybe only one user is allowed to read and write data but nobody else.

In the world of Kubernetes, access modes are the way you can define how durable storage is consumed. These access modes are a part of the spec for PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs).

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: shared-cache
spec:
  accessModes:
  - ReadWriteMany # Allow many nodes to access shared-cache simultaneously.
  resources:
    requests:
      storage: 1Gi

Before v1.22, Kubernetes offered three access modes for PVs and PVCs:

  • ReadWriteOnce – the volume can be mounted as read-write by a single node
  • ReadOnlyMany – the volume can be mounted read-only by many nodes
  • ReadWriteMany – the volume can be mounted as read-write by many nodes

These access modes are enforced by Kubernetes components like the kube-controller-manager and kubelet to ensure only certain pods are allowed to access a given PersistentVolume.

What is this new access mode and how does it work?

Kubernetes v1.22 introduced a fourth access mode for PVs and PVCs, that you can use for CSI volumes:

  • ReadWriteOncePod – the volume can be mounted as read-write by a single pod

If you create a pod with a PVC that uses the ReadWriteOncePod access mode, Kubernetes ensures that pod is the only pod across your whole cluster that can read that PVC or write to it.

If you create another pod that references the same PVC with this access mode, the pod will fail to start because the PVC is already in use by another pod. For example:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  1s    default-scheduler  0/1 nodes are available: 1 node has pod using PersistentVolumeClaim with the same name and ReadWriteOncePod access mode.

How is this different than the ReadWriteOnce access mode?

The ReadWriteOnce access mode restricts volume access to a single node, which means it is possible for multiple pods on the same node to read from and write to the same volume. This could potentially be a major problem for some applications, especially if they require at most one writer for data safety guarantees.

With ReadWriteOncePod these issues go away. Set the access mode on your PVC, and Kubernetes guarantees that only a single pod has access.

How do I use it?

The ReadWriteOncePod access mode is in alpha for Kubernetes v1.22 and is only supported for CSI volumes. As a first step you need to enable the ReadWriteOncePod feature gate for kube-apiserver, kube-scheduler, and kubelet. You can enable the feature by setting command line arguments:

--feature-gates="...,ReadWriteOncePod=true"

You also need to update the following CSI sidecars to these versions or greater:

Creating a PersistentVolumeClaim

In order to use the ReadWriteOncePod access mode for your PVs and PVCs, you will need to create a new PVC with the access mode:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: single-writer-only
spec:
  accessModes:
  - ReadWriteOncePod # Allow only a single pod to access single-writer-only.
  resources:
    requests:
      storage: 1Gi

If your storage plugin supports dynamic provisioning, new PersistentVolumes will be created with the ReadWriteOncePod access mode applied.

Migrating existing PersistentVolumes

If you have existing PersistentVolumes, they can be migrated to use ReadWriteOncePod.

In this example, we already have a "cat-pictures-pvc" PersistentVolumeClaim that is bound to a "cat-pictures-pv" PersistentVolume, and a "cat-pictures-writer" Deployment that uses this PersistentVolumeClaim.

As a first step, you need to edit your PersistentVolume's spec.persistentVolumeReclaimPolicy and set it to Retain. This ensures your PersistentVolume will not be deleted when we delete the corresponding PersistentVolumeClaim:

kubectl patch pv cat-pictures-pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

Next you need to stop any workloads that are using the PersistentVolumeClaim bound to the PersistentVolume you want to migrate, and then delete the PersistentVolumeClaim.

Once that is done, you need to clear your PersistentVolume's spec.claimRef.uid to ensure PersistentVolumeClaims can bind to it upon recreation:

kubectl scale --replicas=0 deployment cat-pictures-writer
kubectl delete pvc cat-pictures-pvc
kubectl patch pv cat-pictures-pv -p '{"spec":{"claimRef":{"uid":""}}}'

After that you need to replace the PersistentVolume's access modes with ReadWriteOncePod:

kubectl patch pv cat-pictures-pv -p '{"spec":{"accessModes":["ReadWriteOncePod"]}}'

Note:

The ReadWriteOncePod access mode cannot be combined with other access modes. Make sure ReadWriteOncePod is the only access mode on the PersistentVolume when updating, otherwise the request will fail.

Next you need to modify your PersistentVolumeClaim to set ReadWriteOncePod as the only access mode. You should also set your PersistentVolumeClaim's spec.volumeName to the name of your PersistentVolume.

Once this is done, you can recreate your PersistentVolumeClaim and start up your workloads:

# IMPORTANT: Make sure to edit your PVC in cat-pictures-pvc.yaml before applying. You need to:
# - Set ReadWriteOncePod as the only access mode
# - Set spec.volumeName to "cat-pictures-pv"

kubectl apply -f cat-pictures-pvc.yaml
kubectl apply -f cat-pictures-writer-deployment.yaml

Lastly you may edit your PersistentVolume's spec.persistentVolumeReclaimPolicy and set to it back to Delete if you previously changed it.

kubectl patch pv cat-pictures-pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'

You can read Configure a Pod to Use a PersistentVolume for Storage for more details on working with PersistentVolumes and PersistentVolumeClaims.

What volume plugins support this?

The only volume plugins that support this are CSI drivers. SIG Storage does not plan to support this for in-tree plugins because they are being deprecated as part of CSI migration. Support may be considered for beta for users that prefer to use the legacy in-tree volume APIs with CSI migration enabled.

As a storage vendor, how do I add support for this access mode to my CSI driver?

The ReadWriteOncePod access mode will work out of the box without any required updates to CSI drivers, but does require updates to CSI sidecars. With that being said, if you would like to stay up to date with the latest changes to the CSI specification (v1.5.0+), read on.

Two new access modes were introduced to the CSI specification in order to disambiguate the legacy SINGLE_NODE_WRITER access mode. They are SINGLE_NODE_SINGLE_WRITER and SINGLE_NODE_MULTI_WRITER. In order to communicate to sidecars (like the external-provisioner) that your driver understands and accepts these two new CSI access modes, your driver will also need to advertise the SINGLE_NODE_MULTI_WRITER capability for the controller service and node service.

If you'd like to read up on the motivation for these access modes and capability bits, you can also read the CSI Specification Changes, Volume Capabilities section of KEP-2485 (ReadWriteOncePod PersistentVolume Access Mode).

Update your CSI driver to use the new interface

As a first step you will need to update your driver's container-storage-interface dependency to v1.5.0+, which contains support for these new access modes and capabilities.

Accept new CSI access modes

If your CSI driver contains logic for validating CSI access modes for requests , it may need updating. If it currently accepts SINGLE_NODE_WRITER, it should be updated to also accept SINGLE_NODE_SINGLE_WRITER and SINGLE_NODE_MULTI_WRITER.

Using the GCP PD CSI driver validation logic as an example, here is how it can be extended:

diff --git a/pkg/gce-pd-csi-driver/utils.go b/pkg/gce-pd-csi-driver/utils.go
index 281242c..b6c5229 100644
--- a/pkg/gce-pd-csi-driver/utils.go
+++ b/pkg/gce-pd-csi-driver/utils.go
@@ -123,6 +123,8 @@ func validateAccessMode(am *csi.VolumeCapability_AccessMode) error {
        case csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY:
        case csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY:
        case csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER:
+       case csi.VolumeCapability_AccessMode_SINGLE_NODE_SINGLE_WRITER:
+       case csi.VolumeCapability_AccessMode_SINGLE_NODE_MULTI_WRITER:
        default:
                return fmt.Errorf("%v access mode is not supported for for PD", am.GetMode())
        }

Your CSI driver will also need to return the new SINGLE_NODE_MULTI_WRITER capability as part of the ControllerGetCapabilities and NodeGetCapabilities RPCs.

Using the GCP PD CSI driver capability advertisement logic as an example, here is how it can be extended:

diff --git a/pkg/gce-pd-csi-driver/gce-pd-driver.go b/pkg/gce-pd-csi-driver/gce-pd-driver.go
index 45903f3..0d7ea26 100644
--- a/pkg/gce-pd-csi-driver/gce-pd-driver.go
+++ b/pkg/gce-pd-csi-driver/gce-pd-driver.go
@@ -56,6 +56,8 @@ func (gceDriver *GCEDriver) SetupGCEDriver(name, vendorVersion string, extraVolu
                csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
                csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY,
                csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
+               csi.VolumeCapability_AccessMode_SINGLE_NODE_SINGLE_WRITER,
+               csi.VolumeCapability_AccessMode_SINGLE_NODE_MULTI_WRITER,
        }
        gceDriver.AddVolumeCapabilityAccessModes(vcam)
        csc := []csi.ControllerServiceCapability_RPC_Type{
@@ -67,12 +69,14 @@ func (gceDriver *GCEDriver) SetupGCEDriver(name, vendorVersion string, extraVolu
                csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
                csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
                csi.ControllerServiceCapability_RPC_LIST_VOLUMES_PUBLISHED_NODES,
+               csi.ControllerServiceCapability_RPC_SINGLE_NODE_MULTI_WRITER,
        }
        gceDriver.AddControllerServiceCapabilities(csc)
        ns := []csi.NodeServiceCapability_RPC_Type{
                csi.NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME,
                csi.NodeServiceCapability_RPC_EXPAND_VOLUME,
                csi.NodeServiceCapability_RPC_GET_VOLUME_STATS,
+               csi.NodeServiceCapability_RPC_SINGLE_NODE_MULTI_WRITER,
        }
        gceDriver.AddNodeServiceCapabilities(ns)

Implement NodePublishVolume behavior

The CSI spec outlines expected behavior for the NodePublishVolume RPC when called more than once for the same volume but with different arguments (like the target path). Please refer to the second table in the NodePublishVolume section of the CSI spec for more details on expected behavior when implementing in your driver.

Update your CSI sidecars

When deploying your CSI drivers, you must update the following CSI sidecars to versions that depend on CSI spec v1.5.0+ and the Kubernetes v1.22 API. The minimum required versions are:

What’s next?

As part of the beta graduation for this feature, SIG Storage plans to update the Kubernetes scheduler to support pod preemption in relation to ReadWriteOncePod storage. This means if two pods request a PersistentVolumeClaim with ReadWriteOncePod, the pod with highest priority will gain access to the PersistentVolumeClaim and any pod with lower priority will be preempted from the node and be unable to access the PersistentVolumeClaim.

How can I learn more?

Please see KEP-2485 for more details on the ReadWriteOncePod access mode and motivations for CSI spec changes.

How do I get involved?

The Kubernetes #csi Slack channel and any of the standard SIG Storage communication channels are great mediums to reach out to the SIG Storage and the CSI teams.

Special thanks to the following people for their insightful reviews and design considerations:

  • Abdullah Gharaibeh (ahg-g)
  • Aldo Culquicondor (alculquicondor)
  • Ben Swartzlander (bswartz)
  • Deep Debroy (ddebroy)
  • Hemant Kumar (gnufied)
  • Humble Devassy Chirammal (humblec)
  • James DeFelice (jdef)
  • Jan Šafránek (jsafrane)
  • Jing Xu (jingxu97)
  • Jordan Liggitt (liggitt)
  • Michelle Au (msau42)
  • Saad Ali (saad-ali)
  • Tim Hockin (thockin)
  • Xing Yang (xing-yang)

If you’re interested in getting involved with the design and development of CSI or any part of the Kubernetes storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

Alpha in Kubernetes v1.22: API Server Tracing

In distributed systems, it can be hard to figure out where problems are. You grep through one component's logs just to discover that the source of your problem is in another component. You search there only to discover that you need to enable debug logs to figure out what really went wrong... And it goes on. The more complex the path your request takes, the harder it is to answer questions about where it went. I've personally spent many hours doing this dance with a variety of Kubernetes components. Distributed tracing is a tool which is designed to help in these situations, and the Kubernetes API Server is, perhaps, the most important Kubernetes component to be able to debug. At Kubernetes' Sig Instrumentation, our mission is to make it easier to understand what's going on in your cluster, and we are happy to announce that distributed tracing in the Kubernetes API Server reached alpha in 1.22.

What is Tracing?

Distributed tracing links together a bunch of super-detailed information from multiple different sources, and structures that telemetry into a single tree for that request. Unlike logging, which limits the quantity of data ingested by using log levels, tracing collects all of the details and uses sampling to collect only a small percentage of requests. This means that once you have a trace which demonstrates an issue, you should have all the information you need to root-cause the problem--no grepping for object UID required! My favorite aspect, though, is how useful the visualizations of traces are. Even if you don't understand the inner workings of the API Server, or don't have a clue what an etcd "Transaction" is, I'd wager you (yes, you!) could tell me roughly what the order of events was, and which components were involved in the request. If some step takes a long time, it is easy to tell where the problem is.

Why OpenTelemetry?

It's important that Kubernetes works well for everyone, regardless of who manages your infrastructure, or which vendors you choose to integrate with. That is particularly true for Kubernetes' integrations with telemetry solutions. OpenTelemetry, being a CNCF project, shares these core values, and is creating exactly what we need in Kubernetes: A set of open standards for Tracing client library APIs and a standard trace format. By using OpenTelemetry, we can ensure users have the freedom to choose their backend, and ensure vendors have a level playing field. The timing couldn't be better: the OpenTelemetry golang API and SDK are very close to their 1.0 release, and will soon offer backwards-compatibility for these open standards.

Why instrument the API Server?

The Kubernetes API Server is a great candidate for tracing for a few reasons:

  • It follows the standard "RPC" model (serve a request by making requests to downstream components), which makes it easy to instrument.
  • Users are latency-sensitive: If a request takes more than 10 seconds to complete, many clients will time-out.
  • It has a complex service topology: A single request could require consulting a dozen webhooks, or involve multiple requests to etcd.

Trying out APIServer Tracing with a webhook

Enabling API Server Tracing

  1. Enable the APIServerTracing feature-gate.

  2. Set our configuration for tracing by pointing the --tracing-config-file flag on the kube-apiserver at our config file, which contains:

apiVersion: apiserver.config.k8s.io/v1alpha1
kind: TracingConfiguration
# 1% sampling rate
samplingRatePerMillion: 10000

Enabling Etcd Tracing

Update: the following was added after the blog was published Add --experimental-enable-distributed-tracing, --experimental-distributed-tracing-address=0.0.0.0:4317, --experimental-distributed-tracing-service-name=etcd flags to etcd to enable tracing. Note that this traces every request, so it will probably generate a lot of traces if you enable it. Required etcd version is 3.5 up to 3.5.4.

Starting from version 3.5.5 until version 3.5.10, the default sampling rate for traces is set to 0%, meaning no traces were collected by default. Unfortunately, there is no option provided to configure a higher sampling rate. (See details)

In version 3.5.11, the number of samples to collect per million spans can be configured using the newly introduced --experimental-distributed-tracing-sampling-rate=1000000 flag.

Example Trace: List Nodes

I could've used any trace backend, but decided to use Jaeger, since it is one of the most popular open-source tracing projects. I deployed the Jaeger All-in-one container in my cluster, deployed the OpenTelemetry collector on my control-plane node (example), and captured traces like this one:

Jaeger screenshot showing API server and etcd trace

The teal lines are from the API Server, and includes it serving a request to /api/v1/nodes, and issuing a grpc Range RPC to ETCD. The yellow-ish line is from ETCD handling the Range RPC.

Example Trace: Create Pod with Mutating Webhook

I instrumented the example webhook with OpenTelemetry (I had to patch controller-runtime, but it makes a neat demo), and routed traces to Jaeger as well. I collected traces like this one:

Jaeger screenshot showing API server, admission webhook, and etcd trace

Compared with the previous trace, there are two new spans: A teal span from the API Server making a request to the admission webhook, and a brown span from the admission webhook serving the request. Even if you didn't instrument your webhook, you would still get the span from the API Server making the request to the webhook.

Get involved!

As this is our first attempt at adding distributed tracing to a Kubernetes component, there is probably a lot we can improve! If my struggles resonated with you, or if you just want to try out the latest Kubernetes has to offer, please give the feature a try and open issues with any problem you encountered and ways you think the feature could be improved.

This is just the very beginning of what we can do with distributed tracing in Kubernetes. If there are other components you think would benefit from distributed tracing, or want to help bring API Server Tracing to GA, join sig-instrumentation at our regular meetings and get involved!

Kubernetes 1.22: A New Design for Volume Populators

Kubernetes v1.22, released earlier this month, introduced a redesigned approach for volume populators. Originally implemented in v1.18, the API suffered from backwards compatibility issues. Kubernetes v1.22 includes a new API field called dataSourceRef that fixes these problems.

Data sources

Earlier Kubernetes releases already added a dataSource field into the PersistentVolumeClaim API, used for cloning volumes and creating volumes from snapshots. You could use the dataSource field when creating a new PVC, referencing either an existing PVC or a VolumeSnapshot in the same namespace. That also modified the normal provisioning process so that instead of yielding an empty volume, the new PVC contained the same data as either the cloned PVC or the cloned VolumeSnapshot.

Volume populators embrace the same design idea, but extend it to any type of object, as long as there exists a custom resource to define the data source, and a populator controller to implement the logic. Initially, the dataSource field was directly extended to allow arbitrary objects, if the AnyVolumeDataSource feature gate was enabled on a cluster. That change unfortunately caused backwards compatibility problems, and so the new dataSourceRef field was born.

In v1.22 if the AnyVolumeDataSource feature gate is enabled, the dataSourceRef field is added, which behaves similarly to the dataSource field except that it allows arbitrary objects to be specified. The API server ensures that the two fields always have the same contents, and neither of them are mutable. The differences is that at creation time dataSource allows only PVCs or VolumeSnapshots, and ignores all other values, while dataSourceRef allows most types of objects, and in the few cases it doesn't allow an object (core objects other than PVCs) a validation error occurs.

When this API change graduates to stable, we would deprecate using dataSource and recommend using dataSourceRef field for all use cases. In the v1.22 release, dataSourceRef is available (as an alpha feature) specifically for cases where you want to use for custom volume populators.

Using populators

Every volume populator must have one or more CRDs that it supports. Administrators may install the CRD and the populator controller and then PVCs with a dataSourceRef specifies a CR of the type that the populator supports will be handled by the populator controller instead of the CSI driver directly.

Underneath the covers, the CSI driver is still invoked to create an empty volume, which the populator controller fills with the appropriate data. The PVC doesn't bind to the PV until it's fully populated, so it's safe to define a whole application manifest including pod and PVC specs and the pods won't begin running until everything is ready, just as if the PVC was a clone of another PVC or VolumeSnapshot.

How it works

PVCs with data sources are still noticed by the external-provisioner sidecar for the related storage class (assuming a CSI provisioner is used), but because the sidecar doesn't understand the data source kind, it doesn't do anything. The populator controller is also watching for PVCs with data sources of a kind that it understands and when it sees one, it creates a temporary PVC of the same size, volume mode, storage class, and even on the same topology (if topology is used) as the original PVC. The populator controller creates a worker pod that attaches to the volume and writes the necessary data to it, then detaches from the volume and the populator controller rebinds the PV from the temporary PVC to the orignal PVC.

Trying it out

The following things are required to use volume populators:

  • Enable the AnyVolumeDataSource feature gate
  • Install a CRD for the specific data source / populator
  • Install the populator controller itself

Populator controllers may use the lib-volume-populator library to do most of the Kubernetes API level work. Individual populators only need to provide logic for actually writing data into the volume based on a particular CR instance. This library provides a sample populator implementation.

These optional components improve user experience:

  • Install the VolumePopulator CRD
  • Create a VolumePopulator custom respource for each specific data source
  • Install the volume data source validator controller (alpha)

The purpose of these components is to generate warning events on PVCs with data sources for which there is no populator.

Putting it all together

To see how this works, you can install the sample "hello" populator and try it out.

First install the volume-data-source-validator controller.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/master/client/config/crd/populator.storage.k8s.io_volumepopulators.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/master/deploy/kubernetes/rbac-data-source-validator.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/volume-data-source-validator/master/deploy/kubernetes/setup-data-source-validator.yaml

Next install the example populator.

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/lib-volume-populator/master/example/hello-populator/crd.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/lib-volume-populator/master/example/hello-populator/deploy.yaml

Create an instance of the Hello CR, with some text.

apiVersion: hello.k8s.io/v1alpha1
kind: Hello
metadata:
  name: example-hello
spec:
  fileName: example.txt
  fileContents: Hello, world!

Create a PVC that refers to that CR as its data source.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Mi
  dataSourceRef:
    apiGroup: hello.k8s.io
    kind: Hello
    name: example-hello
  volumeMode: Filesystem

Next, run a job that reads the file in the PVC.

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  template:
    spec:
      containers:
        - name: example-container
          image: busybox:latest
          command:
            - cat
            - /mnt/example.txt
          volumeMounts:
            - name: vol
              mountPath: /mnt
      restartPolicy: Never
      volumes:
        - name: vol
          persistentVolumeClaim:
            claimName: example-pvc

Wait for the job to complete (including all of its dependencies).

kubectl wait --for=condition=Complete job/example-job

And last examine the log from the job.

kubectl logs job/example-job
Hello, world!

Note that the volume already contained a text file with the string contents from the CR. This is only the simplest example. Actual populators can set up the volume to contain arbitrary contents.

How to write your own volume populator

Developers interested in writing new poplators are encouraged to use the lib-volume-populator library and to only supply a small controller wrapper around the library, and a pod image capable of attaching to volumes and writing the appropriate data to the volume.

Individual populators can be extremely generic such that they work with every type of PVC, or they can do vendor specific things to rapidly fill a volume with data if the volume was provisioned by a specific CSI driver from the same vendor, for example, by communicating directly with the storage for that volume.

The future

As this feature is still in alpha, we expect to update the out of tree controllers with more tests and documentation. The community plans to eventually re-implement the populator library as a sidecar, for ease of operations.

We hope to see some official community-supported populators for some widely-shared use cases. Also, we expect that volume populators will be used by backup vendors as a way to "restore" backups to volumes, and possibly a standardized API to do this will evolve.

How can I learn more?

The enhancement proposal, Volume Populators, includes lots of detail about the history and technical implementation of this feature.

Volume populators and data sources, within the documentation topic about persistent volumes, explains how to use this feature in your cluster.

Please get involved by joining the Kubernetes storage SIG to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!

Minimum Ready Seconds for StatefulSets

This blog describes the notion of Availability for StatefulSet workloads, and a new alpha feature in Kubernetes 1.22 which adds minReadySeconds configuration for StatefulSets.

What problems does this solve?

Prior to Kubernetes 1.22 release, once a StatefulSet Pod is in the Ready state it is considered Available to receive traffic. For some of the StatefulSet workloads, it may not be the case. For example, a workload like Prometheus with multiple instances of Alertmanager, it should be considered Available only when Alertmanager's state transfer is complete, not when the Pod is in Ready state. Since minReadySeconds adds buffer, the state transfer may be complete before the Pod becomes Available. While this is not a fool proof way of identifying if the state transfer is complete or not, it gives a way to the end user to express their intention of waiting for sometime before the Pod is considered Available and it is ready to serve requests.

Another case, where minReadySeconds helps is when using LoadBalancer Services with cloud providers. Since minReadySeconds adds latency after a Pod is Ready, it provides buffer time to prevent killing pods in rotation before new pods show up. Imagine a load balancer in unhappy path taking 10-15s to propagate. If you have 2 replicas then, you'd kill the second replica only after the first one is up but in reality, first replica cannot be seen because it is not yet ready to serve requests.

So, in general, the notion of Availability in StatefulSets is pretty useful and this feature helps in solving the above problems. This is a feature that already exists for Deployments and DaemonSets and we now have them for StatefulSets too to give users consistent workload experience.

How does it work?

The statefulSet controller watches for both StatefulSets and the Pods associated with them. When the feature gate associated with this feature is enabled, the statefulSet controller identifies how long a particular Pod associated with a StatefulSet has been in the Running state.

If this value is greater than or equal to the time specified by the end user in .spec.minReadySeconds field, the statefulSet controller updates a field called availableReplicas in the StatefulSet's status subresource to include this Pod. The status.availableReplicas in StatefulSet's status is an integer field which tracks the number of pods that are Available.

How do I use it?

You are required to prepare the following things in order to try out the feature:

  • Download and install a kubectl greater than v1.22.0 version
  • Switch on the feature gate with the command line flag --feature-gates=StatefulSetMinReadySeconds=true on kube-apiserver and kube-controller-manager

After successfully starting kube-apiserver and kube-controller-manager, you will see AvailableReplicas in the status and minReadySeconds of spec (with a default value of 0).

Specify a value for minReadySeconds for any StatefulSet and you can check if Pods are available or not by checking AvailableReplicas field using: kubectl get statefulset/<name_of_the_statefulset> -o yaml

How can I learn more?

How do I get involved?

Please reach out to us in the #sig-apps channel on Slack (visit https://slack.k8s.io/ for an invitation if you need one), or on the SIG Apps mailing list: kubernetes-sig-apps@googlegroups.com

Enable seccomp for all workloads with a new v1.22 alpha feature

This blog post is about a new Kubernetes feature introduced in v1.22, which adds an additional security layer on top of the existing seccomp support. Seccomp is a security mechanism for Linux processes to filter system calls (syscalls) based on a set of defined rules. Applying seccomp profiles to containerized workloads is one of the key tasks when it comes to enhancing the security of the application deployment. Developers, site reliability engineers and infrastructure administrators have to work hand in hand to create, distribute and maintain the profiles over the applications life-cycle.

You can use the securityContext field of Pods and their containers can be used to adjust security related configurations of the workload. Kubernetes introduced dedicated seccomp related API fields in this SecurityContext with the graduation of seccomp to General Availability (GA) in v1.19.0. This enhancement allowed an easier way to specify if the whole pod or a specific container should run as:

  • Unconfined: seccomp will not be enabled
  • RuntimeDefault: the container runtimes default profile will be used
  • Localhost: a node local profile will be applied, which is being referenced by a relative path to the seccomp profile root (<kubelet-root-dir>/seccomp) of the kubelet

With the graduation of seccomp, nothing has changed from an overall security perspective, because Unconfined is still the default. This is totally fine if you consider this from the upgrade path and backwards compatibility perspective of Kubernetes releases. But it also means that it is more likely that a workload runs without seccomp at all, which should be fixed in the long term.

SeccompDefault to the rescue

Kubernetes v1.22.0 introduces a new kubelet feature gate SeccompDefault, which has been added in alpha state as every other new feature. This means that it is disabled by default and can be enabled manually for every single Kubernetes node.

What does the feature do? Well, it just changes the default seccomp profile from Unconfined to RuntimeDefault. If not specified differently in the pod manifest, then the feature will add a higher set of security constraints by using the default profile of the container runtime. These profiles may differ between runtimes like CRI-O or containerd. They also differ for its used hardware architectures. But generally speaking, those default profiles allow a common amount of syscalls while blocking the more dangerous ones, which are unlikely or unsafe to be used in a containerized application.

Enabling the feature

Two kubelet configuration changes have to be made to enable the feature:

  1. Enable the feature gate by setting the SeccompDefault=true via the command line (--feature-gates) or the kubelet configuration file.
  2. Turn on the feature by enabling the feature by adding the --seccomp-default command line flag or via the kubelet configuration file (seccompDefault: true).

The kubelet will error on startup if only one of the above steps have been done.

Trying it out

If the feature is enabled on a node, then you can create a new workload like this:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: test-container
      image: nginx:1.21

Now it is possible to inspect the used seccomp profile by using crictl while investigating the containers runtime specification:

CONTAINER_ID=$(sudo crictl ps -q --name=test-container)
sudo crictl inspect $CONTAINER_ID | jq .info.runtimeSpec.linux.seccomp
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": ["_llseek", "_newselect", "accept", …, "write", "writev"],
      "action": "SCMP_ACT_ALLOW"
    },
    
  ]
}

You can see that the lower level container runtime (CRI-O and runc in our case), successfully applied the default seccomp profile. This profile denies all syscalls per default, while allowing commonly used ones like accept or write.

Please note that the feature will not influence any Kubernetes API for now. Therefore, it is not possible to retrieve the used seccomp profile via kubectl get or describe if the SeccompProfile field is unset within the SecurityContext.

The feature also works when using multiple containers within a pod, for example if you create a pod like this:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
    - name: test-container-nginx
      image: nginx:1.21
      securityContext:
        seccompProfile:
          type: Unconfined
    - name: test-container-redis
      image: redis:6.2

then you should see that the test-container-nginx runs without a seccomp profile:

sudo crictl inspect $(sudo crictl ps -q --name=test-container-nginx) |
    jq '.info.runtimeSpec.linux.seccomp == null'
true

Whereas the container test-container-redis runs with RuntimeDefault:

sudo crictl inspect $(sudo crictl ps -q --name=test-container-redis) |
    jq '.info.runtimeSpec.linux.seccomp != null'
true

The same applies to the pod itself, which also runs with the default profile:

sudo crictl inspectp (sudo crictl pods -q --name test-pod) |
    jq '.info.runtimeSpec.linux.seccomp != null'
true

Upgrade strategy

It is recommended to enable the feature in multiple steps, whereas different risks and mitigations exist for each one.

Feature gate enabling

Enabling the feature gate at the kubelet level will not turn on the feature, but will make it possible by using the SeccompDefault kubelet configuration or the --seccomp-default CLI flag. This can be done by an administrator for the whole cluster or only a set of nodes.

Testing the Application

If you're trying this within a dedicated test environment, you have to ensure that the application code does not trigger syscalls blocked by the RuntimeDefault profile before enabling the feature on a node. This can be done by:

  • Recommended: Analyzing the code (manually or by running the application with strace) for any executed syscalls which may be blocked by the default profiles. If that's the case, then you can override the default by explicitly setting the pod or container to run as Unconfined. Alternatively, you can create a custom seccomp profile (see optional step below). profile based on the default by adding the additional syscalls to the "action": "SCMP_ACT_ALLOW" section.

  • Recommended: Manually set the profile to the target workload and use a rolling upgrade to deploy into production. Rollback the deployment if the application does not work as intended.

  • Optional: Run the application against an end-to-end test suite to trigger all relevant code paths with RuntimeDefault enabled. If a test fails, use the same mitigation as mentioned above.

  • Optional: Create a custom seccomp profile based on the default and change its default action from SCMP_ACT_ERRNO to SCMP_ACT_LOG. This means that the seccomp filter for unknown syscalls will have no effect on the application at all, but the system logs will now indicate which syscalls may be blocked. This requires at least a Kernel version 4.14 as well as a recent runc release. Monitor the application hosts audit logs (defaults to /var/log/audit/audit.log) or syslog entries (defaults to /var/log/syslog) for syscalls via type=SECCOMP (for audit) or type=1326 (for syslog). Compare the syscall ID with those listed in the Linux Kernel sources and add them to the custom profile. Be aware that custom audit policies may lead into missing syscalls, depending on the configuration of auditd.

  • Optional: Use cluster additions like the Security Profiles Operator for profiling the application via its log enrichment capabilities or recording a profile by using its recording feature. This makes the above mentioned manual log investigation obsolete.

Deploying the modified application

Based on the outcome of the application tests, it may be required to change the application deployment by either specifying Unconfined or a custom seccomp profile. This is not the case if the application works as intended with RuntimeDefault.

Enable the kubelet configuration

If everything went well, then the feature is ready to be enabled by the kubelet configuration or its corresponding CLI flag. This should be done on a per-node basis to reduce the overall risk of missing a syscall during the investigations when running the application tests. If it's possible to monitor audit logs within the cluster, then it's recommended to do this for eventually missed seccomp events. If the application works as intended then the feature can be enabled for further nodes within the cluster.

Conclusion

Thank you for reading this blog post! I hope you enjoyed to see how the usage of seccomp profiles has been evolved in Kubernetes over the past releases as much as I do. On your own cluster, change the default seccomp profile to RuntimeDefault (using this new feature) and see the security benefits, and, of course, feel free to reach out any time for feedback or questions.


Editor's note: If you have any questions or feedback about this blog post, feel free to reach out via the Kubernetes slack in #sig-node.

Alpha in v1.22: Windows HostProcess Containers

Kubernetes v1.22 introduced a new alpha feature for clusters that include Windows nodes: HostProcess containers.

HostProcess containers aim to extend the Windows container model to enable a wider range of Kubernetes cluster management scenarios. HostProcess containers run directly on the host and maintain behavior and access similar to that of a regular process. With HostProcess containers, users can package and distribute management operations and functionalities that require host access while retaining versioning and deployment methods provided by containers. This allows Windows containers to be used for a variety of device plugin, storage, and networking management scenarios in Kubernetes. With this comes the enablement of host network mode—allowing HostProcess containers to be created within the host's network namespace instead of their own. HostProcess containers can also be built on top of existing Windows server 2019 (or later) base images, managed through the Windows container runtime, and run as any user that is available on or in the domain of the host machine.

Linux privileged containers are currently used for a variety of key scenarios in Kubernetes, including kube-proxy (via kubeadm), storage, and networking scenarios. Support for these scenarios in Windows previously required workarounds via proxies or other implementations. Using HostProcess containers, cluster operators no longer need to log onto and individually configure each Windows node for administrative tasks and management of Windows services. Operators can now utilize the container model to deploy management logic to as many clusters as needed with ease.

How does it work?

Windows HostProcess containers are implemented with Windows Job Objects, a break from the previous container model using server silos. Job objects are components of the Windows OS which offer the ability to manage a group of processes as a group (a.k.a. jobs) and assign resource constraints to the group as a whole. Job objects are specific to the Windows OS and are not associated with the Kubernetes Job API. They have no process or file system isolation, enabling the privileged payload to view and edit the host file system with the correct permissions, among other host resources. The init process, and any processes it launches or that are explicitly launched by the user, are all assigned to the job object of that container. When the init process exits or is signaled to exit, all the processes in the job will be signaled to exit, the job handle will be closed and the storage will be unmounted.

HostProcess and Linux privileged containers enable similar scenarios but differ greatly in their implementation (hence the naming difference). HostProcess containers have their own pod security policies. Those used to configure Linux privileged containers do not apply. Enabling privileged access to a Windows host is a fundamentally different process than with Linux so the configuration and capabilities of each differ significantly. Below is a diagram detailing the overall architecture of Windows HostProcess containers:

HostProcess Architecture

How do I use it?

HostProcess containers can be run from within a HostProcess Pod. With the feature enabled on Kubernetes version 1.22, a containerd container runtime of 1.5.4 or higher, and the latest version of hcsshim, deploying a pod spec with the correct HostProcess configuration will enable you to run HostProcess containers. To get started with running Windows containers see the general guidance for Windows in Kubernetes

How can I learn more?

How do I get involved?

HostProcess containers are in active development. SIG Windows welcomes suggestions from the community. Get involved with SIG Windows to contribute!

Kubernetes Memory Manager moves to beta

The blog post explains some of the internals of the Memory manager, a beta feature of Kubernetes 1.22. In Kubernetes, the Memory Manager is a kubelet subcomponent. The memory manage provides guaranteed memory (and hugepages) allocation for pods in the Guaranteed QoS class.

This blog post covers:

  1. Why do you need it?
  2. The internal details of how the MemoryManager works
  3. Current limitations of the MemoryManager
  4. Future work for the MemoryManager

Why do you need it?

Some Kubernetes workloads run on nodes with non-uniform memory access (NUMA). Suppose you have NUMA nodes in your cluster. In that case, you'll know about the potential for extra latency when compute resources need to access memory on the different NUMA locality.

To get the best performance and latency for your workload, container CPUs, peripheral devices, and memory should all be aligned to the same NUMA locality. Before Kubernetes v1.22, the kubelet already provided a set of managers to align CPUs and PCI devices, but you did not have a way to align memory. The Linux kernel was able to make best-effort attempts to allocate memory for tasks from the same NUMA node where the container is executing are placed, but without any guarantee about that placement.

How does it work?

The memory manager is doing two main things:

  • provides the topology hint to the Topology Manager
  • allocates the memory for containers and updates the state

The overall sequence of the Memory Manager under the Kubelet

MemoryManagerDiagram

During the Admission phase:

  1. When first handling a new pod, the kubelet calls the TopologyManager's Admit() method.
  2. The Topology Manager is calling GetTopologyHints() for every hint provider including the Memory Manager.
  3. The Memory Manager calculates all possible NUMA nodes combinations for every container inside the pod and returns hints to the Topology Manager.
  4. The Topology Manager calls to Allocate() for every hint provider including the Memory Manager.
  5. The Memory Manager allocates the memory under the state according to the hint that the Topology Manager chose.

During Pod creation:

  1. The kubelet calls PreCreateContainer().
  2. For each container, the Memory Manager looks the NUMA nodes where it allocated the memory for the container and then returns that information to the kubelet.
  3. The kubelet creates the container, via CRI, using a container specification that incorporates information from the Memory Manager information.

Let's talk about the configuration

By default, the Memory Manager runs with the None policy, meaning it will just relax and not do anything. To make use of the Memory Manager, you should set two command line options for the kubelet:

  • --memory-manager-policy=Static
  • --reserved-memory="<numaNodeID>:<resourceName>=<quantity>"

The value for --memory-manager-policy is straightforward: Static. Deciding what to specify for --reserved-memory takes more thought. To configure it correctly, you should follow two main rules:

  • The amount of reserved memory for the memory resource must be greater than zero.
  • The amount of reserved memory for the resource type must be equal to NodeAllocatable (kube-reserved + system-reserved + eviction-hard) for the resource. You can read more about memory reservations in Reserve Compute Resources for System Daemons.

Reserved memory

Current limitations

The 1.22 release and promotion to beta brings along enhancements and fixes, but the Memory Manager still has several limitations.

Single vs Cross NUMA node allocation

The NUMA node can not have both single and cross NUMA node allocations. When the container memory is pinned to two or more NUMA nodes, we can not know from which NUMA node the container will consume the memory.

Single vs Cross NUMA allocation

  1. The container1