This article is more than one year old. Older articles may contain outdated content. Check that the information in the page has not become incorrect since its publication.
In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general availability (GA). This blog describes the Backoff Limit Per Index feature and its benefits.
When you run workloads on Kubernetes, you must consider scenarios where Pod failures can affect the completion of your workloads. Ideally, your workload should tolerate transient failures and continue running.
To achieve failure tolerance in a Kubernetes Job, you can set the
spec.backoffLimit field. This field specifies the total number of tolerated
failures.
However, for workloads where every index is considered independent, like
embarassingly parallel
workloads - the spec.backoffLimit field is often not flexible enough.
For example, you may choose to run multiple suites of integration tests by
representing each suite as an index within an Indexed Job.
In that setup, a fast-failing index (test suite) is likely to consume your
entire budget for tolerating Pod failures, and you might not be able to run the
other indexes.
In order to address this limitation, Kubernetes introduced backoff limit per index, which allows you to control the number of retries per index.
To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated
Pod failures per index with the spec.backoffLimitPerIndex field. When you set
this field, the Job executes all indexes by default.
Additionally, to fine-tune the error handling:
spec.maxFailedIndexes field. When the limit is exceeded the entire Job is
terminated.FailIndex action in the
Pod Failure Policy
mechanism.When the number of tolerated failures is exceeded, the Job marks that index as
failed and lists it in the Job's status.failedIndexes field.
The following Job spec snippet is an example of how to combine backoff limit per index with the Pod Failure Policy feature:
completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTarget
- action: FailIndex
onExitCodes:
operator: In
values: [ 42 ]
In this example, the Job handles Pod failures as follows:
DisruptionTarget. These Pods don't count towards Job backoff limits.FailIndex rule.spec.maxFailedIndexes field).This work was sponsored by the Kubernetes batch working group in close collaboration with the SIG Apps community.
If you are interested in working on new features in the space we recommend subscribing to our Slack channel and attending the regular community meetings.