Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Optimize K8s API usage for watching events#59080

Merged
potiuk merged 5 commits intoapache:mainfrom
boschglobal:feature/optimize-kubernetes-api-usage
Dec 11, 2025
Merged

Optimize K8s API usage for watching events#59080
potiuk merged 5 commits intoapache:mainfrom
boschglobal:feature/optimize-kubernetes-api-usage

Conversation

Copy link
Contributor

wolfdn commented Dec 5, 2025

Description

This PR optimizes how the KubernetesPodOperator interacts with the Kubernetes API when retrieving events.

Previously, the operator did not pass the resourceVersion parameter when listing events for a pod. This forced Kubernetes to perform a quorum read for every request--an expensive operation. Combined with frequent polling for new events, this created significant load on the Kubernetes API and etcd, especially when many pods were started in parallel.

Best practice is for clients to store the resourceVersion from each response and provide it in subsequent requests. This allows Kubernetes to serve the event list far more efficiently. As stated in the Kubernetes documentation:

Unless you have strong consistency requirements, using resourceVersionMatch=NotOlderThan and a known resourceVersion is preferable since it can achieve better performance and scalability of your cluster than leaving resourceVersion and resourceVersionMatch unset, which requires quorum read to be served.

Reference: https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-get-and-list

With this change, the operator performs one initial event listing without a resourceVersion, and all subsequent requests include the last known resourceVersion.

Additionally, this PR introduces usage of the Kubernetes watch API in deferred (asynchronous) mode. Instead of polling every few seconds, the operator can now watch for new events. This provides two major benefits:

  • New events become visible almost immediately.
  • The number of requests sent to the Kubernetes API is reduced because the watch connection remains active for a longer period.

We implemented this change after observing a high number of HTTP 429 (rate-limited) responses from our cluster's API server. One contributing factor was the large volume of GET requests for event listings, which placed heavy load on etcd. After deploying a patched version of the operator with these improvements, the number of 429 responses dropped from several thousand per minute to nearly zero.

Changes

  • Remember resourceVersion when retrieving events from K8s API
  • Use Kubernetes watch API to watch events when running in deferred (asynchronous) mode
    • There is also a fallback to poll events in deferred mode in case that the Airflow triggerer does not have the permission to watch events (to stay compatible with older versions of Helm chart)
  • Improve mechanism to avoid printing duplicate events (now remembers seen event UIDs instead of counting events)
  • Add watch verb for events in pod launcher role in Helm chart (required so that Airflow triggerer has permission to watch events)

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

wolvery reacted with heart emoji
wolfdn requested review from hussein-awala, jedcunningham and jscheffl as code owners December 5, 2025 06:53
boring-cyborg bot added area:helm-chart Airflow Helm Chart area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Dec 5, 2025
jscheffl approved these changes Dec 7, 2025
Copy link
Contributor

jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this - looks impressive improvement!

One mini nit and before merging would leave the PR open a few days for other 4 eyes to review. LGTM in my view.

wolfdn reacted with heart emoji
jscheffl added this to the Airflow Helm Chart 1.19.0 milestone Dec 7, 2025
Copy link
Contributor

jscheffl commented Dec 7, 2025

@jedcunningham Would be cool to have your opinion and having this in chart 1.19 release as well.

...s/utils/pod_manager.py

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Copy link
Contributor

jscheffl commented Dec 8, 2025

As I heard from @AutomationDev85 about some problems with the asny event polling we might need tomorrow to triage to double-check this is not adding more problems than benefits. Please do not merge before clarified tomorrow (which might be 10.00 CET Tuesday, 9th)

Copy link
Contributor

AutomationDev85 commented Dec 11, 2025

@jscheffl @wolfdn I added a commit which improves pod start handling by awaiting start completion and cancelling the parallel event watcher. This resolves a sporadic hang in the event stream; after testing, the new approach proved more stable.

wolfdn reacted with thumbs up emoji

jscheffl approved these changes Dec 11, 2025
Copy link
Contributor

jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and even a bit better like this. Would still wait (as no pressure to merge) for 1-2 days hoping for feedback from others prior merge. Some more eyes might be good.

Copy link
Member

potiuk commented Dec 11, 2025

This looks lie a fantastic improvement.

potiuk approved these changes Dec 11, 2025
potiuk merged commit 81dee2f into apache:main Dec 11, 2025
125 checks passed
shahar1 mentioned this pull request Dec 30, 2025
54 tasks
danielhoherd mentioned this pull request Jan 20, 2026
jedcunningham mentioned this pull request Jan 30, 2026
98 tasks
jhgoebbert pushed a commit to jhgoebbert/airflow_Owen-CH-Leung that referenced this pull request Feb 8, 2026
* Optimize K8s API usage for watching events

* Fix mypy errors

* Update providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

* Fix hanging API communication during pod event watching

---------

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: AutomationDev85
Subham-KRLX pushed a commit to Subham-KRLX/airflow that referenced this pull request Mar 4, 2026
* Optimize K8s API usage for watching events

* Fix mypy errors

* Update providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

* Fix hanging API communication during pod event watching

---------

Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: AutomationDev85
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

potiuk potiuk approved these changes

jscheffl jscheffl approved these changes

jedcunningham Awaiting requested review from jedcunningham jedcunningham is a code owner

hussein-awala Awaiting requested review from hussein-awala hussein-awala is a code owner

Assignees

No one assigned

Labels

area:helm-chart Airflow Helm Chart area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants