-
Notifications
You must be signed in to change notification settings - Fork 16.6k
helm: ensure graceful redis shutdown#58432
Conversation
What
This pull request modifies the Helm chart's Redis configuration to start the redis-server binary via exec, ensuring it runs as PID 1 inside the container.
Why
When a Kubernetes Pod is shutting down, it sends a SIGTERM signal to the container's main process (PID 1).
The previous configuration used the command: /bin/sh -c "redis-server". In this setup:
- The
/bin/shshell is PID 1. - The actual
redis-serveris a child process.
Standard shells like sh do not automatically propagate signals (like SIGTERM) to their child processes. When the shell receives SIGTERM, it ignores it and waits for its child to exit. Since the redis-server never receives the SIGTERM, it doesn't initiate a graceful save-and-exit sequence.
This leads to the pod hanging until the terminationGracePeriodSeconds timeout is reached (200 seconds by default), at which point Kubernetes sends a SIGKILL (forceful kill).
Impact
- Fixes Pod Hangs: Eliminates unnecessary delays during Pod termination/scaling.
- Prevents Data Loss Risk : Allows Redis to perform a proper
BGSAVEor persistence flush upon receivingSIGTERM, reducing the risk of data loss compared to an immediateSIGKILL.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.
3bba933 to
89f8883
Compare
jscheffl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sound very reasonable! Wondering why it was made like this in the past and not noticed for years. Code is always a mystery...
Does someody else know a reason why it was made like this?
I leave this PR open for a moment but in my view this is avery good bugfix. Thanks for raising!
Note: CI checks failed because of infra problems/GH problems today. Can be rebased before merge, then I assume will turn green.
89f8883 to
6e1f79b
Compare
|
Wouldn't it be better to simply 'promote' redis-server to be the command? Good to know why the Redis pod was always taking forever to terminate! |
|
I did notice redis pod takes a good 15-20 min to terminate, hopefully this fixes. |
I haven't tried it, but I don't know if the password would be interpolated. If it does though, then yes direct would be fine. Feel free to open another PR if you want to try it out @amordoch. |