-
Notifications
You must be signed in to change notification settings - Fork 134
Have stuck detection account for worker-level timeouts as well as client-level #1126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a76ba97 to
11838bf
Compare
9134eaf to
a0cd70e
Compare
…ent-level As reported by #1125, the stuck job detection mechanism currently only considers client-level timeouts and without properly accounting for the possibility of worker-level overrides. Here, make sure to account for worker-level timeouts as well by hosting the job timeout calculation further up in the executor logic. Fixes #1125.
a0cd70e to
7eebbb3
Compare
| } | ||
|
|
||
| return context.WithCancel(ctx) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was doing very little (it's the same as a if jobTimeout > 0 { check) and I needed the conditional for something else, so I just pulled it into execute.
bgentry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great quick fix. Probably worthy of a quick .1 release?
| watchStuckCancel := e.watchStuck(ctx) | ||
| defer watchStuckCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is so much more readable than the large inline block before. I don't feel like I lose the context and flow of the surrounding code, while the fairly complex stuck job watching logic is nicely contained in a helper. 👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah +1. Should have done this from the beginning.
Thx. And yeah, let's get a patch release in given there's no upcoming plans for another release anytime soon. |
Prepare patch release largely containing #1126. [skip ci]
Prepare patch release largely containing #1126. [skip ci]
Fix a bug that came in with #1126 in which we we were correctly calculating timeout, but then not passing it down to the stuck job function when starting the stuck detection goroutine. There is a test that was checking this worked, but due to the nature of the bug, it was in effect detecting a stuck job after 0s and therefore passing by accident. I looked into ways to add additional testing here, but elected not to add more because they'd involve the sort of test I really hate, which has to wait arbitrarily wait to try and check that something did not happen, introducing both slowness and intermittency. After the fix here lands, this is the sort of thing that's not too likely to regress, and should be noticed quickly in case it does.
Fix a bug that came in with #1126 in which we we were correctly calculating timeout, but then not passing it down to the stuck job function when starting the stuck detection goroutine. There is a test that was checking this worked, but due to the nature of the bug, it was in effect detecting a stuck job after 0s and therefore passing by accident. I looked into ways to add additional testing here, but elected not to add more because they'd involve the sort of test I really hate, which has to wait arbitrarily wait to try and check that something did not happen, introducing both slowness and intermittency. After the fix here lands, this is the sort of thing that's not too likely to regress, and should be noticed quickly in case it does. Fixes #1125.
Fix a bug that came in with #1126 in which we we were correctly calculating timeout, but then not passing it down to the stuck job function when starting the stuck detection goroutine. There is a test that was checking this worked, but due to the nature of the bug, it was in effect detecting a stuck job after 0s and therefore passing by accident. I looked into ways to add additional testing here, but elected not to add more because they'd involve the sort of test I really hate, which has to wait arbitrarily wait to try and check that something did not happen, introducing both slowness and intermittency. After the fix here lands, this is the sort of thing that's not too likely to regress, and should be noticed quickly in case it does. Fixes #1125.
Fix a bug that came in with #1126 in which we we were correctly calculating timeout, but then not passing it down to the stuck job function when starting the stuck detection goroutine. There is a test that was checking this worked, but due to the nature of the bug, it was in effect detecting a stuck job after 0s and therefore passing by accident. I looked into ways to add additional testing here, but elected not to add more because they'd involve the sort of test I really hate, which has to wait arbitrarily wait to try and check that something did not happen, introducing both slowness and intermittency. After the fix here lands, this is the sort of thing that's not too likely to regress, and should be noticed quickly in case it does. Fixes #1125.
As reported by #1125, the stuck job detection mechanism currently only
considers client-level timeouts and without properly accounting for the
possibility of worker-level overrides.
Here, make sure to account for worker-level timeouts as well by hosting
the job timeout calculation further up in the executor logic.
Fixes #1125.