job: Close the health reporter instead of stopping #20

joamaki · 2024-10-10T15:49:52Z

When timer/one-shot/observer job finishes it should close the health reporting instead of stopping it in order not to accumulate endless number of health statuses for one-off etc. jobs.

When timer/one-shot/observer job finishes it should close the health reporting instead of stopping it in order not to accumulate endless number of health statuses for one-off etc. jobs. Signed-off-by: Jussi Maki <[email protected]>

joamaki · 2024-10-10T15:51:17Z

There's also another related issue on the cilium/cilium pkg/hive/health/provider.go side where the Stopped() method only marks StoppedAt time but doesn't set Level nor Message which can lead to confusion. Will do a separate PR for that.

(https://github.com/cilium/cilium/pull/35154/files#diff-f0f7633343b350adb76fe60850c162ed40b56279936699feffa2c0dce5587411)

joamaki · 2024-10-10T16:05:49Z

I'm slightly unsure of this change. Do we want the jobs themselves to close their health scopes, or should we just have them call Stopped and we'd Close the whole tree when the job.Group is stopped? That might not ever happen though... it's more likely the group stays running and it's used to spawn a lot of one-shot jobs. I do see the benefit of seeing that a one-shot or a timer job was stopped... another way to solve this is to do garbage collection of stopped health entries after awhile...

My main concern is that right now it's way too easy to leave a lot of these around and if we ever start spawning bunch of these with generated names we'll end up with a huge health table, so I feel like erring on the side of cleaning up might be better. Thoughts?

ovidiutirla · 2024-10-13T19:26:52Z

I'm more into regular GC of the stopped health entries. Keeping their status stopped as long as the cilium-agent runs sound like it could easily become a memory leak.

JobGroups might never get stopped, I wouldn't rely on that. (agree on this)

Could we mark them as stopped, the metric should report that they were stopped, and immediately after closing them.
Or in between stop and close have a slight delay instead of having a periodic gc?

Closing them instead of reporting as stopped could produce some weird visuals on the metrics which are not that intuitive.

tommyp1ckles · 2025-03-28T20:24:29Z

My main concern is that right now it's way too easy to leave a lot of these around and if we ever start spawning bunch of these with generated names we'll end up with a huge health table, so I feel like erring on the side of cleaning up might be better. Thoughts?

I'm more into regular GC of the stopped health entries. Keeping their status stopped as long as the cilium-agent runs sound like it could easily become a memory leak.

Guess this begs the question when is it appropriate to stop vs close the reporter. In my view deleting reporters following something like a Job ceasing to run makes sense.

For the oneshot, thats supposed to be auto-resilient right? So its either in states:

Failed -> retrying with backoff: Keep the report around.
Stopped -> Task no longer relevant or it completed so get rid of the reporter.

There may be instances where we are interested in retaining a report from a OneShot that failed and we decided we didn't want to keep retrying maybe? But in that case I think the above behaviour should be the default and we could account for that use case seperately.

job: Close the health reporter instead of stopping

af3edec

When timer/one-shot/observer job finishes it should close the health reporting instead of stopping it in order not to accumulate endless number of health statuses for one-off etc. jobs. Signed-off-by: Jussi Maki <[email protected]>

joamaki requested a review from a team as a code owner October 10, 2024 15:49

joamaki requested review from ovidiutirla and removed request for a team October 10, 2024 15:49

joamaki mentioned this pull request Feb 26, 2025

pkg/hive/health: Set level to LevelStopped when stopping cilium/cilium#37876

Merged

tommyp1ckles approved these changes Mar 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

job: Close the health reporter instead of stopping #20

job: Close the health reporter instead of stopping #20

Uh oh!

joamaki commented Oct 10, 2024

Uh oh!

joamaki commented Oct 10, 2024 •

edited

Loading

Uh oh!

joamaki commented Oct 10, 2024 •

edited

Loading

Uh oh!

ovidiutirla commented Oct 13, 2024 •

edited

Loading

Uh oh!

tommyp1ckles commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

job: Close the health reporter instead of stopping #20

Are you sure you want to change the base?

job: Close the health reporter instead of stopping #20

Uh oh!

Conversation

joamaki commented Oct 10, 2024

Uh oh!

joamaki commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joamaki commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ovidiutirla commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tommyp1ckles commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joamaki commented Oct 10, 2024 •

edited

Loading

joamaki commented Oct 10, 2024 •

edited

Loading

ovidiutirla commented Oct 13, 2024 •

edited

Loading