Skip to content

Conversation

@joamaki
Copy link
Contributor

@joamaki joamaki commented Oct 10, 2024

When timer/one-shot/observer job finishes it should close the health reporting instead of stopping it in order not to accumulate endless number of health statuses for one-off etc. jobs.

When timer/one-shot/observer job finishes it should close the health
reporting instead of stopping it in order not to accumulate endless
number of health statuses for one-off etc. jobs.

Signed-off-by: Jussi Maki <[email protected]>
@joamaki joamaki requested a review from a team as a code owner October 10, 2024 15:49
@joamaki joamaki requested review from ovidiutirla and removed request for a team October 10, 2024 15:49
@joamaki
Copy link
Contributor Author

joamaki commented Oct 10, 2024

There's also another related issue on the cilium/cilium pkg/hive/health/provider.go side where the Stopped() method only marks StoppedAt time but doesn't set Level nor Message which can lead to confusion. Will do a separate PR for that.

(https://github.com/cilium/cilium/pull/35154/files#diff-f0f7633343b350adb76fe60850c162ed40b56279936699feffa2c0dce5587411)

@joamaki
Copy link
Contributor Author

joamaki commented Oct 10, 2024

I'm slightly unsure of this change. Do we want the jobs themselves to close their health scopes, or should we just have them call Stopped and we'd Close the whole tree when the job.Group is stopped? That might not ever happen though... it's more likely the group stays running and it's used to spawn a lot of one-shot jobs. I do see the benefit of seeing that a one-shot or a timer job was stopped... another way to solve this is to do garbage collection of stopped health entries after awhile...

My main concern is that right now it's way too easy to leave a lot of these around and if we ever start spawning bunch of these with generated names we'll end up with a huge health table, so I feel like erring on the side of cleaning up might be better. Thoughts?

@ovidiutirla
Copy link
Contributor

ovidiutirla commented Oct 13, 2024

I'm more into regular GC of the stopped health entries. Keeping their status stopped as long as the cilium-agent runs sound like it could easily become a memory leak.

JobGroups might never get stopped, I wouldn't rely on that. (agree on this)

Could we mark them as stopped, the metric should report that they were stopped, and immediately after closing them.
Or in between stop and close have a slight delay instead of having a periodic gc?

Closing them instead of reporting as stopped could produce some weird visuals on the metrics which are not that intuitive.

@tommyp1ckles
Copy link
Contributor

My main concern is that right now it's way too easy to leave a lot of these around and if we ever start spawning bunch of these with generated names we'll end up with a huge health table, so I feel like erring on the side of cleaning up might be better. Thoughts?

I'm more into regular GC of the stopped health entries. Keeping their status stopped as long as the cilium-agent runs sound like it could easily become a memory leak.

Guess this begs the question when is it appropriate to stop vs close the reporter. In my view deleting reporters following something like a Job ceasing to run makes sense.

For the oneshot, thats supposed to be auto-resilient right? So its either in states:

  • Failed -> retrying with backoff: Keep the report around.
  • Stopped -> Task no longer relevant or it completed so get rid of the reporter.

There may be instances where we are interested in retaining a report from a OneShot that failed and we decided we didn't want to keep retrying maybe? But in that case I think the above behaviour should be the default and we could account for that use case seperately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants