-
Notifications
You must be signed in to change notification settings - Fork 111
Replace cfn-hup on compute nodes with systemd timers to signal updates #3070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
5ce2f72 to
96d1cbd
Compare
| @@ -0,0 +1,11 @@ | |||
| [Unit] | |||
| Description=Check file modification time every minute | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description must be more specific: it must be more clear that this is a timer configured by pcluster (add this info in the description and in the file name) and the goal of the timer (checks file modifications, which files, why?)
| Description=Check file modification time every minute | ||
|
|
||
| [Timer] | ||
| AccuracySec=1s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The accuracy has an impact on CPU wake-ups.
The finer the accuracy, the higher the chances to generate load on the CPU.
Since we are using a 60sec activation, what about using a 20s accuracy?
See https://www.freedesktop.org/software/systemd/man/latest/systemd.timer.html
| # Cookbook:: aws-parallelcluster-slurm | ||
| # Recipe:: config_compute | ||
| # | ||
| # Copyright:: 2013-2021 Amazon.com, Inc. or its affiliates. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and in other files: fix copyright year with Copyright:: 2025 Amazon.com
|
|
||
| # | ||
| # Cookbook:: aws-parallelcluster-slurm | ||
| # Recipe:: config_compute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix recipe name
| # OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| template '/etc/systemd/system/check-update.service' do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, above and below: the name of the service and the corresponding timer must be more talkative, expressing that they are pcluster services related to the cluster updates
| ------ | ||
|
|
||
| **CHANGES** | ||
| - Replace cfn-hup in compute nodes with systemd timers to support in place updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must surface the value of this change for the user, which is providing better performance.
Also we should surface that the new mechanism relies on shared storage sync between head node and compute fleet.
| chef_sleep '15' | ||
|
|
||
| wait_cluster_ready if cluster_readiness_check_on_update_enabled? | ||
| wait_cluster_ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing this?
We should keep it as it may be useful to keep a way for the user to skip the readiness check if necessary.
| node['cluster']['node_type'] == 'HeadNode' || node['cluster']['in_place_update_on_fleet_enabled'] == 'true' | ||
| end | ||
|
|
||
| def cluster_readiness_check_on_update_enabled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as https://github.com/aws/aws-parallelcluster-cookbook/pull/3070/changes#r2641117741
I think it is useful to keep a way to skip cluster readiness check through chef attributes.
The current chef attribute name must be changed as in_place_update_on_fleet_enabled will not be valid anymore. We could expose something like cluster/update/cluster_readiness_check_enabled
| default['cluster']['nfs']['hard_mount_options'] = 'hard,_netdev,noatime' | ||
|
|
||
| # Cluster Updates | ||
| default['cluster']['in_place_update_on_fleet_enabled'] = 'true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| default['cluster']['shared_update_path'] = "#{node['cluster']['shared_dir']}/check_update" | ||
| default['cluster']['update_checkpoint'] = "#{node['cluster']['scripts_dir']}/update_checkpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to use more talkative names, such as
default['cluster']['update']['trigger_file'] = "#{node['cluster']['shared_dir']}/update_trigger"
default['cluster']['update']['checkpoint_file'] = "#{node['cluster']['scripts_dir']}/update_checkpoint"
With such names for the attributes we better convey that:
- they are files (when an attribute contains a directory path, it has
diras suffix) - one is used as a trigger
| fi' | ||
|
|
||
| [Install] | ||
| WantedBy=multi-user.target No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This service is triggered by a timer. What is the point of having WantedBy=multi-user.target?
I think it could be removed as it is actually unused.
| [Unit] | ||
| Description=Check for recent file modifications | ||
|
|
||
| [Service] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing logs for the logic executed by this service.
I suggest to set the property StandardOutput to log to a local log file. We must then push such file to cloudwatch.
Also, the logic should log more information:
- when it starts
- if the file exists
- what is the content of the checkpoint file
- what is the content of the trigger file
- the decision taken
- when it completes
All log lines must be timestamped with millis resolution.
|
|
||
| [Service] | ||
| Type=oneshot | ||
| TimeoutStartSec=30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this timeout is here to prevent the service being stuck in case of file system unresponsiveness.
This is a good strategy as it is important to put a timeout when dealing with remote resources.
However, I think TimeoutStartSec may not be the right parameter to achieve this.
According to the official documentation for TimeoutStartSec:
If a daemon service does not signal start-up completion within the configured time, the service will be considered failed and will be shut down again.
So the timeout covers the whole start logic. The start logic here includes the execution of cfn-hup-update-action.sh, which can legitimately last longer than 30 seconds. So with the current approach we have the risk of interrupting legitimate updates.
If you agree with this risk, I suggest to configure the timeout logic differently:
- define a timeout of 20 seconds to read the shared files (20s is enough to capture file system failures with a simple read operation on a single file)
- define a timeout for the update action, which must be set through the node_bootstrap_timeout parameter, as it currently is. See https://github.com/gmarciani/aws-parallelcluster-cookbook/blob/wip/mgiacomo/3141/fix-build-ubhu-1218-1/cookbooks/aws-parallelcluster-environment/templates/cfn_hup_configuration/cfn-hook-update.conf.erb#L9-L9
Description of changes
in_place_update_on_fleet_enabledfrom: 6eda378#diff-6d6c58cce2dd575c0638ee245d9647b0dfa3cbdef86a136bd816d00538529fb4Tests
References
in_place_update_truechef attribute: Remove in_place_update_true chef attribute from tests aws-parallelcluster#7156Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.