-
Notifications
You must be signed in to change notification settings - Fork 316
Description
Required Info
- AWS ParallelCluster version: 3.14.0
- Cluster name: pcluster-prod
- Region: us-east-2
Summary
When deleting LoginNodes, two cascading bugs cause cluster update to fail:
- Bug 1: LoginNodes ASG lifecycle hook is never completed, leaving instances stuck in
Terminating:Waitstate - Bug 2: After manually terminating LoginNodes, ComputeFleet nodes don't update their config version
Both bugs cause Readiness Check to fail, resulting in UPDATE_FAILED status.
Bug 1: LoginNodes Lifecycle Hook Not Completed
Bug Description
When removing LoginNodes from a cluster (setting Count: 0 or removing the LoginNodes section entirely), the cluster update fails because:
- ASG termination lifecycle hook keeps instances in
Terminating:Waitstate - HeadNode readiness check expects all nodes (including terminating LoginNodes) to have the new config version
- The lifecycle hook is never completed (no CONTINUE/ABANDON signal sent by ParallelCluster)
- Readiness check fails repeatedly until CloudFormation update times out and rolls back
Steps to Reproduce
- Create a cluster with LoginNodes configured:
LoginNodes:
Pools:
- Name: login
Count: 2
InstanceType: m5.xlarge
GracetimePeriod: 120
Networking:
SubnetIds:
- subnet-XXXXXXXXX-
Wait for cluster to be fully operational
-
Update cluster to remove LoginNodes (either set
Count: 0or removeLoginNodessection):
pcluster update-cluster --cluster-name pcluster-prod \
--cluster-configuration config.yaml \
--region us-east-2- Observe the update fails with
UPDATE_FAILEDstatus
Expected Behavior
- LoginNodes should be gracefully terminated
- Lifecycle hook should be completed (CONTINUE signal sent after grace period)
- OR LoginNodes being deleted should be excluded from readiness check
- OR DynamoDB records for terminating LoginNodes should be cleaned up before readiness check
- Cluster update should succeed
Actual Behavior
- LoginNode EC2 instances stuck in
Terminating:Waitstate (up to 7200 seconds timeout) - Readiness check fails with "wrong records" error for LoginNodes
- CloudFormation update fails and rolls back
- LoginNodes remain in limbo state
Evidence
ASG Lifecycle Hooks Configuration
| Hook Name | Transition | Default Result | Timeout |
|---|---|---|---|
| pcluster-prod-login-LoginNod... | EC2_INSTANCE_LAUNCHING | ABANDON | 600s |
| pcluster-prod-login-LoginNod... | EC2_INSTANCE_TERMINATING | ABANDON | 7200s |
chef-client.log (HeadNode)
INFO:__main__:Checking cluster readiness with arguments: cluster_name=pcluster-prod, table_name=parallelcluster-pcluster-prod, config_version=S5cS4rx2cUU9WXqE014a3vMyIrIhgZUH, region=us-east-2
INFO:__main__:Found batch of 7 cluster node(s): ['i-07ebfd7ab147b8072', 'i-0accf185802c0e3df', 'i-0dc1a01d5db6e9c5f', 'i-0306b95ecec4aca39', 'i-0c3e898294a535f4f', 'i-060a5f082a35a55da', 'i-0962fb3dbc9b6f634']
INFO:__main__:Retrieved 7 DDB item(s):
{'Id': {'S': 'CLUSTER_CONFIG.i-0c3e898294a535f4f'}, 'Data': {'M': {'node_type': {'S': 'LoginNode'}, 'cluster_config_version': {'S': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}, ...}}}
{'Id': {'S': 'CLUSTER_CONFIG.i-0dc1a01d5db6e9c5f'}, 'Data': {'M': {'node_type': {'S': 'LoginNode'}, 'cluster_config_version': {'S': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}, ...}}}
... (ComputeFleet nodes updated to new config) ...
ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
* missing records (0): []
* incomplete records (0): []
* wrong records (4): [('i-0dc1a01d5db6e9c5f', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-0306b95ecec4aca39', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-0c3e898294a535f4f', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-060a5f082a35a55da', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE')]
Note: The 4 "wrong records" are all LoginNodes with old config version. ComputeFleet nodes successfully updated to new config.
clustermgtd.log (HeadNode)
clustermgtd only manages ComputeFleet - no LoginNode management observed:
2025-12-23 03:22:58,584 - [slurm_plugin.clustermgtd:set_config] - INFO - Applying new clustermgtd config:
fleet_config={'compute-gpu': {...}, 'rlwrld-cpu': {...}}
# No LoginNodes in fleet_config
DynamoDB State
LoginNodes remain in DynamoDB with old config version while in Terminating:Wait:
| Instance ID | Node Type | Config Version | Status |
|---|---|---|---|
| i-0dc1a01d5db6e9c5f | LoginNode | aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) | DEPLOYED |
| i-0306b95ecec4aca39 | LoginNode | aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) | DEPLOYED |
| i-0c3e898294a535f4f | LoginNode | aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) | DEPLOYED |
| i-060a5f082a35a55da | LoginNode | aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) | DEPLOYED |
Cluster Status After Failed Update
{
"clusterStatus": "UPDATE_FAILED",
"cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
"loginNodes": [
{
"status": "active",
"healthyNodes": 0,
"unhealthyNodes": 0
}
]
}Root Cause Analysis
The issue is a coordination failure between multiple ParallelCluster components:
- ASG Lifecycle Hook: Configured with 7200s timeout for TERMINATING transition
- Readiness Check (
check_cluster_ready.py): Expects ALL nodes in DynamoDB to have new config version - No Lifecycle Hook Completion: ParallelCluster doesn't send CONTINUE signal to complete termination
- No DynamoDB Cleanup: Terminating LoginNodes' records not removed from DynamoDB
Timeline of Failure
T+0: Update initiated, LoginNodes Count=0
T+0: ASG sets desired capacity to 0
T+0: Instances enter Terminating:Wait (lifecycle hook)
T+30s: Readiness check starts
T+30s: Finds LoginNodes with old config version → FAIL
T+45s: Retry 1 → FAIL (LoginNodes still in Terminating:Wait)
...
T+15m: Retry 10 → FAIL
T+15m: Chef update fails
T+15m: CloudFormation UPDATE_FAILED, rollback begins
T+2h: Lifecycle hook finally times out (default: ABANDON)
Workaround
Manually complete the lifecycle action for all stuck LoginNode instances:
# Get the ASG name and lifecycle hook name
ASG_NAME="pcluster-<cluster-name>-login-AutoScalingGroup"
HOOK_NAME="pcluster-<cluster-name>-login-LoginNodesTerminatingLifecycleHook"
# Complete lifecycle action for each stuck instance
for instance_id in i-xxx1 i-xxx2 i-xxx3 i-xxx4; do
aws autoscaling complete-lifecycle-action \
--lifecycle-hook-name "$HOOK_NAME" \
--auto-scaling-group-name "$ASG_NAME" \
--lifecycle-action-result CONTINUE \
--instance-id "$instance_id" \
--region us-east-2
echo "Completed: $instance_id"
doneAfter completing the lifecycle actions, the instances will terminate and the ASG cleanup will proceed.
Proposed Fix
One of the following approaches:
Option 1: Complete Lifecycle Hook
When LoginNodes are being deleted, ParallelCluster should explicitly complete the termination lifecycle hook:
aws autoscaling complete-lifecycle-action \
--lifecycle-hook-name <hook-name> \
--auto-scaling-group-name <asg-name> \
--lifecycle-action-result CONTINUE \
--instance-id <instance-id>Option 2: Exclude Terminating Nodes from Readiness Check
Modify check_cluster_ready.py to exclude nodes that are:
- Being terminated (ASG desired count reduced)
- Or in
Terminating:Waitstate
Option 3: Clean Up DynamoDB Before Readiness Check
When LoginNodes are removed, delete their DynamoDB records before the readiness check runs.
Impact
- Cannot cleanly remove LoginNodes from a cluster
- Cluster updates involving LoginNodes deletion will fail
- Requires manual intervention to complete lifecycle hooks
- Or requires waiting 2 hours for timeout
Bug 2: ComputeFleet Nodes Don't Update Config Version on Unrelated Changes
Bug Description
After manually terminating the stuck LoginNodes (workaround for Bug 1), the cluster update still fails because ComputeFleet nodes don't update their config version when the change doesn't affect their queue configuration.
What Happens
- LoginNodes removal changes cluster config version:
aZKApy...→z8lGU2... - ComputeFleet nodes detect no change to their queue configuration
- ComputeFleet nodes don't update their config version in DynamoDB
- Readiness check expects ALL nodes to have new config version
- Readiness check fails with "wrong records" for ComputeFleet nodes
Evidence
chef-client.log (after LoginNodes manually terminated)
INFO:__main__:Checking cluster readiness with arguments:
config_version=z8lGUg2HJMh50OGgXy99JI9osLbNn1iO ◀── Expected version
INFO:__main__:Retrieved 3 DDB item(s):
{'Id': 'CLUSTER_CONFIG.i-0962fb3dbc9b6f634',
'node_type': 'ComputeFleet',
'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'} ◀── Old version
{'Id': 'CLUSTER_CONFIG.i-0accf185802c0e3df',
'node_type': 'ComputeFleet',
'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'} ◀── Old version
{'Id': 'CLUSTER_CONFIG.i-07ebfd7ab147b8072',
'node_type': 'ComputeFleet',
'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'} ◀── Old version
ERROR:__main__:Some cluster readiness checks failed:
* wrong records (3): [
('i-07ebfd7ab147b8072', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'),
('i-0accf185802c0e3df', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'),
('i-0962fb3dbc9b6f634', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE')
]
Note: All 3 nodes are ComputeFleet (not LoginNodes), but they still have the old config version.
Root Cause
- Cluster config version changes on any configuration change (including LoginNodes removal)
- ComputeFleet nodes only update their config version when their queue configuration changes
- Readiness check expects all nodes to have the new config version
- → When unrelated config changes, ComputeFleet nodes don't update, causing readiness check failure
Workaround
Manually update DynamoDB records for ComputeFleet nodes:
# For each ComputeFleet node
aws dynamodb update-item --table-name parallelcluster-<cluster-name> \
--key '{"Id":{"S":"CLUSTER_CONFIG.<instance-id>"}}' \
--update-expression "SET #data.cluster_config_version = :v" \
--expression-attribute-names '{"#data":"Data"}' \
--expression-attribute-values '{":v":{"S":"<new-config-version>"}}' \
--region us-east-2Proposed Fix
- ComputeFleet nodes should update config version on any cluster config version change
- OR Readiness check should only verify nodes whose configuration actually changed
- OR Use separate config versions per node type (HeadNode, ComputeFleet, LoginNodes)
Common Theme: Readiness Check Design Issues
Both bugs stem from Readiness Check design issues:
| Aspect | Bug 1 (Lifecycle Hook) | Bug 2 (Config Version) |
|---|---|---|
| Trigger | LoginNodes deletion | LoginNodes deletion |
| Problem | Lifecycle hook not completed | Nodes don't update config version |
| Readiness Check expects | All nodes have new config | All nodes have new config |
| Actual state | LoginNodes stuck in old config | ComputeFleet has old config |
| Result | UPDATE_FAILED | UPDATE_FAILED |
Core Issue: Readiness check requires ALL nodes to have the new config version, but doesn't account for:
- Nodes being terminated (should be excluded)
- Nodes whose configuration didn't actually change (shouldn't need to update)
Related
This bug was discovered while investigating a separate issue: LoginNodes NLB incorrectly created with internal scheme on public subnet (see separate bug report).