Skip to content

[Bug] LoginNodes deletion triggers multiple Readiness Check failures #7172

@almightychang

Description

@almightychang

Required Info

  • AWS ParallelCluster version: 3.14.0
  • Cluster name: pcluster-prod
  • Region: us-east-2

Summary

When deleting LoginNodes, two cascading bugs cause cluster update to fail:

  1. Bug 1: LoginNodes ASG lifecycle hook is never completed, leaving instances stuck in Terminating:Wait state
  2. Bug 2: After manually terminating LoginNodes, ComputeFleet nodes don't update their config version

Both bugs cause Readiness Check to fail, resulting in UPDATE_FAILED status.


Bug 1: LoginNodes Lifecycle Hook Not Completed

Bug Description

When removing LoginNodes from a cluster (setting Count: 0 or removing the LoginNodes section entirely), the cluster update fails because:

  1. ASG termination lifecycle hook keeps instances in Terminating:Wait state
  2. HeadNode readiness check expects all nodes (including terminating LoginNodes) to have the new config version
  3. The lifecycle hook is never completed (no CONTINUE/ABANDON signal sent by ParallelCluster)
  4. Readiness check fails repeatedly until CloudFormation update times out and rolls back

Steps to Reproduce

  1. Create a cluster with LoginNodes configured:
LoginNodes:
  Pools:
    - Name: login
      Count: 2
      InstanceType: m5.xlarge
      GracetimePeriod: 120
      Networking:
        SubnetIds:
          - subnet-XXXXXXXXX
  1. Wait for cluster to be fully operational

  2. Update cluster to remove LoginNodes (either set Count: 0 or remove LoginNodes section):

pcluster update-cluster --cluster-name pcluster-prod \
  --cluster-configuration config.yaml \
  --region us-east-2
  1. Observe the update fails with UPDATE_FAILED status

Expected Behavior

  • LoginNodes should be gracefully terminated
  • Lifecycle hook should be completed (CONTINUE signal sent after grace period)
  • OR LoginNodes being deleted should be excluded from readiness check
  • OR DynamoDB records for terminating LoginNodes should be cleaned up before readiness check
  • Cluster update should succeed

Actual Behavior

  • LoginNode EC2 instances stuck in Terminating:Wait state (up to 7200 seconds timeout)
  • Readiness check fails with "wrong records" error for LoginNodes
  • CloudFormation update fails and rolls back
  • LoginNodes remain in limbo state

Evidence

ASG Lifecycle Hooks Configuration

Hook Name Transition Default Result Timeout
pcluster-prod-login-LoginNod... EC2_INSTANCE_LAUNCHING ABANDON 600s
pcluster-prod-login-LoginNod... EC2_INSTANCE_TERMINATING ABANDON 7200s

chef-client.log (HeadNode)

INFO:__main__:Checking cluster readiness with arguments: cluster_name=pcluster-prod, table_name=parallelcluster-pcluster-prod, config_version=S5cS4rx2cUU9WXqE014a3vMyIrIhgZUH, region=us-east-2

INFO:__main__:Found batch of 7 cluster node(s): ['i-07ebfd7ab147b8072', 'i-0accf185802c0e3df', 'i-0dc1a01d5db6e9c5f', 'i-0306b95ecec4aca39', 'i-0c3e898294a535f4f', 'i-060a5f082a35a55da', 'i-0962fb3dbc9b6f634']

INFO:__main__:Retrieved 7 DDB item(s):
  {'Id': {'S': 'CLUSTER_CONFIG.i-0c3e898294a535f4f'}, 'Data': {'M': {'node_type': {'S': 'LoginNode'}, 'cluster_config_version': {'S': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}, ...}}}
  {'Id': {'S': 'CLUSTER_CONFIG.i-0dc1a01d5db6e9c5f'}, 'Data': {'M': {'node_type': {'S': 'LoginNode'}, 'cluster_config_version': {'S': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}, ...}}}
  ... (ComputeFleet nodes updated to new config) ...

ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
  * missing records (0): []
  * incomplete records (0): []
  * wrong records (4): [('i-0dc1a01d5db6e9c5f', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-0306b95ecec4aca39', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-0c3e898294a535f4f', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-060a5f082a35a55da', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE')]

Note: The 4 "wrong records" are all LoginNodes with old config version. ComputeFleet nodes successfully updated to new config.

clustermgtd.log (HeadNode)

clustermgtd only manages ComputeFleet - no LoginNode management observed:

2025-12-23 03:22:58,584 - [slurm_plugin.clustermgtd:set_config] - INFO - Applying new clustermgtd config:
  fleet_config={'compute-gpu': {...}, 'rlwrld-cpu': {...}}
  # No LoginNodes in fleet_config

DynamoDB State

LoginNodes remain in DynamoDB with old config version while in Terminating:Wait:

Instance ID Node Type Config Version Status
i-0dc1a01d5db6e9c5f LoginNode aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) DEPLOYED
i-0306b95ecec4aca39 LoginNode aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) DEPLOYED
i-0c3e898294a535f4f LoginNode aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) DEPLOYED
i-060a5f082a35a55da LoginNode aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) DEPLOYED

Cluster Status After Failed Update

{
  "clusterStatus": "UPDATE_FAILED",
  "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
  "loginNodes": [
    {
      "status": "active",
      "healthyNodes": 0,
      "unhealthyNodes": 0
    }
  ]
}

Root Cause Analysis

The issue is a coordination failure between multiple ParallelCluster components:

  1. ASG Lifecycle Hook: Configured with 7200s timeout for TERMINATING transition
  2. Readiness Check (check_cluster_ready.py): Expects ALL nodes in DynamoDB to have new config version
  3. No Lifecycle Hook Completion: ParallelCluster doesn't send CONTINUE signal to complete termination
  4. No DynamoDB Cleanup: Terminating LoginNodes' records not removed from DynamoDB

Timeline of Failure

T+0:    Update initiated, LoginNodes Count=0
T+0:    ASG sets desired capacity to 0
T+0:    Instances enter Terminating:Wait (lifecycle hook)
T+30s:  Readiness check starts
T+30s:  Finds LoginNodes with old config version → FAIL
T+45s:  Retry 1 → FAIL (LoginNodes still in Terminating:Wait)
...
T+15m:  Retry 10 → FAIL
T+15m:  Chef update fails
T+15m:  CloudFormation UPDATE_FAILED, rollback begins
T+2h:   Lifecycle hook finally times out (default: ABANDON)

Workaround

Manually complete the lifecycle action for all stuck LoginNode instances:

# Get the ASG name and lifecycle hook name
ASG_NAME="pcluster-<cluster-name>-login-AutoScalingGroup"
HOOK_NAME="pcluster-<cluster-name>-login-LoginNodesTerminatingLifecycleHook"

# Complete lifecycle action for each stuck instance
for instance_id in i-xxx1 i-xxx2 i-xxx3 i-xxx4; do
  aws autoscaling complete-lifecycle-action \
    --lifecycle-hook-name "$HOOK_NAME" \
    --auto-scaling-group-name "$ASG_NAME" \
    --lifecycle-action-result CONTINUE \
    --instance-id "$instance_id" \
    --region us-east-2
  echo "Completed: $instance_id"
done

After completing the lifecycle actions, the instances will terminate and the ASG cleanup will proceed.

Proposed Fix

One of the following approaches:

Option 1: Complete Lifecycle Hook

When LoginNodes are being deleted, ParallelCluster should explicitly complete the termination lifecycle hook:

aws autoscaling complete-lifecycle-action \
  --lifecycle-hook-name <hook-name> \
  --auto-scaling-group-name <asg-name> \
  --lifecycle-action-result CONTINUE \
  --instance-id <instance-id>

Option 2: Exclude Terminating Nodes from Readiness Check

Modify check_cluster_ready.py to exclude nodes that are:

  • Being terminated (ASG desired count reduced)
  • Or in Terminating:Wait state

Option 3: Clean Up DynamoDB Before Readiness Check

When LoginNodes are removed, delete their DynamoDB records before the readiness check runs.

Impact

  • Cannot cleanly remove LoginNodes from a cluster
  • Cluster updates involving LoginNodes deletion will fail
  • Requires manual intervention to complete lifecycle hooks
  • Or requires waiting 2 hours for timeout

Bug 2: ComputeFleet Nodes Don't Update Config Version on Unrelated Changes

Bug Description

After manually terminating the stuck LoginNodes (workaround for Bug 1), the cluster update still fails because ComputeFleet nodes don't update their config version when the change doesn't affect their queue configuration.

What Happens

  1. LoginNodes removal changes cluster config version: aZKApy...z8lGU2...
  2. ComputeFleet nodes detect no change to their queue configuration
  3. ComputeFleet nodes don't update their config version in DynamoDB
  4. Readiness check expects ALL nodes to have new config version
  5. Readiness check fails with "wrong records" for ComputeFleet nodes

Evidence

chef-client.log (after LoginNodes manually terminated)

INFO:__main__:Checking cluster readiness with arguments:
  config_version=z8lGUg2HJMh50OGgXy99JI9osLbNn1iO   ◀── Expected version

INFO:__main__:Retrieved 3 DDB item(s):
  {'Id': 'CLUSTER_CONFIG.i-0962fb3dbc9b6f634',
   'node_type': 'ComputeFleet',
   'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}   ◀── Old version
  {'Id': 'CLUSTER_CONFIG.i-0accf185802c0e3df',
   'node_type': 'ComputeFleet',
   'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}   ◀── Old version
  {'Id': 'CLUSTER_CONFIG.i-07ebfd7ab147b8072',
   'node_type': 'ComputeFleet',
   'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}   ◀── Old version

ERROR:__main__:Some cluster readiness checks failed:
  * wrong records (3): [
    ('i-07ebfd7ab147b8072', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'),
    ('i-0accf185802c0e3df', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'),
    ('i-0962fb3dbc9b6f634', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE')
  ]

Note: All 3 nodes are ComputeFleet (not LoginNodes), but they still have the old config version.

Root Cause

  1. Cluster config version changes on any configuration change (including LoginNodes removal)
  2. ComputeFleet nodes only update their config version when their queue configuration changes
  3. Readiness check expects all nodes to have the new config version
  4. → When unrelated config changes, ComputeFleet nodes don't update, causing readiness check failure

Workaround

Manually update DynamoDB records for ComputeFleet nodes:

# For each ComputeFleet node
aws dynamodb update-item --table-name parallelcluster-<cluster-name> \
  --key '{"Id":{"S":"CLUSTER_CONFIG.<instance-id>"}}' \
  --update-expression "SET #data.cluster_config_version = :v" \
  --expression-attribute-names '{"#data":"Data"}' \
  --expression-attribute-values '{":v":{"S":"<new-config-version>"}}' \
  --region us-east-2

Proposed Fix

  1. ComputeFleet nodes should update config version on any cluster config version change
  2. OR Readiness check should only verify nodes whose configuration actually changed
  3. OR Use separate config versions per node type (HeadNode, ComputeFleet, LoginNodes)

Common Theme: Readiness Check Design Issues

Both bugs stem from Readiness Check design issues:

Aspect Bug 1 (Lifecycle Hook) Bug 2 (Config Version)
Trigger LoginNodes deletion LoginNodes deletion
Problem Lifecycle hook not completed Nodes don't update config version
Readiness Check expects All nodes have new config All nodes have new config
Actual state LoginNodes stuck in old config ComputeFleet has old config
Result UPDATE_FAILED UPDATE_FAILED

Core Issue: Readiness check requires ALL nodes to have the new config version, but doesn't account for:

  • Nodes being terminated (should be excluded)
  • Nodes whose configuration didn't actually change (shouldn't need to update)

Related

This bug was discovered while investigating a separate issue: LoginNodes NLB incorrectly created with internal scheme on public subnet (see separate bug report).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions