[Bug] LoginNodes deletion triggers multiple Readiness Check failures

## Required Info

- **AWS ParallelCluster version**: 3.14.0
- **Cluster name**: pcluster-prod
- **Region**: us-east-2

## Summary

When deleting LoginNodes, **two cascading bugs** cause cluster update to fail:

1. **Bug 1**: LoginNodes ASG lifecycle hook is never completed, leaving instances stuck in `Terminating:Wait` state
2. **Bug 2**: After manually terminating LoginNodes, ComputeFleet nodes don't update their config version

Both bugs cause **Readiness Check** to fail, resulting in `UPDATE_FAILED` status.

---

# Bug 1: LoginNodes Lifecycle Hook Not Completed

## Bug Description

When removing LoginNodes from a cluster (setting `Count: 0` or removing the `LoginNodes` section entirely), the cluster update fails because:

1. ASG termination lifecycle hook keeps instances in `Terminating:Wait` state
2. HeadNode readiness check expects all nodes (including terminating LoginNodes) to have the new config version
3. The lifecycle hook is never completed (no CONTINUE/ABANDON signal sent by ParallelCluster)
4. Readiness check fails repeatedly until CloudFormation update times out and rolls back

## Steps to Reproduce

1. Create a cluster with LoginNodes configured:
```yaml
LoginNodes:
  Pools:
    - Name: login
      Count: 2
      InstanceType: m5.xlarge
      GracetimePeriod: 120
      Networking:
        SubnetIds:
          - subnet-XXXXXXXXX
```

2. Wait for cluster to be fully operational

3. Update cluster to remove LoginNodes (either set `Count: 0` or remove `LoginNodes` section):
```bash
pcluster update-cluster --cluster-name pcluster-prod \
  --cluster-configuration config.yaml \
  --region us-east-2
```

4. Observe the update fails with `UPDATE_FAILED` status

## Expected Behavior

- LoginNodes should be gracefully terminated
- Lifecycle hook should be completed (CONTINUE signal sent after grace period)
- OR LoginNodes being deleted should be excluded from readiness check
- OR DynamoDB records for terminating LoginNodes should be cleaned up before readiness check
- Cluster update should succeed

## Actual Behavior

- LoginNode EC2 instances stuck in `Terminating:Wait` state (up to 7200 seconds timeout)
- Readiness check fails with "wrong records" error for LoginNodes
- CloudFormation update fails and rolls back
- LoginNodes remain in limbo state

## Evidence

### ASG Lifecycle Hooks Configuration

| Hook Name | Transition | Default Result | Timeout |
|-----------|------------|----------------|---------|
| pcluster-prod-login-LoginNod... | EC2_INSTANCE_LAUNCHING | ABANDON | 600s |
| pcluster-prod-login-LoginNod... | EC2_INSTANCE_TERMINATING | ABANDON | 7200s |

### chef-client.log (HeadNode)

```
INFO:__main__:Checking cluster readiness with arguments: cluster_name=pcluster-prod, table_name=parallelcluster-pcluster-prod, config_version=S5cS4rx2cUU9WXqE014a3vMyIrIhgZUH, region=us-east-2

INFO:__main__:Found batch of 7 cluster node(s): ['i-07ebfd7ab147b8072', 'i-0accf185802c0e3df', 'i-0dc1a01d5db6e9c5f', 'i-0306b95ecec4aca39', 'i-0c3e898294a535f4f', 'i-060a5f082a35a55da', 'i-0962fb3dbc9b6f634']

INFO:__main__:Retrieved 7 DDB item(s):
  {'Id': {'S': 'CLUSTER_CONFIG.i-0c3e898294a535f4f'}, 'Data': {'M': {'node_type': {'S': 'LoginNode'}, 'cluster_config_version': {'S': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}, ...}}}
  {'Id': {'S': 'CLUSTER_CONFIG.i-0dc1a01d5db6e9c5f'}, 'Data': {'M': {'node_type': {'S': 'LoginNode'}, 'cluster_config_version': {'S': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}, ...}}}
  ... (ComputeFleet nodes updated to new config) ...

ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
  * missing records (0): []
  * incomplete records (0): []
  * wrong records (4): [('i-0dc1a01d5db6e9c5f', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-0306b95ecec4aca39', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-0c3e898294a535f4f', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'), ('i-060a5f082a35a55da', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE')]
```

Note: The 4 "wrong records" are all LoginNodes with old config version. ComputeFleet nodes successfully updated to new config.

### clustermgtd.log (HeadNode)

clustermgtd only manages ComputeFleet - no LoginNode management observed:

```
2025-12-23 03:22:58,584 - [slurm_plugin.clustermgtd:set_config] - INFO - Applying new clustermgtd config:
  fleet_config={'compute-gpu': {...}, 'rlwrld-cpu': {...}}
  # No LoginNodes in fleet_config
```

### DynamoDB State

LoginNodes remain in DynamoDB with old config version while in `Terminating:Wait`:

| Instance ID | Node Type | Config Version | Status |
|-------------|-----------|----------------|--------|
| i-0dc1a01d5db6e9c5f | LoginNode | aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) | DEPLOYED |
| i-0306b95ecec4aca39 | LoginNode | aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) | DEPLOYED |
| i-0c3e898294a535f4f | LoginNode | aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) | DEPLOYED |
| i-060a5f082a35a55da | LoginNode | aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old) | DEPLOYED |

### Cluster Status After Failed Update

```json
{
  "clusterStatus": "UPDATE_FAILED",
  "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
  "loginNodes": [
    {
      "status": "active",
      "healthyNodes": 0,
      "unhealthyNodes": 0
    }
  ]
}
```

## Root Cause Analysis

The issue is a coordination failure between multiple ParallelCluster components:

1. **ASG Lifecycle Hook**: Configured with 7200s timeout for TERMINATING transition
2. **Readiness Check** (`check_cluster_ready.py`): Expects ALL nodes in DynamoDB to have new config version
3. **No Lifecycle Hook Completion**: ParallelCluster doesn't send CONTINUE signal to complete termination
4. **No DynamoDB Cleanup**: Terminating LoginNodes' records not removed from DynamoDB

### Timeline of Failure

```
T+0:    Update initiated, LoginNodes Count=0
T+0:    ASG sets desired capacity to 0
T+0:    Instances enter Terminating:Wait (lifecycle hook)
T+30s:  Readiness check starts
T+30s:  Finds LoginNodes with old config version → FAIL
T+45s:  Retry 1 → FAIL (LoginNodes still in Terminating:Wait)
...
T+15m:  Retry 10 → FAIL
T+15m:  Chef update fails
T+15m:  CloudFormation UPDATE_FAILED, rollback begins
T+2h:   Lifecycle hook finally times out (default: ABANDON)
```

## Workaround

Manually complete the lifecycle action for all stuck LoginNode instances:

```bash
# Get the ASG name and lifecycle hook name
ASG_NAME="pcluster-<cluster-name>-login-AutoScalingGroup"
HOOK_NAME="pcluster-<cluster-name>-login-LoginNodesTerminatingLifecycleHook"

# Complete lifecycle action for each stuck instance
for instance_id in i-xxx1 i-xxx2 i-xxx3 i-xxx4; do
  aws autoscaling complete-lifecycle-action \
    --lifecycle-hook-name "$HOOK_NAME" \
    --auto-scaling-group-name "$ASG_NAME" \
    --lifecycle-action-result CONTINUE \
    --instance-id "$instance_id" \
    --region us-east-2
  echo "Completed: $instance_id"
done
```

After completing the lifecycle actions, the instances will terminate and the ASG cleanup will proceed.

## Proposed Fix

One of the following approaches:

### Option 1: Complete Lifecycle Hook
When LoginNodes are being deleted, ParallelCluster should explicitly complete the termination lifecycle hook:

```python
aws autoscaling complete-lifecycle-action \
  --lifecycle-hook-name <hook-name> \
  --auto-scaling-group-name <asg-name> \
  --lifecycle-action-result CONTINUE \
  --instance-id <instance-id>
```

### Option 2: Exclude Terminating Nodes from Readiness Check
Modify `check_cluster_ready.py` to exclude nodes that are:
- Being terminated (ASG desired count reduced)
- Or in `Terminating:Wait` state

### Option 3: Clean Up DynamoDB Before Readiness Check
When LoginNodes are removed, delete their DynamoDB records before the readiness check runs.

## Impact

- Cannot cleanly remove LoginNodes from a cluster
- Cluster updates involving LoginNodes deletion will fail
- Requires manual intervention to complete lifecycle hooks
- Or requires waiting 2 hours for timeout

---

# Bug 2: ComputeFleet Nodes Don't Update Config Version on Unrelated Changes

## Bug Description

After manually terminating the stuck LoginNodes (workaround for Bug 1), the cluster update **still fails** because ComputeFleet nodes don't update their config version when the change doesn't affect their queue configuration.

## What Happens

1. LoginNodes removal changes cluster config version: `aZKApy...` → `z8lGU2...`
2. ComputeFleet nodes detect no change to their queue configuration
3. ComputeFleet nodes don't update their config version in DynamoDB
4. Readiness check expects ALL nodes to have new config version
5. Readiness check fails with "wrong records" for ComputeFleet nodes

## Evidence

### chef-client.log (after LoginNodes manually terminated)

```
INFO:__main__:Checking cluster readiness with arguments:
  config_version=z8lGUg2HJMh50OGgXy99JI9osLbNn1iO   ◀── Expected version

INFO:__main__:Retrieved 3 DDB item(s):
  {'Id': 'CLUSTER_CONFIG.i-0962fb3dbc9b6f634',
   'node_type': 'ComputeFleet',
   'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}   ◀── Old version
  {'Id': 'CLUSTER_CONFIG.i-0accf185802c0e3df',
   'node_type': 'ComputeFleet',
   'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}   ◀── Old version
  {'Id': 'CLUSTER_CONFIG.i-07ebfd7ab147b8072',
   'node_type': 'ComputeFleet',
   'cluster_config_version': 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'}   ◀── Old version

ERROR:__main__:Some cluster readiness checks failed:
  * wrong records (3): [
    ('i-07ebfd7ab147b8072', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'),
    ('i-0accf185802c0e3df', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE'),
    ('i-0962fb3dbc9b6f634', 'aZKApygBp2ZBLMeqyn_LoVputxaHnQYE')
  ]
```

Note: All 3 nodes are ComputeFleet (not LoginNodes), but they still have the old config version.

## Root Cause

1. Cluster config version changes on **any** configuration change (including LoginNodes removal)
2. ComputeFleet nodes only update their config version when their **queue configuration** changes
3. Readiness check expects **all** nodes to have the new config version
4. → When unrelated config changes, ComputeFleet nodes don't update, causing readiness check failure

## Workaround

Manually update DynamoDB records for ComputeFleet nodes:

```bash
# For each ComputeFleet node
aws dynamodb update-item --table-name parallelcluster-<cluster-name> \
  --key '{"Id":{"S":"CLUSTER_CONFIG.<instance-id>"}}' \
  --update-expression "SET #data.cluster_config_version = :v" \
  --expression-attribute-names '{"#data":"Data"}' \
  --expression-attribute-values '{":v":{"S":"<new-config-version>"}}' \
  --region us-east-2
```

## Proposed Fix

1. ComputeFleet nodes should update config version on **any** cluster config version change
2. OR Readiness check should only verify nodes whose configuration actually changed
3. OR Use separate config versions per node type (HeadNode, ComputeFleet, LoginNodes)

---

# Common Theme: Readiness Check Design Issues

Both bugs stem from **Readiness Check** design issues:

| Aspect | Bug 1 (Lifecycle Hook) | Bug 2 (Config Version) |
|--------|------------------------|------------------------|
| **Trigger** | LoginNodes deletion | LoginNodes deletion |
| **Problem** | Lifecycle hook not completed | Nodes don't update config version |
| **Readiness Check expects** | All nodes have new config | All nodes have new config |
| **Actual state** | LoginNodes stuck in old config | ComputeFleet has old config |
| **Result** | UPDATE_FAILED | UPDATE_FAILED |

**Core Issue**: Readiness check requires ALL nodes to have the new config version, but doesn't account for:
- Nodes being terminated (should be excluded)
- Nodes whose configuration didn't actually change (shouldn't need to update)

## Related

This bug was discovered while investigating a separate issue: LoginNodes NLB incorrectly created with `internal` scheme on public subnet (see separate bug report).


Hook Name	Transition	Default Result	Timeout
pcluster-prod-login-LoginNod...	EC2_INSTANCE_LAUNCHING	ABANDON	600s
pcluster-prod-login-LoginNod...	EC2_INSTANCE_TERMINATING	ABANDON	7200s

Instance ID	Node Type	Config Version	Status
i-0dc1a01d5db6e9c5f	LoginNode	aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old)	DEPLOYED
i-0306b95ecec4aca39	LoginNode	aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old)	DEPLOYED
i-0c3e898294a535f4f	LoginNode	aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old)	DEPLOYED
i-060a5f082a35a55da	LoginNode	aZKApygBp2ZBLMeqyn_LoVputxaHnQYE (old)	DEPLOYED

Aspect	Bug 1 (Lifecycle Hook)	Bug 2 (Config Version)
Trigger	LoginNodes deletion	LoginNodes deletion
Problem	Lifecycle hook not completed	Nodes don't update config version
Readiness Check expects	All nodes have new config	All nodes have new config
Actual state	LoginNodes stuck in old config	ComputeFleet has old config
Result	UPDATE_FAILED	UPDATE_FAILED

[Bug] LoginNodes deletion triggers multiple Readiness Check failures #7172

Description

Required Info

Summary

Bug 1: LoginNodes Lifecycle Hook Not Completed

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Evidence

ASG Lifecycle Hooks Configuration

chef-client.log (HeadNode)

clustermgtd.log (HeadNode)

DynamoDB State

Cluster Status After Failed Update

Root Cause Analysis

Timeline of Failure

Workaround

Proposed Fix

Option 1: Complete Lifecycle Hook

Option 2: Exclude Terminating Nodes from Readiness Check

Option 3: Clean Up DynamoDB Before Readiness Check

Impact

Bug 2: ComputeFleet Nodes Don't Update Config Version on Unrelated Changes

Bug Description

What Happens

Evidence

chef-client.log (after LoginNodes manually terminated)

Root Cause

Workaround

Proposed Fix

Common Theme: Readiness Check Design Issues

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions