Skip to content

Add pre/post lease hooks#140

Open
kirkbrauer wants to merge 76 commits intomainfrom
add-hooks
Open

Add pre/post lease hooks#140
kirkbrauer wants to merge 76 commits intomainfrom
add-hooks

Conversation

@kirkbrauer
Copy link
Member

@kirkbrauer kirkbrauer commented Jan 24, 2026

Summary

Implement hook scripts that can be configured in the exporter config to run j commands within a temporary lease environment. This PR combines changes across Python, Controller, and Protocol components.

Environment Variables

The following environment variables are exposed to hook scripts:

  • JUMPSTARTER_HOST - Path to the Unix socket for j CLI access
  • JMP_DRIVERS_ALLOW=UNSAFE - Allows all drivers for local access
  • LEASE_NAME - The name of the current lease
  • CLIENT_NAME - Name of the client that acquired the lease

Example Configuration

apiVersion: jumpstarter.dev/v1alpha1
kind: ExporterConfig
metadata:
  namespace: default
  name: test-exporter
endpoint: grpc.jumpstarter.100.123.60.47.nip.io:8082
hooks:
  beforeLease:
    script: |
      echo "Starting lease setup for ${CLIENT_NAME}"
      j power on
      echo "Pre-lease setup completed!"
    timeout: 120
    onFailure: exit
  afterLease:
    script: |
      echo "Starting lease cleanup for ${CLIENT_NAME}"
      j power off
      echo "Post-lease cleanup completed!"
    timeout: 60
    onFailure: warn
tls:
  ca: ''
  insecure: true
token: abc123
export:
  power:
    type: "jumpstarter_driver_power.driver.MockPower"

Changes

Python (python/)

  • Hook executor with before/after lease lifecycle support
  • Configurable timeouts and failure modes (exit, warn)
  • Exporter status reporting and streaming
  • EndSession support for graceful session termination
  • Typed Protobuf/gRPC stubs generation
  • Status field exposed in CLI (jmp admin get exporter)

Controller (controller/)

  • Full ExporterStatus field support with state transitions
  • Status updates reflected from exporters in real-time
  • EndSession API implementation
  • Exporter-initiated lease release support
  • CRDs managed via Helm chart templates

Protocol (protocol/)

  • EndSession API for keeping connections open after lease ends
  • release_lease field for exporter-initiated lease release
  • status_version and previous_status fields for transition tracking

Related PRs

Summary by CodeRabbit

  • New Features

    • Rich exporter status (state, message, version); ReportStatus may request lease release; EndSession RPC; background status monitor; configurable before/after-lease hooks with timeouts, on-failure policies and live hook logs.
  • Bug Fixes

    • Improved online/offline detection, stricter driver-call readiness checks, connection retry/backoff, and safer lease-release sequencing.
  • Chores

    • CRDs and installer regenerated; printer columns for Status/Message added; Helm deploys skip CRD install; ArgoCD sync-wave annotation injected.
  • Tests

    • Expanded unit, integration and e2e coverage for hooks, status monitor, controller flows and logging.
  • Documentation

    • New Hooks docs, glossary and updated configuration/getting-started guides.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds exporter lifecycle hooks (beforeLease/afterLease), exporter status fields/enums and status_message, new ReportStatus release_lease flag and EndSession RPC, StatusMonitor and client status APIs, HookExecutor and LeaseContext, CRD/schema and printer-column updates, protobuf/grpc regeneration, CLI/status rendering, extensive tests and e2e hook configs, and small Helm/manifest build tweaks.

Changes

Cohort / File(s) Summary
Protocol & Generated Stubs
protocol/proto/jumpstarter/v1/jumpstarter.proto, python/packages/jumpstarter-protocol/.../jumpstarter_pb2*.py, .../*.pyi, ..._grpc.py, ..._grpc.pyi
Add ReportStatus.release_lease, EndSession RPC/messages, GetStatus fields (status_version, previous_status), LogStream.source; regenerate protobuf/grpc runtime and typing stubs.
CRDs & Manifests
controller/deploy/helm/jumpstarter/crds/*, controller/deploy/helm/jumpstarter/charts/.../templates/crds/*, controller/deploy/operator/config/crd/bases/*, controller/deploy/operator/bundle/manifests/*
Add CRDs (Exporter, Lease, Client, ExporterAccessPolicy); add exporterStatus and statusMessage to Exporter schema and printer columns; copy CRDs into chart templates and bundle.
Controller Makefile / Helm
controller/Makefile, controller/hack/deploy_with_helm.sh
Inject ArgoCD sync-wave annotation into generated role YAML, copy CRDs into chart templates, regenerate operator installer, and use Helm --skip-crds.
Controller API & Service
controller/api/v1alpha1/exporter_types.go, controller/api/v1alpha1/exporter_helpers.go, controller/internal/controller/exporter_controller.go, controller/internal/service/controller_service.go
Add ExporterStatus fields/constants and printer columns; map CRD↔proto status; adjust online/offline reconcile logic; add ControllerService.ReportStatus handler and lease-release helpers.
Exporter runtime & Hooks
python/packages/jumpstarter/jumpstarter/exporter/... (hooks.py, lease_context.py, session.py, logging*, exporter.py)
Implement HookExecutor (PTY subprocess execution, timeouts, onFailure modes), LeaseContext, multi-socket/session and EndSession handling, per-logger LogSource routing, and exporter lifecycle orchestration.
Status Monitor & Client
python/packages/jumpstarter/jumpstarter/client/status_monitor.py, .../core.py, .../grpc.py, .../lease.py
Add StatusMonitor (background GetStatus polling, waiters, callbacks); extend AsyncDriverClient with status APIs (get/check/wait/end_session), enforce status preconditions for driver calls, and add dial_timeout retry behavior.
Common types & enums
python/packages/jumpstarter/jumpstarter/common/enums.py, .../types.py, .../__init__.py
Add ExporterStatus and LogSource enums with proto conversions, gRPC stub type aliases, and export them from common.
CLI, K8s models & Rendering
python/packages/jumpstarter-cli/..., python/packages/jumpstarter-kubernetes/..., python/packages/jumpstarter-cli-admin/*
Plumb include_status/show_status through CLI and models, add STATUS column to rich tables, update serializers and tests to include exporterStatus/statusMessage.
Hooks Tests & E2E
python/packages/.../hooks_test.py, e2e/exporters/*, e2e/tests-hooks.bats
Add unit tests for HookExecutor, StatusMonitor, LeaseContext, session behavior, controller integration tests, many e2e configs and BATS tests exercising hook scenarios and failure modes.
Build tooling & Misc
python/buf.gen.yaml, controller/.gitignore, e2e/run-e2e.sh
Add mypy-protobuf plugins, ignore operator bin dir, update e2e runner to discover multiple .bats tests, and manifest generation tweaks (sync-wave injection, CRD copy, skip-crds).

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,230,255,0.5)
    participant Client
    participant Exporter
    participant HookExec as HookExecutor
    participant Controller
    end
    Client->>Exporter: Request lease / Dial
    Exporter->>Controller: ReportStatus(status=BEFORE_LEASE_HOOK)
    Exporter->>HookExec: execute_before_lease_hook(LeaseContext)
    HookExec-->>Exporter: result (success / failure)
    alt before-hook success
        Exporter->>Controller: ReportStatus(status=LEASE_READY)
        Client->>Exporter: Driver calls allowed
    else onFailure=endLease
        Exporter->>Controller: ReportStatus(..., release_lease=true)
        Controller->>Controller: mark lease.release = true
    end
    Client->>Exporter: End session
    Exporter->>HookExec: execute_after_lease_hook(...)
    HookExec-->>Exporter: after-hook done
    Exporter->>Controller: ReportStatus(status=OFFLINE)
Loading
sequenceDiagram
    rect rgba(230,255,200,0.5)
    participant AsyncClient
    participant StatusMon as StatusMonitor
    participant ExporterSvc as ExporterService
    end
    AsyncClient->>StatusMon: start()
    loop polling
      StatusMon->>ExporterSvc: GetStatus()
      ExporterSvc-->>StatusMon: status (+version, previous_status)
      StatusMon-->>AsyncClient: notify on change / wake waiters
    end
    AsyncClient->>StatusMon: wait_for_status(LEASE_READY)
    StatusMon-->>AsyncClient: signal when LEASE_READY observed
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

  • [Proposal] Enhanced Lease Statuses for Hooks #104 — Implements exporter status enums, statusMessage, LogSource and hook-related status/log streaming; aligns with this PR’s CRD/proto and runtime changes.
  • jumpstarter-dev/jumpstarter#522 — Requested exporter lifecycle hooks; this PR adds HookConfig/HookInstanceConfig, HookExecutor and hook wiring.
  • jumpstarter-dev/jumpstarter-protocol#25 — Protocol-level request for ExporterStatus/LogSource and GetStatus enhancements; matches proto/stub updates here.

Possibly related PRs

  • jumpstarter-dev/jumpstarter#297 — Overlaps logging surface (LogHandler/LogStream) and log-source routing added here.
  • jumpstarter-dev/jumpstarter#451 — Related CLI logging / RichHandler adjustments present in this PR.
  • jumpstarter-dev/jumpstarter#589 — Related to process/signal and exit-code semantics adjusted in CLI run logic present here.

Suggested labels

backport release-0.7

Suggested reviewers

  • kirkbrauer

Poem

🐰 I dug a hook and tunneled a log,

Before the lease I danced with a cog.
Status bells chimed and monitors peeped,
Hooks ran, leases cleared, and none overslept.
Carrots for reviewers — hop on and leap!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add pre/post lease hooks' is concise and directly describes the main feature addition, which aligns with the extensive changes across controller, Python, protocol, and end-to-end test components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-hooks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/packages/jumpstarter/jumpstarter/config/client.py (1)

184-196: Add status field to exporter reconstruction to preserve it when include_leases=True.

The exporter objects lose their status field when leases are included. The original exporters have a status field that should be preserved during reconstruction, especially when include_status=True is requested. Update line 187-193 to include status=exporter.status.

Code snippet
exporter_with_lease = Exporter(
    namespace=exporter.namespace,
    name=exporter.name,
    labels=exporter.labels,
    online=exporter.online,
    status=exporter.status,  # Add this line
    lease=lease,
)
python/packages/jumpstarter/jumpstarter/config/exporter.py (1)

224-246: Remove unused device_factory parameter from HookExecutor.

The device_factory parameter passed to HookExecutor at line 231 is never used. HookExecutor only executes shell commands via subprocess and accesses the session for logging; it has no need for driver instantiation. Remove the unused parameter from both the HookExecutor dataclass definition and the instantiation call.

🤖 Fix all issues with AI agents
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_clients.yaml`:
- Around line 20-47: Update the CRD description strings so they reference
"Client" consistently instead of "identities" or "Identity": change the top
description "Client is the Schema for the identities API" to a concise "Client
is the Schema for the Client API" (or simply "Client"), and update the
spec.description "ClientSpec defines the desired state of Identity" and
status.description "ClientStatus defines the observed state of Identity" to use
"Client" (e.g., "ClientSpec defines the desired state of Client" and
"ClientStatus defines the observed state of Client") so all descriptive fields
in the CRD (the description at the top and the
spec.description/status.description entries) no longer reference
Identity/identities.

In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_leases.yaml`:
- Around line 30-31: Update the CRD description for the Lease resource: replace
the current "Lease is the Schema for the exporters API" text in the description
property of the Lease CRD with the corrected wording using "leases" (e.g.,
"Leases is the Schema for the exporters API" or "leases is the Schema for the
exporters API") so the description uses the plural form; locate the description
property under the Lease CRD (kind: Lease) in the CRD template and change that
string accordingly.

In `@python/packages/jumpstarter-cli-common/jumpstarter_cli_common/opt.py`:
- Line 19: basicConfig assignment line exceeds max line length; split the call
across multiple lines to satisfy Ruff E501. Locate the basicConfig variable
initialization (basicConfig = partial(logging.basicConfig, format="%(message)s
[%(name)s]", handlers=[RichHandler(show_path=False)]) ) and reformat the
arguments onto separate lines (e.g., place format=... and handlers=[...] each on
their own lines and close parens on its own line) so the call to
partial(logging.basicConfig, ...) fits within line-length limits while
preserving the same arguments and use of RichHandler(show_path=False).

In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 80-89: Remove the unused synchronous helper _run_shell_with_lease:
delete the entire function definition (including its docstring and with-blocks)
since the async variant _run_shell_with_lease_async is now the only one used;
then search the repo for any remaining references to _run_shell_with_lease and
remove or update them, run the test suite/lint to confirm no import or name
errors, and ensure behavior relies solely on _run_shell_with_lease_async.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 395-401: The async context manager on StatusMonitor is incomplete:
__aenter__ currently just returns self and never starts the background polling;
update __aenter__ to initialize the task group and call self.start() (mirroring
client.status_monitor_async() behavior) so entering the context begins
monitoring, and keep __aexit__ awaiting self.stop(); alternatively remove both
methods if context-manager usage is unsupported—modify the methods on the
StatusMonitor class (__aenter__, __aexit__) to either perform full start/stop
lifecycle (create task group + await self.start() in __aenter__, await
self.stop() in __aexit__) or delete them to avoid silent failures.

In `@python/packages/jumpstarter/jumpstarter/common/enums.py`:
- Around line 26-27: The docstring for the enum value AFTER_LEASE_HOOK (mapped
to common_pb2.EXPORTER_STATUS_AFTER_LEASE_HOOK) contains a typo; update the
string from "Lease was releaseed, but exporter is executing after lease hook" to
"Lease was released, but exporter is executing after lease hook" so the
documentation reads correctly.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 541-613: The Listen retry task started with
tg.start_soon(self._retry_stream, "Listen",
self._listen_stream_factory(lease_name), listen_tx) is attached to the outer
task group so it keeps running after the lease ends; move that start call into
the lease-scoped task group (use conn_tg.start_soon(...) before starting
wait_for_lease_end/process_connections) so the Listen task is cancelled with the
lease, and in the enclosing finally block ensure the send stream is closed
(await listen_tx.aclose() or the appropriate close method) to signal listen_rx
termination; reference symbols: _retry_stream,
_listen_stream_factory(lease_name), listen_tx/listen_rx, tg -> conn_tg, and
conn_tg.start_soon.
- Around line 202-321: Change _get_controller_stub to be an async context
manager (e.g., async def _controller_stub(self):) that creates the channel via
self.channel_factory(), yields a
jumpstarter_pb2_grpc.ControllerServiceStub(channel) and always closes the
underlying channel with await channel.close() in a finally block; then update
all callers to use "async with self._controller_stub() as controller:" instead
of awaiting _get_controller_stub(), specifically in _retry_stream (where
controller is acquired for stream_factory), _register_with_controller (Register
call), and _report_status (ReportStatus call) so channels are closed after use
and no resources leak.

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 299-307: wait_for_process currently uses
anyio.to_thread.run_sync(process.wait, abandon_on_cancel=True) which can leave
the child as a zombie if the wait is abandoned; modify wait_for_process to
ensure the subprocess is reaped on cancellation by adding a try/finally: in the
try await the run_sync as before, but in the finally check process.poll() and if
it is None try to terminate/kill (process.terminate()/process.kill()) then call
a non-abandoning run_sync(process.wait) or loop-poll with process.poll() until
exit to fully reap the child; reference the wait_for_process coroutine and the
process variable when making the changes.
- Around line 257-260: The debug call inside read_pty_output exceeds the
120-char line limit; shorten the line by splitting the log message across
multiple strings or moving the format string to a short variable and then
calling logger.debug. For example, break the message into two adjacent string
literals or assign msg = "read_pty_output: heartbeat at %.1fs, iterations=%d"
and call logger.debug(msg, elapsed, read_count) so you keep the same
logger.debug call and the same format arguments (logger, elapsed, read_count)
but comply with the E501 line-length rule.
- Around line 186-189: The test mock for logging_session must support the
context manager used by logging_session.context_log_source in hooks.py; update
the test fixture (e.g., mock_session) so mock_session.context_log_source returns
an object that implements __enter__ and __exit__ (for async usage you can set
mock_session.context_log_source.return_value.__enter__ and __exit__ to AsyncMock
or wrap with contextlib.nullcontext), ensuring the context manager protocol is
honored when hooks.py calls logging_session.context_log_source.
🧹 Nitpick comments (10)
controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_exporteraccesspolicies.yaml (2)

149-152: Add schema validation for duration and priority.

Without constraints, malformed durations or negative priorities can be accepted and only fail at runtime. Consider adding format/bounds in the CRD.

♻️ Suggested schema tightening
                   maximumDuration:
                     type: string
+                    format: duration
                   priority:
                     type: integer
+                    minimum: 0

40-157: Require spec/policies to avoid empty CRs.

Right now an empty resource (missing spec or policies) is valid. If that’s not intended, enforce it in the schema to reduce no-op objects.

♻️ Suggested required fields
           spec:
             description: ExporterAccessPolicySpec defines the desired state of ExporterAccessPolicy.
             properties:
               exporterSelector:
                 description: |-
                   A label selector is a label query over a set of resources. The result of matchLabels and
                   matchExpressions are ANDed. An empty label selector matches all objects. A null
                   label selector matches no objects.
                 properties:
                   matchExpressions:
                     description: matchExpressions is a list of label selector requirements.
                       The requirements are ANDed.
                     items:
                       description: |-
                         A label selector requirement is a selector that contains values, a key, and an operator that
                         relates the key and values.
                       properties:
                         key:
                           description: key is the label key that the selector applies
                             to.
                           type: string
                         operator:
                           description: |-
                             operator represents a key's relationship to a set of values.
                             Valid operators are In, NotIn, Exists and DoesNotExist.
                           type: string
                         values:
                           description: |-
                             values is an array of string values. If the operator is In or NotIn,
                             the values array must be non-empty. If the operator is Exists or DoesNotExist,
                             the values array must be empty. This array is replaced during a strategic
                             merge patch.
                           items:
                             type: string
                           type: array
                           x-kubernetes-list-type: atomic
                       required:
                       - key
                       - operator
                       type: object
                     type: array
                     x-kubernetes-list-type: atomic
                   matchLabels:
                     additionalProperties:
                       type: string
                     description: |-
                       matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
                       map is equivalent to an element of matchExpressions, whose key field is "key", the
                       operator is "In", and the values array contains only "value". The requirements are ANDed.
                     type: object
                 type: object
                 x-kubernetes-map-type: atomic
               policies:
                 items:
                   properties:
                     from:
                       items:
                         properties:
                           clientSelector:
                             description: |-
                               A label selector is a label query over a set of resources. The result of matchLabels and
                               matchExpressions are ANDed. An empty label selector matches all objects. A null
                               label selector matches no objects.
                             properties:
                               matchExpressions:
                                 description: matchExpressions is a list of label selector
                                   requirements. The requirements are ANDed.
                                 items:
                                   description: |-
                                     A label selector requirement is a selector that contains values, a key, and an operator that
                                     relates the key and values.
                                   properties:
                                     key:
                                       description: key is the label key that the selector
                                         applies to.
                                       type: string
                                     operator:
                                       description: |-
                                         operator represents a key's relationship to a set of values.
                                         Valid operators are In, NotIn, Exists and DoesNotExist.
                                       type: string
                                     values:
                                       description: |-
                                         values is an array of string values. If the operator is In or NotIn,
                                         the values array must be non-empty. If the operator is Exists or DoesNotExist,
                                         the values array must be empty. This array is replaced during a strategic
                                         merge patch.
                                       items:
                                         type: string
                                       type: array
                                       x-kubernetes-list-type: atomic
                                   required:
                                   - key
                                   - operator
                                   type: object
                                 type: array
                                 x-kubernetes-list-type: atomic
                               matchLabels:
                                 additionalProperties:
                                   type: string
                                 description: |-
                                   matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
                                   map is equivalent to an element of matchExpressions, whose key field is "key", the
                                   operator is "In", and the values array contains only "value". The requirements are ANDed.
                                 type: object
                             type: object
                             x-kubernetes-map-type: atomic
                         type: object
                       type: array
                     maximumDuration:
                       type: string
                     priority:
                       type: integer
                     spotAccess:
                       type: boolean
                   type: object
                 type: array
+                minItems: 1
             type: object
+            required:
+            - policies
           status:
             description: ExporterAccessPolicyStatus defines the observed state of
               ExporterAccessPolicy.
             type: object
         type: object
+        required:
+        - spec

Also applies to: 162-162

python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

157-161: Redundant timeout configuration.

Both move_on_after(300) and wait_for_any_of(..., timeout=300.0) specify 300 seconds. Since wait_for_any_of has its own timeout, the outer move_on_after is redundant. Consider either removing the outer timeout or differentiating them if different semantics are intended (e.g., outer as a hard cap slightly longer than inner).

♻️ Suggested simplification
-                                        with anyio.move_on_after(300) as timeout_scope:  # 5 minute timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=300.0
-                                            )
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=300.0
+                                        )
+                                        if result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")

Note: This would require adjusting the subsequent conditional logic (lines 162-172) to handle the timeout case inline rather than via timeout_scope.cancelled_caught.

controller/Makefile (1)

66-68: Make the CRD copy step more robust with mkdir.

If the templates directory is missing in a fresh checkout, cp will fail.

♻️ Suggested tweak
 	cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/
+	mkdir -p deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
 	cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
controller/internal/service/controller_service.go (1)

318-338: Consider logging unexpected status values in the default case.

The protoStatusToString function silently maps unknown proto status values to ExporterStatusUnspecified. While this provides a safe fallback, logging at debug/warning level would help diagnose protocol mismatches.

♻️ Optional: Add logging for unexpected status values
 func protoStatusToString(status pb.ExporterStatus) string {
 	switch status {
 	case pb.ExporterStatus_EXPORTER_STATUS_OFFLINE:
 		return jumpstarterdevv1alpha1.ExporterStatusOffline
 	// ... other cases ...
 	default:
+		// Log unexpected status values for debugging protocol mismatches
+		// Consider using a package-level logger if available
 		return jumpstarterdevv1alpha1.ExporterStatusUnspecified
 	}
 }
python/packages/jumpstarter/jumpstarter/common/enums.py (1)

38-45: Consider handling invalid protobuf values gracefully in from_proto.

The current implementation will raise ValueError if an unrecognized integer is passed to from_proto. For robustness against protocol version mismatches, consider returning UNSPECIFIED for unknown values.

♻️ Optional: Handle unknown values gracefully
     `@classmethod`
     def from_proto(cls, value: int) -> "ExporterStatus":
         """Convert from protobuf integer to enum."""
-        return cls(value)
+        try:
+            return cls(value)
+        except ValueError:
+            return cls.UNSPECIFIED
python/packages/jumpstarter/jumpstarter/exporter/logging.py (1)

31-39: Consider longest-prefix matching for hierarchical loggers.

The current prefix matching returns the first match found during iteration. With hierarchical logger names (e.g., driver.power, driver.power.usb), if both driver and driver.power are mapped, the match depends on iteration order. Consider sorting by key length descending to ensure the most specific (longest) prefix wins.

♻️ Suggested improvement for deterministic matching
     def get_source_for_record(self, record):
         """Determine the appropriate log source for a record."""
         with self._lock:
             # Check if this record comes from a logger with a specific source mapping
             logger_name = record.name
-            for mapped_logger, source in self._child_handlers.items():
+            # Sort by key length descending to match most specific prefix first
+            for mapped_logger, source in sorted(
+                self._child_handlers.items(), key=lambda x: len(x[0]), reverse=True
+            ):
                 if logger_name.startswith(mapped_logger):
                     return source
             return self.source
python/packages/jumpstarter/jumpstarter/client/core.py (1)

440-510: Consider extracting helper functions to reduce complexity.

The static analysis correctly flags this method as too complex (C901: 14 > 10). The method handles multiple concerns: reconnection with backoff, log source routing, and filtering. While the current structure is readable, extracting helpers would improve maintainability.

♻️ Suggested structure for reduced complexity
def _route_log_to_logger(self, response, show_all_logs: bool) -> None:
    """Route a log response to the appropriate logger."""
    from jumpstarter.common import LogSource

    if response.HasField("source"):
        source = LogSource(response.source)
        is_hook = source in (LogSource.BEFORE_LEASE_HOOK, LogSource.AFTER_LEASE_HOOK)
    else:
        source = LogSource.SYSTEM
        is_hook = False

    if not (is_hook or show_all_logs):
        return

    severity = response.severity if response.severity else "INFO"
    log_level = getattr(logging, severity, logging.INFO)

    logger_names = {
        LogSource.BEFORE_LEASE_HOOK: "exporter:beforeLease",
        LogSource.AFTER_LEASE_HOOK: "exporter:afterLease",
        LogSource.DRIVER: "exporter:driver",
    }
    logger_name = logger_names.get(source, "exporter:system")
    logging.getLogger(logger_name).log(log_level, response.message)

Then use this helper in the main loop to reduce nesting.

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

496-498: TODO: Controller-side lease ending not yet implemented.

The comment indicates that when on_failure='endLease', there's no mechanism to actually end the lease from the exporter side. This should be tracked.

Would you like me to open an issue to track the controller-side lease ending mechanism for beforeLease hook failures with on_failure='endLease'?

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

197-279: Consider extracting common verification logic.

Both wait_for_status and wait_for_any_of share substantial code for connection recovery verification. This duplication contributes to the complexity flagged by static analysis.

♻️ Suggested refactor to reduce duplication
async def _verify_connection_recovery(self) -> ExporterStatus | None:
    """Verify connection and return current status if recovered, None if still lost."""
    if not self._connection_lost:
        return self._current_status
    
    logger.debug("Connection was marked as lost, verifying...")
    try:
        response = await self._stub.GetStatus(jumpstarter_pb2.GetStatusRequest())
        logger.info("Connection recovered during verification poll")
        self._connection_lost = False
        new_status = ExporterStatus.from_proto(response.status)
        self._current_status = new_status
        self._status_version = response.status_version
        return new_status
    except AioRpcError as e:
        if e.code() == StatusCode.UNAVAILABLE:
            logger.debug("Connection still lost (UNAVAILABLE)")
        else:
            logger.debug("GetStatus error during verification: %s", e.code())
        return None
    except Exception as e:
        logger.debug("GetStatus error during verification: %s", e)
        return None

Then use this helper in both wait_for_status and wait_for_any_of.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@python/packages/jumpstarter-driver-power/jumpstarter_driver_power/client_test.py`:
- Around line 8-20: The test_log_stream uses fixed time.sleep(1) which is flaky;
replace these sleeps with a short-polling loop that checks caplog.text until the
expected substring appears or a timeout expires. In test_log_stream (around
client.log_stream(), client.on(), client.off()) implement a small helper or
inline poll using time.monotonic() to loop with a tiny sleep (e.g., 0.01–0.1s)
and fail if timeout reached, then assert "power on" and "power off" only after
the poll succeeds; keep using caplog and the same logger scope so the
client.log_stream(), client.on(), and client.off() calls remain unchanged.

In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Line 3: Remove the unused import OrderedDict from the imports in
client_test.py; locate the line "from collections import OrderedDict" and delete
it (or remove OrderedDict from the import list) so the lint warning is resolved
and no functionality is changed.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py`:
- Line 3: The import list in status_monitor_test.py includes an unused symbol
AsyncMock; remove AsyncMock from the import statement (keep MagicMock) so the
module only imports symbols actually used in the tests, ensuring static analysis
no longer flags the unused AsyncMock.

In `@python/packages/jumpstarter/jumpstarter/common/enums_test.py`:
- Line 3: Remove the unused top-level import "import pytest" from enums_test.py;
locate the import statement at the top of the test module and delete it so the
file no longer contains the unused pytest symbol.
🧹 Nitpick comments (4)
python/packages/jumpstarter/jumpstarter/exporter/lease_context_test.py (1)

186-200: Avoid time-based sleep to reduce test flakiness.

Line 195 uses a fixed sleep, which can be flaky on slow CI. Consider a handshake event so the test deterministically waits for wait_for_drivers() to start before setting the hook.

♻️ Proposed deterministic rewrite
     async def test_wait_for_drivers_blocks_until_set(self) -> None:
         """Test that wait_for_drivers() blocks until hook event is set."""
         import anyio
 
         before_hook = Event()
         ctx = LeaseContext(lease_name="test-lease", before_lease_hook=before_hook)
 
-        # Set the event after a short delay
-        async def set_after_delay():
-            await anyio.sleep(0.05)
-            before_hook.set()
+        waiter_started = Event()
+
+        async def wait_and_signal():
+            waiter_started.set()
+            await ctx.wait_for_drivers()
 
         async with anyio.create_task_group() as tg:
-            tg.start_soon(set_after_delay)
-            await ctx.wait_for_drivers()
+            tg.start_soon(wait_and_signal)
+            await waiter_started.wait()
+            before_hook.set()
 
         assert ctx.drivers_ready()
python/packages/jumpstarter/jumpstarter/client/core_test.py (2)

16-32: Consider extracting shared test utilities.

MockAioRpcError and create_mock_rpc_error are duplicated from status_monitor_test.py. Consider extracting these to a shared test utilities module (e.g., jumpstarter/client/test_utils.py or a conftest.py).


90-116: Move import to module level.

The anyio import inside wait_for_hook_status should be at the module level for consistency and clarity.

Proposed fix

Add to imports at top of file:

import anyio

Then update the method:

     async def wait_for_hook_status(self, target_status: ExporterStatus, timeout: float = 60.0) -> bool:
         """Wait for exporter to reach a target status using polling."""
-        import anyio
-
         poll_interval = 0.1  # Fast for testing
python/packages/jumpstarter/jumpstarter/client/client_test.py (1)

121-124: Consider adding assertion to fully verify topological order.

The comment states children should be built before parents, but the assertion only verifies grandparent is built after parent. Adding an assertion that child is built before parent would fully validate the topological ordering.

Suggested improvement
         # Children should be built before parents in topological order
-        # The order should be child, parent, grandparent (or parent, child, grandparent)
+        # The order should be child, parent, grandparent
         assert "grandparent" in build_order
         assert build_order.index("grandparent") > build_order.index("parent")
+        assert build_order.index("parent") > build_order.index("child")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@controller/internal/service/suite_test.go`:
- Around line 56-58: The test harness sets up testEnv as an envtest.Environment
with CRDDirectoryPaths and ErrorIfCRDPathMissing: true, which will abort the
suite if the CRD path is wrong; verify and update the CRDDirectoryPaths value to
the actual CRD directory (or add multiple fallback paths) so envtest can find
the CRDs, e.g., check and correct the path used in the testEnv initialization
(the envtest.Environment struct, the CRDDirectoryPaths slice, and the
ErrorIfCRDPathMissing flag) or set ErrorIfCRDPathMissing to false while you
confirm the correct location.
- Around line 29-31: The test currently ignores errors when creating a
Namespace; change the Namespace creation logic (the call that creates a
corev1.Namespace object, e.g., client.CoreV1().Namespaces().Create or similar in
the test setup) to explicitly handle errors: import
k8s.io/apimachinery/pkg/api/errors as apierrors, then after the Create call
check if err != nil and if apierrors.IsAlreadyExists(err) continue/ignore,
otherwise fail the test (t.Fatalf or return the error). Apply the same pattern
for the other namespace-creation site referenced in the comment (the second
block around lines 81-87) so only AlreadyExists is tolerated and all other
errors cause the test to fail.
- Around line 19-24: Update the test setup in suite_test.go to avoid hard-coding
BinaryAssetsDirectory: first respect the KUBEBUILDER_ASSETS environment variable
if set (use it to set BinaryAssetsDirectory), then fall back to existing helper
GetFirstFoundEnvTestBinaryDir() or a directory-existence check for the current
hard-coded path (bin/k8s/1.30.0-<os>-<arch>), and only assign the hard-coded
path as a last resort; modify the code that sets BinaryAssetsDirectory to
perform conditional checks (os.Getenv("KUBEBUILDER_ASSETS"), os.Stat on
candidate dirs, and GetFirstFoundEnvTestBinaryDir()) so tests run correctly
outside the Makefile and when the version/path differs.

In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Around line 120-124: The test's topological-order check only ensures
"grandparent" comes after "parent" but doesn't verify that "child" precedes
"parent"; update the assertions to explicitly assert that
build_order.index("child") < build_order.index("parent") (and keep/optionally
assert both "child" and "parent" are before "grandparent") so the test enforces
children-before-parents ordering; reference the existing build_order list and
the node names "child", "parent", and "grandparent" when adding the
assertion(s).

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 31-38: The docstring usage example in status_monitor.py
incorrectly calls client.start_status_monitor(); update the example to call the
actual method client.status_monitor_async() (or add a thin alias
start_status_monitor -> status_monitor_async in the client if you prefer) so the
docstring matches the real API; ensure the rest of the example still references
monitor.current_status and monitor.wait_for_status(ExporterStatus.LEASE_READY).

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 426-500: The before-lease failure branch doesn't actually end the
lease for on_failure='endLease'; modify run_before_lease_hook to accept an async
lease-release callback (e.g., end_lease: Callable[[LeaseContext],
Awaitable[None]]) or reuse an existing controller-release method, and in the
HookExecutionError else-branch (the on_failure='endLease' path) invoke await
end_lease(lease_scope) (and still set lease_scope.before_lease_hook
event/unblock connections and report status) so the lease is cleanly terminated;
update all call sites of run_before_lease_hook to pass the new end_lease
callback (mirroring how run_after_lease_hook is wired).
♻️ Duplicate comments (2)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

80-90: Unused synchronous function.

This function is no longer called anywhere in the codebase since line 213 now uses _run_shell_with_lease_async. Consider removing it to reduce maintenance burden.

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

397-403: Async context manager won't start monitoring.

The __aenter__ method returns self without calling start(). Direct usage as an async context manager (async with StatusMonitor(...) as monitor:) will silently fail to start background polling. Users must use client.status_monitor_async() instead. Either fix __aenter__ to properly start monitoring or remove the context manager methods to prevent misuse.

🧹 Nitpick comments (9)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

157-161: Redundant double timeout.

Both anyio.move_on_after(300) and timeout=300.0 in wait_for_any_of set 5-minute timeouts. The inner timeout parameter makes the outer cancel scope redundant. Consider removing one:

♻️ Proposed fix
-                                        with anyio.move_on_after(300) as timeout_scope:  # 5 minute timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=300.0
-                                            )
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=300.0
+                                        )
+                                        if result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")

Then adjust the subsequent conditional logic accordingly, removing the timeout_scope.cancelled_caught check at line 171.


133-136: Consider checking connection_lost for beforeLease hook.

The afterLease handling (line 166) checks monitor.connection_lost to distinguish between timeout and connection loss. The beforeLease handling treats both cases as "Timeout" which could be misleading in logs.

♻️ Proposed enhancement
                             if result == ExporterStatus.BEFORE_LEASE_HOOK_FAILED:
                                 logger.warning("beforeLease hook failed")
                             elif result is None:
-                                logger.warning("Timeout waiting for beforeLease hook")
+                                if monitor.connection_lost:
+                                    logger.warning("Connection lost during beforeLease hook")
+                                else:
+                                    logger.warning("Timeout waiting for beforeLease hook")
python/packages/jumpstarter/jumpstarter/client/client_test.py (1)

58-58: Ensure ExitStack is closed after each test.

ExitStack() is never closed, which can leak cleanup callbacks if client_from_channel registers any. Prefer with ExitStack() as mock_stack: to guarantee teardown.

♻️ Example pattern (apply to each test)
-        mock_stack = ExitStack()
+        with ExitStack() as mock_stack:
             with patch(
                 "jumpstarter.client.client.MultipathExporterStub",
                 return_value=mock_stub
             ), patch(
                 "jumpstarter.client.client.import_class",
                 return_value=MockDriverClient
             ):
                 from jumpstarter.client.client import client_from_channel

                 client = await client_from_channel(
                     mock_channel, mock_portal, mock_stack, allow=[], unsafe=True
                 )

Also applies to: 98-98, 143-143, 177-177, 211-211, 245-245

python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (1)

16-32: Consider extracting shared mock utilities to a common test module.

MockAioRpcError and create_mock_rpc_error are duplicated here and in core_test.py (per the relevant snippets). Extracting these to a shared test utilities module (e.g., conftest.py or a dedicated testing.py) would reduce duplication.

python/packages/jumpstarter/jumpstarter/client/core.py (1)

332-351: Asymmetric behavior when monitor is not started.

wait_for_lease_ready_monitored raises RuntimeError if the monitor isn't started, while wait_for_hook_complete_monitored silently returns True. This asymmetry could be surprising. Consider documenting this behavior difference more explicitly or using consistent error handling.

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

309-309: Use %-style formatting in logger calls instead of f-strings.

F-strings in logging calls evaluate the string even when the log level is disabled, causing unnecessary overhead. Use %-style formatting for deferred evaluation.

Proposed fix
-                    logger.warning(f"Missed {missed} status transition(s)")
+                    logger.warning("Missed %d status transition(s)", missed)
-                    logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})")
+                    logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version)
-                            logger.error(f"Status change callback error: {e}")
+                            logger.error("Status change callback error: %s", e)
-                logger.debug(f"GetStatus poll error: {e.code()}")
+                logger.debug("GetStatus poll error: %s", e.code())
-                logger.debug(f"GetStatus poll error: {e}")
+                logger.debug("GetStatus poll error: %s", e)

Also applies to: 320-320, 337-337, 352-354

controller/internal/service/suite_test.go (1)

90-93: Guard teardown if setup failed.

If testEnv never initialized, Stop() can panic in AfterSuite. A simple nil check avoids cascading failures.

🛡️ Suggested guard
 	By("tearing down the test environment")
-	err := testEnv.Stop()
-	Expect(err).NotTo(HaveOccurred())
+	if testEnv != nil {
+		err := testEnv.Stop()
+		Expect(err).NotTo(HaveOccurred())
+	}
controller/Makefile (1)

66-75: Guard against duplicate metadata.annotations blocks.

Line 66–71 unconditionally inserts a new annotations: map. If the generated role ever includes annotations (now or later), this can create duplicate keys and potentially drop the sync-wave or other annotations depending on YAML parsing behavior. Consider merging into an existing annotations map or using a YAML-aware tool for the insert.

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

341-356: Use anyio.to_thread.run_sync() with timeout for blocking process.wait() after process.kill().

The direct process.wait() call at line 354 blocks the event loop if the process doesn't exit promptly. The same block already demonstrates the correct pattern at line 346–347 with anyio.to_thread.run_sync() and anyio.move_on_after(). Apply the same pattern here for consistency and robustness:

Safer non‑blocking wait
                 if process.poll() is None:
                     process.kill()
                     try:
-                        process.wait()
+                        with anyio.move_on_after(5):
+                            await anyio.to_thread.run_sync(process.wait)
                     except Exception:
                         pass

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@controller/Makefile`:
- Line 74: The Makefile currently runs the cp command copying CRDs to
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/, which
will fail if that target directory doesn't exist; update the Makefile to create
the directory first by running a mkdir -p for
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
immediately before the cp command so the cp in the Makefile succeeds even when
the templates/crds directory is missing.
🧹 Nitpick comments (1)
controller/Makefile (1)

66-71: Idempotency risk: duplicate annotations: keys if file already contains annotations.

The AWK script unconditionally inserts an annotations: block after the name: line. If controller-gen ever outputs annotations in the generated role.yaml, or if the manifests target runs without a fresh regeneration, this will produce invalid YAML with duplicate annotations: keys.

Consider using a more robust approach that checks for existing annotations before inserting:

♻️ Suggested safer AWK approach
 	# Add ArgoCD sync-wave annotation to RBAC role for proper deployment ordering (PR `#207`)
-	`@awk` '/^  name: jumpstarter-manager-role$$/{print; print "  annotations:"; print "    argocd.argoproj.io/sync-wave: \"-1\""; next}1' \
+	`@awk` ' \
+		/^  name: jumpstarter-manager-role$$/ { print; found_name=1; next } \
+		found_name && /^  annotations:/ { print; found_annotations=1; next } \
+		found_name && found_annotations && /^  [a-zA-Z]/ && !/^    / { \
+			print "    argocd.argoproj.io/sync-wave: \"-1\""; found_name=0; found_annotations=0 \
+		} \
+		found_name && !found_annotations && /^  [a-zA-Z]/ { \
+			print "  annotations:"; print "    argocd.argoproj.io/sync-wave: \"-1\""; found_name=0 \
+		} \
+		{ print }' \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml > \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp && \
 		mv deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml

Alternatively, consider using a tool like yq for safer YAML manipulation.

Also, the comment references "PR #207" but this appears to be PR #140.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 125-143: The code currently launches the shell unconditionally
after monitor.wait_for_any_of; change it to treat
ExporterStatus.BEFORE_LEASE_HOOK_FAILED and a None (timeout) as hard stops by
not calling _run_shell_only. After the wait_for_any_of call
(monitor.wait_for_any_of), check result and if it equals
ExporterStatus.BEFORE_LEASE_HOOK_FAILED or is None, log the warning and exit the
function (or raise an appropriate exception) instead of proceeding to call
anyio.to_thread.run_sync(_run_shell_only, lease, config, command, path); only
call _run_shell_only when result == ExporterStatus.LEASE_READY.
♻️ Duplicate comments (5)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)

202-204: Controller channels are not closed after use.

This creates a new gRPC channel per call but never closes it. Multiple calls during status updates and stream retries will leak channels. The pattern in _unregister_with_controller() (lines 357-369) demonstrates proper cleanup with explicit await channel.close().


414-458: Task may hang when lease ends without EndSession.

This task only waits on end_session_requested (line 426). If the lease ends without the client calling EndSession, the task continues waiting indefinitely on the old lease context. Consider racing against lease_ended or using a task group scoped to the lease lifetime.


546-553: Listen task should be scoped to the lease lifetime.

The _retry_stream task for Listen is started on the outer tg (line 548), causing it to continue running after handle_lease exits. This results in the task persisting across lease transitions. Move this to conn_tg so it's automatically cancelled when the lease ends, and close listen_tx in the finally block to signal proper stream termination.


578-579: Consider scoping _handle_end_session to the lease task group.

Starting this task on the outer tg means it survives the lease ending. If combined with the fact that it only waits on end_session_requested, the task can be orphaned. This is related to the hang issue noted in the _handle_end_session method.

python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

80-89: Consider removing _run_shell_with_lease if it’s now unused.

The async path appears to be the only entry point, so this sync helper likely adds maintenance overhead.

#!/bin/bash
# Verify remaining usages of the sync helper
rg -nP '\b_run_shell_with_lease\s*\(' --type py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 8-18: The import block is unsorted per Ruff I001; reorder imports
into standard-library first (logging, collections.abc.Awaitable,
collections.abc.Callable, contextlib.AsyncExitStack), then third-party packages
(anyio, anyio.Event, anyio.create_task_group, grpc.StatusCode,
grpc.aio.AioRpcError, jumpstarter_protocol.jumpstarter_pb2), and finally
local/project imports (jumpstarter.common.ExporterStatus) so the import order
satisfies the linter and keeps the existing imported symbols intact.
- Around line 196-201: The timeout checks in wait_for_status() and
wait_for_any_of() incorrectly use "if timeout:" which treats 0 as falsy; change
both checks to "if timeout is not None:" so a timeout of 0 is handled as an
immediate timeout; update the branches around anyio.move_on_after in
wait_for_status() and wait_for_any_of() (the blocks that call wait_loop() and
return False) to use the new condition.
- Around line 396-400: stop() currently sets _running = False and signals
_stop_event but does not wake waiters blocked on _any_change_event; update
stop() to also set _any_change_event (and optionally _any_change_event.set()
followed by _any_change_event.clear() if you want a one-shot wake) so waiters in
wait_for_status and wait_for_any_of that are awaiting _any_change_event.wait()
will wake and observe _running == False and exit their loops; reference the stop
method and the _any_change_event, wait_for_status, and wait_for_any_of functions
when applying this change.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 168-191: The stop() method currently ignores the caller's
should_unregister when wait_for_lease_exit is True, causing later calls (e.g.,
in serve()) to override intent; update stop() so that whenever a deferred stop
is requested (the branch setting self._stop_requested = True) you also set
self._unregister = should_unregister (and likewise when a stop has already been
requested you should update/preserve self._unregister accordingly) so the
original caller preference is retained; refer to the stop() method, the
attributes self._unregister, self._stop_requested, and self._tg.cancel_scope to
implement this change.
🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

147-162: Remove redundant timeout wrapper and improve timeout detection logic.

The StatusMonitor.connection_lost property is valid and well-tested. However, the outer anyio.move_on_after(30) is redundant—wait_for_any_of() already has an identical timeout=30.0 parameter. When both are equal, the inner timeout fires first, making the outer scope's cancelled_caught flag never True. This means timeout cases currently fall through to "Hook completion not confirmed" rather than the timeout warning.

The suggested refactor properly addresses this by removing the outer timeout wrapper and using result is None to detect timeouts, which disambiguates between connection loss and timeout.

♻️ Suggested simplification
-                                        with anyio.move_on_after(30) as timeout_scope:  # 30 second timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=30.0
-                                            )
-                                            if result == ExporterStatus.AVAILABLE:
-                                                logger.info("afterLease hook completed")
-                                            elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
-                                                logger.warning("afterLease hook failed")
-                                            elif monitor.connection_lost:
-                                                # Connection lost - exporter closed, hook likely completed
-                                                logger.info("Exporter connection closed (hook completed)")
-                                            else:
-                                                logger.debug("Hook completion not confirmed")
-                                        if timeout_scope.cancelled_caught:
-                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=30.0
+                                        )
+                                        if result == ExporterStatus.AVAILABLE:
+                                            logger.info("afterLease hook completed")
+                                        elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
+                                            logger.warning("afterLease hook failed")
+                                        elif result is None and monitor.connection_lost:
+                                            # Connection lost - exporter closed, hook likely completed
+                                            logger.info("Exporter connection closed (hook completed)")
+                                        elif result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        else:
+                                            logger.debug("Hook completion not confirmed")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 237-257: The fixed-count retry loop around self.controller.Dial
(using max_retries/base_delay/max_delay) should be replaced with a time‑based
retry bounded by a configurable timeout (use the existing acquisition_timeout if
available or add a new dial_timeout attribute/parameter) so Dial keeps retrying
until time.monotonic() exceeds the deadline; compute deadline = now + timeout,
then loop attempting await self.controller.Dial(...), on AioRpcError with
FAILED_PRECONDITION and "not ready" apply the same exponential/backoff delay
(min(base_delay*2**attempt, max_delay)) but stop retrying and re-raise once
monotonic() > deadline; update the retry log to show remaining time or attempts
left and ensure the new timeout is plumbed through the Lease acquisition
call/site that calls this Dial logic.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 343-347: In _poll_loop (status_monitor.py) the AioRpcError
UNIMPLEMENTED branch currently breaks without clearing _running or notifying
waiters; update that branch to mirror the UNAVAILABLE handling by setting
self._running = False and calling self._any_change_event.set() (and any other
cleanup done in the UNAVAILABLE branch) before breaking so wait_for_status() and
wait_for_any_of() don't hang; also add a unit test that triggers UNIMPLEMENTED
while a waiter (using wait_for_status() or wait_for_any_of()) is active and
assert the waiter returns promptly.
♻️ Duplicate comments (3)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (3)

8-18: Reorder imports to satisfy Ruff (I001).

Current ordering may still fail Ruff I001.

🔧 Proposed import order
-import logging
-from collections.abc import Awaitable, Callable
-from contextlib import AsyncExitStack
+from collections.abc import Awaitable, Callable
+from contextlib import AsyncExitStack
+import logging

Please verify with:

#!/bin/bash
ruff check python/packages/jumpstarter/jumpstarter/client/status_monitor.py

Based on learnings, Ruff is the formatter/linter for Python files.


196-201: Handle timeout=0 explicitly.

if timeout: treats 0 as falsy, so an immediate timeout becomes an infinite wait (Line 196 and Line 280).

🐛 Proposed fix
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_loop()
             return False
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_for_first()
             return None

Optional verification:

#!/bin/bash
rg -n "wait_for_status|wait_for_any_of|timeout\s*=\s*0" --type=py

Also applies to: 280-285


396-400: Wake waiting coroutines on stop().

stop() doesn’t signal _any_change_event, so wait_for_* can hang after shutdown.

🛠️ Proposed fix
     async def stop(self):
         """Stop the background polling task."""
         self._running = False
         self._stop_event.set()
+        # Wake any waiters so they can observe _running=False
+        self._any_change_event.set()
+        self._any_change_event = Event()

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@e2e/dex.values.yaml`:
- Around line 37-40: The userID value for the test user "test-exporter-hooks" is
not a valid UUID because it contains the non-hex character 'g' (value:
c6ed6f40-6689-6g7e-c64c-55d0a9a5871f); replace this with a valid UUID (only 0-9
and a-f), e.g., generate a new v4 UUID and set userID to that new value so the
entry for username "test-exporter-hooks" and its userID field are valid.
🧹 Nitpick comments (5)
e2e/tests-hooks.bats (5)

9-10: Unused variable declaration.

CURRENT_HOOKS_CONFIG is declared but never referenced anywhere in the file. Consider removing it to avoid confusion.

Suggested fix
-# Track which config is currently active
-CURRENT_HOOKS_CONFIG=""

51-54: Consider graceful termination before SIGKILL.

Using kill -9 immediately doesn't give the process a chance to clean up. Consider sending SIGTERM first with a brief grace period before escalating to SIGKILL.

Suggested improvement
         if ps -p "$pid" > /dev/null 2>&1; then
           echo "  Killing PID $pid" >&2
-          kill -9 "$pid" 2>/dev/null || true
+          kill -TERM "$pid" 2>/dev/null || true
+          sleep 0.5
+          ps -p "$pid" > /dev/null 2>&1 && kill -9 "$pid" 2>/dev/null || true
         fi

104-110: Fixed sleep may introduce flakiness.

The 2-second sleep on line 107 is intended to avoid catching pre-connect state, but this is timing-dependent. If the exporter connects faster or slower than expected, tests could become flaky. Consider polling for a specific pre-condition instead, or documenting why 2 seconds is sufficient.


205-208: Add safeguards for empty grep results.

If the grep commands fail to find matches (e.g., due to unexpected output format), before_pos or after_pos could be empty, causing the comparison on line 208 to fail with a shell error. Consider adding validation or using || true with a fallback.

Suggested improvement
   # Verify order: BEFORE should appear before AFTER in output
   local before_pos=$(echo "$output" | grep -n "BEFORE_HOOK:" | head -1 | cut -d: -f1)
   local after_pos=$(echo "$output" | grep -n "AFTER_HOOK:" | head -1 | cut -d: -f1)

+  # Ensure both markers were found
+  [ -n "$before_pos" ] && [ -n "$after_pos" ]
   [ "$before_pos" -lt "$after_pos" ]

315-333: Tests lack verification of actual j command effects.

Tests E1 and E2 only verify that the hook scripts complete (via output markers), but don't verify that j power on/off commands actually executed or had the expected effect. Consider adding assertions that verify the power state changed, or at minimum checking for command output in the hook logs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 358-365: The UNAVAILABLE branch breaks the poll loop but doesn't
update the monitor's running state, so set self._running = False when handling
StatusCode.UNAVAILABLE; specifically, inside the elif e.code() ==
StatusCode.UNAVAILABLE block update self._running to False alongside setting
self._connection_lost, trigger and recreate self._any_change_event as shown, and
ensure start() can restart the monitor after this exit.
♻️ Duplicate comments (2)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)

199-204: Handle timeout=0 explicitly.

if timeout: treats 0 as falsy, so timeout=0 waits indefinitely instead of timing out immediately.

🛠️ Proposed fix
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_loop()
             return False
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_for_first()
             return None

Also applies to: 283-286


416-420: Wake waiters on stop to avoid hangs.

stop() only signals _stop_event, but waiters block on _any_change_event, so they can hang indefinitely.

🛠️ Proposed fix
     async def stop(self):
         """Stop the background polling task."""
         self._running = False
         self._stop_event.set()
+        # Wake waiters so they can observe _running=False
+        self._any_change_event.set()
+        self._any_change_event = Event()

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/config/client.py (1)

162-177: Thread include_status through to the ListExporters request.

The include_status parameter is only used to set a display flag on the response (line 176), but it's never passed to ClientService.ListExporters. If the server requires this flag in the request to include status data, status information won't be fetched. Update ClientService.ListExporters to accept and propagate include_status to the protobuf request.

Suggested update
-exporters_response = await svc.ListExporters(page_size=page_size, page_token=page_token, filter=filter)
+exporters_response = await svc.ListExporters(
+    page_size=page_size,
+    page_token=page_token,
+    filter=filter,
+    include_status=include_status,
+)

@kirkbrauer kirkbrauer requested a review from evakhoni February 9, 2026 21:41
@@ -0,0 +1,69 @@
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CRD files in this dir need to be deleted, were probably added by makefile before the rebases, now those live in deploy/helm/crds instead

scope: Namespaced
versions:
- name: v1alpha1
- additionalPrinterColumns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is good :-)

@mangelajo
Copy link
Member

🤦 sorry for all the conflicts now.

  • controller/deploy/operator/bundle has been deleted
  • controller/deploy/operator/dist/install.yaml deleted

the shell.py was updated. I am reviewing and testing as it is, then we can rebase and merge, let's prioritize this.

# --skip-crds: CRDs are managed via templates/crds/ instead of the special crds/ directory
helm ${METHOD} --namespace jumpstarter-lab \
--create-namespace \
--skip-crds \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--skip-crds \

You need to remove again once you remove the extra CRDs added in the helm template dir.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 127-128: The backgrounded single-run launcher is missing the fd-3
close which can let bats hang; modify the command that starts the one-off
exporter (the line that runs `jmp run --exporter test-exporter-hooks &` and
writes to `HOOKS_EXPORTER_PIDS_FILE`) to close file descriptor 3 before
backgrounding (i.e., ensure the process is started with `3>&-` so fd 3 is not
inherited), mirroring how `start_hooks_exporter` closes fd 3 to avoid leaving
bats' internal pipe open.

In `@python/packages/jumpstarter/jumpstarter/client/core.py`:
- Around line 323-339: The monitored waiter wait_for_lease_ready_monitored
currently only waits for ExporterStatus.LEASE_READY and will hang until timeout
if BEFORE_LEASE_HOOK_FAILED occurs; update wait_for_lease_ready_monitored (which
uses self._status_monitor) to call the monitor's wait_for_any_of (or equivalent)
for both ExporterStatus.LEASE_READY and ExporterStatus.BEFORE_LEASE_HOOK_FAILED,
then handle the result by returning/raising appropriately (e.g., return early or
raise DriverError when BEFORE_LEASE_HOOK_FAILED is observed) instead of only
awaiting wait_for_status.
- Around line 148-193: The loop in start_shell_wait_for_exporter (uses
get_status_async, ExporterStatus, poll_interval, timeout, poll_count) increments
elapsed by poll_interval only and thus ignores time spent in the RPC; wrap the
whole polling loop in anyio.move_on_after(timeout) (or use anyio.fail_after) so
the actual wall-clock timeout includes the GetStatus RPCs, preserve existing
logging on timeout and ensure you still await anyio.sleep(poll_interval) between
polls; remove or stop updating the manual elapsed counter (or replace it with
time.monotonic() if you prefer explicit tracking) so the timeout cannot be
overshot.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 182-214: wait_for_status's wait_loop can return False when the
monitor stopped after setting the desired status; update wait_for_status to
perform the same post-loop status verification as wait_for_any_of: after the
loop or move_on_after block completes (just before returning False), check
self._current_status == target and return True if so (while still preserving
connection_lost handling), ensuring you reference the existing wait_loop,
wait_for_status, _current_status, _running and _connection_lost symbols when
making the change.
🧹 Nitpick comments (9)
e2e/tests-hooks.bats (3)

9-10: Unused variable CURRENT_HOOKS_CONFIG.

This variable is declared but never read or written anywhere else in the file. Remove it to avoid confusion.

Proposed fix
-# Track which config is currently active
-CURRENT_HOOKS_CONFIG=""

92-131: Optional: extract shared config-merge logic.

Lines 98–102 and 121–125 are identical. A small helper like apply_hooks_config "$config_file" would reduce duplication and ensure both paths stay in sync.


349-350: Weak assertion for CLIENT_NAME — matches even when the value is empty.

Line 349 checks lease=[0-9a-f-]+ (must have a non-empty value), but line 350 only checks that client= appears in the output. Consider requiring a non-empty value:

Proposed fix
-  [[ "$output" =~ client= ]]
+  [[ "$output" =~ client=[a-zA-Z0-9_-]+ ]]
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

154-172: Redundant re-imports of ExporterOfflineError inside elif branches.

ExporterOfflineError is already imported at line 154 (before the try). The re-imports at lines 168 and 172 are unnecessary.

♻️ Proposed cleanup
                                        elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
-                                            from jumpstarter.common.exceptions import ExporterOfflineError
                                             reason = monitor.status_message or "afterLease hook failed"
                                             raise ExporterOfflineError(reason)
                                        elif monitor.connection_lost:
-                                            from jumpstarter.common.exceptions import ExporterOfflineError

230-242: Broad exception group conversion may mask non-connection errors.

Lines 240–241 convert any BaseExceptionGroup into ExporterOfflineError("Connection to exporter lost") when lease_used is not None, even if the underlying exceptions are unrelated to connectivity (e.g., AttributeError, KeyError). This can make debugging harder.

Consider logging the original exception group before re-raising, or only converting when the group actually contains connection-related exceptions.

♻️ Suggested improvement
                 if offline_exc:
                     raise offline_exc from None
                 if lease_used is not None:
+                    logger.debug("Unhandled exception group during lease: %s", eg)
                     raise ExporterOfflineError("Connection to exporter lost") from None
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (3)

183-191: Consider decomposing _execute_hook_process further.

This method is ~280 lines with # noqa: C901. The nested read_pty_output and wait_for_process coroutines could be extracted as private methods (or module-level helpers accepting pty_state and process as arguments) to reduce cognitive load and make individual pieces independently testable.


229-252: Minor inconsistency: command vs script_stripped passed to interpreter.

Line 249 uses script_stripped for the file path (correct—no leading/trailing whitespace), but line 252 passes the original command (with potential whitespace) to interpreter -c. This is harmless since shells handle it fine, but for consistency you could use script_stripped in both branches.

Suggested fix
                 else:
                     logger.info("Executing inline script (interpreter: %s)", interpreter)
-                    cmd = [interpreter, "-c", command]
+                    cmd = [interpreter, "-c", script_stripped]

502-508: Consider clarifying in the docstring why request_lease_release is accepted but intentionally unused.

The request_lease_release parameter is documented and intentionally not used—the inline comment at line 576 explains the design: "No request_lease_release - client will discover failure and release." The parameter is kept for API parity with run_after_lease_hook and accepted by callers. However, the docstring does not explicitly explain why the parameter exists despite not being invoked. Adding a note like "Note: This parameter is reserved for consistency with run_after_lease_hook and may be used by clients expecting future enhancements" would clarify the intent and prevent confusion.

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

343-343: Use lazy %-formatting in logger calls instead of f-strings.

Several logger calls use f-strings (lines 343, 354, 371, 414, 416), which eagerly format the string even when the log level is disabled. Prefer logger.warning("Missed %d status transition(s)", missed) style.

Also applies to: 354-354, 371-371, 414-416

Comment on lines +127 to +128
jmp run --exporter test-exporter-hooks &
echo "$!" >> "$HOOKS_EXPORTER_PIDS_FILE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing 3>&- on backgrounded process — can cause bats to hang.

start_hooks_exporter (line 104) correctly closes fd 3 (3>&-) to prevent bats from hanging on its internal pipe. This single-run variant backgrounds jmp run directly without closing fd 3, so bats may block waiting for the child to release the descriptor.

Proposed fix
-  jmp run --exporter test-exporter-hooks &
+  jmp run --exporter test-exporter-hooks 3>&- &
🤖 Prompt for AI Agents
In `@e2e/tests-hooks.bats` around lines 127 - 128, The backgrounded single-run
launcher is missing the fd-3 close which can let bats hang; modify the command
that starts the one-off exporter (the line that runs `jmp run --exporter
test-exporter-hooks &` and writes to `HOOKS_EXPORTER_PIDS_FILE`) to close file
descriptor 3 before backgrounding (i.e., ensure the process is started with
`3>&-` so fd 3 is not inherited), mirroring how `start_hooks_exporter` closes fd
3 to avoid leaving bats' internal pipe open.

Comment on lines +148 to +193
poll_interval = 0.5 # seconds
elapsed = 0.0
poll_count = 0

self.logger.debug("Waiting for exporter to be ready...")
while elapsed < timeout:
poll_count += 1
self.logger.debug("[POLL %d] Calling GetStatus (elapsed: %.1fs)...", poll_count, elapsed)
try:
status = await self.get_status_async()
self.logger.debug("[POLL %d] GetStatus returned: %s", poll_count, status)
except Exception as e:
# Connection error - keep trying
self.logger.debug("[POLL %d] Error getting status, will retry: %s", poll_count, e)
await anyio.sleep(poll_interval)
elapsed += poll_interval
continue

if status is None:
# GetStatus not implemented - assume ready for backward compatibility
self.logger.debug("[POLL %d] GetStatus not implemented, assuming ready", poll_count)
return

if status == ExporterStatus.LEASE_READY:
self.logger.info("Exporter ready, starting shell...")
return
elif status == ExporterStatus.BEFORE_LEASE_HOOK:
# Hook is running - this is expected, keep waiting
self.logger.debug("[POLL %d] beforeLease hook is running...", poll_count)
elif status == ExporterStatus.BEFORE_LEASE_HOOK_FAILED:
# Hook failed - log but continue (exporter may still be usable)
self.logger.warning("beforeLease hook failed")
return
elif status == ExporterStatus.AVAILABLE:
# Exporter is available but not yet leased - keep waiting
# This can happen if client connects before exporter receives lease assignment
self.logger.debug("[POLL %d] Exporter status: AVAILABLE (waiting for lease assignment)", poll_count)
else:
# Other status - continue waiting
self.logger.debug("[POLL %d] Exporter status: %s (waiting...)", poll_count, status)

self.logger.debug("[POLL %d] Sleeping for %.1fs before next poll...", poll_count, poll_interval)
await anyio.sleep(poll_interval)
elapsed += poll_interval

self.logger.warning("Timeout waiting for beforeLease hook to complete (after %d polls)", poll_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Elapsed time tracking doesn't account for RPC duration, potentially overshooting timeout.

elapsed is incremented by poll_interval after each iteration, but the actual time includes the GetStatus RPC call (which has a 5-second gRPC timeout in the poll loop). In the worst case, each iteration could take 5s (RPC) + 0.5s (sleep) but only 0.5s is tracked, so the method could run ~6x longer than timeout.

Consider using anyio.move_on_after(timeout) to wrap the entire loop, similar to how the StatusMonitor methods handle timeouts.

🛠️ Proposed fix
     async def wait_for_lease_ready(self, timeout: float = 300.0) -> None:
         import anyio
 
-        poll_interval = 0.5  # seconds
-        elapsed = 0.0
-        poll_count = 0
-
+        poll_interval = 0.5
+        poll_count = 0
         self.logger.debug("Waiting for exporter to be ready...")
-        while elapsed < timeout:
+        with anyio.move_on_after(timeout):
+          while True:
             poll_count += 1
-            self.logger.debug("[POLL %d] Calling GetStatus (elapsed: %.1fs)...", poll_count, elapsed)
+            self.logger.debug("[POLL %d] Calling GetStatus...", poll_count)
             # ... rest of loop body unchanged, remove elapsed tracking ...
+          return  # unreachable, but satisfies the flow
 
         self.logger.warning("Timeout waiting for beforeLease hook to complete (after %d polls)", poll_count)
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/core.py` around lines 148 -
193, The loop in start_shell_wait_for_exporter (uses get_status_async,
ExporterStatus, poll_interval, timeout, poll_count) increments elapsed by
poll_interval only and thus ignores time spent in the RPC; wrap the whole
polling loop in anyio.move_on_after(timeout) (or use anyio.fail_after) so the
actual wall-clock timeout includes the GetStatus RPCs, preserve existing logging
on timeout and ensure you still await anyio.sleep(poll_interval) between polls;
remove or stop updating the manual elapsed counter (or replace it with
time.monotonic() if you prefer explicit tracking) so the timeout cannot be
overshot.

Comment on lines +323 to +339
async def wait_for_lease_ready_monitored(self, timeout: float = 300.0) -> None:
"""Wait for LEASE_READY status using background monitor.

Non-blocking to other async tasks. Requires status_monitor_async context.

Args:
timeout: Maximum time to wait in seconds (default: 5 minutes)
"""
if not self._status_monitor:
raise RuntimeError("Status monitor not started. Use status_monitor_async() context.")

reached = await self._status_monitor.wait_for_status(
ExporterStatus.LEASE_READY,
timeout=timeout
)
if not reached:
raise DriverError("Timeout waiting for LEASE_READY status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

wait_for_lease_ready_monitored doesn't handle BEFORE_LEASE_HOOK_FAILED, causing a 300s timeout on hook failure.

The polling-based wait_for_lease_ready (line 177-180) correctly handles BEFORE_LEASE_HOOK_FAILED by returning early. However, the monitored version only waits for LEASE_READY, so a failed beforeLease hook would block for the full timeout duration.

🛠️ Proposed fix: use wait_for_any_of
     async def wait_for_lease_ready_monitored(self, timeout: float = 300.0) -> None:
         if not self._status_monitor:
             raise RuntimeError("Status monitor not started. Use status_monitor_async() context.")
 
-        reached = await self._status_monitor.wait_for_status(
-            ExporterStatus.LEASE_READY,
+        result = await self._status_monitor.wait_for_any_of(
+            [ExporterStatus.LEASE_READY, ExporterStatus.BEFORE_LEASE_HOOK_FAILED],
             timeout=timeout
         )
-        if not reached:
+        if result is None:
             raise DriverError("Timeout waiting for LEASE_READY status")
+        if result == ExporterStatus.BEFORE_LEASE_HOOK_FAILED:
+            self.logger.warning("beforeLease hook failed")
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/core.py` around lines 323 -
339, The monitored waiter wait_for_lease_ready_monitored currently only waits
for ExporterStatus.LEASE_READY and will hang until timeout if
BEFORE_LEASE_HOOK_FAILED occurs; update wait_for_lease_ready_monitored (which
uses self._status_monitor) to call the monitor's wait_for_any_of (or equivalent)
for both ExporterStatus.LEASE_READY and ExporterStatus.BEFORE_LEASE_HOOK_FAILED,
then handle the result by returning/raising appropriately (e.g., return early or
raise DriverError when BEFORE_LEASE_HOOK_FAILED is observed) instead of only
awaiting wait_for_status.

Comment on lines +182 to +214
async def wait_loop():
# Increment active waiters to force fast polling
self._active_waiters += 1
try:
while self._running and not self._connection_lost:
# Check current status
if self._current_status == target:
return True

# Capture event reference before waiting
current_event = self._any_change_event

# Double-check after capturing
if self._current_status == target:
return True

# Check for connection lost
if self._connection_lost:
logger.debug("Connection lost while waiting")
return False

# Wait for any status change
await current_event.wait()
return False
finally:
self._active_waiters -= 1

if timeout is not None:
with anyio.move_on_after(timeout):
return await wait_loop()
return False
else:
return await wait_loop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

wait_for_status misses the target when the monitor stops after setting the status.

When _signal_unsupported() (or stop()) sets _current_status to LEASE_READY and simultaneously sets _running = False, wait_loop wakes from the event but the while self._running check fails immediately, returning False — even though the target status was reached.

wait_for_any_of already handles this correctly with a post-loop status check (lines 289-294). wait_for_status lacks the equivalent.

🐛 Proposed fix: add post-loop check matching wait_for_any_of
                     await current_event.wait()
-                return False
+                # Loop exited (running=False or connection_lost) - check status one
+                # last time. The poll loop may have set _current_status (e.g. to
+                # LEASE_READY when GetStatus is unimplemented) before stopping.
+                if self._current_status == target:
+                    return True
+                return False
             finally:
                 self._active_waiters -= 1
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py` around
lines 182 - 214, wait_for_status's wait_loop can return False when the monitor
stopped after setting the desired status; update wait_for_status to perform the
same post-loop status verification as wait_for_any_of: after the loop or
move_on_after block completes (just before returning False), check
self._current_status == target and return True if so (while still preserving
connection_lost handling), ensuring you reference the existing wait_loop,
wait_for_status, _current_status, _running and _connection_lost symbols when
making the change.

// the afterLease hook completes, ensuring leases are always released even if
// the client disconnects unexpectedly.
if req.GetReleaseLease() {
if err := s.handleExporterLeaseRelease(ctx, exporter); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the exporter goes offline and does not report the request to release? is there some sort of timeout mechanism or alternate way?

func checkExporterStatusForDriverCalls(exporterStatus string) error {
switch exporterStatus {
case jumpstarterdevv1alpha1.ExporterStatusLeaseReady,
jumpstarterdevv1alpha1.ExporterStatusBeforeLeaseHook,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this allow clients to dial in while before/after lease hooks are happening?

deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml

cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/
cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shall not copy this (goes along with the helm install --skip-crds that needs to be removed)

Comment on lines +70 to +74
@awk '/^ name: jumpstarter-manager-role$$/{print; print " annotations:"; print " argocd.argoproj.io/sync-wave: \"-1\""; next}1' \
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml > \
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp && \
mv deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp \
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just do those changes to the helm charts.

from jumpstarter.exporter import Session
from jumpstarter.utils.env import env

if TYPE_CHECKING:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talking to @kirkbrauer this was done because of a cyclic import problem

from jumpstarter.driver import Driver

if TYPE_CHECKING:
from jumpstarter.driver import Driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same cyclical issue, it's good.


# Create hook executor if hooks are configured
hook_executor = None
if self.hooks.before_lease or self.hooks.after_lease:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not enabled as long as those fields are empth 👍

None - caller manages socket paths externally
"""
server = grpc.aio.server()
for port in ports:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, will be useful when we want to make the exporter listen on TCP too

@evakhoni
Copy link
Contributor

evakhoni commented Feb 10, 2026

@kirkbrauer I've seen that you added logic to allow ending lease from the exporter side. I wonder is shutting down the exporter supposed to end the lease as well, or the functionality is only preserved to a hook with action onFailure: endLease ?

here I shut down the exporter

INFO:jumpstarter.exporter.session:GetStatus() -> LEASE_READY (version=2)
INFO:jumpstarter.exporter.session:GetStatus() -> LEASE_READY (version=2)
^CINFO:jumpstarter_cli.run:PARENT: Got 2 (SIGINT), forwarding to child PG 1858585
INFO:jumpstarter_cli.run:CHILD: Received 2 (SIGINT)
INFO:jumpstarter.exporter.exporter:Stopping exporter immediately, unregistering from controller
INFO:jumpstarter.exporter.exporter:Updated status to AVAILABLE: Available for new lease
INFO:jumpstarter.exporter.session:Stopping session server...
INFO:jumpstarter.exporter.exporter:Unregistering exporter with controller
INFO:jumpstarter.exporter.exporter:Updated status to OFFLINE: Exporter shutting down
INFO:jumpstarter.exporter.exporter:Controller unregistration completed successfully

but the lease is still up

python ⚡hooks-test-none ➤ uv run jmp get leases --client hooks-client 
NAME                                  SELECTOR    BEGIN TIME           DURATION  CLIENT        EXPORTER       
019c492c-e4b5-785e-bfdf-d191b5a72afe  hooks=none  2026-02-10 22:10:04  0:30:00   hooks-client  hooks-test-none

is this an expected behavior?

@kirkbrauer
Copy link
Member Author

Hi @evakhoni,

That is a good point. Currently, the mechanism is only built to handle the case where a failed hook needs to end the lease. I did not add any additional functionality to end the lease on exporter shutdown, but that's a good enhancement to have.

@evakhoni
Copy link
Contributor

ack. I have opened #212 to track it separately. we can probably address it in a separate PR later on..

@evakhoni
Copy link
Contributor

evakhoni commented Feb 11, 2026

tried an old client to a new exporter - controller pair - doesn't work

│ │ │ AioRpcError: <AioRpcError of RPC that terminated with:                                   │ │ │
│ │ │         status = StatusCode.FAILED_PRECONDITION                                          │ │ │
│ │ │         details = "exporter is not ready (status: Available)"                            │ │ │
│ │ │         debug_error_string = "UNKNOWN:Error received from peer ipv4:10.0.0.1:8082        │ │ │
│ │ │ {grpc_status:9, grpc_message:"exporter is not ready (status: Available)"}"               │ │ │
│ │ │ >                                                                                        │ │ │
│ │ ╰──────────────────────────────────────────────────────────────────────────────────────────╯ │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

interestingly, if i start the client first, while its waiting for the exporter to be online, then start the exporter, it connects with no such an error observed.

@evakhoni
Copy link
Contributor

evakhoni commented Feb 11, 2026

a momentary restart of an exporter underneath an open client connection, results in a following error later when the client disconnects:

python ⚡hooks-test-none ➤ exit
[02/11/26 11:24:31] INFO     Running afterLease hook (Ctrl+C to skip)... [jumpstarter_cli.shell]                                                                                                                             
[02/11/26 11:24:38] INFO     Releasing Lease 019c4c03-de44-7068-9616-b3d0ef267441 [jumpstarter.client.lease]                                                                                                                 
Error: Ready for commands

if during this brief exporter restart however an attempt to send client command was used, we observe the following crash imediately

│ │ AioRpcError: <AioRpcError of RPC that terminated with:                                       │ │
│ │         status = StatusCode.UNAVAILABLE                                                      │ │
│ │         details = "failed to connect to all addresses; last error: UNAVAILABLE:              │ │
│ │ unix:/run/user/1000/jumpstarter-brod65vz/socket: recvmsg:Connection reset by peer"           │ │
│ │         debug_error_string = "UNKNOWN:Error received from peer  {grpc_status:14,             │ │
│ │ grpc_message:"failed to connect to all addresses; last error: UNAVAILABLE:                   │ │
│ │ unix:/run/user/1000/jumpstarter-brod65vz/socket: recvmsg:Connection reset by peer"}"         │ │
│ │ >                                                                                            │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

which is recoverable, with the client being able to send calls as the exporter finishes restarting, however after such a recovery, we receive the following, more severe error at exiting the client session:

python ⚡hooks-test-none ➤ exit
[02/11/26 11:30:58] INFO     Running afterLease hook (Ctrl+C to skip)... [jumpstarter_cli.shell]                                                                                                                             
                    INFO     Connection recovered during verification poll [jumpstarter.client.status_monitor]                                                                                                               
                    INFO     afterLease hook completed [jumpstarter_cli.shell]                                                                                                                                               
                    ERROR    Exception in callback Future.set_result()                                                                                                                                                       
                             handle: <Handle Future.set_result()> [asyncio]                                                                                                                                                  
                             Traceback (most recent call last):                                                                                                                                                              
                               File "/usr/lib64/python3.13/asyncio/events.py", line 89, in _run                                                                                                                              
                                 self._context.run(self._callback, *self._args)                                                                                                                                              
                                 ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                              
                             asyncio.exceptions.InvalidStateError: invalid state                                                                                                                                             
                    INFO     Releasing Lease 019c4c08-b9a8-70b8-85fd-db5e11d982de [jumpstarter.client.lease]       

note, this behavior is observed with no hooks set in the exporter. I haven't tested with a hooks present yet, but I suspect irregularities may be observed.


edit: this was with the code just before today's commit dc7832f I have updated to the latest, for now I was unable to reproduce the second more severe error, and receiving Error: Ready for commands in both cases

@evakhoni
Copy link
Contributor

new controller with old exporter doesn't work. the controller marks it as offline

 ~/repos/jumpstarter/python  ➦ 758e70be  jmp version
Jumpstarter v0.8.1rc1 from /home/evakhoni/.local/jumpstarter/venv/lib64/python3.14/site-packages (Python 3.14.2 (main, Dec  5 2025, 00:00:00) [GCC 15.2.1 20251111 (Red Hat 15.2.1-4)])
 ~/repos/jumpstarter/python  ➦ 758e70be  jmp run --exporter hooks-test-none 
INFO:jumpstarter.exporter.exporter:Registering exporter with controller
INFO:jumpstarter.exporter.exporter:Currently not leased

in another terminal

 ~/repos/jumpstarter/python  ➦ 758e70be  uv run jmp get exporters --with leases,online
NAME             ONLINE  LABELS      LEASED BY  LEASE STATUS  RELEASE TIME
hooks-test-both  no      hooks=both             Available                 
hooks-test-none  no      hooks=none             Available      

@mangelajo mangelajo self-requested a review February 12, 2026 09:47
# These jobs can be removed once 0.7.x controller support is no longer needed.
# ============================================================================

compat-old-controller:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
compat-old-controller:
e2e-compat-old-controller:

nit

env:
CI: true

compat-old-client:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
compat-old-client:
e2e-compat-old-client:

another nit

But very cool that you found a way to do backwards compat testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-pr-images enhancement New feature or request go Pull requests that update go code protocol python Pull requests that update python code

Projects

Development

Successfully merging this pull request may close these issues.

3 participants