Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds exporter lifecycle hooks (beforeLease/afterLease), exporter status fields/enums and status_message, new ReportStatus release_lease flag and EndSession RPC, StatusMonitor and client status APIs, HookExecutor and LeaseContext, CRD/schema and printer-column updates, protobuf/grpc regeneration, CLI/status rendering, extensive tests and e2e hook configs, and small Helm/manifest build tweaks. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(200,230,255,0.5)
participant Client
participant Exporter
participant HookExec as HookExecutor
participant Controller
end
Client->>Exporter: Request lease / Dial
Exporter->>Controller: ReportStatus(status=BEFORE_LEASE_HOOK)
Exporter->>HookExec: execute_before_lease_hook(LeaseContext)
HookExec-->>Exporter: result (success / failure)
alt before-hook success
Exporter->>Controller: ReportStatus(status=LEASE_READY)
Client->>Exporter: Driver calls allowed
else onFailure=endLease
Exporter->>Controller: ReportStatus(..., release_lease=true)
Controller->>Controller: mark lease.release = true
end
Client->>Exporter: End session
Exporter->>HookExec: execute_after_lease_hook(...)
HookExec-->>Exporter: after-hook done
Exporter->>Controller: ReportStatus(status=OFFLINE)
sequenceDiagram
rect rgba(230,255,200,0.5)
participant AsyncClient
participant StatusMon as StatusMonitor
participant ExporterSvc as ExporterService
end
AsyncClient->>StatusMon: start()
loop polling
StatusMon->>ExporterSvc: GetStatus()
ExporterSvc-->>StatusMon: status (+version, previous_status)
StatusMon-->>AsyncClient: notify on change / wake waiters
end
AsyncClient->>StatusMon: wait_for_status(LEASE_READY)
StatusMon-->>AsyncClient: signal when LEASE_READY observed
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
python/packages/jumpstarter/jumpstarter/config/client.py (1)
184-196: Addstatusfield to exporter reconstruction to preserve it wheninclude_leases=True.The exporter objects lose their
statusfield when leases are included. The original exporters have astatusfield that should be preserved during reconstruction, especially wheninclude_status=Trueis requested. Update line 187-193 to includestatus=exporter.status.Code snippet
exporter_with_lease = Exporter( namespace=exporter.namespace, name=exporter.name, labels=exporter.labels, online=exporter.online, status=exporter.status, # Add this line lease=lease, )python/packages/jumpstarter/jumpstarter/config/exporter.py (1)
224-246: Remove unuseddevice_factoryparameter fromHookExecutor.The
device_factoryparameter passed toHookExecutorat line 231 is never used.HookExecutoronly executes shell commands via subprocess and accesses the session for logging; it has no need for driver instantiation. Remove the unused parameter from both theHookExecutordataclass definition and the instantiation call.
🤖 Fix all issues with AI agents
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_clients.yaml`:
- Around line 20-47: Update the CRD description strings so they reference
"Client" consistently instead of "identities" or "Identity": change the top
description "Client is the Schema for the identities API" to a concise "Client
is the Schema for the Client API" (or simply "Client"), and update the
spec.description "ClientSpec defines the desired state of Identity" and
status.description "ClientStatus defines the observed state of Identity" to use
"Client" (e.g., "ClientSpec defines the desired state of Client" and
"ClientStatus defines the observed state of Client") so all descriptive fields
in the CRD (the description at the top and the
spec.description/status.description entries) no longer reference
Identity/identities.
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_leases.yaml`:
- Around line 30-31: Update the CRD description for the Lease resource: replace
the current "Lease is the Schema for the exporters API" text in the description
property of the Lease CRD with the corrected wording using "leases" (e.g.,
"Leases is the Schema for the exporters API" or "leases is the Schema for the
exporters API") so the description uses the plural form; locate the description
property under the Lease CRD (kind: Lease) in the CRD template and change that
string accordingly.
In `@python/packages/jumpstarter-cli-common/jumpstarter_cli_common/opt.py`:
- Line 19: basicConfig assignment line exceeds max line length; split the call
across multiple lines to satisfy Ruff E501. Locate the basicConfig variable
initialization (basicConfig = partial(logging.basicConfig, format="%(message)s
[%(name)s]", handlers=[RichHandler(show_path=False)]) ) and reformat the
arguments onto separate lines (e.g., place format=... and handlers=[...] each on
their own lines and close parens on its own line) so the call to
partial(logging.basicConfig, ...) fits within line-length limits while
preserving the same arguments and use of RichHandler(show_path=False).
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 80-89: Remove the unused synchronous helper _run_shell_with_lease:
delete the entire function definition (including its docstring and with-blocks)
since the async variant _run_shell_with_lease_async is now the only one used;
then search the repo for any remaining references to _run_shell_with_lease and
remove or update them, run the test suite/lint to confirm no import or name
errors, and ensure behavior relies solely on _run_shell_with_lease_async.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 395-401: The async context manager on StatusMonitor is incomplete:
__aenter__ currently just returns self and never starts the background polling;
update __aenter__ to initialize the task group and call self.start() (mirroring
client.status_monitor_async() behavior) so entering the context begins
monitoring, and keep __aexit__ awaiting self.stop(); alternatively remove both
methods if context-manager usage is unsupported—modify the methods on the
StatusMonitor class (__aenter__, __aexit__) to either perform full start/stop
lifecycle (create task group + await self.start() in __aenter__, await
self.stop() in __aexit__) or delete them to avoid silent failures.
In `@python/packages/jumpstarter/jumpstarter/common/enums.py`:
- Around line 26-27: The docstring for the enum value AFTER_LEASE_HOOK (mapped
to common_pb2.EXPORTER_STATUS_AFTER_LEASE_HOOK) contains a typo; update the
string from "Lease was releaseed, but exporter is executing after lease hook" to
"Lease was released, but exporter is executing after lease hook" so the
documentation reads correctly.
In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 541-613: The Listen retry task started with
tg.start_soon(self._retry_stream, "Listen",
self._listen_stream_factory(lease_name), listen_tx) is attached to the outer
task group so it keeps running after the lease ends; move that start call into
the lease-scoped task group (use conn_tg.start_soon(...) before starting
wait_for_lease_end/process_connections) so the Listen task is cancelled with the
lease, and in the enclosing finally block ensure the send stream is closed
(await listen_tx.aclose() or the appropriate close method) to signal listen_rx
termination; reference symbols: _retry_stream,
_listen_stream_factory(lease_name), listen_tx/listen_rx, tg -> conn_tg, and
conn_tg.start_soon.
- Around line 202-321: Change _get_controller_stub to be an async context
manager (e.g., async def _controller_stub(self):) that creates the channel via
self.channel_factory(), yields a
jumpstarter_pb2_grpc.ControllerServiceStub(channel) and always closes the
underlying channel with await channel.close() in a finally block; then update
all callers to use "async with self._controller_stub() as controller:" instead
of awaiting _get_controller_stub(), specifically in _retry_stream (where
controller is acquired for stream_factory), _register_with_controller (Register
call), and _report_status (ReportStatus call) so channels are closed after use
and no resources leak.
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 299-307: wait_for_process currently uses
anyio.to_thread.run_sync(process.wait, abandon_on_cancel=True) which can leave
the child as a zombie if the wait is abandoned; modify wait_for_process to
ensure the subprocess is reaped on cancellation by adding a try/finally: in the
try await the run_sync as before, but in the finally check process.poll() and if
it is None try to terminate/kill (process.terminate()/process.kill()) then call
a non-abandoning run_sync(process.wait) or loop-poll with process.poll() until
exit to fully reap the child; reference the wait_for_process coroutine and the
process variable when making the changes.
- Around line 257-260: The debug call inside read_pty_output exceeds the
120-char line limit; shorten the line by splitting the log message across
multiple strings or moving the format string to a short variable and then
calling logger.debug. For example, break the message into two adjacent string
literals or assign msg = "read_pty_output: heartbeat at %.1fs, iterations=%d"
and call logger.debug(msg, elapsed, read_count) so you keep the same
logger.debug call and the same format arguments (logger, elapsed, read_count)
but comply with the E501 line-length rule.
- Around line 186-189: The test mock for logging_session must support the
context manager used by logging_session.context_log_source in hooks.py; update
the test fixture (e.g., mock_session) so mock_session.context_log_source returns
an object that implements __enter__ and __exit__ (for async usage you can set
mock_session.context_log_source.return_value.__enter__ and __exit__ to AsyncMock
or wrap with contextlib.nullcontext), ensuring the context manager protocol is
honored when hooks.py calls logging_session.context_log_source.
🧹 Nitpick comments (10)
controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_exporteraccesspolicies.yaml (2)
149-152: Add schema validation for duration and priority.Without constraints, malformed durations or negative priorities can be accepted and only fail at runtime. Consider adding format/bounds in the CRD.
♻️ Suggested schema tightening
maximumDuration: type: string + format: duration priority: type: integer + minimum: 0
40-157: Require spec/policies to avoid empty CRs.Right now an empty resource (missing spec or policies) is valid. If that’s not intended, enforce it in the schema to reduce no-op objects.
♻️ Suggested required fields
spec: description: ExporterAccessPolicySpec defines the desired state of ExporterAccessPolicy. properties: exporterSelector: description: |- A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects. properties: matchExpressions: description: matchExpressions is a list of label selector requirements. The requirements are ANDed. items: description: |- A label selector requirement is a selector that contains values, a key, and an operator that relates the key and values. properties: key: description: key is the label key that the selector applies to. type: string operator: description: |- operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist. type: string values: description: |- values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. This array is replaced during a strategic merge patch. items: type: string type: array x-kubernetes-list-type: atomic required: - key - operator type: object type: array x-kubernetes-list-type: atomic matchLabels: additionalProperties: type: string description: |- matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed. type: object type: object x-kubernetes-map-type: atomic policies: items: properties: from: items: properties: clientSelector: description: |- A label selector is a label query over a set of resources. The result of matchLabels and matchExpressions are ANDed. An empty label selector matches all objects. A null label selector matches no objects. properties: matchExpressions: description: matchExpressions is a list of label selector requirements. The requirements are ANDed. items: description: |- A label selector requirement is a selector that contains values, a key, and an operator that relates the key and values. properties: key: description: key is the label key that the selector applies to. type: string operator: description: |- operator represents a key's relationship to a set of values. Valid operators are In, NotIn, Exists and DoesNotExist. type: string values: description: |- values is an array of string values. If the operator is In or NotIn, the values array must be non-empty. If the operator is Exists or DoesNotExist, the values array must be empty. This array is replaced during a strategic merge patch. items: type: string type: array x-kubernetes-list-type: atomic required: - key - operator type: object type: array x-kubernetes-list-type: atomic matchLabels: additionalProperties: type: string description: |- matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed. type: object type: object x-kubernetes-map-type: atomic type: object type: array maximumDuration: type: string priority: type: integer spotAccess: type: boolean type: object type: array + minItems: 1 type: object + required: + - policies status: description: ExporterAccessPolicyStatus defines the observed state of ExporterAccessPolicy. type: object type: object + required: + - specAlso applies to: 162-162
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
157-161: Redundant timeout configuration.Both
move_on_after(300)andwait_for_any_of(..., timeout=300.0)specify 300 seconds. Sincewait_for_any_ofhas its own timeout, the outermove_on_afteris redundant. Consider either removing the outer timeout or differentiating them if different semantics are intended (e.g., outer as a hard cap slightly longer than inner).♻️ Suggested simplification
- with anyio.move_on_after(300) as timeout_scope: # 5 minute timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=300.0 - ) + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=300.0 + ) + if result is None: + logger.warning("Timeout waiting for afterLease hook to complete")Note: This would require adjusting the subsequent conditional logic (lines 162-172) to handle the timeout case inline rather than via
timeout_scope.cancelled_caught.controller/Makefile (1)
66-68: Make the CRD copy step more robust with mkdir.If the templates directory is missing in a fresh checkout,
cpwill fail.♻️ Suggested tweak
cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/ + mkdir -p deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/ cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/controller/internal/service/controller_service.go (1)
318-338: Consider logging unexpected status values in the default case.The
protoStatusToStringfunction silently maps unknown proto status values toExporterStatusUnspecified. While this provides a safe fallback, logging at debug/warning level would help diagnose protocol mismatches.♻️ Optional: Add logging for unexpected status values
func protoStatusToString(status pb.ExporterStatus) string { switch status { case pb.ExporterStatus_EXPORTER_STATUS_OFFLINE: return jumpstarterdevv1alpha1.ExporterStatusOffline // ... other cases ... default: + // Log unexpected status values for debugging protocol mismatches + // Consider using a package-level logger if available return jumpstarterdevv1alpha1.ExporterStatusUnspecified } }python/packages/jumpstarter/jumpstarter/common/enums.py (1)
38-45: Consider handling invalid protobuf values gracefully infrom_proto.The current implementation will raise
ValueErrorif an unrecognized integer is passed tofrom_proto. For robustness against protocol version mismatches, consider returningUNSPECIFIEDfor unknown values.♻️ Optional: Handle unknown values gracefully
`@classmethod` def from_proto(cls, value: int) -> "ExporterStatus": """Convert from protobuf integer to enum.""" - return cls(value) + try: + return cls(value) + except ValueError: + return cls.UNSPECIFIEDpython/packages/jumpstarter/jumpstarter/exporter/logging.py (1)
31-39: Consider longest-prefix matching for hierarchical loggers.The current prefix matching returns the first match found during iteration. With hierarchical logger names (e.g.,
driver.power,driver.power.usb), if bothdriveranddriver.powerare mapped, the match depends on iteration order. Consider sorting by key length descending to ensure the most specific (longest) prefix wins.♻️ Suggested improvement for deterministic matching
def get_source_for_record(self, record): """Determine the appropriate log source for a record.""" with self._lock: # Check if this record comes from a logger with a specific source mapping logger_name = record.name - for mapped_logger, source in self._child_handlers.items(): + # Sort by key length descending to match most specific prefix first + for mapped_logger, source in sorted( + self._child_handlers.items(), key=lambda x: len(x[0]), reverse=True + ): if logger_name.startswith(mapped_logger): return source return self.sourcepython/packages/jumpstarter/jumpstarter/client/core.py (1)
440-510: Consider extracting helper functions to reduce complexity.The static analysis correctly flags this method as too complex (C901: 14 > 10). The method handles multiple concerns: reconnection with backoff, log source routing, and filtering. While the current structure is readable, extracting helpers would improve maintainability.
♻️ Suggested structure for reduced complexity
def _route_log_to_logger(self, response, show_all_logs: bool) -> None: """Route a log response to the appropriate logger.""" from jumpstarter.common import LogSource if response.HasField("source"): source = LogSource(response.source) is_hook = source in (LogSource.BEFORE_LEASE_HOOK, LogSource.AFTER_LEASE_HOOK) else: source = LogSource.SYSTEM is_hook = False if not (is_hook or show_all_logs): return severity = response.severity if response.severity else "INFO" log_level = getattr(logging, severity, logging.INFO) logger_names = { LogSource.BEFORE_LEASE_HOOK: "exporter:beforeLease", LogSource.AFTER_LEASE_HOOK: "exporter:afterLease", LogSource.DRIVER: "exporter:driver", } logger_name = logger_names.get(source, "exporter:system") logging.getLogger(logger_name).log(log_level, response.message)Then use this helper in the main loop to reduce nesting.
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)
496-498: TODO: Controller-side lease ending not yet implemented.The comment indicates that when
on_failure='endLease', there's no mechanism to actually end the lease from the exporter side. This should be tracked.Would you like me to open an issue to track the controller-side lease ending mechanism for
beforeLeasehook failures withon_failure='endLease'?python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)
197-279: Consider extracting common verification logic.Both
wait_for_statusandwait_for_any_ofshare substantial code for connection recovery verification. This duplication contributes to the complexity flagged by static analysis.♻️ Suggested refactor to reduce duplication
async def _verify_connection_recovery(self) -> ExporterStatus | None: """Verify connection and return current status if recovered, None if still lost.""" if not self._connection_lost: return self._current_status logger.debug("Connection was marked as lost, verifying...") try: response = await self._stub.GetStatus(jumpstarter_pb2.GetStatusRequest()) logger.info("Connection recovered during verification poll") self._connection_lost = False new_status = ExporterStatus.from_proto(response.status) self._current_status = new_status self._status_version = response.status_version return new_status except AioRpcError as e: if e.code() == StatusCode.UNAVAILABLE: logger.debug("Connection still lost (UNAVAILABLE)") else: logger.debug("GetStatus error during verification: %s", e.code()) return None except Exception as e: logger.debug("GetStatus error during verification: %s", e) return NoneThen use this helper in both
wait_for_statusandwait_for_any_of.
...y/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_clients.yaml
Show resolved
Hide resolved
...oy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_leases.yaml
Show resolved
Hide resolved
python/packages/jumpstarter-cli-common/jumpstarter_cli_common/opt.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In
`@python/packages/jumpstarter-driver-power/jumpstarter_driver_power/client_test.py`:
- Around line 8-20: The test_log_stream uses fixed time.sleep(1) which is flaky;
replace these sleeps with a short-polling loop that checks caplog.text until the
expected substring appears or a timeout expires. In test_log_stream (around
client.log_stream(), client.on(), client.off()) implement a small helper or
inline poll using time.monotonic() to loop with a tiny sleep (e.g., 0.01–0.1s)
and fail if timeout reached, then assert "power on" and "power off" only after
the poll succeeds; keep using caplog and the same logger scope so the
client.log_stream(), client.on(), and client.off() calls remain unchanged.
In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Line 3: Remove the unused import OrderedDict from the imports in
client_test.py; locate the line "from collections import OrderedDict" and delete
it (or remove OrderedDict from the import list) so the lint warning is resolved
and no functionality is changed.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py`:
- Line 3: The import list in status_monitor_test.py includes an unused symbol
AsyncMock; remove AsyncMock from the import statement (keep MagicMock) so the
module only imports symbols actually used in the tests, ensuring static analysis
no longer flags the unused AsyncMock.
In `@python/packages/jumpstarter/jumpstarter/common/enums_test.py`:
- Line 3: Remove the unused top-level import "import pytest" from enums_test.py;
locate the import statement at the top of the test module and delete it so the
file no longer contains the unused pytest symbol.
🧹 Nitpick comments (4)
python/packages/jumpstarter/jumpstarter/exporter/lease_context_test.py (1)
186-200: Avoid time-based sleep to reduce test flakiness.Line 195 uses a fixed sleep, which can be flaky on slow CI. Consider a handshake event so the test deterministically waits for
wait_for_drivers()to start before setting the hook.♻️ Proposed deterministic rewrite
async def test_wait_for_drivers_blocks_until_set(self) -> None: """Test that wait_for_drivers() blocks until hook event is set.""" import anyio before_hook = Event() ctx = LeaseContext(lease_name="test-lease", before_lease_hook=before_hook) - # Set the event after a short delay - async def set_after_delay(): - await anyio.sleep(0.05) - before_hook.set() + waiter_started = Event() + + async def wait_and_signal(): + waiter_started.set() + await ctx.wait_for_drivers() async with anyio.create_task_group() as tg: - tg.start_soon(set_after_delay) - await ctx.wait_for_drivers() + tg.start_soon(wait_and_signal) + await waiter_started.wait() + before_hook.set() assert ctx.drivers_ready()python/packages/jumpstarter/jumpstarter/client/core_test.py (2)
16-32: Consider extracting shared test utilities.
MockAioRpcErrorandcreate_mock_rpc_errorare duplicated fromstatus_monitor_test.py. Consider extracting these to a shared test utilities module (e.g.,jumpstarter/client/test_utils.pyor aconftest.py).
90-116: Move import to module level.The
anyioimport insidewait_for_hook_statusshould be at the module level for consistency and clarity.Proposed fix
Add to imports at top of file:
import anyioThen update the method:
async def wait_for_hook_status(self, target_status: ExporterStatus, timeout: float = 60.0) -> bool: """Wait for exporter to reach a target status using polling.""" - import anyio - poll_interval = 0.1 # Fast for testingpython/packages/jumpstarter/jumpstarter/client/client_test.py (1)
121-124: Consider adding assertion to fully verify topological order.The comment states children should be built before parents, but the assertion only verifies
grandparentis built afterparent. Adding an assertion thatchildis built beforeparentwould fully validate the topological ordering.Suggested improvement
# Children should be built before parents in topological order - # The order should be child, parent, grandparent (or parent, child, grandparent) + # The order should be child, parent, grandparent assert "grandparent" in build_order assert build_order.index("grandparent") > build_order.index("parent") + assert build_order.index("parent") > build_order.index("child")
python/packages/jumpstarter-driver-power/jumpstarter_driver_power/client_test.py
Show resolved
Hide resolved
python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@controller/internal/service/suite_test.go`:
- Around line 56-58: The test harness sets up testEnv as an envtest.Environment
with CRDDirectoryPaths and ErrorIfCRDPathMissing: true, which will abort the
suite if the CRD path is wrong; verify and update the CRDDirectoryPaths value to
the actual CRD directory (or add multiple fallback paths) so envtest can find
the CRDs, e.g., check and correct the path used in the testEnv initialization
(the envtest.Environment struct, the CRDDirectoryPaths slice, and the
ErrorIfCRDPathMissing flag) or set ErrorIfCRDPathMissing to false while you
confirm the correct location.
- Around line 29-31: The test currently ignores errors when creating a
Namespace; change the Namespace creation logic (the call that creates a
corev1.Namespace object, e.g., client.CoreV1().Namespaces().Create or similar in
the test setup) to explicitly handle errors: import
k8s.io/apimachinery/pkg/api/errors as apierrors, then after the Create call
check if err != nil and if apierrors.IsAlreadyExists(err) continue/ignore,
otherwise fail the test (t.Fatalf or return the error). Apply the same pattern
for the other namespace-creation site referenced in the comment (the second
block around lines 81-87) so only AlreadyExists is tolerated and all other
errors cause the test to fail.
- Around line 19-24: Update the test setup in suite_test.go to avoid hard-coding
BinaryAssetsDirectory: first respect the KUBEBUILDER_ASSETS environment variable
if set (use it to set BinaryAssetsDirectory), then fall back to existing helper
GetFirstFoundEnvTestBinaryDir() or a directory-existence check for the current
hard-coded path (bin/k8s/1.30.0-<os>-<arch>), and only assign the hard-coded
path as a last resort; modify the code that sets BinaryAssetsDirectory to
perform conditional checks (os.Getenv("KUBEBUILDER_ASSETS"), os.Stat on
candidate dirs, and GetFirstFoundEnvTestBinaryDir()) so tests run correctly
outside the Makefile and when the version/path differs.
In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Around line 120-124: The test's topological-order check only ensures
"grandparent" comes after "parent" but doesn't verify that "child" precedes
"parent"; update the assertions to explicitly assert that
build_order.index("child") < build_order.index("parent") (and keep/optionally
assert both "child" and "parent" are before "grandparent") so the test enforces
children-before-parents ordering; reference the existing build_order list and
the node names "child", "parent", and "grandparent" when adding the
assertion(s).
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 31-38: The docstring usage example in status_monitor.py
incorrectly calls client.start_status_monitor(); update the example to call the
actual method client.status_monitor_async() (or add a thin alias
start_status_monitor -> status_monitor_async in the client if you prefer) so the
docstring matches the real API; ensure the rest of the example still references
monitor.current_status and monitor.wait_for_status(ExporterStatus.LEASE_READY).
In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 426-500: The before-lease failure branch doesn't actually end the
lease for on_failure='endLease'; modify run_before_lease_hook to accept an async
lease-release callback (e.g., end_lease: Callable[[LeaseContext],
Awaitable[None]]) or reuse an existing controller-release method, and in the
HookExecutionError else-branch (the on_failure='endLease' path) invoke await
end_lease(lease_scope) (and still set lease_scope.before_lease_hook
event/unblock connections and report status) so the lease is cleanly terminated;
update all call sites of run_before_lease_hook to pass the new end_lease
callback (mirroring how run_after_lease_hook is wired).
♻️ Duplicate comments (2)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
80-90: Unused synchronous function.This function is no longer called anywhere in the codebase since line 213 now uses
_run_shell_with_lease_async. Consider removing it to reduce maintenance burden.python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)
397-403: Async context manager won't start monitoring.The
__aenter__method returnsselfwithout callingstart(). Direct usage as an async context manager (async with StatusMonitor(...) as monitor:) will silently fail to start background polling. Users must useclient.status_monitor_async()instead. Either fix__aenter__to properly start monitoring or remove the context manager methods to prevent misuse.
🧹 Nitpick comments (9)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)
157-161: Redundant double timeout.Both
anyio.move_on_after(300)andtimeout=300.0inwait_for_any_ofset 5-minute timeouts. The innertimeoutparameter makes the outer cancel scope redundant. Consider removing one:♻️ Proposed fix
- with anyio.move_on_after(300) as timeout_scope: # 5 minute timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=300.0 - ) + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=300.0 + ) + if result is None: + logger.warning("Timeout waiting for afterLease hook to complete")Then adjust the subsequent conditional logic accordingly, removing the
timeout_scope.cancelled_caughtcheck at line 171.
133-136: Consider checkingconnection_lostfor beforeLease hook.The afterLease handling (line 166) checks
monitor.connection_lostto distinguish between timeout and connection loss. The beforeLease handling treats both cases as "Timeout" which could be misleading in logs.♻️ Proposed enhancement
if result == ExporterStatus.BEFORE_LEASE_HOOK_FAILED: logger.warning("beforeLease hook failed") elif result is None: - logger.warning("Timeout waiting for beforeLease hook") + if monitor.connection_lost: + logger.warning("Connection lost during beforeLease hook") + else: + logger.warning("Timeout waiting for beforeLease hook")python/packages/jumpstarter/jumpstarter/client/client_test.py (1)
58-58: EnsureExitStackis closed after each test.
ExitStack()is never closed, which can leak cleanup callbacks ifclient_from_channelregisters any. Preferwith ExitStack() as mock_stack:to guarantee teardown.♻️ Example pattern (apply to each test)
- mock_stack = ExitStack() + with ExitStack() as mock_stack: with patch( "jumpstarter.client.client.MultipathExporterStub", return_value=mock_stub ), patch( "jumpstarter.client.client.import_class", return_value=MockDriverClient ): from jumpstarter.client.client import client_from_channel client = await client_from_channel( mock_channel, mock_portal, mock_stack, allow=[], unsafe=True )Also applies to: 98-98, 143-143, 177-177, 211-211, 245-245
python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (1)
16-32: Consider extracting shared mock utilities to a common test module.
MockAioRpcErrorandcreate_mock_rpc_errorare duplicated here and incore_test.py(per the relevant snippets). Extracting these to a shared test utilities module (e.g.,conftest.pyor a dedicatedtesting.py) would reduce duplication.python/packages/jumpstarter/jumpstarter/client/core.py (1)
332-351: Asymmetric behavior when monitor is not started.
wait_for_lease_ready_monitoredraisesRuntimeErrorif the monitor isn't started, whilewait_for_hook_complete_monitoredsilently returnsTrue. This asymmetry could be surprising. Consider documenting this behavior difference more explicitly or using consistent error handling.python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)
309-309: Use%-style formatting in logger calls instead of f-strings.F-strings in logging calls evaluate the string even when the log level is disabled, causing unnecessary overhead. Use
%-style formatting for deferred evaluation.Proposed fix
- logger.warning(f"Missed {missed} status transition(s)") + logger.warning("Missed %d status transition(s)", missed)- logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})") + logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version)- logger.error(f"Status change callback error: {e}") + logger.error("Status change callback error: %s", e)- logger.debug(f"GetStatus poll error: {e.code()}") + logger.debug("GetStatus poll error: %s", e.code()) - logger.debug(f"GetStatus poll error: {e}") + logger.debug("GetStatus poll error: %s", e)Also applies to: 320-320, 337-337, 352-354
controller/internal/service/suite_test.go (1)
90-93: Guard teardown if setup failed.If
testEnvnever initialized,Stop()can panic inAfterSuite. A simple nil check avoids cascading failures.🛡️ Suggested guard
By("tearing down the test environment") - err := testEnv.Stop() - Expect(err).NotTo(HaveOccurred()) + if testEnv != nil { + err := testEnv.Stop() + Expect(err).NotTo(HaveOccurred()) + }controller/Makefile (1)
66-75: Guard against duplicatemetadata.annotationsblocks.Line 66–71 unconditionally inserts a new
annotations:map. If the generated role ever includes annotations (now or later), this can create duplicate keys and potentially drop the sync-wave or other annotations depending on YAML parsing behavior. Consider merging into an existing annotations map or using a YAML-aware tool for the insert.python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)
341-356: Useanyio.to_thread.run_sync()with timeout for blockingprocess.wait()afterprocess.kill().The direct
process.wait()call at line 354 blocks the event loop if the process doesn't exit promptly. The same block already demonstrates the correct pattern at line 346–347 withanyio.to_thread.run_sync()andanyio.move_on_after(). Apply the same pattern here for consistency and robustness:Safer non‑blocking wait
if process.poll() is None: process.kill() try: - process.wait() + with anyio.move_on_after(5): + await anyio.to_thread.run_sync(process.wait) except Exception: pass
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@controller/Makefile`:
- Line 74: The Makefile currently runs the cp command copying CRDs to
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/, which
will fail if that target directory doesn't exist; update the Makefile to create
the directory first by running a mkdir -p for
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
immediately before the cp command so the cp in the Makefile succeeds even when
the templates/crds directory is missing.
🧹 Nitpick comments (1)
controller/Makefile (1)
66-71: Idempotency risk: duplicateannotations:keys if file already contains annotations.The AWK script unconditionally inserts an
annotations:block after thename:line. If controller-gen ever outputs annotations in the generated role.yaml, or if the manifests target runs without a fresh regeneration, this will produce invalid YAML with duplicateannotations:keys.Consider using a more robust approach that checks for existing annotations before inserting:
♻️ Suggested safer AWK approach
# Add ArgoCD sync-wave annotation to RBAC role for proper deployment ordering (PR `#207`) - `@awk` '/^ name: jumpstarter-manager-role$$/{print; print " annotations:"; print " argocd.argoproj.io/sync-wave: \"-1\""; next}1' \ + `@awk` ' \ + /^ name: jumpstarter-manager-role$$/ { print; found_name=1; next } \ + found_name && /^ annotations:/ { print; found_annotations=1; next } \ + found_name && found_annotations && /^ [a-zA-Z]/ && !/^ / { \ + print " argocd.argoproj.io/sync-wave: \"-1\""; found_name=0; found_annotations=0 \ + } \ + found_name && !found_annotations && /^ [a-zA-Z]/ { \ + print " annotations:"; print " argocd.argoproj.io/sync-wave: \"-1\""; found_name=0 \ + } \ + { print }' \ deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml > \ deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp && \ mv deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp \ deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yamlAlternatively, consider using a tool like
yqfor safer YAML manipulation.Also, the comment references "PR
#207" but this appears to be PR#140.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 125-143: The code currently launches the shell unconditionally
after monitor.wait_for_any_of; change it to treat
ExporterStatus.BEFORE_LEASE_HOOK_FAILED and a None (timeout) as hard stops by
not calling _run_shell_only. After the wait_for_any_of call
(monitor.wait_for_any_of), check result and if it equals
ExporterStatus.BEFORE_LEASE_HOOK_FAILED or is None, log the warning and exit the
function (or raise an appropriate exception) instead of proceeding to call
anyio.to_thread.run_sync(_run_shell_only, lease, config, command, path); only
call _run_shell_only when result == ExporterStatus.LEASE_READY.
♻️ Duplicate comments (5)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)
202-204: Controller channels are not closed after use.This creates a new gRPC channel per call but never closes it. Multiple calls during status updates and stream retries will leak channels. The pattern in
_unregister_with_controller()(lines 357-369) demonstrates proper cleanup with explicitawait channel.close().
414-458: Task may hang when lease ends without EndSession.This task only waits on
end_session_requested(line 426). If the lease ends without the client calling EndSession, the task continues waiting indefinitely on the old lease context. Consider racing againstlease_endedor using a task group scoped to the lease lifetime.
546-553: Listen task should be scoped to the lease lifetime.The
_retry_streamtask for Listen is started on the outertg(line 548), causing it to continue running afterhandle_leaseexits. This results in the task persisting across lease transitions. Move this toconn_tgso it's automatically cancelled when the lease ends, and closelisten_txin the finally block to signal proper stream termination.
578-579: Consider scoping_handle_end_sessionto the lease task group.Starting this task on the outer
tgmeans it survives the lease ending. If combined with the fact that it only waits onend_session_requested, the task can be orphaned. This is related to the hang issue noted in the_handle_end_sessionmethod.python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
80-89: Consider removing_run_shell_with_leaseif it’s now unused.The async path appears to be the only entry point, so this sync helper likely adds maintenance overhead.
#!/bin/bash # Verify remaining usages of the sync helper rg -nP '\b_run_shell_with_lease\s*\(' --type py
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 8-18: The import block is unsorted per Ruff I001; reorder imports
into standard-library first (logging, collections.abc.Awaitable,
collections.abc.Callable, contextlib.AsyncExitStack), then third-party packages
(anyio, anyio.Event, anyio.create_task_group, grpc.StatusCode,
grpc.aio.AioRpcError, jumpstarter_protocol.jumpstarter_pb2), and finally
local/project imports (jumpstarter.common.ExporterStatus) so the import order
satisfies the linter and keeps the existing imported symbols intact.
- Around line 196-201: The timeout checks in wait_for_status() and
wait_for_any_of() incorrectly use "if timeout:" which treats 0 as falsy; change
both checks to "if timeout is not None:" so a timeout of 0 is handled as an
immediate timeout; update the branches around anyio.move_on_after in
wait_for_status() and wait_for_any_of() (the blocks that call wait_loop() and
return False) to use the new condition.
- Around line 396-400: stop() currently sets _running = False and signals
_stop_event but does not wake waiters blocked on _any_change_event; update
stop() to also set _any_change_event (and optionally _any_change_event.set()
followed by _any_change_event.clear() if you want a one-shot wake) so waiters in
wait_for_status and wait_for_any_of that are awaiting _any_change_event.wait()
will wake and observe _running == False and exit their loops; reference the stop
method and the _any_change_event, wait_for_status, and wait_for_any_of functions
when applying this change.
In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 168-191: The stop() method currently ignores the caller's
should_unregister when wait_for_lease_exit is True, causing later calls (e.g.,
in serve()) to override intent; update stop() so that whenever a deferred stop
is requested (the branch setting self._stop_requested = True) you also set
self._unregister = should_unregister (and likewise when a stop has already been
requested you should update/preserve self._unregister accordingly) so the
original caller preference is retained; refer to the stop() method, the
attributes self._unregister, self._stop_requested, and self._tg.cancel_scope to
implement this change.
🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)
147-162: Remove redundant timeout wrapper and improve timeout detection logic.The
StatusMonitor.connection_lostproperty is valid and well-tested. However, the outeranyio.move_on_after(30)is redundant—wait_for_any_of()already has an identicaltimeout=30.0parameter. When both are equal, the inner timeout fires first, making the outer scope'scancelled_caughtflag never True. This means timeout cases currently fall through to "Hook completion not confirmed" rather than the timeout warning.The suggested refactor properly addresses this by removing the outer timeout wrapper and using
result is Noneto detect timeouts, which disambiguates between connection loss and timeout.♻️ Suggested simplification
- with anyio.move_on_after(30) as timeout_scope: # 30 second timeout - result = await monitor.wait_for_any_of( - [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], - timeout=30.0 - ) - if result == ExporterStatus.AVAILABLE: - logger.info("afterLease hook completed") - elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: - logger.warning("afterLease hook failed") - elif monitor.connection_lost: - # Connection lost - exporter closed, hook likely completed - logger.info("Exporter connection closed (hook completed)") - else: - logger.debug("Hook completion not confirmed") - if timeout_scope.cancelled_caught: - logger.warning("Timeout waiting for afterLease hook to complete") + result = await monitor.wait_for_any_of( + [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED], + timeout=30.0 + ) + if result == ExporterStatus.AVAILABLE: + logger.info("afterLease hook completed") + elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: + logger.warning("afterLease hook failed") + elif result is None and monitor.connection_lost: + # Connection lost - exporter closed, hook likely completed + logger.info("Exporter connection closed (hook completed)") + elif result is None: + logger.warning("Timeout waiting for afterLease hook to complete") + else: + logger.debug("Hook completion not confirmed")
python/packages/jumpstarter/jumpstarter/client/status_monitor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 237-257: The fixed-count retry loop around self.controller.Dial
(using max_retries/base_delay/max_delay) should be replaced with a time‑based
retry bounded by a configurable timeout (use the existing acquisition_timeout if
available or add a new dial_timeout attribute/parameter) so Dial keeps retrying
until time.monotonic() exceeds the deadline; compute deadline = now + timeout,
then loop attempting await self.controller.Dial(...), on AioRpcError with
FAILED_PRECONDITION and "not ready" apply the same exponential/backoff delay
(min(base_delay*2**attempt, max_delay)) but stop retrying and re-raise once
monotonic() > deadline; update the retry log to show remaining time or attempts
left and ensure the new timeout is plumbed through the Lease acquisition
call/site that calls this Dial logic.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 343-347: In _poll_loop (status_monitor.py) the AioRpcError
UNIMPLEMENTED branch currently breaks without clearing _running or notifying
waiters; update that branch to mirror the UNAVAILABLE handling by setting
self._running = False and calling self._any_change_event.set() (and any other
cleanup done in the UNAVAILABLE branch) before breaking so wait_for_status() and
wait_for_any_of() don't hang; also add a unit test that triggers UNIMPLEMENTED
while a waiter (using wait_for_status() or wait_for_any_of()) is active and
assert the waiter returns promptly.
♻️ Duplicate comments (3)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (3)
8-18: Reorder imports to satisfy Ruff (I001).Current ordering may still fail Ruff I001.
🔧 Proposed import order
-import logging -from collections.abc import Awaitable, Callable -from contextlib import AsyncExitStack +from collections.abc import Awaitable, Callable +from contextlib import AsyncExitStack +import loggingPlease verify with:
#!/bin/bash ruff check python/packages/jumpstarter/jumpstarter/client/status_monitor.pyBased on learnings, Ruff is the formatter/linter for Python files.
196-201: Handletimeout=0explicitly.
if timeout:treats0as falsy, so an immediate timeout becomes an infinite wait (Line 196 and Line 280).🐛 Proposed fix
- if timeout: + if timeout is not None: with anyio.move_on_after(timeout): return await wait_loop() return False- if timeout: + if timeout is not None: with anyio.move_on_after(timeout): return await wait_for_first() return NoneOptional verification:
#!/bin/bash rg -n "wait_for_status|wait_for_any_of|timeout\s*=\s*0" --type=pyAlso applies to: 280-285
396-400: Wake waiting coroutines on stop().
stop()doesn’t signal_any_change_event, sowait_for_*can hang after shutdown.🛠️ Proposed fix
async def stop(self): """Stop the background polling task.""" self._running = False self._stop_event.set() + # Wake any waiters so they can observe _running=False + self._any_change_event.set() + self._any_change_event = Event()
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@e2e/dex.values.yaml`:
- Around line 37-40: The userID value for the test user "test-exporter-hooks" is
not a valid UUID because it contains the non-hex character 'g' (value:
c6ed6f40-6689-6g7e-c64c-55d0a9a5871f); replace this with a valid UUID (only 0-9
and a-f), e.g., generate a new v4 UUID and set userID to that new value so the
entry for username "test-exporter-hooks" and its userID field are valid.
🧹 Nitpick comments (5)
e2e/tests-hooks.bats (5)
9-10: Unused variable declaration.
CURRENT_HOOKS_CONFIGis declared but never referenced anywhere in the file. Consider removing it to avoid confusion.Suggested fix
-# Track which config is currently active -CURRENT_HOOKS_CONFIG=""
51-54: Consider graceful termination before SIGKILL.Using
kill -9immediately doesn't give the process a chance to clean up. Consider sendingSIGTERMfirst with a brief grace period before escalating toSIGKILL.Suggested improvement
if ps -p "$pid" > /dev/null 2>&1; then echo " Killing PID $pid" >&2 - kill -9 "$pid" 2>/dev/null || true + kill -TERM "$pid" 2>/dev/null || true + sleep 0.5 + ps -p "$pid" > /dev/null 2>&1 && kill -9 "$pid" 2>/dev/null || true fi
104-110: Fixed sleep may introduce flakiness.The 2-second sleep on line 107 is intended to avoid catching pre-connect state, but this is timing-dependent. If the exporter connects faster or slower than expected, tests could become flaky. Consider polling for a specific pre-condition instead, or documenting why 2 seconds is sufficient.
205-208: Add safeguards for empty grep results.If the grep commands fail to find matches (e.g., due to unexpected output format),
before_posorafter_poscould be empty, causing the comparison on line 208 to fail with a shell error. Consider adding validation or using|| truewith a fallback.Suggested improvement
# Verify order: BEFORE should appear before AFTER in output local before_pos=$(echo "$output" | grep -n "BEFORE_HOOK:" | head -1 | cut -d: -f1) local after_pos=$(echo "$output" | grep -n "AFTER_HOOK:" | head -1 | cut -d: -f1) + # Ensure both markers were found + [ -n "$before_pos" ] && [ -n "$after_pos" ] [ "$before_pos" -lt "$after_pos" ]
315-333: Tests lack verification of actualjcommand effects.Tests E1 and E2 only verify that the hook scripts complete (via output markers), but don't verify that
j power on/offcommands actually executed or had the expected effect. Consider adding assertions that verify the power state changed, or at minimum checking for command output in the hook logs.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 358-365: The UNAVAILABLE branch breaks the poll loop but doesn't
update the monitor's running state, so set self._running = False when handling
StatusCode.UNAVAILABLE; specifically, inside the elif e.code() ==
StatusCode.UNAVAILABLE block update self._running to False alongside setting
self._connection_lost, trigger and recreate self._any_change_event as shown, and
ensure start() can restart the monitor after this exit.
♻️ Duplicate comments (2)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)
199-204: Handletimeout=0explicitly.
if timeout:treats0as falsy, sotimeout=0waits indefinitely instead of timing out immediately.🛠️ Proposed fix
- if timeout: + if timeout is not None: with anyio.move_on_after(timeout): return await wait_loop() return False- if timeout: + if timeout is not None: with anyio.move_on_after(timeout): return await wait_for_first() return NoneAlso applies to: 283-286
416-420: Wake waiters on stop to avoid hangs.
stop()only signals_stop_event, but waiters block on_any_change_event, so they can hang indefinitely.🛠️ Proposed fix
async def stop(self): """Stop the background polling task.""" self._running = False self._stop_event.set() + # Wake waiters so they can observe _running=False + self._any_change_event.set() + self._any_change_event = Event()
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/config/client.py (1)
162-177: Threadinclude_statusthrough to the ListExporters request.The
include_statusparameter is only used to set a display flag on the response (line 176), but it's never passed toClientService.ListExporters. If the server requires this flag in the request to include status data, status information won't be fetched. UpdateClientService.ListExportersto accept and propagateinclude_statusto the protobuf request.Suggested update
-exporters_response = await svc.ListExporters(page_size=page_size, page_token=page_token, filter=filter) +exporters_response = await svc.ListExporters( + page_size=page_size, + page_token=page_token, + filter=filter, + include_status=include_status, +)
| @@ -0,0 +1,69 @@ | |||
| --- | |||
There was a problem hiding this comment.
the CRD files in this dir need to be deleted, were probably added by makefile before the rebases, now those live in deploy/helm/crds instead
| scope: Namespaced | ||
| versions: | ||
| - name: v1alpha1 | ||
| - additionalPrinterColumns: |
|
🤦 sorry for all the conflicts now.
the shell.py was updated. I am reviewing and testing as it is, then we can rebase and merge, let's prioritize this. |
controller/hack/deploy_with_helm.sh
Outdated
| # --skip-crds: CRDs are managed via templates/crds/ instead of the special crds/ directory | ||
| helm ${METHOD} --namespace jumpstarter-lab \ | ||
| --create-namespace \ | ||
| --skip-crds \ |
There was a problem hiding this comment.
| --skip-crds \ |
You need to remove again once you remove the extra CRDs added in the helm template dir.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@e2e/tests-hooks.bats`:
- Around line 127-128: The backgrounded single-run launcher is missing the fd-3
close which can let bats hang; modify the command that starts the one-off
exporter (the line that runs `jmp run --exporter test-exporter-hooks &` and
writes to `HOOKS_EXPORTER_PIDS_FILE`) to close file descriptor 3 before
backgrounding (i.e., ensure the process is started with `3>&-` so fd 3 is not
inherited), mirroring how `start_hooks_exporter` closes fd 3 to avoid leaving
bats' internal pipe open.
In `@python/packages/jumpstarter/jumpstarter/client/core.py`:
- Around line 323-339: The monitored waiter wait_for_lease_ready_monitored
currently only waits for ExporterStatus.LEASE_READY and will hang until timeout
if BEFORE_LEASE_HOOK_FAILED occurs; update wait_for_lease_ready_monitored (which
uses self._status_monitor) to call the monitor's wait_for_any_of (or equivalent)
for both ExporterStatus.LEASE_READY and ExporterStatus.BEFORE_LEASE_HOOK_FAILED,
then handle the result by returning/raising appropriately (e.g., return early or
raise DriverError when BEFORE_LEASE_HOOK_FAILED is observed) instead of only
awaiting wait_for_status.
- Around line 148-193: The loop in start_shell_wait_for_exporter (uses
get_status_async, ExporterStatus, poll_interval, timeout, poll_count) increments
elapsed by poll_interval only and thus ignores time spent in the RPC; wrap the
whole polling loop in anyio.move_on_after(timeout) (or use anyio.fail_after) so
the actual wall-clock timeout includes the GetStatus RPCs, preserve existing
logging on timeout and ensure you still await anyio.sleep(poll_interval) between
polls; remove or stop updating the manual elapsed counter (or replace it with
time.monotonic() if you prefer explicit tracking) so the timeout cannot be
overshot.
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 182-214: wait_for_status's wait_loop can return False when the
monitor stopped after setting the desired status; update wait_for_status to
perform the same post-loop status verification as wait_for_any_of: after the
loop or move_on_after block completes (just before returning False), check
self._current_status == target and return True if so (while still preserving
connection_lost handling), ensuring you reference the existing wait_loop,
wait_for_status, _current_status, _running and _connection_lost symbols when
making the change.
🧹 Nitpick comments (9)
e2e/tests-hooks.bats (3)
9-10: Unused variableCURRENT_HOOKS_CONFIG.This variable is declared but never read or written anywhere else in the file. Remove it to avoid confusion.
Proposed fix
-# Track which config is currently active -CURRENT_HOOKS_CONFIG=""
92-131: Optional: extract shared config-merge logic.Lines 98–102 and 121–125 are identical. A small helper like
apply_hooks_config "$config_file"would reduce duplication and ensure both paths stay in sync.
349-350: Weak assertion forCLIENT_NAME— matches even when the value is empty.Line 349 checks
lease=[0-9a-f-]+(must have a non-empty value), but line 350 only checks thatclient=appears in the output. Consider requiring a non-empty value:Proposed fix
- [[ "$output" =~ client= ]] + [[ "$output" =~ client=[a-zA-Z0-9_-]+ ]]python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)
154-172: Redundant re-imports ofExporterOfflineErrorinside elif branches.
ExporterOfflineErroris already imported at line 154 (before thetry). The re-imports at lines 168 and 172 are unnecessary.♻️ Proposed cleanup
elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED: - from jumpstarter.common.exceptions import ExporterOfflineError reason = monitor.status_message or "afterLease hook failed" raise ExporterOfflineError(reason) elif monitor.connection_lost: - from jumpstarter.common.exceptions import ExporterOfflineError
230-242: Broad exception group conversion may mask non-connection errors.Lines 240–241 convert any
BaseExceptionGroupintoExporterOfflineError("Connection to exporter lost")whenlease_used is not None, even if the underlying exceptions are unrelated to connectivity (e.g.,AttributeError,KeyError). This can make debugging harder.Consider logging the original exception group before re-raising, or only converting when the group actually contains connection-related exceptions.
♻️ Suggested improvement
if offline_exc: raise offline_exc from None if lease_used is not None: + logger.debug("Unhandled exception group during lease: %s", eg) raise ExporterOfflineError("Connection to exporter lost") from Nonepython/packages/jumpstarter/jumpstarter/exporter/hooks.py (3)
183-191: Consider decomposing_execute_hook_processfurther.This method is ~280 lines with
# noqa: C901. The nestedread_pty_outputandwait_for_processcoroutines could be extracted as private methods (or module-level helpers acceptingpty_stateandprocessas arguments) to reduce cognitive load and make individual pieces independently testable.
229-252: Minor inconsistency:commandvsscript_strippedpassed to interpreter.Line 249 uses
script_strippedfor the file path (correct—no leading/trailing whitespace), but line 252 passes the originalcommand(with potential whitespace) tointerpreter -c. This is harmless since shells handle it fine, but for consistency you could usescript_strippedin both branches.Suggested fix
else: logger.info("Executing inline script (interpreter: %s)", interpreter) - cmd = [interpreter, "-c", command] + cmd = [interpreter, "-c", script_stripped]
502-508: Consider clarifying in the docstring whyrequest_lease_releaseis accepted but intentionally unused.The
request_lease_releaseparameter is documented and intentionally not used—the inline comment at line 576 explains the design: "No request_lease_release - client will discover failure and release." The parameter is kept for API parity withrun_after_lease_hookand accepted by callers. However, the docstring does not explicitly explain why the parameter exists despite not being invoked. Adding a note like "Note: This parameter is reserved for consistency withrun_after_lease_hookand may be used by clients expecting future enhancements" would clarify the intent and prevent confusion.python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)
343-343: Use lazy%-formatting in logger calls instead of f-strings.Several logger calls use f-strings (lines 343, 354, 371, 414, 416), which eagerly format the string even when the log level is disabled. Prefer
logger.warning("Missed %d status transition(s)", missed)style.Also applies to: 354-354, 371-371, 414-416
| jmp run --exporter test-exporter-hooks & | ||
| echo "$!" >> "$HOOKS_EXPORTER_PIDS_FILE" |
There was a problem hiding this comment.
Missing 3>&- on backgrounded process — can cause bats to hang.
start_hooks_exporter (line 104) correctly closes fd 3 (3>&-) to prevent bats from hanging on its internal pipe. This single-run variant backgrounds jmp run directly without closing fd 3, so bats may block waiting for the child to release the descriptor.
Proposed fix
- jmp run --exporter test-exporter-hooks &
+ jmp run --exporter test-exporter-hooks 3>&- &🤖 Prompt for AI Agents
In `@e2e/tests-hooks.bats` around lines 127 - 128, The backgrounded single-run
launcher is missing the fd-3 close which can let bats hang; modify the command
that starts the one-off exporter (the line that runs `jmp run --exporter
test-exporter-hooks &` and writes to `HOOKS_EXPORTER_PIDS_FILE`) to close file
descriptor 3 before backgrounding (i.e., ensure the process is started with
`3>&-` so fd 3 is not inherited), mirroring how `start_hooks_exporter` closes fd
3 to avoid leaving bats' internal pipe open.
| poll_interval = 0.5 # seconds | ||
| elapsed = 0.0 | ||
| poll_count = 0 | ||
|
|
||
| self.logger.debug("Waiting for exporter to be ready...") | ||
| while elapsed < timeout: | ||
| poll_count += 1 | ||
| self.logger.debug("[POLL %d] Calling GetStatus (elapsed: %.1fs)...", poll_count, elapsed) | ||
| try: | ||
| status = await self.get_status_async() | ||
| self.logger.debug("[POLL %d] GetStatus returned: %s", poll_count, status) | ||
| except Exception as e: | ||
| # Connection error - keep trying | ||
| self.logger.debug("[POLL %d] Error getting status, will retry: %s", poll_count, e) | ||
| await anyio.sleep(poll_interval) | ||
| elapsed += poll_interval | ||
| continue | ||
|
|
||
| if status is None: | ||
| # GetStatus not implemented - assume ready for backward compatibility | ||
| self.logger.debug("[POLL %d] GetStatus not implemented, assuming ready", poll_count) | ||
| return | ||
|
|
||
| if status == ExporterStatus.LEASE_READY: | ||
| self.logger.info("Exporter ready, starting shell...") | ||
| return | ||
| elif status == ExporterStatus.BEFORE_LEASE_HOOK: | ||
| # Hook is running - this is expected, keep waiting | ||
| self.logger.debug("[POLL %d] beforeLease hook is running...", poll_count) | ||
| elif status == ExporterStatus.BEFORE_LEASE_HOOK_FAILED: | ||
| # Hook failed - log but continue (exporter may still be usable) | ||
| self.logger.warning("beforeLease hook failed") | ||
| return | ||
| elif status == ExporterStatus.AVAILABLE: | ||
| # Exporter is available but not yet leased - keep waiting | ||
| # This can happen if client connects before exporter receives lease assignment | ||
| self.logger.debug("[POLL %d] Exporter status: AVAILABLE (waiting for lease assignment)", poll_count) | ||
| else: | ||
| # Other status - continue waiting | ||
| self.logger.debug("[POLL %d] Exporter status: %s (waiting...)", poll_count, status) | ||
|
|
||
| self.logger.debug("[POLL %d] Sleeping for %.1fs before next poll...", poll_count, poll_interval) | ||
| await anyio.sleep(poll_interval) | ||
| elapsed += poll_interval | ||
|
|
||
| self.logger.warning("Timeout waiting for beforeLease hook to complete (after %d polls)", poll_count) |
There was a problem hiding this comment.
Elapsed time tracking doesn't account for RPC duration, potentially overshooting timeout.
elapsed is incremented by poll_interval after each iteration, but the actual time includes the GetStatus RPC call (which has a 5-second gRPC timeout in the poll loop). In the worst case, each iteration could take 5s (RPC) + 0.5s (sleep) but only 0.5s is tracked, so the method could run ~6x longer than timeout.
Consider using anyio.move_on_after(timeout) to wrap the entire loop, similar to how the StatusMonitor methods handle timeouts.
🛠️ Proposed fix
async def wait_for_lease_ready(self, timeout: float = 300.0) -> None:
import anyio
- poll_interval = 0.5 # seconds
- elapsed = 0.0
- poll_count = 0
-
+ poll_interval = 0.5
+ poll_count = 0
self.logger.debug("Waiting for exporter to be ready...")
- while elapsed < timeout:
+ with anyio.move_on_after(timeout):
+ while True:
poll_count += 1
- self.logger.debug("[POLL %d] Calling GetStatus (elapsed: %.1fs)...", poll_count, elapsed)
+ self.logger.debug("[POLL %d] Calling GetStatus...", poll_count)
# ... rest of loop body unchanged, remove elapsed tracking ...
+ return # unreachable, but satisfies the flow
self.logger.warning("Timeout waiting for beforeLease hook to complete (after %d polls)", poll_count)🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/core.py` around lines 148 -
193, The loop in start_shell_wait_for_exporter (uses get_status_async,
ExporterStatus, poll_interval, timeout, poll_count) increments elapsed by
poll_interval only and thus ignores time spent in the RPC; wrap the whole
polling loop in anyio.move_on_after(timeout) (or use anyio.fail_after) so the
actual wall-clock timeout includes the GetStatus RPCs, preserve existing logging
on timeout and ensure you still await anyio.sleep(poll_interval) between polls;
remove or stop updating the manual elapsed counter (or replace it with
time.monotonic() if you prefer explicit tracking) so the timeout cannot be
overshot.
| async def wait_for_lease_ready_monitored(self, timeout: float = 300.0) -> None: | ||
| """Wait for LEASE_READY status using background monitor. | ||
|
|
||
| Non-blocking to other async tasks. Requires status_monitor_async context. | ||
|
|
||
| Args: | ||
| timeout: Maximum time to wait in seconds (default: 5 minutes) | ||
| """ | ||
| if not self._status_monitor: | ||
| raise RuntimeError("Status monitor not started. Use status_monitor_async() context.") | ||
|
|
||
| reached = await self._status_monitor.wait_for_status( | ||
| ExporterStatus.LEASE_READY, | ||
| timeout=timeout | ||
| ) | ||
| if not reached: | ||
| raise DriverError("Timeout waiting for LEASE_READY status") |
There was a problem hiding this comment.
wait_for_lease_ready_monitored doesn't handle BEFORE_LEASE_HOOK_FAILED, causing a 300s timeout on hook failure.
The polling-based wait_for_lease_ready (line 177-180) correctly handles BEFORE_LEASE_HOOK_FAILED by returning early. However, the monitored version only waits for LEASE_READY, so a failed beforeLease hook would block for the full timeout duration.
🛠️ Proposed fix: use wait_for_any_of
async def wait_for_lease_ready_monitored(self, timeout: float = 300.0) -> None:
if not self._status_monitor:
raise RuntimeError("Status monitor not started. Use status_monitor_async() context.")
- reached = await self._status_monitor.wait_for_status(
- ExporterStatus.LEASE_READY,
+ result = await self._status_monitor.wait_for_any_of(
+ [ExporterStatus.LEASE_READY, ExporterStatus.BEFORE_LEASE_HOOK_FAILED],
timeout=timeout
)
- if not reached:
+ if result is None:
raise DriverError("Timeout waiting for LEASE_READY status")
+ if result == ExporterStatus.BEFORE_LEASE_HOOK_FAILED:
+ self.logger.warning("beforeLease hook failed")🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/core.py` around lines 323 -
339, The monitored waiter wait_for_lease_ready_monitored currently only waits
for ExporterStatus.LEASE_READY and will hang until timeout if
BEFORE_LEASE_HOOK_FAILED occurs; update wait_for_lease_ready_monitored (which
uses self._status_monitor) to call the monitor's wait_for_any_of (or equivalent)
for both ExporterStatus.LEASE_READY and ExporterStatus.BEFORE_LEASE_HOOK_FAILED,
then handle the result by returning/raising appropriately (e.g., return early or
raise DriverError when BEFORE_LEASE_HOOK_FAILED is observed) instead of only
awaiting wait_for_status.
| async def wait_loop(): | ||
| # Increment active waiters to force fast polling | ||
| self._active_waiters += 1 | ||
| try: | ||
| while self._running and not self._connection_lost: | ||
| # Check current status | ||
| if self._current_status == target: | ||
| return True | ||
|
|
||
| # Capture event reference before waiting | ||
| current_event = self._any_change_event | ||
|
|
||
| # Double-check after capturing | ||
| if self._current_status == target: | ||
| return True | ||
|
|
||
| # Check for connection lost | ||
| if self._connection_lost: | ||
| logger.debug("Connection lost while waiting") | ||
| return False | ||
|
|
||
| # Wait for any status change | ||
| await current_event.wait() | ||
| return False | ||
| finally: | ||
| self._active_waiters -= 1 | ||
|
|
||
| if timeout is not None: | ||
| with anyio.move_on_after(timeout): | ||
| return await wait_loop() | ||
| return False | ||
| else: | ||
| return await wait_loop() |
There was a problem hiding this comment.
wait_for_status misses the target when the monitor stops after setting the status.
When _signal_unsupported() (or stop()) sets _current_status to LEASE_READY and simultaneously sets _running = False, wait_loop wakes from the event but the while self._running check fails immediately, returning False — even though the target status was reached.
wait_for_any_of already handles this correctly with a post-loop status check (lines 289-294). wait_for_status lacks the equivalent.
🐛 Proposed fix: add post-loop check matching wait_for_any_of
await current_event.wait()
- return False
+ # Loop exited (running=False or connection_lost) - check status one
+ # last time. The poll loop may have set _current_status (e.g. to
+ # LEASE_READY when GetStatus is unimplemented) before stopping.
+ if self._current_status == target:
+ return True
+ return False
finally:
self._active_waiters -= 1🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py` around
lines 182 - 214, wait_for_status's wait_loop can return False when the monitor
stopped after setting the desired status; update wait_for_status to perform the
same post-loop status verification as wait_for_any_of: after the loop or
move_on_after block completes (just before returning False), check
self._current_status == target and return True if so (while still preserving
connection_lost handling), ensuring you reference the existing wait_loop,
wait_for_status, _current_status, _running and _connection_lost symbols when
making the change.
| // the afterLease hook completes, ensuring leases are always released even if | ||
| // the client disconnects unexpectedly. | ||
| if req.GetReleaseLease() { | ||
| if err := s.handleExporterLeaseRelease(ctx, exporter); err != nil { |
There was a problem hiding this comment.
What happens if the exporter goes offline and does not report the request to release? is there some sort of timeout mechanism or alternate way?
| func checkExporterStatusForDriverCalls(exporterStatus string) error { | ||
| switch exporterStatus { | ||
| case jumpstarterdevv1alpha1.ExporterStatusLeaseReady, | ||
| jumpstarterdevv1alpha1.ExporterStatusBeforeLeaseHook, |
There was a problem hiding this comment.
won't this allow clients to dial in while before/after lease hooks are happening?
controller/Makefile
Outdated
| deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml | ||
|
|
||
| cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/ | ||
| cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/ |
There was a problem hiding this comment.
we shall not copy this (goes along with the helm install --skip-crds that needs to be removed)
| @awk '/^ name: jumpstarter-manager-role$$/{print; print " annotations:"; print " argocd.argoproj.io/sync-wave: \"-1\""; next}1' \ | ||
| deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml > \ | ||
| deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp && \ | ||
| mv deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp \ | ||
| deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml |
There was a problem hiding this comment.
we should just do those changes to the helm charts.
| from jumpstarter.exporter import Session | ||
| from jumpstarter.utils.env import env | ||
|
|
||
| if TYPE_CHECKING: |
There was a problem hiding this comment.
talking to @kirkbrauer this was done because of a cyclic import problem
| from jumpstarter.driver import Driver | ||
|
|
||
| if TYPE_CHECKING: | ||
| from jumpstarter.driver import Driver |
There was a problem hiding this comment.
same cyclical issue, it's good.
|
|
||
| # Create hook executor if hooks are configured | ||
| hook_executor = None | ||
| if self.hooks.before_lease or self.hooks.after_lease: |
There was a problem hiding this comment.
not enabled as long as those fields are empth 👍
| None - caller manages socket paths externally | ||
| """ | ||
| server = grpc.aio.server() | ||
| for port in ports: |
There was a problem hiding this comment.
This is nice, will be useful when we want to make the exporter listen on TCP too
|
@kirkbrauer I've seen that you added logic to allow ending lease from the exporter side. I wonder is shutting down the exporter supposed to end the lease as well, or the functionality is only preserved to a hook with action here I shut down the exporter but the lease is still up is this an expected behavior? |
|
Hi @evakhoni, That is a good point. Currently, the mechanism is only built to handle the case where a failed hook needs to end the lease. I did not add any additional functionality to end the lease on exporter shutdown, but that's a good enhancement to have. |
|
ack. I have opened #212 to track it separately. we can probably address it in a separate PR later on.. |
|
tried an old client to a new exporter - controller pair - doesn't work interestingly, if i start the client first, while its waiting for the exporter to be online, then start the exporter, it connects with no such an error observed. |
|
a momentary restart of an exporter underneath an open client connection, results in a following error later when the client disconnects: if during this brief exporter restart however an attempt to send client command was used, we observe the following crash imediately which is recoverable, with the client being able to send calls as the exporter finishes restarting, however after such a recovery, we receive the following, more severe error at exiting the client session: note, this behavior is observed with no hooks set in the exporter. I haven't tested with a hooks present yet, but I suspect irregularities may be observed. edit: this was with the code just before today's commit dc7832f I have updated to the latest, for now I was unable to reproduce the second more severe error, and receiving |
|
new controller with old exporter doesn't work. the controller marks it as offline in another terminal |
# Conflicts: # controller/deploy/operator/bundle/manifests/jumpstarter.dev_exporters.yaml # controller/deploy/operator/dist/install.yaml
| # These jobs can be removed once 0.7.x controller support is no longer needed. | ||
| # ============================================================================ | ||
|
|
||
| compat-old-controller: |
There was a problem hiding this comment.
| compat-old-controller: | |
| e2e-compat-old-controller: |
nit
| env: | ||
| CI: true | ||
|
|
||
| compat-old-client: |
There was a problem hiding this comment.
| compat-old-client: | |
| e2e-compat-old-client: |
another nit
But very cool that you found a way to do backwards compat testing
Summary
Implement hook scripts that can be configured in the exporter config to run
jcommands within a temporary lease environment. This PR combines changes across Python, Controller, and Protocol components.Environment Variables
The following environment variables are exposed to hook scripts:
JUMPSTARTER_HOST- Path to the Unix socket for j CLI accessJMP_DRIVERS_ALLOW=UNSAFE- Allows all drivers for local accessLEASE_NAME- The name of the current leaseCLIENT_NAME- Name of the client that acquired the leaseExample Configuration
Changes
Python (
python/)exit,warn)jmp admin get exporter)Controller (
controller/)Protocol (
protocol/)release_leasefield for exporter-initiated lease releasestatus_versionandprevious_statusfields for transition trackingRelated PRs
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests
Documentation