Exception and Cleanup Findings (2026-05-18)

This document captures a focused review of exception handling and shutdown/cleanup behavior in the Python SDK runtime lifecycle.

Scope

  • Command provider session lifecycle

  • Command consumer session lifecycle

  • Context startup/shutdown orchestration

  • Hook exception isolation

Findings Already Handled

1. Consumer hook exception isolation is robust

  • on_status, on_ack, on_exec_status, and on_terminal hook exceptions are caught and logged.

  • Consumer session cleanup still completes even if on_terminal raises.

  • Existing tests cover this behavior (test_consumer_hook_isolation.py).

2. Shutdown is idempotent

  • DDSContext.shutdown() uses _is_shutdown guard and returns on repeated calls.

3. Shutdown order is correct for dispatcher safety

  • Dispatcher is stopped before task cancellation and DDS entity teardown.

  • Service-level close() is used for logical cleanup only.

4. Context-level service close failures are contained

  • DDSContext.shutdown() catches/logs service close() exceptions and continues teardown.

Open Findings (Not Yet Fully Handled)

1. Provider on_terminal exceptions can skip provider session cleanup

In CommandProviderSession.run() finalization, await self._provider.on_terminal(self) executes before:

  • provider instance disposal

  • active-session map removal

If on_terminal() raises, disposal and map cleanup may be skipped for that session.

Risk:

  • lingering _active_sessions entries

  • missed instance disposal

  • shutdown path inconsistencies under hook failure

2. run_until_shutdown() can double-start already-started services

DDSContext.run_until_shutdown() currently creates a new task for every service exposing _run() without checking whether a prior start() already created a live task.

Risk:

  • duplicate reader loops for services that were manually started

  • hard-to-debug duplicated processing

3. Provider close() can abort early on non-cancel exception from _run task await

CommandProvider.close() cancels _task and only handles asyncio.CancelledError when awaiting it. If awaiting _task raises a different exception, active-session fail/cleanup logic below may not execute in that close() call.

Risk:

  • incomplete fail-on-shutdown behavior for active sessions

  • reduced cleanup resilience after reader-loop failure

Suggested Fixes

A. Harden provider session finalization

Wrap provider on_terminal in try/except in the finally block, and always run disposal and _active_sessions.pop(...) afterward.

Suggested shape:

  1. try: await on_terminal(...)

  2. except Exception: log

  3. always dispose instances

  4. always remove session from active map

B. Prevent double-start in run_until_shutdown()

Before creating a task for _run(), check whether _task exists and is still running.

Suggested guard:

  • start only when _task is None or _task.done()

C. Make provider close() resilient to non-cancel task failures

When awaiting canceled _task, catch generic exceptions (log and continue) so active sessions are still failed and awaited.

Test Gaps to Add

  1. Provider hook isolation test:

  • Provider subclass whose on_terminal() raises.

  • Assert session still disposes and is removed from _active_sessions.

  1. Lifecycle double-start guard test:

  • Call service.start() and then ctx.run_until_shutdown().

  • Assert only one _run() task loop executes.

  1. Provider close resilience test:

  • Force _run() task to fail with non-cancel exception.

  • Assert close() still proceeds to fail/await active sessions.

Priority

  • High: provider on_terminal finalization hardening

  • High: double-start guard in run_until_shutdown()

  • Medium: provider close() robustness for non-cancel task exceptions

Notes

  • Findings are based on code-path verification in runtime implementation, not architecture docs alone.

  • Consumer exception isolation is in better shape than provider finalization paths.