Resource Health & Reconciliation
Clauderon continuously monitors the health of your sessions and can automatically recover from failures. This guide explains health states, reconciliation, and recovery workflows.
Health States
Every session has a health state representing the status of its backend resources (containers, pods, processes).
State Definitions
| State | Description | Recoverable | User Action |
|---|---|---|---|
| Healthy | Session running normally | N/A | None needed |
| Stopped | Container stopped but intact | ✅ Yes | Start |
| Hibernated | Session suspended to save resources | ✅ Yes | Wake |
| Pending | Resource creation in progress | ⏳ Wait | Wait or cancel |
| Error | Container failed to start/run | ✅ Yes | Recreate |
| CrashLoop | Container repeatedly crashing (K8s) | ⚠️ Maybe | Recreate Fresh |
| Missing | Resource deleted externally | ✅ Yes | Recreate |
| DeletedExternally | Resource removed outside Clauderon | ✅ Yes | Recreate or Cleanup |
State Transitions
┌─────────────┐ │ Pending │ └──────┬──────┘ │ ▼ ┌─────────────────► Healthy ◄─────────────┐ │ │ │ │ ▼ │ │ ┌──────────┐ ┌─────┴────┐ │ │ Stopped │ │ Wake │ │ └────┬─────┘ └──────────┘ │ │ ▲ │ ▼ │ │ ┌──────────┐ ┌─────┴────────┐ └──────────────┤ Start │ │ Hibernated │ └──────────┘ └──────────────┘
Error/Missing/CrashLoop │ ▼ ┌──────────────┐ │ Recreate │────► Healthy │ (or Cleanup) │ └──────────────┘Backend-Specific Mappings
Different backends report health differently:
Docker:
- Healthy - Container running
- Stopped - Container exists but not running
- Missing - Container deleted
- Error - Container exited with error
Kubernetes:
- Healthy - Pod running and ready
- Pending - Pod scheduled but not yet running
- CrashLoop - Pod in CrashLoopBackOff state
- Error - Pod failed or ImagePullBackOff
- Missing - Pod/deployment deleted
Zellij:
- Healthy - Zellij session active
- Missing - Zellij session not found
- Error - Zellij process exited
Sprites:
- Healthy - Container running
- Hibernated - Container suspended
- Error - Container failed
- Missing - Container not found on sprites.dev
Apple Container:
- Healthy - Container running
- Stopped - Container stopped
- Error - Container failed
- Missing - Container deleted
Health Checking
Automatic Health Checks
Clauderon automatically checks session health:
- On access - When you attach, view, or interact with session
- Periodic - Background health checks (configurable interval)
- On reconciliation - During reconciliation attempts
Manual Health Check
Check session health via API:
GET /api/sessions/{id}/healthResponse:
{ "session_id": "abc123", "health": "Error", "details": { "container_status": "exited", "exit_code": 1, "error_message": "OCI runtime error" }, "available_actions": ["recreate", "recreate_fresh", "cleanup"], "data_preservation": { "recreate": true, "recreate_fresh": false, "cleanup": false }, "last_check": "2025-01-28T12:34:56Z"}Health in User Interfaces
Web UI:
- Health badge on session card
- Detailed health view in session detail page
- Action buttons based on available actions
TUI:
- Color-coded session list (green=healthy, yellow=stopped, red=error)
- Press
hon session to show health modal - Health modal shows state, actions, and data preservation
CLI:
# View session status (includes health)clauderon status <session-name>
# Detailed health infoclauderon inspect <session-name>Available Actions by State
Actions you can take depend on the current health state:
Healthy State
| Action | Effect | Preserves Data |
|---|---|---|
| Recreate | Rebuild container | ✅ Yes |
| Cleanup | Delete all resources | ❌ No |
Use case: Force rebuild without stopping first
Stopped State
| Action | Effect | Preserves Data |
|---|---|---|
| Start | Start existing container | ✅ Yes |
| Recreate | Rebuild container | ✅ Yes |
| Cleanup | Delete all resources | ❌ No |
Use case: Resume stopped session or rebuild if needed
Hibernated State
| Action | Effect | Preserves Data |
|---|---|---|
| Wake | Resume from hibernation | ✅ Yes |
| Recreate | Rebuild container | ✅ Yes |
| Cleanup | Delete all resources | ❌ No |
Use case: Wake to continue working or rebuild if corrupted
Error State
| Action | Effect | Preserves Data |
|---|---|---|
| Recreate | Rebuild with existing clone | ✅ Yes (uncommitted changes) |
| Recreate Fresh | Rebuild with fresh clone | ⚠️ Partial (committed only) |
| Cleanup | Delete all resources | ❌ No |
Use case: Fix broken container while preserving work
CrashLoop State
| Action | Effect | Preserves Data |
|---|---|---|
| Recreate Fresh | Rebuild with fresh clone | ⚠️ Partial (committed only) |
| Cleanup | Delete all resources | ❌ No |
Use case: Container won’t start - fresh rebuild likely needed
Missing State
| Action | Effect | Preserves Data |
|---|---|---|
| Recreate | Rebuild with existing clone | ✅ Yes (if clone exists) |
| Recreate Fresh | Rebuild with fresh clone | ⚠️ Partial (committed only) |
| Cleanup | Delete all resources | ❌ No |
Use case: Container deleted externally - recreate from database state
Data Preservation
Understanding what each action preserves:
Preserves Everything (✅)
Actions: Start, Wake, Recreate (if git clone exists)
Preserved:
- Session chat history and metadata
- Git repository state (committed and uncommitted)
- Container filesystem (if recreating from existing clone)
- Environment variables and configuration
Lost:
- Running processes (must restart)
- In-memory state
Preserves Committed Changes Only (⚠️)
Actions: Recreate Fresh
Preserved:
- Session chat history and metadata
- Git repository committed changes
- Configuration and settings
Lost:
- Uncommitted changes (git working directory)
- Untracked files
- Container filesystem state
Destroys Everything (❌)
Actions: Cleanup
Destroyed:
- Session record in database
- All git repository data
- All container resources
- Chat history and metadata
Irreversible! Only use when you’re sure.
Reconciliation System
Reconciliation automatically recovers sessions from failures.
What Reconciliation Does
Reconciliation attempts to:
- Detect failures - Check for error states
- Determine cause - Analyze why resource failed
- Apply fix - Recreate, restart, or clean up
- Verify recovery - Confirm session is healthy again
Reconciliation Triggers
Automatic (if enabled):
- On Clauderon startup (feature flag:
reconcile_on_startup) - After backend errors during operations
- Periodic background reconciliation (future feature)
Manual:
clauderon reconcile [session-name]Reconciles all sessions or specific session.
Reconciliation Attempts
Reconciliation uses exponential backoff:
| Attempt | Delay | Total Time |
|---|---|---|
| 1 | 30 seconds | 30s |
| 2 | 2 minutes | 2m 30s |
| 3 | 5 minutes | 7m 30s |
| Max | Stops after 3 attempts | - |
After 3 failures, reconciliation stops and session remains in error state. Manual intervention required.
Reconciliation Tracking
Each session tracks reconciliation status:
-- In databasereconciliation_attempts: 2last_reconciliation_at: "2025-01-28T12:30:00Z"reconciliation_error: "OCI runtime create failed"View via API:
GET /api/sessions/{id}{ "reconciliation": { "attempts": 2, "last_attempt": "2025-01-28T12:30:00Z", "next_attempt": "2025-01-28T12:35:00Z", "error": "OCI runtime create failed" }}Reconciliation Strategies by State
| State | Reconciliation Action |
|---|---|
| Error | Attempt recreate |
| Missing | Attempt recreate (if clone exists) |
| CrashLoop | Wait, then attempt recreate fresh |
| DeletedExternally | Mark as missing, attempt recreate |
| Stopped | Do nothing (intentional stop) |
| Hibernated | Do nothing (intentional hibernation) |
Recovery Workflows
Via TUI
-
View Health
- Navigate to session in TUI
- Press
hto show health modal
-
Choose Action
- Health modal shows available actions
- Select action with arrow keys
- Press Enter to confirm
-
Confirm Data Impact
- TUI shows data preservation indicator
- ✅ Green = preserves data
- ⚠️ Yellow = partial preservation
- ❌ Red = destructive
-
Execute
- Action executes immediately
- TUI shows progress
- Session state updates when complete
Via Web UI
-
Open Session
- Click on session in session list
- Session detail page opens
-
View Health Status
- Health badge shows current state
- “Actions” dropdown shows available actions
-
Select Action
- Click action (Start, Wake, Recreate, etc.)
- Confirmation dialog appears
-
Confirm
- Dialog shows data preservation info
- Click “Confirm” to proceed
-
Monitor Progress
- Progress indicator shows recovery status
- Page refreshes when complete
Via CLI
Start stopped session:
clauderon start <session-name>Wake hibernated session:
clauderon wake <session-name>Recreate failed session:
clauderon recreate <session-name># or for fresh rebuildclauderon recreate <session-name> --freshCleanup session:
clauderon cleanup <session-name># or delete entirelyclauderon delete <session-name>Via API
Start:
POST /api/sessions/{id}/startWake:
POST /api/sessions/{id}/wakeRecreate:
POST /api/sessions/{id}/recreateRecreate Fresh:
POST /api/sessions/{id}/recreate-freshCleanup:
POST /api/sessions/{id}/cleanupCrash Loop Detection
Specific to Kubernetes backend.
What is CrashLoop?
Kubernetes puts pods in CrashLoopBackOff when they repeatedly fail to start:
- Container exits immediately after starting
- Kubernetes tries to restart
- Container fails again
- Backoff delay increases each time
Common Causes
- Invalid container image - Image doesn’t exist or is corrupted
- Missing dependencies - Required libraries not in image
- Configuration errors - Invalid environment variables or config
- Resource limits - Not enough CPU/memory to start
- Command errors - Entrypoint or command fails
Detection
Clauderon detects CrashLoop by:
- Checking pod status for
CrashLoopBackOff - Monitoring container restart count
- Analyzing pod events
Recovery
Automatic reconciliation:
- Waits for backoff period
- Attempts recreate fresh (resets restart count)
Manual recovery:
# Recreate with fresh cloneclauderon recreate <session-name> --fresh
# Or cleanup and start overclauderon cleanup <session-name>clauderon create <session-name> --backend kubernetesContainer Restart Policies
Clauderon uses these restart policies:
| Backend | Restart Policy | Notes |
|---|---|---|
| Docker | unless-stopped | Restarts unless manually stopped |
| Kubernetes | Always | Always restarts, may enter CrashLoop |
| Zellij | N/A | No restart (process management) |
| Sprites | Automatic | sprites.dev manages restarts |
| Apple Container | unless-stopped | Similar to Docker |
Troubleshooting
Reconciliation Failures
Problem: Session fails to reconcile after 3 attempts
Diagnosis:
# Check reconciliation statusclauderon inspect <session-name>
# View reconciliation errors# (in database or via API)Common causes:
- Backend resource limits reached (Docker/K8s)
- Network issues (Sprites)
- Corrupted git clone
- Invalid configuration
Solutions:
# Try manual recreate freshclauderon recreate <session-name> --fresh
# Or cleanup and recreate from scratchclauderon cleanup <session-name>clauderon create <session-name>Persistent Health Errors
Problem: Session always returns to error state
Diagnosis:
-
Check backend logs:
Terminal window # Dockerdocker logs <container-id># Kuberneteskubectl logs <pod-name> -
Check session logs:
Terminal window clauderon logs <session-name> -
Inspect session configuration:
Terminal window clauderon inspect <session-name>
Solutions:
- Fix underlying backend issue (disk space, permissions, etc.)
- Update session configuration (resource limits, image, etc.)
- Recreate with different backend
Orphaned Resources
Problem: Backend resources exist but Clauderon lost track
Symptoms:
- Session shows as “Missing” but container still running
docker psshows container, but Clauderon doesn’t see it- Resources consuming resources but not accessible
Solutions:
# Cleanup orphaned resources manuallydocker stop <container-id>docker rm <container-id>
# Or use backend-specific cleanupkubectl delete pod <pod-name>
# Then cleanup in Clauderonclauderon cleanup <session-name>External Deletion
Problem: Someone deleted container/pod outside Clauderon
Detection:
- Session shows “DeletedExternally” state
- Health check fails with “not found”
Recovery:
# Recreate from database stateclauderon recreate <session-name>
# Or cleanup and start freshclauderon cleanup <session-name>Slow Reconciliation
Problem: Reconciliation takes too long
Causes:
- Large git repository (slow to clone)
- Slow container image pull
- Backend resource contention
- Network latency (Sprites)
Solutions:
- Use local backends for faster recovery
- Pre-pull container images
- Increase backend resources
- Use smaller repositories when possible
Configuration
Reconciliation Settings
[features]# Enable reconciliation on startupreconcile_on_startup = true
[reconciliation]# First retry delay (seconds)initial_delay = 30
# Backoff multiplierbackoff_multiplier = 2.0
# Maximum attemptsmax_attempts = 3
# Maximum delay (seconds)max_delay = 300Health Check Settings
[health]# Health check interval (seconds)check_interval = 60
# Timeout for health checks (seconds)check_timeout = 10
# Enable background health checksbackground_checks = falseBest Practices
- Monitor health - Regularly check session health in TUI/Web UI
- Enable reconciliation - Set
reconcile_on_startup = truefor automatic recovery - Understand data preservation - Know what each action preserves before executing
- Start small - Try “Start” or “Wake” before “Recreate”
- Commit work - Commit changes before risky operations
- Check logs - Always check logs before recreating
- Cleanup promptly - Remove failed sessions you won’t recover
- Use fresh carefully - “Recreate Fresh” loses uncommitted work
- Test recovery - Practice recovery workflows before you need them
- Document issues - Note error messages for troubleshooting
See Also
- Docker Backend - Docker-specific health checks
- Kubernetes Backend - K8s crash loop detection
- Sprites Backend - Hibernation and wake
- API Reference - Health and recovery endpoints
- Troubleshooting - General troubleshooting guide