Operations Manual
This manual covers routine operational tasks for a running AuthNexus deployment, including monitoring, database maintenance, certificate lifecycle management, and common troubleshooting scenarios.
Monitoring
Node Health
Each server node performs periodic checkins to the Control Plane via Channel 1. Monitor node status through:
GET /admin/v1/nodes/overview # Aggregate node statistics
GET /admin/v1/nodes # List all nodes with status
GET /admin/v1/nodes/:nid # Detailed node informationKey fields to watch:
| Field | Healthy Value | Action if Unhealthy |
|---|---|---|
| Node status | online | Check network, certificates, CP connectivity |
| Last checkin | Within 2 minutes | Investigate Channel 1 connectivity |
| SSE connected | true | Check Channel 2 (SSE may have degraded to polling) |
| OCSP last applied | Recent timestamp | Check Channel 4 (OCSP fetch) |
Certificate Expiration
Proactively monitor certificates approaching expiry:
GET /admin/v1/pki/expiring # Certificates nearing expiration
GET /admin/v1/pki/certs # Full certificate inventorySet up a periodic check (daily recommended) and rotate certificates well before expiry. See Certificate Rotation below.
Dashboard Reports
The admin dashboard provides daily aggregate reports:
GET /admin/v1/reports/login-daily # User login trends
GET /admin/v1/reports/card-generated-daily # Card key generation
GET /admin/v1/reports/card-activated-daily # Card key activation
GET /admin/v1/reports/agent-login-daily # Admin login activityOnline User Presence
Real-time online user counts are aggregated from node checkin reports:
GET /admin/v1/presence # Current online users across all nodesLog Analysis
Log Categories
| Log Prefix | Source | What to Watch For |
|---|---|---|
[CP Agent] | Server node | Connection failures, config apply errors |
[Security] | Server node | Blacklist sync issues, epoch propagation delays |
[TLS] | Both | Handshake failures, certificate validation errors |
[Setup] | Control Plane | First-run initialization problems |
[PKI] | Control Plane | Certificate issuance and revocation events |
[SSE] | Both | Event stream disconnections, degraded mode entry |
Audit Log Queries
Security-sensitive operations are tracked in structured audit logs:
GET /admin/v1/audit-logs # User-related audit events
GET /admin/v1/agent-audit-logs # Admin operation audit trail
GET /admin/v1/pki/audit-logs # PKI certificate operations
GET /admin/v1/nodes/audit-logs # Node management events
GET /admin/v1/login-logs # User login attempts
GET /admin/v1/agent-login-logs # Admin login attempts
GET /admin/v1/nodes/auth-events # Node authentication eventsAll audit endpoints support pagination and time-range filtering.
Database Maintenance
SQLite
- Backup: copy the database file while the process is running (SQLite WAL mode allows safe hot copies).
- Vacuum: run
VACUUMperiodically (monthly) during low-traffic windows to reclaim space. - WAL checkpoint: SQLite auto-checkpoints, but manual
PRAGMA wal_checkpoint(TRUNCATE)can be run if the WAL file grows large. - Busy timeout: configured via
--sqlite-busy-timeout(default 5000ms). Increase if you seeSQLITE_BUSYerrors under load.
PostgreSQL
- Backup: use
pg_dumpfor logical backups or continuous archiving with WAL shipping. - Vacuum: ensure
autovacuumis enabled (default in PostgreSQL). Monitor bloat withpg_stat_user_tables. - Connection pooling: AuthNexus manages its own connection pool. External poolers (PgBouncer) are optional but can help with many server nodes sharing one PG instance.
- Timeouts: tuned via
--db-connect-timeout,--db-statement-timeout, and--db-lock-timeout.
Database Separation
Remember that the Control DB and Runtime DB are always separate. Maintenance must be performed independently on each database.
Certificate Rotation
Node Server Certificate Rotation
POST /admin/v1/nodes/:nid/certs/rotateThis triggers asynchronous PKI jobs that:
- Issue a new server certificate from
tcp_server_ca. - Issue a new CP client certificate from
cp_node_client_ca. - Rebuild the node deploy package.
- Deliver updated certificates via config pull.
Monitor job progress:
GET /admin/v1/pki/jobs # List all PKI jobs
GET /admin/v1/pki/jobs/:id # Job detail and status
POST /admin/v1/pki/jobs/:id/retry # Retry a failed job
POST /admin/v1/pki/jobs/:id/cancel # Cancel a pending jobApplication Client CA Rotation
- Generate the new CA.
- Publish a new
app_client_ca_bundletrust bundle. - Wait for all nodes to pull and apply the updated bundle.
- Begin issuing client certificates from the new CA.
- Optionally revoke the old CA after a migration window.
Node Certificate Reissuance
If a node's certificate is compromised or needs replacement:
POST /admin/v1/nodes/:nid/reissue-packageAfter reissuance, the old certificate files must be removed from the node before it can re-enroll with the new credentials.
Node Lifecycle
Creating a Node
POST /admin/v1/nodesThis creates the node record and triggers a NodeOnboard PKI job that provisions certificates. Download the deploy package after the job completes:
GET /admin/v1/nodes/:nid/deploy-packages
GET /admin/v1/nodes/:nid/deploy-packages/:pid/downloadDisabling / Enabling a Node
POST /admin/v1/nodes/:nid/disable
POST /admin/v1/nodes/:nid/enableA disabled node's OCSP response will reflect revocation, causing SDK clients to reject handshakes.
Deleting a Node
DELETE /admin/v1/nodes/:nid # Hard delete (production-grade)This permanently removes the node and its associated certificates.
Troubleshooting
Node Cannot Connect to CP
Symptoms: node status "offline", no checkin records.
- Verify network connectivity between the node and CP on port 9091.
- Check that the node's
node_agent.jsonhas the correct CP address. - Verify mTLS certificates: the node's client cert must be signed by
cp_node_client_ca, and the CP server cert must be signed bycp_server_ca. - Check CP logs for TLS handshake errors.
SSE Degraded Mode
Symptoms: [SSE] log messages indicating disconnection; polling intervals tighten.
- The system continues to function via polling -- this is not a critical outage.
- Check for network interruptions or firewalls dropping long-lived connections.
- Verify the CP process is healthy and not overloaded.
- SSE will auto-reconnect when the connection is restored.
SDK Handshake Failures
Symptoms: SDK clients fail to connect, ErrorCode::TlsHandshakeFailed.
- Verify the SDK's trust policy:
server_ca_bundle_pathmust trust thetcp_server_ca. - Check SPKI pins match the current server certificate.
- Verify the client certificate is signed by
app_client_caand has the correctclientAuthEKU and URI SAN. - If
must-stapleis enabled, ensure the node has a valid (non-revoked) OCSP staple.
Stale Session After Password Reset
Symptoms: user can still access resources after password change.
- Verify the auth epoch was bumped (check
auth_epoch_changesin the Control DB). - Check Channel 3 delta pull is functioning (node should pull within seconds).
- In degraded SSE mode, expect up to 5 seconds of propagation delay.
Database Lock Contention (SQLite)
Symptoms: SQLITE_BUSY errors in logs.
- Increase
--sqlite-busy-timeout(default 5000ms). - Check for long-running queries or transactions.
- Consider migrating to PostgreSQL for high-concurrency workloads.
Next Steps
- Deployment Guide -- initial setup and configuration
- Security Model -- understanding fail-closed semantics
- Admin API -- complete API endpoint reference