Operations Manual

This manual covers routine operational tasks for a running AuthNexus deployment, including monitoring, database maintenance, certificate lifecycle management, and common troubleshooting scenarios.

Monitoring

Node Health

Each server node performs periodic checkins to the Control Plane via Channel 1. Monitor node status through:

GET /admin/v1/nodes/overview    # Aggregate node statistics
GET /admin/v1/nodes             # List all nodes with status
GET /admin/v1/nodes/:nid        # Detailed node information

Key fields to watch:

Field	Healthy Value	Action if Unhealthy
Node status	`online`	Check network, certificates, CP connectivity
Last checkin	Within 2 minutes	Investigate Channel 1 connectivity
SSE connected	`true`	Check Channel 2 (SSE may have degraded to polling)
OCSP last applied	Recent timestamp	Check Channel 4 (OCSP fetch)

Certificate Expiration

Proactively monitor certificates approaching expiry:

GET /admin/v1/pki/expiring    # Certificates nearing expiration
GET /admin/v1/pki/certs       # Full certificate inventory

Set up a periodic check (daily recommended) and rotate certificates well before expiry. See Certificate Rotation below.

Dashboard Reports

The admin dashboard provides daily aggregate reports:

GET /admin/v1/reports/login-daily           # User login trends
GET /admin/v1/reports/card-generated-daily  # Card key generation
GET /admin/v1/reports/card-activated-daily  # Card key activation
GET /admin/v1/reports/agent-login-daily     # Admin login activity

Online User Presence

Real-time online user counts are aggregated from node checkin reports:

GET /admin/v1/presence    # Current online users across all nodes

Log Analysis

Log Categories

Log Prefix	Source	What to Watch For
`[CP Agent]`	Server node	Connection failures, config apply errors
`[Security]`	Server node	Blacklist sync issues, epoch propagation delays
`[TLS]`	Both	Handshake failures, certificate validation errors
`[Setup]`	Control Plane	First-run initialization problems
`[PKI]`	Control Plane	Certificate issuance and revocation events
`[SSE]`	Both	Event stream disconnections, degraded mode entry

Audit Log Queries

Security-sensitive operations are tracked in structured audit logs:

GET /admin/v1/audit-logs          # User-related audit events
GET /admin/v1/agent-audit-logs    # Admin operation audit trail
GET /admin/v1/pki/audit-logs      # PKI certificate operations
GET /admin/v1/nodes/audit-logs    # Node management events
GET /admin/v1/login-logs          # User login attempts
GET /admin/v1/agent-login-logs    # Admin login attempts
GET /admin/v1/nodes/auth-events   # Node authentication events

All audit endpoints support pagination and time-range filtering.

Database Maintenance

SQLite

Backup: copy the database file while the process is running (SQLite WAL mode allows safe hot copies).
Vacuum: run VACUUM periodically (monthly) during low-traffic windows to reclaim space.
WAL checkpoint: SQLite auto-checkpoints, but manual PRAGMA wal_checkpoint(TRUNCATE) can be run if the WAL file grows large.
Busy timeout: configured via --sqlite-busy-timeout (default 5000ms). Increase if you see SQLITE_BUSY errors under load.

PostgreSQL

Backup: use pg_dump for logical backups or continuous archiving with WAL shipping.
Vacuum: ensure autovacuum is enabled (default in PostgreSQL). Monitor bloat with pg_stat_user_tables.
Connection pooling: AuthNexus manages its own connection pool. External poolers (PgBouncer) are optional but can help with many server nodes sharing one PG instance.
Timeouts: tuned via --db-connect-timeout, --db-statement-timeout, and --db-lock-timeout.

Database Separation

Remember that the Control DB and Runtime DB are always separate. Maintenance must be performed independently on each database.

Certificate Rotation

Node Server Certificate Rotation

POST /admin/v1/nodes/:nid/certs/rotate

This triggers asynchronous PKI jobs that:

Issue a new server certificate from tcp_server_ca.
Issue a new CP client certificate from cp_node_client_ca.
Rebuild the node deploy package.
Deliver updated certificates via config pull.

Monitor job progress:

GET /admin/v1/pki/jobs          # List all PKI jobs
GET /admin/v1/pki/jobs/:id      # Job detail and status
POST /admin/v1/pki/jobs/:id/retry   # Retry a failed job
POST /admin/v1/pki/jobs/:id/cancel  # Cancel a pending job

Application Client CA Rotation

Generate the new CA.
Publish a new app_client_ca_bundle trust bundle.
Wait for all nodes to pull and apply the updated bundle.
Begin issuing client certificates from the new CA.
Optionally revoke the old CA after a migration window.

Node Certificate Reissuance

If a node's certificate is compromised or needs replacement:

POST /admin/v1/nodes/:nid/reissue-package

After reissuance, the old certificate files must be removed from the node before it can re-enroll with the new credentials.

Node Lifecycle

Creating a Node

POST /admin/v1/nodes

This creates the node record and triggers a NodeOnboard PKI job that provisions certificates. Download the deploy package after the job completes:

GET /admin/v1/nodes/:nid/deploy-packages
GET /admin/v1/nodes/:nid/deploy-packages/:pid/download

Disabling / Enabling a Node

POST /admin/v1/nodes/:nid/disable
POST /admin/v1/nodes/:nid/enable

A disabled node's OCSP response will reflect revocation, causing SDK clients to reject handshakes.

Deleting a Node

DELETE /admin/v1/nodes/:nid    # Hard delete (production-grade)

This permanently removes the node and its associated certificates.

Troubleshooting

Node Cannot Connect to CP

Symptoms: node status "offline", no checkin records.

Verify network connectivity between the node and CP on port 9091.
Check that the node's node_agent.json has the correct CP address.
Verify mTLS certificates: the node's client cert must be signed by cp_node_client_ca, and the CP server cert must be signed by cp_server_ca.
Check CP logs for TLS handshake errors.

SSE Degraded Mode

Symptoms: [SSE] log messages indicating disconnection; polling intervals tighten.

The system continues to function via polling -- this is not a critical outage.
Check for network interruptions or firewalls dropping long-lived connections.
Verify the CP process is healthy and not overloaded.
SSE will auto-reconnect when the connection is restored.

SDK Handshake Failures

Symptoms: SDK clients fail to connect, ErrorCode::TlsHandshakeFailed.

Verify the SDK's trust policy: server_ca_bundle_path must trust the tcp_server_ca.
Check SPKI pins match the current server certificate.
Verify the client certificate is signed by app_client_ca and has the correct clientAuth EKU and URI SAN.
If must-staple is enabled, ensure the node has a valid (non-revoked) OCSP staple.

Stale Session After Password Reset

Symptoms: user can still access resources after password change.

Verify the auth epoch was bumped (check auth_epoch_changes in the Control DB).
Check Channel 3 delta pull is functioning (node should pull within seconds).
In degraded SSE mode, expect up to 5 seconds of propagation delay.

Database Lock Contention (SQLite)

Symptoms: SQLITE_BUSY errors in logs.

Increase --sqlite-busy-timeout (default 5000ms).
Check for long-running queries or transactions.
Consider migrating to PostgreSQL for high-concurrency workloads.

Next Steps

Deployment Guide -- initial setup and configuration
Security Model -- understanding fail-closed semantics
Admin API -- complete API endpoint reference

Operations Manual ​

Monitoring ​

Node Health ​

Certificate Expiration ​

Dashboard Reports ​

Online User Presence ​

Log Analysis ​

Log Categories ​

Audit Log Queries ​

Database Maintenance ​

SQLite ​

PostgreSQL ​

Database Separation ​

Certificate Rotation ​

Node Server Certificate Rotation ​

Application Client CA Rotation ​

Node Certificate Reissuance ​

Node Lifecycle ​

Creating a Node ​

Disabling / Enabling a Node ​

Deleting a Node ​

Troubleshooting ​

Node Cannot Connect to CP ​

SSE Degraded Mode ​

SDK Handshake Failures ​

Stale Session After Password Reset ​

Database Lock Contention (SQLite) ​

Next Steps ​