Skip to content

Operations Manual

This manual covers routine operational tasks for a running AuthNexus deployment, including monitoring, database maintenance, certificate lifecycle management, and common troubleshooting scenarios.

Monitoring

Node Health

Each server node performs periodic checkins to the Control Plane via Channel 1. Monitor node status through:

GET /admin/v1/nodes/overview    # Aggregate node statistics
GET /admin/v1/nodes             # List all nodes with status
GET /admin/v1/nodes/:nid        # Detailed node information

Key fields to watch:

FieldHealthy ValueAction if Unhealthy
Node statusonlineCheck network, certificates, CP connectivity
Last checkinWithin 2 minutesInvestigate Channel 1 connectivity
SSE connectedtrueCheck Channel 2 (SSE may have degraded to polling)
OCSP last appliedRecent timestampCheck Channel 4 (OCSP fetch)

Certificate Expiration

Proactively monitor certificates approaching expiry:

GET /admin/v1/pki/expiring    # Certificates nearing expiration
GET /admin/v1/pki/certs       # Full certificate inventory

Set up a periodic check (daily recommended) and rotate certificates well before expiry. See Certificate Rotation below.

Dashboard Reports

The admin dashboard provides daily aggregate reports:

GET /admin/v1/reports/login-daily           # User login trends
GET /admin/v1/reports/card-generated-daily  # Card key generation
GET /admin/v1/reports/card-activated-daily  # Card key activation
GET /admin/v1/reports/agent-login-daily     # Admin login activity

Online User Presence

Real-time online user counts are aggregated from node checkin reports:

GET /admin/v1/presence    # Current online users across all nodes

Log Analysis

Log Categories

Log PrefixSourceWhat to Watch For
[CP Agent]Server nodeConnection failures, config apply errors
[Security]Server nodeBlacklist sync issues, epoch propagation delays
[TLS]BothHandshake failures, certificate validation errors
[Setup]Control PlaneFirst-run initialization problems
[PKI]Control PlaneCertificate issuance and revocation events
[SSE]BothEvent stream disconnections, degraded mode entry

Audit Log Queries

Security-sensitive operations are tracked in structured audit logs:

GET /admin/v1/audit-logs          # User-related audit events
GET /admin/v1/agent-audit-logs    # Admin operation audit trail
GET /admin/v1/pki/audit-logs      # PKI certificate operations
GET /admin/v1/nodes/audit-logs    # Node management events
GET /admin/v1/login-logs          # User login attempts
GET /admin/v1/agent-login-logs    # Admin login attempts
GET /admin/v1/nodes/auth-events   # Node authentication events

All audit endpoints support pagination and time-range filtering.

Database Maintenance

SQLite

  • Backup: copy the database file while the process is running (SQLite WAL mode allows safe hot copies).
  • Vacuum: run VACUUM periodically (monthly) during low-traffic windows to reclaim space.
  • WAL checkpoint: SQLite auto-checkpoints, but manual PRAGMA wal_checkpoint(TRUNCATE) can be run if the WAL file grows large.
  • Busy timeout: configured via --sqlite-busy-timeout (default 5000ms). Increase if you see SQLITE_BUSY errors under load.

PostgreSQL

  • Backup: use pg_dump for logical backups or continuous archiving with WAL shipping.
  • Vacuum: ensure autovacuum is enabled (default in PostgreSQL). Monitor bloat with pg_stat_user_tables.
  • Connection pooling: AuthNexus manages its own connection pool. External poolers (PgBouncer) are optional but can help with many server nodes sharing one PG instance.
  • Timeouts: tuned via --db-connect-timeout, --db-statement-timeout, and --db-lock-timeout.

Database Separation

Remember that the Control DB and Runtime DB are always separate. Maintenance must be performed independently on each database.

Certificate Rotation

Node Server Certificate Rotation

POST /admin/v1/nodes/:nid/certs/rotate

This triggers asynchronous PKI jobs that:

  1. Issue a new server certificate from tcp_server_ca.
  2. Issue a new CP client certificate from cp_node_client_ca.
  3. Rebuild the node deploy package.
  4. Deliver updated certificates via config pull.

Monitor job progress:

GET /admin/v1/pki/jobs          # List all PKI jobs
GET /admin/v1/pki/jobs/:id      # Job detail and status
POST /admin/v1/pki/jobs/:id/retry   # Retry a failed job
POST /admin/v1/pki/jobs/:id/cancel  # Cancel a pending job

Application Client CA Rotation

  1. Generate the new CA.
  2. Publish a new app_client_ca_bundle trust bundle.
  3. Wait for all nodes to pull and apply the updated bundle.
  4. Begin issuing client certificates from the new CA.
  5. Optionally revoke the old CA after a migration window.

Node Certificate Reissuance

If a node's certificate is compromised or needs replacement:

POST /admin/v1/nodes/:nid/reissue-package

After reissuance, the old certificate files must be removed from the node before it can re-enroll with the new credentials.

Node Lifecycle

Creating a Node

POST /admin/v1/nodes

This creates the node record and triggers a NodeOnboard PKI job that provisions certificates. Download the deploy package after the job completes:

GET /admin/v1/nodes/:nid/deploy-packages
GET /admin/v1/nodes/:nid/deploy-packages/:pid/download

Disabling / Enabling a Node

POST /admin/v1/nodes/:nid/disable
POST /admin/v1/nodes/:nid/enable

A disabled node's OCSP response will reflect revocation, causing SDK clients to reject handshakes.

Deleting a Node

DELETE /admin/v1/nodes/:nid    # Hard delete (production-grade)

This permanently removes the node and its associated certificates.

Troubleshooting

Node Cannot Connect to CP

Symptoms: node status "offline", no checkin records.

  1. Verify network connectivity between the node and CP on port 9091.
  2. Check that the node's node_agent.json has the correct CP address.
  3. Verify mTLS certificates: the node's client cert must be signed by cp_node_client_ca, and the CP server cert must be signed by cp_server_ca.
  4. Check CP logs for TLS handshake errors.

SSE Degraded Mode

Symptoms: [SSE] log messages indicating disconnection; polling intervals tighten.

  1. The system continues to function via polling -- this is not a critical outage.
  2. Check for network interruptions or firewalls dropping long-lived connections.
  3. Verify the CP process is healthy and not overloaded.
  4. SSE will auto-reconnect when the connection is restored.

SDK Handshake Failures

Symptoms: SDK clients fail to connect, ErrorCode::TlsHandshakeFailed.

  1. Verify the SDK's trust policy: server_ca_bundle_path must trust the tcp_server_ca.
  2. Check SPKI pins match the current server certificate.
  3. Verify the client certificate is signed by app_client_ca and has the correct clientAuth EKU and URI SAN.
  4. If must-staple is enabled, ensure the node has a valid (non-revoked) OCSP staple.

Stale Session After Password Reset

Symptoms: user can still access resources after password change.

  1. Verify the auth epoch was bumped (check auth_epoch_changes in the Control DB).
  2. Check Channel 3 delta pull is functioning (node should pull within seconds).
  3. In degraded SSE mode, expect up to 5 seconds of propagation delay.

Database Lock Contention (SQLite)

Symptoms: SQLITE_BUSY errors in logs.

  1. Increase --sqlite-busy-timeout (default 5000ms).
  2. Check for long-running queries or transactions.
  3. Consider migrating to PostgreSQL for high-concurrency workloads.

Next Steps