Skip to content

FAQ

Why C++23?

Q: Why is AuthNexus written in C++23 instead of Go, Rust, or Java?

AuthNexus targets high-throughput, low-latency authentication workloads where every microsecond on the hot path matters. C++23 provides:

  • Zero-cost abstractions -- no garbage collector pauses, no runtime overhead.
  • Asio coroutines -- native co_await integration with the Asio networking library for scalable async I/O without callback spaghetti.
  • Direct hardware control -- thread domain isolation, memory layout control, and CPU-cache-friendly data structures.
  • Mature ecosystem -- OpenSSL, Lua, SQLite, and libpq all have first-class C/C++ bindings.

The thread domain architecture (IO, Logic, DB, Crypto, CloudFunction pools all physically isolated) would be difficult to express cleanly in languages with a managed runtime.

Why a Custom Binary Protocol?

Q: Why not use HTTP/REST or gRPC for SDK-to-server communication?

The business data path between SDK clients and server nodes uses a custom binary protocol over TLS 1.3 for several reasons:

  • Minimal overhead -- no HTTP header parsing, no JSON serialization on the hot path. Packets are compact fixed-layout structures.
  • Bidirectional push -- the server can push notifications (announcements, force logout, kill process) to connected clients without polling or WebSocket upgrade negotiation.
  • Session-oriented -- a single persistent TCP connection carries authentication, heartbeats, queries, cloud function calls, and push notifications. No connection-per-request overhead.
  • mTLS-native -- mutual TLS is part of the connection lifecycle, not an afterthought. The four-layer validation (CA chain, certificate semantics, CP binding, business handshake) is deeply integrated.

HTTP is still used for admin-to-CP and node-to-CP communication where request-response semantics and human-debuggability are more valuable than raw throughput.

Why HMAC-SHA256 Instead of Argon2 / bcrypt?

Q: Isn't a fast hash insecure for passwords?

AuthNexus uses $authnexus-fast-hmac-sha256$v=1$ for a deliberate design trade-off:

PropertySlow Hash (Argon2/bcrypt)AuthNexus HMAC-SHA256
Offline brute-force resistanceHighRelies on HMAC key secrecy
Online throughput100--1000 hashes/sec/core1,000,000+ hashes/sec/core
Latency per login50--500msSub-millisecond
Requirement for crypto thread poolLarge pool neededMinimal pool sufficient

The HMAC key is a server-held secret. An attacker who obtains only the database (without the key) cannot perform offline attacks. This is comparable to the "pepper" technique used alongside slow hashes, except the entire hash is based on the secret key.

This trade-off is appropriate for:

  • High-concurrency authentication servers handling thousands of logins per second.
  • Environments where the HMAC key is stored in hardware security modules (HSMs) or secure enclaves.
  • Systems where network-level brute-force is already mitigated by rate limiting and blacklisting.

If your threat model requires resistance to full server compromise (attacker obtains both the database and the key), consider adding an application-level slow hash in the SDK before transmission.

How Do I Migrate from Another System?

Q: Can I migrate existing users and data to AuthNexus?

AuthNexus includes a migration framework (src/migrator/) implemented in Python that supports:

  • Dual-backend migration -- import into either SQLite or PostgreSQL.
  • CShield migration -- tested with real CShield database dumps (10 database schema).
  • Automatic certificate provisioning -- migrated applications automatically receive certificates via the PKI job system (zero C++ changes required).

The migration process:

  1. Export data from the source system.
  2. Run the Python migrator against the AuthNexus Control DB.
  3. Start the Control Plane -- the PKI job poller automatically provisions certificates for imported applications and nodes.
  4. Verify data integrity through the admin dashboard.

Legacy password hashes (plain SHA256, Argon2id) are not auto-migrated. Users must reset their passwords after migration.

Is mTLS Required?

Q: Can I disable mTLS for development or testing?

No. mTLS is a fundamental security invariant in AuthNexus, not an optional feature:

  • SDK-to-server: TLS 1.3 with mutual authentication is mandatory. The client certificate embeds the app_id via URI SAN, which is validated against the business handshake.
  • Node-to-CP: mTLS over HTTP is required for all /cp/v2/* endpoints. The node's identity is proven by its cp_node_client_ca-signed certificate.

For development, the PKI setup wizard generates all necessary CAs and certificates. The sdk_demo binary and the admin frontend demo mode work with the generated certificates without manual PKI setup.

The only exception is Channel 4 (OCSP), which uses plain HTTP because OCSP responses are cryptographically self-signed.

How Does AuthNexus Scale?

Q: What are the scaling characteristics?

AuthNexus scales vertically within a single node and horizontally across multiple nodes:

Vertical Scaling

A single server_app instance on an 8-core machine can handle thousands of concurrent SDK connections. The --auto thread configuration scales all thread pools proportionally to CPU cores.

Key vertical scaling dimensions:

  • IO threads -- network connection capacity.
  • Logic threads -- request processing throughput.
  • DB threads -- query concurrency (PostgreSQL scales better than SQLite here).
  • Cloud function threads -- Lua execution parallelism.

Horizontal Scaling

Deploy multiple server_app nodes, each connecting to the same control_plane_app. The Control Plane distributes configuration, commands, and security policy to all nodes.

SDK Clients ──> server_app (Node 1) ──> control_plane_app
SDK Clients ──> server_app (Node 2) ──>      (shared)
SDK Clients ──> server_app (Node 3) ──>

Each node operates independently with its own Runtime DB. Client routing to nodes is handled externally (DNS, load balancer, or application-level selection).

Control Plane Scaling

The Control Plane is currently a single process. For most deployments, a single CP instance is sufficient because:

  • Admin API traffic is low-volume (human operators).
  • Node checkins are infrequent (minutes, not seconds).
  • SSE connections are one-per-node (dozens, not thousands).

What Happens When a Node Loses CP Connectivity?

Q: Can server nodes operate independently?

Yes, with graceful degradation:

CapabilityCP ConnectedCP Disconnected
User authenticationFullFull (using cached data)
Heartbeat processingFullFull
Cloud function executionFullFunctions cached locally continue working
Session invalidation (epoch)Real-time (seconds)Delayed (polling at 800ms--5s)
Blacklist updatesReal-timeDelayed (polling at 800ms--5s)
New configurationsDelivered via Channel 1Queued until reconnection
OCSP staplingFresh responsesCached response until nextUpdate expires

The server node caches all critical data locally in its Runtime DB. Authentication, heartbeats, and cloud functions continue without interruption. Security updates (epoch bumps, blacklist changes) may be delayed but will catch up when connectivity is restored.

The --blacklist-fail-closed flag controls behavior when the blacklist cache itself is unavailable (not the CP connection): if set, all requests are denied until the cache is rebuilt.

Why Four Separate CAs?

Q: Wouldn't a single CA be simpler?

The four-CA model provides cryptographic isolation between trust domains:

  1. Compromise containment -- if one CA's key is compromised, only that trust domain is affected. A compromised app_client_ca does not grant access to the CP management plane.
  2. Independent rotation -- each CA can be rotated on its own schedule without disrupting other domains.
  3. Least privilege -- server nodes hold tcp_server_ca and cp_node_client_ca certificates, but never app_client_ca signing keys. SDKs hold app_client_ca certificates but cannot impersonate nodes.
  4. Revocation isolation -- revoking a client CA does not affect node-to-CP or CP server certificates.

Can I Use AuthNexus Without the Admin Frontend?

Q: Can I manage everything through the API?

Yes. The admin frontend is a pure SPA that communicates exclusively through the /admin/v1/* REST API. Every operation available in the UI is available via direct API calls. You can build your own management interface, use curl, or integrate with existing admin tools.

What Is the Database Lock on SQLite?

Q: I see SQLITE_BUSY errors under load. What should I do?

SQLite uses file-level locking. Under high write concurrency, lock contention can cause SQLITE_BUSY errors. Solutions:

  1. Increase busy timeout -- --sqlite-busy-timeout 10000 (10 seconds).
  2. WAL mode -- enabled by default, allows concurrent reads during writes.
  3. Migrate to PostgreSQL -- for deployments with multiple nodes or high write throughput, PostgreSQL eliminates this bottleneck entirely.

SQLite is best suited for single-node deployments with moderate traffic. PostgreSQL is recommended for anything beyond that.

How Are Cloud Functions Delivered to Nodes?

Q: How do Lua scripts get from the admin dashboard to the server node?

Cloud function delivery follows the standard configuration pipeline:

  1. Admin creates or updates a function via POST /admin/v1/cloud-functions.
  2. The function metadata and script body are stored in the Control DB.
  3. The CP sends a config.pending SSE hint (Channel 2) to relevant nodes.
  4. Each node pulls the updated manifest via POST /cp/v2/nodes/:id/configs/pull (Channel 1).
  5. The node fetches the script body via GET /cp/v2/objects/cloud-functions/:name?app_id= (Channel 1).
  6. The script is compiled into Lua bytecode and cached in the node's cloud function runtime.

If SSE is unavailable, nodes discover new functions during their periodic config poll.

What Logging Level Should I Use in Production?

Q: What are the trade-offs between log levels?

LevelUse CaseVolume
infoProduction default. Startup, shutdown, key events, errorsLow
warnIncluded in info. Degraded states, retries, near-limit conditionsLow
debugLocal troubleshooting only. Per-request tracing, state transitionsHigh (tens of MB)
traceExtreme diagnostics. Packet-level, per-field loggingVery high

Using debug or trace in production will generate tens of megabytes of logs quickly, pollute benchmark samples, and may impact performance under load. Always use info for production deployments and switch to debug only for targeted troubleshooting sessions.

Next Steps