Files
proxmox/docs/04-configuration/INGRESS_RISKS_AND_HARDENING.md
defiQUG fbda1b4beb
Some checks failed
Deploy to Phoenix / deploy (push) Has been cancelled
docs: Ledger Live integration, contract deploy learnings, NEXT_STEPS updates
- ADD_CHAIN138_TO_LEDGER_LIVE: Ledger form done; public code review repo bis-innovations/LedgerLive; init/push commands
- CONTRACT_DEPLOYMENT_RUNBOOK: Chain 138 gas price 1 gwei, 36-addr check, TransactionMirror workaround
- CONTRACT_*: AddressMapper, MirrorManager deployed 2026-02-12; 36-address on-chain check
- NEXT_STEPS_FOR_YOU: Ledger done; steps completable now (no LAN); run-completable-tasks-from-anywhere
- MASTER_INDEX, OPERATOR_OPTIONAL, SMART_CONTRACTS_INVENTORY_SIMPLE: updates
- LEDGER_BLOCKCHAIN_INTEGRATION_COMPLETE: bis-innovations/LedgerLive reference

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-12 15:46:57 -08:00

15 KiB

Ingress Architecture Risks and Hardening

Last Updated: 2026-01-31
Document Version: 1.0
Status: Active Documentation


Date: 2026-01-20
Status: Complete Risk Assessment
Purpose: Identify risks and hardening opportunities for ingress architecture


Overview

This document identifies risks and hardening opportunities for the ingress architecture:

Cloudflare DNS → UDM Pro port-forward → NPMplus (reverse proxy + SSL termination) → Backend VMs/services (nginx or direct ports)

Scope: Identifies risks and provides hardening recommendations without breaking production.


Identified Risks

Risk 1: Single Point of Failure - NPMplus

Severity: High
Component: NPMplus (VMID 10233)
Status: Current

Description:

  • NPMplus is a single reverse proxy container
  • All ingress traffic depends on one container
  • If NPMplus fails, all public-facing services become unavailable

Impact:

  • Complete ingress outage if NPMplus container fails
  • No redundancy or failover
  • Single container failure affects all 19 domains

Mitigation (Current):

  • Container is monitored and backed up
  • Configuration is documented and can be restored
  • Container is running on stable Proxmox host (r630-01)

Hardening Opportunities:

  • HA Setup Guide Created: Complete guide available at docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md
  • Deploy HA NPMplus instance (active-passive with Keepalived)
  • Set up automatic failover (Keepalived virtual IP)
  • Document manual failover procedures (done in backup/restore guide)

Recommendation:

  • Review and implement HA setup guide during next maintenance window
  • Set up container health monitoring
  • Regular backups (done in backup/restore guide)

HA Implementation: See docs/04-configuration/NPMPLUS_HA_SETUP_GUIDE.md for complete step-by-step instructions.


Risk 2: DNS-Only Mode (No Cloudflare Proxy/WAF)

Severity: Medium
Component: Cloudflare DNS
Status: Intentional Configuration

Description:

  • All DNS records use "DNS Only" mode (gray cloud)
  • No Cloudflare proxy, WAF, or DDoS protection
  • Origin IPs (76.53.10.36) exposed directly

Impact:

  • No DDoS protection from Cloudflare
  • No WAF rules for application-layer attacks
  • Origin IPs visible to attackers
  • No CDN caching

Rationale (Intentional):

  • Direct SSL termination at NPMplus required
  • Cloudflare proxy would interfere with Let's Encrypt validation
  • Allows direct control over SSL certificates

Hardening Opportunities (without breaking production):

  1. Enable Cloudflare Access for Admin Portals:

    • Add authentication layer for dbis-admin.d-bis.org
    • Add authentication layer for secure.d-bis.org
    • Does not require changing DNS proxy status
  2. Implement Rate Limiting at NPMplus:

    • Add rate limiting for RPC endpoints (especially public RPC)
    • Configure rate limiting per IP or per domain
    • Does not require changing DNS configuration
  3. Monitor and Alert on Unusual Traffic:

    • Set up log aggregation for NPMplus access logs
    • Configure alerts for unusual traffic patterns
    • Detect DDoS attempts early

Not in Scope (would require production changes):

  • Enabling Cloudflare proxy (would require changing SSL termination)
  • Changing to Cloudflare SSL (would require certificate changes)

Recommendation:

  • Implement rate limiting for RPC endpoints
  • Set up Cloudflare Access for admin portals
  • Monitor traffic patterns and set up alerts

Risk 3: Certificate Expiration

Severity: Medium
Component: SSL Certificates
Status: Current

Description:

  • All 19 SSL certificates expire on 2026-04-16
  • Auto-renewal enabled but could fail
  • Certificate failure would cause HTTPS outages

Impact:

  • Services become inaccessible if certificates expire
  • Browser warnings if certificates invalid
  • All domains affected simultaneously (same expiration date)

Current Mitigation:

  • Auto-renewal enabled in NPMplus
  • Let's Encrypt handles renewal automatically
  • Certificates valid until 2026-04-16

Hardening Opportunities (without breaking production):

  1. Certificate Expiration Monitoring:

    • Set up alerts 90/60/30 days before expiration
    • Monitor certificate status via NPMplus API
    • Alert if auto-renewal fails
  2. Certificate Verification Scripts:

    • Regular verification of certificate validity
    • Automated checks for certificate expiration
    • Integration with monitoring systems

Recommendation:

  • Set up certificate expiration alerts
  • Regular verification of certificate status
  • Document manual renewal procedures (done in backup/restore guide)

Risk 4: Sankofa Routing Issue

Severity: High
Component: Backend Routing
Status: Known, Cutover Plan in Place

Description:

  • 5 Sankofa domains route to Blockscout (192.168.11.140) but services not deployed
  • Incorrect routing prevents Sankofa services from working
  • Users may access wrong content

Impact:

  • Sankofa domains don't work as intended
  • Incorrect content served (Blockscout instead of Sankofa)
  • SSL certificates exist but services not available

Current Status:

  • Known issue documented
  • Cutover plan created (see SANKOFA_CUTOVER_PLAN.md)
  • Waiting for Sankofa service deployment

Mitigation:

  • Cutover plan in place
  • Will update routing once services deployed
  • Temporary routing keeps domains accessible (though incorrect)

Recommendation:

  • Complete Sankofa service deployment
  • Execute cutover plan when services ready
  • Update source-of-truth after cutover

Risk 5: UDM Pro Port Forwarding - Manual Configuration

Severity: Medium
Component: Edge Routing
Status: Current

Description:

  • Port forwarding configured manually via UDM Pro web UI
  • No automation or API access
  • Risk of misconfiguration during changes

Impact:

  • Manual errors during configuration changes
  • No version control or audit trail
  • Difficult to verify configuration matches documentation

Hardening Opportunities (without breaking production):

  1. Document Exact Steps:

    • Create detailed configuration guide
    • Document exact values for port forwarding rules
    • Create verification checklist
  2. Verification Procedures:

    • Regular verification of port forwarding rules
    • Screenshot evidence of configuration
    • Automated connectivity tests

Recommendation:

  • Document exact port forwarding steps (done in verification runbook)
  • Regular verification of configuration
  • Screenshot evidence stored

Risk 6: Backend VM Direct Access (No Nginx)

Severity: Low-Medium
Component: Backend VMs
Status: Intentional Configuration

Description:

  • Some VMs accessible directly (no nginx layer)
  • Besu RPC nodes (2101, 2201) expose ports 8545/8546 directly
  • Node.js APIs (10150, 10151) expose port 3000 directly

Impact:

  • Direct exposure of application ports
  • No additional security layer (nginx headers, rate limiting)
  • Application-level security only

Rationale (Intentional):

  • RPC services require direct access for performance
  • Node.js APIs designed for direct exposure
  • Nginx layer adds unnecessary complexity for these services

Hardening Opportunities (without breaking production):

  1. Rate Limiting at NPMplus:

    • Add rate limiting to RPC proxy hosts
    • Configure rate limits per IP or globally
    • Prevent abuse without adding nginx layer
  2. Security Headers at NPMplus:

    • Add security headers via NPMplus advanced config
    • Configure CSP, X-Frame-Options, etc.
    • Apply to all proxy hosts
  3. Access Lists:

    • Configure IP allowlists for private RPC endpoints
    • Restrict access to authorized IPs only
    • Use NPMplus access lists feature

Not in Scope (would require production changes):

  • Adding nginx layer to all services
  • Changing backend architecture

Recommendation:

  • Add rate limiting for RPC endpoints at NPMplus
  • Configure access lists for private RPC endpoints
  • Add security headers via NPMplus advanced config

Risk 7: Internal TLS (Double TLS)

Severity: Low
Component: VMID 2400
Status: Current Configuration

Description:

  • VMID 2400 (thirdweb-rpc-1) uses HTTPS internally (port 443)
  • NPMplus terminates SSL, then proxies to HTTPS backend
  • Results in double TLS termination (NPMplus → VMID 2400)

Impact:

  • Additional complexity in certificate management
  • Two SSL certificates required (NPMplus + VMID 2400)
  • Potential performance overhead

Rationale (Documentation Needed):

  • Need to document why this is required
  • May be intentional for additional security
  • Or legacy configuration that could be simplified

Hardening Opportunities (without breaking production):

  1. Document Internal TLS Rationale:

    • Document why VMID 2400 uses HTTPS internally
    • Verify if internal TLS is necessary
    • Document certificate management for internal TLS
  2. Monitor Internal TLS Certificate Expiration:

    • Track internal SSL certificate expiration
    • Ensure internal certificates are renewed
    • Avoid internal certificate expiration causing outages

Recommendation:

  • Document why internal TLS is used
  • Monitor internal certificate expiration
  • Verify if internal TLS could be changed to HTTP (future consideration)

Hardening Opportunities (Without Breaking Production)

1. Rate Limiting at NPMplus

Priority: High
Effort: Medium
Impact: High

Implementation:

  • Configure rate limiting for RPC endpoints
  • Set limits per IP (e.g., 100 requests/minute)
  • Apply to all RPC proxy hosts

Steps:

  1. Access NPMplus UI
  2. Navigate to Proxy Hosts
  3. Edit RPC proxy hosts (rpc-http-pub, rpc-ws-pub, etc.)
  4. Configure rate limiting in advanced config or access lists
  5. Test rate limiting behavior

Benefits:

  • Protects RPC endpoints from abuse
  • Prevents DDoS attacks
  • Does not require backend changes

2. Cloudflare Access for Admin Portals

Priority: Medium
Effort: Medium
Impact: Medium

Implementation:

  • Enable Cloudflare Access for dbis-admin.d-bis.org
  • Enable Cloudflare Access for secure.d-bis.org
  • Configure access policies (email allowlist, MFA, etc.)

Steps:

  1. Access Cloudflare Zero Trust dashboard
  2. Navigate to Access → Applications
  3. Add application: dbis-admin.d-bis.org
  4. Configure access policy (email allowlist, MFA)
  5. Repeat for secure.d-bis.org

Benefits:

  • Additional authentication layer
  • MFA support
  • Audit trail
  • Does not require changing DNS proxy status

3. Certificate Expiration Monitoring

Priority: High
Effort: Low
Impact: High

Implementation:

  • Set up monitoring for certificate expiration
  • Configure alerts 90/60/30 days before expiration
  • Monitor auto-renewal status

Steps:

  1. Create monitoring script or use existing verification scripts
  2. Run daily checks of certificate expiration
  3. Configure alerts (email, Slack, etc.)
  4. Test alert system

Script:

# Run certificate verification daily
bash scripts/verify/export-npmplus-config.sh

# Check expiration dates
cat docs/04-configuration/verification-evidence/npmplus-verification-*/certificates.json | \
    jq '.[] | select(.expires | fromdateiso8601 < (now + (90 * 86400))) | .domain_names'

Benefits:

  • Early warning of certificate expiration
  • Time to fix auto-renewal issues
  • Prevents unexpected outages

4. Health Check Endpoints for All Backend Services

Priority: Medium
Effort: Low-Medium
Impact: Medium

Implementation:

  • Add health check endpoints to all backend services
  • Configure health checks in NPMplus (if supported)
  • Monitor health endpoints

Steps:

  1. Add /health endpoints to all backend services
  2. Configure health checks in application config
  3. Set up monitoring for health endpoints
  4. Configure alerts for failed health checks

Benefits:

  • Early detection of service issues
  • Proactive monitoring
  • Better troubleshooting

5. Log Aggregation for NPMplus Access Logs

Priority: Medium
Effort: Medium
Impact: Medium

Implementation:

  • Set up log aggregation for NPMplus access logs
  • Configure log forwarding (syslog, filebeat, etc.)
  • Set up log analysis and alerting

Steps:

  1. Configure NPMplus to log to syslog or file
  2. Set up log forwarder (filebeat, fluentd, etc.)
  3. Configure log aggregation (ELK stack, Loki, etc.)
  4. Set up alerts for unusual patterns

Benefits:

  • Better visibility into traffic patterns
  • Detect attacks early
  • Audit trail for troubleshooting

6. Document Failover Procedures

Priority: High
Effort: Low
Impact: High

Implementation:

  • Document failover procedures if NPMplus fails
  • Create step-by-step recovery guide
  • Test failover procedures

Status: Done in NPMPLUS_BACKUP_RESTORE.md


Not in Scope (Would Require Production Changes)

The following hardening measures would require production changes and are not in scope for this plan:

  1. Enabling Cloudflare Proxy:

    • Would require changing SSL termination from NPMplus to Cloudflare
    • Would require reconfiguration of all SSL certificates
    • Would break current architecture
  2. Adding HA NPMplus Instance:

    • Would require deployment of additional NPMplus container
    • Would require load balancer configuration
    • Would require database replication or shared storage
  3. Changing Backend Architecture:

    • Adding nginx layer to all services
    • Changing RPC endpoints to use nginx
    • Would require application changes

Risk Summary Table

Risk Severity Status Mitigation Hardening Priority
Single Point of Failure (NPMplus) High Current Documented High (monitoring)
DNS-Only Mode Medium Intentional Rate limiting, Cloudflare Access Medium
Certificate Expiration Medium Current Auto-renewal High (monitoring)
Sankofa Routing Issue High Known Cutover plan in place High (cutover)
UDM Pro Manual Config Medium Current Documentation Medium (verification)
Backend Direct Access Low-Medium Intentional Rate limiting Medium
Internal TLS Low Current Documentation Low (documentation)

Hardening Implementation Priority

High Priority (Implement First)

  1. Certificate Expiration Monitoring - Critical for preventing outages
  2. Rate Limiting for RPC Endpoints - Prevents abuse
  3. Document Failover Procedures - Done

Medium Priority

  1. Cloudflare Access for Admin Portals - Additional security
  2. Health Check Endpoints - Better monitoring
  3. Log Aggregation - Better visibility

Low Priority

  1. Document Internal TLS Rationale - Documentation improvement

  • Verification Runbook: docs/04-configuration/INGRESS_VERIFICATION_RUNBOOK.md
  • Backup/Restore Guide: docs/04-configuration/NPMPLUS_BACKUP_RESTORE.md
  • Sankofa Cutover Plan: docs/04-configuration/SANKOFA_CUTOVER_PLAN.md
  • Comprehensive Architecture: docs/04-configuration/DNS_NPMPLUS_VM_COMPREHENSIVE_ARCHITECTURE.md

Last Updated: 2026-01-20
Maintained By: Infrastructure Team
Status: Complete Risk Assessment