Post

Building a Budget-Friendly Lab VPS Platform – Part 5: When Things Break - Cleanup, Monitoring, and Keeping the Lights On

Building a Budget-Friendly Lab VPS Platform – Part 5: When Things Break - Cleanup, Monitoring, and Keeping the Lights On

TL;DR: The provisioning pipeline from Part 4 works great when everything goes right. This post is about the equally important code that runs when everything goes wrong.


Last week was about the happy path — payment clears, VM provisions, user gets access.

This week is about everything else: failed provisioning, stuck operations, orphaned resources, and the background processes that keep the platform working even when individual operations fail.

The reality is that most of the operational code isn’t about creating VMs. It’s about cleaning up after partial failures, detecting when things are stuck, and ensuring that what users see matches what actually exists in Proxmox.


Failed Provisioning Recovery

When provisionVmForPaidOrder() fails halfway through, it leaves the Order in a “failed” state with an error message. But that’s not the end of the story.

The platform runs a periodic cleanup job that finds failed orders and attempts to recover:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
async function retryFailedProvisioning() {
  // Find orders that failed but might be retryable
  const failedOrders = await Order.find({
    status: "failed",
    createdAt: { $gt: new Date(Date.now() - 24 * 60 * 60 * 1000) }, // within 24h
    retryCount: { $lt: 3 } // haven't retried too many times
  });

  for (const order of failedOrders) {
    const errorMsg = String(order.error || "").toLowerCase();
    
    // Skip permanent failures (payment issues, invalid plans, etc.)
    if (errorMsg.includes("payment") || errorMsg.includes("subscription") || errorMsg.includes("plan not found")) {
      console.log(`[Retry] Skipping order ${order._id}: permanent failure`);
      continue;
    }

    // Retry transient failures (timeouts, network issues, Proxmox locks)
    console.log(`[Retry] Attempting to retry failed order ${order._id}`);
    
    try {
      // Reset status so provisionVmForPaidOrder will run again
      order.status = "created";
      order.error = null;
      order.retryCount = (order.retryCount || 0) + 1;
      await order.save();
      
      await provisionVmForPaidOrder(order._id);
      console.log(`[Retry] Successfully retried order ${order._id}`);
    } catch (err) {
      console.error(`[Retry] Retry failed for order ${order._id}:`, err);
      
      // Mark as failed again with updated error
      order.status = "failed";
      order.error = `Retry ${order.retryCount}: ${err.message}`;
      await order.save();
    }
  }
}

This runs every 10 minutes and catches cases where provisioning failed due to temporary Proxmox issues, network timeouts, or resource locks that have since cleared.


Orphaned Resource Detection

One of the nastier failure modes is when VM creation succeeds in Proxmox but fails before updating the database. This leaves “orphaned” VMs that exist in the hypervisor but aren’t tracked by the platform.

The cleanup job scans for these:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
async function detectOrphanedVms() {
  // Get all VMs currently running in Proxmox
  const proxmoxVms = await getAllProxmoxVms();
  
  // Get all VMs the platform thinks it owns
  const trackedVmIds = new Set();
  const users = await User.find({ vmId: { $exists: true, $ne: null } });
  const vms = await Vm.find({});
  
  users.forEach(u => u.vmId && trackedVmIds.add(u.vmId));
  vms.forEach(v => v.vmId && trackedVmIds.add(v.vmId));

  // Find VMs in our ID range that aren't tracked
  const minId = parseInt(process.env.VM_ID_MIN || "6000", 10);
  const maxId = parseInt(process.env.VM_ID_MAX || "6999", 10);
  
  const orphans = [];
  for (const pveVm of proxmoxVms) {
    const vmId = parseInt(pveVm.vmid);
    if (vmId >= minId && vmId <= maxId && !trackedVmIds.has(vmId)) {
      orphans.push({
        vmId,
        name: pveVm.name,
        status: pveVm.status,
        uptime: pveVm.uptime || 0
      });
    }
  }

  if (orphans.length > 0) {
    console.log(`[Cleanup] Found ${orphans.length} orphaned VMs:`, orphans);
    
    // For now, just log them. In production, you might want to:
    // 1. Try to match them to failed Orders by timing/name
    // 2. Automatically destroy VMs that are clearly orphaned
    // 3. Send alerts to administrators
    
    for (const orphan of orphans) {
      await logOrphanedVm(orphan);
    }
  }

  return orphans;
}

async function getAllProxmoxVms() {
  const json = await pveFetchJson(`/cluster/resources?type=vm`, { method: "GET" });
  return Array.isArray(json?.data) ? json.data : [];
}

The platform logs orphaned VMs but doesn’t automatically delete them, since that could be destructive if there’s a database sync issue. Instead, it provides visibility for manual cleanup.


Stuck Operation Detection

Sometimes VMs get stuck in intermediate states — provisioning VMs that never finish, resize requests that never apply, suspended VMs that never actually stop.

The monitoring code watches for these:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
async function detectStuckOperations() {
  const now = new Date();
  const issues = [];

  // Orders stuck in "provisioning" for more than 20 minutes
  const stuckProvisioning = await Order.find({
    status: "provisioning",
    updatedAt: { $lt: new Date(now.getTime() - 20 * 60 * 1000) }
  });

  for (const order of stuckProvisioning) {
    issues.push({
      type: "stuck_provisioning",
      orderId: order._id,
      userId: order.userId,
      stuckFor: Math.round((now.getTime() - order.updatedAt.getTime()) / (60 * 1000)) + " minutes"
    });
  }

  // Resize requests stuck in "billed" (paid but not applied)
  const stuckResizes = await ResizeRequest.find({
    status: "billed",
    updatedAt: { $lt: new Date(now.getTime() - 10 * 60 * 1000) }
  });

  for (const rr of stuckResizes) {
    issues.push({
      type: "stuck_resize", 
      resizeId: rr._id,
      vmId: rr.vmId,
      stuckFor: Math.round((now.getTime() - rr.updatedAt.getTime()) / (60 * 1000)) + " minutes"
    });
  }

  // VMs that should exist but don't show up in Proxmox
  const activeUsers = await User.find({ 
    vmId: { $exists: true, $ne: null },
    stripeCustomerId: { $exists: true }
  });

  const proxmoxVmIds = new Set();
  const proxmoxVms = await getAllProxmoxVms();
  proxmoxVms.forEach(vm => proxmoxVmIds.add(parseInt(vm.vmid)));

  for (const user of activeUsers) {
    if (user.vmId && !proxmoxVmIds.has(user.vmId)) {
      // Double-check that their subscription is actually active
      try {
        if (user.stripeCustomerId) {
          const customer = await stripe.customers.retrieve(user.stripeCustomerId);
          const subs = await stripe.subscriptions.list({ customer: customer.id, status: 'active' });
          
          if (subs.data.length > 0) {
            issues.push({
              type: "missing_vm",
              userId: user._id,
              email: user.email,
              vmId: user.vmId,
              hasActiveSubscription: true
            });
          }
        }
      } catch (err) {
        console.error(`[Monitor] Error checking subscription for user ${user._id}:`, err);
      }
    }
  }

  if (issues.length > 0) {
    console.log(`[Monitor] Found ${issues.length} operational issues:`, JSON.stringify(issues, null, 2));
    
    // In production, send alerts
    await sendOperationalAlert({
      type: "stuck_operations",
      count: issues.length,
      details: issues
    });
  }

  return issues;
}

This gives visibility into operations that started but never completed, which is usually a sign that error handling missed an edge case.


Subscription Sync and Enforcement

One of the trickiest parts is keeping Stripe subscription state in sync with actual VM state. Users can cancel subscriptions directly in Stripe, payment methods can fail, or webhooks can be missed.

The platform runs a periodic sync to catch these:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
async function syncSubscriptionStates() {
  const users = await User.find({ 
    stripeCustomerId: { $exists: true, $ne: null },
    vmId: { $exists: true, $ne: null }
  });

  for (const user of users) {
    try {
      // Get current Stripe state
      const customer = await stripe.customers.retrieve(user.stripeCustomerId);
      const subs = await stripe.subscriptions.list({ 
        customer: customer.id,
        limit: 10
      });

      const activeSub = subs.data.find(sub => 
        ['active', 'trialing'].includes(sub.status)
      );

      if (!activeSub) {
        // User has no active subscription but still has a VM
        console.log(`[Sync] User ${user.email} has VM ${user.vmId} but no active subscription`);
        
        // Check if VM is already suspended to avoid double-suspension
        const vmStatus = await getVmStatus(user.vmId);
        
        if (vmStatus === 'running') {
          console.log(`[Sync] Suspending VM ${user.vmId} for inactive subscription`);
          await suspendVm(user.vmId);
          
          // Notify user
          await sendEmailOnce({
            key: `vm:${user.vmId}:suspended:no_subscription`,
            userId: user._id,
            vmId: user.vmId,
            to: user.email,
            subject: `Your Lab VM has been suspended`,
            text: `Hi,

Your lab VM (${user.vmId}) has been suspended because we couldn't find an active subscription.

This usually happens when:
- Your subscription was cancelled
- Your payment method failed
- There was a billing issue

To restore access, please visit your dashboard and reactivate your subscription.

Dashboard: ${absoluteUrl("/dashboard")}

If you believe this is an error, please reply to this email.

— ${appName()} Team`
          });
        }
        
      } else {
        // User has active subscription, make sure VM is running
        const vmStatus = await getVmStatus(user.vmId);
        
        if (vmStatus === 'stopped') {
          console.log(`[Sync] Resuming VM ${user.vmId} for active subscription`);
          await resumeVm(user.vmId);
        }
      }
      
    } catch (err) {
      console.error(`[Sync] Error syncing subscription for user ${user._id}:`, err);
    }
  }
}

This catches cases where subscription state changes but webhooks fail to deliver, ensuring that users don’t keep running VMs after cancelling subscriptions.


Resource Cleanup Utilities

The platform also needs utilities for manual cleanup operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
async function terminateVm(vmId, reason = "manual") {
  console.log(`[Terminate] Starting termination of VM ${vmId} (reason: ${reason})`);
  
  try {
    // Find the VM record
    const vm = await Vm.findOne({ vmId });
    const user = vm ? await User.findById(vm.userId) : null;
    
    // Stop the VM in Proxmox (don't fail if already stopped)
    try {
      const vmInfo = await getQemuConfigViaRest({ node: vm?.node || "pve", vmId });
      if (vmInfo) {
        await sshExec(`qm stop ${vmId} --skiplock 1 || true`);
        console.log(`[Terminate] Stopped VM ${vmId}`);
        
        // Wait a bit for clean shutdown
        await sleep(5000);
        
        // Destroy the VM
        await sshExec(`qm destroy ${vmId} --skiplock 1 --purge 1 || true`);
        console.log(`[Terminate] Destroyed VM ${vmId}`);
      }
    } catch (err) {
      console.error(`[Terminate] Error stopping/destroying VM ${vmId}:`, err);
    }

    // Clean up Cloudflare tunnel and DNS
    if (vm?.cf?.tunnelId) {
      try {
        await deleteCloudflaretunnel(vm.cf.tunnelId);
        await deleteCloudflareRecord(vm.cf.sshDnsRecordId);
        if (vm.cf.eveDnsRecordId) {
          await deleteCloudflareRecord(vm.cf.eveDnsRecordId);
        }
        console.log(`[Terminate] Cleaned up Cloudflare resources for VM ${vmId}`);
      } catch (err) {
        console.error(`[Terminate] Error cleaning up Cloudflare for VM ${vmId}:`, err);
      }
    }

    // Clean up database records
    if (user) {
      user.vmId = null;
      user.vmName = null;
      user.sshHostname = null;
      user.eveHostname = null;
      user.eveWebUrl = null;
      user.cfTunnelId = null;
      user.cfTunnelName = null;
      user.cfDnsRecordId = null;
      user.plan = null;
      await user.save();
    }

    if (vm) {
      vm.status = 'terminated';
      await vm.save();
    }

    console.log(`[Terminate] Successfully terminated VM ${vmId}`);
    
    return { success: true, vmId };
    
  } catch (err) {
    console.error(`[Terminate] Failed to terminate VM ${vmId}:`, err);
    return { success: false, vmId, error: err.message };
  }
}

async function suspendVm(vmId) {
  console.log(`[Suspend] Suspending VM ${vmId}`);
  
  try {
    await sshExec(`qm stop ${vmId} || true`);
    
    const vm = await Vm.findOne({ vmId });
    if (vm) {
      vm.status = 'suspended';
      await vm.save();
    }
    
    console.log(`[Suspend] Successfully suspended VM ${vmId}`);
    return { success: true };
  } catch (err) {
    console.error(`[Suspend] Failed to suspend VM ${vmId}:`, err);
    return { success: false, error: err.message };
  }
}

async function resumeVm(vmId) {
  console.log(`[Resume] Resuming VM ${vmId}`);
  
  try {
    await sshExec(`qm start ${vmId} || true`);
    
    const vm = await Vm.findOne({ vmId });
    if (vm) {
      vm.status = 'active';
      await vm.save();
    }
    
    console.log(`[Resume] Successfully resumed VM ${vmId}`);
    return { success: true };
  } catch (err) {
    console.error(`[Resume] Failed to resume VM ${vmId}:`, err);
    return { success: false, error: err.message };
  }
}

The Background Job Runner

All these cleanup and monitoring functions need to run periodically. The platform uses a simple interval-based approach:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Background job scheduler - runs cleanup and monitoring tasks
function startBackgroundJobs() {
  console.log('[Jobs] Starting background job scheduler');

  // Retry failed provisioning every 10 minutes
  setInterval(async () => {
    try {
      await retryFailedProvisioning();
    } catch (err) {
      console.error('[Jobs] Error in retryFailedProvisioning:', err);
    }
  }, 10 * 60 * 1000);

  // Detect operational issues every 5 minutes
  setInterval(async () => {
    try {
      await detectStuckOperations();
    } catch (err) {
      console.error('[Jobs] Error in detectStuckOperations:', err);
    }
  }, 5 * 60 * 1000);

  // Sync subscription states every 30 minutes
  setInterval(async () => {
    try {
      await syncSubscriptionStates();
    } catch (err) {
      console.error('[Jobs] Error in syncSubscriptionStates:', err);
    }
  }, 30 * 60 * 1000);

  // Detect orphaned VMs every hour
  setInterval(async () => {
    try {
      await detectOrphanedVms();
    } catch (err) {
      console.error('[Jobs] Error in detectOrphanedVms:', err);
    }
  }, 60 * 60 * 1000);

  // Clean up old failed orders (after 7 days)
  setInterval(async () => {
    try {
      const cutoff = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000);
      const result = await Order.deleteMany({
        status: 'failed',
        createdAt: { $lt: cutoff },
        retryCount: { $gte: 3 }
      });
      if (result.deletedCount > 0) {
        console.log(`[Jobs] Cleaned up ${result.deletedCount} old failed orders`);
      }
    } catch (err) {
      console.error('[Jobs] Error cleaning up old orders:', err);
    }
  }, 24 * 60 * 60 * 1000); // daily

  console.log('[Jobs] Background jobs started');
}

// Start background jobs when the server starts
startBackgroundJobs();

This isn’t sophisticated, but it’s reliable. Each job is isolated and failures in one don’t affect the others.


Health Checks and Status Endpoints

For external monitoring, the platform exposes health check endpoints:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Basic health check
app.get('/api/health', async (req, res) => {
  try {
    // Check database connection
    await User.findOne().limit(1);
    
    // Check Proxmox API
    await pveFetchJson('/version', { method: 'GET' });
    
    // Check Stripe API
    await stripe.balance.retrieve();
    
    res.json({ 
      status: 'healthy',
      timestamp: new Date().toISOString(),
      version: process.env.npm_package_version || 'unknown'
    });
  } catch (err) {
    res.status(500).json({
      status: 'unhealthy', 
      error: err.message,
      timestamp: new Date().toISOString()
    });
  }
});

// Detailed status for operations dashboard
app.get('/api/status', requireAuth, requireAdmin, async (req, res) => {
  try {
    const [
      totalUsers,
      activeVms, 
      pendingOrders,
      failedOrders,
      stuckOperations,
      orphanedVms
    ] = await Promise.all([
      User.countDocuments(),
      Vm.countDocuments({ status: 'active' }),
      Order.countDocuments({ status: 'provisioning' }),
      Order.countDocuments({ status: 'failed' }),
      detectStuckOperations(),
      detectOrphanedVms()
    ]);

    res.json({
      stats: {
        totalUsers,
        activeVms,
        pendingOrders,
        failedOrders
      },
      issues: {
        stuckOperations: stuckOperations.length,
        orphanedVms: orphanedVms.length
      },
      details: {
        stuckOperations,
        orphanedVms
      },
      timestamp: new Date().toISOString()
    });
  } catch (err) {
    res.status(500).json({ error: err.message });
  }
});

The /api/health endpoint is designed for load balancers and uptime monitoring. The /api/status endpoint gives detailed operational visibility.


What This Cleanup Code Prevents

Without these background processes, the platform would slowly degrade:

  • Resource leaks: Failed provisioning would leave VMs running in Proxmox but not tracked in the database
  • Billing drift: Cancelled subscriptions wouldn’t stop VM access, active subscriptions wouldn’t resume suspended VMs
  • Stuck operations: Transient failures would become permanent, requiring manual intervention
  • Poor user experience: Users would see “provisioning” status forever instead of getting retry attempts The cleanup code isn’t glamorous, but it’s what makes the difference between a demo that works once and a platform that works reliably.

What I’d Improve Next

The current approach has some obvious limitations:

No proper job queue: Everything runs on intervals, which means jobs can overlap or get skipped if they run long. A proper job queue (Redis-based or database-backed) would be more reliable.

Limited retry logic: The retry mechanisms are simple backoff patterns. More sophisticated retry logic with exponential backoff, jitter, and different strategies per operation type would be better.

Manual alerting: The platform logs issues but doesn’t automatically page anyone. Integration with PagerDuty, Slack, or email alerting would catch problems faster.

No circuit breakers: If Proxmox or Stripe APIs are down, the platform keeps hammering them instead of backing off gracefully.

These improvements matter more as the platform scales, but they’re not critical at the current size. The simple approach works and is easy to debug when something goes wrong.

The provisioning pipeline from Part 4 gets users from payment to working VM. The cleanup and monitoring code in this post keeps that working over time, even when individual operations fail or external systems have issues.


Together, they form the operational backbone that lets the platform run without constant manual intervention. Next week: the user-facing parts — dashboards, controls, and how users actually interact with their lab VMs once they’re provisioned.

Need a real lab environment?

I run a small KVM-based lab VPS platform designed for Containerlab and EVE-NG workloads — without cloud pricing nonsense.

Visit localedgedatacenter.com →
This post is licensed under CC BY 4.0 by the author.