Post

Building a Budget-Friendly Lab VPS Platform – Part 4: Provisioning After Payment - Wiring Stripe, Proxmox, and Reality Together

Building a Budget-Friendly Lab VPS Platform – Part 4: Provisioning After Payment - Wiring Stripe, Proxmox, and Reality Together

TL;DR: No VM is created, resized, or modified until Stripe confirms payment. Everything else in the code exists to enforce that rule.


Last week was about access boundaries — how users reach their lab VMs without ever touching Proxmox directly.

This week is about a different boundary, and it’s one that’s much easier to get wrong:

billing vs infrastructure.

Specifically: when does the platform actually do something irreversible?

A lot of homegrown platforms quietly blur this line. They provision first, bill later, and spend the rest of their life trying to reconcile reality after the fact.

This post is about how I’ve tried — imperfectly — to avoid that.


The Rule That Shapes Everything

There’s one rule that shows up everywhere in this file:

Proxmox does nothing until Stripe says the money is paid.

Not:

  • “Checkout was created”
  • “The user clicked confirm”
  • “We’ll charge them later”

Paid.

That rule drives VM creation, resizing, and even suspension behavior. It makes the code longer and more defensive than I’d like — but it keeps infrastructure from drifting ahead of billing.


Orders Are State, Not Action

When a user provisions a VM, the platform does not immediately touch Proxmox.

Instead, it creates an Order record and then deliberately does nothing with infrastructure until that order reaches a paid state. The order represents intent, not execution.

You can see that separation immediately in the provisioning entry point:

1
2
3
4
5
6
7
8
9
10
11
12
async function provisionVmForPaidOrder(orderId) {
  const order = await Order.findById(orderId);
  if (!order) throw new Error(`Order not found: ${orderId}`);

  if (order.status === "provisioning" || order.status === "provisioned") {
    return;
  }

  order.status = "provisioning";
  await order.save();
  // Proxmox work only happens after this point
}

This function is never called for “checkout created” or “user clicked confirm.” It’s only invoked after Stripe has confirmed payment and the order has transitioned into a paid state.

That may feel heavy-handed, but it prevents the most common failure mode: provisioning machines for sessions that never convert.


Stripe Is the Source of Truth — Not the UI

The UI never decides when something changes in Proxmox.

Stripe does.

That’s most obvious in how resize requests are handled. When a user requests more CPU or RAM, the platform records a ResizeRequest, but it does not touch the VM.

1
2
3
4
5
6
7
const rr = await ResizeRequest.create({
  userId,
  vmId,
  subscriptionId,
  desiredTotals: { cpu, ramGb, diskGb },
  status: "created",
});

At this point:

  • The VM is unchanged
  • Proxmox has not been contacted
  • No SSH commands have run

The resize request just sits there until Stripe confirms that money has actually been collected.

Only when Stripe reports a paid invoice does the platform cross the boundary:

1
2
3
4
5
await resizeApplyOnProxmox({
  vmId: rr.vmId,
  cpu: rr.desiredTotals.cpu,
  ramGb: rr.desiredTotals.ramGb,
});

That sequencing is intentional. It trades speed for correctness and avoids the mess of rolling back infrastructure after a failed charge.


Charging Immediately Instead of “Eventually”

Upgrades are charged immediately, not deferred to the next billing cycle.

That decision shows up clearly in the Stripe flow. When a resize is requested, the platform explicitly creates and pays a proration invoice instead of waiting:

1
2
3
4
5
6
7
8
9
const invoice = await stripe.invoices.create({
  customer: customerId,
  subscription: subscriptionId,
  collection_method: "charge_automatically",
  auto_advance: false,
});

const finalized = await stripe.invoices.finalizeInvoice(invoice.id);
const paidInvoice = await stripe.invoices.pay(finalized.id);

If that payment fails or requires user action, the resize is never applied. The VM stays exactly as it was.

This avoids a whole class of “you owe us money” states where infrastructure and billing drift apart.

The code here isn’t elegant, but it matches how Stripe actually behaves under failure — which matters more than how clean it looks.


Provisioning Is Procedural on Purpose

Once payment clears, provisioning becomes very literal and very linear.

There’s no abstraction hiding what’s happening. The code walks Proxmox step by step:

1
2
3
4
5
6
7
8
9
10
11
12
await proxmox.nodes.$(node).qemu.$(templateId).clone.$post({
  newid: vmId,
  full: 1,
  name: vmName,
});

await waitForCloneUnlock({ node, vmId });

await proxmox.nodes.$(node).qemu.$(vmId).config.$put({
  cores: plan.vcpus,
  memory: plan.ram * 1024,
});

Disk resizing, cloud-init injection, tunnel creation, and boot all follow in the same explicit sequence.

Each step blocks on the previous one. Each step can fail. Each failure is handled where it occurs.

This isn’t a workflow engine. It’s an admission that Proxmox is stateful, sometimes slow, and very honest about when it’s busy.


SSH Is Still in the Loop — and That’s a Compromise

Some operations still go through SSH and qm. That’s visible and intentional.

For example, resizing CPU and memory during a paid resize:

1
2
3
await sshExec(`qm stop ${vmId} --skiplock 1 || true`);
await sshExec(`qm set ${vmId} -cores ${cores} -memory ${memMb}`);
await sshExec(`qm start ${vmId} || true`);

I don’t love this, but the REST API doesn’t cleanly cover every operation the platform needs today. SSH access is scoped to the orchestrator, commands are short-lived, and nothing is interactive.

It’s a compromise, not something I’m pretending doesn’t exist.


Failure Is Treated as a Normal Outcome

A lot of the code exists purely to handle failure paths.

Retry logic around Proxmox locks looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
async function retryOperation(operation, maxRetries = 12, delayMs = 5000) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await operation();
    } catch (err) {
      if (String(err.message).toLowerCase().includes("lock")) {
        await sleep(delayMs);
        continue;
      }
      throw err;
    }
  }
  throw new Error("Max retries exceeded");
}

If provisioning ultimately fails, the order is marked failed and stops there. The VM isn’t partially owned. The user doesn’t get partial access.

That’s not advanced error handling. It’s just refusing to pretend failure is rare.


What I’d Refactor Next

The code works, but a few things are clearly showing strain.

My single server.js is doing too much. Billing, provisioning, access control, and orchestration all live in one file because it was expedient, not because it’s a good shape.

Retry and timeout logic should be centralized. Right now it’s scattered and inconsistent, even when it behaves correctly.

Longer term, this probably wants a background job system. The synchronous approach is defensible at this scale, but it won’t age gracefully.

None of these are urgent fixes. The platform behaves correctly today. But they’re the parts that slow me down every time I touch the code — which usually means they’re the right places to improve next.


Code Companion - provisionVmForPaidOrder() End-to-End

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
// Entry point: provisioning only after payment.
// Key property: idempotent-ish (won't double-provision if already provisioning/provisioned).
async function provisionVmForPaidOrder(orderId) {
  const order = await Order.findById(orderId);
  if (!order) throw new Error(`Order not found: ${orderId}`);
  if (order.status === "provisioning" || order.status === "provisioned") {
    console.log(`[Provision] Order ${orderId} already ${order.status}; skipping`);
    return;
  }

  // Transition early so duplicate webhook deliveries don't double-run.
  order.status = "provisioning";
  order.error = null;
  await order.save();

  try {
    const user = await User.findById(order.userId);
    if (!user) throw new Error("Not authorized");

    const plan = order.plan;
    const node = plan.node;
    const templateId = plan.templateId;

    console.log(
      `[Provision] paid order=${orderId} user=${user.email} plan=${JSON.stringify(plan)} stripeSub=${
        order.stripeSubscriptionId || "none"
      } interval=${order.billing?.interval || "month"}`
    );

    // Allocate a free VMID from cluster resources (within configured range).
    const vmId = await generateVmId();
    const vmName = safeVmNameFromEmail(user.email);

    // Clone timing knobs
    const CLONE_TIMEOUT_MS = Number.isFinite(parseInt(process.env.CLONE_TIMEOUT_MS || "", 10))
      ? parseInt(process.env.CLONE_TIMEOUT_MS, 10)
      : 15 * 60 * 1000;
    const CLONE_POLL_MS = Number.isFinite(parseInt(process.env.CLONE_POLL_MS || "", 10))
      ? parseInt(process.env.CLONE_POLL_MS, 10)
      : 10_000;

    console.log(
      `[Provision] clone start templateId=${templateId} -> vmId=${vmId} (timeout=${Math.round(
        CLONE_TIMEOUT_MS / 1000
      )}s)`
    );

    // Clone template -> new VM
    await withTimeout(
      proxmox.nodes.$(node).qemu.$(templateId).clone.$post({ newid: vmId, full: 1, name: vmName }),
      CLONE_TIMEOUT_MS,
      "pve clone"
    );

    // Wait for Proxmox to drop the clone lock so later operations don't fail.
    console.log("[Provision] clone request returned; waiting for clone lock to clear...");
    await waitForCloneUnlock({ node, vmId, timeoutMs: CLONE_TIMEOUT_MS, pollMs: CLONE_POLL_MS });
    console.log("[Provision] clone done (lock cleared)");

    // Derive guest username and recover stored password (encrypted at registration).
    const vmUsername = vmUsernameFromEmail(user.email);
    const vmPassword = user.accountPasswordEnc ? decryptString(user.accountPasswordEnc) : null;

    // Apply CPU/RAM and cloud-init guest auth via Proxmox config API.
    console.log("[Provision] config start");
    await retryOperation(async () => {
      await withTimeout(
        proxmox.nodes.$(node).qemu.$(vmId).config.$put({
          cores: plan.vcpus,
          memory: plan.ram * 1024,
          ciuser: vmUsername,
          cipassword: vmPassword || undefined,
          ipconfig0: "ip=dhcp",
        }),
        20_000,
        "pve config"
      );
    });
    console.log("[Provision] config done");

    // Disk resize: check current disk size and only grow if needed.
    const diskInterface = process.env.VM_DISK_INTERFACE || "scsi0";
    const currentDisk = await getCurrentDiskSize(node, vmId, diskInterface);
    if (plan.disk > currentDisk) {
      console.log(`[Provision] resize start (${currentDisk}G -> ${plan.disk}G)`);
      await retryOperation(async () => {
        await withTimeout(
          proxmox.nodes.$(node).qemu.$(vmId).resize.$put({
            disk: diskInterface,
            size: `+${plan.disk - currentDisk}G`,
          }),
          20_000,
          "pve resize"
        );
      });
      console.log("[Provision] resize done");
    }

    // Determine whether this VM needs EVE web ingress in addition to SSH ingress.
    const isEve = plan.type === "lab" && String(plan.labType || "").toLowerCase().includes("eve");

    // Create Cloudflare tunnel + DNS and receive a tunnel token to inject into the guest.
    console.log(`[Provision] cloudflare tunnel/dns start (isEve=${isEve})`);
    const cf = await ensureTunnelAndDns({ vmId, includeEveHttp: isEve });
    console.log("[Provision] cloudflare tunnel/dns done");

    // Render cloud-init YAML (contains tunnel token + SSH hostname + user creds).
    const userDataYaml = renderCloudInitTemplate({
      tunnelToken: cf.tunnelToken,
      sshHostname: cf.sshHostname,
      vmUsername,
      vmPassword,
    });

    // Upload cloud-init snippet to Proxmox and attach to VM.
    const snippetFilename = `user-data-${vmId}.yaml`;
    await uploadCloudInitSnippetToProxmox({ snippetFilename, userDataYaml });
    await attachCloudInitSnippetToVm({ vmId, snippetFilename });

    // Boot VM.
    console.log("[Provision] start VM");
    await retryOperation(async () => {
      await withTimeout(proxmox.nodes.$(node).qemu.$(vmId).status.start.$post(), 20_000, "pve start");
    });
    console.log("[Provision] start VM done");

    // Persist “user-facing” fields on the User model (legacy-ish convenience fields).
    user.plan = {
      type: "lab",
      vcpus: plan.vcpus,
      ram: plan.ram,
      disk: plan.disk,
      labType: plan.labType,
    };
    user.vmId = vmId;
    user.vmName = vmName;
    user.sshHostname = cf.sshHostname;
    user.cfTunnelId = cf.tunnelId;
    user.cfTunnelName = cf.tunnelName;
    user.cfDnsRecordId = cf.sshDnsRecordId || null;
    user.eveHostname = isEve ? cf.eveHostname : null;
    user.eveWebUrl = isEve ? cf.eveWebUrl : null;

    // Keep Stripe linkage if present on the Order.
    try {
      user.stripeCustomerId = order.stripeCustomerId || user.stripeCustomerId || null;
    } catch {}

    await user.save();

    // Persist canonical VM record (Vm collection) with tunnel + Stripe linkage.
    await upsertVmRecord({
      userId: user._id,
      vmId,
      vmName,
      node,
      plan,
      sshHostname: cf.sshHostname,
      eveHostname: isEve ? cf.eveHostname : null,
      eveWebUrl: isEve ? cf.eveWebUrl : null,
      cf,
      stripeSubscriptionId: order.stripeSubscriptionId || null,
      stripeCustomerId: order.stripeCustomerId || null,
      orderId: String(order._id),
    });

    // Finalize Order.
    order.status = "provisioned";
    order.vmId = vmId;
    await order.save();

    // Notify user (optional SMTP); email contains access details.
    const firstName = user.email.split("@")[0].split(/[.\-_]/)[0];
    const capitalizedName = firstName.charAt(0).toUpperCase() + firstName.slice(1);

    const readyMsg = {
      subject: `Your Lab VM is Ready! (VM ID: ${vmId}) – ${appName()}`,
      text: `Hi ${capitalizedName},

Fantastic news — your lab VM is fully provisioned and online!

Access Details

SSH Access
→ Host: ${cf.sshHostname}
→ Username: ${vmUsername}
→ Password: Same as your ${appName()} account password

${
  isEve
    ? `EVE-NG Web UI
→ URL: https://${cf.eveHostname}/
→ Default credentials: admin / admin
   (Please change the password immediately after first login!)

`
    : ""
}You can also use the built-in noVNC console directly from your dashboard at any time.

Your subscription is now active. Enjoy your lab!

Dashboard: ${absoluteUrl("/dashboard")}

Need help? Reply to this email — we're here to assist.

— ${appName()} Team
`,
    };

    await sendEmailOnce({
      key: `vm:${vmId}:ready`,
      userId: user._id,
      vmId,
      to: user.email,
      subject: readyMsg.subject,
      text: readyMsg.text,
      meta: { type: "vm_ready", orderId: String(order._id), vmId },
    });

    console.log(`[Provision] SUCCESS order=${orderId} vmId=${vmId}`);
  } catch (err) {
    // If anything fails, record the failure on the Order so the system can reconcile.
    console.error(`[Provision] FAILED order=${orderId}:`, err);
    const order2 = await Order.findById(orderId);
    if (order2) {
      order2.status = "failed";
      order2.error = err?.message || String(err);
      await order2.save();
    }
    throw err;
  }
}

Supporting functions used by the pipeline (the ones that actually enforce behavior):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// VMID allocation from Proxmox cluster resources
async function generateVmId() {
  const minId = parseInt(process.env.VM_ID_MIN || "6000", 10);
  const maxId = parseInt(process.env.VM_ID_MAX || "6999", 10);
  const json = await pveFetchJson(`/cluster/resources?type=vm`, { method: "GET" });
  const items = Array.isArray(json?.data) ? json.data : [];
  const used = new Set();
  for (const it of items) {
    const id = parseInt(it?.vmid, 10);
    if (Number.isFinite(id)) used.add(id);
  }
  for (let vmId = minId; vmId <= maxId; vmId++) {
    if (!used.has(vmId)) return vmId;
  }
  throw new Error(`No available VMIDs in range ${minId}-${maxId}`);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
// Wait for clone lock to clear by polling VM config via REST
async function waitForCloneUnlock({ node, vmId, timeoutMs = 15 * 60 * 1000, pollMs = 10_000 }) {
  const start = Date.now();
  while (true) {
    const cfg = await getQemuConfigViaRest({ node, vmId });
    const lock = cfg?.lock ? String(cfg.lock) : "";
    if (!lock) return cfg;
    if (Date.now() - start > timeoutMs) {
      throw new Error(`Timed out waiting for vmId=${vmId} clone lock to clear (last lock='${lock}')`);
    }
    await sleep(pollMs);
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Create/update Cloudflare tunnel config + DNS for this VM, return token for cloud-init injection
async function ensureTunnelAndDns({ vmId, includeEveHttp = false }) {
  const sshHostname = `ssh-${vmId}.${mustEnv("BASE_DOMAIN")}`;
  const eveHostname = includeEveHttp ? `eve-${vmId}.${mustEnv("BASE_DOMAIN")}` : null;

  // tunnel creation omitted here contains Cloudflare API calls
  const ingress = [{ hostname: sshHostname, service: "ssh://localhost:22" }];
  if (includeEveHttp && eveHostname) {
    ingress.push({ hostname: eveHostname, service: "http://localhost:80" });
  }
  ingress.push({ service: "http_status:404" });

  // upsert DNS + fetch tunnel token
  return { tunnelToken: "", sshHostname, eveHostname, tunnelId: "", sshDnsRecordId: "" };
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Cloud-init template render -> upload to Proxmox snippets -> attach to VM -> cloudinit update
function renderCloudInitTemplate({ tunnelToken, sshHostname, vmUsername, vmPassword }) {
  const templatePath = path.join(__dirname, "views", "cloudinit", "user-data.ejs");
  const tmpl = fs.readFileSync(templatePath, "utf8");
  const vmPasswordB64 = Buffer.from(String(vmPassword || ""), "utf8").toString("base64");
  return ejs.render(tmpl, { tunnelToken, sshHostname, vmUsername, vmPasswordB64 });
}
async function uploadCloudInitSnippetToProxmox({ snippetFilename, userDataYaml }) {
  const snippetsDir = process.env.PROXMOX_SNIPPETS_DIR || "/var/lib/vz/snippets";
  await sshExec(`mkdir -p ${snippetsDir}`);
  await sftpUploadText(`${snippetsDir}/${snippetFilename}`, userDataYaml);
  await sshExec(`test -s ${snippetsDir}/${snippetFilename}`);
}
async function attachCloudInitSnippetToVm({ vmId, snippetFilename }) {
  const storage = process.env.PROXMOX_SNIPPET_STORAGE || "local";
  await sshExec(`qm set ${vmId} --cicustom "user=${storage}:snippets/${snippetFilename}"`);
  await sshExec(`qm cloudinit update ${vmId}`);
}
Need a real lab environment?

I run a small KVM-based lab VPS platform designed for Containerlab and EVE-NG workloads — without cloud pricing nonsense.

Visit localedgedatacenter.com →
This post is licensed under CC BY 4.0 by the author.