ProxLB VM Migration Error: Max CPU Issue

by SLV Team 41 views

Hey guys! Let's break down a tricky issue I encountered while using ProxLB for VM migration in my Proxmox cluster. Specifically, ProxLB was trying to move a VM to a host that didn't have enough CPU resources, leading to a migration failure. We'll explore the problem, the configuration, the logs, and how to potentially address it. This is a common issue, and understanding it can save you a lot of headaches when managing your virtual machines.

The Core Problem: Max CPU Mismatch

At its heart, the issue stems from a max CPU mismatch. ProxLB, in its effort to balance resources, attempted to migrate a VM (guest web07.redacted.domain, in this case) to a Proxmox node (server05) that couldn't support the VM's CPU requirements. The target node simply didn't have the available CPU cores to handle the VM as configured. This led to a migration failure, specifically an online migrate failure as reported by Proxmox. This kind of problem is something many of us face, so let's try to understand the details to avoid it.

When a VM needs to be migrated, ProxLB examines available resources and makes a decision based on the configuration. In my case, ProxLB identified that server05 was a suitable destination based on available memory and other factors. However, the underlying Proxmox system quickly realized the CPU limit wouldn't work, and the migration process was terminated.

Diving into the Configuration

Let's get into the details of the configuration that influenced this behavior. Understanding the setup is crucial to grasping the root cause. This section will go over the relevant parts of the ProxLB configuration file that played a role in this situation.

ProxLB's Perspective

First, we have the proxmox_api section, which defines the Proxmox API connection parameters: the hosts, the user, the token, and other settings like SSL verification and timeouts. The important thing is that the hosts variable specifies the nodes that ProxLB considers for migration. In my case, the list included server06, server05, and server04. The timeout, retries, and wait_time settings are essential for handling potential network or API issues during the migration process.

proxmox_api:
  # hosts: ['server04.redacted.domain', 'server05.redacted.domain', 'server06.redacted.domain', 'server07.redacted.domain', 'server08.redacted.domain']
  hosts: ['server06.redacted.domain', 'server05.redacted.domain', 'server04.redacted.domain']
  user: redacted
  token_id: redacted
  token_secret: redacted
  ssl_verification: False
  timeout: 1000
  retries: 3
  wait_time: 10

Cluster and Balancing Settings

Next, the proxmox_cluster section helps define which nodes are actively considered for migration and which are intentionally excluded. Here, maintenance_nodes and ignore_nodes are especially important. ignore_nodes can tell ProxLB to avoid using certain nodes altogether. In this configuration, server04 is in both maintenance_nodes and ignore_nodes. This is a useful tool to manage your cluster's behavior.

proxmox_cluster:
  maintenance_nodes: ['server04']
  ignore_nodes: ['server04']
  overprovisioning: False

Within the balancing section, we configure how ProxLB approaches resource balancing. The enable parameter turns balancing on or off. Other key options include live (for live migrations), with_local_disks, and balance_types. The method option dictates which metric to use for balancing, in this case, memory. It's worth noting that the mode is set to assigned, and balanciness and max_job_validation influence how aggressively ProxLB tries to move VMs.

balancing:
  enable: True
  enforce_affinity: False
  parallel: False
  parallel_jobs: 1
  live: True
  with_local_disks: True
  with_conntrack_state: False
  balance_types: ['vm', 'ct']
  max_job_validation: 1800
  balanciness: 5
  method: memory
  mode: assigned

Service and Logging

The service section controls the daemon behavior, scheduling, and delay settings. The log_level parameter is key for debugging, as it controls the verbosity of the logs. In my setup, the log level was set to DEBUG, which means we get detailed information about each step of the process. This setting made it much easier to diagnose the issue.

service:
  daemon: False
  schedule:
    interval: 3
    format: hours
  delay:
    enable: False
    time: 1
    format: hours
  log_level: DEBUG

Examining the Log Files

Now, let's turn to the log files, which are crucial for troubleshooting. The logs reveal exactly what happened during the migration attempt and, importantly, what went wrong. The logs come from both ProxLB and the underlying Proxmox system, giving us a comprehensive picture of the events.

ProxLB Log Snippets

Here are some relevant excerpts from the ProxLB log that show the steps leading up to the failure. Notice how ProxLB detected the resource usage on each node, selected server05, and then initiated the migration.

2025-10-16 07:21:01,809 - ProxLB - DEBUG - Starting: log_node_metrics.
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Nodes usage memory: server05: 92.78% | server06: 64.97%
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Nodes usage cpu:    server05: 2.88%  | server06: 22.30%
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Nodes usage disk:   server05: 5.58% | server06: 3.84%
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Finished: log_node_metrics.
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Balancing: Parallel balancing is disabled. Running sequentially.
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Starting: chunk_dict.
2025-10-16 07:21:01,810 - ProxLB - DEBUG - Balancing: Balancing for guest web07.redacted.domain of type VM started.
2025-10-16 07:21:01,810 - ProxLB - DEBUG - Starting: exec_rebalancing_vm.
2025-10-16 07:21:01,810 - ProxLB - INFO - Balancing: Starting to migrate VM guest web07.redacted.domain from server06 to server05.
2025-10-16 07:21:01,852 - ProxLB - DEBUG - Finished: exec_rebalancing_vm.
2025-10-16 07:21:01,852 - ProxLB - DEBUG - Starting: get_rebalancing_job_status.
2025-10-16 07:21:11,880 - ProxLB - DEBUG - Balancing: Job ID UPID:server06:003ECE30:05F5BC98:68F080BE:qmigrate:1054:redacted: (guest: web07.redacted.domain) for migration is still running... (Run: 2)
2025-10-16 07:21:11,881 - ProxLB - DEBUG - Starting: get_rebalancing_job_status.
2025-10-16 07:21:11,909 - ProxLB - CRITICAL - Balancing: Job ID UPID:server06:003ECE30:05F5BC98:68F080BE:qmigrate:1054:redacted: (guest: web07.redacted.domain) went into an error! Please check manually.
2025-10-16 07:21:11,909 - ProxLB - DEBUG - Finished: get_rebalancing_job_status.

Proxmox Log Snippets

Now, let's examine the Proxmox logs. This is where the root cause of the failure becomes clear. The Proxmox logs explicitly state the CPU limitation and why the migration was terminated.

2025-10-16 07:21:03 use dedicated network address for sending migration traffic (10.150.0.5)
2025-10-16 07:21:04 starting migration of VM 1054 to node 'vhx05' (10.150.0.5)
2025-10-16 07:21:04 starting VM 1054 on remote node 'server05'
2025-10-16 07:21:05 [vhx05] MAX 4 vcpus allowed per VM on this node
2025-10-16 07:21:05 ERROR: online migrate failure - remote command failed with exit code 255
2025-10-16 07:21:05 aborting phase 2 - cleanup resources
2025-10-16 07:21:05 migrate_cancel
2025-10-16 07:21:06 ERROR: migration finished with problems (duration 00:00:03)
TASK ERROR: migration problems

The crucial line here is [vhx05] MAX 4 vcpus allowed per VM on this node. This clearly indicates that server05, or vhx05, the internal name of the server, only permits a maximum of 4 vCPUs per VM. The VM being migrated (web07.redacted.domain, VM ID 1054) was likely configured with more than 4 vCPUs, which triggered the failure. This shows the importance of matching the resources a VM requests with those available on the host.

Root Cause Analysis and Solution

The primary cause of this failure is straightforward: the target host (server05) did not have the necessary CPU resources to accommodate the migrating VM. Specifically, the VM was configured with more virtual CPUs than server05 could provide.

Potential Solutions

  1. Review VM Configuration: The first step should be to check the VM's configuration to see how many vCPUs it has assigned. If the VM truly needs more than 4 vCPUs (in my case, the limit on server05), then the best solution is to migrate it to a host that does have sufficient CPU resources. It might be necessary to adjust the VM's CPU configuration to match the capabilities of the available hardware.
  2. Modify ProxLB Configuration: If you want ProxLB to avoid such migrations automatically, you can adjust the configuration file to exclude the problematic host (server05) from migration targets. This prevents ProxLB from even attempting to migrate to a host that won't work.
  3. Upgrade Hardware: A more long-term solution could involve upgrading the hardware on the underpowered nodes, so they can support more powerful VMs. This would give you more flexibility in resource allocation and migration options.
  4. Affinity Rules: Implement affinity rules, which can influence where VMs are placed based on resource availability. This will give you much more control over the distribution of VMs across your cluster. This option requires you to adjust the VM settings.
  5. Monitoring and Alerts: Set up monitoring to watch out for resource contention issues on your nodes, particularly CPU usage. Alerts can proactively notify you of potential migration problems, helping you fix the issues before migrations are attempted.

Additional Context: My Setup

Here's some additional context about my setup that might be helpful. I'm running ProxLB on a dedicated VM within the cluster, on a fresh Debian 13 installation, and installed it from the repository as described in the documentation. I'm still using Proxmox version 8, even though I have the latest updates installed. The cluster includes a mix of server-class machines and some HP Prodesk PCs that I added for testing. The HP PCs, which only have 4 cores, were the source of the issue. The VM in question, web07.redacted.domain, has 8 vCPUs assigned, exceeding the limit on the HP machines.

  • ProxLB Version: 1.1.8
  • Installation: Debian repository
  • Running as: Dedicated VM on Proxmox cluster
  • Cluster Nodes: 5 total (mix of server-class and HP Prodesk PCs)
  • VM Configuration: Web07 has 8 vCPUs

Conclusion

This ProxLB migration failure highlights the importance of matching VM resource requirements to the available resources of the target host. By carefully examining the logs, we can identify the cause and take steps to prevent similar issues in the future. Always make sure your Proxmox hosts have enough resources to handle your VMs and adjust your configuration to suit your needs. I hope this helps you troubleshoot similar issues in your own clusters, guys!