ProxLB VM Migration Error: Max CPU Issue
Hey guys! Let's break down a tricky issue I encountered while using ProxLB for VM migration in my Proxmox cluster. Specifically, ProxLB was trying to move a VM to a host that didn't have enough CPU resources, leading to a migration failure. We'll explore the problem, the configuration, the logs, and how to potentially address it. This is a common issue, and understanding it can save you a lot of headaches when managing your virtual machines.
The Core Problem: Max CPU Mismatch
At its heart, the issue stems from a max CPU mismatch. ProxLB, in its effort to balance resources, attempted to migrate a VM (guest web07.redacted.domain, in this case) to a Proxmox node (server05) that couldn't support the VM's CPU requirements. The target node simply didn't have the available CPU cores to handle the VM as configured. This led to a migration failure, specifically an online migrate failure
as reported by Proxmox. This kind of problem is something many of us face, so let's try to understand the details to avoid it.
When a VM needs to be migrated, ProxLB examines available resources and makes a decision based on the configuration. In my case, ProxLB identified that server05 was a suitable destination based on available memory and other factors. However, the underlying Proxmox system quickly realized the CPU limit wouldn't work, and the migration process was terminated.
Diving into the Configuration
Let's get into the details of the configuration that influenced this behavior. Understanding the setup is crucial to grasping the root cause. This section will go over the relevant parts of the ProxLB configuration file that played a role in this situation.
ProxLB's Perspective
First, we have the proxmox_api
section, which defines the Proxmox API connection parameters: the hosts, the user, the token, and other settings like SSL verification and timeouts. The important thing is that the hosts
variable specifies the nodes that ProxLB considers for migration. In my case, the list included server06
, server05
, and server04
. The timeout
, retries
, and wait_time
settings are essential for handling potential network or API issues during the migration process.
proxmox_api:
# hosts: ['server04.redacted.domain', 'server05.redacted.domain', 'server06.redacted.domain', 'server07.redacted.domain', 'server08.redacted.domain']
hosts: ['server06.redacted.domain', 'server05.redacted.domain', 'server04.redacted.domain']
user: redacted
token_id: redacted
token_secret: redacted
ssl_verification: False
timeout: 1000
retries: 3
wait_time: 10
Cluster and Balancing Settings
Next, the proxmox_cluster
section helps define which nodes are actively considered for migration and which are intentionally excluded. Here, maintenance_nodes
and ignore_nodes
are especially important. ignore_nodes
can tell ProxLB to avoid using certain nodes altogether. In this configuration, server04 is in both maintenance_nodes
and ignore_nodes
. This is a useful tool to manage your cluster's behavior.
proxmox_cluster:
maintenance_nodes: ['server04']
ignore_nodes: ['server04']
overprovisioning: False
Within the balancing
section, we configure how ProxLB approaches resource balancing. The enable
parameter turns balancing on or off. Other key options include live
(for live migrations), with_local_disks
, and balance_types
. The method
option dictates which metric to use for balancing, in this case, memory
. It's worth noting that the mode
is set to assigned
, and balanciness
and max_job_validation
influence how aggressively ProxLB tries to move VMs.
balancing:
enable: True
enforce_affinity: False
parallel: False
parallel_jobs: 1
live: True
with_local_disks: True
with_conntrack_state: False
balance_types: ['vm', 'ct']
max_job_validation: 1800
balanciness: 5
method: memory
mode: assigned
Service and Logging
The service
section controls the daemon behavior, scheduling, and delay settings. The log_level
parameter is key for debugging, as it controls the verbosity of the logs. In my setup, the log level was set to DEBUG
, which means we get detailed information about each step of the process. This setting made it much easier to diagnose the issue.
service:
daemon: False
schedule:
interval: 3
format: hours
delay:
enable: False
time: 1
format: hours
log_level: DEBUG
Examining the Log Files
Now, let's turn to the log files, which are crucial for troubleshooting. The logs reveal exactly what happened during the migration attempt and, importantly, what went wrong. The logs come from both ProxLB and the underlying Proxmox system, giving us a comprehensive picture of the events.
ProxLB Log Snippets
Here are some relevant excerpts from the ProxLB log that show the steps leading up to the failure. Notice how ProxLB detected the resource usage on each node, selected server05
, and then initiated the migration.
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Starting: log_node_metrics.
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Nodes usage memory: server05: 92.78% | server06: 64.97%
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Nodes usage cpu: server05: 2.88% | server06: 22.30%
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Nodes usage disk: server05: 5.58% | server06: 3.84%
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Finished: log_node_metrics.
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Balancing: Parallel balancing is disabled. Running sequentially.
2025-10-16 07:21:01,809 - ProxLB - DEBUG - Starting: chunk_dict.
2025-10-16 07:21:01,810 - ProxLB - DEBUG - Balancing: Balancing for guest web07.redacted.domain of type VM started.
2025-10-16 07:21:01,810 - ProxLB - DEBUG - Starting: exec_rebalancing_vm.
2025-10-16 07:21:01,810 - ProxLB - INFO - Balancing: Starting to migrate VM guest web07.redacted.domain from server06 to server05.
2025-10-16 07:21:01,852 - ProxLB - DEBUG - Finished: exec_rebalancing_vm.
2025-10-16 07:21:01,852 - ProxLB - DEBUG - Starting: get_rebalancing_job_status.
2025-10-16 07:21:11,880 - ProxLB - DEBUG - Balancing: Job ID UPID:server06:003ECE30:05F5BC98:68F080BE:qmigrate:1054:redacted: (guest: web07.redacted.domain) for migration is still running... (Run: 2)
2025-10-16 07:21:11,881 - ProxLB - DEBUG - Starting: get_rebalancing_job_status.
2025-10-16 07:21:11,909 - ProxLB - CRITICAL - Balancing: Job ID UPID:server06:003ECE30:05F5BC98:68F080BE:qmigrate:1054:redacted: (guest: web07.redacted.domain) went into an error! Please check manually.
2025-10-16 07:21:11,909 - ProxLB - DEBUG - Finished: get_rebalancing_job_status.
Proxmox Log Snippets
Now, let's examine the Proxmox logs. This is where the root cause of the failure becomes clear. The Proxmox logs explicitly state the CPU limitation and why the migration was terminated.
2025-10-16 07:21:03 use dedicated network address for sending migration traffic (10.150.0.5)
2025-10-16 07:21:04 starting migration of VM 1054 to node 'vhx05' (10.150.0.5)
2025-10-16 07:21:04 starting VM 1054 on remote node 'server05'
2025-10-16 07:21:05 [vhx05] MAX 4 vcpus allowed per VM on this node
2025-10-16 07:21:05 ERROR: online migrate failure - remote command failed with exit code 255
2025-10-16 07:21:05 aborting phase 2 - cleanup resources
2025-10-16 07:21:05 migrate_cancel
2025-10-16 07:21:06 ERROR: migration finished with problems (duration 00:00:03)
TASK ERROR: migration problems
The crucial line here is [vhx05] MAX 4 vcpus allowed per VM on this node
. This clearly indicates that server05
, or vhx05
, the internal name of the server, only permits a maximum of 4 vCPUs per VM. The VM being migrated (web07.redacted.domain, VM ID 1054) was likely configured with more than 4 vCPUs, which triggered the failure. This shows the importance of matching the resources a VM requests with those available on the host.
Root Cause Analysis and Solution
The primary cause of this failure is straightforward: the target host (server05
) did not have the necessary CPU resources to accommodate the migrating VM. Specifically, the VM was configured with more virtual CPUs than server05
could provide.
Potential Solutions
- Review VM Configuration: The first step should be to check the VM's configuration to see how many vCPUs it has assigned. If the VM truly needs more than 4 vCPUs (in my case, the limit on
server05
), then the best solution is to migrate it to a host that does have sufficient CPU resources. It might be necessary to adjust the VM's CPU configuration to match the capabilities of the available hardware. - Modify ProxLB Configuration: If you want ProxLB to avoid such migrations automatically, you can adjust the configuration file to exclude the problematic host (server05) from migration targets. This prevents ProxLB from even attempting to migrate to a host that won't work.
- Upgrade Hardware: A more long-term solution could involve upgrading the hardware on the underpowered nodes, so they can support more powerful VMs. This would give you more flexibility in resource allocation and migration options.
- Affinity Rules: Implement affinity rules, which can influence where VMs are placed based on resource availability. This will give you much more control over the distribution of VMs across your cluster. This option requires you to adjust the VM settings.
- Monitoring and Alerts: Set up monitoring to watch out for resource contention issues on your nodes, particularly CPU usage. Alerts can proactively notify you of potential migration problems, helping you fix the issues before migrations are attempted.
Additional Context: My Setup
Here's some additional context about my setup that might be helpful. I'm running ProxLB on a dedicated VM within the cluster, on a fresh Debian 13 installation, and installed it from the repository as described in the documentation. I'm still using Proxmox version 8, even though I have the latest updates installed. The cluster includes a mix of server-class machines and some HP Prodesk PCs that I added for testing. The HP PCs, which only have 4 cores, were the source of the issue. The VM in question, web07.redacted.domain, has 8 vCPUs assigned, exceeding the limit on the HP machines.
- ProxLB Version: 1.1.8
- Installation: Debian repository
- Running as: Dedicated VM on Proxmox cluster
- Cluster Nodes: 5 total (mix of server-class and HP Prodesk PCs)
- VM Configuration: Web07 has 8 vCPUs
Conclusion
This ProxLB migration failure highlights the importance of matching VM resource requirements to the available resources of the target host. By carefully examining the logs, we can identify the cause and take steps to prevent similar issues in the future. Always make sure your Proxmox hosts have enough resources to handle your VMs and adjust your configuration to suit your needs. I hope this helps you troubleshoot similar issues in your own clusters, guys!