Issue happens as there are more than one thread processing connect for a host simultaneously. The VM full sync. is not designed to work in this scenario and as a result user VMs may get stopped incorrectly.
Direct agent scan task runs at regular intervals (direct.agent.scan.interval defaulted to 90 secs) and identifies hosts that needs to be processed for connect. In a normal scenario hosts mostly get connected within that interval and there are no issues. But if due to some reason the connect process takes more time and is not completed by the time next agent scan runs. In this case, based on the db. state same hosts may get picked up again. And then there will be situations where more than one thread is processing connect for the same host.
The fix is to check if there is a thread already processing connect for a host and in this case all subsequent threads for that host will simply bail out. Also there may be a scenario where one thread already completed processing connect but another thread already got scheduled before that and will again repeat the same. This is also prevented by putting appropriate checks.
With commit d79f1f6fdc the AgentMonitor
was replaced with a pluggable service. However the ping timeout in the
original constructor was not passed on anymore, leading to a default
pingTimout of 0. This would fail all agents constantly.
Modified the startMonitor command to take a pingtimeout as an argument
and instruct AgentManagerImpl to pass it along.
* send StartupAnswer right after StartupCommand is recieved
* if post processor going wrong, send out readycommand with error message to agent, then agent will exit
73be77a4c1
I've renamed discover to discoverer to fix the issue. My ant debug fails
with:
[java] ERROR [utils.component.ComponentLocator] (main:) Unable to
load configuration for management-server from components.xml
[java] com.cloud.utils.exception.CloudRuntimeException: Unable to
find class: com.cloud.hypervisor.kvm.discoverer.KvmServerDiscoverer
RB: https://reviews.apache.org/r/6239/
Send-by: rohit.yadav@citrix.com
Changes:
- in case of external service providers, there is no discoverer that could load the resource.
- So we have to rely on agentMgr to load the resource as earlier.
Changes:
- We do not need these global setting anymore. These will be hidden since 3.0
- The default traffic label will be picked from the global setting which is null by default. When traffic label is null it means the resource uses tag on the default gateway
- Changes to invoke discoverer to reload the resource object on host connection
- Since a zone can have many physical networks, there can be multiple guest, public networks. Only the zone wide storage and management traffic label will be stored in host_details henceforth.
- If traffic labels are updated, discoverer should update the host_details
2. add ha parameter to dissconnect host to indicate if HA VMs on this host
status 12844, 13394: resolved fixed
reviewed-by : edison
Conflicts:
server/src/com/cloud/agent/manager/AgentManagerImpl.java
server/src/com/cloud/agent/manager/ClusteredAgentManagerImpl.java
we use 'update count' to make sure agent status transformation is atomic.
However, atomic means success or fail which is not true for agent status.
some important transformation occassionally fails because race condition that
some other one is changing it simultaneously which finally makes agent stuck in a
wrong status.
use reenterent lock to serialize the agent status transformation. this memory lock
works in clusterd environement as well because in our design an agent is only active
in one mgmt server
status 13269: resolved fixed