The issue happens randomly when hosts in a cluster gets distributed across multiple MS. Host can get split in following scenarios:
a. Add host – MS on which add host is executed takes ownership of the host. So if 2 hosts belonging to same cluster are added from 2 different MS then cluster gets split
b. scanDirectAgentToLoad – This runs every 90 secs. and check if there are any hosts that needs to be reconnected. The current logic of host scan can also lead to a split
The idea is to fix (b) to ensure that hosts in a cluster are managed by same MS. For (a) only the entry in the database is going to be created except in case if the host getting added is first in the cluster (in this case agent creation happens at the same time) and then (b) will take care of connection and agent creation part. Since currently addHost only creates an entry in the db there is a small window where the host state will be shown as 'Alert' till the time (b) is scheduled and picks up the host to make a connection. The MS doing add host will immediately schedule a scan task and also send notification to peers to start the scan task.
Entities correlated to the Identity and carry a uuid and those
correlated to InternalIdentity carry an id. Those entities that carry
both will correlated to Identity and InternalIdentity.
This refactors entities wherever possible to ensure the VO only
implements the first class entity.
Signed-off-by: Prasanna Santhanam <tsp@apache.org>
1) Support HTTP keep-alive in clustering communication channel
2) Increase concurrency level for clustering message delivery
Reviewed-By: Kelven (with unit test)
1) Drop synchronized call semantic for ClusterManagerImpl.broadcast()
2) Have no choice now but to use an unbound thread pool to notify upper layer. This is to prevent thread starvation when we have cross-management server waitings.
Reviewed-By: Kelven(with unit test)
Description:
1) Moved RuntimeCloudException from api/ to utils/.
Added simple constructor to RuntimeCloudException.
Modified all classes that extended RuntimeException
to extend RuntimeCloudException. These classes
are listed below:
ServerApiException
CloudAuthenticationException
CloudExecutionException
AsyncCommandQueued
HypervisorVersionChangedException
RuntimeCloudException
2) Added overloaded constructed to CloudException.
Modified all classes that extend Exception to extend CloudException instead.
These classes are listed below:
ConcurrentOperationException
ConflictingNetworkSettingsException
ConnectionException
DiscoveryException
InsufficientCapacityException
ManagementServerException
ResourceUnavailableException
VirtualMachineMigrationException
AgentControlChannelException
OperationTimedoutException.java
UnsupportedVersionException.java
UsageServerException.java
UnableDeleteHostException.java
AgentAuthnException.java
HttpCallException.java
ActiveFencingException.java
ClusterInvalidSessionException.java
GreTunnelException.java
OvsVlanExhaustedException.java
move all listxxx interface from HostDao to managers(ResourceManager, SecondaryStorageVmManager etc) with decent name using SearchCriteria2
or direct call SearchCriteria2 on demand