This introduces a new certificate authority framework that allows
pluggable CA provider implementations to handle certificate operations
around issuance, revocation and propagation. The framework injects
itself to `NioServer` to handle agent connections securely. The
framework adds assumptions in `NioClient` that a keystore if available
with known name `cloud.jks` will be used for SSL negotiations and
handshake.
This includes a default 'root' CA provider plugin which creates its own
self-signed root certificate authority on first run and uses it for
issuance and provisioning of certificate to CloudStack agents such as
the KVM, CPVM and SSVM agents and also for the management server for
peer clustering.
Additional changes and notes:
- Comma separate list of management server IPs can be set to the 'host'
global setting. Newly provisioned agents (KVM/CPVM/SSVM etc) will get
radomized comma separated list to which they will attempt connection
or reconnection in provided order. This removes need of a TCP LB on
port 8250 (default) of the management server(s).
- All fresh deployment will enforce two-way SSL authentication where
connecting agents will be required to present certificates issued
by the 'root' CA plugin.
- Existing environment on upgrade will continue to use one-way SSL
authentication and connecting agents will not be required to present
certificates.
- A script `keystore-setup` is responsible for initial keystore setup
and CSR generation on the agent/hosts.
- A script `keystore-cert-import` is responsible for import provided
certificate payload to the java keystore file.
- Agent security (keystore, certificates etc) are setup initially using
SSH, and later provisioning is handled via an existing agent connection
using command-answers. The supported clients and agents are limited to
CPVM, SSVM, and KVM agents, and clustered management server (peering).
- Certificate revocation does not revoke an existing agent-mgmt server
connection, however rejects a revoked certificate used during SSL
handshake.
- Older `cloudstackmanagement.keystore` is deprecated and will no longer
be used by mgmt server(s) for SSL negotiations and handshake. New
keystores will be named `cloud.jks`, any additional SSL certificates
should not be imported in it for use with tomcat etc. The `cloud.jks`
keystore is stricly used for agent-server communications.
- Management server keystore are validated and renewed on start up only,
the validity of them are same as the CA certificates.
New APIs:
- listCaProviders: lists all available CA provider plugins
- listCaCertificate: lists the CA certificate(s)
- issueCertificate: issues X509 client certificate with/without a CSR
- provisionCertificate: provisions certificate to a host
- revokeCertificate: revokes a client certificate using its serial
Global settings for the CA framework:
- ca.framework.provider.plugin: The configured CA provider plugin
- ca.framework.cert.keysize: The key size for certificate generation
- ca.framework.cert.signature.algorithm: The certificate signature algorithm
- ca.framework.cert.validity.period: Certificate validity in days
- ca.framework.cert.automatic.renewal: Certificate auto-renewal setting
- ca.framework.background.task.delay: CA background task delay/interval
- ca.framework.cert.expiry.alert.period: Days to check and alert expiring certificates
Global settings for the default 'root' CA provider:
- ca.plugin.root.private.key: (hidden/encrypted) CA private key
- ca.plugin.root.public.key: (hidden/encrypted) CA public key
- ca.plugin.root.ca.certificate: (hidden/encrypted) CA certificate
- ca.plugin.root.issuer.dn: The CA issue distinguished name
- ca.plugin.root.auth.strictness: Are clients required to present certificates
- ca.plugin.root.allow.expired.cert: Are clients with expired certificates allowed
UI changes:
- Button to download/save the CA certificates.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Host-HA offers investigation, fencing and recovery mechanisms for host that for
any reason are malfunctioning. It uses Activity and Health checks to determine
current host state based on which it may degrade a host or try to recover it. On
failing to recover it, it may try to fence the host.
The core feature is implemented in a hypervisor agnostic way, with two separate
implementations of the driver/provider for Simulator and KVM hypervisors. The
framework also allows for implementation of other hypervisor specific provider
implementation in future.
The Host-HA provider implementation for KVM hypervisor uses the out-of-band
management sub-system to issue IPMI calls to reset (recover) or poweroff (fence)
a host.
The Host-HA provider implementation for Simulator provides a means of testing
and validating the core framework implementation.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
KVM hosts on shared storage failure was accepted by mgmt server with the
host state as Up, even though there was no primary/shared storage available on
it. This patch offers a quick fix by throwing an exception in the storage monitor
which connects storage pool on host. The failure is trapped by agent manager
that disconnects the agent without any investigation.
Based on Lab tests, KVM agent may take upto 2 minutes to attempt NFS mount when
the storage is inaccessible (firewalled, or shutdown) before returning back with
an error. It is safe to assume that this won't add pressure on mgmt server due to
several reconnection attempts, and KVM agent would retry reconnection every 2
minutes.
For such KVM hosts, where failure happens due to storage issues; they will be
briefly put in Alert state but will be mostly be in Connecting state during which
the KVM host attempts to mount/reconfigure NFS storage pool.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Support access to a host’s out-of-band management interface (e.g. IPMI, iLO,
DRAC, etc.) to manage host power operations (on/off etc.) and querying current
power state in CloudStack.
Given the wide range of out-of-band management interfaces such as iLO and iDRA,
the service implementation allows for development of separate drivers as plugins.
This feature comes with a ipmitool based driver that uses the
ipmitool (http://linux.die.net/man/1/ipmitool) to communicate with any
out-of-band management interface that support IPMI 2.0.
This feature allows following common use-cases:
- Restarting stalled/failed hosts
- Powering off under-utilised hosts
- Powering on hosts for provisioning or to increase capacity
- Allowing system administrators to see the current power state of the host
For testing this feature `ipmisim` can be used:
https://pypi.python.org/pypi/ipmisim
FS:
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Out-of-band+Management+for+CloudStack
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
- Uses non-blocking SSL handshake and non-blocking connections
- Uses 60s as timeout for both client/server to guard against indefinitely
blocking clients
- Unit test to prove fix, client and malicious clients trying to connect to server
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
There 2 things which has been changed.
* We look on power_state_update_time instead of update_time. Didn't make sense to me at all to look at update_time.
* Due DB update optimisation, powerState will only be updated if < MAX_CONSECUTIVE_SAME_STATE_UPDATE_COUNT. That is why we can not rely on these information unless we make sure these are up to date.
During ping task, while scanning and updating status of all VMs on the host that are stuck in a transitional state
and are missing from the power report, do so only for VMs that are not removed.
(cherry picked from commit de7173a0ed)
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Fixed on 4.4 and master but not on 4.5, cherry-picked on 4.5 using commit
fbafc957dc
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Conflicts:
engine/orchestration/src/com/cloud/agent/manager/DirectAgentAttache.java
pool, the source and destination pools cannot be local and cluster/zone and vice versa.
Cloudstack detects it and throws a exception. However, the end user only sees an
unexpected exception and not the reason for failure. Fixed it by making sure the
reason for the failure is correctly captured and shown to the end user.
(cherry picked from commit cffae8eef0)
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Conflicts:
server/src/com/cloud/storage/VolumeApiServiceImpl.java
When migration fails instead of returning NULL, throw the exception.
(cherry picked from commit a5a65c7b55)
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
If VM has been cold migrated across different VMware DCs, then unregister the VM from source host.
(cherry picked from commit 15b348632d)
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Before registering a VM check if a different CS VM with same name exists in vCenter.
(cherry picked from commit 33179cce56)
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Separate global config to enable/disable Storage Migration during normal deployment
Introduced a configuration parameter named enable.storage.migration
(cherry picked from commit c55bc0b2d1)
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
During vmsync if StopCommand (issued as part of PowerOff/PowerMissing report) fails to stop VM (since VM is running on HV),
don't transition VM state to "Stopped" in CS db. Also added a check to throw ConcurrentOperationException if vm state is not
"Running" after start operation.
Changes:
- When there is HA we try to redeploy the affected vm using regular planners and if that fails we retry using the special planner for HA (which skips checking disable threshold)
Now because of job framework the InsufficientCapacittyException gets masked and the special planners are not called. Job framework needs to be fixed to rethrow the correct exception.
- Also the VM Work Job framework is not setting the DeploymentPlanner to the VmWorkJob. So the HA Planner being passed by HAMgr was not getting used.
- Now the job framework sets the planner passed in by any caller of the VM Start operation, to the job
On default iptables rules are updated to add ACCEPT egress traffic.
If the network egress default policy is false, CS remove ACCEPT and adds the DROP rule which
is egress default rule when there are no other egress rules.
If the CS network egress default policy is true, CS won't configure any default rule for egress because
router already came up to accept egress traffic. If there are already egress rules for network then the
egress rules get applied on VR.
For isolated network with out firewall service, VR default allows egress traffic (guestnetwork --> public network)