Root Cause:
Some global parameters contains NULL value, where the code doesn't handle NULL check.
So it fails with an exception. Hence nothing appears on the field(ERROR).
Solution:
Added required NULL check.
Per Ilya's reply, host_view may contain duplicate entries when hosts
have tags. Changing the host_view may cause unseen regressions so
to fix the issues we've modified the zone/cluster metrics code to use
the `host` table (hostdao) to iterate through the list of hosts in a
cluster during zone/cluster metrics listing.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
This fixes the following:
- Fixes thread leaks in RemoteEndHostEndPoint
- Fixes a potential NPE while finding EP for a storage/scope
Unbounded thread growth can be reproduced with following findings:
- Every unreachable template would produce 6 new threads (in a single
ScheduledExecutorService instance) spaced by 10 seconds
- Every reachable template url without the template would produce 1 new
thread (and one ScheduledExecutorService instance), it errors out quickly without
causing more thread growth.
- Every valid url will produce upto 10 threads as the same ep (endpoint
instance) will be reused to query upload/download (async callback)
progresses.
Every RemoteHostEndPoint instances creates its own
ScheduledExecutorService instance which is why in the jstack dump, we
see several threads that share the prefix RemoteHostEndPoint-{1..10}
(given poolsize is defined as 10, it uses suffixes 1-10).
This fixes the discovered thread leakage with following notes:
- Instead of ScheduledExecutorService instance, a cached pool could be
used instead and was implemented, and with `static` scope to be reused
among other future RemoteHostEndPoint instances.
- It was not clear why we would want to wait when we've Answers returned
from the remote EP, and therefore a scheduled/delayed Runnable was
not required at all for processing answers. ScheduledExecutorService
was therefore not really required, moved to ExecutorService instead.
- Another benefit of using a cached pool is that it will shutdown
threads if they are not used in 60 seconds, and they get re-used for
future runnable submissions.
- Caveat: the executor service is still unbounded, however, the use-case
that this method is used for short jobs to check upload/download
progresses fits the case here.
- Refactored CmdRunner to not use/reference objects from parent class.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
- Improves job scheduling using state/event-driven logic
- Reduced database and cpu load, by reducing all background threads to one
- Improves Simulator and KVM host-ha integration tests
- Triggers VM HA on successful host (ipmi reboot) recovery
- Improves internal datastructures and checks around HA counter
- New FSM events to retry fencing and recovery
- Fixes KVM activity script to aggresively check against last update time
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* FR12: Have basic constraint in CA certificate
- Refactors certificate generation to use V3
- Removes use of V1 based certificate generator
- Puts basic constraint and keyusage extentions in certificate generator
when caCert is not provided, i.e. for building CA certificate
- For normal certificate generation, skips putting basic constraint
instead puts authority key identifier (the ca cert)
- Fixes tests to use the V3 certificate generator
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* FR12: backup and restore cpvm/ssvm keystore during reboot
This is backported from:
https://github.com/apache/cloudstack/pull/2278
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Per communication with Marcus, and his test results, this enforces the
clients to provide certificates and on failure this stops SSL
negotiations when auth strictness is set to true.
This uses the `setNeedClientAuth`, where if the option is set and the
client chooses not to provide authentication information about itself,
the negotiations will stop and the engine will begin its closure
procedure:
https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html#setNeedClientAuth(boolean)
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
This introduces a new certificate authority framework that allows
pluggable CA provider implementations to handle certificate operations
around issuance, revocation and propagation. The framework injects
itself to `NioServer` to handle agent connections securely. The
framework adds assumptions in `NioClient` that a keystore if available
with known name `cloud.jks` will be used for SSL negotiations and
handshake.
This includes a default 'root' CA provider plugin which creates its own
self-signed root certificate authority on first run and uses it for
issuance and provisioning of certificate to CloudStack agents such as
the KVM, CPVM and SSVM agents and also for the management server for
peer clustering.
Additional changes and notes:
- Comma separate list of management server IPs can be set to the 'host'
global setting. Newly provisioned agents (KVM/CPVM/SSVM etc) will get
radomized comma separated list to which they will attempt connection
or reconnection in provided order. This removes need of a TCP LB on
port 8250 (default) of the management server(s).
- All fresh deployment will enforce two-way SSL authentication where
connecting agents will be required to present certificates issued
by the 'root' CA plugin.
- Existing environment on upgrade will continue to use one-way SSL
authentication and connecting agents will not be required to present
certificates.
- A script `keystore-setup` is responsible for initial keystore setup
and CSR generation on the agent/hosts.
- A script `keystore-cert-import` is responsible for import provided
certificate payload to the java keystore file.
- Agent security (keystore, certificates etc) are setup initially using
SSH, and later provisioning is handled via an existing agent connection
using command-answers. The supported clients and agents are limited to
CPVM, SSVM, and KVM agents, and clustered management server (peering).
- Certificate revocation does not revoke an existing agent-mgmt server
connection, however rejects a revoked certificate used during SSL
handshake.
- Older `cloudstackmanagement.keystore` is deprecated and will no longer
be used by mgmt server(s) for SSL negotiations and handshake. New
keystores will be named `cloud.jks`, any additional SSL certificates
should not be imported in it for use with tomcat etc. The `cloud.jks`
keystore is stricly used for agent-server communications.
- Management server keystore are validated and renewed on start up only,
the validity of them are same as the CA certificates.
New APIs:
- listCaProviders: lists all available CA provider plugins
- listCaCertificate: lists the CA certificate(s)
- issueCertificate: issues X509 client certificate with/without a CSR
- provisionCertificate: provisions certificate to a host
- revokeCertificate: revokes a client certificate using its serial
Global settings for the CA framework:
- ca.framework.provider.plugin: The configured CA provider plugin
- ca.framework.cert.keysize: The key size for certificate generation
- ca.framework.cert.signature.algorithm: The certificate signature algorithm
- ca.framework.cert.validity.period: Certificate validity in days
- ca.framework.cert.automatic.renewal: Certificate auto-renewal setting
- ca.framework.background.task.delay: CA background task delay/interval
- ca.framework.cert.expiry.alert.period: Days to check and alert expiring certificates
Global settings for the default 'root' CA provider:
- ca.plugin.root.private.key: (hidden/encrypted) CA private key
- ca.plugin.root.public.key: (hidden/encrypted) CA public key
- ca.plugin.root.ca.certificate: (hidden/encrypted) CA certificate
- ca.plugin.root.issuer.dn: The CA issue distinguished name
- ca.plugin.root.auth.strictness: Are clients required to present certificates
- ca.plugin.root.allow.expired.cert: Are clients with expired certificates allowed
UI changes:
- Button to download/save the CA certificates.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
* FR17: 1. Add timeout to the volume stats command
2. When a unknown command is received return a BadCommand from request processor
* FR17: Unit test for checking bad and a good command sent to the agent as json
Consider the CPU and memory overcommit ratios with total cpu/ram values
or thresholds for host metrics. This will fix incorrect notification
(cells turning yellow/red) in the metrics view.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Consider the CPU and memory overcommit ratios with total cpu/ram values
or thresholds for host and cluster metrics. This will fix incorrect
notification (cells turning yellow/red) in the metrics view.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
This removes username and passwords details from the listClusters
response. The details are usually seen in VMware environments only.
With dynamic roles features, the listClusters API may be provided
to a read-only root-admin user role/type which should not be able to get
the credentials.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
1. Add timeout to the volume stats command
2. When a unknown command is received return a BadCommand from request processor
3. Unit test for checking bad and a good command sent to the agent as json
This allows native CloudStack users to change password from the UI.
Overall changes:
- New 'usersource' key returned in the listUsers API
- Removed ldap specific check from the UI, added checks based on usersource
- Native CloudStack users will be allowed to change password from the UI
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
HostStats returns cpu usage in percentage while memory usage in bytes.
This fixes a regression in maximum CPU usage deviation that did not
assume the values to be in percentage and multiple the final ratios
with 100 which leads to 100x the actual deviation value.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Fixes log4j transformation using replace.properties to translate
@AGENTLOG@ to a valid value during rpm/mvn build.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
This implements an out-of-band management plugin for nested-cloudstack
environments where the hypervisor host is a VM in a parent CloudStack environment
that is used as a host in the (testing) CloudStack environment. This plugin
allows power operations to translate into start/stop/reboot of the VM (host).
The out-of-band management configuration accepted are:
- Address: The API URL of the parent CloudStack enviroment
- Port: The uuid of the (host) VM in the parent CloudStack environment
- Username: The apikey of the user account who has ownership on the (host) VM
- Password: The secretkey of the user account who has ownership on the (host) VM
Note: change password of the oobm interface is not support by this plugin
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
Host-HA offers investigation, fencing and recovery mechanisms for host that for
any reason are malfunctioning. It uses Activity and Health checks to determine
current host state based on which it may degrade a host or try to recover it. On
failing to recover it, it may try to fence the host.
The core feature is implemented in a hypervisor agnostic way, with two separate
implementations of the driver/provider for Simulator and KVM hypervisors. The
framework also allows for implementation of other hypervisor specific provider
implementation in future.
The Host-HA provider implementation for KVM hypervisor uses the out-of-band
management sub-system to issue IPMI calls to reset (recover) or poweroff (fence)
a host.
The Host-HA provider implementation for Simulator provides a means of testing
and validating the core framework implementation.
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>