cloudstack

Commit Graph

Author	SHA1	Message	Date
Abhishek Kumar	c6cc136ce6	changes in pools for agent executors Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-10-08 14:24:09 +05:30
Abhishek Kumar	c859f8ba80	changes in vm powerstate sync Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-10-07 16:25:14 +05:30
Abhishek Kumar	a9661f4587	changes in statscollection Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-10-07 16:23:15 +05:30
Abhishek Kumar	a5d02665b4	changes for host reqrieval from db Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-10-07 16:21:06 +05:30
Abhishek Kumar	adae7c88b8	changes Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-27 13:00:38 +05:30
Abhishek Kumar	fa50740514	wip Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-26 12:10:38 +05:30
Abhishek Kumar	1652086ff9	wip changes for agent reconnection Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-25 18:33:55 +05:30
Abhishek Kumar	0ca8722c38	Merge remote-tracking branch 'apple/scalability-improvements' into scalability-improvements-fixes	2024-09-23 14:47:25 +05:30
Abhishek Kumar	1d0b90f984	Merge remote-tracking branch 'apple/apple-base418' into scalability-improvements	2024-09-23 14:45:21 +05:30
Abhishek Kumar	a78a2508e9	server: refactor MS list retrieval for agent connect During agent join and while changing configs - host and indirect.agent.lb.algorithm, optimize calling DB for zone's host list Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-12 16:00:17 +05:30
Abhishek Kumar	d5a774c736	import fix Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-11 11:44:48 +05:30
Abhishek Kumar	de60fb64e8	fix Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-10 15:35:14 +05:30
Abhishek Kumar	9074c4b6ad	address process vm power state report for transitioning VMs Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-10 15:22:16 +05:30
Abhishek Kumar	3e098b87a9	fix Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-10 15:22:02 +05:30
Abhishek Kumar	61764aba1f	cache and executors refactoring Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-09 19:39:50 +05:30
Abhishek Kumar	8f6c657159	optimize scanStalledVms procedure Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-06 14:27:56 +05:30
Abhishek Kumar	8ee5e6a99a	refactor transitioning vm process report Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-04 18:35:23 +05:30
Abhishek Kumar	1be848da25	server: PingRoutingCommand - enable scanStalledVm Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-03 16:03:11 +05:30
Abhishek Kumar	337add8fb9	server: PingRoutingCommand - apply some optimizations Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-09-03 16:01:46 +05:30
mprokopchuk	e0d6066935	Bumped pom version to 4.18.1.2 (to add migration SQL script)	2024-08-15 17:55:00 -07:00
Abhishek Kumar	5e98405b38	Merge remote-tracking branch 'apple/apple-base418' into scalability-improvements	2024-07-22 16:12:19 +05:30
Vishesh	c2de75744e	kvm: Add support for cgroupv2 (#8252 ) (#459 ) * kvm: Add support for cgroupv2 (#8252) 1. Problem description In Apache CloudStack (ACS), when a VM is deployed in a host with the KVM hypervisor, an XML file is created in the assigned host, which has a property shares that defines the weight of the VM to access the host CPU. The value of this property has no unit, and it is a relative measure to calculate how much CPU a given VM will have in the host. However, this value has a limit, which depends on the version of cgroup utilized by the host's kernel. The problem lies at the range value of shares that varies between both versions: [2, 264144] for cgroups version 1; and [1, 10000] for cgroups version 2. Currently, ACS calculates the value of shares using Equation 1, presented below, where CPU is the number of cores and speed is the CPU frequency; both specified in the VM's compute offering. Therefore, if a compute offering has, for example, 6 cores at 2 GHz, the shares value will be 12000 and an exception will be thrown by libvirt if the host utilizes cgroup v2. The second version is becoming the default one in current Linux distributions; thus, it is necessary to address this limitation. Equation 1 shares = CPU * speed Fixes: #6744 2. Proposed changes To address the problem described, we propose to apply a scale conversion considering the max shares of the host. Using the same formula currently utilized by ACS, it is possible to calculate the maximum shares of a VM for a given host. In other words, using the number of cores and the nominal speed of the host's CPU as the upper limit of shares allowed to a VM. Then, this value will be scaled to the allowed interval of [1, 10000] of cgroup v2 by using a linear scale conversion. The VM shares would be calculated as Equation 2, presented below, where VM requested shares is the requested shares value calculated using Equation 1, cgroup upper limit is fixed with a value of 10000 (cgroups v2 upper limit), and host max shares is the maximum shares value of the host, calculated using Equation 1. Using Equation 2, the only case where a VM passes the cgroup v2 limit is when the user requests more resources than the host has, which is not possible with the current implementation of ACS. Equation 2 shares = (VM requested shares * cgroup upper limit)/host max shares To implement the proposal, the following APIs will be updated: deployVirtualMachine, migrateVirtualMachine and scaleVirtualMachine. When a VM is being deployed, a new verification will be added to find a suitable host. The max shares of each host will be calculated, and the VM calculated shares will be verified if it does not surpass the host's value. Likewise, the migration of VMs will have a similar new verification. Lastly, the scale of VMs will also have the same verification for the VM's host. To determine the max shares of a given host, we will use the same equation currently used in ACS for calculating the shares of VMs, presented in Section 1. When Equation 1 is used to determine the maximum shares of a host, CPU is the number of cores of the host, and speed is the nominal CPU speed, i.e., considering the CPU's base frequency. It is important to note that these changes are only for hosts with the KVM hypervisor using cgroup v2 for now. * Update overcommit ratio during live VM migration * minor refactoring --------- Co-authored-by: Bryan Lima <42067040+BryanMLima@users.noreply.github.com>	2024-06-27 12:22:17 +05:30
Vishesh	7ed43e3e43	Let network guru decide if ipv6 cidr size can't be equal to 64 (#462 )	2024-06-27 12:20:49 +05:30
Abhishek Kumar	8f88103a29	FR72 - api,server: purge expunged resources (#405 ) This PR introduces the functionality of purging removed DB entries for CloudStack entities (currently only for VirtualMachine). There would be three mechanisms for purging removed resources: - Background task - CloudStack will run a background task which runs at a defined interval. Other parameters for this task can be controlled with new global settings. - API - New API `purgeExpungedResources`. It will allow passing the following parameters - resourcetype, batchsize, startdate, enddate - Config for service offering. Service offerings can be created with purgeresources parameter which would allow purging resources immediately on expunge. Following new global settings have been added: - `expunged.resources.purge.enabled`: Default: false. Whether to run a background task to purge the DB records of the expunged resources. - `expunged.resources.purge.resources`: Default: (empty). A comma-separated list of resource types that will be considered by the background task to purge the DB records of the expunged resources. Currently only VirtualMachine is supported. An empty value will result in considering all resource types for purging. - `expunged.resources.purge.interval`: Default: 86400. Interval (in seconds) for the background task to purge the DB records of the expunged resources. - `expunged.resources.purge.delay`: Default: 300. Initial delay (in seconds) to start the background task to purge the DB records of the expunged resources task. - `expunged.resources.purge.batch.size`: Default: 50. Batch size to be used during purging of the DB records of the expunged resources. - `expunged.resources.purge.start.time`: Default: (empty). Start time to be used by the background task to purge the DB records of the expunged resources. Use format `yyyy-MM-dd` or `yyyy-MM-dd HH:mm:ss`. - `expunged.resources.purge.keep.past.days`: Default: 30. The number of days in the past from the execution time of the background task to purge the DB records of the expunged resources for which the expunged resources must not be purged. To enable purging DB records of the expunged resource till the execution of the background task, set the value to zero. - `expunged.resource.purge.job.delay`: Default: 180. Delay (in seconds) to execute the purging of the DB records of an expunged resource initiated by the configuration in the offering. Minimum value should be 180 seconds and if a lower value is set then the minimum value will be used. Upstream PRs: https://github.com/apache/cloudstack/pull/8999 https://github.com/apache/cloudstack-documentation/pull/397 Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com> Co-authored-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com>	2024-06-19 12:59:50 +05:30
Rohit Yadav	7a7f1e2b6e	FIXME/TODO: CPU and DB hotspot found Found these CPU and DB hotspot that handle agent ping commands, this adds idle load when there are high number of hosts. By design, there isn't any quick win here. However, the power sync report/handling could be improved, so it doesn't need to kick-in for every ping command received. Few more areas marked in the codebase. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>	2024-05-22 20:22:39 +05:30
Rohit Yadav	5484d3c7e6	orchestartion: optimise vm list fetching excluding that reported This optimises the sql query and iterator to simply return the VMs list excluding those in the received report. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>	2024-05-22 20:22:38 +05:30
Rohit Yadav	de82aa8e91	engine/orchestartion: wrap db txn in try-with, only fetch id Optimises DB query that seem to run against every Ping command, where whole columns are fetched but only `id` column is used. Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>	2024-05-22 20:22:38 +05:30
Vishesh	c3eba5e213	Fix exceeding of resource limits with powerflex (#443 ) * Fix exceeding of resource limits with powerflex * Fix for volume prepare during VM start * resolve comments * Add e2e tests * Fixup * Update e2e tests * minor refactoring * refactoring * fixup --------- Co-authored-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com>	2024-05-08 20:54:54 +05:30
Vishesh	2f4cea6dca	Fix message publish in transaction (#438 ) * Fix message publish in transaction * Resolve comments	2024-05-07 13:27:19 +05:30
Vishesh	c21b6d8b52	Update volume's passphrase to null if diskOffering doesn't support encryption (#428 )	2024-04-23 09:46:20 -06:00
Vishesh	1b7f33d0e1	This PR fixes the build issue on apple-base418 (#429 )	2024-04-12 15:43:23 +02:00
Vishesh	0501678478	Allow overriding root diskoffering id & size, and expunge old root disk while restoring VM (#401 ) * Allow overriding root diskoffering id & size while restoring VM * UI changes * Allow expunging of old disk while restoring a VM * Apply suggestions from code review Co-authored-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com> * resolve comments * Fixup * Rename some variables * Resolve comments * Address comments * Duplicate volume's details while duplicating volume * Allow setting IOPS for the new volume * minor cleanup * fixup * Add checks for template size * Replace strings for IOPS with constants * Fix saveVolumeDetails method * Fixup * Fixup UI styling --------- Co-authored-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com>	2024-04-12 17:47:16 +05:30
Vishesh	8d0915c4c9	Change iops on offering change (#416 ) * Change IOPS on disk offering change * Remove iops & bandwidth limits before copying template * minor refactor * Handle diskOfferingDetails * Fixup	2024-04-11 16:59:57 +05:30
Marcus Sorensen	f896586925	Update version to 4.18.1.1 (#417 ) * Update version to 4.18.1.1 * Update changelog * Update changelog * Update changelog --------- Co-authored-by: Marcus Sorensen <mls@apple.com>	2024-04-08 09:27:57 -06:00
Vishesh	5137c196c2	HypervisorType as a class (#393 ) * HypervisorType as a class * Fixup * fixup * Add missing annotation * Resolve comments * Handle parallels typo	2024-04-02 17:35:16 +05:30
Marcus Sorensen	bf4ea0d59f	Storage drivers to decide if they need data motion for zone-wide use (#392 ) * Storage drivers to decide if they need data motion for zone-wide use * Apply fixes in resolving PrimaryDataStore * add tests Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com> * fix imports Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com> --------- Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com> Co-authored-by: Marcus Sorensen <mls@apple.com> Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>	2024-03-14 10:53:24 +05:30
Vishesh	ba3284bdc5	Fix resource count discrepancies (#376 ) * Fix resource count discrepancies * Fixup while removing vm * Fix discrepancies when starting VMs * Fixup tests * Fixups * Don't take lock when amount is negative	2024-03-13 18:22:34 +05:30
Abhishek Kumar	1510b44f03	backport: add more unit tests and fix related to #327 (#378 ) Adds: - Fix for volume limit checks for disk offerings with multiple tags - When a VM is deployed with multiple disks having offerings with multiple tags the resource limit check may falter as currently it tries to check based on individual diskoffering. With this, change if offering d1 and d2 for volumes v1 and v2 both have tag1, server will check volume limits for tag tag1 using the combined size of v1 and v2. - Fix for template tag hosts in random host allocator - May affect use of template tag, service offering tags and random host allocator together. The current code for the random host allocator falters while trying to find the host allocation. This was found and fixed during the addition of the unit test here, https://github.com/shapeblue/cloudstack-apple/pull/378/files#diff-bbf9baea014e5cc1dfe9e7d13467c9857208cfe65e93883721d88a6f0452f912 - Unit tests for changes in api,server,ui: tagged resource limits #327	2024-03-01 17:22:14 +05:30
Harikrishna	747d1101c1	New API "checkVolume" to check and repair any leaks or repair all issues (#362 ) * Introduced a new API "checkVolumeAndRepair" that allows users or admins to check and repair if any leaks observed. Currently this is supported only for KVM * some fixes * Added unit tests * addressed review comments * add repair volume while granting access * Changed repair parameter to accept both leaks/all values * Introduced new global setting volume.check.and.repair.before.use to do volume check and repair before VM start or volume attach operations * Added volume check and repair changes only during VM start and volume attach operations * Refactored the names to look similar across the code * Some code fixes * remove unused code * Renamed repair values * Addressed review comments * code refactored * used volume name in logs * Changed the API to Async and the setting scope to storage pool * Fixed exit value handling with check volume command * Fixed storage scope to the setting * Fixed volume format issues * Refactored the log messages * Fix formatting	2024-02-29 14:40:40 +05:30
Vishesh	f30e07b312	Fix host stuck in connecting state (#375 ) * Fix host stuck in connecting state (#8502) There are a lot of test failures due to test_vm_life_cycle.py in multiple PRs due to host not available for migration of VMs. #8438 (comment) #8433 (comment) #7344 (comment) While debugging I noticed that the hosts get stuck in Connecting state because MS is waiting for a response of the ReadyCommand from the agent. Since we take a lock on connection and disconnection, restarting the agent doesn't work. To fix this, we have to restart the MS or wait for ~1 hour (default timeout). On the agent side, it gets stuck waiting for a response from the Script execution. To reproduce, run smoke/test_vm_life_cycle.py (TestSecuredVmMigration test class to be specific). Once the tests are complete, you will notice that some hosts are stuck in Connecting state. And restarting the agent fails due to the named lock. Locks on DB can be checked using the below query. SELECT * FROM performance_schema.metadata_locks INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID WHERE PROCESSLIST_ID <> CONNECTION_ID() \G; This PR adds a wait for the ready command and a timeout to the Script execution to ensure that the thread doesn't get stuck and the named lock from database is released. * Externalise a few timeouts & fix timeout for hostSupportsUefi in libvirt ready command wrapper (#8547) This PR fixes bug introduced in #8502. Timeout for script execution was set to 60 ms instead of 60s which resulted in host not getting UEFI enabled. This is a blocker for 4.19 release. We do this by introducing a new agent parameter `agent.script.timeout` (default - 60 seconds) to use as a timeout for the script checking host's UEFI status. We also externalize the timeout for the ReadyCommand by introducing a new global setting `ready.command.wait` (default - 60 seconds). For ModifyStoragePoolCommand, we don't externalize the timeout to avoid confusion for the user. Since, the required timeout can vary depending on the provider in use and we are only setting the wait for default host listener for now. Instead, we reuse the global `wait` setting by dividing it by `5` making the default value of 6 minutes (1800/5 = 360s) for ModifyStoragePoolCommand. Note: the actual time, the MS waits is twice the wait set for a Command. Check reference code below. `19250403e6/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentAttache.java (L406-L442)` * fixup	2024-02-21 13:44:53 +05:30
Suresh Kumar Anaparti	89f93746ac	Storage plugin support to check if volume on datastore requires access for migration (#380 ) * Check if volume on datastore requires access for migration, and grant/revoke volume access if requires * Updated default implementation for requiresAccessForMigration method in PrimaryDataStoreDriver	2024-02-20 11:32:32 -07:00
Vishesh	8b01c0aa62	Update VM's state if powerstate & state are not in sync (#368 ) * Update VM's state if powerstate & state are not in sync * Add unit tests * some code improvements for instacne / power state check * Update power state after vm stop confirmation (as power state 'PowerOn' is kept after vm stop and not updated later) * Reset the power state update counter before migrate, to allow power state sync to proper state / host * Do not consider transitional states (Starting, Stopping) to check power state sync * set powerstate to off for all vm types --------- Co-authored-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com>	2024-02-20 14:44:04 +05:30
Abhishek Kumar	6a9cdedda4	api,server,ui: tagged resource limits (#327 ) Introduces the concept of tagged resource limits. Limits can be enforced on accounts and domains for the deployment of entities for a tagged resource. Current tagged resource limits can be used for the following resource types, Host limits user_vm cpu memory Storage limits volume primary_storage Following global settings can used to specify tags for which limit needs to be enforced, Host: resource.limit.host.tags Storage: resource.limit.storage.tags Option for specifying tagged resource limits and viewing tagged resource usage are made available in the UI. Enhances use of templatetag for VM deployment and template creation Adds option to list disk offering with suitability flag for a virtualmachine. A new parameter named virtualmachineid has been added to the listDiskOfferings API which when passed returns suitableforvirtualmachine param in the reponse.	2024-02-07 17:35:15 +05:30
Suresh Kumar Anaparti	b44710c8a9	Pass StoragePoolType object for poolType dao attribute - fixes conversion to DB column (#371 )	2024-02-02 14:10:02 +05:30
Suresh Kumar Anaparti	7fef155621	Remove sensitive params (VmPassword, etc) from VMWork log (#369 ) * Remove sensitive params (VmPassword, etc) from VMWork log * Added unit tests * review comments	2024-01-24 17:49:20 +05:30
kishankavala	99939d22a7	CleanUp Async Jobs after mgmt server maintenance (#356 ) * Cleanup Volume AsyncJob after mgmt server stop * Clean Up Vm async job resources during mggmt server stop * Use State.isTransitional method to identify trnsition states * Add cleanup for Network Async Job * Add license * Added RevertSnapshotting to volume transition state. Fixed spacing code style * Added transitional flag in Volume state * Updated network event for failed job, (re)added cleanup for volumes created from snapshots, and some code improvements * Added java doc for volume state constructor * Fixed cleanup SNAPSHOT_ID entry in volume details for failed volumes created from snapshots --------- Co-authored-by: Suresh Kumar Anaparti <suresh.anaparti@shapeblue.com>	2024-01-09 17:54:26 +05:30
Vishesh	b9c3752ce0	Fix: Select another pod if all hosts in the pod becomes unavailable (#339 )	2023-11-07 15:11:21 +01:00
Vishesh	a7c7a33131	Apple base418 agent lock during reconnect (#340 ) Co-authored-by: Marcus Sorensen <mls@apple.com>	2023-11-03 16:56:15 +01:00
Pearl Dsilva	73c86b8a30	VR live patching: Allow live patch of VPC VRs even if networks are in allocated / shutdown state (#7958 ) (cherry picked from commit `951ba04cf0`) Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>	2023-10-09 19:15:32 +05:30
Marcus Sorensen	c01ed90569	Trigger out of band VM state update via libvirt event when VM stops (#7963 ) * Trigger out of band VM state update via libvirt event when VM stops * Add License headers, refactor nested try --------- Co-authored-by: Marcus Sorensen <mls@apple.com> (cherry picked from commit `3694667f50`) Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>	2023-09-28 12:22:15 +05:30

1 2 3 4 5 ...

1065 Commits