Adds @Mock injection for BackupDetailsDao so NASBackupProvider's
backupDetailsDao field is wired during testDeleteBackup and
takeBackupSuccessfully, fixing the NPE flagged by @harikrishna-patnala.
@bernardodemarco pointed out that design docs / RFCs go in the project
wiki or as a separate issue rather than into the source tree. The RFC
content has been posted as a comment on the existing tracking issue
#12899 (which is where the design discussion already lives), and the
docs/rfcs/ directory is removed from this PR.
Phase 6 added a hasBackingChain() check before rsync that uses
qemu-img info to detect chained incrementals. The existing
testExecuteWithRsyncFailure test mocks Script.runSimpleBashScriptForExitValue
to return 0 for any command, so the new qemu-img info check
incorrectly evaluates as "has backing chain" and routes the test
through the chain-flatten path instead of rsync — the test then
asserts a failure that never occurs.
Add a clause to the mock that returns 1 (no backing chain) for the
qemu-img info backing-filename probe, so the test continues to
exercise the rsync path it was designed for.
Adds five new test cases to test_backup_recovery_nas.py covering the
end-to-end behaviour of the incremental NAS backup feature:
* test_incremental_chain_cadence
- Sets nas.backup.full.every=3, takes 5 backups, verifies the
type pattern is FULL, INC, INC, FULL, INC.
* test_restore_from_incremental
- FULL + 2 INCs, each with a marker file. Restores from the
latest INC and verifies all three markers are present
(i.e. qemu-img convert flattened the chain correctly).
* test_delete_middle_incremental_repairs_chain
- Builds FULL, INC1, INC2; deletes INC1 (no force needed);
restores from the surviving INC2 and verifies that markers
from FULL, INC1 (which was deleted), and INC2 are all present
— proving the rebase merged INC1's blocks into INC2.
* test_refuse_delete_full_with_children
- Verifies plain delete of a FULL that has children fails, and
delete with forced=true succeeds and removes the whole chain.
* test_stopped_vm_falls_back_to_full
- Sets cadence to 2, takes one backup (FULL), stops the VM,
triggers another (cadence would say INC). Verifies the second
backup is recorded as FULL because the agent fell back when
backup-begin couldn't run on a stopped VM.
All tests restore nas.backup.full.every to 10 in finally blocks.
Refs: apache/cloudstack#12899
Adds the delete-with-chain-repair semantics agreed in the RFC review:
scripts/vm/hypervisor/kvm/nasbackup.sh
- New '-o rebase' operation: rebases an existing on-NAS qcow2 onto
a new backing parent. Uses a SAFE rebase (no -u) so the target
absorbs blocks of the about-to-be-deleted parent before the
backing pointer is moved up to the grandparent. Writes the new
backing reference relative to the target's directory so it
survives mount-point changes.
- New CLI flags --rebase-target, --rebase-new-backing (both passed
mount-relative).
RebaseBackupCommand + LibvirtRebaseBackupCommandWrapper
- New agent command that wraps the script's rebase operation. The
provider sends one of these per child that needs re-pointing.
NASBackupProvider.deleteBackup
- Now plans the chain repair before touching files via
computeChainRepair():
* No chain metadata -> single-file delete (legacy behaviour)
* Tail incremental -> single delete, no rebase
* Middle incremental -> rebase immediate child onto our
parent, then delete; shift
chain_position of all later
descendants by -1
* Full with descendants -> refuse unless forced=true; with
forced=true delete full + every
descendant newest-first
- Updates parent_backup_id, chain_position metadata in
backup_details after each rebase so the model in the DB matches
the on-disk chain.
This implements the cascade-delete behaviour requested in @abh1sar's
review point #7.
Refs: apache/cloudstack#12899
Two changes that together let an incremental NAS backup be restored
without manual chain assembly:
scripts/vm/hypervisor/kvm/nasbackup.sh
- qemu-img rebase now writes a backing-file path that is RELATIVE to
the new qcow2's directory (e.g. ../<parent-ts>/root.<uuid>.qcow2)
rather than the absolute path on the current mount point. NAS mount
points are ephemeral (mktemp -d), so an absolute reference would
not resolve when the backup is re-mounted at restore time. Relative
references are resolved by qemu-img against the file's own
directory, so the chain stays valid no matter where the NAS is
mounted next.
- Verifies the parent file exists on the NAS before rebasing.
LibvirtRestoreBackupCommandWrapper
- For file-based primary storage (local, NFS-file), the existing
code rsync'd the source qcow2 to the volume. That copies only the
differential blocks of an incremental, leaving a volume whose
backing-file reference points at a path the primary storage host
doesn't have. Now: detect a backing-chain via qemu-img info JSON
and flatten via 'qemu-img convert -O qcow2', which follows the
chain and produces a self-contained qcow2. Full backups continue
to use rsync (faster, no chain to flatten).
- The block-storage path (RBD/Linstor) already used qemu-img convert
via the QemuImg helper, which auto-flattens chains, so that path
needed no change.
Refs: apache/cloudstack#12899
CloudStack rebuilds the libvirt domain XML on every VM start, which means
persistent QEMU dirty bitmaps don't survive a stop/start cycle. Rather
than hooking into the VM start lifecycle (intrusive across the
orchestration layer), this commit handles the missing bitmap *lazily* at
the next backup attempt:
nasbackup.sh
- When -M incremental is requested, the script first checks
`virsh checkpoint-list` for the parent bitmap. If absent, it
recreates the checkpoint on the running domain so libvirt accepts
the <incremental> reference. The next incremental will be larger
than usual (it captures all writes since recreate, not since the
previous incremental) but is correct; subsequent ones return to
normal size.
- On recreation, emits BITMAP_RECREATED=<name> on stdout for the
orchestrator to record.
BackupAnswer
+ bitmapRecreated field surfaced from the agent.
LibvirtTakeBackupCommandWrapper
- Strips BITMAP_RECREATED= line from stdout before size parsing.
- Sets answer.setBitmapRecreated(...).
NASBackupChainKeys
+ BITMAP_RECREATED key for backup_details.
NASBackupProvider
- When the agent reports a recreated bitmap, persists it under
backup_details and logs an info-level message so operators can
correlate larger-than-usual incrementals with VM restarts.
This satisfies the bitmap-loss-on-VM-restart concern from the RFC review
without touching VirtualMachineManager / StartCommand / agent lifecycle.
Refs: apache/cloudstack#12899
Adds the Java side of the incremental NAS backup feature:
TakeBackupCommand
+ mode, bitmapNew, bitmapParent, parentPath fields (null for legacy
callers — script preserves its existing behaviour when these are
omitted).
BackupAnswer
+ bitmapCreated (echoed by the agent on success)
+ incrementalFallback (true when an incremental was requested but the
agent had to fall back to full because the VM was stopped).
LibvirtTakeBackupCommandWrapper
- Forwards the new fields to nasbackup.sh.
- Strips the new BITMAP_CREATED= / INCREMENTAL_FALLBACK= marker lines
out of stdout before the existing numeric-suffix size parser runs,
so the script can keep the same "size as last line(s)" contract.
- Surfaces both markers on the BackupAnswer.
NASBackupProvider
- decideChain(vm) walks backup_details (chain_id, chain_position,
bitmap_name) for the latest BackedUp backup of the VM and decides:
* Stopped VM -> full (libvirt backup-begin needs running QEMU)
* No prior chain -> full (chain_position=0)
* chain_position+1 >= nas.backup.full.every -> new full
* otherwise -> incremental, parent=last bitmap
- Generates timestamp-based bitmap names ("backup-<epoch>") matching
what the script then registers as the libvirt checkpoint name.
- persistChainMetadata() writes parent_backup_id, bitmap_name,
chain_id, chain_position, type into the existing backup_details
key/value table (per the RFC review — no new columns on backups).
- Honours the agent's INCREMENTAL_FALLBACK= signal: re-records the
backup as a full and starts a fresh chain.
- createBackupObject() now takes a type argument so the BackupVO
reflects the actual decision instead of always being "FULL".
Refs: apache/cloudstack#12899
Adds three new optional CLI flags to nasbackup.sh:
-M|--mode <full|incremental>
--bitmap-new <name> (checkpoint to create with this backup)
--bitmap-parent <name> (incremental: parent bitmap to read changes since)
--parent-path <path> (incremental: parent backup file for rebase)
Behavior:
- When -M is omitted, behavior is unchanged (legacy full-only, no checkpoint
created), so existing callers are not affected.
- With -M full + --bitmap-new, a full backup is taken AND a libvirt
checkpoint of that name is registered atomically (via backup-begin's
--checkpointxml), giving the next incremental its starting bitmap.
- With -M incremental, libvirt's <incremental> element references the
parent bitmap; only changed blocks are written. After completion,
qemu-img rebase wires the new file to its parent so the chain on the
NAS is self-describing for restore.
- Stopped VMs cannot use backup-begin; if -M incremental is requested
while VM is stopped, the script falls back to a full and emits
INCREMENTAL_FALLBACK= on stderr so the orchestrator can record it
correctly in the chain.
- The script echoes BITMAP_CREATED=<name> on success so the Java caller
can store it under backup_details (NASBackupChainKeys.BITMAP_NAME).
Works across local file, NFS-file, and LINSTOR primary storage. Ceph RBD
running-VM support is a pre-existing limitation of this script, not
affected by this change.
Refs: apache/cloudstack#12899
NASBackupChainKeys defines the keys this provider stores under the
existing backup_details kv table (parent_backup_id, bitmap_name,
chain_id, chain_position, type). This keeps the backups table
provider-agnostic per the RFC review.
nas.backup.full.every is a zone-scoped ConfigKey that controls how
often a full backup is taken; the remaining backups in the cycle are
incremental. Counts backups (not days), so it works for hourly,
daily, and ad-hoc schedules. Default 10. Set to 1 to disable
incrementals (every backup is full).
Refs: apache/cloudstack#12899
Adds the design document for incremental NAS backups using QEMU dirty
bitmaps and libvirt's backup-begin API. Reduces daily backup storage
80-95% for large VMs.
Refs: apache/cloudstack#12899
* Move logs for values of the migration settings out of the loop
* Apply suggestions from code review
Co-authored-by: Suresh Kumar Anaparti <sureshkumar.anaparti@gmail.com>
---------
Co-authored-by: Suresh Kumar Anaparti <sureshkumar.anaparti@gmail.com>
Fixes an issue in NsxResource.executeRequest where Network.Service
comparison failed when DeleteNsxNatRuleCommand was executed in a
different process. Due to serialization/deserialization, the
deserialized Network.Service instance was not equal to the static
instances Network.Service.StaticNat and Network.Service.PortForwarding,
causing the comparison to always return false.
Co-authored-by: Andrey Volchkov <avolchkov@playtika.com>
(cherry picked from commit 30dd234b00)
* Fix static routes to be added to PBR tables in VPC routers
Static routes were only being added to the main routing table, but
policy-based routing (PBR) is active on VPC routers. This caused
traffic coming in from specific interfaces to not find the static
routes, as they use interface-specific routing tables (Table_ethX).
This fix:
- Adds a helper method to find which interface a gateway belongs to
by matching the gateway IP against configured interface subnets
- Modifies route add/delete operations to update both the main table
and the appropriate interface-specific PBR table
- Uses existing CsAddress databag metadata to avoid OS queries
- Handles both add and revoke operations for proper cleanup
- Adds comprehensive logging for troubleshooting
Fixes#12857
* Add iptables FORWARD rules for nexthop-based static routes
When static routes use nexthop (gateway) instead of referencing a
private gateway's public IP, the iptables FORWARD rules were not
being generated. This caused traffic to be dropped by ACLs.
This fix:
- Adds a shared helper CsHelper.find_device_for_gateway() to determine
which interface a gateway belongs to by checking subnet membership
- Updates CsStaticRoutes to use the shared helper instead of duplicating
the device-finding logic
- Modifies CsAddress firewall rule generation to handle both old-style
(ip_address-based) and new-style (nexthop-based) static routes
- Generates the required FORWARD and PREROUTING rules for nexthop routes:
* -A PREROUTING -s <network> ! -d <interface_ip>/32 -i <dev> -j ACL_OUTBOUND_<dev>
* -A FORWARD -d <network> -o <dev> -j ACL_INBOUND_<dev>
* -A FORWARD -d <network> -o <dev> -m state --state RELATED,ESTABLISHED -j ACCEPT
Fixes the second part of #12857
* network matching grep fix, don't let 1.2.3.4/32 match 11.2.3.4/32
* initial attempt at network.loadbalancer.haproxy.idle.timeout implementation
* implement test cases
* move idleTimeout configuration test to its own test case
`cursor` field when more pages are available. The previous implementation only
fetched the first page and ignored pagination.
This change updates the list retrieval flow to:
- follow the `cursor` chain until no further pages exist
- accumulate items from all pages
- return a single merged result to the caller
This ensures that list operations return the complete dataset rather than just
the first page.
Co-authored-by: Andrey Volchkov <avolchkov@playtika.com>
* Fix VPC restart with multi-CIDR networks: handle comma-separated CIDR in NetworkVO.equals()
When a network has multiple CIDRs (e.g. '192.168.2.0/24,160.0.0.0/24'),
NetworkVO.equals() passes the raw comma-separated string to
NetUtils.isNetworkAWithinNetworkB() which expects a single CIDR,
causing 'cidr is not formatted correctly' error during VPC restart
with cleanup=true.
Extract only the first CIDR value before passing to NetUtils.
* Fix root cause: skip CIDR/gateway updates for Public traffic type networks
addCidrAndGatewayForIpv4/Ipv6 (introduced by PR #11249) was called for all
network types without checking if the network is Public. This caused
comma-separated CIDRs to be stored on Public networks, which then triggered
'cidr is not formatted correctly' errors during VPC restart.
Add TrafficType.Public guard in both the VLAN creation (addCidr) and
VLAN deletion (removeCidr) paths in ConfigurationManagerImpl.
* Sanitize legacy network-level addressing fields for Public networks
---------
Co-authored-by: dahn <daan@onecht.net>
Changes to check resource limits with reservations for the following
resource types:
- backup
- backup_storage
- bnucket
- object_storage
Signed-off-by: Abhishek Kumar <abhishek.mrt22@gmail.com>