This bug also happens in XenServer 5.6 FP1, but this issue is hidden by bug 11564.
There are two issues here
1. in XenServer 5.6 FP1/SP2, when a slave host is down ( NIC is unplugged), and before master detect the slave is down( it takes about 10 minutes),
template.createClone doesn't work sometime, it complains can not contact slave host, looks like XenServer try to start VM on this slave host, we use VM.create instead, we can specify host in VM.create API, so it will not try to connect the slave
2. in XenServer 5.6 FP1/SP2, host tag is introduced into VDI, however in HA case, VM.reset_powerstate and VM.destory doesn't remove this host tag, when CloudStack try to start this VM on other host, XenServer complains this VDI is already used ( or mount as RW). CloudStack will manually remove the host tage in VDI in FenceCommand
status 11552: resolved fixed
1) Introduce new managers - ProjectManager and DomainManager. Moved all domain related code from AccountManager to DomainManager.
2) Moved some code from ManagementServerImpl to the correct managers.
3) New resource limit for Domain - Project
Added two New values "all" and "default" to global config "network.loadbalancer.haproxy.stats.visibility" . With this change, it can take six possible value:
global - stats visible from public network.
guest-network - stats visible only to guestnetwork.
link-local - stats visible only to link local network(for xen and kvm).
disabled - stats disabled.
all - stats available on public,guest and link-local. (Newly added)
default - stats availble on the serving http port, this does need any specific http port.(Newly added)
Except "default" and "disabled", all the rest of 4 need to configure the stats port.
Force stop the router would release all the resources it used, but router may
still running. Add a column "stop_pending" in the database, and stop it when the
router come back.
Admin would able to choose to force destroy such router, then recover the
network using restartNetwork command with cleanup=false.
remove heartbeat entry for this Primary Storage, when put this Primary Storage into maintenance mode
create heartbeat entry for this Primary Storage, when cancal maintenance for this Primary Storage
status 11275: resolved fixed
status 11036: resolved fixed
1) Use row locks instead of global lock when update resource_count table. When update resource_count for account, make sure that we lock account+all related domains
2) Insert resource_count records for account/domain at the moment when account/domain is created.
3) As a part of DB upgrade, insert missing resource_count records for all non-removed accounts/domains
Conflicts:
core/src/com/cloud/alert/AlertManager.java
server/test/com/cloud/agent/MockAgentManagerImpl.java
Changes :
- Fixing API doc +response name + errorMessage
- Adding seperate events to Egress rules
- Egress rules Using the same database table as that of ingress with new column type.
Pending Tasks:
- db upgrade
- database table rename from security_ingress_rule to generic name, renaming some of the jave class from ingress to generic name.
- Retesting on kvm
Changes:
To make sure migration does not attempt to pick a host that has running VMs more than the max guest VM's limit:
- Changed manual migration to call host allocators to return a list of hosts suitable for migration. Host allocators check for the max guest VM limit.
- Earlier we returned hosts with enough capacity but now Host Allocators make other checks along with capacity. So the list of hosts returned are hosts that have enough capacity AND satisfy all other conditions like host tags, max guests limit etc. Or in other words Allocators dont return the hosts that dont satisfy all conditions even if they have capacity.
-Therefore, now we mark the list of hosts returned for manual migration as 'suitable' hosts instead of 'hasenoughCapacity' in the HostResponse.
- HA migration already calls allocators, so no change is needed there.
Changes:
- Adding a new table 'hypervisor_capabilities' that will record capabilities for each hypervisor version. Added db schema changes for this.
- Currently a few capabilities have been added, namely, 'max_guests_limit' and 'security_group_enabled'
- Added a new column 'hypervisor_version' to host table. StartupRouting command now takes in this parameter. It should be set when a host connects.
- If a host's hypervisor version is not present, we find all the capabilities rows for that hypervisor type and use the first record.
- 'max_guests_limit' is the maximum number of running guest Vms that a host can have for the given hypervisor.
- Host Allocators use this limit and skip a host if the number of running VMs on that host exceeds this limit.
Description :
API's:
- Two new api's authorizeSecurityGroupEgress,revokeSecurityGroupEgressCmd are added. These two API's are similer to ingress rule API's.
- authorizeSecurityGroupEgress :Authorizes a particular egress rule for this security group . Usageof API is very similer to that of authorizeSecurityGroupIngress except that instead of source cidr there will be destination cidr. By default like ingress, all the outgoing flows are blocked.
- revokeSecurityGroupEgress : It is similer to revokeSecurityGroupIngress api, It removes the egress rule.
- listSecurityGroup API's response changed. It include's egress list apart from the existing ingress rules in the output of the API.
Hypervisors :
- It is implemented in Xen and KVM.
Pending Tasks : Blocking using destination security groups.
Previous commits: c9fda641673df7701f44963ef27e1d488f121219 , 24e4e44b8f0712a37147a3777833de3f9e24829e
- adding supprt for Netscaler VPX & MPX load blancers
- implemented for virtual networking
- works only with new fetched public IP, inline support is not added yet
more details will be added in the bug
Added New value "link-local" to global config network.loadbalancer.haproxy.stats.visibility . With this change it can take new parameter "link-local" value apart from the existing 3 values global,guest-network,disabled.
global - stats visible from public network
guest-network - stats visible only to guestnetwork.
link-local - stats visible only to link local network
disabled - stats disabled.
Changes:
- CreateTemplate and RegisterTemplate now support adding a template tag. It is a string value. This is root-admin only action - only admin can add template tags.
- ListTemplates will return the template tag in response.
- HostAllocator changed to use template tag along with the existing tag on service offering. If both tags are present, allocator now finds hosts satisfying both tags. If no hosts have both tags, allocation will fail.
- DB changes to add new column to vm_template table.
- DB upgrade changes for upgrade from 2.2.10 to 2.2.11
previous commit: c9fda641673df7701f44963ef27e1d488f121219 ( this under bug 1067, typing error)
changes: 1) partially implemented listing of egress rules along with ingress rules.
2) partially implemneted egress rules for KVM
status 11029: resolved fixed
Commit also includes the following:
* map firewall rule to pf/lb/staticNat/vpn when the firewall rule is created as a part of pf/lb/staticNat/vpn rule creation
* when delete firewall rules, also delete related firewall rule
1) On enableStaticNat command we actually send the command to the backend (we used to just upgrade the DB in the past). The backend command carries sourceIp and destIp, and creates IP to IP mapping on the domR.
2) On disableStaticNat for the Ip address in addition to cleaning up port ranges, we also delete IP to IP mapping on the domR.
1) Added new apis: createFirewallRule, deleteFirewallRule, listFirewallRules
2) Modified existing apis - added boolean openFirewall parameter to createPortForwardingRule/createIpForwardingRule/createRemoteAccessVpn. If parameter is set to true, open firewall on the domR before creating an actual PF rule there
Modified backend calls appropriately.
3) Schema changes for firewall_rules table:
* startPort/endPort can be null now
* added icmp_type, icmp_code fields (can be not null only when protocol is icmp)
4) Added new manager - FirewallManagerImpl
Conflicts:
api/src/com/cloud/api/BaseCmd.java
client/tomcatconf/commands.properties.in
server/src/com/cloud/api/ApiResponseHelper.java
server/src/com/cloud/configuration/DefaultComponentLibrary.java
server/src/com/cloud/network/lb/LoadBalancingRulesManagerImpl.java
server/src/com/cloud/network/rules/RulesManagerImpl.java
1) Added new apis: createFirewallRule, deleteFirewallRule, listFirewallRules
2) Modified existing apis - added boolean openFirewall parameter to createPortForwardingRule/createIpForwardingRule/createRemoteAccessVpn. If parameter is set to true, open firewall on the domR before creating an actual PF rule there
Modified backend calls appropriately.
3) Schema changes for firewall_rules table:
* startPort/endPort can be null now
* added icmp_type, icmp_code fields (can be not null only when protocol is icmp)
4) Added new manager - FirewallManagerImpl
deployed VM is not impacted, if the guest OS type is not supported, run it as HVM
status 10483: resolved fixed
Conflicts:
core/src/com/cloud/hypervisor/xen/resource/CitrixHelper.java
core/src/com/cloud/hypervisor/xen/resource/XenServer60Resource.java
implement pool-wise VM sync,
For XenServer, VM fullSync is pool-wise now, VM deltaSync is still per host
Conflicts:
server/src/com/cloud/vm/VirtualMachineManagerImpl.java
put copy scripts in SetupCommand,
1. initiate returns host version,
2. if it doesn't match with DB, update DB, and reconnect the host.
status 9997: resolved fixed
Part 2
commit 797839360c65cd348d2eb20630521177ab0919de
bug 9154: redundant virtual router
commit 8ff7f230204d4d3a7a4adee75523a9a84f4276fe
bug 9154: Replace domain_router.is_master with domain_router.redundant_state in DB
commit 230b99e9e0b152648f1dd2a5eab6f22315b8e7b4
bug 9154: Add redundant state to DomainRouterResponse
commit ccefb5ff5e83d713798a347c99bce1a0d04b4317
bug 9154: Add router fault state report
commit 7a3090378f9785caecf741b70554f6ea17c41764
bug 9154: Send alert if found two virtual routers in master state
commit 66831056e4bf27665871bccd24e6159071564847
bug 9154: Code clean up
commit bf3f58a85741fa7118bd848a42d8b21baa4478d4
bug 9154: Add isRedundantRouter to DomainRouterResponse
Part 1
This backport contained:
commit 52317c718c25111c2535657139b541db0c9d1e1f
bug 9154: Initial check in for enabling redundant virtual router
commit 54199112055d754371bfb141168fb5538bf6d6ea
Add host verification for CheckRouterCommand
commit cef978a228c90056ead9be10cbc4de74c2b8de76
Fix CheckRouterAnswer's isMaster report
commit 4072f0a6991ac3b63601a1764fbe14188965f62f
Some build fixes and code refactoring for redundant router
commit 4d3350b7cd8ee2706a9bace4437fc194e36c8dd5
Redundant Router: Fix OVS
commit 6a228830e7c46d819fa0c3317e159e041337e887
Fix findByNetwork()/findByNetworkAndPod()'s return
commit c627777b3d5bdbcd60db4032cebd349a5b1ecd83
Redundant Router: Fix isVmAlive()
commit e1275d2514adc41f8744f5107d4069c38be195f1
Only issue CheckRouterCommand to redundant routers
And all modification to the scripts till
commit 4e3942462ed3fde3a3d7011e95839e2128fba514
logging changes
in the master branch.
put copy scripts in SetupCommand,
1. initiate returns host version,
2. if it doesn't match with DB, update DB, and reconnect the host.
status 9997: resolved fixed
This patch enable redundant virtual routers.
1. To enable this feature, db need to be updated using follow SQL by now(we
would get a UI way later):
UPDATE network_offerings SET redundant_router=1 WHERE guest_type="Virtual" AND
system_only=0;
2. System would try to start up two routers at different hosts. But if there is
only one host in the zone, system would start up two routers on it.
3. The failover part is using keepalived, and connection tracking part is using
conntrackd. There would be one master router and one backup router. The status
of router(master or backup) can be query from the database table domain_router
now. Management server would update the status every 30s by default.
4. The routers for the same zone would use same external NIC(same ip and mac).
The script used for fail-over would ensure only one external NIC present in the
network at any time.
5. Currently management server don't got the ability to stop one of router is
both of them reported as master. The feature is in the todo list.
After two routers start up, disconnect anyone of them, the guest network
shouldn't be affected, and established connection(http, ssh, etc.) should still
works. The fail-over on gateway part should be 3~4 seconds.
Currently the patch works with KVM. Would deal with vmware and XenServer soon.
haproxy tunning:
0. Test case:
httpd running in 5 user VMs, all of them created on a xenserver host(16 core, 42G memroy, 10G network)
domR running on an anther host with same hardware configuration.
test application, ab, running on anther host behind an anther seperate switch
1.haproxy is not a memory intensive app. I can get 4625.96 connection/s with 1G memory. While it's really a CPU intensive app, domR always uses around 100% CPU on the host.
2.By default, you can't get better connection/s rate, because ip_conntrack_max and tw_bucket are too small, you will see the error in domR like:
"TCP: time wait bucket table overflow" or "nf_conntrack: table full, dropping packet".
So I increase these numbers to 1000000 from 65536, then I can steadly get around 4600 connection/s when memory is >= 1G.
Here is the connection per second, tested by "ab -n 1000000 -c 100 http://192.168.170.152:880/test.html"
domR memory conn/s
128M: 3545.55
256M: 4081.38
512M: 4318.18
1G: 4625.96
7G: 4745.53
3. If I enable notrack for both connections between domr/user vm, and public network, that tell iptable in domR don't track the connection during my test, then I can get better number, around
5800 connections/s. But we can't enable notrack, as iptables is used to track throughput in domR.
4. In a word, with this commit, the connection rate of haproxy can be increased from 1000-2000/s to 4700/s when domR's memory is larger than 1G.
5. How many CPU need to assign to domR to get this number? Haven't finished yet, as CPU is shared by all the VMs on the host, if other VMs are busy, it will impact the performance of haproxy.
Changes specific for Xen hypervisor, and DB upgrade. Changes for vmware chcked-in already in commit 1c310a0d2ae81108386f0dd5c2e899ff00fee9e9, e71112e2f587f5d6c9c6d5337cfeb1f239f29633. KVM will not support this feature.
Changes:
- Added a new column `source_template_id` to vm_template table to carry the parent/source template ID from which the tempalte was created
- Added the column in db upgrade 224 to 225
- Changed code to save the source_template_id if there is one associated to the volume/ volume from which the snapshot was taken
- API response returns the sourcetemplateid field, if set, in all template usecases.
status 9623: resolved fixed
Also set ram_size to 1024 for console proxy offering during the upgrade
Conflicts:
core/src/com/cloud/vm/SecondaryStorageVmVO.java
server/src/com/cloud/agent/manager/allocator/impl/UserConcentratedAllocator.java
server/src/com/cloud/consoleproxy/ConsoleProxyManagerImpl.java
server/src/com/cloud/storage/allocator/LocalStoragePoolAllocator.java
server/src/com/cloud/storage/secondary/SecondaryStorageManagerImpl.java
- Local fix to not log the content for ModifySshKeyCommand.
- For commands that do not want to log the parameters, added the facility to indicate this.
- For such commands, we remove the parameters from the log.
status 9336: resolved fixed
Following changes were made:
* deleteSecurityGroup/authorizeSecurityGroupIngress - removed account/domainId parameters as SG is uniquely identified by id now
* removed account_name field from securityGroup DB table; removed allowed_security_group/allowed_sec_grp_acct from security_ingress_rule.
These values were used for api response generation only for performance purposes; added caching on API level to improve performance
* Added missing security checks for securityGroups/ingressRules
Since private and public keys are logged, this is a Security concern
Changes: Added capability to 'Command' instances to support excluding certain fields from getting logged using GSON @Expose annotation.
- Update system vm_instance's template_id if it does not match the system vm template.
- Use _templateDao.findSystemVMTemplate to find the latest system vm template.