Firstly, Thank you for reading…
I was asked by a customer to implement a SRM solution in their existing environment to allow them to replicate between their 2 sites, Production and DR. The company was covered under an ELA with which SRM was included in the Cloud sweet licensing and the renewal was due in the coming months. The first step was to upgrade their environment to a later version in order to take advantage of a later SRM with enhanced features, this meant upgrading both sites all 90 hosts, appliances etc etc, which all had been staggered and done under change. Once this was in place I could deploy the SRM appliances or so I thought, after going through the complicated process I could not pair the 2 sites, I had a call open with VMware which was escalated twice and took over 4 weeks to put right due to a bug. So while all this was going on I discovered Zerto a third party replication tool and went to the business to propose a POC and product comparison given the stresses thus far. Once it was agreed I installed and configured Zerto in one afternoon and overnight I had my first group of test VMs replicating as simple as that.
Don’t get me wrong I love VMware I have worked with it for over 12 years but they don’t get everything right, and I wanted to share with everyone my results. This is quite a lengthy blog as I am taking content from my product comparison that I put to the business but hopefully the contents will help others in their documentation and decisions.
This first section discusses the architecture of SRM and the Zerto, what their components doe etc. The most important thing here is the method of replication, if you want to route you replication traffic specifically to reduce latency it is more difficult with SRM and simple with Zerto….
SRM is the proprietary replication technology for VMware that manages the failover for virtual machines within its environment. We have chosen to use the host to host replication feature of vSphere replication over IP as opposed to array based replication from a LUN level. This essentially means there are 2 main elements to the replication utility,
vSphere replication – These are virtual appliances installed within the vSphere environment and are responsible for replicating and managing the data at a block level of virtual machines one appliance is required in either site. You can scale up to 10 VRA’s 1 being the management server and the other 9 helping with replication load.
SRM – Site Recovery Manager is the management engine to manage failovers and tests. This utility is installed on top of a windows machine and one is required per site. The SRM server holds protection groups which maintain a list of VMs to be managed in that group, and Recovery plans which has a workflow for the process in which VMs are failed over.
Essentially what this diagram shows is that each element of SRM is in contact with each other, the VRA and SRM servers authenticate with one another over SSO. Once the connection is established the replication takes place as indicated by the red line, there is a vSCSI filter in each ESXi host which intercepts IO for each protected VM and holds a copy of the change blocks in the VM directory under change block tracking. These changed blocks are then replicated to the destination replication appliance based on intervals determined by the configured RPO. The data is then written to the destination vmdk via a network file copy filter. This method is a-synchronous and one directional, however when failing back the path will be the same but reversed. This solution is managed via the vSphere web client only.
Zerto being the third party replication software will also be used with host based replication over IP as opposed to storage replication. Zerto when installed utilises API’s and plugins to generate inventories and execute certain VMware features. As with SRM Zerto is made up of 2 main features for the utility to operate,
VRM – This is the Virtual replication manager which is responsible for managing the replication engines, RPO’s, customised scripts etc. One is required per site and is a windows server based installation.
VRA – The Virtual replication appliance is the component that is responsible for the replication of the guests, one is required for each ESXi host that has VM’s to be protected in both sites.
As with vSphere replication each major component communicates with one another, the only difference here is that the replication is done directly between the 2 replication appliances and not from the ESXi host. In a similar manor the solution has a replication agent on each host that intercepts the IO holds the change blocks in cache and then replicates to the remote site appliance to gather the blocks de-compress and write to disk. The frequency is again determined by the configured RPO. This solution can be managed by a web console or via a plugin through a vSphere client and is a single pane of glass.
Installation and Functionality
From start to finish there were a few challenges with the installation of SRM once we had upgraded our environment to vSphere 6.0 we noticed that the build versions in each site needed to be identical in order for the sites to pair. We then encountered a time skew bug where the SSO tokens would expire immediately this was due to a conflict with an EMC VSI plugin which needed to be uninstalled from vSphere and the string removed from the managed object browser.
The below steps are the process required to install SRM and replicate the first VM
- Bring both sites vCenter and ESXi hosts up to the same build number
- Install SRM on windows server in both sites
- Configure SRM with vCenter
- Install vSphere Replication appliance from ovf in both sites
- Configure vSphere replication appliance with IP, SSO, etc and connect to vCenter
- Pair SRM plugin within vCenter with the remote site
- Pair the VRA management server with its remote counter part
- Set up resource mappings for folders, networking, datastores etc
- Configure IP customization rule per subnet
- Configure a protection group
- Configure a recovery group
- Configure recovery steps
- Configure replication for each VM
- Add replicated VM’s to protection group
Time taken to install and configure SRM – 3 weeks 4 working days
Time taken once the bug was overcome – 2.8 working days
Once the solution is up and running, protection groups and recovery plans are easy enough to create and edit assuming your resource mapping and permissions for each feature in vSphereare correct. However when protecting VM’s you first need to configure replication for each VM before adding them to the protection group.
The process for protecting a group of Virtual machines assuming that the SRM infrastructure is in place.
- In vSphere web client, select all guests that are to be replicated and right click configure replication.
- Go through steps to configure RPO and recovery interval point in time
- Configure storage for each disk and each VM ensuring not to over provision the datastores.
- Open the protection group and add the replicated machines (the machines will not be available to add until they have been configured for vSphere replication.
- Check the recovery process for test and recovery IP addresses etc
Time taken to configure the 8 SQL servers for replication = 38.20 minutes
The below image is an extract of the performance reports from SRM, its shows transferred bytes accumulative to 5 minute intervals as well as RPO violations and replication server status. The transferred data chart can be run over given periods so we are able to go back over long periods of time in the past but as mentioned only as granular as 5 minute intervals.
The process for installing, configuring and managing Zerto has been made very simple. The installation was complete first time without any issues.
The below steps are those followed from scratch
- Install Zerto management servers in both sites (the install adds the plugins to vCenter)
- Pair the 2 sites
- From the portal select which ESXi hosts to install replication appliances on with predefined IP addresses and execute
- Create a protection group including which VMs to replicate, custom IP’s and storage etc.
Time taken to install and configure Zerto – 4 hours 20 mins
Once installed creating and deleting VPG’s was very simple there was also the ability to edit existing VPGs with configuration information or adding removing VMs.
The process for protecting a group of machines is as follows
- Create a new VPG and select all of the VM’s from the Zerto GUI to be protected
- Configure the storage for each VM and each disk ensuring that the allocations are not too much for the datastores
- Configure the recovery options for each VM i.e. failover or test VM’s
Time taken to configure the 8 SQL servers for replication = 7.32 minutes
Below is a screen shot of Zerto’s dashboard, from the dashboard you are able to see a high level of the number of VPG’s protected, the number of VM’s protected and the amount pf data protected. It also shows the real time IOPs throughput on transfer rate and the up to date RPO time with the ability to report on individual VPGs but only for the past 24 hours.
The below table outlines some of the features which are available for the 2 products Zerto and vSphere replication with SRM comparatively.
|Array Based Replication||Yes||Yes|
|Host based replication||Yes||Yes|
|protection groups||Yes||Yes with SRM separately configured|
|recovery priority list||Yes||Yes with SRM separately configured|
|Custom IPs||Yes||Yes with SRM separately configured|
|Recovery to isolation||Yes||Yes|
|Failover scripts||Yes||Yes with SRM separately configured|
|file level restore ability||Yes up to 4 second intervals||No|
|use of snapshots for replication||No||No but vSphere uses CBT|
|point in time recovery||Yes up to 4 second intervals||Yes up to 4 intervals per day|
|Change recovery settings||Yes||Yes|
|Multiple vSphere version compatibility||Yes||No|
|Single VM recovery||Yes||Yes|
|VMs with RDMs||Yes||Only Virtual compatibility mode|
|Replicate a VM with active snapshots||Yes||Yes|
|vSphere Client plugin||Yes||No|
|vMotion protected guest||Yes||Yes|
|Increase resources for protected VM||Yes||No – You must stop replication|
|Add disk to protected VM||Yes||Yes – Replication is paused until disk is configured for replication|
|Works with CommVault Snapshots||Yes – No Issues found||Yes – No Issues found|
The below table shows the results of the tests carried out between the 2 products, the variable being the test and the content for the results. Each test was carried out between the 2 solutions which both have similar features for compression and file transfer. The fields that are N/A are either because that test was not carried out due to application owner acceptance or because that value is not possible from that particular solution.
Initially Zerto is more expensive but having said that its not an expensive product not for what it does anyway. The total cost of ownership pays for itself, the time taken to install configure and manage SRM is far more effort an resource that Zerto. If you think about on going costs and stress with effort, for example when you come to upgrade vSphere, if you upgrade the environment you need to upgrade the SRM environment to which means ceasing replication and it will never be a simple upgrade, where as because Zerto is a third party and because its agnostic it doesn’t care. So long as its ZRA (replication appliance) is up and it can see the guest its replication it continues…
Being VMware’s native product for replication I would have expected it to be a bit better in terms of deployment and functionality especially given the amount of editions there have been to the software. Aside from the bug which was discovered the installation was complicated and presented issues along the way partly due to lack of concrete implementation documentation. In order to control the replication traffic a new kernel will be required to be installed on each host with a VLAN tag to replicate across the links to the replication appliances in the remote site, in order to route the traffic. Once installed the process of protecting VM’s and groups is quite cumbersome and not welcoming finding components to the software scattered across different areas of the web client. As far as the product goes it works, the data is compressible which reduces the amount of data replicated and throttling is available to ensure the bandwidth isn’t hammered. During the POC the RPO set the product minimum of 15 mins which was breached 12 times meaning that the VM’s affected had as a result a greater RPO until the sync caught up. Once configured guests are replicated and monitored efficiently, during the POC I haven’t experienced any issues post implementation apart from the RPO breaches.
vSphere Replication Risks
- The RPOs keep violating the 15 minute recovery time, it seems a 15 min RPO is too aggressive for vSphere replication.
- When the resource of a VM is edited to for example increase a disk, add a disk or change memory/CPU the replication must be suspended and then re-enabled and a sync will commence.
- A failover of protection groups will not happen if any of the guest in the group has an issue with protection.
Zerto is a product that is agnostic of the hypervisor and storage so is agile in the sense it can be deployed easily and quickly, this type of solution would work well for integration of environments such as acquisitions where a replication server can be deployed and replicate Virtual machines back to the production datacenter seamlessly and quickly depending on the link, it also allows us to manage migrating in and out of the cloud. The architecture of the product is very clever allowing IO to be intercepted and replicated with no need for any change block tracking, it also simplifies the implementation in the sense that the replication happens directly between the replication appliances cutting out the kernel allowing us to simply route the traffic via the VLAN. The management of the product was simple and user friendly, groups are created for the different groups of servers to protect. As part of the group creation all of the parameters for failover and for test are configured and all of the guests can be selected in one workflow. The intuitive dashboard shows the current real time information relative to the bandwidth consumption and RPO achievements. The other features of the product especially file level recovery can be leveraged to recover server files for Test and Dev environments freeing up storage on the backup environment.
Both solutions would work and replicate guests from one site to another, however based on the results of this POC the best product to provide a replication solution is Zerto. There are a number of factors that have aided in this conclusion which are summarised in the below table.
Thanks for reading I hope it was useful, there will be other blogs about Zerto so have a look through the archives