At VMware Explore, I got an interesting question from a customer. They are a VCPP partner and are leveraging VMware Cloud Director to service their customers. They were asking if it is possible to create a (nested) VCF environment on top of VCD. I said, yes, that should be possible, the HOL is an excellent example of where this is done, but I saw little information on the exact way how to build this configuration.
At VMware we, as TAM’s, have the possibility to create our own lab environments. For this we use a VMware Cloud Director environment, dedicated for TAM’s. So I thought it would be interesting to see how to go about building a VCF lab on top of VCD.
I will be starting from scratch and will create a VCF 4.5 environment, with all the necessary prerequisites to create and run this, from zero. For this, I created an empty vApp within VCD, called:
And, of course, we also need a little bit of a design. This might (will) change over time, but let’s start small. First, we are building:
So, one DC, that will contain network services, Domain Controller functionality and be a stepping stone to get to the environment. One Cloud Builder VM, that will help us deploy our VCF environment and a couple of ESXi servers that will contain this VCF environment.
I created the dc01 and promoted it to become both a Domain Controller and DNS server for the rest of the environment. I can access the dc through the “Public” Network, which is (for me) reachable internally over VPN.
As a Domain Name, I chose “my-sddc.vcd” (as I already use my-sddc.local in my local lab 🙂 and vcd of course, related to VMware Cloud Director)
After installing the Domain Controller (and configuring it for all the other roles it will get), I deployed the Cloud Builder Appliance. There was already a template in the environment that contained Cloud Builder 4.5 and I deployed a new VM from this template (but installing a freshly downloaded Cloud Builder VM would be the same starting point, but it is rather large, so I took the shortcut of the template).
After starting it, it gives me a nice overview of what I need to do:
So next, it is time to create some prerequisites, which are presented on the next page:
First up, we create some new (nested) ESXi servers. For this I am using the following configuration:
- 4 x CPU
- 64 GB Memory
(in hindsight, this is too small. I would opt for 16 x cpu and 128 GB of memory. I will change it later in the process, during the bring up, but if you start from scratch, use the larger size if you have the option).
- 16 GB Storage for OS
- 100 GB Storage PVSCSI connected
- 1000 GB Storage PVSCSI connected
- 100 GB Storage to be used as a cache device –> Bus type: NVME
- 1000 GB Storage to be used as a capacity device –> Bus type: NVME
I don’t know what exactly is necessary from a storage perspective, but this configuration works. It will however lead to two hybrid diskgroups, one of 100 GB and one of 1000 GB. What I was aiming for (and will eventually be configuring) is one disk group, consisting of the 100 GB NVME as the Cache layer and the 1000 GB NVME as the Capacity layer. But if I start of with only that attached, the Cloud Builder will fail to create the vSAN Datastore.
So I start with the configuration as mentioned and after the deployment has completed, move to the configuration with only the NVME devices (see the last part of this article).
I boot the VM with the .iso attached, to install the correct version of ESXi, related to VCF 4.5 (find out the correct version in the Release Notes), which is 7.0 Update 3g, build 20328353.
After installing this host, I copied it another 3 times, so I have 4 hosts in total:
Important to do a reset of the configuration, because ESXi will by default reuse the same (virtual) vmk mac-address (based on the initial host’s physical mac-address) if you don’t.
So log on to each host and in the DCUI use “Reset Configuration”:
Now, it becomes important to configure both the hosts and the environment in a way that VCF can be deployed. That means doing the following activities:
- Configure Networking (allow Guest VLAN tagging)
- Configure VLAN on the ESXi hosts
- Configure Routing
- Configure the hosts with the correct certificate
- Configure DNS (Forward and Reverse)
- Configure NTP
- TSM SSH Service to start and stop with host
Configure Networking (allow Guest VLAN tagging)
The “Allow Guest VLAN Tagging” setting on the Network within the vApp will allow all VLAN’s to be utilized. Basically it means that VLAN 4095 is attached to the port group. So I can use all the VLAN’s I want, in the sheet I will be using to deploy VCF:
This also allows me to connect a router to the segment and use it to route between all the VLAN’s that VCF will be using.
Configure VLAN on ESXi hosts
Since the hosts will be configured using a VLAN (and since, with the previous step, VLAN’s are allowed throughout the Internal Network), I configure the hosts to use VLAN: 1611:
And also configure that on the Windows DC, for the internal network:
and after that, connectivity is there:
After configuring the VLAN on the Management Interface, I also need to configure it on the VM Network Port group, for all the hosts:
Because we need internal connectivity and a gateway to connect to, within the internal network I am going to deploy a routing VM. In the case of this environment, I am going to create a pfSense VM. I deploy the pfSense VM and configure it to route between the different VLAN’s I will be using within the environment:
The VM will have one network connection to the public network and one connection to “Internal Network”. Due to the VLAN setting I made in the previous step, I can create all necessary VLAN-interfaces on the router, to create my own little (Virtual) DataCenter.
Look at the Firewall settings within pfSense, since it disallowed some traffic between my Cloud Builder network and the Management network (ping did not work), so make sure you create the right “allow” rules to make sure traffic flows as needed.
Configure the hosts with the correct certificate
By default, the hosts will contain a self-signed certificate that holds the “wrong” name:
We need to change this, so the correct “Common Name” is reflected in the certificate.
For this, we can use the command:
/sbin/generate-certificates and then reboot the hosts. Then the certificate shown will be the correct common name:
After DNS and NTP have been set up:
the TSM SSH Service is started ánd set to start and stop with the host:
I can download and start filling out the VCF Sheet:
I am going to maintain the VLAN-ID’s but will change the IP addresses to my own design. I also chose not to validate the Thumbprints, but in a production deployment that is of course something you need to do:
Hosts and Networks:
Making sure all names and addresses are of course neatly put into DNS, both forward and reverse.
After this is all done, it is time to put the pedal to the metal, and insert the Sheet into the Cloud Builder Appliance:
and validate the configuration:
And when the validation is completed:
we can start the actual deployment (but first, I created a snapshot of the vApp, just in case):
And then we wait (I started this on a Saturday night (although in the US, where my client is running, it’s only early afternoon ;))), so I’ll see the result tommorrow morning).
It failed a couple of times for several reasons, but otherwise we won’t learn, so let me write down the issues and the way I went around them. Most have to do with the environment being nested and performance being not really production ready.
- Network configuration
- vSAN Configuration
- NSX Deployment
- Deployment of SDDC Manager
So, first time, it failed. It failed because the vmk0 of the esx02 was not able to be migrated to the Distributed Switch:
The vmk0 of the esx01 had migrated but was no longer reachable from the DC01. However, since the deployed vc01 (vCenter Server) was running on the esx01, I could still manage the configuration. What I needed to do was both change the dPG to only use vmnic0 (uplink1) and migrate both the vmk0 ánd the vmnic0 at the same time. After this, the process went through:
I also had to help the vSAN configuration a little, with Silencing the alerts around the certification of the NVME and SCSI controllers:
Next issue I ran into, was the deployment of NSX. Because it is a nested environment, the NSX appliances deploy slowly and are too slow before they are fully functional. This leads to them being deleted and redeployed again and again (until CloudBuilder deems it enough and stops):
The way to address this (at least, that is what I have done) is wait for the VM’s to be powered on and then pause the Cloud Builder VM until the NSX configuration has completed (thanks to Shashank Mohan, from https://communities.vmware.com/t5/VMware-Cloud-Foundation/VCF-4-0-stuck-at-quot-Deploy-and-Configure-NSX-T-Data-Center/m-p/2828328/highlight/true#M649.
After pausing the Cloud Builder, giving NSX the time to do it’s thing and restarting, the process continues:
It does take a long time to deploy and configure the whole NSX part, so be patient. Take a look at the status of the NSX Managers, but don’t force anything while the process is running.
Deployment of SDDC Manager:
I did notice that increasing the cpu-count of the ESXi hosts might be a good idea and while deploying the SDDC Manager, we ran into contention. VM’s were migrating to get resources and that ultimately led to a failed task.
That was a good moment to reconsider the host-size. Four cpu’s is too small to run these NSX Managers, even when they are “small”. So after this failure I put my hosts in maintenance mode, shut them down and increased the resources (one by one, while leaving the NSX Managers running).
After doing that and “Retrying” the deployment, everything completes:
And after clicking “Finish” I can look at the SDDC Manager:
Now on to some cleaning up.
- Remove the Cloud Builder VM
- Change the Disk Group Layout of the hosts
The Cloud Builder VM can be removed from the VCD environment.
And as described in the initial configuration, the ultimate goal was to move to a single diskgroup per host, containing only the NVME devices (and thus, creating an AllFlash cluster). For this I removed both diskgroups from each host and created a new one consisting of the NVME devices (one host at a time, of course, because everything is still running):
After that, the SCSI disks can be removed from the ESXi hosts, within Cloud Builder (on the fly):
And after saving this for all the hostst and rescanning them, the Disk Management page looks like:
Then, finally, it is time to close down all hosts and create a template of this VCF on VCD environment (for use by other people or to create a copy, if I want it):
And if you are a TAM and reading this. You can find it in the Library “GENERAL-CUSTOM-TEMPLATES-TAM-VDC02” and the contents are in the description.
Next step: Deploying a first Workload Domain. (that’s part 2).