Using Data Deduplication and Compression with VDO on RHEL 7 and 8

Storage deduplication technology has been on the market for quite some time now. Unfortunately all of the implementations have been vendor specific proprietary software. With VDO, there is now an open source Linux native solution available.

Red hat has introduced VDO (Virtual Data Optimizer) in RHEL 7.5, a storage deduplication technology bough with Permabit in 2017. Of course it has been open sourced since then.

In contrast to ZFS which provides the same functionality on the file system level, VDO is an inline data reduction which works on block device level, it is file system agnostic.

Use cases

There are basically two major use cases: VM Storage and Object Storage Backends.

VM Storage

The main use case is storage for virtual machines where a lot of data is redundant, i.e. the base operating system of the VMs. This allows to deduplicate the data on disk on a large scale, think about 100 VMs where the operating system takes about 5Gbyte each will be reduced to approx. 5 Gbyte instead of 500 Gbyte.

Typically VM storage can be over committed by factor 10.

Object- and Block storage backends

As a backend for CEPH and Glusterfs, it is recommended to not over commit more than factor 3. The reason for the lower over commitment is that the storage administrator usually does not know what kind of data will be stored on it.

Availability

VDO is available since RHEL 7.5, it is included in the base subscription. At the moment it is not available for Fedora (yet).

The source code is available on github:

At the moment the Kernel code is not yet in the upstream Mainline Kernel, it is ongoing work to get it into the Mainstream Kernel.

Typical setup

Physical disk -> VDO -> Volumegroup -> Logical volume -> file system.

Block device can be a physical disk (or a partition on it), multi path device, LUKS disk, or a software RAID device (md or LVM RAID).

Restrictions

You can not use LVM cache, LVM snapshots and thin provisioned logical volumes on top of VDO. Theoretically you can use LUKS on top of VDO, but it makes no sense because there is nothing to deduplicate. Needless to say that VDO on top of a VDO device does not make any sense as well. Be aware that you can not make use of partitioning or (LVM) Raid on top of VDO devices, all that things should be done in the underlying layer of VDO.

When using SAN, check if your storage box already does deduplication. In this case VDO is useless for you.

Installation

Its straight forward:

[root@vdotest ~]# yum -y install vdo kmod-kvdo

Create the VDO volume

In this test case, I attached a 110Gbyte disk, created a 100 GByte partition and will over commit it by factor 10.

Warning! As of writing this article, never use a whole physical disk, use a partition instead and leave some spare space in the disk to avoid data loss! (see further below)

[root@vdotest ~]# vdo create --name=vdo1 --device=/dev/vdb1 --vdoLogicalSize=1T

Creating volume group, logical volume and file system on top of the VDO volume

[root@vdotest ~]# pvcreate /dev/mapper/vdo1
[root@vdotest ~]# vgcreate vg_vdo /dev/mapper/vdo1
[root@vdotest ~]# lvcreate -n lv_vdo vg_vdo -L 900G
[root@vdotest ~]# mkfs.xfs -K /dev/vg_vdo/lv_vdo
[root@vdotest ~]# echo "/dev/mapper/vg_vdo-lv_vdo       /mnt    xfs     defaults,x-systemd.requires=vdo.service 0 0" >> /etc/fstab

Display the whole stack

[root@vdotest ~]# lsblk /dev/vdb
NAME                MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vdb                 252:16   0  110G  0 disk 
└─vdb1              252:17   0  100G  0 part 
  └─vdo1            253:7    0    1T  0 vdo  
    └─vg_vdo-lv_vdo 253:8    0  900G  0 lvm  /mnt
[root@vdotest ~]#

Populate the disk with data

The ideal test for VDO is to put some real-life VM-Images to the file system on top of it. In this case I scp’ed three IPA server and some instances to that file system. This kind of systems are all quite similar, the disk space saved is tremendous. The total size of the vm images is 105G

Lets have a look:

[root@vdotest ~]# df -h /mnt
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/vg_vdo-lv_vdo  900G  105G  800G  12% /mnt
[root@vdotest ~]# 
[root@vdotest ~]# ll -h /mnt
total 101G
-rw-------. 1 root root 21G Dec 17 09:41 ipa1.lab.delouw.ch.qcow2
-rw-------. 1 root root 21G Dec 17 09:47 ipa1.ldelouw.ch
-rw-------. 1 root root 21G Dec 17 09:54 ipa2.ldelouw.ch
-rw-r--r--. 1 root root 21G Dec 17 09:58 ipaclient-rhel6.home.delouw.ch
-rw-------. 1 root root 21G Dec 17 10:03 ipatest.delouw.ch.qcow2
[root@vdotest ~]#

Lets use the vdostats utility to display the actually used storage on disk:

[root@vdotest ~]# vdostats --si
Device                    Size      Used Available Use% Space saving%
/dev/mapper/vdo1        107.4G     15.2G     92.2G  14%           89%
[root@vdotest ~]#

Performance Tuning

There are a lot of parameters that can be changed. Unfortunately the documentation available at the moment is rudimentary, thus its more a guesswork than facts.

  • Number of worker threads of different kind
  • Enable or Disable compression

On machines with a lot of CPUs using more threads than the defaults can dramatically boost performance. man 8 vdo gives a glimpse of the different parameters related to threads.

Compression is a quite expensive operation. On top of that, depending on the kind of data you are storing, it does not make much sense to use compression (Well, deduplication is kind of compression as well).

Pitfalls

Be aware! With every storage deduplication solution there comes a big pitfall: The logical volume on top of VDO shows free disk space while the actual disk space on the physical disk can be (almost) exhausted. You need to carefully monitor the actual disk usage.

The fill grade can rapidly change if the data to be stored contains a lot of non-deduplicatable and/or compressible data. A good example is virtual machine images containing a LUKS encrypted disk, In such a case, use LUKS on the storage, not on the VM level.

Even if you update one virtual machine, the delta to other machine images will grow and less physical space is available.

VDO comes with a few Nagios plugins which are very useful for alerting administrators in the cause the available physical disk is filling up. They are located in /usr/share/doc/vdo/examples/nagios

According to df -h, on my test system there is still 800 Gbyte available. What happens if I store my 700 Gbyte Satellite 6 image? The data is mostly RPMs which are already compressed quite well. Lets see….

After a transfer of approx 155 Gbyte, the physical disk got full and the file system is inaccessible. I was hitting the worst case that can happen: Complete and unrecoverable data loss.

The df command shows some 241 Gbyte free.

[root@vdotest ~]# df -h |grep mnt
/dev/mapper/vg_vdo-lv_vdo                900G  241G  660G  27% /mnt
[root@vdotest ~]#

The vdostat command tells a different story, like expected.

[root@vdotest ~]# vdostats --si
Device                    Size      Used Available Use% Space saving%
/dev/mapper/vdo1        107.4G    107.4G      0.0B 100%           59%
[root@vdotest ~]# 

When attempting to access the data, there will be an I/O error.

[root@vdotest ~]# ll -h /mnt
ls: cannot access /mnt: Input/output error
[root@vdotest ~]# 

Thats bad. I mean really bad. The device is not accessible anymore.

xfs_repair does not work. Do not attempt to make use of the -L option! Your file system will be gone.

Recovering from a full physical disk

Lets resize the partition instead. First unmount the file system

[root@vdotest ~]# umount /mnt

Delete and recreate the partition using fdisk

[root@vdotest ~]# fdisk /dev/vdb
Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): d
Selected partition 1
Partition 1 is deleted

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): 
Using default response p
Partition number (1-4, default 1): 
First sector (2048-41943039, default 2048): 
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-41943039, default 41943039): 
Using default value 41943039
Partition 1 of type Linux and of size 20 GiB is set

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.
[root@vdotest ~]# partprobe
[root@vdotest ~]#
[root@vdotest ~]# vdo growPhysical -n vdo1

Run a file system check.

Now you are able to mount the file system again and your data is available again.

Documentation

Red Hat maintains a nice documentation about storage administration, VDO is covered by an own chapter. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/storage_administration_guide/#vdo

Conclusion

The technology is very interesting and will kick some ass. Storage deduplication will be more and more important, with VDO there is now a Linux native solution for that.

At the moment it is quite dangerous to use VDO in production. Filling up a physical disk without spare space is an unrecoverable error, a complete data loss. That means: Always create the VDO device on top of a partition that is not using the whole disk or another device that can grow in size to prevent data loss.

If you plan to use VDO in production make sure you have a proper monitoring in place that alerts quite ahead of time to be able to take corrective action.

Nevertheless: Its cool stuff and I’m sure the current situation will be fixed soon.

Leveraging Network-Bound Disk Encryption at Enterprise Scale

Tang and Clevis

Tang and Clevis

Network-Bound Disk Encryption (NBDE) adds scaling to LUKS by automated disk unlocking on system startup.

Why should I encrypt disks? If you dont want to see your corporate and private data leaked, you should do so as an additional security measure.

Use cases

There are basically two use cases for disk encryption. The first one is to prevent data leaks when a device gets stolen or lost (mobile computers, unsecured server rooms etc.). Theft of devices is usually not a threat for enterprise grade data centers with physical security.

Here comes the second use case for this enterprise grade data centers: At some point in time, disks will get disposed, either because of a defect or they get outdated technology wise. That means a data leak is possible at the end of the disks life cycle. A defect disk can not be wiped at all. For someone with deep pockets, there is still a chance to at least partially access the data. Wiping six TiB disks takes many hours just for overwriting them with zeros, not even with random data. An encrypted disk without a passphrase set can just simply get disposed without considering if it needs to be wiped or physically destroyed.

Note: Disk encryption does not help protecting you from a data theft by a person having access to the data, it also does not help against misbehaving software.

As you can imagine, it is a good idea to encrypt your storage. The standard for disk encryption in Linux is LUKS (Linux Unified Key Setup).

Adding Tang and Clevis for scaling

Unfortunately LUKS does not scale at all, because the passphrase must be entered manually on system startup, a no-go for data center operations. Tang and Clevis adds the scaling factor to the game.

Tang is the server component, Clevis and LUKS-meta the client component. The secret itself is stored on the client, the client asks the server for the data needed for the decryption of the key stored in the LUKS meta data. For more information on the crypto algorithms used, please see the Slide Deck “Tang and Clevis” by Fraser Tweedale

Availability and support

Tang and clevis have been added to RHEL 7.4 and are supported. The packages tang-nagios and clevis-udisk2 are in technical preview phase and are not supported. The packages are included in the base subscription.

It is also available for Fedora as well.

Set up the Tang servers

Setting up a Tang server is straight forward. For redundancy, please set up at least two Tang servers, a maximum of seven Tang Servers are supported by the client, which corresponds to the number of LUKS slots (eight) minus the one used for the initial passphrase.

[root@tang1 ~]# yum -y install tang
[root@tang1 ~]# systemctl enable --now tangd.socket
[root@tang1 ~]# jose jwk gen -i '{"alg":"ES512"}' -o /var/db/tang/new_sig.jwk
[root@tang1 ~]# jose jwk gen -i '{"alg":"ECMR"}' -o /var/db/tang/new_exc.jwk

Display the Thumbprint to be added to the Kickstart later on.

[root@tang1 ~]# jose jwk thp -i /var/db/tang/new_sig.jwk

Automated client setup during Kickstart

Be aware that you can run into problems when re-provisioning a system that contains old LUKS keys. You probably want to wipe them. In the following setup, all the slots are located on the second partition.

# Wipe LUKS keys on the second partition of disk vda
%pre
cryptsetup isLuks /dev/vda2  && dd if=/dev/zero of=/dev/vda2 bs=512 count=2097152
%end

part /boot      --fstype ext2 --size=512 --ondisk=vda
part pv.0       --size=1 --grow --ondisk=vda --encrypted --passphrase=dummy-master-pass

volgroup vg_luksclient pv.0

logvol /        --name=lv_root    --vgname=vg_luksclient --size=4096
logvol /home    --name=lv_home    --vgname=vg_luksclient --size=512 --fsoption=nosuid,nodev
logvol /tmp     --name=lv_tmp    --vgname=vg_luksclient --size=512 --fsoption=nosuid,nodev,noexec
logvol /var     --name=lv_var    --vgname=vg_luksclient --size=2048 --fsoption=nosuid,nodev
logvol /var/log --name=lv_var_log --vgname=vg_luksclient --size=2048 --fsoption=nosuid,nodev
logvol swap     --fstype swap --name=lv_swap    --vgname=vg_luksclient --size=4096

Be aware that the transfer of the Kickstart file will be done in clear text, that means that this dummy-master-pass is exposed. It should be automatically removed. You can add a master key via a secure way after the installation with Ansible, Puppet or simply manually via SSH.

Ensure you have the clevis-dracut package installed so that the init ramdisk will get created in the right way.

%packages
clevis-dracut
%end

In the %post section of the Kickstart file, add the following to register your system to the Tang servers.

%post
clevis bind luks -f -k- -d /dev/vda2 tang '{"url":"http://tang1.example.com","thp":"vkaGTzcBNEeF_X5KX-w9754Gl80"}' <<< "dummy-master-pass"
clevis bind luks -f -k- -d /dev/vda2 tang '{"url":"http://tang2.example.com","thp":"x_KcDG92bVP3SUL9KOzmzps4sZg"}' <<< "dummy-master-pass"
%end

In case you want to remove the master password, put the following line into your %post section of the Kickstart file:

%post
cryptsetup luksRemoveKey /dev/vda2 - <<<"dummy-master-pass"
%end

Usage of a passphrase

There are pros and cons about doing so. On one hand, if all Tang servers are unavailable, there is not a slight chance to access the data if there is no master password set. On the other hand, a master password can be leaked and it should be changed from time to time which needs to be automated (i.e. with Ansible) to scale.

I personally tend to use a master password. Choose wisely depending on your specific use case if you set a master password or not.

Good to know

Be aware that the password prompt on system startup will always show up. It disappears automatically after a few seconds if a Tang server have been reached.

Documentation

The following documents helps you further to get an idea about the Tang/Clevis setup:

A nice presentation from a conference is available here: https://www.usenix.org/conference/lisa16/conference-program/presentation/atkisson

Another more technical presentation is available here: http://redhat.slides.com/npmccallum/sad#/

Important commands

There are a few LUKS and clevis related commands you should know about.

cryptsetup

Cryptsetup is used to handle the LUKS slots, adding and removal of passphrases. More information is available in man 8 cryptsetup

luksmeta

luksmeta gives you access to the meta data of LUKS. I.e. showing which slots are in use:

[root@luksclient ~]# luksmeta show -d /dev/vda2 
0   active empty
1   active cb6e8904-81ff-40da-a84a-07ab9ab5715e
2   active cb6e8904-81ff-40da-a84a-07ab9ab5715e
3   active cb6e8904-81ff-40da-a84a-07ab9ab5715e
4 inactive empty
5 inactive empty
6 inactive empty
7 inactive empty
[root@luksclient ~]#

The following command is reading the meta data and put the encrypted content to the file meta

luksmeta load -d /dev/vda2 -s 1  > meta

It looks like this:

eyJhbGciOiJFQ0RILUVTIiwiY2xldmlzIjp7InBpbiI6InRhbmciLCJ0YW5nIjp7ImFkdiI6eyJrZXlzIjpbeyJhbGciOiJFUzUxMiIsImNydiI6IlAtNTIxIiwia2V5X29wcyI6WyJ2ZXJpZnkiXSwia3R5IjoiRUMiLCJ4IjoiQVdCeFZSYk9MOXBYNjhRU0lqSEZyNzVuNUVXdDZGblkySmNaNVgxX0s4MldaNW9kMUNQTUJwQ0dsS1ZFZ29LOFQwMERPazFsMHJRQ2kyOEg4SDBsVXlfaCIsInkiOiJBTzNLdmsyc2pqYVpSM3RrbW5KcVQyWGYtd1lnbXZSa0JqNUpmNFgzWmtHTDRHbTYtbE5qemhzVEdraEZLRmZUdnJLUElDTHBSQndCTnNXc0JuZUlVTEViIn0seyJhbGciOiJFQ01SIiwiY3J2IjoiUC01MjEiLCJrZXlfb3BzIjpbImRlcml2ZUtleSJdLCJrdHkiOiJFQyIsIngiOiJBTm05WmQwUDFyT1F6MXhhQVFNTzJxRjRua3ZHMVpKS2VHNkFaWjdPTEo1ejhKS000N0otMUhZWnkyZk0zT29ZQVdiUndQdnJ6aUt4MFJmNWh0QlkzNXBxIiwieSI6IkFTdVZYR3JRQ0c4R3dKTENXbWpVbC1jN0llUUh4TC01cFRGYTJaOU1ESnU4Ym9JZFo3WlNiZHBHZUFWMnhMTzlCTnlqbE5zSzB2ZWJrR3ZDcmU5bDl0aFYifSx7ImFsZyI6IkVTNTEyIiwiY3J2IjoiUC01MjEiLCJrZXlfb3BzIjpbInZlcmlmeSJdLCJrdHkiOiJFQyIsIngiOiJBT01Yc2JqcGd3MUVwMEdmSHd2NFRGTzFhWlNxZlY2NWlURVpWWDc4Y3M1SkI2dlBHaUZwd2RiZnpVWlpzR2FCZVludXdzTHJ0UTZTYm0zWHVTdGNHTFlFIiwieSI6IkFSZUZMckNHZnB6S2tzaTZvVXdLdEZjOW9IbHVHdDJjd3AwNmR1M3dEUGgta0t5N3RfTmZEU1JOSXRuWkZIbWs3eVlwSnkxYlpiRDRUTklTcXFIZDlDbDIifV19LCJ1cmwiOiJodHRwOi8vdGFuZzEuaG9tZS5kZWxvdXcuY2gifX0sImVuYyI6IkEyNTZHQ00iLCJlcGsiOnsiY3J2IjoiUC01MjEiLCJrdHkiOiJFQyIsIngiOiJBQnlMWjZmcWJKVVdzalZVc1ZjN0hwWlhLQ1BIZjJWenhyTExkODdvajBERnhGeTJRUTJHSXNEbFJ6OTg0cmtkNDJVQ3pDVy1VcGE4bG9nTl9BT0hsU0syIiwieSI6IkFBcHJaLUxFMUk3NUxWMXZtTHhkYUl0TmlETnpjUUVpLXJsR1FwVjFnT2IwWU5rbDFyWVgxdU45OE9WcHdiWUowTEpYYnYtRGZnSjU2RjBPMkNFczJOck4ifSwia2lkIjoicTAzQXd4VG5sU3lpQjRnelBTYTBfcXhsVzU4In0..Ws5k2fgQ26yN-mMv.1NwlYoyaUmF5X0jqGDcKO3HWn02StXotqnjZKaZtSUXioyW0-rc8HxH6HkkJTMQJk_EXr8ZXB4hmTXfUqAtRqgpEW4SdzJIw_AsGbJm5h_8lQLPIF4o.fbbNxK51MC14hX46Dgkj6Q

You can decrypt it:

[root@luksclient ~]# clevis decrypt tang < meta 
OTQy6NGfqTjppwIrrM4cc15zr-sxy5PPmKExHul1m-pcMjEHjGdoN5uqD9vcEiuMM56VapPV_LedXYEkktYO-g[root@luksclient ~]#

OTQy6NGfqTjppwIrrM4cc15zr-sxy5PPmKExHul1m-pcMjEHjGdoN5uqD9vcEiuMM56VapPV_LedXYEkktYO-g is the cleartext passphrase returned. It actually can be used to type it in the console, I recommend a serial console where you can copy-paste 😉

If you run the same command again when both Tang servers are down, you will get an error:

[root@luksclient ~]# clevis decrypt tang < meta
Error communicating with the server!
[root@luksclient ~]#

As you can see, you don’t need to provide a Tang Server URL.

lsblk

Lsblk is a nice little tool which shows the available storage in a tree. You can see the different layers of the storage subsystem.

[root@luksclient ~]# lsblk 
NAME                                          MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
vda                                           252:0    0   20G  0 disk  
├─vda1                                        252:1    0  512M  0 part  /boot
└─vda2                                        252:2    0 19.5G  0 part  
  └─luks-f0a70f08-b745-429f-ba8e-ec07e8953c3d 253:0    0 19.5G  0 crypt 
    ├─vg_luksclient-lv_root                   253:1    0    4G  0 lvm   /
    ├─vg_luksclient-lv_swap                   253:2    0    4G  0 lvm   [SWAP]
    ├─vg_luksclient-lv_var_log                253:3    0    2G  0 lvm   /var/log
    ├─vg_luksclient-lv_var                    253:4    0    2G  0 lvm   /var
    ├─vg_luksclient-lv_tmp                    253:5    0  512M  0 lvm   /tmp
    └─vg_luksclient-lv_home                   253:6    0  512M  0 lvm   /home
[root@luksclient ~]# 

json_reformat

If you want to play with JSON, install the package yajl.

With json_reformat you can minimize JSON and you are required to do so as clevis encrypt sss does not allow spaces, it fails.

Lets reformat this:

[root@luksclient ~]# echo '{"t": 1,"pins": {"tang": [{"url": "http://tang1.example.com"}, {"url": "http://tang2.example.com"}]}}'|json_reformat -m && echo ""
{"t":1,"pins":{"tang":[{"url":"http://tang1.example.com"},{"url":"http://tang2.example.com"}]}}
[root@localhost ~]# 

How to figure out to which servers the client is enrolled

I was curious how clevis figures out what Tang server to connect to. There is nothing written to the initrd, that means it must be stored somewhere in the LUKS metadata. It was taking me some time to figure out how it works.

Just decode the meta data to JSON:

 luksmeta load -d /dev/vda2 -s 1|jose b64 dec -i- |json_reformat 

Unfortunately the JSON seems to be invalid, at least json_reformat brings an error parse error: premature EOF. However, you will see the URL.

Test scenarios

I made a few tests with to figure out how Tang and Clevis works when something is going south.

Tang server(s) not available during system installatioon

If only one Tang server is available, installation work, server gets enrolled to only one Tang server. The server must be enrolled to the second Tang server manually after it came up again.

If both servers are down during installation, the installation finished successful, the temporary passphrase is still active as LUKS will deny removing the last passphrase available. Of course, the LUKS metadata is not available. You can enroll the servers manually after one or both servers come back online. Remember to remove the temporary passphrase afterwards.

Tang Server(s) not available during reboot

If one Tang server is not available, the other one is used, no impact.

If both servers are down, Plymouth asks for the LUKS passphrase. If you removed the the passphrase, you will not be able to boot the server. After starting one or both Tang servers, boot continues.

Drawbacks

Tang and Clevis are both very young projects and not yet mature. I’ve figured out the following drawbacks:

Missing Registry

At the moment there is no way to report which servers are registered to what Tang server. This makes it hard to check from a central point if a server is really registered to two (or more) Tang servers to ensure smooth operation in the case of a failed Tang server.

This is particular true if one (or more) Tang server is down during install time of the client system. As a workaround, set up a monitoring script that checks if there are two active slots. I.e.

if [ $(luksmeta show -d /dev/vda2|grep " active"|grep -v empty|wc -l) -ne 2 ] || [ $? -eq 0 ]; then
        echo "Something is wrong with the LUKS metadata, please check"|mail -s "LUKS Metadata failure" monitoring@example.com
fi

Logging

Logging of Tang requests is very basic at the moment, some improvement is needed here as well. Again, the documentation for the return codes is lacking

Scalability

When using more than one Tang server, always that one defined in the first slot be be accessed. There is no round-robin or similar load-balancing method. This means that that the sequence of Tang Server must be shuffled on the client which involves some logic in the Kickstart file.

One Tang server should be able to handle more than 2k requests per second, so the problem only kicks in very large environments, where more than 2000 server are booting (or getting installed) at the same time.

Maturity

Its a brand new project using completely new ideas and methods. At the moment not much experience is there, an issue that will be solved over time.

Documentation

There is almost no documentation available which goes beyond a few lines to show how to set up the server and client. Whats missing is how to troubleshoot the environment. Another missing part is how to handle key rotation, its unclear for me if and what has to be done on the client.

Easy-to-read documentation is important, in particular for Tang and Clevis which is using some new style die-hard cryptographic mathematics.

Conclusion

Both, client and server have a very small footprint and are performing well. The idea of Tang and Clevis is brilliant and a first incarnation is ready to use. Due to the drawbacks mentioned above I think it is not yet ready for production and it will take a while until it is.

Due to the nature of the project, stability and reliability is a key point, that is why people should test it and provide feedback.

I would like to thank the involved engineers, cool stuff.

Have fun:-)

Building a virtual CEPH storage cluster

cephThis post will guide you trough the procedure to build up a testbed on RHEL7 for a complete CEPH cluster. At the end you will have an admin server, one monitoring node and three storage nodes. CEPH is a object and block storage mostly used for virtual machine images and bulk BLOBS such as video- and other media. It is not intended to be used as a file storage (yet).

Machine set up
I’ve set up five virtual machines, one admin and monitoring server and five OSD servers.

  • ceph-admin.example.com
  • ceph-mon01.example.com
  • ceph-osd01.example.com
  • ceph-osd02.example.com
  • ceph-osd03.example.com

Each of them have a disk for the OS of 10GB, the OSD servers additional 3x10GB disks for the storage, in total 90GB for the stroage. Each virtual machine got 1GB RAM assigned, which is barley good enough for some first tests.

Configure your network
While it is recommended to have two separate networks, one public and one for cluster interconnect (heartbeat, replication etc). However, for this testbed only one network is used.

While it is recommended practice to have your servers configured using the Fully qualified hostname (FQHN) you must also configure the short hostname for CEPH.

Check if this is working as needed:

[root@ceph-admin ~]# hostname
ceph-admin.example.com
[root@ceph-admin ~]# hostname -s
ceph-admin
[root@ceph-admin ~]# 

To be able to resolve the short hostname, edit your /etc/resolv.conf and enter a domain search path

[root@ceph-admin ~]# cat /etc/resolv.conf 
search example.com
nameserver 192.168.100.148
[root@ceph-admin ~]# 

Note: In my network, all is IPv6 enabled and I first tried to set CEPH up with all IPv6. I was unable to get it working properly with IPv6! Disable IPv6 before you start. Disclaimer: Maybe I made some mistakes.

You also need to keep time in sync. The usuage of NTP or chrony is best practice anyway.

Register and subscribe the machines and attach the repositories needed

This procedure needs to be repeated on every node, inlcuding the admin server and the monitoring node(s)

[root@ceph-admin ~]# subscription-manager register
[root@ceph-admin ~]# subscription-manager list --available > pools

Search the pools file for the Ceph subscription and attach the pool in question.

[root@ceph-admin ~]# subscription-manager attach --pool=<the-pool-id>

Disable all repositories and enable the needed ones

[root@ceph-admin ~]# subscription-manager repos --disable="*"
[root@ceph-admin ~]# subscription-manager repos --enable=rhel-7-server-rpms \
--enable=rhel-7-server-rhceph-1.2-calamari-rpms \
--enable=rhel-7-server-rhceph-1.2-installer-rpms \
--enable=rhel-7-server-rhceph-1.2-mon-rpms \
--enable=rhel-7-server-rhceph-1.2-osd-rpms

Set up a CEPH user
Of course, you should set a secure password instead of this example 😉

[root@ceph-admin ~]# useradd -d /home/ceph -m -p $(openssl passwd -1 <super-secret-password>) ceph

Creating the sudoers rule for the ceph user

[root@ceph-admin ~]# echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph
[root@ceph-admin ~]# chmod 0440 /etc/sudoers.d/ceph

Setting up passwordless SSH logins. First create a ssh key for root. Do not set a pass phrase!

[root@ceph-admin ~]# ssh-keygen -t rsa -N "" -f /root/.ssh/id_rsa

And add the key to ~/.ssh/authorized_keys of the ceph user on the other nodes.

[root@ceph-admin ~]# ssh-copy-id ceph@ceph-mon01
[root@ceph-admin ~]# ssh-copy-id ceph@ceph-osd01
[root@ceph-admin ~]# ssh-copy-id ceph@ceph-osd02
[root@ceph-admin ~]# ssh-copy-id ceph@ceph-osd03

Configure your ssh configuration.

To make your life easier (not providing –username ceph) when you run ceph-deploy) set up the ssh client config file. This can be done for the user root in ~/.ssh/config or in /etc/ssh/ssh_config.

Host ceph-mon01
     Hostname ceph-mon01
     User ceph

Host ceph-osd01
     Hostname ceph-osd01
     User ceph

Host ceph-osd02
     Hostname ceph-osd02
     User ceph

Host ceph-osd03
     Hostname ceph-osd03
     User ceph

Set up the admin server

Go to https://access.redhat.com and download the ISO image. Copy the image to your admin server and mount it loop-back.

[root@ceph-admin ~]# mount rhceph-1.2.3-rhel-7-x86_64.iso /mnt -o loop

Copy the required product certificated to /etc/pki/product

[root@ceph-admin ~]# cp /mnt/RHCeph-Calamari-1.2-x86_64-c1e8ca3b6c57-285.pem /etc/pki/product/285.pem
[root@ceph-admin ~]# cp /mnt/RHCeph-Installer-1.2-x86_64-8ad6befe003d-281.pem /etc/pki/product/281.pem
[root@ceph-admin ~]# cp /mnt/RHCeph-MON-1.2-x86_64-d8afd76a547b-286.pem /etc/pki/product/286.pem
[root@ceph-admin ~]# cp /mnt/RHCeph-OSD-1.2-x86_64-25019bf09fe9-288.pem /etc/pki/product/288.pem

Install the setup files

[root@ceph-admin ~]# yum install /mnt/ice_setup-*.rpm

Set up a config directory:

[root@ceph-admin ~]# mkdir ~/ceph-config
[root@ceph-admin ~]# cd ~/ceph-config

and run the installer

[root@ceph-admin ~]# ice_setup -d /mnt

To initilize, run calamari-ctl

[root@ceph-admin ceph-config]# calamari-ctl initialize
[INFO] Loading configuration..
[INFO] Starting/enabling salt...
[INFO] Starting/enabling postgres...
[INFO] Initializing database...
[INFO] Initializing web interface...
[INFO] You will now be prompted for login details for the administrative user account.  This is the account you will use to log into the web interface once setup is complete.
Username (leave blank to use 'root'): 
Email address: luc@example.com
Password: 
Password (again): 
Superuser created successfully.
[INFO] Starting/enabling services...
[INFO] Restarting services...
[INFO] Complete.
[root@ceph-admin ceph-config]#

Create the cluster

Ensure you are running the following command in the config directory! In this example it is ~/ceph-config.

[root@ceph-admin ceph-config]# ceph-deploy new ceph-mon01

Edit some settings in ceph.conf

osd_journal_size = 1000
osd_pool_default_size = 3
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 128
osd_pool_default_pgp_num = 128

In production, the first value should be bigger, at least 10G. The number of placement groups is according the number of your cluster members, the OSD servers. For small clusters up to 5, 128 pgs are fine.

Install the CEPH software on the nodes.

[root@ceph-admin ceph-config]# ceph-deploy install ceph-admin ceph-mon01 ceph-osd01 ceph-osd02 ceph-osd03

Adding the initual monitor server

[root@ceph-admin ceph-config]# ceph-deploy mon create-initial

Connect the all nodes server to calamari:

[root@ceph-admin ceph-config]# ceph-deploy calamari connect ceph-mon01 ceph-osd01 ceph-osd02 ceph-osd03 ceph-admin

Make your admin server being an admin server

[root@ceph-admin ceph-config]# yum -y install ceph ceph-common
[root@ceph-admin ceph-config]# ceph-deploy admin ceph-mon01 ceph-osd01 ceph-osd02 ceph-osd03 ceph-admin

Purge and add your data disks:

[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd01:vdb
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd01:vdc
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd01:vdd
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd02:vdb
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd02:vdc
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd02:vdd
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd01:vdb
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd02:vdc
[root@ceph-admin ceph-config]# ceph-deploy disk zap ceph-osd03:vdd

[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd01:vdb
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd01:vdc
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd01:vdd
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd02:vdb
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd02:vdc
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd02:vdd
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd03:vdb
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd03:vdc
[root@ceph-admin ceph-config]# ceph-deploy osd create ceph-osd03:vdd

You now can check the health of your cluster:

[root@ceph-admin ceph-config]# ceph health
HEALTH_OK
[root@ceph-admin ceph-config]# 

Or with some more information:

[root@ceph-admin ceph-config]# ceph status
    cluster 117bf1bc-04fd-4ae1-8360-8982dd38d6f2
     health HEALTH_OK
     monmap e1: 1 mons at {ceph-mon01=192.168.100.150:6789/0}, election epoch 2, quorum 0 ceph-mon01
     osdmap e42: 9 osds: 9 up, 9 in
      pgmap v73: 192 pgs, 3 pools, 0 bytes data, 0 objects
            318 MB used, 82742 MB / 83060 MB avail
                 192 active+clean
[root@ceph-admin ceph-config]# 

Whats next?
A storage is worthless if not used. A follow-up post will guide you trough how to use CEPH as storage for libvirt.

Further reading

Creating and managing iSCSI targets

If you want to create and manage iSCSI targets with Fedora or RHEL, you stumble upon tgtd and tgtadm. This tools are easy to use but have some obstacles to take care of. This is a quick guide on how to use tgtd and tgtadm.

iSCSI terminology
In the iSCSI world, we not taking about server and client, but iSCSI-Targets, which is the server and iSCSI-Initiators which are the clients

Install the tool set
It is just one package to install, afterwards enable the service:

target:~# yum install scsi-target-utils
target:~# chkconfig tgtd on
target:~# service tgtd start

Or Systemd style:

target:~# systemctl start tgtd.service
target:~# systemctl enable tgtd.service

Online configuration vs. configuration file
There are basically two ways of configuring iSCSI targets:

  • Online configuration with tgtadm, changes are getting available instantly, but not consistent over reboots
  • Configuration files. Changes are presistent, but not instantly available

Well, there is the dump parameter for tgtadm but i.e. passwords are replaced with “PLEASE_CORRECT_THE_PASSWORD” which makes tgtadm completely useless if you are using CHAP authentication.

If you do not use CHAP authentication and use IP based ACLs instead, tgtadm can help you, just dump the config to /etc/tgt/conf.d

Usage of tgtadm

After you have created the storage such as a logical volume (used in this example), a partition or even a file, you can add the first target:

target:~# tgtadm --lld iscsi --op new --mode target --tid 1 --targetname iqn.2013-07.com.example.storage.ssd1

Then you can add a LUN (logical Unit) to the target

target:~# tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 --backing-store /dev/vg_storage_ssd/lv_storage_ssd

It is always a good idea to restrict access to your iSCSI targets. There are two ways to do so: IP based and user (CHAP Authentication) based ACL.

In this example we first add two addresses and later on remove one of them again just as a demo

target:~# tgtadm --lld iscsi --mode target --op bind --tid 1 --initiator-address=192.168.0.106
target:~# tgtadm --lld iscsi --mode target --op bind --tid 1 --initiator-address=192.168.0.107

Go to both initiators where the IPs and check if the Targets are visible:

iscsiadm --mode discovery --type sendtargets --portal 192.168.0.1

Lets remove the ACL for the IP address 192.168.0.107

target:~# tgtadm --lld iscsi --mode target --op unbind --tid 1 --initiator-address=192.168.0.107

Test if the Target is still visible on the host with IP address 192.168.0.107, it is not anymore.

If you want to use CHAP authentication, please be aware that tgtadm –dump does not save password, so initiators will not be able to login after a restart of the tgtd.

To add a new user:

target:~# tgtadm --lld iscsi --op new --mode account --user iscsi-user --password secret

And add the ACL to the target:

target:~# tgtadm --lld iscsi --op bind --mode account --tid 2 --user iscsi-user

To remove an account for the target:

target:~# tgtadm --lld iscsi --op unbind --mode account --tid 2 --user iscsi-user

As a wrote further above, configurations done by tgtadm are not persistent over reboot or restart of tgtd. For basic configurations as descibed above, the dump parameter is working fine. As configuration files in /etc/tgt/conf.d/ are automatically included, you just dump the config into a separate file.

target:~# tgt-admin --dump |grep -v default-driver > /etc/tgt/conf.d/my-targets.conf

The other way round
If you are using more sophisticated configuration, you probably want to manage your iSCSI configration the other way round.

You can edit your configuration file(s) in /etc/tgt/conf.d and invoke tgt-admin with the respective parameters to update the config instantly.

tgt-admin (not to be mistaken as tgtadm) is a perl script which basically parses /etc/tgt/targets.conf and updates the targets by invoking tgtadm.

To update your Target(s) issue:

tgt-admin --update ALL --force

For all your targets, incl. active ones (–force) or

tgt-admin --update --tid=1 --force

For updating Target ID 1

SIGKILL is nasty but sometimes needed
tgtd can not be stopped as usual daemons, you need to do a sledgehammer operation and invoke kill -9 to the process followed by service tgtd start command.

How the start up and stop process is working in a proper workaround way is being teached by Systemd, have a look at /usr/lib/systemd/system/tgtd.service which does not actually stop tgtd but just removes the targets.

Conclusion
tgtadm can be help- and sometimes harmful. Carefully consider what is the better way for you, creating config files with tgtadm or update the configuration files and activate them with tgt-admin.