How to self-backup your data?May 28, 2021
Even though larger Internet service providers offer data backup options, they may not suit everyone. Whether due to high price, their design or insufficient compression. We were not satisfied with the provided data backup either. In this article, we'll look at how difficult (or rather not) it was for us to switch to self-backup. And we might as well inspire you!
Backup can be used in various ways
In Dactyl Group, we manage a wide range of web and mobile applications, including their databases and e-mail inboxes.The data is spread across several servers and we need extra storage to back it up. Firstly, to have the possibility to go back to older versions of the projects, and secondly, so that we have the opportunity to restore the data after an unintentional deletion. Of course, backup also plays a crucial role in security, because although the servers are monitored carefully, even a seemingly unsinkable ship can get into trouble. We do know that, though.
As for providers, everyone approaches backup differently. Some offer internal backup solutions on their own disks for a fee, others encourage you to set up regular backups via time/date and storage options settings. "Online" storage spaces, such as DropBox or Google Disc, are often used. However, the majority of providers rely on the client to take care of the backup internally.
Not to mention, there are many types of backups that can be created. Whether it is an unstructured backup, where a full copy of data is created every time, incremental backup, where only the changes made are saved, or a differential backup with solely modified files being saved. It always depends on the circumstances and options. Therefore, everyone may benefit from a different approach or their combination.
Why we have decided to do self backup
Our main server provider offers the possibility of backup by creating one large ZIP archive from all hosting data (which consists of websites, databases, e-mails) on the selected day and time. It is then sent to the FTP server of our choice. This process works relatively well for websites that take up hundreds of MB or a few GB. However, some of our projects exceed hundreds of gigabytes in size, and with each backup, one huge zipped file is created.
Storing even a few copies is a tough nut to crack. It is demanding both for data space and the data transfer over the Internet, which can take several hours. Moreover, we have to take into account the fact that in order to create an archive, it is necessary to have the space for storing the data available directly on the server. In other words, you need to possess at least as much free space as the largest backed up project.
Data recovery is also problematic. Why? Because copying or extracting such a file is practically impossible, even if we only need a part of the backup.
How Dactyl self-backups its data
Due to the shortcomings mentioned above, after internal discussions and analysis, we have decided to opt for our own solution. This lies in using a basic open-source tool Rsync. It is a very powerful, but at the same time an easy to use program. It has existed for almost a quarter of a century and is the basic tool for almost all Unix-like distributions. Therefore, it is no experimental innovation but a years-proven helper for moving and synchronizing data over the network.
It is however important to know its disadvantages too. In our opinion, it consists in the need to connect to the target machine using the SSH protocol. In this way, Rsync verifies your right to access the data. Unfortunately, if you rent the web hosting itself, SSH access is not available. Nevertheless, if you are our client, we back up the sites for you.
However, what we appreciate the most are the strengths of this tool. The steps of this process could be summarized in the following points:
- After verifying your access, it recursively retrieves the contents of the source folders (sites) and the target folders (backup).
- It then copies the files if they do not exist in the destination, or the time of their last modification differs (it does not transfer everything when we back up the copy).
- Each transferred file is archived and unpacked at its destination (it saves bandwidth).
- It preserves the owner, group, even the read and write rights.
(simplifies restoring the data to its original location).
Compared to native file copying, this is an all-round improvement. With each backup, it either creates a new folder with all current files on the destination disk or just updates the already existing older backup.
To save storage space, we use another important trick that Rsync offers in cooperation with the file system - the so-called hard links.This is a technique by which a reference to an existing folder on the destination disk can be passed to the program. In case that when comparing the files, Rsync finds out that this file already exists in the reference folder without any modification, it will only create a hard link to the reference in the destination folder. This way of linking the files is pretty cool because although it makes it seem like two files, they only take up space for one. The file system ensures that the link looks like a copy but in reality, it is just a reference. Backups then form folders with a complete file structure. The hard linked file will be completely removed only after the last link leading to it is deleted.This solves the problem of deleting old backups without worrying about data loss.
Therefore, if the data on the backed up sites or in the email inboxes does not change too frequently, we can conveniently store, for example, 5 different backups from various time periods. Yet, the size of the backup takes up only around 120% of the original space.
How to self-backup - a simple guide
Running backups with Rsync is not complicated at all and can be conveniently done at home.
The prerequisites for creating a backup tool are:
- a computer, ideally with a Unix-like operating system (the OS can be also different)
- internet connection (public IP is not necessary)
- enough storage space (internal/external disk, USB, memory card, etc.)
In Dactyl, the role of a computer is represented by a previously unused server machine with the Debian operating system. It has Rsync and other programs installed defaultly and is connected to a standard office Internet network. But it doesn't have to be a server. An old unused laptop, a virtualized computer on a workstation or even a minicomputer, such as the Raspberry Pi are perfectly fine. Using the CRON scheduler, the command for using Rsync is then automatically executed, e.g. in the following form:
rsync -avz --link-dest=<path_to_reference> <path_to_web> <path_to_storage>
Description of the command:
- -a the switch ensures the preservation of the owner, group and read / write rights,
- -v the switch specifies that Rsync should give more information about the progress of the backup,
- -z the switch forces the files to be compressed when moving them,
- --link-dest= determines which folder should be used as a reference for hard linking
- <path_to_reference> is the path to the folder for links (typically the previous backup),
- <path_to_web> is the path to the folder on our hosting remote server,
- <path_to_storage> is the path to the local folder where we want to save the backup.
If we create the first backup without a reference folder for linking, it means this will be a complete and full copy of the data, however, each subsequent backup can then be smaller with every change (not)made on the previously backed up data.
How backups work in practice
If you are renting hosting from us, you now get the idea of how we backup your data. We generally create backups three times a month (which is included in the price of the service), but we provide tailor-made solutions as well, for example in the case of the ČPZP project.
We also run the disks with the backups in a RAID array. This way, the data is protected even when one of the disks becomes worn out. So you needn't worry about your data and its backups with us.
The author of this article is Honza Kohoutek. He is an engineer specializing in computer and embedded systems. He has been working at Dactyl as a backend web developer in PHP and Yii technologies for more than 5 years.
If you are interested in our tips, check other articles: