Checksums as a way to preserve file integrity
Digital files are vulnerable and can be lost or subjected to unwanted changes in the short term, but you can use checksums to detect this loss of information and verify whether your files still match your backups. In this article, you’ll learn:
- What is bit rot and what causes it?
- What is a checksum and what do you use it for?
- How can you create checksums?
Digital files are vulnerable, not just because of rapidly evolving technology but also because all digital media is unreliable for long-term storage if you don't have good back-up and control procedures in place. Without the necessary precautions, digital information can be lost or changed unintentionally even in the short term. This phenomenon, known as 'bit rot', is often caused by a wear and tear or a change in the carrier's chemical composition. That's why it's essential to always have an identical back-up copy. Errors when copying files can also result in data loss, however, for example when making a back-up.
You can use a checksum to trace any errors or loss of information. A checksum is like a unique fingerprint for a file that can be used to verify whether two files are identical. As soon as any difference between the files is detected, the checksum software generates a new series of digits, so each changed file receives a new control digit. This tells you when to replace the original file with the back-up, and allows you to verify that the back-up is an identical copy of the original. Anyone who wants to archive digital files sustainably needs to always generate these checksums and then regularly check them.
How can you use checksums?
The principle of using a checksum is very simple: a calculation based on an algorithm and a series of letters or numbers results in a new, shorter series of characters. Repeating this same calculation and comparing the result with the previous one determines whether the series is still correct.
This technique is used for data communication and storage. An algorithm is applied to a series of bits, the collections of zeroes and ones that make up all digital files. If one of the bits in this series changes, it returns a different checksum to clearly show that there is something wrong with the file. These checksums can be calculated on any random series of bits, including for digital image or text files.
MD5
The Message Digest Algorithm 5 (MD5) has a 35-character checksum. Each character is a number from 0 to 9 or a letter from a to f, e.g. 5adb6b18a918913e279761a06e5ba73a
, with 1632 or 2128 different possible combinations. So the chance that two files give the same checksum is extremely small. You can therefore create a quasi-unique fingerprint for every file with an MD5 checksum.
MD5 was initially designed to be used as a security algorithm, but it has been found to suffer from extensive vulnerabilities. It can still be used as a checksum, however, for example in a digital archive. MD5 checksums are created before or while files are added to the digital archive. They are checked against previously generated checksums at regular intervals and/or when consulting a file, to see whether the file is still complete and unchanged (and not corrupted).
This is important because digital files are often stored in large quantities, making it impossible to visually inspect each file individually. Moreover, in most cases, a visual inspection of all the individual files would still not be enough to determine whether the integrity of the stored files is unchanged. If an MD5 checksum reveals that the integrity of a digital file has changed, you need to go back to the (unchanged) back-up and replace the changed file with an exact copy of that back-up.
Checksum tools
There are lots of – free – software packages available for MD5 checksums. The principle is always the same and equally simple: the software creates checksums for a number of files, resulting in a small text file that you store together with the files. To verify the files, the software compares the new checksum against the one in the text file. So, if you want to ensure that data is not lost because a carrier deteriorates – causing it to lose the file – you need to save the small text file in a different location, e.g. an external hard drive.
Take into account the fact that new checksum tools appear regularly, and it's possible that support for older checksum tools could stop at any moment. The MD5 checksums themselves do not rely on a particular checksum tool, however.
Various factors can help to determine which checksum tool to choose. They don't all run on all operating systems or different versions, so users may need to choose a different tool depending on whether they're running a particular version of Windows, Mac OS X or Linux. Additionally, not all tools have a graphic user interface, and some users can be put off by tools that only work using a command line. Some checksum tools also offer more extensive or different features than others, but most can create and check other types of checksums alongside MD5 checksums.
You can find a more detailed overview of examples of checksum tools on Wikipedia.
Get started with checksum tools
We demonstrate three ways to create and check MD5 checksums here for illustration purposes. For user-friendliness, we have chosen checksum tools with a graphical user interface. We have used the checksums tools on an Apple computer ourselves, but they also run on other operating systems. We recommend consulting the relevant manual before installing your chosen tool.
Authors: Rony Vissers (meemoo) with help from Nastasia Vanderperren (meemoo) and Henk Vanstappen