File identification
The first step in keeping your digital archive readable in the long term is to determine the file formats in which your digital documents are stored. You can then take further actions based on this information.
In this article, you’ll learn:
- What is file identification?
- Why is it important to know in which format your files are stored?
- How can you identify file formats in your digital archive and when should you do this?
File identification is the precise determination of the type and version of the file format of a digital file. It allows you to detect outdated file formats in time and, if necessary, convert them to a sustainable format. Using checksums to regularly monitor the integrity of your files assures you that the files themselves have not been altered; the ones and zeros that make up the file remain the same.
However, you have no guarantee that you will still be able to open the files in a few years because the right software may no longer be available. One example of this is WordPerfect files that can no longer be opened by current office software. That’s why it’s important to map out which formats you have in your digital collection, and to check whether there is still software available to open these files.
How do you know which file formats you have?
A first step is to look at the file’s extension, which is the string of characters that comes after the dot in the filename. For example, a file with the filename ‘document.doc’ has the extension ‘.doc’, which indicates that it can probably be opened with a word processor. The extension is only part of the information, however; that .doc file could be a file in Microsoft Word format, but it could just as well be a in completely different format.[1]
Moreover, someone could have manually renamed the file and given it a different extension, so the extension alone does not provide absolute certainty about the file format. Often, it is important to also know which version a file format is in to be able to open the file with the corresponding software version – and an extension does not provide a clear answer to this either.
The format and version used are indicated by invisible meta-information in the file’s source code, which tells the operating system how to open it. DROID (Digital Record Object Identification) software is a useful tool that specialises in reading this information, so you can identify the file format and correct version.
When should you identify files?
There are several moments in a digital object’s life cycle when it might be useful to identify files, e.g. when you have photos digitised by an external company. After digitisation, you receive the files back with the .tif extension, but you want to check whether these files are indeed TIFF files. You can use DROID for this.
Also, when you have no idea what types of files your digital archive consists of, it might be useful to have DROID analyse your archive. This will give you a list of all the files and file formats that your digital archive consists of, so you can better assess the risks.
Author: Nastasia Vanderperren (meemoo) with help from Joris Janssens
- ↑ For a list of all software that uses .doc as an extension, but uses a different file format, see: http://filext.com/file-extension/DOC