Recommended file formats for keeping your digital archive readable

From Tracks
Jump to navigation Jump to search
This page is a translated version of the page Aanbevolen bestandsformaten om je digitaal archief leesbaar te houden and the translation is 99% complete.
Outdated translations are marked like this.
Other languages:
English • ‎Nederlands • ‎français

To keep your digital archive readable in the long term, it’s important to store your files in a sustainable file format. Some file formats may make your documents unreadable over time.
In this article, you’ll learn:

  • What is digital obsolescence and how can you prevent it?
  • What is a file format?
  • Why should you use a sustainable file format for your digital documents?
  • Which file formats are suitable as a sustainable format?

If your digital archive is properly backed up and/or you save everything in the cloud, then you still have all your digital files. But are you sure can you still open them? Hopefully, you have your poster in a format other than the PageMaker file from 1994, because there's no suitable software available for it anymore. Yes, you read that right: digital archives don't preserve themselves all on their own.

Virus Blaster.jpg

The problem of digital obsolescence

Digital obsolescence is when a file is so old that the software for opening it is no longer available, unless you resort to some (time-intensive) digital archaeology. And even if there is still software available for it, there's a strong chance that later versions will display files differently from older versions.

Software durability is determined by:

  • the extent of backward compatibility: a new version of the software might not be able to properly read files from older versions;
  • the complexity of the software: the more complex the software, the harder it is to guarantee backward compatibility;
  • its distribution in the market or community: a large market means more software for reading files;
  • its open documentation: if the source code is available, programmers can continue to develop the software to read the file format. Using open file formats reduces the risk of being reliant on particular technologies or providers.

The file format determines how the information is coded in a computer file, and is usually indicated by the extension in the file name. A codec is a piece of software or hardware that allows data to be coded or decoded, or compressed or decompressed. You can use DROID to gain an overview of the file formats in your digital archive.

Other risks

Compression can be an issue for image and video files. Photos are widely saved in JPEG format, for example, which uses an intensive compression algorithm. You can't notice this with the naked eye at first, but it leads to problems when migrating the photo to a new format, e.g. importing it into image editing software such as Photoshop.

Also take into account the issue of files that refer to each other. An InDesign file, for example, does not contain the images, but links to the images which are stored elsewhere on your drive. This link is lost when the files are moved.

How do you choose the right file format?

Keeping a digital archive readable is essentially the continuous migration of old files to current file formats (which we call a 'migration strategy'), or emulating an old computer environment on the current setup, so that the old software can still work (which we call an 'emulation strategy').

Both strategies become very complex over time, and are often only implemented by specialists. As an artist or arts organisation, it's best to focus primarily on choosing an open and well-documented file format when creating your document. That's the best guarantee for ensuring your digital archive remains readable in the long term. You could also bet on more than one horse, for example by saving images or PDFs of complex 3D models. Secondly, you can check whether there are any potentially 'at-risk' files among your existing digital content. If there are, then please feel free to contact one of the partners in the TRACKS network for more tailored advice.

Below is an overview of tips for each file type.

Word processing documents

Examples: DOC, DOCX, ODT, TXT, RTF

Word processing documents are best saved in ODT, or PDF if the document no longer needs to be modified. It's easy to save documents as ODT or PDF files from within Word. In the latter case, do not choose to print to PDF as this is lower quality than the 'share' or 'export' option. Always select PDF/A as the PDF archiving profile, which is available in Word in the PDF save settings. Saving files in the latest version of Word (DOCX) in their original format is not an ideal solution, even though the risks are currently very low.

ODT

ODT (Open Document Text) is the open source variant of DOC and DOCX. As an open format for formatted text, it is therefore the preferred option.

PDF

PDF files can simply be saved (in the medium term) in PDF format. If possible, make sure that every PDF created within the organisation is saved in a PDF archiving profile (preferably PDF/A, or PDF/E for architectural drawings).

Raster images

Examples: TIFF, JPEG, GIF, PNG, PSD, BMP

A raster image or bitmap is an image in digital form, with the colour set for each pixel. The disadvantage of a raster image is that individual pixels become visible when the image is magnified. Bitmap software is available for editing raster images. The counterpart of a raster image is the vector image.

One example of a raster image is when a digital camera captures the image and uses an image chip to record it, which contains a raster of pixels.

TIFF

TIFF is generally recommended as a durable storage format for raster images. It is best not to use compression for images. Indeed, (lossy) compression results in a loss of quality when editing images. You should therefore make sure that photos with artistic value, used for communication and presentation, are delivered and saved in uncompressed TIFF format.

There are various TIFF profiles. Uncompressed baseline IBM TIFF v6.0 is considered to be the most durable. Make sure that an RGB profile is used as the colour space, if possible AdobeRGB or ecirgb-v2. It's also best to create equivalent TIFF versions of Photoshop files, but keep the original file with layer information if you want to edit it further.

JPEG

It's fine to use JPEG files for photos created for the purpose of documenting an exhibition or public event, but don't use any exotic or obsolete formats such as BMP (Bitmap).

PNG

PNG is an open image format that uses lossless compression (so no image information is lost). PNG is used for high-quality online publications and presentations, and logos and graphics.

2D Vector images

Examples: AI, SVG, EPS

A vector image is a graphical representation composed of simple geometric objects such as points, lines, curves, polygons, etc. Complex forms are created by combining these more elementary shapes. The objects' formulas describe the images, so vector images can be enlarged to any desired format without any loss of quality. This is in contrast to a raster or bitmap image, in which individual pixels are coloured in separately on the digital canvas. This means the resolution for the chosen scale is fixed, causing the image to become blurred or chunky when enlarged.

The description of a vector image might say, for example, to draw a circle of a certain colour and size over a text. The absolute size of neither the text nor the circle is set, only the relationship between them. This flexibility means that vector graphics can therefore be displayed at any size, and the resolution (the information density) remains the same.

SVG

SVG is generally recommended as a durable file format for vector drawings, so always make sure you have an SVG equivalent of definitive vector images.

Text files

Examples: TXT

Text files can simply be saved as such, but note that text can be coded in different ways (e.g. ANSI, ASCII and UTF-8). Where possible, try to ensure that text files are coded in UTF-8.

Presentation files

Examples: PPT, PPTX

These files can be saved (in the medium term) in their original format. PDF is more durable, however, so migrate completed presentations to this format. PPT files have already become outdated, so make sure you have equivalents in PPTX or PDF, and choose PDF/A.

Spreadsheets

Examples: XLS, XLSX, ODS

There is no comprehensive solution for spreadsheet files within the archive community, but XLSX and ODS are considered to be sufficiently durable. XLS is outdated. It is therefore recommended to identify important XLS spreadsheets in the archive and create an equivalent in ODS and XLSX.

Video files

Examples: AVI, FLV, MOV, MPEG-1, MPEG-2, MPEG-4, SWF, WMV

Long-term storage of video files is a job for specialists. When you order videos, however, you can require the providers to deliver them in durable formats. In principle, MKV' is the most durable format for storing video. MXF, AVI and MOV are other durable formats. File formats for audio and video are simply containers for the audio and video streams, so it's important to determine how audio and video need to be encoded. FFV1 coding is generally chosen within the archive and heritage sector. For audio streams, LPCM coding is recommended. Make sure that neither the file format nor the audio and video stream are compressed. This often results in large files (for FFV1: 45-50 GB per hour of video!), so use it primarily for valuable videos in which a lot has been invested.

Lower quality standards can be used for less important videos. The video codecs h.262 and h.264, for example, are widely used in MP4 format. You can read a good overview on sustainable video file storage at SCART.

Audio files

Examples: AC3, AIFF, MP3, WAV, WMA

Important audio files are best saved in WAV format. FLAC and AIFF are also durable formats. Use LPCM for the audio signal coding. MP3 can be used as a reference format or for less important audio files, e.g. to access via your website.

Email files

Examples: PST, MBOX, MSG

Emails can be saved in different ways. If entire mailboxes are being saved, it's best to opt for the MBOX format. It is, however, recommended to also save important emails (with high informative value for a project) separately in the project dossier. EML format is best for this. Also always save attachments separately from the email. Gmail has functions for exporting emails or saving them in EML and MBOX. Outlook uses application-specific formats, such as PST and MSG, which are not durable. To save Outlook mailboxes, it is therefore best to use an email client like Thunderbird. (See article on how to archive emails).

Websites

Websites are essentially dynamic information entities that are constantly changing. This means you can only capture all a website's information by taking snapshots of it at regular intervals, much like the Internet Archive does (archive.org). Note: it is insufficient to rely solely on the Internet Archive because the snapshots produced by this service are rarely complete. It's also relatively easy to create them yourself. A snapshot of a website is a 'static copy' of all its HTML pages together with all images, style sheets, etc. The system that the website runs on (often a content management system like Drupal or Wordpress) is then not also archived. The archiving format for websites is WARC. You can find strategies for saving websites in the article for how to archive websites.

The extent to which you can archive websites effectively often depends on the technology used. Flash code is very difficult to archive, for example. You can measure your website's 'archivability' at archiveready.com. If you're developing new websites, try to ensure, wherever possible, that they will be easy to archive at a later date.

Databases

Databases come in different forms and perform different functions. Archiving a database is essentially about exporting the information from the database in a form that can be imported into a new database. It often uses Excel tables, or CSV or XML files, but other data files are also possible. It's important to properly document how the database is organised. And the same applies here as for websites: build databases in such a way that makes it easy to retrieve the information from them in forms that can easily be imported into other databases.

2D CAD

Examples: DWG, DXF, VWX, DGN

It's best to save 2D CAD drawing files in a format that is commonly used and easy to open, usually DWG or DXF. Architects not using Autodesk products are recommended to save drawings with exchanged and published status in DWG or DXF. Make sure that files which refer to each other (such as XREFs or plot style files) are saved together (this is possible in AutoCAD via the etransmit command). 2D CAD drawings are also often converted into PDF, which should continue to be kept. As well as having a legal value, PDFs are much more durable than any existing CAD file. They are currently created mostly using the plot or print function, but software such as AutoCAD and Vectorworks provide the possibility of exporting drawings straight to PDF. In this case, the files can contain more information, the risk of errors when creating the PDF is reduced, and the creator has more control over which elements need to appear in the drawing. Choose PDF/A or PDF/E.

3D CAD

Examples: DWG, DXF, VWX, DGN, SKP, 3DM

CAD files should be saved in a format that is commonly used and easy to open, but there is hardly any such format available for CAD drawings in 3D. 3D models should therefore be saved in their original format, but make sure you properly document the software and version used to create the file as well as its system requirements. There are cases when a 3D CAD file is displayed differently following a software version update. IFC is increasingly becoming the industry standard for exchanging and publishing technical 3D models. It is open, documented and durable, but take into account that the conversion from 3D model to IFC always incurs a certain amount of loss.

3D modelling files

Examples: 3DS, VRML, X3D, U3D, BLEND

The variety of 3D modelling files is too wide to make general statements about their preservation. X3D and U3D are durable file formats, but they are not suitable for all 3D models. You should therefore save the files in their original format with documentation about the original software, just like for 3D CAD. 3D models are often created to produce other documents, such as renders in 2D. The same recommendations apply for these documents as for image files. In some cases, a 3D model is not a file but an executable, such as for the models in Unity. In this case, make sure you document the system requirements for the executable. Documenting 3D scenes using snapshots or videos (e.g. screen captures) is another good option.

Sheet music

The recommended formats for preserving digital sheet music are PDF/A, TIFF or MusicXML. The format you choose depends on the intended use.

PDF/A and TIFF are good formats for storing and reading documents, and you can handle them just like you would any other PDF document or TIFF image. MusicXML is an open format that allows you to notate and edit sheet music. This ensures you can retain any information in the notes and can easily modify it, but it is less handy for reading and performing music. In this case, it's best to save the manuscript as a PDF/A or TIFF file.


Author: Wim Lowet (VAi) and Nastasia Vanderperren (meemoo)