Capturing data from 3.5-inch diskettes for House for Electronics Arts (HeK)
In May 2018, HeK (House for Electronic Arts)[1] asked PACKED vzw to capture data from its 3.5-inch diskettes. This data was the digital art works Raoul A. Pictor cherche son style (1993)[2] by Hervé Graumann and Über Sehen (1993)[3] by Studer / Van den berg. There were nine high-density diskettes in total, some of which were made for Mac and the others for Windows. HeK didn’t have the right reading equipment to capture the data, so PACKED vzw developed a workflow to retrieve it from the diskettes. Diskettes are fragile carriers. If they become too damaged, there’s a very real chance the reading equipment won’t be able to read the carriers and the art works will be lost.
Issue
Diskettes are data carriers with a capacity of 80 KB (first generation) to 2.88 MB (latest generation), which use magnetism to store data. They were ubiquitous in the 1980s until the emergence of the CD-R and USB sticks at the end of the 1990s/early 2000s.
There are various types of diskettes and some variants are not compatible. Many require their own specific reading device, which cannot write or read other types.[4] Diskettes can differ, for example, in:
- size: the first diskettes, invented in the late 1960s by IBM, had an 8-inch diameter. The 5.25-inch diskette was introduced for home computers in the mid 1970s. The 3.5-inch diskette became the most popular data storage medium in 1988. Diskettes were also available in 2, 2.5, 3, 3.25 and 4-inch formats, but they never fully broke through.
- the number of tracks and sectors: data is organised in tracks and sectors on diskettes. Tracks are concentric circles around the centre of the diskette with spaces left in between. Nothing is written in these spaces. Sectors are blocks that are a constant size (expressed in bytes), each with their own identification number so the operating system can find the data on the diskette. Diskettes can also differ in the number of tracks they have per side[5], per sector and per inch, and the number of bytes they have per track.
- the number of writeable sides: there are single-sided and double-sided diskettes. A diskette reader that can read single-sided diskettes can’t necessarily read double-sided diskettes, and vice versa.
- density: this is the efficiency with which data can be stored on a magnetic carrier. The higher the density, the more data a diskette can store. A greater density is achieved for example by coding improvements for data storage, the magnetic strength at which the data can be written and the material used. There are single-density (SD or 1D), double-density (DD or 2D), quad-density (QD or 4D), high-density (HD), extra-high density (ED) and triple-density (TD) diskettes.
- logical format: the logical format is the file system that determines how the data is written to the carrier. The most common formats are FM (for DOS-formatted, single-density diskettes), MFM (for double-density diskettes that are DOS-formatted and high-density diskettes) and GCR, which has an Apple variant and a Commodore variant. There are also separate formats for Atari and Amiga, among others.
The consequence of all these differences is, for example, that a 3.5-inch diskette station cannot read every 3.5-inch diskette.
The many variants mean that capturing data from diskettes can be a challenge. Diskette readers with a USB connection, which you can still buy today, can usually only read high-density 1.44 MB diskettes, which was the most popular format after the mid 1990s. Diskettes are also fragile carriers. They’re sensitive to dust, condensation and temperature fluctuations, and can’t be stored near magnets or magnetic devices. Any damage can render them unreadable, making it very difficult or even impossible to retrieve data from them.
Status
We captured the content from the nine 3.5-inch diskettes. The files were retrieved from the disk image, identified and saved to a contemporary data carrier.
Method
We decided to create disk images to capture the data. Disk images are bit-for-bit copies of the diskettes. This doesn’t just store the files, but also all the system information, on the carrier. So the information on the carrier is copied as completely as possible, and remains as close to the original as possible. Then you can retrieve the files from the disk image and identify them. Disk images can be created with software that performs a checksum control on the source (the original disk content) and the disk image (the copy).[6] This ensures that there haven’t been any errors when creating the disk image, and that the disk image is an identical copy of the original.
The copied carriers were listed in a spreadsheet with the following columns:
- UI (unique identifier): to create the unique identifier, we used the code assigned to the art work by the institution, and then added a consecutive 3-figure number for each carrier, starting with 001. For example, the unique identifier 2008_199_001 refers to the first carrier processed for the art work with number 2008/199.
- Institution: the name of the museum, i.e. HeK.
- Carrier type: the type of diskette. For HeK, these were 3.5-inch DS HD diskettes[7].
- Carrier format: the logical format on the diskette. In the case of the high-density diskettes from HeK, this was MFM.
- Information on the carrier: all the information from the label on the diskette.
- Functional? If the disk image could be opened and the files retrieved from it, then the diskette was considered to be functional.
- Copied with no errors? This field indicates if a disk image could be created without the software encountering any errors while reading the carrier.
- MD5 checksum: an MD5 checksum was created for every disk image. These checksums are used to check the file integrity.
- Notes: this column includes relevant information about the carrier, e.g. it was an empty diskette, not all files could be retrieved from the diskette, or the error messages that we received when we tried to open the disk image.
In order to prevent our computer files being written to the external carriers, we used write blockers. 3.5-inch diskettes have a write blocker on the carrier which makes the diskette read-only. This is the slider in the bottom left corner. We also used a hardware write blocker. This equipment prevents a computer from being able to write data on the connected carrier.
Create disk images
When testing a reading device with a USB connection, we established that it could read the HeK 3.5-inch high-density diskettes. We used Guymager[8] software to create a disk image from the diskettes. Guymager is open source software that’s used to create disk images of evidence in forensic examinations. It’s extremely important that data is captured unchanged for forensic examinations, and Guymager makes this possible. It has various features to check that the copy is the same as the original. It’s also important that data is saved unchanged for digital preservation. Another of Guymager’s advantages is that it automatically creates metadata in the capturing process and writes it to a text file, such as the checksums for both the carrier and the disk image, for example.
The software is designed so that that an MD5-checksum can be created, and the MD5-checksum for the disk image and the original carrier can be compared to ensure that the disk image and carrier are identical. We opted for Linux dd raw image as the file format because it’s an open format supported by all operating systems. Expert Witness Format is a proprietary format and can only be opened with a limited number of applications.
This enabled us to make identical copies of the nine diskettes.
Exporting disk image files
A disk image is not a file that you can simply open to look up data. It differs from copying files from a single location because the disk image saves all the system information as well as the files from the carrier. For a computer, a disk image is therefore equivalent to an external drive or carrier that needs to be read in. To read or use the files and folders from a disk image, you need to connect or mount the disk image to your computer. This can be risky because some operating systems (invisibly) write files to the connected storage media. Sometimes it’s also not possible to mount a disk image because of its file system. File systems are software categorisations for a storage medium (e.g. hard drive or external carrier) that the operating system needs to display the data as files on the medium and use them in applications. Some file systems can only be used on a specific operating system, whereas others are accessible to multiple operating systems.[9] It’s possible, for example, that a disk image from an (external) drive that’s been formatted for Windows cannot be opened on a Mac computer, or vice versa.
In order to ensure that HeK had access to the files on the disk image, we first exported and identified them, using software to ensure we could export all the files – including hidden files – without altering the disk images. Before we could start exporting files from the disk images, we first needed to know which file system the disk images had. Selecting the right tool does after all depend on the file system. This information is also needed in case you want to open the files in an emulation environment. The appropriate emulation environment can be selected on the basis of the file system.
We always performed the following actions for the export:
- Determine the file system
- Create an index file with an overview of all files on the disk image
- Retrieve the files from the disk image
- Identify the file formats. This step is required to know which software you can use to open the files (if the computer doesn’t automatically find this out itself).
Determining the file system
The most common file systems for diskettes used in MS-DOS/Windows and Classic Macintosh are FAT12[10] and HFS[11]. FAT is a file system that was developed for MS-DOS and Windows, for which FAT12 is used specifically for diskettes. It is widely supported, including by almost all modern operating systems (Windows, Mac and Linux). HFS is an obsolete file system that was developed by Apple and used for diskettes and hard drives. HFS disk images can only be read on Mac (both Classic Macintosh and the modern OS X/macOS).
We used Disktype to determine the file system. This is a command line tool that can be used in UNIX environments such as Linux or Mac, or via Cygwin[12] on Windows, to establish the file systems on a disk or disk image. We use the command disktype image.img > disktype.txt to write the info to a text file disktype.txt for the disk image with the name image.img (see screenshot).
This is how we established that seven disk images had the FAT12 file system. The other two had the HFS file system.
We then used the Bitcurator Disk Image Access Tool to retrieve the files from disk images with the FAT12 file system. Bitcurator[13] is a specialist version of Ubuntu that consists of a collection of forensic tools to help with the preservation of data on external carriers. Bitcurator Disk Image Access Tool is software with which you can see all files on a disk image (including deleted files) and export them.
Bitcurator Disk Image Access Tool cannot use disk images with the HFS file system. There is similar software for HFS: HFSExplorer. You can also export all files (including hidden files) and still retain the original metadata such as the most recent editing date.
We were able to export the disk images files from all diskettes with this software.
Identifying files
Once all the files had been retrieved from the disk images, we could identify them. We used DROID for this. DROID identifies files in two ways: on one hand by the file extension, and on the other with a code stored in a file’s bitstream. It uses the PRONOM database for this. DROID wasn’t able to identify all the files because files in the HFS file system (Classic Mac environment) did not have an extension or had the wrong extension. If DROID doesn’t know the internal code for a file, and can only identify the files on the basis of the extension, it’s impossible for DROID to recognise them.
Conclusion
Data on obsolete carriers is fragile and at risk of being lost, partly because the reading equipment is rare, but also because the carriers age and cannot be read properly anymore. They therefore need to be transferred to a modern data carrier as soon as possible. We were able to transpose all nine diskettes to a modern data carrier using a diskette reader with USB connection, a write blocker and software such as disktype, Guymager, HFSExplorer and Bitcurator.
If you find a diskette in your own archive, please contact us before attempting to read the carrier yourself. Provide us with all the information you have about the carrier, such as the period when it was used, the computer the carrier was used on (Mac or Widows/MS-DOS) and a photo. This makes it easier for us to identify the carrier and determine which strategy we need to retrieve its data.
Author: Nastasia Vanderperren (Meemoo)
- ↑ For more information, see http://www.hek.ch/en.html
- ↑ for more information, see http://www.hek.ch/en/collection/collection-single/collection/raoul-a-pictor-cherche-son-style.html
- ↑ Über Sehen is a screensaver. See http://www.studervandenberg.ch/works.html
- ↑ An non-exhaustive list of diskette types: https://en.wikipedia.org/wiki/List_of_floppy_disk_formats
- ↑ the most common number of tracks is 40 or 80.
- ↑ Such as Guymager, Isobuster, FTK imager and Disk Utility
- ↑ DS stands for double-sided, HD for high-density.
- ↑ http://guymager.sourceforge.net/
- ↑ For more information, see https://en.wikipedia.org/wiki/File_system.
- ↑ For more information, see https://en.wikipedia.org/wiki/File_Allocation_Table#FAT12.
- ↑ For more information, see https://en.wikipedia.org/wiki/Hierarchical_File_System.
- ↑ Cygwin aims to allow programs of Unix-like systems to be recompiled and run natively on Windows with minimal source code modifications, https://en.wikipedia.org/wiki/Cygwin
- ↑ For more information, see https://bitcurator.net/bitcurator