Readiness Spectrum

Creating checksums for digital newspaper content is a relatively easy process. For small collections and non-technical environments (e.g., institutions without technical staff members), there are tools that can be used to calculate checksums. For example, there are open source graphical user interface (GUI) versions of the BagIt utilities, such as Bagger, that can make batch checksum creation very easy and provide the institution with a ready-made manifest  of files and checksums.  The command-line programs mentioned above are relatively simple to invoke, and technical staff with even a moderate level of experience in Unix and Linux environments should have no problem coordinating the programs with scripts (or using tools like md5deep/hashdeep) to automate the batch creation of checksums for multiple objects. Others that are more platform-specific can also be used.

Managing the checksums you have created requires additional effort. A checksum is only as useful as the format that it is embedded in—applications will need to be able to create new checksums on demand and either automatically compare them back against previously recorded checksums or output them into a format that can be processed for comparison purposes. Maintaining the associations and linkages between an individual checksum and the file for which it was generated requires good data management. Coupling this information with existing inventories will take some planning. For example, the algorithm used to create the checksums must be recorded, and it must also be supported by the applications that will be used to create them and/or compare them in the future. Also institutions should make sure that the checksum creation process they deploy allows for file name changes in the future, as will be discussed next in Section 5: “Organizing Digital Newspapers for Preservation”.

Depending on the scale of the content being managed, and the degree of sophistication in tools and approaches being used, an institution may want to document up-front its checksum management workflows in the context of its current or prospective data management environment (more on this next under Optimal Readiness).

Case Study: Boston College

In the NEH-funded Chronicles in Preservation project, several digital newspaper curators experimented with BagIt to inventory, checksum, and package their collections for preservation purposes.

Boston College packaged 183 GB of digital newspaper content using BagIt. This package was then split into smaller 30 GB archival units for preservation storage using the BagIt Java Library. These smaller archival units then had their checksum manifests validated against the original manifest using custom scripts built in the project.

After a successful ingest these smaller BagIt units were exported, rebuilt, and validated as the original 183 GB package using some additional custom scripts built in the project. The rebuilt BagIt package was then returned to Boston College who were able to validate checksums using the BagIt tools.