Rationale & Sound Practices

Rationale

Stewards of digital newspapers need to be able to attest to the completeness, correctness, authenticity, and renderability of their collections over time. One way that institutions can do this is to require that checksums (digital signatures) be generated for their master digital files at the time of their creation and to store and compare these checksums over time. Recent digitization specifications and standards recommend that when institutions outsource digitization, they request a listed record (or manifest) of files and their checksums from their vendors or digitization units and actively use these to verify their digitized collections upon receipt (i.e., to make sure that the collection arrives intact). With this manifest of files and their checksums, stewards can also perform routine audits and implement repairs on corrupted objects from backups or preservation copies as needed over time.

Sound Practices

Checksums can be generated by several open source tools and utilities (more on this below), and can be stored in a simple output format and/or referenced to corresponding objects via metadata (e.g., METS). Once stored, these checksum records can be called upon by both content curators and preservation service providers to ensure that the objects have survived intact through both network based transfers and hardware/software processes.

When recording checksums for master digital newspaper files, a few important practices to follow include the following:

  1. There are a several different kinds of checksum algorithms available for institutions to apply to their files (md-5, sha-1, and sha-2 being the most prominent). Recording which algorithm was applied is imperative so that later verification processes can properly apply that same algorithm.
  2. Checksum values should be generated in such a way as to allow for later filename changes (i.e., hashing the content only). The utilities used to create the checksums may or may not have support for producing checksums on the content only as opposed to the content plus the filename.
  3. Checksums are best used for near-term audit purposes because checksums can also be miscalculated at the time of creation or corrupted at later points and may need to be recreated from known good originals.