The complexities involved in managing formats for digital newspapers will vary depending upon two main factors:
- The range of formats an institution holds
- The institution’s decisions regarding normalization and migration activities
Institutions with consistent digital newspaper collections that include a small number of formats will find preservation readiness work easier than institutions with inconsistent collections that cover a wider spectrum of formats. And institutions that do not engage in format migration/normalization activities (whether due to resource or policy-based decisions) will find format management less time/resource intensive, at least in the short term.
Initial management steps vary little across the broad readiness spectrum. Almost any institution will be able to complete the following tasks:
- Identify and document its digital newspaper file formats using tools like DROID, which has a graphical user interface (GUI) and links to PRONOM
- Evaluate and make determinations about sustainability issues presented by the various identified formats (using UDFR, PRONOM, and/or the Library of Congress Sustainability of Formats website)
- Establish policies regarding normalization and migration
- Migrate files deemed “at risk” (e.g., using tools like Xena)
More advanced institutions will also complete the following:
- Validate formats during the identification process by using more advanced tools such as JHOVE/2, FITS, or Unix programs like the find and file commands (or their corollaries in other OS environments) combined with shell scripts.
- Perform normalization to streamline file formats into a select range deemed manageable by institutional policy.
The essential and optimal steps are described in more detail ahead.
Case Study: Boston College
Boston College has digitized several of its campus newspapers in accordance with the National Digital Newspaper Technical Guidelines. This has provided Boston College with several high-quality archival page scans in both TIFF and JPEG2000 formats.
To conserve storage space Boston College have opted to prioritize their JPEG2000 images as its preservation masters (TIFFs can be quite large). This retains the legibility of text and graphics. Due to the amount of white space the images were eligible for some small amount of compression. While JPEG2000 is not as widely adopted as TIFF, Boston College believes this will change and the format still satisfies the criteria for being non-proprietary and open source.
Boston College has also tested the conversion from JPEG2000 back to TIFF with satisfying results.
Case Study: Virginia Tech
Virginia Tech has been a leader in working with publishers of born-digital newspapers to archive its PDFs and web files.
To better manage and preserve the born-digital web files under its care Virginia Tech made efforts to migrate early versions of this HTML content to the more recent HTML 4.0. Though this was a significant undertaking it enabled Virginia Tech to apply better consistency and reliability for the rendering of this unique content.
Through its participation in the NEH-funded Chronicles in Preservation project Virginia Tech were also given the opportunity to apply some of leading format identification and validation tools such as JHOVE/2, DROID, and FITS. These tools were especially helpful for characterizing its early versions of PDF content.
In addition, the Chronicles in Preservation project enabled them to make use of the FCLA Description Service to generate technical metadata for its born-digital file formats.