Readiness Spectrum

The complexities involved in managing formats for digital newspapers will vary depending upon two main factors:

  1. The range of formats an institution holds
  2. The institution’s decisions regarding normalization and migration activities

Institutions with consistent digital newspaper collections that include a small number of formats will find preservation readiness work easier than institutions with inconsistent collections that cover a wider spectrum of formats. And institutions that do not engage in format migration/normalization activities (whether due to resource or policy-based decisions) will find format management less time/resource intensive, at least in the short term.

Initial management steps vary little across the broad readiness spectrum. Almost any institution will be able to complete the following tasks:

  1. Identify and document its digital newspaper file formats using tools like DROID, which has a graphical user interface (GUI) and links to PRONOM
  2. Evaluate and make determinations about sustainability issues presented by the various identified formats (using UDFR, PRONOM, and/or the Library of Congress Sustainability of Formats website)
  3. Establish policies regarding normalization and migration
  4. Migrate files deemed “at risk” (e.g., using tools like Xena)

More advanced institutions will also complete the following:

  1. Validate formats during the identification process by using more advanced tools such as JHOVE/2, FITS, or Unix programs like the find and file commands (or their corollaries in other OS environments) combined with shell scripts.
  2. Perform normalization to streamline file formats into a select range deemed manageable by institutional policy.

The essential and optimal steps are described in more detail ahead.

Case Study: Boston College

Boston College has digitized several of its campus newspapers in accordance with the National Digital Newspaper Technical Guidelines. This has provided Boston College with several high-quality archival page scans in both TIFF and JPEG2000 formats.

To conserve storage space Boston College have opted to prioritize their JPEG2000 images as its preservation masters (TIFFs can be quite large). This retains the legibility of text and graphics. Due to the amount of white space the images were eligible for some small amount of compression. While JPEG2000 is not as widely adopted as TIFF, Boston College believes this will change and the format still satisfies the criteria for being non-proprietary and open source.

Boston College has also tested the conversion from JPEG2000 back to TIFF with satisfying results.


Case Study: Virginia Tech

Virginia Tech has been a leader in working with publishers of born-digital newspapers to archive its PDFs and web files.

To better manage and preserve the born-digital web files under its care Virginia Tech made efforts to migrate early versions of this HTML content to the more recent HTML 4.0. Though this was a significant undertaking it enabled Virginia Tech to apply better consistency and reliability for the rendering of this unique content.

Through its participation in the NEH-funded Chronicles in Preservation project Virginia Tech were also given the opportunity to apply some of leading format identification and validation tools such as JHOVE/2DROID, and FITS. These tools were especially helpful for characterizing its early versions of PDF content.

In addition, the Chronicles in Preservation project enabled them to make use of the FCLA Description Service to generate technical metadata for its born-digital file formats.

7 thoughts on “Readiness Spectrum

  1. In reality the format registries you mention here offer very little information on sustainability issues at all (the only exception here is the LoC website, which does explicitly address this).

    I’m also somewhat surprised about a relatively obscure tool like Xena getting so many mentions here. I must admit here I’ve never used it myself, but I’m a little bit puzzled about the Xena project website which doesn’t even offer any information whatsoever about which formats are supported. This makes me wonder if the mentions are based on experience with this tool by the authors?

    Overall I’m getting the impression here that the authors have mainly considered tools/registries that were developed within the archival community, whereas much more (and often better!) ones exist elsewhere (especially true for migration tools).

  2. JPEG 2000 is ‘open’ in the sense that it is not proprietary and that is an ISO standard (although payment is required to obtain the standard text, as is is the case with most ISO standards) .

    The term ‘open source’ applies to *software*, and I don’t think its use in the current context makes much sense. This aside it is likely to result in confusion, because JPEG 2000 isn’t well supported by open source software at all! See e.g. this blog post (and also the comments below it):


  3. What about the needs of the user audience the archive is supposed to serve (designated community)? E.g. storing born-digital newspapers  in PDF format would be a pretty poor choice if users want to access content with mobile devices/e-readers!


  4. True. But our primary goal with the Guidelines is to help institutions think through how they can preserve what they already have (regardless of how ideal its future format or emerging use case might be – at this stage somewhat of a moving target with all sorts of resource questions surrounding it). Many of the born-digital eprints newspapers we had access to in the project could have undergone OCR work where this had not already occurred. Beyond that there is a whole separate terrain of investigating significant properties that would need to be addressed–all of which would be contingent upon preserving the existing object.

    We deal with ideal creation and acquisition scenarios toward the end of the document:

    There we attempt to bridge the higher preservation level concerns around usable formats back to thinking through how some of what you are focusing on here can be caught up-front.

  5. Admittedly Xena could do a better job of making this easier to locate on its website, but it is available within the tool and on its Sourceforge page

    I think the name of the game with everything at this point, including things like the format registries, is just as Xena promotes on their website:

    “No software is perfect. While we have put in our best efforts towards making sure that Xena is as good as we can make it, there are bound to be some imperfections.

    You can help us find and solve them by reporting any bugs.”

    Perhaps in an ideal world we would continue to try and improve technologies and practices in an open source fashion within our community. But that may not be able to happen for everything or when people need solutions immediately that come out of the box. Would love to hear more about the migration tools you mention and consider any proprietary licensing issues that may govern their usage.


Comments are closed.