Rationale & Sound Practices

Rationale

For more than a decade, newspapers have been digitized by an array of different institutions (libraries, commercial vendors, etc.) according to a variety of image, document and text output specifications. During this same timeframe, institutions have been acquiring “born-digital” newspaper content, including both e-prints (often through FTP or hard drive exchanges from a publisher to a library) and web-based files (often “harvested” using web-capture tools like Heritrix or obtained via FTP exchanges). The resulting digital newspaper files come in a variety of flavors, including those typical for digitized newspapers (e.g., TIFFPDF/AJPEG2000XML, etc.) and for “born-digital” newspaper contents (e.g., PDF, various image, audio and multimedia formats, HTMLXHTMLCSSJavaScript, etc).

Sound Practices

In order to ensure the longevity of digital newspaper content, an institution must identify what file formats it manages, validate these files according to their specifications, and normalize and/or migrate these files according to the institution’s policy decisions. The processes of validation, normalization, and migration of newspaper files are used by institutions to ensure that newspaper content can be effectively rendered over time.

Identifying file formats is a first step in understanding what file types an institution is managing. As described previously, recording this information in an inventory (see Section 1: “Inventorying Digital Newspapers for Preservation”) will help an institution track a range of file format complexities it will need to address over time (including format obsolescence and format changes).

Format validation, briefly defined, is a process by which an institution assesses the conformance of a file to its format specification (e.g., that a .pdf follows the internal content, layout, and structure rules of the .pdf specification) and checks that a file will be dependably rendered by the program(s) designed for that format. Validating files allows an institution to catch and address errors (files that do not behave as they should).

Normalization is the process of migrating a file from its native format into an open preservation format. (e.g., migrating Olive’s PrXML OCR to XML if possible). Migration more generally may be employed to ensure that the content of a file type that is facing obsolescence can be rendered into a new format (proprietary or open).

The library/archive communities have reached consensus regarding well-understood high-quality open archival formats for image-related collections like digital newspapers, namely TIFF, PDF/A, and to some degree even JPEG2000 (lossless or lossy compression image format). The same cannot always be said for OCR and other article-level transcription, but curators and vendors  typically aim to produce XML-based formats that have forward-migration pathways. Born-digital news (e-prints) may be contained in various legacy PDF and HTML versions, and web-based content (including social media content) may include a wide range of file formats, depending upon the particular born-digital newspaper.

Once an institution possesses a clear understanding of the range of different formats it hosts, it may determine that some files need normalization or migration attention. The decision to normalize or migrate formats should be thoroughly evaluated. Consultation with a mature, reliable format registry is the first step—this can help an institution to identify migration pathways to more suitable formats. The institution can then familiarize itself with various tools that can perform necessary transformations.

We note that normalization and migration are still fraught topics in the library/archive realm, with passionate advocates both for and against employing these practices. We also note that format registries and migration tools are still in early stages of development, and should be employed only after thorough consideration. In addition, format migration does not require, nor should it imply, that a content curator should dispose of the original or current format. It is advisable to continue preserving both the original and successor formats for as long as resources permit.

6 thoughts on “Rationale & Sound Practices

  1. You seem to suggest here that it is *always* necessary to perform some kind of normalization or migration step. Looks a bit dangerous to me, because of such operations are notoriously  risky: without sufficient knowledge of both the source files (*what* do you want to preserve) and the normalization/migration tools it’s pretty easy to end up with unrecoverable loss of information. The best thing is to get this right at the creation/production stage.

  2. Agree about keeping the original format; however in the case of large image files this may simply not be financially feasible (this depends of course on the size of the collection and the institution’s resources).

  3. Format validation tools (JHOVE etc) do check conformance to filespec (to a limited degree), but this doesn’t necessarily guarantee  that a file will be ‘dependably rendered’ (particularly important for formats that encapsulate compressed image data)!

    For the latter its is advisable to complement the format validation by additional rendering tests, if possible. See e.g. this write-up by Andy Jackson:

    http://anjackson.github.io/keeping-codes/experiments/Understanding%20Tools%20and%20Formats%20Via%20Bitwise%20Analysis.html

     

  4. Note that PDF/A doesn’t by itself warrant high quality: e.g. it is possible to use low quality DCT (JPEG) compression within PDF/A whereas uncompressed images are possible as well.

    Also PDF(/A) is most typically used to package multiple scanned pages in one file, which is useful for some access use cases. However if those files ever need to be migrated to some other format PDF is more difficult to work with than simple still image formats (TIFF, JP2, etc), because the PDF wrapper adds an extra layer of complexity. So the way in which the newspapers are foreseen to be used is an extremely important factor to consider when choosing a format.

  5. Perhaps we can strengthen what we mean when we say “…as long as resources permit.” Your concern was our implied meaning. Maybe we just need to acknowledge that more forcefully?

  6. These are great points. In using the phrase “high-quality” in conjunction with all the mentioned formats we are unintentionally conflating things a bit here. We’ll work with this paragraph as we go forward.

    Can you provide some examples of migration cases where curators felt compelled to migrate a PDF file to a non-PDF format? In most of the sample content sets we had to work with the resolution or text formatting would not have necessarily benefited from such a migration, and there was not a compelling obsolescence pressure. Any further real world examples we could refer to would help us in contextualizing our guidance more effectively. Thanks in advance for any follow-up.

Comments are closed.