In order to ensure the longevity of digital newspaper content, an institution must identify what file formats it manages, validate these files according to their specifications, and normalize and/or migrate these files according to the institution’s policy decisions. The processes of validation, normalization, and migration of newspaper files are used by institutions to ensure that newspaper content can be effectively rendered over time.
Identifying file formats is a first step in understanding what file types an institution is managing. As described previously, recording this information in an inventory (see Section 1: “Inventorying Digital Newspapers for Preservation”) will help an institution track a range of file format complexities it will need to address over time (including format obsolescence and format changes).
Format validation, briefly defined, is a process by which an institution assesses the conformance of a file to its format specification (e.g., that a .pdf follows the internal content, layout, and structure rules of the .pdf specification) and checks that a file will be dependably rendered by the program(s) designed for that format. Validating files allows an institution to catch and address errors (files that do not behave as they should).
Normalization is the process of migrating a file from its native format into an open preservation format. (e.g., migrating Olive’s PrXML OCR to XML if possible). Migration more generally may be employed to ensure that the content of a file type that is facing obsolescence can be rendered into a new format (proprietary or open).
The library/archive communities have reached consensus regarding well-understood high-quality open archival formats for image-related collections like digital newspapers, namely TIFF, PDF/A, and to some degree even JPEG2000 (lossless or lossy compression image format). The same cannot always be said for OCR and other article-level transcription, but curators and vendors typically aim to produce XML-based formats that have forward-migration pathways. Born-digital news (e-prints) may be contained in various legacy PDF and HTML versions, and web-based content (including social media content) may include a wide range of file formats, depending upon the particular born-digital newspaper.
Once an institution possesses a clear understanding of the range of different formats it hosts, it may determine that some files need normalization or migration attention. The decision to normalize or migrate formats should be thoroughly evaluated. Consultation with a mature, reliable format registry is the first step—this can help an institution to identify migration pathways to more suitable formats. The institution can then familiarize itself with various tools that can perform necessary transformations.
We note that normalization and migration are still fraught topics in the library/archive realm, with passionate advocates both for and against employing these practices. We also note that format registries and migration tools are still in early stages of development, and should be employed only after thorough consideration. In addition, format migration does not require, nor should it imply, that a content curator should dispose of the original or current format. It is advisable to continue preserving both the original and successor formats for as long as resources permit.