Tools

There are a number of tools and approaches, ranging from simple to sophisticated that can be of help when it comes to this activity of packaging metadata for preserving digital newspapers.

The first is rather straightforward and relates to the preservation readiness activity of inventorying covered earlier in the Guidelines. For institutions of all sizes, but particularly smaller or under-resourced institutions, digital newspaper content may have been acquired in ad-hoc ways and on a wide range of media, particularly born-digital newspaper content (e.g., pre-prints or web files from local presses and publishers). As such, metadata may reside in multiple locations and conform to a range of standards (or even no standards at all). The inventorying stage provides an opportunity to record where this metadata lives relative to the actual collection files. This can be done in the inventory instrument, or in a simple spreadsheet or a database used just for the purposes of metadata tabulation. The most important thing is recording the associations between the metadata and the collection and/or items. More on what an institution should do with the outputs of such approaches in the next section on “Essential Readiness”.

For institutions that use repository software systems for their digital newspapers, as mentioned above, metadata often is stored within these systems in some relationship to the collection. This metadata may include various administrative, technical, and even structural elements. But more often than not this metadata contains collection or item level descriptive information. The process of extracting and packaging any such metadata quite often falls to the native export features of the various repository systems being used to serve out the associated content. For example, if an institution stores its metadata in one of the popular repository software systems (DSpace, Fedora, Olive ActivePaper, ArchivalWare, etc), the software may provide metadata export functions that can output the records as XML. Institutions should consult the system documentation and make use of developer support during this process to ensure consistent and thorough outputs that include the full range of metadata elements that shoud be derived from the system. Depending on the purpose of preservation it may be important to retain certain data elements (e.g., DSpace Handles) for the sake of rebuilding the collections at a later date in the same repository environment. In other cases, this data may be extraneous to the long-term goal of the institution’s preservation use case, and may be excluded accordingly.

Institutions producing digital newspaper collections according to well-formed digitization standards, such as the NDNP Technical Guidelines, will already have adequate-to-excellent metadata records, including page-level METS or METS-ALTO records containing descriptive, technical, administrative, and structural metadata. In these cases metadata may have been produced by a vendor or digitization unit and packaged with the collection files according to those well-formed specifications. Such institutions may have very little additional work required in order to consolidate their metadata for long-term preservation.

For those institutions with adequate resources that have not yet moved beyond descriptive metadata for their digital newspaper collections, there are a range of tools that can assist with analyzing image, text, and other multimedia files, as well as extracting metadata from them for long-term archival management and metadata packaging. The XML outputs of these tools can then be consolidated and conformed into schemas such as METS and/or PREMIS.

Below is a list of tools with varying sets of use cases and requiring different degrees of knowledge about file specifications:

For example, the Unix file command can produce application and MIME type specific information on a per file basis and can be combined with shell scripts to recursively perform batch outputs, resulting in tabular formats that can be processed further for the sake of more long-term supported metadata schemas.

The New Zealand Metadata Extraction Tool is a Java-based graphical user interface (GUI) application that can run on Windows and Unix platforms to analyze files and output findings to XML.

Exiftool is a command-line utility that can read, edit, and create metadata for a wide variety of file formats.

FITS is an open source, command-line tool for Unix-based systems that combines the abilities of many different open-source file identification, validation, and metadata extraction tools and that outputs results to XML.