Tools

There are a number of tools and resources that can assist an institution with packaging its digital newspapers for long-term preservation. Selecting the appropriate tools will depend upon the level of resource investment an institution can make—from creating simple, well-documented TAR packages to deploying GUIDs (and making use of related Name Assigning Authorities) and producing more complex packages (whether TAR, WARC, or BagIt).

Examples of GUID and packaging tools are provided below with brief descriptions. Usage of these tools is discussed below in the Essential and Optimal sections.

Globally Unique Identifier (GUID) resources that are  popular in digital libraries and archives include:

  1. Handle System
  2. ARK
  3. NOID

The Handle System, for the purposes of creating local GUIDs, provides local server side application support for maintaining a Name Assigning Authority (NAA), and creating Name Assigning Authority Numbers (NAAN) for institutions that can be used as prefixes for unique identifiers. Identifiers in the Handle system can be any printable characters in UTF-8 encoding from most major languages written today. ARKs, or Archival Resource Keys, are unique identifiers that can be created and managed by tools like NOID, or Nice Opaque Identifiers.

In order to use these tools, an institution should first register with a federated Name Assigning Authority like that maintained by the California Digital Library (CDL). Doing so will provide the institution with a NAAN prefix that it can use in conjunction with an ARK created by NOID to form a GUID for a collection or set of digital newspaper objects. This GUID can then be stored with metadata and can be used to manage the collection in archival storage. METS and/or PREMIS can be especially helpful metadata tools for institutions that have moved to the stage of creating GUIDs for their digital newspapers (more on this below).

Lossless archival packaging formats include:

  1. TAR
  2. WARC
  3. METS
  4. PREMIS
  5. BagIt

TAR, which stands for Tape ARchive, is a lossless packaging format that works especially well for encapsulating folders of files and maintaining file system metadata about the objects and structures contained therein. There are a number of standard and open source utilities available for producing and unpacking TAR packages. TAR packages can be subsequently compressed, but compression is not advised for long-term storage and preservation.

WARC, which stands for Web ARchive, is a lossless packaging format typically used for encapsulating harvested websites. It improves upon an earlier ARC format by better describing the web resources it contains and enabling improved harvesting, sharing and access. Standard web-crawling software such as Heritrix can produce WARC packages.

METS and PREMIS were previously described in the Section 3: “Metadata Packaging for Digital Newspapers.” These standards can be used to maintain linkages between associated digital newspaper objects, their metadata, and any assigned GUIDs. They can also help to record preservation actions taken on archived objects.

Finally, BagIt is a packaging specification that can be applied to a digital newspaper collection at any level in a collection hierarchy to produce an inventory and checksums for that content. BagIt can be especially helpful for auditing locally preserved content or verifying the integrity of collection content when exchanged with an external preservation service provider.