Tools

Content curators need lightweight tools to help them determine the full range of different file formats that comprise their digital newspaper collections and to assess whether these file formats are valid according to their specifications. It should be noted that format identification and validation tools are limited in the types of formats that they can reasonably identify and validate—in some cases multiple tools may be needed to validate outputs for a single collection. Some format identification and validation tools can also produce technical metadata (more on this in Section 3: “Metadata Packaging for Digital Newspapers”).

Helpful format identification tools include:

Normalization and migration decisions are ultimately policy decisions. There is no “right” answer regarding whether or not these activities are necessary or advisable for a particular institution. In order to establish local policy, an institution should consider the following:

  1. Level of need: Does the institution have obsolete digital newspaper formats?
  2. The viability of the institution’s current digital newspaper formats.
  3. The range of the institution’s current formats: Is it so broad that the institution’s ability to keep track of viability is compromised?
  4. Resource levels: Is it feasible for the institution to test and run any format management tool?

If normalization and/or migration are undertaken, the tools an institution uses should be thoroughly tested prior to implementation.

Helpful format registry and migration tools include:

7 thoughts on “Tools

  1. JHOVE(1) isn’t very useful for identification since it only recognises a handful of formats. JHOVE2 uses an old version of DROID for identification, and, worse, a DROID signature file that is a modified version of DROID sig file version 20, which dates back to 2006 (!). The sig file cannot be updated to a more recent one since this will make JHOVE2 go all haywire. So I would consider neither JHOVE nor JHOVE2 to be serious candidates for identification.

    However, one tool I would mention here is Apache Tika, which covers a similar range of formats as Unix File.

     

     

     

     

    You might want to mention Apache Tika here as well.

  2. There are several problems with this section:

    First UDFR, PRONOM and the Lib. Congress website aren’t *tools*, but *format registries*.

    Personally I wouldn’t call UDFR very helpful at all; basically all the information it contains was taken from an older version of PRONOM, which is made available via an impenetrable interface. Also at the time of writing (25/9/2013) there haven’t been any updates to UDFR’s contents in 235 days, which suggests it’s pretty dead!

    I would also mention the Archiveteam File format Wiki, which is more free-form than PRONOM but incredibly useful:

    http://fileformats.archiveteam.org/

    I also find it confusing that this sections presents format registries and migration tools as one category; they’re really very different things!

     

     

     

     

  3. There are several problems with this section:

    First UDFR, PRONOM and the Lib. Congress website aren’t *tools*, but *format registries*.

    Personally I wouldn’t call UDFR very helpful at all; basically all the information it contains was taken from an older version of PRONOM, which is made available via an impenetrable interface. Also at the time of writing (25/9/2013) there haven’t been any updates to UDFR’s contents in 235 days, which suggests it’s pretty dead!

    I would also mention the Archiveteam File format Wiki, which is more free-form than PRONOM but incredibly useful:

    http://fileformats.archiveteam.org/

    I also find it confusing that this sections presents format registries and migration tools as one category; they’re really very different things!

     

    (Note: somehow this comment accidentally ended up as part of my other comment for section 4, no idea how that happened!)

     

     

  4. Will do Johan – it is great to hear that Apache Tika is getting some traction. We’ll definitely revisit it for inclusion. We found that for better or worse many of our project partners had adopted usage of JHOVE (less so JHOVE2) and have begun using it in their workflows as an early effort to begin tracking and managing their formats. DROID is becoming popular for its GUI. FITS is increasingly being turned to as well. It is very early in the game for most institutions with respect to these tools, but we feel, for all of their shortcomings, they deserve reference, and they will likely continue to get treatment and attention in the short-term as this document lands.

    Much as we did on the previous page drawing caution to the nascent state of format registries and migration tools we can probably do a better job here of casting these format identification/validation tools in that same light.

  5. This was implied in item #2 under “viability” (i.e., able to be used/rendered) but that term is sufficiently reductive I suppose to your larger concern here around a designated community who may want one or more derivative formats for particular access/research use cases. We will play with the list of criteria here to address.

  6. This list was intended to encompass both registries and tools/resources like the LoC website and Xena. We envision that as time goes by and as institutions become more versed in the activity of consulting registries and then turning to tools to perform any necessary migrations that they will be considered a related set of resources.

    Perhaps our caution on the first page of this section on formats bears repeating here just to draw attention to the fact that we acknowledge that these are nascent developments that depend on community engagement and participation.

    Recall:

    “We note that normalization and migration are still fraught topics in the library/archive realm, with passionate advocates both for and against employing these practices. We also note that format registries and migration tools are still in early stages of development, and should be employed only after thorough consideration.”

    We’ll return to the ArchiveTeam work and consider it further as well. It began picking up steam as we were formalizing our research & documentation.

Comments are closed.